Thread: Research/Implementation of Nested Loop Join optimization
Does PostgreSQL already works this way?
When you install the source tree (e.g. in folder \postgresql-8.3.x) you will want to examine nodeMergejoin.c typically found in a path similar to this:
\postgresql-8.3.x\src\backend\executor\nodeMergejoin.c
Here are the comments from the version on my machine:
/*
* INTERFACE ROUTINES
* ExecMergeJoin mergejoin outer and inner relations.
* ExecInitMergeJoin creates and initializes run time states
* ExecEndMergeJoin cleans up the node.
*
* NOTES
*
* Merge-join is done by joining the inner and outer tuples satisfying
* join clauses of the form ((= outerKey innerKey) ...).
* The join clause list is provided by the query planner and may contain
* more than one (= outerKey innerKey) clause (for composite sort key).
*
* However, the query executor needs to know whether an outer
* tuple is "greater/smaller" than an inner tuple so that it can
* "synchronize" the two relations. For example, consider the following
* relations:
*
* outer: (0 ^1 1 2 5 5 5 6 6 7) current tuple: 1
* inner: (1 ^3 5 5 5 5 6) current tuple: 3
*
* To continue the merge-join, the executor needs to scan both inner
* and outer relations till the matching tuples 5. It needs to know
* that currently inner tuple 3 is "greater" than outer tuple 1 and
* therefore it should scan the outer relation first to find a
* matching tuple and so on.
*
* Therefore, rather than directly executing the merge join clauses,
* we evaluate the left and right key expressions separately and then
* compare the columns one at a time (see MJCompare). The planner
* passes us enough information about the sort ordering of the inputs
* to allow us to determine how to make the comparison. We may use the
* appropriate btree comparison function, since Postgres' only notion
* of ordering is specified by btree opfamilies.
*
*
* Consider the above relations and suppose that the executor has
* just joined the first outer "5" with the last inner "5". The
* next step is of course to join the second outer "5" with all
* the inner "5's". This requires repositioning the inner "cursor"
* to point at the first inner "5". This is done by "marking" the
* first inner 5 so we can restore the "cursor" to it before joining
* with the second outer 5. The access method interface provides
* routines to mark and restore to a tuple.
*
*
* Essential operation of the merge join algorithm is as follows:
*
* Join {
* get initial outer and inner tuples INITIALIZE
* do forever {
* while (outer != inner) { SKIP_TEST
* if (outer < inner)
* advance outer SKIPOUTER_ADVANCE
* else
* advance inner SKIPINNER_ADVANCE
* }
* mark inner position SKIP_TEST
* do forever {
* while (outer == inner) {
* join tuples JOINTUPLES
* advance inner position NEXTINNER
* }
* advance outer position NEXTOUTER
* if (outer == mark) TESTOUTER
* restore inner position to mark TESTOUTER
* else
* break // return to top of outer loop
* }
* }
* }
*
* The merge join operation is coded in the fashion
* of a state machine. At each state, we do something and then
* proceed to another state. This state is stored in the node's
* execution state information and is preserved across calls to
* ExecMergeJoin. -cim 10/31/89
*/
From: pgsql-hackers-owner@postgresql.org [mailto:pgsql-hackers-owner@postgresql.org] On Behalf Of Manoel Henrique
Sent: Wednesday, July 23, 2008 1:17 PM
To: pgsql-hackers@postgresql.org
Subject: [HACKERS] Research/Implementation of Nested Loop Join optimization
Hi! I`m a researcher from PUC-Rio (Brazil) and we`re studying about an Joins, and we`d like to implement an optimization on the Nested Loop Join, this optimization consists on while scanning the inner table, the iteration would go from up-down then backwards(down-up) to take advantage of the buffer pages in memory.
We`d work with MaterialScan and only NestedLoop (we`re dropping all indexes and keys to make it this way). The research objective is to show some students how a DBMS works.
Does PostgreSQL already works this way?
Is it possible to implement such thing? Is it easy? how hard?
Thank you in advance,
Manoel Henrique Souza.
When you install the source tree (e.g. in folder \postgresql-8.3.x) you will want to examine nodeMergejoin.c typically found in a path similar to this:
\postgresql-8.3.x\src\backend\executor\nodeMergejoin.c
Here are the comments from the version on my machine:
/*
* INTERFACE ROUTINES
* ExecMergeJoin mergejoin outer and inner relations.
* ExecInitMergeJoin creates and initializes run time states
* ExecEndMergeJoin cleans up the node.
*
* NOTES
*
* Merge-join is done by joining the inner and outer tuples satisfying
* join clauses of the form ((= outerKey innerKey) ...).
* The join clause list is provided by the query planner and may contain
* more than one (= outerKey innerKey) clause (for composite sort key).
*
* However, the query executor needs to know whether an outer
* tuple is "greater/smaller" than an inner tuple so that it can
* "synchronize" the two relations. For example, consider the following
* relations:
*
* outer: (0 ^1 1 2 5 5 5 6 6 7) current tuple: 1
* inner: (1 ^3 5 5 5 5 6) current tuple: 3
*
* To continue the merge-join, the executor needs to scan both inner
* and outer relations till the matching tuples 5. It needs to know
* that currently inner tuple 3 is "greater" than outer tuple 1 and
* therefore it should scan the outer relation first to find a
* matching tuple and so on.
*
* Therefore, rather than directly executing the merge join clauses,
* we evaluate the left and right key expressions separately and then
* compare the columns one at a time (see MJCompare). The planner
* passes us enough information about the sort ordering of the inputs
* to allow us to determine how to make the comparison. We may use the
* appropriate btree comparison function, since Postgres' only notion
* of ordering is specified by btree opfamilies.
*
*
* Consider the above relations and suppose that the executor has
* just joined the first outer "5" with the last inner "5". The
* next step is of course to join the second outer "5" with all
* the inner "5's". This requires repositioning the inner "cursor"
* to point at the first inner "5". This is done by "marking" the
* first inner 5 so we can restore the "cursor" to it before joining
* with the second outer 5. The access method interface provides
* routines to mark and restore to a tuple.
*
*
* Essential operation of the merge join algorithm is as follows:
*
* Join {
* get initial outer and inner tuples INITIALIZE
* do forever {
* while (outer != inner) { SKIP_TEST
* if (outer < inner)
* advance outer SKIPOUTER_ADVANCE
* else
* advance inner SKIPINNER_ADVANCE
* }
* mark inner position SKIP_TEST
* do forever {
* while (outer == inner) {
* join tuples JOINTUPLES
* advance inner position NEXTINNER
* }
* advance outer position NEXTOUTER
* if (outer == mark) TESTOUTER
* restore inner position to mark TESTOUTER
* else
* break // return to top of outer loop
* }
* }
* }
*
* The merge join operation is coded in the fashion
* of a state machine. At each state, we do something and then
* proceed to another state. This state is stored in the node's
* execution state information and is preserved across calls to
* ExecMergeJoin. -cim 10/31/89
*/
From: pgsql-hackers-owner@postgresql.org [mailto:pgsql-hackers-owner@postgresql.org] On Behalf Of Manoel Henrique
Sent: Wednesday, July 23, 2008 1:17 PM
To: pgsql-hackers@postgresql.org
Subject: [HACKERS] Research/Implementation of Nested Loop Join optimization
Hi! I`m a researcher from PUC-Rio (Brazil) and we`re studying about an Joins, and we`d like to implement an optimization on the Nested Loop Join, this optimization consists on while scanning the inner table, the iteration would go from up-down then backwards(down-up) to take advantage of the buffer pages in memory.
We`d work with MaterialScan and only NestedLoop (we`re dropping all indexes and keys to make it this way). The research objective is to show some students how a DBMS works.
Does PostgreSQL already works this way?Is it possible to implement such thing? Is it easy? how hard?
Thank you in advance,
Manoel Henrique Souza.
From: Manoel Henrique [mailto:mhenriquesgbd@gmail.com]
Sent: Wednesday, July 23, 2008 1:47 PM
To: Dann Corbit
Cc: pgsql-hackers@postgresql.org
Subject: Re: [HACKERS] Research/Implementation of Nested Loop Join optimization
The nodeMergejoin.c is the code for the Merge Join isn`t it? I am trying to find a way to change the Nested Loop Join, It would be more like on nodeNestloop.c when rescanning the inner plan, (second time scanning the inner plan and so on) he`d change the scan direction, If the scan direction was from first tuple to last tuple it would go backwards, if it was from last to first it would go forward... The code I`m looking atm is from 8.3.1 , seems to have some kind of direction manager but doesn`t seems to be in use.
>>
You are right. You want file: nodeNestloop.c
<<
"Manoel Henrique" <mhenriquesgbd@gmail.com> writes: > The nodeMergejoin.c is the code for the Merge Join isn`t it? I am trying to > find a way to change the Nested Loop Join, It would be more like on > nodeNestloop.c when rescanning the inner plan, (second time scanning the > inner plan and so on) he`d change the scan direction, If the scan direction > was from first tuple to last tuple it would go backwards, if it was from > last to first it would go forward... The code I`m looking atm is from 8.3.1 > , seems to have some kind of direction manager but doesn`t seems to be in > use. I find this a bit dubious. If the inner rel is small enough to fit in memory then it buys nothing. If not, then you win only to the extent that a pretty large fraction of the inner rel fits in memory. In any case you are relying on the assumption that backwards scan is just as efficient as forward scan, which seems to me to be a pretty large assumption --- we expect forward seqscans to get a performance boost from kernel readahead, but I'd be surprised if the kernel recognized what was happening in a backwards scan. Note also that backwards scan doesn't work at all in some plan node types (cf ExecSupportsBackwardScan). You'd need to check what the inner input node was before trying this. regards, tom lane
Yet, all plan node types we are testing works with backwards scan (looking on ExecSupportsBackwardScan). But, is there a easy way to make a query execute only in backwards scan? How we can do that? Our first objective is to make a backwards scan only and then test a forward-and-backward scan.
-- Manoel
"Manoel Henrique" <mhenriquesgbd@gmail.com> writes:I find this a bit dubious. If the inner rel is small enough to fit in
> The nodeMergejoin.c is the code for the Merge Join isn`t it? I am trying to
> find a way to change the Nested Loop Join, It would be more like on
> nodeNestloop.c when rescanning the inner plan, (second time scanning the
> inner plan and so on) he`d change the scan direction, If the scan direction
> was from first tuple to last tuple it would go backwards, if it was from
> last to first it would go forward... The code I`m looking atm is from 8.3.1
> , seems to have some kind of direction manager but doesn`t seems to be in
> use.
memory then it buys nothing. If not, then you win only to the extent
that a pretty large fraction of the inner rel fits in memory. In any
case you are relying on the assumption that backwards scan is just as
efficient as forward scan, which seems to me to be a pretty large
assumption --- we expect forward seqscans to get a performance boost
from kernel readahead, but I'd be surprised if the kernel recognized
what was happening in a backwards scan.
Note also that backwards scan doesn't work at all in some plan
node types (cf ExecSupportsBackwardScan). You'd need to check
what the inner input node was before trying this.
regards, tom lane
"Manoel Henrique" <mhenriquesgbd@gmail.com> writes: > Yes, I'm relying on the assumption that backwards scan has the same cost as > forward scan, why shouldn't it? Because hard drives only spin one direction -- Gregory Stark EnterpriseDB http://www.enterprisedb.com Ask me about EnterpriseDB's RemoteDBA services!
On Fri, Jul 25, 2008 at 6:10 PM, Gregory Stark <stark@enterprisedb.com> wrote: > "Manoel Henrique" <mhenriquesgbd@gmail.com> writes: > >> Yes, I'm relying on the assumption that backwards scan has the same cost as >> forward scan, why shouldn't it? > > Because hard drives only spin one direction :) -- Jonah H. Harris, Senior DBA myYearbook.com
On Fri, 2008-07-25 at 19:31 -0400, Jonah H. Harris wrote: > On Fri, Jul 25, 2008 at 6:10 PM, Gregory Stark <stark@enterprisedb.com> wrote: > > "Manoel Henrique" <mhenriquesgbd@gmail.com> writes: > > > >> Yes, I'm relying on the assumption that backwards scan has the same cost as > >> forward scan, why shouldn't it? > > > > Because hard drives only spin one direction > > :) What if you are below the equator? > -- > Jonah H. Harris, Senior DBA > myYearbook.com > -- The PostgreSQL Company since 1997: http://www.commandprompt.com/ PostgreSQL Community Conference: http://www.postgresqlconference.org/ United States PostgreSQL Association: http://www.postgresql.us/ Donate to the PostgreSQL Project: http://www.postgresql.org/about/donate
Joshua D. Drake escribió: > On Fri, 2008-07-25 at 19:31 -0400, Jonah H. Harris wrote: > > On Fri, Jul 25, 2008 at 6:10 PM, Gregory Stark <stark@enterprisedb.com> wrote: > > > "Manoel Henrique" <mhenriquesgbd@gmail.com> writes: > > > > > >> Yes, I'm relying on the assumption that backwards scan has the same cost as > > >> forward scan, why shouldn't it? > > > > > > Because hard drives only spin one direction > > > > :) > > What if you are below the equator? They spin the same direction here too, thanks :-) (Coriolis does not affect much in this case) -- Alvaro Herrera http://www.CommandPrompt.com/ PostgreSQL Replication, Consulting, Custom Development, 24x7 support
Gregory Stark <stark@enterprisedb.com> writes: > "Manoel Henrique" <mhenriquesgbd@gmail.com> writes: >> Yes, I'm relying on the assumption that backwards scan has the same cost as >> forward scan, why shouldn't it? > Because hard drives only spin one direction Good joke, but to be serious: we expect that forward scans will result in the kernel doing read-ahead, which will allow overlapping of CPU work to process one page with the I/O to bring in the next page. A backwards scan will get no such overlapping and thus be up to 2X slower, unless the kernel is smart enough to do read-ahead for descending-order read requests. Which seems not too probable. A fairly typical kernel behavior is that read-ahead is triggered by successive read() requests without any intervening seek(), and this is impossible for a backward scan. (Yes, we do optimize out the seek calls in a forward scan. IIRC it's done in fd.c.) regards, tom lane
Tom Lane escribió: > Gregory Stark <stark@enterprisedb.com> writes: > > "Manoel Henrique" <mhenriquesgbd@gmail.com> writes: > >> Yes, I'm relying on the assumption that backwards scan has the same cost as > >> forward scan, why shouldn't it? > > > Because hard drives only spin one direction > > Good joke, but to be serious: we expect that forward scans will result > in the kernel doing read-ahead, which will allow overlapping of > CPU work to process one page with the I/O to bring in the next page. I wonder if this is spoiled (or rather, the backwards case fixed) by the attempts to call posix_fadvise() on certain types of scan. -- Alvaro Herrera http://www.CommandPrompt.com/ PostgreSQL Replication, Consulting, Custom Development, 24x7 support
Alvaro Herrera <alvherre@commandprompt.com> writes: > Tom Lane escribi�: >> Good joke, but to be serious: we expect that forward scans will result >> in the kernel doing read-ahead, which will allow overlapping of >> CPU work to process one page with the I/O to bring in the next page. > I wonder if this is spoiled (or rather, the backwards case fixed) by the > attempts to call posix_fadvise() on certain types of scan. Yeah, I started wondering about that too after sending off the above. The fadvise patch might eliminate the distinction ... on platforms where fadvise exists and actually works well. regards, tom lane
Tom Lane wrote: > Gregory Stark <stark@enterprisedb.com> writes: >> "Manoel Henrique" <mhenriquesgbd@gmail.com> writes: >>> Yes, I'm relying on the assumption that backwards scan has the same cost as >>> forward scan, why shouldn't it? > > G...we expect that forward scans will result > in the kernel doing read-ahead, ... > A backwards scan will get no such overlapping and thus be up to 2X > slower, unless the kernel is smart enough to do read-ahead for > descending-order read requests. Which seems not too probable. Linux's old adaptive readahead patches claimed to[1]: It also have methods to detect some less common cases: - readingbackward" Interestingly the author of that patch used postgres as the example application that benefits from the patch (30%). I'm not sure if the backward reading feature got kept in the simplified on-demand readahead that seems to have superseded the adaptive readahead stuff in 2.6.23[2]. [1] http://lwn.net/Articles/185469/ [2] http://kernelnewbies.org/Linux_2_6_23#head-102af265937262a7a21766ae58fddc1a29a5d8d7
Ron Mayer <rm_pg@cheapcomplexdevices.com> writes: > Tom Lane wrote: >> A backwards scan will get no such overlapping and thus be up to 2X >> slower, unless the kernel is smart enough to do read-ahead for >> descending-order read requests. Which seems not too probable. > Linux's old adaptive readahead patches claimed to[1]: I didn't say that there were *no* platforms that could do it. regards, tom lane
"Tom Lane" <tgl@sss.pgh.pa.us> writes: > Gregory Stark <stark@enterprisedb.com> writes: >> "Manoel Henrique" <mhenriquesgbd@gmail.com> writes: >>> Yes, I'm relying on the assumption that backwards scan has the same cost as >>> forward scan, why shouldn't it? > >> Because hard drives only spin one direction > > Good joke, but to be serious: we expect that forward scans will result > in the kernel doing read-ahead, which will allow overlapping of > CPU work to process one page with the I/O to bring in the next page. Well it wasn't a joke but you're right that it's not the whole picture. But then neither is considering interleaving of I/O with CPU work. Hard drives spin in a particular direction which means if you stream I/O requests to them in that direction you can stream data off the hard drive as fast as it passes under the read head. That's going to be 50-60MB/s for a single modern 7200rpm drive. On the other hand if you send an I/O request for the previous block then you have to wait a whole rotation before it passes under the head. On a 7200rpm drive that's over 8ms which is a *lot* of CPU work to interleave. The most bandwidth you'll be able to get is under 1MB/s. However there's another reason fadvise helps -- the kernel or the drive gets a chance to reorder the I/O. If we read-ahead a whole track's worth of I/O backwards then during the first 8ms latency the kernel has a chance to notice that it should reorder the queued up requests and do them the right way around. -- Gregory Stark EnterpriseDB http://www.enterprisedb.com Ask me about EnterpriseDB's 24x7 Postgres support!