Re: Internal operations when the planner makes a hash join. - Mailing list pgsql-performance

From Igor Neyman
Subject Re: Internal operations when the planner makes a hash join.
Date
Msg-id F4C27E77F7A33E4CA98C19A9DC6722A20598C110@EXCHANGE.corp.perceptron.com
Whole thread Raw
In response to Re: Internal operations when the planner makes a hash join.  (negora <negora@negora.com>)
List pgsql-performance

> -----Original Message-----
> From: negora [mailto:negora@negora.com]
> Sent: Tuesday, February 23, 2010 4:33 PM
> To: Scott Carey
> Cc: Alvaro Herrera; pgsql-performance@postgresql.org
> Subject: Re: Internal operations when the planner makes a hash join.
>
> Thank you for explaining me the internal behaviour of the
> PostgreSQL engine. I'll try to look for more information
> about that hash tables. It sounds really really interesting.
> Your information was very useful.
>
> The origin of my doubt resides in the fact that I need to do
> a joint between 3 HUGE tables (millions of registries) and do
> certain operations with the retrieved information. I was
> deciding whether to use one SELECT with 3 JOINs, as I've been
> doing since the beginning, or build a PL/PgSQL function based
> on 3 nested "FOR ... IN SELECT ... LOOP"
> structures which tried to minimize the subsequent table
> searches storing intermediate useful data in arrays
> (curiously, these would act as the hash tables which you
> mention, but in a very very rudimentary way). In a case like
> this one (possibly unable to fit in RAM), Is also JOIN the
> best solution?
>
> Since I've to retrieve such a big amount of columns and
> crossed registries I had started to think that using 1 SELECT
> with 3 JOINs would increase the number of table searches a
> LOT and also "duplicate" the information too much. I mean
> "duplicate" as in this case, where the Factor 1 appears
> millions of times for every Element:
>
> Element 1 | Sub-factor 1 | Factor 1
> Element 2 | Subf-actor 1 | Factor 1
> ...
> Element 12639747465586 | Sub-factor 1 | Factor 1 Element 1 |
> Sub-factor 2 | Factor 1
>
> I hope not to robber you much time but... What do you think
> about it? Is it better either 1 SELECT with 3 JOINs or build
> nested "FOR ... IN SELECT ... LOOP" structures? Could it be
> one of that cases in which I've to choose between either
> higher speed but higher memory consume (3
> JOINs) or lower speed but less memory expense (3 FORs)?
>
> Thanks again and apologizes for extending this topic too much.
>
>
> Scott Carey wrote:
> > On Feb 23, 2010, at 8:53 AM, Alvaro Herrera wrote:
> >
> >
> >> negora wrote:
> >>
> >>
> >>> According to how I understood the process, the engine
> would get the
> >>> name from the student with ID 1 and would look for the
> name of the
> >>> father with ID 1 in the hashed table. It'd do exactly the
> same with
> >>> the student #2 and father #2. But my big doubt is about
> the 3rd one
> >>> (Anthony). Would the engine "know" that it already had
> retrieved the
> >>> father's name for the student 1 and would avoid searching for it
> >>> into the hashed table (using some kind of internal
> mechanism which
> >>> allows to "re-utilize" the name)? Or would it search into
> the hashed
> >>> table again?<br>
> >>>
> >> The hash table is searched again.  But that's fast, because it's a
> >> hash table.
> >>
> >>
> >
> > To answer the question another way, "remembering" that it
> has already seen father A once and tracking that would use a
> hash table to remember that fact.
> >
> > The hash table created by the first scan IS the "remember
> you have seen this father" data structure, optimized for fast
> lookup.  So before even looking at the first student, the
> hash table is built so that it is fast to find out if a
> father has been seen before, and if so where that father's
> data is located.  Looking this data up is often referred to
> as a "probe" and not a "scan" because it takes just as long
> to do if the hash table has 100 entries or 10000 entries.
> The drawback is that the whole thing has to fit in RAM.
> >
> >
> >
> >> --
> >> Alvaro Herrera
> http://www.CommandPrompt.com/
> >> The PostgreSQL Company - Command Prompt, Inc.
> >>
> >> --
> >> Sent via pgsql-performance mailing list
> >> (pgsql-performance@postgresql.org)
> >> To make changes to your subscription:
> >> http://www.postgresql.org/mailpref/pgsql-performance
> >>
> >
> >
> >
>

So, you are trying to do "nested loop" in PL/PgSQL.
Why not let optimizer decide between "nested loop" and "hash join" based
on your memory settings and statistics collected for objects involved?
I'm pretty sure, it'll be faster than PL/PgSQL 3 nested loops.

Igor Neyman

pgsql-performance by date:

Previous
From: Kevin Kempter
Date:
Subject: partitioned tables query not using indexes
Next
From: "A. Kretschmer"
Date:
Subject: Re: partitioned tables query not using indexes