Thread: How to read query plan

Re: How to read query plan

From

Ragnar Hafstað

Date:

13 March 2005, 16:51:13

On Sun, 2005-03-13 at 16:32 +0100, Miroslav Šulc wrote:
> Hi all,
>
> I am new to PostgreSQL and query optimizations. We have recently moved
> our project from MySQL to PostgreSQL and we are having performance
> problem with one of our most often used queries. On MySQL the speed was
> sufficient but PostgreSQL chooses time expensive query plan. I would
> like to optimize it somehow but the query plan from EXPLAIN ANALYZE is
> little bit cryptic to me.
>

[snip output of EXPLAIN ANALYZE]

for those of us who have not yet reached the level where one can
infer it from the query plan, how abour showing us the actual
query too ?

but as an example of what to look for, consider the first few lines
(reformatted):

> Merge Right Join  (cost=9868.84..9997.74 rows=6364 width=815)
>                   (actual time=9982.022..10801.216 rows=6364 loops=1)
>   Merge Cond: ("outer".idpk = "inner".cadastralunitidfk)
>   ->  Index Scan using cadastralunits_pkey on cadastralunits
>       (cost=0.00..314.72 rows=13027 width=31)
>       (actual time=0.457..0.552 rows=63 loops=1)
>   ->  Sort  (cost=9868.84..9884.75 rows=6364 width=788)
>             (actual time=9981.405..10013.708 rows=6364 loops=1)

notice that the index scan is expected to return 13027 rows, but
actually returns 63. this might influence the a choice of plan.

gnari

Re: How to read query plan

From

Tom Lane

Date:

13 March 2005, 19:16:06

John Arbash Meinel <john@arbash-meinel.com> writes:
> How about a quick side track.
> Have you played around with your shared_buffers, maintenance_work_mem,
> and work_mem settings?

Indeed.  The hash joins seem unreasonably slow considering how little
data they are processing (unless this is being run on some ancient
toaster...).  One thought that comes to mind is that work_mem may be
set so small that the hashes are forced into multiple batches.

Another question worth asking is what are the data types of the columns
being joined on.  If they are character types, what locale and encoding
is the database using?

> Are you re-running the query multiple times, and reporting the later
> speeds, or just the first time? (If nothing is loaded into memory, the
> first run is easily 10x slower than later ones.)

That cost would be paid during the bottom-level scans though.  The thing
that strikes me here is that nearly all of the cost is being spent
joining.

> What version of postgres are you using?

And what's the platform (hardware and OS)?

            regards, tom lane

Re: How to read query plan

From

Tom Lane

Date:

13 March 2005, 19:54:44

=?windows-1250?Q?Miroslav_=8Aulc?= <miroslav.sulc@startnet.cz> writes:
> I've just tried to uncomment the settings for these parameters with with
> no impact on the query speed.

> shared_buffers = 48000          # min 16, at least max_connections*2,
> 8KB each
> work_mem = 1024         # min 64, size in KB
> maintenance_work_mem = 16384    # min 1024, size in KB
> max_stack_depth = 2048          # min 100, size in KB

Hmm.  Given the small size of the auxiliary tables, you'd think they'd
fit in 1MB work_mem no problem.  But try bumping work_mem up to 10MB
just to see if it makes a difference.  (BTW, you do know that altering
the .conf file doesn't in itself do anything?  You have to SIGHUP the
postmaster to make it notice the change ... and for certain parameters
such as shared_buffers, you actually have to stop and restart the
postmaster.  You can use the SHOW command to verify whether a change
has taken effect.)

> I have checked this and there are some JOINs smallint against integer.
> Is that problem?

That probably explains why some of the joins are merges instead of
hashes --- hash join doesn't work across datatypes.  Doesn't seem like
it should be a huge problem though.  I was more concerned about the
possibility of slow locale-dependent string comparisons.

            regards, tom lane

Re: How to read query plan

From

Tom Lane

Date:

13 March 2005, 20:26:39

=?windows-1250?Q?Miroslav_=8Aulc?= <miroslav.sulc@startnet.cz> writes:
> There are only JOINs number against number.

Hmph.  There's no reason I can see that hash joins should be as slow as
they seem to be in your test.

Is the data confidential?  If you'd be willing to send me a pg_dump
off-list, I'd like to replicate this test and try to see where the time
is going.

            regards, tom lane

Re: How to read query plan

From

Tom Lane

Date:

14 March 2005, 00:08:50

=?windows-1250?Q?Miroslav_=8Aulc?= <miroslav.sulc@startnet.cz> writes:
>> Is the data confidential?  If you'd be willing to send me a pg_dump
>> off-list, I'd like to replicate this test and try to see where the time
>> is going.
>>
> Thank you very much for your offer. The data are partially confidental
> so I hashed some of the text information and changed some values (not
> the values for the JOINs) so I could send it to you. I've checked the
> EXPLAIN ANALYZE if anything changed and the result is merely the same
> (maybe cca 1 sec slower - maybe because the hash caused the text data to
> be longer).

No problem; thank you for supplying the test case.  What I find is
rather surprising: most of the runtime can be blamed on disassembling
and reassembling tuples during the join steps.  Here are the hot spots
according to gprof:

-----------------------------------------------
                1.27    7.38    8277/103737      ExecScan [16]
                2.93   17.02   19092/103737      ExecNestLoop <cycle 2> [14]
                3.91   22.70   25456/103737      ExecMergeJoin <cycle 2> [13]
                7.81   45.40   50912/103737      ExecHashJoin <cycle 2> [12]
[9]     86.3   15.92   92.50  103737         ExecProject [9]
                7.65   76.45 8809835/9143692     ExecEvalVar [10]
                3.42    4.57  103737/103775      heap_formtuple [17]
                0.03    0.24   12726/143737      ExecMakeFunctionResultNoSets [24]
                0.02    0.12  103737/290777      ExecStoreTuple [44]
                0.01    0.00       2/2           ExecEvalFunc [372]
                0.00    0.00       2/22          ExecMakeFunctionResult [166]
-----------------------------------------------
                0.00    0.00      42/9143692     ExecEvalFuncArgs [555]
                0.05    0.51   59067/9143692     ExecHashGetHashValue [32]
                0.24    2.38  274748/9143692     ExecMakeFunctionResultNoSets [24]
                7.65   76.45 8809835/9143692     ExecProject [9]
[10]    69.5    7.94   79.34 9143692         ExecEvalVar [10]
               79.34    0.00 8750101/9175517     nocachegetattr [11]
-----------------------------------------------

I think the reason this is popping to the top of the runtime is that the
joins are so wide (an average of ~85 columns in a join tuple according
to the numbers above).  Because there are lots of variable-width columns
involved, most of the time the fast path for field access doesn't apply
and we end up going to nocachegetattr --- which itself is going to be
slow because it has to scan over so many columns.  So the cost is
roughly O(N^2) in the number of columns.

As a short-term hack, you might be able to improve matters if you can
reorder your LEFT JOINs to have the minimum number of columns
propagating up from the earlier join steps.  In other words make the
later joins add more columns than the earlier, as much as you can.

This is actually good news, because before 8.0 we had much worse
problems than this with extremely wide tuples --- there were O(N^2)
behaviors all over the place due to the old List algorithms.  Neil
Conway's rewrite of the List code got rid of those problems, and now
we can see the places that are left to optimize.  The fact that there
seems to be just one is very nice indeed.

Since ExecProject operations within a nest of joins are going to be
dealing entirely with Vars, I wonder if we couldn't speed matters up
by having a short-circuit case for a projection that is only Vars.
Essentially it would be a lot like execJunk.c, except able to cope
with two input tuples.  Using heap_deformtuple instead of retail
extraction of fields would eliminate the O(N^2) penalty for wide tuples.

            regards, tom lane

Re: [HACKERS] How to read query plan

From

Tom Lane

Date:

14 March 2005, 05:00:28

I wrote:
> Since ExecProject operations within a nest of joins are going to be
> dealing entirely with Vars, I wonder if we couldn't speed matters up
> by having a short-circuit case for a projection that is only Vars.
> Essentially it would be a lot like execJunk.c, except able to cope
> with two input tuples.  Using heap_deformtuple instead of retail
> extraction of fields would eliminate the O(N^2) penalty for wide tuples.

Actually, we already had a pending patch (from Atsushi Ogawa) that
eliminates that particular O(N^2) behavior in another way.  After
applying it, I get about a factor-of-4 reduction in the runtime for
Miroslav's example.

ExecEvalVar and associated routines are still a pretty good fraction of
the runtime, so it might still be worth doing something like the above,
but it'd probably be just a marginal win instead of a big win.

            regards, tom lane

Re: [HACKERS] How to read query plan

From

Miroslav Šulc

Date:

14 March 2005, 08:44:45

Tom Lane wrote:

>I wrote:
>
>
>>Since ExecProject operations within a nest of joins are going to be
>>dealing entirely with Vars, I wonder if we couldn't speed matters up
>>by having a short-circuit case for a projection that is only Vars.
>>Essentially it would be a lot like execJunk.c, except able to cope
>>with two input tuples.  Using heap_deformtuple instead of retail
>>extraction of fields would eliminate the O(N^2) penalty for wide tuples.
>>
>>
>
>Actually, we already had a pending patch (from Atsushi Ogawa) that
>eliminates that particular O(N^2) behavior in another way.  After
>applying it, I get about a factor-of-4 reduction in the runtime for
>Miroslav's example.
>
>
Is there a chance we will see this patch in the 8.0.2 release? And when
can we expect this release?

>ExecEvalVar and associated routines are still a pretty good fraction of
>the runtime, so it might still be worth doing something like the above,
>but it'd probably be just a marginal win instead of a big win.
>
>            regards, tom lane
>
>
Miroslav

Re: How to read query plan

From

Christopher Kings-Lynne

Date:

14 March 2005, 09:02:36

> 1) in PostgreSQL I use 'varchar(1)' for a lot of fields and in MySQL I
> use 'enum'
> 2) in PostgreSQL in some cases I use connection fields that are not of
> the same type (smallint <-> integer (SERIAL)), in MySQL I use the same
> types

Well both those things will make PostgreSQL slower...

Chris

Re: How to read query plan

From

PFC

Date:

14 March 2005, 09:08:54

    Instead of a varchar(1) containing 'y' or 'n' you could use a BOOL or an
integer.
    Your query seems of the form :

    SELECT FROM main_table LEFT JOIN a lot of tables ORDER BY sort_key LIMIT
N OFFSET M;

    I would suggest to rewrite it in a simpler way : instead of generating
the whole result set, sorting it, and then grabbing a slice, generate only
the ror id's, grab a slice, and then generate the full rows from that.

    - If you order by a field which is in main_table :
    SELECT FROM main_table LEFT JOIN a lot of tables WHERE main_table.id IN
(SELECT id FROM main_table ORDER BY sort_key LIMIT N OFFSET M
) ORDER BY sort_key LIMIT N OFFSET M;

    - If you order by a field in one of the child tables, I guess you only
want to display the rows in the main table which have this field, ie.
not-null in the LEFT JOIN. You can also use the principle above.

    - You can use a straight join instead of an IN.


On Mon, 14 Mar 2005 09:58:49 +0100, Miroslav Šulc
<miroslav.sulc@startnet.cz> wrote:

> John Arbash Meinel wrote:
>
>>> In fact, on MySQL I didn't see any slow reactions so I didn't measure
>>> and inspect it. But I can try it if I figure out how to copy the
>>> database from PostgreSQL to MySQL.
>>
>>
>> I figured you still had a copy of the MySQL around to compare to. You
>> probably don't need to spend too much time on it yet.
>
> So I have some results. I have tested the query on both PostgreSQL 8.0.1
> and MySQL 4.1.8 with LIMIT set to 30 and OFFSET set to 6000. PostgreSQL
> result is 11,667.916 ms, MySQL result is 448.4 ms.
>
> Both databases are running on the same machine (my laptop) and contain
> the same data. However there are some differences in the data table
> definitions:
> 1) in PostgreSQL I use 'varchar(1)' for a lot of fields and in MySQL I
> use 'enum'
> 2) in PostgreSQL in some cases I use connection fields that are not of
> the same type (smallint <-> integer (SERIAL)), in MySQL I use the same
> types
>
>>
>> John
>> =:->
>
> Miroslav

Re: How to read query plan

From

Tom Lane

Date:

14 March 2005, 15:03:13

=?windows-1250?Q?Miroslav_=8Aulc?= <miroslav.sulc@startnet.cz> writes:
> As there are a lot of varchar(1) in the AdDevicesSites table, wouldn't
> be helpful to change them to char(1)? Would it solve the variable-width
> problem at least for some fields and speed the query up?

No, because char(1) isn't physically fixed-width (consider multibyte
encodings).  There's really no advantage to char(N) in Postgres.

I don't know what you're doing with those fields, but if they are
effectively booleans or small codes you might be able to convert them to
bool or int fields.  There is also the "char" datatype (not to be
confused with char(1)) which can hold single ASCII characters, but is
nonstandard and a bit impoverished as to functionality.

However, I doubt this is worth pursuing.  One of the things I tested
yesterday was a quick hack to organize the storage of intermediate join
tuples with fixed-width fields first and non-fixed ones later.  It
really didn't help much at all :-(.  I think the trouble with your
example is that in the existing code, the really fast path applies only
when the tuple contains no nulls --- and since you're doing all that
left joining, there's frequently at least one null lurking.

            regards, tom lane

Re: [HACKERS] How to read query plan

From

Tom Lane

Date:

14 March 2005, 15:04:23

=?windows-1250?Q?Miroslav_=8Aulc?= <miroslav.sulc@startnet.cz> writes:
> Tom Lane wrote:
>> Actually, we already had a pending patch (from Atsushi Ogawa) that
>> eliminates that particular O(N^2) behavior in another way.  After
>> applying it, I get about a factor-of-4 reduction in the runtime for
>> Miroslav's example.
>>
> Is there a chance we will see this patch in the 8.0.2 release?

No.  We are not in the habit of making non-bug-fix changes in stable
branches.  Ogawa's patch is in CVS for 8.1.

            regards, tom lane

Re: How to read query plan

From

Harald Fuchs

Date:

14 March 2005, 15:10:44

In article <423556B8.4020500@startnet.cz>,
=?ISO-8859-15?Q?Miroslav_=A6ulc?= <miroslav.sulc@startnet.cz> writes:

>> Instead of a varchar(1) containing 'y' or 'n' you could use a
>> BOOL or an  integer.

> Sure I could. The problem is our project still supports both MySQL and
> PostgreSQL. We used enum('Y','N') in MySQL so there would be a lot of
> changes in the code if we would change to the BOOL data type.

Since BOOL is exactly what you want to express and since MySQL also
supports BOOL (*), you should make that change anyway.

(*) MySQL recognizes BOOL as a column type and silently uses
TINYINT(1) instead.

Re: How to read query plan

From

Greg Stark

Date:

14 March 2005, 15:59:17

Miroslav ©ulc <miroslav.sulc@startnet.cz> writes:

> I think so. I'll change the varchar(1) fields to char(1) where possible

char isn't faster than varchar on postgres. If anything it may be slightly
slower because every comparison first needs to pad both sides with spaces.

--
greg

Re: How to read query plan

From

Tom Lane

Date:

14 March 2005, 16:23:45

=?ISO-8859-15?Q?Miroslav_=A6ulc?= <miroslav.sulc@startnet.cz> writes:
> PFC wrote:
>> Instead of a varchar(1) containing 'y' or 'n' you could use a BOOL
>> or an  integer.

> Sure I could. The problem is our project still supports both MySQL and
> PostgreSQL. We used enum('Y','N') in MySQL so there would be a lot of
> changes in the code if we would change to the BOOL data type.

Just FYI, I did a quick search-and-replace on your dump to replace
varchar(1) by "char", which makes the column fixed-width without any
change in the visible data.  This made hardly any difference in the
join speed though :-(.  So that is looking like a dead end.

John's idea about re-joining to the main table to pick up the bulk of
its fields only after joining to the sub-tables might work.

            regards, tom lane

Re: How to read query plan

From

Kaloyan Iliev Iliev

Date:

14 March 2005, 16:50:50

Hi,

I have an idea about your problem. Will it be difficult not to change the entire code but only the queries? You can change type in the Postgres to bool. Then, when select data you can use a CASE..WHEN to return 'Y' or 'N' or even write a little function which accepts bool and returns 'Y' or 'N'. In this case in all your queries you will have to replace the select of bool field with select form the function.

Kaloyan

Miroslav Љulc wrote:

Tom Lane wrote:

Just FYI, I did a quick search-and-replace on your dump to replace
varchar(1) by "char", which makes the column fixed-width without any
change in the visible data. This made hardly any difference in the
join speed though :-(. So that is looking like a dead end.

I'll try to change the data type to bool but the problem I stand in front of is that the code expects that SELECTs return 'Y' or 'N' but what I have found out till now is that PostgreSQL returns 't' or 'f' for bool data. I think about some solution but they use CPU :-(

John's idea about re-joining to the main table to pick up the bulk of
its fields only after joining to the sub-tables might work.

I'll try that. It seems it could work.

regards, tom lane

Miroslav
---------------------------(end of broadcast)---------------------------
TIP 5: Have you checked our extensive FAQ?
              http://www.postgresql.org/docs/faq 

Re: How to read query plan

From

Miroslav Šulc

Date:

14 March 2005, 17:08:14

Kaloyan Iliev Iliev wrote:

> Hi,
>
> I have an idea about your problem. Will it be difficult not to change
> the entire code but only the queries? You can change type in the
> Postgres to bool. Then, when select data you can use a CASE..WHEN to
> return 'Y' or 'N' or even write a little function which accepts bool
> and returns 'Y' or 'N'. In this case in all your queries you will have
> to replace the select of bool field with select form the function.

Thank you for your suggestion. I had a private message exchange with
Harald Fuchs who suggested the same (except the function). Here is what
whe "exchanged":

Harald Fuchs wrote:

> If you can control the SELECTs, just use
>
>  SELECT CASE col WHEN true THEN 'Y' ELSE 'N' END
>
> instead of
>
>  SELECT col
>
> Thus you wouldn't need to change your application code.
>
>
I use single SELECT for both PostgreSQL and MySQL. I could use your
solution. It would just require some tagging of bool fields in SELECTs
so I could add the CASE statement in case I use PostgreSQL backend.

Miroslav

Attachment

miroslav.sulc.vcf

Avoiding tuple construction/deconstruction during joining

From

Tom Lane

Date:

14 March 2005, 17:24:30

=?ISO-8859-2?Q?Miroslav_=A9ulc?= <miroslav.sulc@startnet.cz> writes:
> [ concerning a deeply nested LEFT JOIN to get data from a star schema ]

> So I have some results. I have tested the query on both PostgreSQL 8.0.1
> and MySQL 4.1.8 with LIMIT set to 30 and OFFSET set to 6000. PostgreSQL
> result is 11,667.916 ms, MySQL result is 448.4 ms.

That's a fairly impressive discrepancy :-(, and even the slot_getattr()
patch that Atsushi Ogawa provided isn't going to close the gap.
(I got about a 4x speedup on Miroslav's example in my testing, which
leaves us still maybe 6x slower than MySQL.)

Looking at the post-patch profile for the test case, there is still
quite a lot of cycles going into tuple assembly and disassembly:

Each sample counts as 0.01 seconds.
  %   cumulative   self              self     total
 time   seconds   seconds    calls  ms/call  ms/call  name
 24.47      4.49     4.49                             _mcount
  8.01      5.96     1.47  9143692     0.00     0.00  ExecEvalVar
  6.92      7.23     1.27  6614373     0.00     0.00  slot_deformtuple
  6.54      8.43     1.20  9143692     0.00     0.00  slot_getattr
  6.21      9.57     1.14   103737     0.01     0.03  ExecTargetList
  5.56     10.59     1.02   103775     0.01     0.01  DataFill
  3.22     11.18     0.59   103775     0.01     0.01  ComputeDataSize
  2.83     11.70     0.52                             ExecEvalVar
  2.72     12.20     0.50  9094122     0.00     0.00  memcpy
  2.51     12.66     0.46                             encore
  2.40     13.10     0.44   427448     0.00     0.00  nocachegetattr
  2.13     13.49     0.39   103775     0.00     0.02  heap_formtuple
  2.07     13.87     0.38                             noshlibs
  1.20     14.09     0.22   225329     0.00     0.00  _doprnt
  1.20     14.31     0.22                             msquadloop
  1.14     14.52     0.21                             chunks
  0.98     14.70     0.18   871885     0.00     0.00  AllocSetAlloc
  0.98     14.88     0.18                             $$dyncall
  0.76     15.02     0.14   594242     0.00     0.00  FunctionCall3
  0.71     15.15     0.13   213312     0.00     0.00  comparetup_heap
  0.65     15.27     0.12     6364     0.02     0.13  printtup
  0.60     15.38     0.11   790702     0.00     0.00  pfree

(_mcount is profiling overhead, ignore it.)  It looks to me like just
about everything in the top dozen functions is there as a result of the
fact that join steps form new tuples that are the merge of their input
tuples.  Even our favorite villains, palloc and pfree, are down in the
sub-percent range.

I am guessing that the reason MySQL wins on this is that they avoid
doing any data copying during a join step.  I wonder whether we could
accomplish the same by taking Ogawa's patch to the next level: allow
a TupleTableSlot to contain either a "materialized" tuple as now,
or a "virtual" tuple that is simply an array of Datums and null flags.
(It's virtual in the sense that any pass-by-reference Datums would have
to be pointing to data at the next level down.)  This would essentially
turn the formtuple and deformtuple operations into no-ops, and get rid
of a lot of the associated overhead such as ComputeDataSize and
DataFill.  The only operations that would have to forcibly materialize
a tuple would be ones that need to keep the tuple till after they fetch
their next input tuple --- hashing and sorting are examples, but very
many plan node types don't ever need to do that.

I haven't worked out the details, but it seems likely that this could be
a relatively nonintrusive patch.  The main thing that would be an issue
would be that direct reference to slot->val would become verboten (since
you could no longer be sure there was a materialized tuple there).
I think this would possibly affect some contrib stuff, which is a strong
hint that it'd break some existing user-written code out there.

Thoughts?

            regards, tom lane

Re: Avoiding tuple construction/deconstruction during joining

From

Miroslav Šulc

Date:

14 March 2005, 19:12:02

Tom Lane wrote:

>>So I have some results. I have tested the query on both PostgreSQL 8.0.1
>>and MySQL 4.1.8 with LIMIT set to 30 and OFFSET set to 6000. PostgreSQL
>>result is 11,667.916 ms, MySQL result is 448.4 ms.
>>
>>
>
>That's a fairly impressive discrepancy :-(, and even the slot_getattr()
>patch that Atsushi Ogawa provided isn't going to close the gap.
>(I got about a 4x speedup on Miroslav's example in my testing, which
>leaves us still maybe 6x slower than MySQL.)
>
>
As I wrote, the comparison is not "fair". Here are the conditions:

"Both databases are running on the same machine (my laptop) and contain
the same data. However there are some differences in the data table
definitions:
1) in PostgreSQL I use 'varchar(1)' for a lot of fields and in MySQL I
use 'enum'
2) in PostgreSQL in some cases I use connection fields that are not of
the same type (smallint <-> integer (SERIAL)), in MySQL I use the same
types"

For those not used to MySQL, enum is an integer "mapped" to a text
string (that's how I see it). That means that when you have field such
as enum('yes','no','DK'), in the table there are stored numbers 1, 2 and
3 which are mapped to the text values 'yes', 'no' and 'DK'. The
description is not accurate (I'm not MySQL programmer, I didn't check it
recently and I didn't inspect the code - I wouldn't understand it
either) but I think it's not that important. What is important is the
fact that MySQL has to work with some dozen fields that are numbers and
PostgreSQL has to work with the same fields as varchar(). Some of the
other fields are also varchars. This might (or might not) cause the
speed difference. However I think that if devs figure out how to speed
this up, other cases will benefit from the improvement too.

As I understood from the contributions of other, the 2) shouldn't have a
great impact on the speed.

>Looking at the post-patch profile for the test case, there is still
>quite a lot of cycles going into tuple assembly and disassembly:
>
>Each sample counts as 0.01 seconds.
>  %   cumulative   self              self     total
> time   seconds   seconds    calls  ms/call  ms/call  name
> 24.47      4.49     4.49                             _mcount
>  8.01      5.96     1.47  9143692     0.00     0.00  ExecEvalVar
>  6.92      7.23     1.27  6614373     0.00     0.00  slot_deformtuple
>  6.54      8.43     1.20  9143692     0.00     0.00  slot_getattr
>  6.21      9.57     1.14   103737     0.01     0.03  ExecTargetList
>  5.56     10.59     1.02   103775     0.01     0.01  DataFill
>  3.22     11.18     0.59   103775     0.01     0.01  ComputeDataSize
>  2.83     11.70     0.52                             ExecEvalVar
>  2.72     12.20     0.50  9094122     0.00     0.00  memcpy
>  2.51     12.66     0.46                             encore
>  2.40     13.10     0.44   427448     0.00     0.00  nocachegetattr
>  2.13     13.49     0.39   103775     0.00     0.02  heap_formtuple
>  2.07     13.87     0.38                             noshlibs
>  1.20     14.09     0.22   225329     0.00     0.00  _doprnt
>  1.20     14.31     0.22                             msquadloop
>  1.14     14.52     0.21                             chunks
>  0.98     14.70     0.18   871885     0.00     0.00  AllocSetAlloc
>  0.98     14.88     0.18                             $$dyncall
>  0.76     15.02     0.14   594242     0.00     0.00  FunctionCall3
>  0.71     15.15     0.13   213312     0.00     0.00  comparetup_heap
>  0.65     15.27     0.12     6364     0.02     0.13  printtup
>  0.60     15.38     0.11   790702     0.00     0.00  pfree
>
>(_mcount is profiling overhead, ignore it.)  It looks to me like just
>about everything in the top dozen functions is there as a result of the
>fact that join steps form new tuples that are the merge of their input
>tuples.  Even our favorite villains, palloc and pfree, are down in the
>sub-percent range.
>
>I am guessing that the reason MySQL wins on this is that they avoid
>doing any data copying during a join step.  I wonder whether we could
>accomplish the same by taking Ogawa's patch to the next level: allow
>a TupleTableSlot to contain either a "materialized" tuple as now,
>or a "virtual" tuple that is simply an array of Datums and null flags.
>(It's virtual in the sense that any pass-by-reference Datums would have
>to be pointing to data at the next level down.)  This would essentially
>turn the formtuple and deformtuple operations into no-ops, and get rid
>of a lot of the associated overhead such as ComputeDataSize and
>DataFill.  The only operations that would have to forcibly materialize
>a tuple would be ones that need to keep the tuple till after they fetch
>their next input tuple --- hashing and sorting are examples, but very
>many plan node types don't ever need to do that.
>
>I haven't worked out the details, but it seems likely that this could be
>a relatively nonintrusive patch.  The main thing that would be an issue
>would be that direct reference to slot->val would become verboten (since
>you could no longer be sure there was a materialized tuple there).
>I think this would possibly affect some contrib stuff, which is a strong
>hint that it'd break some existing user-written code out there.
>
>Thoughts?
>
>
Mine thought is that I don't know what you are talking about :-) Now
seriously, I am far below this level of knowledge. But I can contribute
a test that (maybe) can help. I have rewritten the query so it JOINs the
varchar() fields (in fact all fields except the IDPK) at the last INNER
JOIN. Though there is one more JOIN, the query is more than 5 times
faster (1975.312 ms) :-)

So my silly opinion is that if the planner could decide that there are
too much time expensive fields that are not needed during performing
JOINs and these could be JOINed at the last step, it would do it this
way :-)

Below is the adjusted query and the EXPLAIN ANALYZE output. (Tom, you
can run it on the data I have sent you and it should run without changes.)

SELECT AdDevicesSites.IDPK, AdDevicesSites2.AdDevicesSiteSizeIDFK,
AdDevicesSites2.AdDevicesSiteRegionIDFK,
AdDevicesSites2.AdDevicesSiteCountyIDFK,
AdDevicesSites2.AdDevicesSiteCityIDFK,
AdDevicesSites2.AdDevicesSiteDistrictIDFK,
AdDevicesSites2.AdDevicesSiteStreetIDFK,
AdDevicesSites2.AdDevicesSiteStreetDescriptionIDFK,
AdDevicesSites2.AdDevicesSitePositionIDFK,
AdDevicesSites2.AdDevicesSiteVisibilityIDFK,
AdDevicesSites2.AdDevicesSiteStatusTypeIDFK,
AdDevicesSites2.AdDevicesSitePartnerIdentificationOperatorIDFK,
AdDevicesSites2.AdDevicesSitePartnerElectricitySupplierIDFK,
AdDevicesSites2.AdDevicesSitePartnerMaintainerIDFK,
AdDevicesSites2.AdDevicesSitePartnerStickerIDFK,
AdDevicesSites2.CadastralUnitIDFK, AdDevicesSites2.MediaType,
AdDevicesSites2.Mark, AdDevicesSites2.Amount, AdDevicesSites2.Distance,
AdDevicesSites2.OwnLightening, AdDevicesSites2.LocationDownTown,
AdDevicesSites2.LocationSuburb,
AdDevicesSites2.LocationBusinessDistrict,
AdDevicesSites2.LocationResidentialDistrict,
AdDevicesSites2.LocationIndustrialDistrict,
AdDevicesSites2.LocationNoBuildings, AdDevicesSites2.ParkWayHighWay,
AdDevicesSites2.ParkWayFirstClassRoad, AdDevicesSites2.ParkWayOtherRoad,
AdDevicesSites2.ParkWayStreet, AdDevicesSites2.ParkWayAccess,
AdDevicesSites2.ParkWayExit, AdDevicesSites2.ParkWayParkingPlace,
AdDevicesSites2.ParkWayPassangersOnly, AdDevicesSites2.ParkWayCrossRoad,
AdDevicesSites2.PositionStandAlone,
AdDevicesSites2.NeighbourhoodPublicTransportation,
AdDevicesSites2.NeighbourhoodInterCityTransportation,
AdDevicesSites2.NeighbourhoodPostOffice,
AdDevicesSites2.NeighbourhoodNewsStand,
AdDevicesSites2.NeighbourhoodAmenities,
AdDevicesSites2.NeighbourhoodSportsSpot,
AdDevicesSites2.NeighbourhoodHealthServiceSpot,
AdDevicesSites2.NeighbourhoodShops,
AdDevicesSites2.NeighbourhoodShoppingCenter,
AdDevicesSites2.NeighbourhoodSuperMarket,
AdDevicesSites2.NeighbourhoodPetrolStation,
AdDevicesSites2.NeighbourhoodSchool, AdDevicesSites2.NeighbourhoodBank,
AdDevicesSites2.NeighbourhoodRestaurant,
AdDevicesSites2.NeighbourhoodHotel,
AdDevicesSites2.RestrictionCigarettes,
AdDevicesSites2.RestrictionPolitics, AdDevicesSites2.RestrictionSpirits,
AdDevicesSites2.RestrictionSex, AdDevicesSites2.RestrictionOther,
AdDevicesSites2.RestrictionNote, AdDevicesSites2.SpotMapFile,
AdDevicesSites2.SpotPhotoFile, AdDevicesSites2.SourcePhotoTimeStamp,
AdDevicesSites2.SourceMapTimeStamp, AdDevicesSites2.Price,
AdDevicesSites2.WebPrice, AdDevicesSites2.CadastralUnitCode,
AdDevicesSites2.BuildingNumber, AdDevicesSites2.ParcelNumber,
AdDevicesSites2.GPSLatitude, AdDevicesSites2.GPSLongitude,
AdDevicesSites2.GPSHeight, AdDevicesSites2.MechanicalOpticalCoordinates,
AdDevicesSites2.Deleted, AdDevicesSites2.Protected,
AdDevicesSites2.DateCreated, AdDevicesSites2.DateLastModified,
AdDevicesSites2.DateDeleted, AdDevicesSites2.CreatedByUserIDFK,
AdDevicesSites2.LastModifiedByUserIDFK,
AdDevicesSites2.DeletedByUserIDFK,
AdDevicesSites2.PhotoLastModificationDate,
AdDevicesSites2.MapLastModificationDate,
AdDevicesSites2.DateLastImported, AdDevicesSiteRegions.Name AS
AdDevicesSiteRegionName, AdDevicesSiteCounties.Name AS
AdDevicesSiteCountyName, AdDevicesSiteCities.Name AS
AdDevicesSiteCityName, AdDevicesSiteStreets.Name AS
AdDevicesSiteStreetName, AdDevicesSiteDistricts.Name AS
AdDevicesSiteDistrictName, AdDevicesSiteStreetDescriptions.Name_cs AS
AdDevicesSiteStreetDescriptionName_cs,
AdDevicesSiteStreetDescriptions.Name_en AS
AdDevicesSiteStreetDescriptionName_en, AdDevicesSiteSizes.Name AS
AdDevicesSiteSizeName, SUBSTRING(AdDevicesSiteVisibilities.Name_cs, 3)
AS AdDevicesSiteVisibilityName_cs,
SUBSTRING(AdDevicesSiteVisibilities.Name_en, 3) AS
AdDevicesSiteVisibilityName_en, AdDevicesSitePositions.Name_cs AS
AdDevicesSitePositionName_cs, AdDevicesSitePositions.Name_en AS
AdDevicesSitePositionName_en, AdDevicesSiteStatusTypes.Name_cs AS
AdDevicesSiteStatusTypeName_cs, AdDevicesSiteStatusTypes.Name_en AS
AdDevicesSiteStatusTypeName_en, PartnerIdentificationsOperator.Name AS
PartnerIdentificationOperatorName, PartnersElectricitySupplier.Name AS
PartnerElectricitySupplierName, PartnersMaintainer.Name AS
PartnerMaintainerName, PartnersSticker.Name AS PartnerStickerName,
CadastralUnits.Code AS CadastralUnitCodeNative, CadastralUnits.Name AS
CadastralUnitName
FROM AdDevicesSites
LEFT JOIN AdDevicesSiteRegions ON AdDevicesSites.AdDevicesSiteRegionIDFK
= AdDevicesSiteRegions.IDPK
LEFT JOIN AdDevicesSiteCounties ON
AdDevicesSites.AdDevicesSiteCountyIDFK = AdDevicesSiteCounties.IDPK
LEFT JOIN AdDevicesSiteCities ON AdDevicesSites.AdDevicesSiteCityIDFK =
AdDevicesSiteCities.IDPK
LEFT JOIN AdDevicesSiteStreets ON AdDevicesSites.AdDevicesSiteStreetIDFK
= AdDevicesSiteStreets.IDPK
LEFT JOIN AdDevicesSiteStreetDescriptions ON
AdDevicesSites.AdDevicesSiteStreetDescriptionIDFK =
AdDevicesSiteStreetDescriptions.IDPK
LEFT JOIN AdDevicesSiteDistricts ON
AdDevicesSites.AdDevicesSiteDistrictIDFK = AdDevicesSiteDistricts.IDPK
LEFT JOIN AdDevicesSiteSizes ON AdDevicesSites.AdDevicesSiteSizeIDFK =
AdDevicesSiteSizes.IDPK
LEFT JOIN AdDevicesSiteVisibilities ON
AdDevicesSites.AdDevicesSiteVisibilityIDFK = AdDevicesSiteVisibilities.IDPK
LEFT JOIN AdDevicesSitePositions ON
AdDevicesSites.AdDevicesSitePositionIDFK = AdDevicesSitePositions.IDPK
LEFT JOIN AdDevicesSiteStatusTypes ON
AdDevicesSites.AdDevicesSiteStatusTypeIDFK = AdDevicesSiteStatusTypes.IDPK
LEFT JOIN PartnerIdentifications AS PartnerIdentificationsOperator ON
AdDevicesSites.AdDevicesSitePartnerIdentificationOperatorIDFK =
PartnerIdentificationsOperator.IDPK
LEFT JOIN Partners AS PartnersElectricitySupplier ON
AdDevicesSites.AdDevicesSitePartnerElectricitySupplierIDFK =
PartnersElectricitySupplier.IDPK
LEFT JOIN Partners AS PartnersMaintainer ON
AdDevicesSites.AdDevicesSitePartnerMaintainerIDFK = PartnersMaintainer.IDPK
LEFT JOIN Partners AS PartnersSticker ON
AdDevicesSites.AdDevicesSitePartnerStickerIDFK = PartnersSticker.IDPK
LEFT JOIN CadastralUnits ON AdDevicesSites.CadastralUnitIDFK =
CadastralUnits.IDPK
INNER JOIN AdDevicesSites AS AdDevicesSites2 ON AdDevicesSites.IDPK =
AdDevicesSites2.IDPK

QUERY PLAN

Hash Join  (cost=193867.68..235417.92 rows=142556 width=815) (actual time=1577.080..2200.677 rows=6364 loops=1)

  Hash Cond: ("outer".idpk = "inner".idpk)

  ->  Seq Scan on addevicessites addevicessites2  (cost=0.00..13118.56 rows=142556 width=473) (actual
time=40.650..49.195rows=6364 loops=1) 

  ->  Hash  (cost=186898.29..186898.29 rows=142556 width=350) (actual time=1534.080..1534.080 rows=0 loops=1)

        ->  Merge Right Join  (cost=184758.38..186898.29 rows=142556 width=350) (actual time=1187.653..1244.955
rows=6364loops=1) 

              Merge Cond: ("outer".idpk = "inner".cadastralunitidfk)

              ->  Index Scan using cadastralunits_pkey on cadastralunits  (cost=0.00..314.49 rows=13027 width=31)
(actualtime=0.034..0.128 rows=63 loops=1) 

              ->  Sort  (cost=184758.38..185114.77 rows=142556 width=325) (actual time=1187.582..1190.111 rows=6364
loops=1)

                    Sort Key: addevicessites.cadastralunitidfk

                    ->  Hash Left Join  (cost=93887.04..143074.76 rows=142556 width=325) (actual time=502.584..1129.145
rows=6364loops=1) 

                          Hash Cond: ("outer".addevicessitepartnerstickeridfk = "inner".idpk)

                          ->  Hash Left Join  (cost=93884.28..140933.65 rows=142556 width=307) (actual
time=502.388..1075.572rows=6364 loops=1) 

                                Hash Cond: ("outer".addevicessitepartnermaintaineridfk = "inner".idpk)

                                ->  Hash Left Join  (cost=93881.51..138792.55 rows=142556 width=289) (actual
time=502.208..1023.802rows=6364 loops=1) 

                                      Hash Cond: ("outer".addevicessitepartnerelectricitysupplieridfk = "inner".idpk)

                                      ->  Hash Left Join  (cost=93878.75..136651.45 rows=142556 width=271) (actual
time=502.041..969.959rows=6364 loops=1) 

                                            Hash Cond: ("outer".addevicessitepartneridentificationoperatoridfk =
"inner".idpk)

                                            ->  Nested Loop Left Join  (cost=93875.99..134510.35 rows=142556 width=253)
(actualtime=501.849..915.256 rows=6364 loops=1) 

                                                  Join Filter: ("outer".addevicessitestatustypeidfk = "inner".idpk)

                                                  ->  Nested Loop Left Join  (cost=93874.93..118471.74 rows=142556
width=228)(actual time=501.826..818.436 rows=6364 loops=1) 

                                                        Join Filter: ("outer".addevicessitepositionidfk = "inner".idpk)

                                                        ->  Nested Loop Left Join  (cost=93873.90..108848.18
rows=142556width=207) (actual time=501.802..737.137 rows=6364 loops=1) 

                                                              Join Filter: ("outer".addevicessitevisibilityidfk =
"inner".idpk)

                                                              ->  Merge Right Join  (cost=93872.86..96017.09
rows=142556width=177) (actual time=501.741..554.834 rows=6364 loops=1) 

                                                                    Merge Cond: ("outer".idpk =
"inner".addevicessitesizeidfk)

                                                                    ->  Index Scan using addevicessitesizes_pkey on
addevicessitesizes (cost=0.00..5.62 rows=110 width=14) (actual time=0.009..0.264 rows=110 loops=1) 

                                                                    ->  Sort  (cost=93872.86..94229.25 rows=142556
width=169)(actual time=501.697..504.267 rows=6364 loops=1) 

                                                                          Sort Key:
addevicessites.addevicessitesizeidfk

                                                                          ->  Hash Left Join  (cost=57752.91..65764.23
rows=142556width=169) (actual time=234.321..466.130 rows=6364 loops=1) 

                                                                                Hash Cond:
("outer".addevicessitedistrictidfk= "inner".idpk) 

                                                                                ->  Hash Left Join
(cost=57743.17..63616.15rows=142556 width=164) (actual time=233.456..421.267 rows=6364 loops=1) 

                                                                                      Hash Cond:
("outer".addevicessitestreetdescriptionidfk= "inner".idpk) 

                                                                                      ->  Hash Left Join
(cost=57634.86..62707.43rows=142556 width=137) (actual time=223.396..362.945 rows=6364 loops=1) 

                                                                                            Hash Cond:
("outer".addevicessitestreetidfk= "inner".idpk) 

                                                                                            ->  Hash Left Join
(cost=57566.95..61844.20rows=142556 width=119) (actual time=217.101..312.605 rows=6364 loops=1) 

                                                                                                  Hash Cond:
("outer".addevicessitecityidfk= "inner".idpk) 

                                                                                                  ->  Merge Right Join
(cost=57561.85..59700.76rows=142556 width=110) (actual time=216.635..266.672 rows=6364 loops=1) 

                                                                                                        Merge Cond:
("outer".idpk= "inner".addevicessitecountyidfk) 

                                                                                                        ->  Sort
(cost=6.19..6.48rows=117 width=17) (actual time=0.350..0.389 rows=116 loops=1) 

                                                                                                              Sort Key:
addevicessitecounties.idpk

                                                                                                              ->  Seq
Scanon addevicessitecounties  (cost=0.00..2.17 rows=117 width=17) (actual time=0.008..0.146 rows=117 loops=1) 

                                                                                                        ->  Sort
(cost=57555.66..57912.05rows=142556 width=99) (actual time=216.250..218.611 rows=6364 loops=1) 

                                                                                                              Sort Key:
addevicessites.addevicessitecountyidfk

                                                                                                              ->  Merge
RightJoin  (cost=33573.63..35712.03 rows=142556 width=99) (actual time=173.646..199.036 rows=6364 loops=1) 


MergeCond: ("outer".idpk = "inner".addevicessiteregionidfk) 

                                                                                                                    ->
Sort (cost=1.44..1.48 rows=15 width=23) (actual time=0.055..0.059 rows=13 loops=1) 


 Sort Key: addevicessiteregions.idpk 


 ->  Seq Scan on addevicessiteregions  (cost=0.00..1.15 rows=15 width=23) (actual time=0.016..0.032 rows=15 loops=1) 

                                                                                                                    ->
Sort (cost=33572.19..33928.58 rows=142556 width=82) (actual time=173.559..176.398 rows=6364 loops=1) 


 Sort Key: addevicessites.addevicessiteregionidfk 


 ->  Seq Scan on addevicessites  (cost=0.00..13118.56 rows=142556 width=82) (actual time=62.345..164.783 rows=6364
loops=1)

                                                                                                  ->  Hash
(cost=4.48..4.48rows=248 width=17) (actual time=0.283..0.283 rows=0 loops=1) 

                                                                                                        ->  Seq Scan on
addevicessitecities (cost=0.00..4.48 rows=248 width=17) (actual time=0.011..0.162 rows=138 loops=1) 

                                                                                            ->  Hash
(cost=60.53..60.53rows=2953 width=34) (actual time=6.229..6.229 rows=0 loops=1) 

                                                                                                  ->  Seq Scan on
addevicessitestreets (cost=0.00..60.53 rows=2953 width=34) (actual time=0.010..3.816 rows=2984 loops=1) 

                                                                                      ->  Hash  (cost=96.85..96.85
rows=4585width=43) (actual time=10.017..10.017 rows=0 loops=1) 

                                                                                            ->  Seq Scan on
addevicessitestreetdescriptions (cost=0.00..96.85 rows=4585 width=43) (actual time=0.008..6.371 rows=4585 loops=1) 

                                                                                ->  Hash  (cost=8.59..8.59 rows=459
width=21)(actual time=0.815..0.815 rows=0 loops=1) 

                                                                                      ->  Seq Scan on
addevicessitedistricts (cost=0.00..8.59 rows=459 width=21) (actual time=0.007..0.541 rows=382 loops=1) 

                                                              ->  Materialize  (cost=1.04..1.08 rows=4 width=36)
(actualtime=0.000..0.002 rows=4 loops=6364) 

                                                                    ->  Seq Scan on addevicessitevisibilities
(cost=0.00..1.04rows=4 width=36) (actual time=0.026..0.035 rows=4 loops=1) 

                                                        ->  Materialize  (cost=1.03..1.06 rows=3 width=27) (actual
time=0.000..0.001rows=3 loops=6364) 

                                                              ->  Seq Scan on addevicessitepositions  (cost=0.00..1.03
rows=3width=27) (actual time=0.007..0.010 rows=3 loops=1) 

                                                  ->  Materialize  (cost=1.05..1.10 rows=5 width=31) (actual
time=0.000..0.002rows=5 loops=6364) 

                                                        ->  Seq Scan on addevicessitestatustypes  (cost=0.00..1.05
rows=5width=31) (actual time=0.006..0.016 rows=5 loops=1) 

                                            ->  Hash  (cost=2.61..2.61 rows=61 width=34) (actual time=0.144..0.144
rows=0loops=1) 

                                                  ->  Seq Scan on partneridentifications partneridentificationsoperator
(cost=0.00..2.61 rows=61 width=34) (actual time=0.007..0.103 rows=61 loops=1) 

                                      ->  Hash  (cost=2.61..2.61 rows=61 width=34) (actual time=0.122..0.122 rows=0
loops=1)

                                            ->  Seq Scan on partners partnerselectricitysupplier  (cost=0.00..2.61
rows=61width=34) (actual time=0.004..0.072 rows=61 loops=1) 

                                ->  Hash  (cost=2.61..2.61 rows=61 width=34) (actual time=0.136..0.136 rows=0 loops=1)

                                      ->  Seq Scan on partners partnersmaintainer  (cost=0.00..2.61 rows=61 width=34)
(actualtime=0.005..0.086 rows=61 loops=1) 

                          ->  Hash  (cost=2.61..2.61 rows=61 width=34) (actual time=0.143..0.143 rows=0 loops=1)

                                ->  Seq Scan on partners partnerssticker  (cost=0.00..2.61 rows=61 width=34) (actual
time=0.009..0.098rows=61 loops=1) 

Total runtime: 2210.937 ms


>            regards, tom lane
>
>
Miroslav

Attachment

miroslav.sulc.vcf

Re: Avoiding tuple construction/deconstruction during joining

From

Tom Lane

Date:

14 March 2005, 19:22:50

=?windows-1250?Q?Miroslav_=8Aulc?= <miroslav.sulc@startnet.cz> writes:
> seriously, I am far below this level of knowledge. But I can contribute
> a test that (maybe) can help. I have rewritten the query so it JOINs the
> varchar() fields (in fact all fields except the IDPK) at the last INNER
> JOIN. Though there is one more JOIN, the query is more than 5 times
> faster (1975.312 ms) :-)

That confirms my thought that passing the data up through multiple
levels of join is what's killing us.  I'll work on a solution.  This
will of course be even less back-patchable to 8.0.* than Ogawa's work,
but hopefully it will fix the issue for 8.1.

            regards, tom lane

Re: Avoiding tuple construction/deconstruction during joining

From

Miroslav Šulc

Date:

14 March 2005, 19:32:00

Tom Lane wrote:

>=?windows-1250?Q?Miroslav_=8Aulc?= <miroslav.sulc@startnet.cz> writes:
>
>
>>seriously, I am far below this level of knowledge. But I can contribute
>>a test that (maybe) can help. I have rewritten the query so it JOINs the
>>varchar() fields (in fact all fields except the IDPK) at the last INNER
>>JOIN. Though there is one more JOIN, the query is more than 5 times
>>faster (1975.312 ms) :-)
>>
>>
>
>That confirms my thought that passing the data up through multiple
>levels of join is what's killing us.  I'll work on a solution.  This
>will of course be even less back-patchable to 8.0.* than Ogawa's work,
>but hopefully it will fix the issue for 8.1.
>
>            regards, tom lane
>
>
Tom, thank you and the others for help. I'm glad that my problem can
help PostgreSQL improve and that there are people like you that
understand each little bit of the software :-)

Miroslav

Attachment

miroslav.sulc.vcf

Re: Avoiding tuple construction/deconstruction during joining

From

PFC

Date:

15 March 2005, 17:17:40

    I have asked him for the data and played with his queries, and obtained
massive speedups with the following queries :

http://boutiquenumerique.com/pf/miroslav/query.sql
http://boutiquenumerique.com/pf/miroslav/query2.sql
http://boutiquenumerique.com/pf/miroslav/materialize.sql

    Note that my optimized version of the Big Joins is not much faster that
the materialized view without index (hash joins are damn fast in postgres)
but of course using an index...

Re: Avoiding tuple construction/deconstruction during joining

From

PFC

Date:

15 March 2005, 19:53:24

    On my machine (Laptop with Pentium-M 1.6 GHz and 512MB DDR333) I get the
following timings :

    Big Joins Query will all the fields and no order by (I just put a SELECT
* in the first table) yielding about 6k rows :
    => 12136.338 ms

    Replacing the SELECT * from the table with many fields by just a SELECT
of the foreign key columns :
    => 1874.612 ms

    I felt like playing a bit so I implemented a hash join in python
(download the file, it works on Miroslav's data) :
    All timings do not include time to fetch the data from the database.
Fetching all the tables takes about 1.1 secs.

    * With something that looks like the current implementation (copying
tuples around) and fetching all the fields from the big table :
    => Fetching all the tables : 1.1 secs.
    => Joining : 4.3 secs

    * Fetching only the integer fields
    => Fetching all the tables : 0.4 secs.
    => Joining : 1.7 secs

    * A smarter join which copies nothing and updates the rows as they are
processed, adding fields :
    => Fetching all the tables :  1.1 secs.
    => Joining : 0.4 secs
    With the just-in-time compiler activated, it goes down to about 0.25
seconds.

    First thing, this confirms what Tom said.
    It also means that doing this query in the application can be a lot
faster than doing it in postgres including fetching all of the tables.
There's a problem somewhere ! It should be the other way around ! The
python mappings (dictionaries : { key : value } ) are optimized like crazy
but they store column names for each row. And it's a dynamic script
language ! Argh.

    Note : run the program like this :

python test.py |less -S

    So that the time spent scrolling your terminal does not spoil the
measurements.

Download test program :
http://boutiquenumerique.com/pf/miroslav/test.py