Re: The REAL cost of joins - Mailing list pgsql-general

From Peter Hunsberger
Subject Re: The REAL cost of joins
Date
Msg-id cc159a4a1003031433r27c8de5erea85d0b841117b9@mail.gmail.com
Whole thread Raw
In response to The REAL cost of joins  (Marcin Krol <mrkafk@gmail.com>)
List pgsql-general
On Wed, Mar 3, 2010 at 3:59 PM, Marcin Krol <mrkafk@gmail.com> wrote:
>
> Hello everyone,
>
> I have inadvertently set off a furball on an unrelated ng on what is the actual cost of SQL joins. But there's no
reliableconclusion. I would like to actually know that, that is, are JOINs truly expensive? 
>
> As they say, one measurement is worth a thousand opinions, so I've done measurement on my PG app:
>
> $ time echo "\c hrs;
> SELECT hosts.ip, reservation.start_date, architecture.architecture, os_kind.os_kind, os_rel.os_rel,
os_version.os_version,project.project, email.email FROM hosts 
>  INNER JOIN project ON project.id = hosts.project_id
>  INNER JOIN architecture ON hosts.architecture_id = architecture.id
>  INNER JOIN os_kind ON os_kind.id = hosts.os_kind_id
>  INNER JOIN os_rel ON hosts.os_rel_id = os_rel.id
>  INNER JOIN os_version ON hosts.os_version_id = os_version.id
>  INNER JOIN reservation_hosts ON hosts.id = reservation_hosts.host_id
>  INNER JOIN reservation on reservation.id = reservation_hosts.reservation_id
>  INNER JOIN email ON reservation.email_id = email.id
>
> ;" | psql > /dev/null
>
> real    0m0.099s
> user    0m0.015s
> sys     0m0.005s
>
>
>
> $ time echo "\c hrs;
> > SELECT hosts.ip FROM hosts;
> > SELECT reservation.start_date FROM reservation;
> > SELECT architecture.architecture FROM architecture;
> > SELECT os_rel.os_rel FROM os_rel;
> > SELECT os_version.os_version FROM os_version;
> > SELECT project.project FROM project;
> > SELECT email.email FROM email;
> > " | psql > /dev/null
>
> real    0m0.046s
> user    0m0.008s
> sys     0m0.004s
>
> Note: I've created indexes on those tables, both on data columns like hosts.ip and on .id columns.
>
> What do you think of this? And in general: when (if?) should one denormalize data?

I don't get what these two examples are supposed to show?  They do
not, in general, create equivalent output data. If you had to write
procedural logic to do the same correlation between the individual
result sets as the joins are performing then you'd be back to at least
the same performance and probably worse, so what exactly is being
compared here?

As to your second question; well, outside of not doing premature
optimization I'll observe that for operational databases,
denormalization rarely gives me enough savings to justify the
resultant extra data management required to maintain it.  If you're
talking about a data warehouse (and given the way you've worded the
question I supsect you are not?) then that's a different question, in
that case I prefer to generate the data from the operational side of
the house so you're no longer paying an operational cost to maintain
the denormalized data (you've got a whole different set of costs
instead).

--
Peter Hunsberger

pgsql-general by date:

Previous
From: Guillaume Yziquel
Date:
Subject: Re: PQntuples returns an int.
Next
From: Justin Graf
Date:
Subject: Re: Scratching my head why results are different between machines.