Thread: join with redundant results VS simpler join plus multiple selects

join with redundant results VS simpler join plus multiple selects

From

WireSpot

Date:

20 November 2008, 07:54:03

I have a schema with galleries, people and images. Each person has a
bunch of private images. People can join any number of galleries and
can publish any of their images to the galleries they join (or not).

I'd like to retrieve a data set where for a given gallery id I get all
the people AND all the images they've published to that gallery.

I can do this in two ways.
1) Do a join that will give me the people that belong to said gallery,
then loop in the code and do simple selects to retrieve images in that
gallery for each of them.
2) Do a join between all three tables. The end result will have as
many rows as total images for all the people in the gallery.
Obviously, there's going to be redundant data, since a person's info
will be repeated for each image.

Which is better in terms of performance? I used EXPLAIN ANALYZE and
actual queries and it seems to suggest that option 2, while returning
redundant info, is faster.

Re: join with redundant results VS simpler join plus multiple selects

From

Craig Ringer

Date:

20 November 2008, 09:05:39

WireSpot wrote:
> I have a schema with galleries, people and images. Each person has a
> bunch of private images. People can join any number of galleries and
> can publish any of their images to the galleries they join (or not).
>
> I'd like to retrieve a data set where for a given gallery id I get all
> the people AND all the images they've published to that gallery.
>
> I can do this in two ways.
> 1) Do a join that will give me the people that belong to said gallery,
> then loop in the code and do simple selects to retrieve images in that
> gallery for each of them.
> 2) Do a join between all three tables. The end result will have as
> many rows as total images for all the people in the gallery.
> Obviously, there's going to be redundant data, since a person's info
> will be repeated for each image.
>
> Which is better in terms of performance? I used EXPLAIN ANALYZE and
> actual queries and it seems to suggest that option 2, while returning
> redundant info, is faster.

That's probably going to be the case.  PostgreSQL won't need to read the
redundant info in from disk each time, and relative to the image data
it's going to be pretty small. By doing it all in one join you're
avoiding the overhead of all those network round trips (if on a
network), statement preparation and planning, etc etc etc. Additionally,
PostgreSQL is probably going to be using a join plan that's much more
efficient than anything you'll get by looping over each user and asking
for images.

If you wanted to avoid returning too much redundant info, you could
always do it in two queries:

- Find a list of all users belonging to the gallery and any other
non-image data associated with them; then

- Retrieve all images in one query using a join against the list of
users who're members of the gallery, but only actually return (eg) the
user id, gallery id, and image data for each image.

In all honestly, though, it probably doesn't matter unless there's a LOT
of additional data you want to obtain about each user.

Note, however, that when testing method (1) in your post you will REALLY
need to make sure that you're using parameterized prepared statements
for the image queries.

--
Craig Ringer

Re: join with redundant results VS simpler join plus multiple selects

From

WireSpot

Date:

20 November 2008, 09:56:15

On Thu, Nov 20, 2008 at 15:05, Craig Ringer <craig@postnewspapers.com.au> wrote:
> That's probably going to be the case.  PostgreSQL won't need to read the
> redundant info in from disk each time, and relative to the image data it's
> going to be pretty small. By doing it all in one join you're avoiding the
> overhead of all those network round trips (if on a network), statement
> preparation and planning, etc etc etc. Additionally, PostgreSQL is probably
> going to be using a join plan that's much more efficient than anything
> you'll get by looping over each user and asking for images.

How about if the subset of images for each user is randomized? As in
ORDER BY RANDOM() LIMIT 3. I'm guessing that will put somewhat of a
cramp on the big join scenario and perhaps it becomes better to have
the RANDOM() in the small individual selects?

I'm probably going to give myself the answer, please advise if I'm not
thinking straight:

In this case, from EXPLAIN ANALYZE I get that after introducing
random() and limit, while the cost for the big join scenario is
practically the same, the actual execution time increases with about
0.100ms. Whereas on individual selects with random() limit I get an
increase of 0.040, and since the people will be shown paginated 10 per
page, I'm looking at a 10 x 0.040 = 0.400 increase.

So the big join still comes ahead.

> Note, however, that when testing method (1) in your post you will REALLY
> need to make sure that you're using parameterized prepared statements for
> the image queries.

Definitely, and I'm already working on that (see my other thread).

Re: join with redundant results VS simpler join plus multiple selects

From

Craig Ringer

Date:

20 November 2008, 14:41:00

WireSpot wrote:
> On Thu, Nov 20, 2008 at 15:05, Craig Ringer <craig@postnewspapers.com.au> wrote:
>> That's probably going to be the case.  PostgreSQL won't need to read the
>> redundant info in from disk each time, and relative to the image data it's
>> going to be pretty small. By doing it all in one join you're avoiding the
>> overhead of all those network round trips (if on a network), statement
>> preparation and planning, etc etc etc. Additionally, PostgreSQL is probably
>> going to be using a join plan that's much more efficient than anything
>> you'll get by looping over each user and asking for images.
>
> How about if the subset of images for each user is randomized? As in
> ORDER BY RANDOM() LIMIT 3. I'm guessing that will put somewhat of a
> cramp on the big join scenario and perhaps it becomes better to have
> the RANDOM() in the small individual selects?

I'm not even sure how you'd achieve that (exactly 3 randomly selected
images per user) in with a single query. Then again, it's stupidly late
here, so my brain may not be working. Any chance you can post a query
that shows what you're doing?

--
Craig Ringer

Re: join with redundant results VS simpler join plus multiple selects

From

WireSpot

Date:

21 November 2008, 03:42:38

On Thu, Nov 20, 2008 at 20:40, Craig Ringer <craig@postnewspapers.com.au> wrote:
> I'm not even sure how you'd achieve that (exactly 3 randomly selected
> images per user) in with a single query. Then again, it's stupidly late
> here, so my brain may not be working. Any chance you can post a query
> that shows what you're doing?

Basically the big join would go like this:

SELECT * FROM gall2ppl gp
JOIN people p ON (gp.gallery_id=N AND gp.gallery_id=p.gallery_id)
JOIN gall2ppl2img gpi ON (gp.gp_id=gpi.gp_id)
JOIN images i ON (gpi.image_id=i.image_id)

Where gp links galleries to people, images are images, and
gall2ppl2img links images to the links between galleries and people.
As you'll notice, I have a gallery_id=N condition, which means I work
with one gallery at a time.

The second scenario would mean doing just the first JOIN, then cycling
through the results and doing the last JOIN for each person.

Getting 3 random images is achieved by slapping an ORDER BY random()
LIMIT 3 at the end of the last JOIN.

I think this means that, although the big join would be more
efficient, it's not practical. Because I also need to limit the number
of persons to get pagination (means using LIMIT and OFFSET on the
people set) and I don't see how I can do that simultaneously (limit
both the people set and the image set in the same join). Not to
mention the added headache of ordering first by some attribute of the
people (name or date of join) then by the images with the random()
thing. The mixed ordering could probably be done but I really dunno
about the simultaneous LIMITing. And in achieving this the big join
may well lose its initial efficiency.

Would it be better if I took the first join and made a view out of it,
then SELECT on that to get the images? The possible advantage would be
I wouldn't have to do the last JOIN for the images, it would be all
straight SELECT's.