Thread: test data for query optimizer

test data for query optimizer

From

nconway@klamath.dyndns.org (Neil Conway)

Date:

12 July 2002, 12:05:36

I'd like to look at the performance of the query optimizer (both the
traditional one and GEQO) when joining large numbers of tables: 10-15,
or more. In order to do that (and to get meaningful results), I'll
need to work with some data that actually requires joins of that
magnitude. Ideally, I'd like the data to be somewhat realistic -- so
that the performance I'm seeing will reflect the performance a typical
user might see. (i.e. I don't want an artificial benchmark)

However, I don't possess any data of that nature, and I'm unsure
where I can find some (or how to generate some of my own). Does
anyone know of:
   - a freely available collection of data that requires queries   of this type, and is reasonably representative of
"realworld"   applications
 
   - or, a means to generate programatically some data that   fits the above criteria.

Thanks in advance,

Neil

-- 
Neil Conway <neilconway@rogers.com>
PGP Key ID: DB3C29FC

Re: test data for query optimizer

From

Tom Lane

Date:

12 July 2002, 13:14:59

nconway@klamath.dyndns.org (Neil Conway) writes:
> I'd like to look at the performance of the query optimizer (both the
> traditional one and GEQO) when joining large numbers of tables: 10-15,
> or more. In order to do that (and to get meaningful results), I'll
> need to work with some data that actually requires joins of that
> magnitude.

The easiest way to construct a realistic many-way join is to use a star
schema.  Here you have a primary "fact table" that includes a lot of
columns that individually join to the primary keys of other "detail
tables".  For example, you might have a column "State" in the fact table
with values like "PA", "NY", etc, and you want to join it to a table
states(abbrev,fullname,...) so your query can display "Pennsylvania",
"New York", etc.  It's easy to make up realistic examples that involve
any number of joins.

This is of course only one usage pattern for lots-o-joins, so don't put
too much credence in it alone as a benchmark, but it's certainly a
widely used pattern.

Searching for "star schema" at Google turned up some interesting things
last time I tried it.
        regards, tom lane

Re: test data for query optimizer

From

"Christopher Kings-Lynne"

Date:

12 July 2002, 23:22:53

What about the OSDB benchmark?  Does that contain a large dataset?

Chris

----- Original Message ----- 
From: "Neil Conway" <nconway@klamath.dyndns.org>
To: "PostgreSQL Hackers" <pgsql-hackers@postgresql.org>
Sent: Saturday, July 13, 2002 12:05 AM
Subject: [HACKERS] test data for query optimizer


> I'd like to look at the performance of the query optimizer (both the
> traditional one and GEQO) when joining large numbers of tables: 10-15,
> or more. In order to do that (and to get meaningful results), I'll
> need to work with some data that actually requires joins of that
> magnitude. Ideally, I'd like the data to be somewhat realistic -- so
> that the performance I'm seeing will reflect the performance a typical
> user might see. (i.e. I don't want an artificial benchmark)
> 
> However, I don't possess any data of that nature, and I'm unsure
> where I can find some (or how to generate some of my own). Does
> anyone know of:
> 
>     - a freely available collection of data that requires queries
>     of this type, and is reasonably representative of "real world"
>     applications
> 
>     - or, a means to generate programatically some data that
>     fits the above criteria.
> 
> Thanks in advance,
> 
> Neil
> 
> -- 
> Neil Conway <neilconway@rogers.com>
> PGP Key ID: DB3C29FC
> 
> ---------------------------(end of broadcast)---------------------------
> TIP 1: subscribe and unsubscribe commands go to majordomo@postgresql.org
>

Re: test data for query optimizer

From

nconway@klamath.dyndns.org (Neil Conway)

Date:

13 July 2002, 00:06:43

On Sat, Jul 13, 2002 at 11:18:14AM +0800, Christopher Kings-Lynne wrote:
> What about the OSDB benchmark?  Does that contain a large dataset?

No -- it only uses 5 relations total, with the most complex query
only involving 4 joins.

Cheers,

Neil

-- 
Neil Conway <neilconway@rogers.com>
PGP Key ID: DB3C29FC

Re: test data for query optimizer

From

Mark kirkwood

Date:

13 July 2002, 00:06:47

What about the TPC-H benchmark ?

I cant recall if it has more than 10 tables, but it seemed like the
queries were "quite good" for a benchmark. In addition it comes with a
data generator.


regards

Mark
>On Sat, 2002-07-13 at 04:05, Neil Conway wrote:
> I'd like to look at the performance of the query optimizer (both the
> traditional one and GEQO) when joining large numbers of tables: 10-15,
> or more. In order to do that (and to get meaningful results), I'll
> need to work with some data that actually requires joins of that
> magnitude. Ideally, I'd like the data to be somewhat realistic -- so
> that the performance I'm seeing will reflect the performance a typical
> user might see. (i.e. I don't want an artificial benchmark)
> 
>