Thread: test data for query optimizer
I'd like to look at the performance of the query optimizer (both the traditional one and GEQO) when joining large numbers of tables: 10-15, or more. In order to do that (and to get meaningful results), I'll need to work with some data that actually requires joins of that magnitude. Ideally, I'd like the data to be somewhat realistic -- so that the performance I'm seeing will reflect the performance a typical user might see. (i.e. I don't want an artificial benchmark) However, I don't possess any data of that nature, and I'm unsure where I can find some (or how to generate some of my own). Does anyone know of: - a freely available collection of data that requires queries of this type, and is reasonably representative of "realworld" applications - or, a means to generate programatically some data that fits the above criteria. Thanks in advance, Neil -- Neil Conway <neilconway@rogers.com> PGP Key ID: DB3C29FC
nconway@klamath.dyndns.org (Neil Conway) writes: > I'd like to look at the performance of the query optimizer (both the > traditional one and GEQO) when joining large numbers of tables: 10-15, > or more. In order to do that (and to get meaningful results), I'll > need to work with some data that actually requires joins of that > magnitude. The easiest way to construct a realistic many-way join is to use a star schema. Here you have a primary "fact table" that includes a lot of columns that individually join to the primary keys of other "detail tables". For example, you might have a column "State" in the fact table with values like "PA", "NY", etc, and you want to join it to a table states(abbrev,fullname,...) so your query can display "Pennsylvania", "New York", etc. It's easy to make up realistic examples that involve any number of joins. This is of course only one usage pattern for lots-o-joins, so don't put too much credence in it alone as a benchmark, but it's certainly a widely used pattern. Searching for "star schema" at Google turned up some interesting things last time I tried it. regards, tom lane
What about the OSDB benchmark? Does that contain a large dataset? Chris ----- Original Message ----- From: "Neil Conway" <nconway@klamath.dyndns.org> To: "PostgreSQL Hackers" <pgsql-hackers@postgresql.org> Sent: Saturday, July 13, 2002 12:05 AM Subject: [HACKERS] test data for query optimizer > I'd like to look at the performance of the query optimizer (both the > traditional one and GEQO) when joining large numbers of tables: 10-15, > or more. In order to do that (and to get meaningful results), I'll > need to work with some data that actually requires joins of that > magnitude. Ideally, I'd like the data to be somewhat realistic -- so > that the performance I'm seeing will reflect the performance a typical > user might see. (i.e. I don't want an artificial benchmark) > > However, I don't possess any data of that nature, and I'm unsure > where I can find some (or how to generate some of my own). Does > anyone know of: > > - a freely available collection of data that requires queries > of this type, and is reasonably representative of "real world" > applications > > - or, a means to generate programatically some data that > fits the above criteria. > > Thanks in advance, > > Neil > > -- > Neil Conway <neilconway@rogers.com> > PGP Key ID: DB3C29FC > > ---------------------------(end of broadcast)--------------------------- > TIP 1: subscribe and unsubscribe commands go to majordomo@postgresql.org >
On Sat, Jul 13, 2002 at 11:18:14AM +0800, Christopher Kings-Lynne wrote: > What about the OSDB benchmark? Does that contain a large dataset? No -- it only uses 5 relations total, with the most complex query only involving 4 joins. Cheers, Neil -- Neil Conway <neilconway@rogers.com> PGP Key ID: DB3C29FC
What about the TPC-H benchmark ? I cant recall if it has more than 10 tables, but it seemed like the queries were "quite good" for a benchmark. In addition it comes with a data generator. regards Mark >On Sat, 2002-07-13 at 04:05, Neil Conway wrote: > I'd like to look at the performance of the query optimizer (both the > traditional one and GEQO) when joining large numbers of tables: 10-15, > or more. In order to do that (and to get meaningful results), I'll > need to work with some data that actually requires joins of that > magnitude. Ideally, I'd like the data to be somewhat realistic -- so > that the performance I'm seeing will reflect the performance a typical > user might see. (i.e. I don't want an artificial benchmark) > >