Thread: Sample databases?

Sample databases?

From
mlw
Date:
I am doing some testing and development on Postgres.

Is there, by chance, a good source of data which can be used as a test
database? I have been using a music database, but it is proprietary, and
makes me uncomfortable to post public tests.

What do you guys use?

Perhaps we can create a substantial test database? (Millions of records,
many tables, and a number of relations.) So when we see a problem, we
can all see it right away. I like "real world" data, because it is often
more organic than randomized test data, and brings out more issues. Take
index selection during a select, for instance.

-- 
http://www.mohawksoft.com


Re: Sample databases?

From
Thomas Lockhart
Date:
> What do you guys use?

The regression database, which you can augment with some "insert into x
select * from x;" commands. It would also be useful to have a "database
generation" script, but of course this would be cooked data.

> Perhaps we can create a substantial test database? (Millions of records,
> many tables, and a number of relations.) So when we see a problem, we
> can all see it right away. I like "real world" data, because it is often
> more organic than randomized test data, and brings out more issues. Take
> index selection during a select, for instance.

The regression database is such a beast, but is not large enough for the
millions of records kinds of tests.

Suggestions?
                    - Thomas


Re: Sample databases?

From
Jeff Hoffmann
Date:
Thomas Lockhart wrote:
>  
> > Perhaps we can create a substantial test database? (Millions of records,
> > many tables, and a number of relations.) So when we see a problem, we
> > can all see it right away. I like "real world" data, because it is often
> > more organic than randomized test data, and brings out more issues. Take
> > index selection during a select, for instance.
> 
> The regression database is such a beast, but is not large enough for the
> millions of records kinds of tests.
> 
> Suggestions?
> 

maybe the Tiger database.  it's certainly big enough & freely
available.  if you're not familiar with tiger, it's a street database
from the census department.  you can find it at
ftp://ftp.linuxvc.com/pub/US-map.  it's in plain text format, but
trivial to import.  it's set up in several (at least a dozen tables)
which are heavily interrelated & sometimes in fairly complex ways.

-- 

Jeff Hoffmann
PropertyKey.com


Re: Sample databases?

From
Josh Rovero
Date:
The NIMA web site has tab-delimited version of the
Airfield Information database files.  Lots of data,
many tables to relate.  Some elements are geographic,
others are text and numeric feature attributes.

mlw wrote:

> I am doing some testing and development on Postgres.
> 
> Is there, by chance, a good source of data which can be used as a test
> database? I have been using a music database, but it is proprietary, and
> makes me uncomfortable to post public tests.
>



Re: Sample databases?

From
Tom Lane
Date:
mlw <markw@mohawksoft.com> writes:
> Perhaps we can create a substantial test database? (Millions of records,
> many tables, and a number of relations.) So when we see a problem, we
> can all see it right away. I like "real world" data, because it is often
> more organic than randomized test data, and brings out more issues.

That's true, but a single test database strikes me as the wrong way
to go.  The real-life examples that people throw at Postgres are so
varied that a test database could never hope to be an adequate
substitute.  I think a test database would likely be subject to
"benchmark syndrome", ie it'd encourage us to optimize with blinders on.

The regression database is actually sufficient to reproduce most simpler
sorts of performance problems, once you know what to look for.
        regards, tom lane


Re: Sample databases?

From
ncm@zembu.com (Nathan Myers)
Date:
On Wed, Dec 20, 2000 at 12:41:01AM +0000, Josh Rovero wrote:
> mlw wrote:
> > I am doing some testing and development on Postgres.
> > 
> > Is there, by chance, a good source of data which can be used as a test
> > database? I have been using a music database, but it is proprietary, and
> > makes me uncomfortable to post public tests.
> 
> The NIMA web site has tab-delimited version of the
> Airfield Information database files.  Lots of data,
> many tables to relate.  Some elements are geographic,
> others are text and numeric feature attributes.

It would be no bad thing to include benchmarks against large, real
sample databases.  However, it would be very bad indeed to include
those large databases in the distribution.

I suggest that each such benchmark script include code to check if 
the sample database is present and, if not, download it from its 
canonical site, massage it into shape and import it.  Then there 
would be no need to limit the number and variety of large sample 
databases that a build may be tried against.

I gather that it takes two weeks to run the regression tests for
IBM's DB2 for a single target platform.

Nathan Myers
ncm@zembu.com


Re: Sample databases?

From
Thomas Lockhart
Date:
> I suggest that each such benchmark script include code to check if
> the sample database is present and, if not, download it from its
> canonical site, massage it into shape and import it.  Then there
> would be no need to limit the number and variety of large sample
> databases that a build may be tried against.

The contrib/mac directory does something similar wrt fetching data,
using wget to get a file of manufacturer mac addresses to populate the
database as necessary.

Although not everyone will use such a large test database, it certainly
cannot hurt to have someone pursue assembling this as a toolkit, and
running tests if they find that interesting and helpful. It may be hard
to predict if, when, and how we will find this useful until it exists ;)
                    - Thomas