Thread: Sample databases?
I am doing some testing and development on Postgres. Is there, by chance, a good source of data which can be used as a test database? I have been using a music database, but it is proprietary, and makes me uncomfortable to post public tests. What do you guys use? Perhaps we can create a substantial test database? (Millions of records, many tables, and a number of relations.) So when we see a problem, we can all see it right away. I like "real world" data, because it is often more organic than randomized test data, and brings out more issues. Take index selection during a select, for instance. -- http://www.mohawksoft.com
> What do you guys use? The regression database, which you can augment with some "insert into x select * from x;" commands. It would also be useful to have a "database generation" script, but of course this would be cooked data. > Perhaps we can create a substantial test database? (Millions of records, > many tables, and a number of relations.) So when we see a problem, we > can all see it right away. I like "real world" data, because it is often > more organic than randomized test data, and brings out more issues. Take > index selection during a select, for instance. The regression database is such a beast, but is not large enough for the millions of records kinds of tests. Suggestions? - Thomas
Thomas Lockhart wrote: > > > Perhaps we can create a substantial test database? (Millions of records, > > many tables, and a number of relations.) So when we see a problem, we > > can all see it right away. I like "real world" data, because it is often > > more organic than randomized test data, and brings out more issues. Take > > index selection during a select, for instance. > > The regression database is such a beast, but is not large enough for the > millions of records kinds of tests. > > Suggestions? > maybe the Tiger database. it's certainly big enough & freely available. if you're not familiar with tiger, it's a street database from the census department. you can find it at ftp://ftp.linuxvc.com/pub/US-map. it's in plain text format, but trivial to import. it's set up in several (at least a dozen tables) which are heavily interrelated & sometimes in fairly complex ways. -- Jeff Hoffmann PropertyKey.com
The NIMA web site has tab-delimited version of the Airfield Information database files. Lots of data, many tables to relate. Some elements are geographic, others are text and numeric feature attributes. mlw wrote: > I am doing some testing and development on Postgres. > > Is there, by chance, a good source of data which can be used as a test > database? I have been using a music database, but it is proprietary, and > makes me uncomfortable to post public tests. >
mlw <markw@mohawksoft.com> writes: > Perhaps we can create a substantial test database? (Millions of records, > many tables, and a number of relations.) So when we see a problem, we > can all see it right away. I like "real world" data, because it is often > more organic than randomized test data, and brings out more issues. That's true, but a single test database strikes me as the wrong way to go. The real-life examples that people throw at Postgres are so varied that a test database could never hope to be an adequate substitute. I think a test database would likely be subject to "benchmark syndrome", ie it'd encourage us to optimize with blinders on. The regression database is actually sufficient to reproduce most simpler sorts of performance problems, once you know what to look for. regards, tom lane
On Wed, Dec 20, 2000 at 12:41:01AM +0000, Josh Rovero wrote: > mlw wrote: > > I am doing some testing and development on Postgres. > > > > Is there, by chance, a good source of data which can be used as a test > > database? I have been using a music database, but it is proprietary, and > > makes me uncomfortable to post public tests. > > The NIMA web site has tab-delimited version of the > Airfield Information database files. Lots of data, > many tables to relate. Some elements are geographic, > others are text and numeric feature attributes. It would be no bad thing to include benchmarks against large, real sample databases. However, it would be very bad indeed to include those large databases in the distribution. I suggest that each such benchmark script include code to check if the sample database is present and, if not, download it from its canonical site, massage it into shape and import it. Then there would be no need to limit the number and variety of large sample databases that a build may be tried against. I gather that it takes two weeks to run the regression tests for IBM's DB2 for a single target platform. Nathan Myers ncm@zembu.com
> I suggest that each such benchmark script include code to check if > the sample database is present and, if not, download it from its > canonical site, massage it into shape and import it. Then there > would be no need to limit the number and variety of large sample > databases that a build may be tried against. The contrib/mac directory does something similar wrt fetching data, using wget to get a file of manufacturer mac addresses to populate the database as necessary. Although not everyone will use such a large test database, it certainly cannot hurt to have someone pursue assembling this as a toolkit, and running tests if they find that interesting and helpful. It may be hard to predict if, when, and how we will find this useful until it exists ;) - Thomas