Thread: OSS database needed for testing

OSS database needed for testing

From
Josh Berkus
Date:
Folks,

Please pardon the cross-posting.

A small group of us on the Performance list were discussing the first steps
toward constructing a comprehensive Postgresql installation benchmarking
tool, mostly to compare different operating systems and file systemsm but
later to be used as a foundation for a tuning wizard.

To do this, we need one or more real (not randomly generated*) medium-large
database which is or can be BSD-licensed (data AND schema).   This database
must have:

1) At least one "main" table with 12+ columns and 100,000+ rows (each).
2) At least 10-12 additional tables of assorted sizes, at least half of which
should have Foriegn Key relationships to the main table(s) or each other.
3) At least one large text or varchar field among the various tables.

In addition, the following items would be helpful, but are not required:
4) Views, triggers, and functions built on the database
5) A query log of database activity to give us sample queries to work with.
6) Some complex data types, such as geometric, network, and/or custom data
types.

Thanks for any leads you can give me!

(* To forestall knee-jerk responses:  Randomly generated data does not look or
perform the same as real data in my professional opinion, and I'm the one
writing the test scripts.)

--
-Josh Berkus
 Aglio Database Solutions
 San Francisco


Re: [HACKERS] OSS database needed for testing

From
pgsql@mohawksoft.com
Date:
I don't know that it meets your criteria, but.....

I have a set of scripts and a program that will load the US Census TigerUA
database into PostgreSQL. The thing is absolutely freak'n huge. I forget
which, but it is either 30g or 60g of data excluding indexes.

Also, if that is too much, I have a similar setup to load the FreeDB music
database, from www.freedb.org. It has roughly 670,000 entries in "cdtitles"
and 8 million entries in "cdsongs."

Either one of which, I would be willing to send you the actual DB on cd(s)
if you pay for postage and media.


> Folks,
>
> Please pardon the cross-posting.
>
> A small group of us on the Performance list were discussing the first
> steps  toward constructing a comprehensive Postgresql installation
> benchmarking  tool, mostly to compare different operating systems and
> file systemsm but  later to be used as a foundation for a tuning
> wizard.
>
> To do this, we need one or more real (not randomly generated*)
> medium-large  database which is or can be BSD-licensed (data AND
> schema).   This database  must have:
>
> 1) At least one "main" table with 12+ columns and 100,000+ rows (each).
> 2) At least 10-12 additional tables of assorted sizes, at least half of
> which  should have Foriegn Key relationships to the main table(s) or
> each other. 3) At least one large text or varchar field among the
> various tables.
>
> In addition, the following items would be helpful, but are not
> required: 4) Views, triggers, and functions built on the database
> 5) A query log of database activity to give us sample queries to work
> with. 6) Some complex data types, such as geometric, network, and/or
> custom data  types.
>
> Thanks for any leads you can give me!
>
> (* To forestall knee-jerk responses:  Randomly generated data does not
> look or  perform the same as real data in my professional opinion, and
> I'm the one  writing the test scripts.)
>
> --
> -Josh Berkus
>  Aglio Database Solutions
>  San Francisco
>
>
> ---------------------------(end of
> broadcast)--------------------------- TIP 1: subscribe and unsubscribe
> commands go to majordomo@postgresql.org


Re: [HACKERS] OSS database needed for testing

From
Bruno Wolff III
Date:
On Thu, Apr 03, 2003 at 13:26:01 -0500,
  pgsql@mohawksoft.com wrote:
> I don't know that it meets your criteria, but.....
>
> I have a set of scripts and a program that will load the US Census TigerUA
> database into PostgreSQL. The thing is absolutely freak'n huge. I forget
> which, but it is either 30g or 60g of data excluding indexes.

Are the data model or the loading scripts available publicly?
I have the tiger data and a program that uses it to convert addresses
to latitude and longitude, but I don't really like the program and
was thinking about trying to load the data into a database and do
queries against the database to find location.


Re: [HACKERS] OSS database needed for testing

From
mlw
Date:

Bruno Wolff III wrote:

>On Thu, Apr 03, 2003 at 13:26:01 -0500,
>  pgsql@mohawksoft.com wrote:
>
>
>>I don't know that it meets your criteria, but.....
>>
>>I have a set of scripts and a program that will load the US Census TigerUA
>>database into PostgreSQL. The thing is absolutely freak'n huge. I forget
>>which, but it is either 30g or 60g of data excluding indexes.
>>
>>
>
>Are the data model or the loading scripts available publicly?
>I have the tiger data and a program that uses it to convert addresses
>to latitude and longitude, but I don't really like the program and
>was thinking about trying to load the data into a database and do
>queries against the database to find location.
>
>
>
I have a set of scripts, SQL table defs, a small C program, along with a
set of field with files that loads it into PGSQL using the "copy from
stdin" It works fairly well, but takes a good long time to load it all.

Should I put it in the download section of my website?


Re: [HACKERS] OSS database needed for testing

From
Bruno Wolff III
Date:
On Thu, Apr 03, 2003 at 17:19:13 -0500,
  mlw <pgsql@mohawksoft.com> wrote:
>
> I have a set of scripts, SQL table defs, a small C program, along with a
> set of field with files that loads it into PGSQL using the "copy from
> stdin" It works fairly well, but takes a good long time to load it all.
>
> Should I put it in the download section of my website?

Yes. I would be interested in looking at it even if I don't use exactly
the same way to do stuff. Taking a logn time to load the data into the
database isn't a big deal for me. reading through the tiger (and FIPS) data
documentation it seemed like there might be some gotchas in unusual cases
and I am not sure the google contest program really handled things right
so I would like to see another implementation. I am also interested in the
data model as that will save me some time.


Re: [HACKERS] OSS database needed for testing

From
"Jeffrey D. Brower"
Date:
Hi Josh,

Let me vote on the Tiger data.  I used to use this database.  It is public,
updated by the government, VERY useful in own right, it works well with the
earthdistance contribution, a real world database a lot of us use and I
think you can put together some killer scripts on it.

Can I vote twice?  <g>

    Jeff

----- Original Message -----
From: <pgsql@mohawksoft.com>
To: <josh@agliodbs.com>
Cc: <pgsql-general@postgresql.org>; <pgsql-performance@postgresql.org>;
<pgsql-hackers@postgresql.org>
Sent: Thursday, April 03, 2003 1:26 PM
Subject: Re: [PERFORM] [HACKERS] OSS database needed for testing


> I don't know that it meets your criteria, but.....
>
> I have a set of scripts and a program that will load the US Census TigerUA
> database into PostgreSQL. The thing is absolutely freak'n huge. I forget
> which, but it is either 30g or 60g of data excluding indexes.
>
> Also, if that is too much, I have a similar setup to load the FreeDB music
> database, from www.freedb.org. It has roughly 670,000 entries in
"cdtitles"
> and 8 million entries in "cdsongs."
>
> Either one of which, I would be willing to send you the actual DB on cd(s)
> if you pay for postage and media.
>
>
> > Folks,
> >
> > Please pardon the cross-posting.
> >
> > A small group of us on the Performance list were discussing the first
> > steps  toward constructing a comprehensive Postgresql installation
> > benchmarking  tool, mostly to compare different operating systems and
> > file systemsm but  later to be used as a foundation for a tuning
> > wizard.
> >
> > To do this, we need one or more real (not randomly generated*)
> > medium-large  database which is or can be BSD-licensed (data AND
> > schema).   This database  must have:
> >
> > 1) At least one "main" table with 12+ columns and 100,000+ rows (each).
> > 2) At least 10-12 additional tables of assorted sizes, at least half of
> > which  should have Foriegn Key relationships to the main table(s) or
> > each other. 3) At least one large text or varchar field among the
> > various tables.
> >
> > In addition, the following items would be helpful, but are not
> > required: 4) Views, triggers, and functions built on the database
> > 5) A query log of database activity to give us sample queries to work
> > with. 6) Some complex data types, such as geometric, network, and/or
> > custom data  types.
> >
> > Thanks for any leads you can give me!
> >
> > (* To forestall knee-jerk responses:  Randomly generated data does not
> > look or  perform the same as real data in my professional opinion, and
> > I'm the one  writing the test scripts.)
> >
> > --
> > -Josh Berkus
> >  Aglio Database Solutions
> >  San Francisco
> >
> >
> > ---------------------------(end of
> > broadcast)--------------------------- TIP 1: subscribe and unsubscribe
> > commands go to majordomo@postgresql.org
>
>
> ---------------------------(end of broadcast)---------------------------
> TIP 2: you can get off all lists at once with the unregister command
>     (send "unregister YourEmailAddressHere" to majordomo@postgresql.org)


Re: [HACKERS] OSS database needed for testing

From
Josh Berkus
Date:
Jeff,

> Let me vote on the Tiger data.  I used to use this database.  It is public,
> updated by the government, VERY useful in own right, it works well with the
> earthdistance contribution, a real world database a lot of us use and I
> think you can put together some killer scripts on it.

We'd have to use a subset of it.   30G is a little larger than anything we
want people to download as  a test package.

--
Josh Berkus
Aglio Database Solutions
San Francisco


Re: [HACKERS] OSS database needed for testing

From
Tom Lane
Date:
Josh Berkus <josh@agliodbs.com> writes:
> We'd have to use a subset of it.   30G is a little larger than anything we
> want people to download as  a test package.

Yeah, it seems a bit over the top ...

The FCC database sounded like an interesting alternative to me.

            regards, tom lane


Re: [HACKERS] OSS database needed for testing

From
mlw
Date:

Josh Berkus wrote:

>Jeff,
>
>
>
>>Let me vote on the Tiger data.  I used to use this database.  It is public,
>>updated by the government, VERY useful in own right, it works well with the
>>earthdistance contribution, a real world database a lot of us use and I
>>think you can put together some killer scripts on it.
>>
>>
>
>We'd have to use a subset of it.   30G is a little larger than anything we
>want people to download as  a test package.
>
>
>
Actually, come to think of it, the TigerUA DB is in chunks. You can use
as much or as little as you want. I'll put the loader scripts on my
download page tonight.

Here is the home page for the data:
http://www.census.gov/geo/www/tiger/tigerua/ua_tgr2k.html



Re: [HACKERS] OSS database needed for testing

From
"Jeffrey D. Brower"
Date:
Absolutely.  We could just use one large state or several small ones and let
folks download the whole thing if they wanted.  Using that technique you
could control the size of the test quite closely and still make something
potentially quite valuable as a contribution beyond the bench.

----- Original Message -----
From: "Josh Berkus" <josh@agliodbs.com>
To: "Jeffrey D. Brower" <jeff@pointhere.net>; <pgsql@mohawksoft.com>
Cc: <pgsql-performance@postgresql.org>
Sent: Thursday, April 03, 2003 11:29 PM
Subject: Re: [PERFORM] [HACKERS] OSS database needed for testing


> Jeff,
>
> > Let me vote on the Tiger data.  I used to use this database.  It is
public,
> > updated by the government, VERY useful in own right, it works well with
the
> > earthdistance contribution, a real world database a lot of us use and I
> > think you can put together some killer scripts on it.
>
> We'd have to use a subset of it.   30G is a little larger than anything we
> want people to download as  a test package.
>
> --
> Josh Berkus
> Aglio Database Solutions
> San Francisco


Re: [HACKERS] OSS database needed for testing

From
Josh Berkus
Date:
Jeff, Mlw,

> Absolutely.  We could just use one large state or several small ones and
> let folks download the whole thing if they wanted.  Using that technique
> you could control the size of the test quite closely and still make
> something potentially quite valuable as a contribution beyond the bench.

Hold on a second.  The FCC database is still a better choice because it is
more complex with a carefully defined schema.   The Tiger database would be
good for doing tests of type 1 and 3, but not for tests of types 2 and 4.

It would certainly be interesting to use the Tiger database as the basis for
an additional type of test:

6) Very Large Data Set: querying, then updating, 300+ selected rows from a
2,000,000 + row table.

... but I still see the FCC database as our best candidate for the battery of
tests 1-5.

--
Josh Berkus
Aglio Database Solutions
San Francisco


Re: OSS database needed for testing

From
"Jeffrey D. Brower"
Date:
I think you got me there.  I have to agree with both points.

(Besides, you are the one coding this thing and I think you understand it
better than I do.)

Let me know if I can help.

    Jeff

----- Original Message -----
From: "Josh Berkus" <josh@agliodbs.com>
To: "Jeffrey D. Brower" <jeff@Green-Visor.US>; <pgsql@mohawksoft.com>
Cc: <pgsql-performance@postgresql.org>; <pgsql-hackers@postgresql.org>
Sent: Friday, April 04, 2003 11:09 AM
Subject: Re: [PERFORM] [HACKERS] OSS database needed for testing


> Jeff, Mlw,
>
> > Absolutely.  We could just use one large state or several small ones and
> > let folks download the whole thing if they wanted.  Using that technique
> > you could control the size of the test quite closely and still make
> > something potentially quite valuable as a contribution beyond the bench.
>
> Hold on a second.  The FCC database is still a better choice because it is
> more complex with a carefully defined schema.   The Tiger database would
be
> good for doing tests of type 1 and 3, but not for tests of types 2 and 4.
>
> It would certainly be interesting to use the Tiger database as the basis
for
> an additional type of test:
>
> 6) Very Large Data Set: querying, then updating, 300+ selected rows from a
> 2,000,000 + row table.
>
> ... but I still see the FCC database as our best candidate for the battery
of
> tests 1-5.
>
> --
> Josh Berkus
> Aglio Database Solutions
> San Francisco
>
>
> ---------------------------(end of broadcast)---------------------------
> TIP 4: Don't 'kill -9' the postmaster