Thread: Complex database for testing, U.S. Census Tiger/UA

Complex database for testing, U.S. Census Tiger/UA

From
mlw
Date:
The U.S. Census provides a database of street polygons and other data 
about landmarks, elevation, etc. This was discussed in a separate thread.

The main URL is here:
http://www.census.gov/geo/www/tiger/index.html

My loader was written for the 2000 version, the 2002 version has some 
difference, but it should be easy enough to ad the fields.

On my site, in the downloads section, at the bottom is the tigerua 
loader. It is very raw, just hacked together to load the data. It may 
take a little work to function with 2002 files, I have not looked at 
that yet.

My site:
http://www.mohawksoft.com



Re: Complex database for testing, U.S. Census Tiger/UA

From
Jan Wieck
Date:
mlw wrote:
> 
> The U.S. Census provides a database of street polygons and other data
> about landmarks, elevation, etc. This was discussed in a separate thread.
> 
> The main URL is here:
> http://www.census.gov/geo/www/tiger/index.html

While yes, the tiger database (or better it's content) is interesting, I
don't think that it can be counted as a "complex database". Just that
something is big doesn't mean that.

> 
> My loader was written for the 2000 version, the 2002 version has some
> difference, but it should be easy enough to ad the fields.

OT:

Just out of curiosity, do you plan more on this? I was playing around
with the 2000 version a while back, but the Garmin GPS units
unfortunately use a proprietary map format, so one cannot generate his
own detail maps for download. The waypoint and route data protocol is
well known though.


Jan

-- 

#======================================================================#
# It's easier to get forgiveness for being wrong than for being right. #
# Let's break this rule - forgive me.                                  #
#================================================== JanWieck@Yahoo.com #



Re: Complex database for testing, U.S. Census Tiger/UA

From
pgsql@mohawksoft.com
Date:
> mlw wrote:
>>
>> The U.S. Census provides a database of street polygons and other data
>> about landmarks, elevation, etc. This was discussed in a separate
>> thread.
>>
>> The main URL is here:
>> http://www.census.gov/geo/www/tiger/index.html
>
> While yes, the tiger database (or better it's content) is interesting,
> I don't think that it can be counted as a "complex database". Just that
> something is big doesn't mean that.

I guess you are right, but there are a lot of related tables. I wouldn't
call it simple, though. It can get huge, however.
>
>>
>> My loader was written for the 2000 version, the 2002 version has some
>> difference, but it should be easy enough to ad the fields.
>
> OT:
>
> Just out of curiosity, do you plan more on this? I was playing around
> with the 2000 version a while back, but the Garmin GPS units
> unfortunately use a proprietary map format, so one cannot generate his
> own detail maps for download. The waypoint and route data protocol is
> well known though.

I'm not sure what a Garmin GPS unit is, but the TigerUA DB uses longitude
and latitude. Any reasonable geographical system must somehow map to lat/long.

Actually, I am going to download the latest version and get it installed on
a system. There is a project I plan to work on in the near future, after all
the other crap I gotta do, that will make use of the data.



Re: Complex database for testing, U.S. Census Tiger/UA

From
cbbrowne@cbbrowne.com
Date:
Jan Wieck wrote:
> mlw wrote:
> > 
> > The U.S. Census provides a database of street polygons and other data
> > about landmarks, elevation, etc. This was discussed in a separate thread.
> > 
> > The main URL is here:
> > http://www.census.gov/geo/www/tiger/index.html
> 
> While yes, the tiger database (or better it's content) is interesting, I
> don't think that it can be counted as a "complex database". Just that
> something is big doesn't mean that.

Just so.

There are doubtless interesting cases that may be tested by virtue of
having a data set that is large, and perhaps "deeply interlinked."

But that only covers cases that have to do with "largeness."  It doesn't
help ensure that PostgreSQL plays well when it gets hit by nested sets
of updates where the challenges involve ensuring the system performs OK
and does not deadlock when hit by complex sets of transactions.

So that an "interesting" database might involve not only a database, but
also a set of transactions that hit multiple tables that are to update
that database.  In effect, something like the "readers/writers" that get
used to test locking semantics.

This is something that would not be able to solely consist of a set of
tables; it would have to include streams of updates.  Something like one
of the TPC benchmarks...
--
output = reverse("moc.enworbbc@" "enworbbc")
http://www3.sympatico.ca/cbbrowne/rdbms.html
"If I  could find  a way to  get [Saddam  Hussein] out of  there, even
putting a  contract out on him,  if the CIA  still did that sort  of a
thing, assuming it ever did, I would be for it."  -- Richard M. Nixon



Re: Complex database for testing, U.S. Census Tiger/UA

From
Dustin Sallings
Date:
Around 11:24 on Apr 8, 2003, cbbrowne@cbbrowne.com said:
I think it was my first application I wrote in python which parsed
the zip files containing these data and shoved it into a postgres system.
I had multiple clients on four or five computers running nonstop for about
two weeks to get it all populated.
By the time I was done, and got my first index created, I began to
run out of disk space.  I think I only had about 70GB to work with on the
RAID array.

# Jan Wieck wrote:
# > mlw wrote:
# > >
# > > The U.S. Census provides a database of street polygons and other data
# > > about landmarks, elevation, etc. This was discussed in a separate thread.
# > >
# > > The main URL is here:
# > > http://www.census.gov/geo/www/tiger/index.html
# >
# > While yes, the tiger database (or better it's content) is interesting, I
# > don't think that it can be counted as a "complex database". Just that
# > something is big doesn't mean that.
#
# Just so.
#
# There are doubtless interesting cases that may be tested by virtue of
# having a data set that is large, and perhaps "deeply interlinked."
#
# But that only covers cases that have to do with "largeness."  It doesn't
# help ensure that PostgreSQL plays well when it gets hit by nested sets
# of updates where the challenges involve ensuring the system performs OK
# and does not deadlock when hit by complex sets of transactions.
#
# So that an "interesting" database might involve not only a database, but
# also a set of transactions that hit multiple tables that are to update
# that database.  In effect, something like the "readers/writers" that get
# used to test locking semantics.
#
# This is something that would not be able to solely consist of a set of
# tables; it would have to include streams of updates.  Something like one
# of the TPC benchmarks...
# --
# output = reverse("moc.enworbbc@" "enworbbc")
# http://www3.sympatico.ca/cbbrowne/rdbms.html
# "If I  could find  a way to  get [Saddam  Hussein] out of  there, even
# putting a  contract out on him,  if the CIA  still did that sort  of a
# thing, assuming it ever did, I would be for it."  -- Richard M. Nixon
#
#
# ---------------------------(end of broadcast)---------------------------
# TIP 3: if posting/reading through Usenet, please send an appropriate
# subscribe-nomail command to majordomo@postgresql.org so that your
# message can get through to the mailing list cleanly
#
#

--
SPY                      My girlfriend asked me which one I like better.
pub  1024/3CAE01D5 1994/11/03 Dustin Sallings <dustin@spy.net>
|    Key fingerprint =  87 02 57 08 02 D0 DA D6  C8 0F 3E 65 51 98 D8 BE
L_______________________ I hope the answer won't upset her. ____________



Re: Complex database for testing, U.S. Census Tiger/UA

From
cbbrowne@cbbrowne.com
Date:
Dustin Sallings wrote:
>     I think it was my first application I wrote in python which parsed
> the zip files containing these data and shoved it into a postgres system.
> I had multiple clients on four or five computers running nonstop for about
> two weeks to get it all populated.
> 
>     By the time I was done, and got my first index created, I began to
> run out of disk space.  I think I only had about 70GB to work with on the
> RAID array.

But this does not establish that this data represents a meaningful
"transactional" load.

Based on the sources, which presumably involve unique data, the
"transactions" are all touching independent sets of data, and are likely
to be totally uninteresting from the perspective of seeing how the
system works under /TRANSACTION/ load.

TRANSACTION loading will involve doing updates that actually have some
opportunity to trample on one another.  Multiple transactions
concurrently updating a single balance table.  Multiple transactions
concurrently trying to attach links to a table entry.  That sort of
thing.

I remember a while back when MSFT did a "enterprise scalability day,"
where they were trumpeting SQL Server performance on "hundreds of
millions of transactions."  At the time, I was at Sabre, who actually do
tens of millions of transactions per day, for passenger reservations
across lotso airlines.  Microsoft was making loud noises to the effect
that NT Server was wonderful for "enterprise transaction" work; the guys
at work just laughed, because the kind of performance they got involved
considerable amounts of 370 assembler to tune vital bits of the
systems.

What happened in the "scalability tests" was that Microsoft did much the
same thing you did; they had hordes of transactions going through that
were well, basically independent of one another.  They could "scale"
things up trivially by adding extra boxes.  Need to handle 10x the
transactions?  Well, since they don't actually modify any shared
resources, you just need to put in 10x as many servers.

And that's essentially what happens any time TPC-? benchmarks reach the
point of irrelevance; that happens every time someone figures out some
"hack" that is able to successfully partition the work load.  At that
point, they merely need to add a bit of extra hardware, and increasing
performance is as easy as adding extra processor boards.  The real world
doesn't scale so easily...
--
(concatenate 'string "cbbrowne" "@acm.org")
http://cbbrowne.com/info/emacs.html
Send  messages calling for fonts  not  available to the  recipient(s).
This can (in the case of Zmail) totally disable the user's machine and
mail system for up to a whole day in some circumstances.
-- from the Symbolics Guidelines for Sending Mail



Re: Complex database for testing, U.S. Census Tiger/UA

From
Josh Berkus
Date:
MLW,

> > The U.S. Census provides a database of street polygons and other data
> > about landmarks, elevation, etc. This was discussed in a separate thread.
> >

Yeah, this was me.   We decided to go with the FCC database because it is more
managably sized and has extensive schema documentation.   Personally, I'd be
happy to see someone put together a "huge table" test using the Tiger
database, but for general tests we're aiming more at the 50-100mb size.

--
-Josh BerkusAglio Database SolutionsSan Francisco



Re: Complex database for testing, U.S. Census Tiger/UA

From
"Merlin Moncure"
Date:
Josh Berkus wrote:
> MLW,
>
> > > The U.S. Census provides a database of street polygons and other
data
> > > about landmarks, elevation, etc. This was discussed in a separate
> thread.
> > >
> happy to see someone put together a "huge table" test using the Tiger
> database, but for general tests we're aiming more at the 50-100mb
size.

The Tiger US street level data would be an excellent test of the polygon
storage and extraction routines.  My information might no longer be
current, but the last time I checked Tiger gave the street level data
out on cd (er, cds) as one huge table of disconnected road 'segments'
broken up by state.  Connecting the segments into longer streets for
more meaningful processing is a good benchmarking procedure. I this is
interesting strictly on that level.  It doesn't test the optimizer or
esoteric features much (except for geo features), but is a good test of
index/cache/random tuple access.  It's typical of the scientific/data
processing problem domain that is much less common (but much more
interesting!) than your average business based app.  I definitely
understand mlw's thinking.
That being said, since major competitors lack the robust geo
types/indices of postgres (the only way, IMHO, to do this type of
thing); it wouldn't be a very fair test.  I would like to point out the
problem is scalable by picking one state e.g. Rhode Island :), and
building off that.  One thing at a time tho.

It's no accident we have a 'PostGIS' and not a 'MyGIS' :)

Merlin