Re: Table Design for Hierarchical Data - Mailing list pgsql-sql

From Lee Hachadoorian
Subject Re: Table Design for Hierarchical Data
Date
Msg-id r2v5ab13581004071127j4cdbdabao3040241774820d50@mail.gmail.com
Whole thread Raw
In response to Re: Table Design for Hierarchical Data  (Achilleas Mantzios <achill@matrix.gatewaynet.com>)
List pgsql-sql
On Tue, Apr 6, 2010 at 2:34 PM, Little, Douglas
<DOUGLAS.LITTLE@orbitz.com> wrote:
> Hey Lee,
>
> I’m on dm-discuss@yahoogroups.com

Thanks for the pointer. I'm looking at their archives now.

> Ie a row is a point in time, or average for a period of time.  Are the
> Numbers actual, or estimates?    To be useful, you’ll want to be able to
> insert new records while retaining  history and projections and easily be
> able to build forecasts.

Yes, each row represents employment for a given time period (usually
year), geography (county, ZIP code, etc.), and NAICS code. I'm
planning to partition by year, particularly since the agency we get
data from releases preliminary data first, final data later, so that I
can easily drop or disinherit preliminary data. Numbers are actual
employment based on establishment reporting.

> I suppose you can get data at the 61, 611, and 6111 level.  You want to be
> able to accurately sum  where code like ‘61%’

We would never do a sum like 61%, because that would double or triple
count data all the way down the hierarchy. The employment for NAICS 61
at a particular geography and time is the same as the sum of all
3-digit children (61#) which is also the same as the sum of all
4-digit grandchildren (61##), etc.

On Tue, Apr 6, 2010 at 6:04 PM, Michael Glaesemann <grzm@seespotcode.net> wrote:
> Another is nested sets which performs quite nicely for loads which are more read than write (which I suspect is the
casehere). 

You are right that we will be reading more than writing, but the SQL
looks complicated, and I don't have the skills to build the kind of
application layer that would allow our users to work with data stored
as a nested set.

>> The problem is that because of nondisclosure rules, the
>> data is sometimes censored at the more specific level.
>
> I don't know if this is per-user or per-category or what, but it may be something you store separately from the main
table.

Suppression is per-category. All users at our org will have access to
the same data. More info below.

On Tue, Apr 6, 2010 at 8:23 PM, Steve Crawford
<scrawford@pinpointresearch.com> wrote:
> reports. The CTE/recursive-query features in 8.4 are great for this. But in
> the case you have described, the number of levels is well defined as is the
> type of information associated with each level.

The number of levels is well-defined, but we won't have data for all
time-periods/geographies/NAICS levels. Because of nondisclosure rules,
we might have 4-digit data at the county level but only 2-digit data
at the ZIP code, and not have full coverage for all years.

> One question that might impact this is the coding of your source data. Is it
> all full 5-digit coding or are some records coded at a high level of detail
> and others only to the top-level?

We actually don't usually get below 4-digit, but the answer is the
latter: some data is available at a detailed level and some data only
at the top level.

> What do you mean by censored? Is the data supplied to you pre-aggregated to
> some level and censored to preserve confidentiality or are do you have the
> record-level source data and the responsibility to suppress data in your
> reports? Is the data suppression ad-hoc (i.e. someone comes to you and says

The state agency gives us pre-aggregated data not microdata. The exact
suppression rule is we don't get any data for a cell with fewer than 3
firms or where one firm has >80% of the total employment. Thus, the
smaller the cells (smaller geography, more specific NAICS
categorization), the more likely we run into empty cells.

On Wed, Apr 7, 2010 at 2:17 AM, Scott Marlowe <scott.marlowe@gmail.com> wrote:
> Since it's an identifier and not really a numeric per se, I'd store it
> as text.  I mean it could as easily be a 5 character alpha code as 5
> character number code.

Yes, already decided to store as text. Thanks for the substring index
suggestion.

On Wed, Apr 7, 2010 at 4:26 AM, Sergey Konoplev <gray.ru@gmail.com> wrote:
> Haven't you thought about ltree contrib? From the description of
> ltree: "This module implements a data type ltree for representing
> labels of data stored in a hierarchical tree-like structure".
>
> http://www.postgresql.org/docs/8.4/interactive/ltree.html

No I was not familiar with this, and it looks really promising. Thanks
for the pointer. It's a little repetitive, but it looks like the path
should be stored as 61.611.6111. Assuming I define a column naics as
ltree, being able to query WHERE nlevel(naics) = [2|3|4] will work
nicely, and with the right views, my users never have to see it.

Thanks again to everyone who replied. Any further remarks, questions,
comments are welcome.

--Lee

--
Lee Hachadoorian
PhD Student, Geography
Program in Earth & Environmental Sciences
CUNY Graduate Center


pgsql-sql by date:

Previous
From: Achilleas Mantzios
Date:
Subject: Re: Table Design for Hierarchical Data
Next
From: Yeb Havinga
Date:
Subject: Re: Table Design for Hierarchical Data