Re: The "many nulls" problem - Mailing list pgsql-performance

From Oleg Bartunov
Subject Re: The "many nulls" problem
Date
Msg-id Pine.LNX.4.64.0803142158140.27666@sn.sai.msu.ru
Whole thread Raw
In response to The "many nulls" problem  ("Kynn Jones" <kynnjo@gmail.com>)
Responses Re: The "many nulls" problem  ("Kynn Jones" <kynnjo@gmail.com>)
List pgsql-performance
Kynn,

have you seen contrib/hstore ? You can have one table with common attributes
and hide others in hstore

Oleg
On Fri, 14 Mar 2008, Kynn Jones wrote:

> It often happens that a particular pieces of information is non-null for a
> small minority of cases.  A superficially different manifestation of this is
> when two pieces of information are identical in all but a small minority of
> cases.  This can be easily mapped to the previous description by defining a
> null in one column to mean that its contents should be obtained from those
> of another column.  A further variant of this is when one piece of
> information is a simple function of another one in all but a small minority
> of cases.
>
> (BTW, I vaguely recall that RDb theorists have a technical term for this
> particular design issue, but I don't remember it.)
>
> In all these cases, the design choice, at least according to RDb's 101, is
> between including a column in the table that will be NULL most of the time,
> or defining a second auxiliary column that references the first one and
> holds the non-redundant information for the minority of cases for which this
> is necessary (and maybe define a VIEW that includes all the columns).
>
> But for me it is a frequent occurrence that my quaint and simple RDb's 101
> reasoning doesn't really apply for PostgreSQL.  Basically, Pg is too smart
> for it!  For example, does a large proportion of NULLs really imply a lot of
> wasted space?  Maybe this is true for fixed-length data types, but what
> about for type TEXT or VARCHAR?
>
> Just to be concrete, consider the case of a customers database for some home
> shopping website.  Suppose that, as it happens, for the majority of this
> site's customers, the shipping and billing addresses are identical.  Or
> consider the scenario of a company in which, for most employees, the email
> address can be readily computed from the first and last name using the rule
> First M. Last => first_last@acme.com, but the company allows some
> flexibility for special cases (e.g. for people like Yasuhiro Tanaka who's
> known to everyone by his nickname, Yaz, the email is
> yaz_tanaka@acme.combecause hardly anyone remembers or even knows his
> full name.)
>
> What's your schema design approach for such situations?  How would you go
> about deciding whether the number of exceptional cases is small enough to
> warrant a second table?  Of course, one could do a systematic profiling of
> various possible scenarios, but as a first approximation what's your
> rule-of-thumb?
>
> TIA!
>
> Kynn
>

     Regards,
         Oleg
_____________________________________________________________
Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru),
Sternberg Astronomical Institute, Moscow University, Russia
Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/
phone: +007(495)939-16-83, +007(495)939-23-83

pgsql-performance by date:

Previous
From: "Kynn Jones"
Date:
Subject: The "many nulls" problem
Next
From: Pascal Cohen
Date:
Subject: Re: Hardware question for a DB server