Re: SQL - planet redundant data - Mailing list pgsql-general

From John D. Burger
Subject Re: SQL - planet redundant data
Date
Msg-id e7747ead06fc966805d2f993a2df3edb@mitre.org
Whole thread Raw
In response to Re: SQL - planet redundant data  (Poul Jensen <flyvholm@gfy.ku.dk>)
Responses Re: SQL - planet redundant data
List pgsql-general
>> Hmm, in fact if the redundant values you're worried about come in
>> long stretches (e.g., temperature is the same for many observations
>> in a row), I suppose you could do the same thing - map a constant
>> value to the range of observation IDs for which it holds.  This gets
>> back to having many tables, though.
>
> This is in effect normalization? But if the observation ID takes just
> as much storage as the original value, have we gained anything? BTW,
> I'm not aiming at redundancy in the measurements - this is minimal
> compared to e.g. year and station ID.

I don't think this is exactly what people meant by normalization, but
maybe.  My basic thought was that you since you have some redundancy in
your data, you might want to use some form of compression.  One of the
simplest forms of compression is called run-length encoding
(http://en.wikipedia.org/wiki/Run_length_encoding).  So you'd have most
of your data in a main table:

   create table observations (
     obsID            integer    primary key,    -- Maybe a BIGINT
     temperature    float,
     etc.
   );

and some other "compressed" tables for those features that have long
runs of repetitive values:

   create table obsYears (
    startObs    integer    primary key    references observations (obsID),
    endObs    integer                references observations (obsID),
    year        integer);

   create table obsStations (
    startObs    integer    primary key    references observations (obsID),
    endObs    integer                references observations (obsID),
    stationID    integer);

(Caution, I haven't checked these for syntax.)  I've introduced an
observation ID, and then I have "compressed" tables that map =ranges=
of these IDs to values that are constant for long stretches.  Each year
occupies only one row, same with each station.  (I think your reply to
Tom may have been getting at something like this.)  Now you can do
queries like this, say, for temperature statistics in a particular
year:

   select avg(temperature), stddev(temperature) from observations,
obsYears
     where obsID between startObs and endObs
     and year = 2001;

You could join in other compressed tables in the same way.  In fact,
you could glue them all together with a VIEW, and you'd be able to
treat the whole thing like one giant table, with much of the redundancy
removed.  Note that if you define indexes on the startObs and endObs
columns, Postgresql might avoid scanning through the compressed tables
every time you do a query.  You might also benefit from a composite
index on (startObs, endObs).  For features like year, which are
massively repetitive, this might even be faster than storing the
feature in the main table, since the compressed table will easily fit
in memory.

So the basic idea is run-length encoding for repetitive values.  I
think I can use this in some of my own data - I don't know why I never
thought of it before.

- John D. Burger
   MITRE



pgsql-general by date:

Previous
From: Tom Lane
Date:
Subject: Re: If an index is based on 3 columns will a query using two of the columns utilize the index?
Next
From: Tom Lane
Date:
Subject: Re: If an index is based on 3 columns will a query using two of the columns utilize the index?