Re: SQL - planet redundant data - Mailing list pgsql-general

From Poul Jensen
Subject Re: SQL - planet redundant data
Date
Msg-id 43252772.4050808@gfy.ku.dk
Whole thread Raw
In response to Query questions  (Poul Jensen <flyvholm@gfy.ku.dk>)
Responses Re: SQL - planet redundant data
Re: SQL - planet redundant data
List pgsql-general
Thank you for your input! Individual replies follow below.

##############################

Chris Travers wrote:

> Ok.  Imagine two huge huge tables:
>
> file_data    and additional_data
>
> create_table file_data (
>    file_id serial primary key,
>    station_id text,
>     ....
> );
>
> create table additional_data (
>    data_id bigserial primary key,
>    file_id int REFERENCES file_data (file_id),
>    temp,
>    ....
> );

So the 500 million entries in additional_data.file_id are now restricted
to belong to the same group of values found in file_data.file_id. But I
don't see how this can get me rid of redundant data - I still have 500M
entries? Consider the column "year". My 500M records come from 13 years.
Saving 500M values when only 13 are distinct is very redundant.

>
> Also you can normalize your way out of the redundant data problem.
>
I can tell how to do this  for filename and station name - I save
something by saving an integer rather than a text string. But all the
rest of my columns are already smallints. Do I save anything by storing
"5" rather than "1997"?

>
> I would *highly* recommend waiting until 8.1 goes live to impliment
> this in production.
>
It will be built/rebuilt after 8.1 is released.

##############################

John D. Burger wrote:

> Why not effectively append all of your per-file tables into one huge
> table, with an additional column indicating which file the data comes
> from?  Then you have only two tables, one with a row for each
> observation in your data, one with a row for each file.  Some queries
> need to join the two tables, but that's not a big deal.

That big table with a row for each observation will have to include all
columns, such as year and station_id. For such columns I store 500M
values of which only ~15 are distinct in my case - the redundancy I'm
trying to minimize. By splitting into smaller tables, many columns will
contain only one distinct value. It *should* be possible to delete such
columns and instead store their corresponding values in a "table
header". If I had one table pr. file, the table you suggest with one row
pr. file could effectively serve as the collection of table headers.

>
> It also seems to me that you may be tying your schema design too
> closely to the current way that the data is represented.  Do you
> really need to have the data file figure so prominently in your design?

The archive of data files is shared by a number of research groups
around the world. There's a whole software package that people use for
data analysis, accessing the data files as they are. So I expect a lot
of file-specific queries.

> Hmm, in fact if the redundant values you're worried about come in long
> stretches (e.g., temperature is the same for many observations in a
> row), I suppose you could do the same thing - map a constant value to
> the range of observation IDs for which it holds.  This gets back to
> having many tables, though.
>
This is in effect normalization? But if the observation ID takes just as
much storage as the original value, have we gained anything? BTW, I'm
not aiming at redundancy in the measurements - this is minimal compared
to e.g. year and station ID. I do have an idea of how it *should* be
possible to get rid of much of it - read the response to Tom Lane below.

##############################

Tom Lane wrote:

>No, tableoid is sort of a virtual column ... it doesn't exist on disk.
>When you query it you get a value fetched from the internal data
>structure representing the table.
>
>
So virtual columns are possible - THIS is a way to clear redundant data!
Is it possible for a user to create a virtual column? If not, this would
make a big improvement.

What I really need are "partial virtual columns". I'm imagining an
alternative version of VACUUM ANALYZE that could do the following:
1) Order the rows in the table so that for each column, identical values
are placed next to each other for as far as possible (the row order that
optimizes one column will probably not be optimal for other columns).
2) For each column, identify the stretches that contain only one
distinct value. Save that value together with ID of start and end row
and delete stretch.
It is not obvious how to do a perfect optimization process in 1), at
least not to me - I'm sure a skilled mathematician would know exactly
how to do it. But here's a simple approach that would get us part of the
way:
1.1) Grab the column w. most redundancy (fewest distinct values) and
sort it into groups according to the distinct values.
1.2) For each of these groups, grab the column w. next most redundancy
and sort into groups according to the distinct values.
And so on. Stop whenever groups become so small that there's nothing to
gain.
Such an analysis would make it much less expensive to combine
same-schema tables, and having everything in the same table is really
convenient. It would obviously save a lot of storage space, but I
imagine it would enable more efficient queries too - having to check
just 3 values instead of the thousands (or even millions) they may
replace must give a considerable gain.

'What is the big benefit of not having ordered rows? I imagine it could
be a disadvantage for dynamic databases, but for a static database like
mine which won't be modified, except for maybe adding new data once a
year, I imagine an optimization including row ordering could be highly
beneficial.

##############################

Jim C. Nasby wrote:

>What you seem to be looking for is a form of partitioning. PostgreSQL
>doesn't currently support partitioning of this form, but there's work in
>progress to change that.
>
>
Any idea how far out in the future this is? Would it make the
optimization process described above (reply to Tom Lane) obsolete? Well,
maybe my ideas about an optimal solution just illustrate lack of
knowledge about SQL, but I'm hoping somebody can see what I'm trying to
suggest.

>As someone else mentioned, you could do it with a union all view.
>http://cvs.distributed.net/viewcvs.cgi/stats-sql/logdb/ has an example
>of this.
>
>
Thank you - it does look as if some union all views could come in handy.

##############################

Thanks and best regards,

Poul Jensen

pgsql-general by date:

Previous
From: Sim Zacks
Date:
Subject: Re: pg_autovacuum not doing anything
Next
From: suresh ramasamy
Date:
Subject: help me learn