Thread: Query questions

Query questions

From
Poul Jensen
Date:
I'm building a database containing key parameters for ~500,000 data
files. The design I found logical is

Two tables for each file:
1) Larger table with detailed key parameters
    (10-15 columns, ~1000 rows), call it large_table
2) Small table with file summary
    (~30 columns, 1 row), call it small_table

A major purpose is to gain the ability to identify all data in the original
archive (~3 terabyte) satisfying some criteria that can be specified in
the database queries. But I find the query options surprisingly limited
(or difficult - sure, things can be done if you write enough code...)
Here is what I would like to be possible:

SELECT <large_table columns> FROM <regular expression>
     WHERE <condition on large_table>
    IF <condition on corresponding small_table>;

In words: For each of many large_tables, check corresponding small
table and only process large_table if some condition is met.

2 problems:
1) Can't use regular expression to specify tables. Wouldn't that be nice?
    Is it possible to specify a lot of tables without having to write custom
    functions in e.g. plsql?
2) No IF statement. How do I issue commands (or redesign
    database) so I can check the file summary before diving into the
    large table? This ought to be simple...

Any input will be highly appreciated, thanks in advance!

Poul Jensen

Re: Query questions

From
Ragnar Hafstað
Date:
On Sat, 2005-09-03 at 00:59 -0800, Poul Jensen wrote:
> I'm building a database containing key parameters for ~500,000 data
> files. The design I found logical is
>
> Two tables for each file:
> 1) Larger table with detailed key parameters
>     (10-15 columns, ~1000 rows), call it large_table
> 2) Small table with file summary
>     (~30 columns, 1 row), call it small_table

you want to create 1 million tables, all with one of
2 schemas?

why not just 2 tables, each with the additional "file"
column ?

> ...
> SELECT <large_table columns> FROM <regular expression>
>      WHERE <condition on large_table>
>     IF <condition on corresponding small_table>;

this would then be something like:

  SELECT <large_table columns> FROM large_table
    WHERE file ~ <regular expression>
    AND <condition on large_table>
    AND <subquery involving small_table>

gnari



Re: Query questions

From
Poul Jensen
Date:
Thank you very much for your response! It leads to another couple of
questions:

>>I'm building a database containing key parameters for ~500,000 data
>>files. The design I found logical is
>>
>>Two tables for each file:
>>1) Larger table with detailed key parameters
>>    (10-15 columns, ~1000 rows), call it large_table
>>2) Small table with file summary
>>    (~30 columns, 1 row), call it small_table
>>
>>
>
>you want to create 1 million tables, all with one of
>2 schemas?
>
>

I started out with a schema for each file, thinking I could utilize the
schema
structure in queries, but I don't see how. Schemas are useful for grouping
tables according to users/owners. Other than that, do they add anything
but a dot in the table name?

>why not just 2 tables, each with the additional "file"
>column ?
>
>
>>...
>>SELECT <large_table columns> FROM <regular expression>
>>     WHERE <condition on large_table>
>>    IF <condition on corresponding small_table>;
>>
>>
>
>this would then be something like:
>
>  SELECT <large_table columns> FROM large_table
>    WHERE file ~ <regular expression>
>    AND <condition on large_table>
>    AND <subquery involving small_table>
>
>

The large_table would have ~500 million rows, each of which would have
to be checked for the first condition (pattern matching above). With
separate
tables there are "only" ~500,000 initial checks to do.
Also, I don't see how to construct that small_table-subquery. If it is
possible
I would love to know how! Can you (or anybody else) give an example?

Thank you,

Poul

Re: Query questions

From
Douglas McNaught
Date:
Poul Jensen <flyvholm@gfy.ku.dk> writes:

>>you want to create 1 million tables, all with one of
>>2 schemas?
>>
>>
>
> I started out with a schema for each file, thinking I could utilize
> the schema
> structure in queries, but I don't see how. Schemas are useful for grouping
> tables according to users/owners. Other than that, do they add anything
> but a dot in the table name?

The word "schema" is a bit overloaded--here it means "table
specification" rather than "namespace".

A general rule of database design is: if you end up creating multiple
tables with the same schema, you're doing it wrong--they should all be
one table with an additional key (usually indexed) to distinguish the
data.

-Doug

Re: Query questions

From
Poul Jensen
Date:
>>>you want to create 1 million tables, all with one of
>>>2 schemas?
>>>
>>>
>>>
>>>
>>I started out with a schema for each file, thinking I could utilize
>>the schema
>>structure in queries, but I don't see how. Schemas are useful for grouping
>>tables according to users/owners. Other than that, do they add anything
>>but a dot in the table name?
>>
>>
>
>The word "schema" is a bit overloaded--here it means "table
>specification" rather than "namespace".
>
>A general rule of database design is: if you end up creating multiple
>tables with the same schema, you're doing it wrong--they should all be
>one table with an additional key (usually indexed) to distinguish the
>data.
>
>-Doug
>
>
Thank you for clarifying -that had me confused!

Would the reason for this general rule be that if you combine same-schema
tables, you can access them all with one query? At least that would be the
motivation in my case, but it isn't free. It would cost space (500
million entries
for that extra key, add index) and query time (1st condition need now be
checked on 500 million entries of which "only" the 500,000 unique ones
should have to be checked - indexing helps, but there must be a loss).

If this is really necessary it's because it's so unbelievably hard to access
many tables in one query that I'm thinking the developers maintain
restrictions
on query options in order to force certain database designs - your
general rule
above. If so, what is the advantage? I see only disadvantages (a pain to be
honest).

Thank you,

Poul

Re: Query questions

From
Chris Travers
Date:
Poul Jensen wrote:

>
>>>> you want to create 1 million tables, all with one of
>>>> 2 schemas?
>>>>
>>>>
>>>>
>>>
>>> I started out with a schema for each file, thinking I could utilize
>>> the schema
>>> structure in queries, but I don't see how. Schemas are useful for
>>> grouping
>>> tables according to users/owners. Other than that, do they add anything
>>> but a dot in the table name?
>>>
>>
>>
>> The word "schema" is a bit overloaded--here it means "table
>> specification" rather than "namespace".
>>
>> A general rule of database design is: if you end up creating multiple
>> tables with the same schema, you're doing it wrong--they should all be
>> one table with an additional key (usually indexed) to distinguish the
>> data.
>>
>> -Doug
>>
>>
> Thank you for clarifying -that had me confused!
>
> Would the reason for this general rule be that if you combine same-schema
> tables, you can access them all with one query?

No.  It is so that the single query becomes manageable.  It also greatly
simplifies database maintenance.

> At least that would be the
> motivation in my case, but it isn't free. It would cost space (500
> million entries
> for that extra key, add index) and query time (1st condition need now be
> checked on 500 million entries of which "only" the 500,000 unique ones
> should have to be checked - indexing helps, but there must be a loss).

I suppose how the data is distributed in the table.  For the record, I
don't think that 500M records should be that problematic.

Now....
Suppose you need to further partition this  (best to wait until 8.1 for
performance reasons), but you could partition it into, say, 10
subtables.  You could use inheritance and check constraints to make this
transparent to the application and perform reasonably well.

>
> If this is really necessary it's because it's so unbelievably hard to
> access
> many tables in one query that I'm thinking the developers maintain
> restrictions
> on query options in order to force certain database designs - your
> general rule
> above. If so, what is the advantage? I see only disadvantages (a pain
> to be
> honest).

Look into inheritance.  It makes this easier.  However, I don't care
which RDBMS you use, management of 1000 identical tables is going to be
a real pain and I think that everyone here will probably suggest that it
is not exactly a sane thing to do.

Best Wishes,
Chris Travers
Metatron Technology COnsulting

Attachment

Re: Query questions

From
Poul Jensen
Date:
> Look into inheritance.  It makes this easier.  However, I don't care
> which RDBMS you use, management of 1000 identical tables is going to
> be a real pain and I think that everyone here will probably suggest
> that it is not exactly a sane thing to do.

Thank you, Chris. I have omitted two important points. One is:
The database needs no maintenance. Once created nothing in it will be
modified - there'll only
be SELECT queries to look up data satisfying certain criteria.

To summarize the task:
I have ~500,000 data files containing ~1,000 records each. The database
should then contain:
1) A detailed table (~1,000 rows, ~15 columns) for each file
2) A small table with file summary (1 row, ~30 columns) for each file
The typical query should then check file summaries to identify which
detailed tables/subtables
to check for further conditions.

The other important point I forgot about (much worse):
The detailed table for each file is created with a number of columns
which for most files have
the same value for all records. I had planned to delete columns
containing only a single
value and push it to the file summary instead. Hence, the 500,000
detailed tables wouldn't
all be identical. The initial check on the file summary would determine
whether the detailed
table for a given file should be searched and, if so, what columns are
found in it.

I guess I could either:
1) Add a lot of redundant data to the database so tables can be
combined, allowing single
    query searches. Judging from the size of my test database I would
end up with ~200 GB
    without redundant data, so I do consider this a problem.
2) Write code (e.g. in C) with a loop to do separate queries for every
file - I imagine this
    would be terribly inefficient compared to the single query option.

Question:
If I create a table with all the columns needed to combine the 500,000
tables and I simply
omit the redundant columns when entering the subtable for a given file,
will PostgreSQL
fill in values, not allowing me to save any space? Would I have to make
an inheritance
hierarchy to save the space? If so, all columns not in the parent table
can only be
accessed via separate, child-table specific queries which is
troublesome... :-|

Help!

Poul

SQL - planet redundant data

From
Poul Jensen
Date:
I have ~500,000 data files each containing ~1,000 records that I want to
put into a database for easy access.
Fictive example for illustration: File w. meteorological data from a
given station.

 stat_id |  yr  | d_o_y | hr | mn | tmp  | wind
---------|------|-------|----|----|------|------
    78   | 2001 |  211  | 14 |  0 | 15.3 |  4.7
    78   | 2001 |  211  | 14 |  1 | 15.4 |  5.4
    78   | 2001 |  211  | 14 |  2 | 15.4 |  5.0
    ..   | .... |  ...  | .. |  . | .... |  ...
    78   | 2001 |  211  | 14 | 24 | 15.8 |  2.7

Some parameters are constant for any file, e.g. station id. Others are
constant for a varying percentage of the files (hr~58%, d_o_y~98%,
yr~99.995% due to the oddly chosen 25 min. pr. file)

It seems like a table with file summaries (1 row for each file) would be
very useful. Some of the columns/rows could be:

    file     | st_id |  yr  | d_o_y | avg_tmp | std_dev_tmp
-------------|-------|------|-------|---------|--------------
78_f6548.dat |   78  | 2001 |  211  |   15.5  |    0.24
57_f4296.dat |   57  | 2000 |   -1  |   8.3   |    0.11
    ...      |  ...  |  ... |  ...  |   ...   |    ...
(-1 for day of year indicates an odd file spanning across midnight)

To store the detailed records the SQL novice would construct one table
pr. file and exclude any constant columns since these are redundant
(given in summary table). These detailed tables would then have
different column schemas - (mn,tmp,wind) for ~58%, (hr,mn,tmp,wind) for
~40%, (d_o_y,hr,mn,tmp,wind) for ~2% and (yr,d_o_y,hr,mn,tmp,wind) for
~0.005%.

SQL will discourage this novice. To enable simple queries data tables
must be combined. Say I want to find all summer data with high
temperature standard deviation and low wind. If I made the ultimate
table for easy queries, containing all columns for both detailed data
and file summary, I could find the data like this:

SELECT file, std_dev_tmp, tmp, wind FROM huge_table WHERE
std_dev_tmp>0.2 AND d_o_y>150 AND d_o_y<240 AND wind<4.0

The ability to make this easy query comes with a ridiculous amount of
redundant data in huge_table. In order to remove all the stuff that is
already given by the summary table I must split the huge table into a
number of tables; one for each of above mentioned column schemas (in my
case many more than in the example above). Difficult to build and no
easy queries anyway - I'd need to write a C program to launch the many
queries that would in effect return the same data as the single query
above. Defeats the purpose of easy, efficient access.

This novice must be missing a sneaky way to avoid massive redundancy and
still maintain easy access. I've been suggested to look at inheritance
and foreign keys. Foreign keys I don't see how to use, but I could make
an inheritance hierarchy of the split up tables to make the parent table
columns (mn,tmp,wind in above example) available across tables. But
accessing the rest of the columns still require child-table specific
queries, and my many child tables will not be nicely nested as in above
example, so I still don't see how it can get me there. I could really
use some input.

One thought: PostgreSQL adds to every table a system column tableoid
containing a constant value. Is that value really stored 1,000 times for
a 1,000-row table? Or...?

(Apologies for the length of this post)

Re: SQL - planet redundant data

From
Bruno Wolff III
Date:
On Thu, Sep 08, 2005 at 19:58:55 -0800,
  Poul Jensen <flyvholm@gfy.ku.dk> wrote:
>
> This novice must be missing a sneaky way to avoid massive redundancy and
> still maintain easy access. I've been suggested to look at inheritance
> and foreign keys. Foreign keys I don't see how to use, but I could make
> an inheritance hierarchy of the split up tables to make the parent table
> columns (mn,tmp,wind in above example) available across tables. But
> accessing the rest of the columns still require child-table specific
> queries, and my many child tables will not be nicely nested as in above
> example, so I still don't see how it can get me there. I could really
> use some input.

You can also use views. Updatable views can be created using the rule system
if you need them.

Re: SQL - planet redundant data

From
Tom Lane
Date:
Poul Jensen <flyvholm@gfy.ku.dk> writes:
> One thought: PostgreSQL adds to every table a system column tableoid
> containing a constant value. Is that value really stored 1,000 times for
> a 1,000-row table? Or...?

No, tableoid is sort of a virtual column ... it doesn't exist on disk.
When you query it you get a value fetched from the internal data
structure representing the table.

There is plenty of *real* overhead per row, mind you, but tableoid isn't
part of it.  You can read some details here:
http://developer.postgresql.org/docs/postgres/storage-page-layout.html

            regards, tom lane

Re: SQL - planet redundant data

From
"John D. Burger"
Date:
> To store the detailed records the SQL novice would construct one table
> pr. file and exclude any constant columns since these are redundant
> (given in summary table). These detailed tables would then have
> different column schemas - (mn,tmp,wind) for ~58%, (hr,mn,tmp,wind)
> for ~40%, (d_o_y,hr,mn,tmp,wind) for ~2% and (yr,d_o_y,hr,mn,tmp,wind)
> for ~0.005%.

I don't understand this at all - why would you have one table per file?
  Why not effectively append all of your per-file tables into one huge
table, with an additional column indicating which file the data comes
from?  Then you have only two tables, one with a row for each
observation in your data, one with a row for each file.  Some queries
need to join the two tables, but that's not a big deal.

There may be a lot of redundancy in your data, but that's life.  If
it's not deterministic redundancy (e.g., file ID => station ID), then I
would just live with it.

It also seems to me that you may be tying your schema design too
closely to the current way that the data is represented.  Do you really
need to have the data file figure so prominently in your design?  If
you do need to keep track of which file an observation came from, I
might have one huge table of observations, each with an observation ID,
and have another table that maps filenames to ranges of observation IDs
(filename, startObs, endObs).  A range query could then quickly find
which table is responsible for an observation ID.

Hmm, in fact if the redundant values you're worried about come in long
stretches (e.g., temperature is the same for many observations in a
row), I suppose you could do the same thing - map a constant value to
the range of observation IDs for which it holds.  This gets back to
having many tables, though.

- John Burger
   MITRE



Re: SQL - planet redundant data

From
"Jim C. Nasby"
Date:
On Thu, Sep 08, 2005 at 07:58:55PM -0800, Poul Jensen wrote:
> This novice must be missing a sneaky way to avoid massive redundancy and
> still maintain easy access. I've been suggested to look at inheritance
> and foreign keys. Foreign keys I don't see how to use, but I could make

What you seem to be looking for is a form of partitioning. PostgreSQL
doesn't currently support partitioning of this form, but there's work in
progress to change that.

As someone else mentioned, you could do it with a union all view.
http://cvs.distributed.net/viewcvs.cgi/stats-sql/logdb/ has an example
of this.
--
Jim C. Nasby, Sr. Engineering Consultant      jnasby@pervasive.com
Pervasive Software      http://pervasive.com    work: 512-231-6117
vcard: http://jim.nasby.net/pervasive.vcf       cell: 512-569-9461

Re: SQL - planet redundant data

From
Poul Jensen
Date:
Thank you for your input! Individual replies follow below.

##############################

Chris Travers wrote:

> Ok.  Imagine two huge huge tables:
>
> file_data    and additional_data
>
> create_table file_data (
>    file_id serial primary key,
>    station_id text,
>     ....
> );
>
> create table additional_data (
>    data_id bigserial primary key,
>    file_id int REFERENCES file_data (file_id),
>    temp,
>    ....
> );

So the 500 million entries in additional_data.file_id are now restricted
to belong to the same group of values found in file_data.file_id. But I
don't see how this can get me rid of redundant data - I still have 500M
entries? Consider the column "year". My 500M records come from 13 years.
Saving 500M values when only 13 are distinct is very redundant.

>
> Also you can normalize your way out of the redundant data problem.
>
I can tell how to do this  for filename and station name - I save
something by saving an integer rather than a text string. But all the
rest of my columns are already smallints. Do I save anything by storing
"5" rather than "1997"?

>
> I would *highly* recommend waiting until 8.1 goes live to impliment
> this in production.
>
It will be built/rebuilt after 8.1 is released.

##############################

John D. Burger wrote:

> Why not effectively append all of your per-file tables into one huge
> table, with an additional column indicating which file the data comes
> from?  Then you have only two tables, one with a row for each
> observation in your data, one with a row for each file.  Some queries
> need to join the two tables, but that's not a big deal.

That big table with a row for each observation will have to include all
columns, such as year and station_id. For such columns I store 500M
values of which only ~15 are distinct in my case - the redundancy I'm
trying to minimize. By splitting into smaller tables, many columns will
contain only one distinct value. It *should* be possible to delete such
columns and instead store their corresponding values in a "table
header". If I had one table pr. file, the table you suggest with one row
pr. file could effectively serve as the collection of table headers.

>
> It also seems to me that you may be tying your schema design too
> closely to the current way that the data is represented.  Do you
> really need to have the data file figure so prominently in your design?

The archive of data files is shared by a number of research groups
around the world. There's a whole software package that people use for
data analysis, accessing the data files as they are. So I expect a lot
of file-specific queries.

> Hmm, in fact if the redundant values you're worried about come in long
> stretches (e.g., temperature is the same for many observations in a
> row), I suppose you could do the same thing - map a constant value to
> the range of observation IDs for which it holds.  This gets back to
> having many tables, though.
>
This is in effect normalization? But if the observation ID takes just as
much storage as the original value, have we gained anything? BTW, I'm
not aiming at redundancy in the measurements - this is minimal compared
to e.g. year and station ID. I do have an idea of how it *should* be
possible to get rid of much of it - read the response to Tom Lane below.

##############################

Tom Lane wrote:

>No, tableoid is sort of a virtual column ... it doesn't exist on disk.
>When you query it you get a value fetched from the internal data
>structure representing the table.
>
>
So virtual columns are possible - THIS is a way to clear redundant data!
Is it possible for a user to create a virtual column? If not, this would
make a big improvement.

What I really need are "partial virtual columns". I'm imagining an
alternative version of VACUUM ANALYZE that could do the following:
1) Order the rows in the table so that for each column, identical values
are placed next to each other for as far as possible (the row order that
optimizes one column will probably not be optimal for other columns).
2) For each column, identify the stretches that contain only one
distinct value. Save that value together with ID of start and end row
and delete stretch.
It is not obvious how to do a perfect optimization process in 1), at
least not to me - I'm sure a skilled mathematician would know exactly
how to do it. But here's a simple approach that would get us part of the
way:
1.1) Grab the column w. most redundancy (fewest distinct values) and
sort it into groups according to the distinct values.
1.2) For each of these groups, grab the column w. next most redundancy
and sort into groups according to the distinct values.
And so on. Stop whenever groups become so small that there's nothing to
gain.
Such an analysis would make it much less expensive to combine
same-schema tables, and having everything in the same table is really
convenient. It would obviously save a lot of storage space, but I
imagine it would enable more efficient queries too - having to check
just 3 values instead of the thousands (or even millions) they may
replace must give a considerable gain.

'What is the big benefit of not having ordered rows? I imagine it could
be a disadvantage for dynamic databases, but for a static database like
mine which won't be modified, except for maybe adding new data once a
year, I imagine an optimization including row ordering could be highly
beneficial.

##############################

Jim C. Nasby wrote:

>What you seem to be looking for is a form of partitioning. PostgreSQL
>doesn't currently support partitioning of this form, but there's work in
>progress to change that.
>
>
Any idea how far out in the future this is? Would it make the
optimization process described above (reply to Tom Lane) obsolete? Well,
maybe my ideas about an optimal solution just illustrate lack of
knowledge about SQL, but I'm hoping somebody can see what I'm trying to
suggest.

>As someone else mentioned, you could do it with a union all view.
>http://cvs.distributed.net/viewcvs.cgi/stats-sql/logdb/ has an example
>of this.
>
>
Thank you - it does look as if some union all views could come in handy.

##############################

Thanks and best regards,

Poul Jensen

Re: SQL - planet redundant data

From
"John D. Burger"
Date:
>> Hmm, in fact if the redundant values you're worried about come in
>> long stretches (e.g., temperature is the same for many observations
>> in a row), I suppose you could do the same thing - map a constant
>> value to the range of observation IDs for which it holds.  This gets
>> back to having many tables, though.
>
> This is in effect normalization? But if the observation ID takes just
> as much storage as the original value, have we gained anything? BTW,
> I'm not aiming at redundancy in the measurements - this is minimal
> compared to e.g. year and station ID.

I don't think this is exactly what people meant by normalization, but
maybe.  My basic thought was that you since you have some redundancy in
your data, you might want to use some form of compression.  One of the
simplest forms of compression is called run-length encoding
(http://en.wikipedia.org/wiki/Run_length_encoding).  So you'd have most
of your data in a main table:

   create table observations (
     obsID            integer    primary key,    -- Maybe a BIGINT
     temperature    float,
     etc.
   );

and some other "compressed" tables for those features that have long
runs of repetitive values:

   create table obsYears (
    startObs    integer    primary key    references observations (obsID),
    endObs    integer                references observations (obsID),
    year        integer);

   create table obsStations (
    startObs    integer    primary key    references observations (obsID),
    endObs    integer                references observations (obsID),
    stationID    integer);

(Caution, I haven't checked these for syntax.)  I've introduced an
observation ID, and then I have "compressed" tables that map =ranges=
of these IDs to values that are constant for long stretches.  Each year
occupies only one row, same with each station.  (I think your reply to
Tom may have been getting at something like this.)  Now you can do
queries like this, say, for temperature statistics in a particular
year:

   select avg(temperature), stddev(temperature) from observations,
obsYears
     where obsID between startObs and endObs
     and year = 2001;

You could join in other compressed tables in the same way.  In fact,
you could glue them all together with a VIEW, and you'd be able to
treat the whole thing like one giant table, with much of the redundancy
removed.  Note that if you define indexes on the startObs and endObs
columns, Postgresql might avoid scanning through the compressed tables
every time you do a query.  You might also benefit from a composite
index on (startObs, endObs).  For features like year, which are
massively repetitive, this might even be faster than storing the
feature in the main table, since the compressed table will easily fit
in memory.

So the basic idea is run-length encoding for repetitive values.  I
think I can use this in some of my own data - I don't know why I never
thought of it before.

- John D. Burger
   MITRE



Re: SQL - planet redundant data

From
Poul Jensen
Date:
Thank you, John!
I misunderstood you the first time, but I now see we have the same thing
in mind.

> So you'd have most of your data in a main table:
>
>   create table observations (
>     obsID            integer    primary key,    -- Maybe a BIGINT
>     temperature    float,
>     etc.
>   );
>
> and some other "compressed" tables for those features that have long
> runs of repetitive values:
>
>   create table obsYears (
>     startObs    integer    primary key    references observations
> (obsID),
>     endObs    integer                references observations (obsID),
>     year        integer);
>
>   create table obsStations (
>     startObs    integer    primary key    references observations
> (obsID),
>     endObs    integer                references observations (obsID),
>     stationID    integer);
>
> (Caution, I haven't checked these for syntax.)  I've introduced an
> observation ID, and then I have "compressed" tables that map =ranges=
> of these IDs to values that are constant for long stretches.  Each
> year occupies only one row, same with each station.

One complication: Applying the observation ID you're in effect ordering
the rows. If you order them chronologically, "year" is perfectly lined
up, giving you one row pr. value in your compressed table, but e.g.
"month" will be split up in n_years*12 stretches of obsIDs, and
"station_id" may not have any continuous stretches of obsIDs at all. I
don't see any solution to this, but better compression can be achieved
by ordering rows optimally when applying the obsID. The reply to Tom
Lane in my previous post suggested one way to do this - it may not
always be optimal, but at least it's simple.

> Now you can do queries like this, say, for temperature statistics in a
> particular year:
>
>   select avg(temperature), stddev(temperature) from observations,
> obsYears
>     where obsID between startObs and endObs
>     and year = 2001;

This works! I had not yet realized how to make this connection between
two tables, so that was a major help - thank you.

>
> You could join in other compressed tables in the same way.  In fact,
> you could glue them all together with a VIEW, and you'd be able to
> treat the whole thing like one giant table, with much of the
> redundancy removed.

That is exactly what I want, and now I finally see how to do it (I
think!). However, it is a considerable amount of work to set this up
manually, plus, it has been a headache realizing how to get there at
all. I'm hoping that one or more of the developers think it would be a
good idea for PostgreSQL to perform an internal table optimization
process using run-length encoding. Imagine you could just throw all your
data into one table, run OPTIMIZE TABLE and you'd be done. With SQL
being all about tables I'm surprised this idea (or something even
better) hasn't been implemented already.

Poul Jensen

Re: SQL - planet redundant data

From
Brent Wood
Date:
>
> That is exactly what I want, and now I finally see how to do it (I
> think!). However, it is a considerable amount of work to set this up
> manually, plus, it has been a headache realizing how to get there at
> all. I'm hoping that one or more of the developers think it would be a
> good idea for PostgreSQL to perform an internal table optimization
> process using run-length encoding. Imagine you could just throw all your
> data into one table, run OPTIMIZE TABLE and you'd be done. With SQL
> being all about tables I'm surprised this idea (or something even
> better) hasn't been implemented already.

There was a recent brief thread here on storing timeseries data, where the
use of clustered indices for static tables was suggested. This might also
be useful in your situation...

Cheers,

 Brent Wood

Re: SQL - planet redundant data

From
"Jim C. Nasby"
Date:
On Sun, Sep 11, 2005 at 11:00:02PM -0800, Poul Jensen wrote:
> Tom Lane wrote:
>
> >No, tableoid is sort of a virtual column ... it doesn't exist on disk.
> >When you query it you get a value fetched from the internal data
> >structure representing the table.
> >
> >
> So virtual columns are possible - THIS is a way to clear redundant data!
> Is it possible for a user to create a virtual column? If not, this would
> make a big improvement.
>
> What I really need are "partial virtual columns". I'm imagining an
> alternative version of VACUUM ANALYZE that could do the following:
> 1) Order the rows in the table so that for each column, identical values
> are placed next to each other for as far as possible (the row order that
> optimizes one column will probably not be optimal for other columns).
> 2) For each column, identify the stretches that contain only one
> distinct value. Save that value together with ID of start and end row
> and delete stretch.
> It is not obvious how to do a perfect optimization process in 1), at
> least not to me - I'm sure a skilled mathematician would know exactly
> how to do it. But here's a simple approach that would get us part of the
> way:
> 1.1) Grab the column w. most redundancy (fewest distinct values) and
> sort it into groups according to the distinct values.
> 1.2) For each of these groups, grab the column w. next most redundancy
> and sort into groups according to the distinct values.
> And so on. Stop whenever groups become so small that there's nothing to
> gain.
> Such an analysis would make it much less expensive to combine
> same-schema tables, and having everything in the same table is really
> convenient. It would obviously save a lot of storage space, but I
> imagine it would enable more efficient queries too - having to check
> just 3 values instead of the thousands (or even millions) they may
> replace must give a considerable gain.
>
> 'What is the big benefit of not having ordered rows? I imagine it could
> be a disadvantage for dynamic databases, but for a static database like
> mine which won't be modified, except for maybe adding new data once a
> year, I imagine an optimization including row ordering could be highly
> beneficial.

Oracle supports something akin to this. On an Index Organized Table you
can tell it to pull part of the index key out of individual rows. IE:

CREATE TABLE (a, b, c, d, e, f)
    PRIMARY KEY(a, b, c, d)
    INDEX ORGANIZED COMBINE(a, b)
;

In this case, every time the combination of (a, b) changes, Oracle
stores a special record of some kind that indicates the change, and it
doesn't store a or b with each row. (Note that I'm guessing on syntax,
it's been a while since I've used Oracle).

> ##############################
>
> Jim C. Nasby wrote:
>
> >What you seem to be looking for is a form of partitioning. PostgreSQL
> >doesn't currently support partitioning of this form, but there's work in
> >progress to change that.
> >
> >
> Any idea how far out in the future this is? Would it make the
> optimization process described above (reply to Tom Lane) obsolete? Well,
> maybe my ideas about an optimal solution just illustrate lack of
> knowledge about SQL, but I'm hoping somebody can see what I'm trying to
> suggest.

Well, the firslt important step is slated for 8.1; effective partition
elimination. You can search for more info, especially in the BizGres
archives. Next step is creating actual partitioning syntax to make
managing partitions easier.
--
Jim C. Nasby, Sr. Engineering Consultant      jnasby@pervasive.com
Pervasive Software      http://pervasive.com    work: 512-231-6117
vcard: http://jim.nasby.net/pervasive.vcf       cell: 512-569-9461

ECPG - how to fetch then sort strings

From
Poul Jensen
Date:
I need to fetch strings from a database with ECPG and then sort them in
C. Here is one of my failed attempts:

###########################

int main(int argc, char *argv[]) {

  int maxlen=20;
  long nrec;

  EXEC SQL BEGIN DECLARE SECTION;
  varchar filenms[][maxlen]=NULL;
  char dbnm[50];
  EXEC SQL END DECLARE SECTION;

  sprintf(dbnm,"%s",argv[1]);

  EXEC SQL CONNECT TO :dbnm;

  EXEC SQL SELECT filenm INTO :filenms FROM beamdata;
  nrec = sqlca.sqlerrd[2];   /* Returns number of rows retrieved */

  EXEC SQL COMMIT;
  EXEC SQL DISCONNECT;

  qsort(filenms, nrec, maxlen*sizeof(char), scmp);

  free(filenms);

  return 0;

}

static int scmp( const void *sp1, const void *sp2 )
{
    return( strcmp(*(char **)sp1, *(char **)sp2) );
}


###########################

It compiles ok, but I get garbage in variable filenms. If I change the
declaration of filenms to:
  char **filenms=NULL;
the SQL query returns strings ok. But the strings have variable length,
and I need to specify one length in qsort, so it won't work. Another
attempt to try to specify string length:
  char (*filenms)[maxlen]=NULL;
Won't compile, ECPG doesn't accept this syntax. Well, and strcmp crashes
(segmentation fault) in function scmp regardless what I try. Not SQL
error, I know, but if anybody can tell why I'd be grateful.

Any suggestions?

Thanks,
Poul Jensen

Re: ECPG - how to fetch then sort strings

From
Martijn van Oosterhout
Date:
Please don't use "reply" to start new thread, thanks.

On Fri, Sep 08, 2006 at 05:55:44AM -0800, Poul Jensen wrote:
> I need to fetch strings from a database with ECPG and then sort them in
> C. Here is one of my failed attempts:

<snip>
>  varchar filenms[][maxlen]=NULL;

I think you need to reread the C manual as to the difference between
arrays and pointers. I'm don't know about ECPG, but I think your other
attempt, using "char**" was much closer to the mark.

> It compiles ok, but I get garbage in variable filenms. If I change the
> declaration of filenms to:
>  char **filenms=NULL;
> the SQL query returns strings ok. But the strings have variable length,
> and I need to specify one length in qsort, so it won't work. Another

Since you're declaring a array of pointers to char, the width you have
to specify to qsort would be sizeof(char*). I think you can just use
the normal strcmp() function with qsort then.

Ofcourse, you could get the database to sort them for you...

Hope this helps,
--
Martijn van Oosterhout   <kleptog@svana.org>   http://svana.org/kleptog/
> From each according to his ability. To each according to his ability to litigate.

Attachment

Re: ECPG - how to fetch then sort strings

From
Poul Jensen
Date:
Martijn van Oosterhout wrote:
> Please don't use "reply" to start new thread, thanks.
>
> On Fri, Sep 08, 2006 at 05:55:44AM -0800, Poul Jensen wrote:
>
>> I need to fetch strings from a database with ECPG and then sort them in
>> C. Here is one of my failed attempts:
>>
>
> <snip>
>
>>  varchar filenms[][maxlen]=NULL;
>>
>
> I think you need to reread the C manual as to the difference between
> arrays and pointers. I'm don't know about ECPG, but I think your other
> attempt, using "char**" was much closer to the mark.
>
>
>> It compiles ok, but I get garbage in variable filenms. If I change the
>> declaration of filenms to:
>>  char **filenms=NULL;
>> the SQL query returns strings ok. But the strings have variable length,
>> and I need to specify one length in qsort, so it won't work. Another
>>
>
> Since you're declaring a array of pointers to char, the width you have
> to specify to qsort would be sizeof(char*). I think you can just use
> the normal strcmp() function with qsort then.
>
> Ofcourse, you could get the database to sort them for you...
>
> Hope this helps,
>
I'm afraid it didn't for various reasons, but I appreciate you trying.
What I ended up doing was simply declaring an extra array outside the
SQL declare section:

char (*tmp)[maxlen];

Then allocate the required memory once known:

  if (( tmp = malloc(maxlen*nrec*sizeof(char)) ) == NULL) {
    fprintf(stderr,"Memory allocation failure\n");
    exit(-1);
  }

and then just copy the strings into it one by one. As for strcmp it
*can* be used directly in qsort, but not without some manipulation:

 qsort(tmp, nrec, maxlen*sizeof(char), (int(*)(const void*, const
void*))&strcmp);

This resolved my issues. For now. ;-)