Thread: spectral datasets in postgresql

spectral datasets in postgresql

From
Glenn Sullivan
Date:
Hi,

Is it possible/feasible/desirable to use postgresql to hold large spectral
datasets.  The dataset would consist of a number of attributes and a
spectrum.  The spectrum is simply a list of floating point values.  However,
there can be from 128 points to 32 million points for one spectrum.

There would then of course be thousands of these spectra.

I don't know how you would save those 32 million points. If you can do
so, I don't know if it is a feasible way to use postgresql.


I welcome your comments.

Glenn

Re: spectral datasets in postgresql

From
"Nick Fankhauser"
Date:
Glenn

If I understand correctly, you're asking if it is possible to represent the
one to many relationship between the spectra & the values they consist of.
The answer is yes- that is one of the things relational databases like
PostgreSQL do well, and what you describe is a fairly simple data structure.

-Nick

--------------------------------------------------------------------------
Nick Fankhauser  nickf@ontko.com  Phone 1.765.935.4283  Fax 1.765.962.9788
Ray Ontko & Co.     Software Consulting Services     http://www.ontko.com/



> -----Original Message-----
> From: pgsql-general-owner@postgresql.org
> [mailto:pgsql-general-owner@postgresql.org]On Behalf Of Glenn Sullivan
> Sent: Thursday, November 15, 2001 5:32 PM
> To: pgsql-general@postgresql.org
> Subject: [GENERAL] spectral datasets in postgresql
>
>
> Hi,
>
> Is it possible/feasible/desirable to use postgresql to hold large spectral
> datasets.  The dataset would consist of a number of attributes and a
> spectrum.  The spectrum is simply a list of floating point
> values.  However,
> there can be from 128 points to 32 million points for one spectrum.
>
> There would then of course be thousands of these spectra.
>
> I don't know how you would save those 32 million points. If you can do
> so, I don't know if it is a feasible way to use postgresql.
>
>
> I welcome your comments.
>
> Glenn
>
> ---------------------------(end of broadcast)---------------------------
> TIP 1: subscribe and unsubscribe commands go to majordomo@postgresql.org
>


Re: spectral datasets in postgresql

From
Antonio Fiol Bonnín
Date:
Nick,

I am afraid that Glenn is not as worried about that as he is aabout the fact
that there may be up to 32 million entries per spectrum times 1000+ spectra,
which makes a total of about 3200+ million entries, supposedly on the same
relation.

Maybe I'm off-base, but I am thinking of something like:

CREATE TABLE spectra ( id SERIAL, element VARCHAR(30), ...other data about the
spectra... );
CREATE TABLE data ( id int, value float );

Table spectra would hold 1000+ entries (quite small table, after all).
Table data would hold all the "value"s for all spectra, i.e. 3200+ million
entries, and not even id is unique.

Structure seems fine to me for modest purposes, but not for this one. Of
course, you did not talk about any particular structure, Nick, but I can't
think of another one at the moment.

Of course, what here applies por PostgreSQL seems to me that may also apply for
any other RDBMS I can think of.

Nick, you are certainly a more experienced programmer than I am. Could you
please tell me what I am missing? And maybe give your opinion on the following
proposal? Thank you!!

(After re-reading my e-mail, I have come to the following point: It may be
feasible, as the rows are not very large. I am not sure if the influence of
that is very important or not. In any case, supposing 10byte/row, which is
probably a really low estimate, that makes a table with 32 Terabytes of data.
IIRC, PostgreSQL needs all the data on a particular table to fit on a
filesystem. I guess you will need an enormous partition...)

However, ...

Glenn,

Will you need to search by your "values"? A search like "I would like to know
which spectra have a spectrum "value" between X and Y", I mean...

If you will not need that, you may think of BLOBS to store your spectrum data.
That will avoid you 32 million inserts per spectrum ;-) Even if you issue them
in the same transaction, I guess the performance will be a big issue there.

(Again, re-reading and recalculating, that makes 32 million 32 bits (?)
floating point numbers per BLOB, times 1000. That is, at least, about 1
Terabyte, if I did my calculations properly).

A question has just finally come to my mind: Is compression possible, either on
the original data (transformation to avoid redundant "spectral" information) or
on the float values (gzip compression or something like that)? In that case,
you may save quite a lot of disk space.

Good luck!

Antonio Fiol

P.S. Of course, you need to add any overhead data generated by the RDBMS in
both cases.


Nick Fankhauser wrote:

> Glenn
>
> If I understand correctly, you're asking if it is possible to represent the
> one to many relationship between the spectra & the values they consist of.
> The answer is yes- that is one of the things relational databases like
> PostgreSQL do well, and what you describe is a fairly simple data structure.
>
> -Nick
>
> --------------------------------------------------------------------------
> Nick Fankhauser  nickf@ontko.com  Phone 1.765.935.4283  Fax 1.765.962.9788
> Ray Ontko & Co.     Software Consulting Services     http://www.ontko.com/
>
> > -----Original Message-----
> > From: pgsql-general-owner@postgresql.org
> > [mailto:pgsql-general-owner@postgresql.org]On Behalf Of Glenn Sullivan
> > Sent: Thursday, November 15, 2001 5:32 PM
> > To: pgsql-general@postgresql.org
> > Subject: [GENERAL] spectral datasets in postgresql
> >
> >
> > Hi,
> >
> > Is it possible/feasible/desirable to use postgresql to hold large spectral
> > datasets.  The dataset would consist of a number of attributes and a
> > spectrum.  The spectrum is simply a list of floating point
> > values.  However,
> > there can be from 128 points to 32 million points for one spectrum.
> >
> > There would then of course be thousands of these spectra.
> >
> > I don't know how you would save those 32 million points. If you can do
> > so, I don't know if it is a feasible way to use postgresql.
> >
> >
> > I welcome your comments.
> >
> > Glenn
> >
> > ---------------------------(end of broadcast)---------------------------
> > TIP 1: subscribe and unsubscribe commands go to majordomo@postgresql.org
> >
>
> ---------------------------(end of broadcast)---------------------------
> TIP 1: subscribe and unsubscribe commands go to majordomo@postgresql.org


Re: spectral datasets in postgresql

From
"Nick Fankhauser"
Date:
Antonio-

You're right- I focused on the wide variation of 128->32 million & thought
that was Glenn's question. While I was peripherally aware of the hugeness of
the data, I chocked that up as an equal issue under just about any storage
system and I was indeed thinking of the simple data structure you proposed.
However, as you point out, 32 Billion is big enough to present its own
unique set of problems.

Now that I look at it from this angle, I note that one of the
characteristics of this data is that it is always retrieved in the same
groups, since the float values for one spectrum can belong to no other.
Although I don't know the details, I'd guess that the values are unlikely to
change- maybe Glenn can weigh in with some more information on that.

This being the case, perhaps a better structure would involve storing the
spectra values in BLOBs, and simply having a single table with the other
attributes and OIDs for the BLOBs in them. This would eliminate the need to
scan billions of rows to retrieve values.

However, once we have a structure of this type, it seems almost a waste to
use a relational database.

Glenn- Do the values for a spectrum need to be sorted? that is, do they come
to you out of order & then need to be retrieved in order? Knowing this would
help determine whether storing them in a BLOB with no further organization
is a reasonable approach. Also, do the values need to be changeable, or is
this basically static data?

Antonio- Thanks for setting pointing me at the real issue- I completely
missed it!

-Nick





> -----Original Message-----
> From: admin@ontko.com [mailto:admin@ontko.com]On Behalf Of Antonio Fiol
> Bonnín
> Sent: Friday, November 16, 2001 3:25 AM
> To: nickf@ontko.com
> Cc: Glenn Sullivan; pgsql-general@postgresql.org
> Subject: Re: [GENERAL] spectral datasets in postgresql
>
>
> Nick,
>
> I am afraid that Glenn is not as worried about that as he is
> aabout the fact
> that there may be up to 32 million entries per spectrum times
> 1000+ spectra,
> which makes a total of about 3200+ million entries, supposedly on the same
> relation.
>
> Maybe I'm off-base, but I am thinking of something like:
>
> CREATE TABLE spectra ( id SERIAL, element VARCHAR(30), ...other
> data about the
> spectra... );
> CREATE TABLE data ( id int, value float );
>
> Table spectra would hold 1000+ entries (quite small table, after all).
> Table data would hold all the "value"s for all spectra, i.e. 3200+ million
> entries, and not even id is unique.
>
> Structure seems fine to me for modest purposes, but not for this one. Of
> course, you did not talk about any particular structure, Nick, but I can't
> think of another one at the moment.
>
> Of course, what here applies por PostgreSQL seems to me that may
> also apply for
> any other RDBMS I can think of.
>
> Nick, you are certainly a more experienced programmer than I am. Could you
> please tell me what I am missing? And maybe give your opinion on
> the following
> proposal? Thank you!!
>
> (After re-reading my e-mail, I have come to the following point: It may be
> feasible, as the rows are not very large. I am not sure if the
> influence of
> that is very important or not. In any case, supposing 10byte/row, which is
> probably a really low estimate, that makes a table with 32
> Terabytes of data.
> IIRC, PostgreSQL needs all the data on a particular table to fit on a
> filesystem. I guess you will need an enormous partition...)
>
> However, ...
>
> Glenn,
>
> Will you need to search by your "values"? A search like "I would
> like to know
> which spectra have a spectrum "value" between X and Y", I mean...
>
> If you will not need that, you may think of BLOBS to store your
> spectrum data.
> That will avoid you 32 million inserts per spectrum ;-) Even if
> you issue them
> in the same transaction, I guess the performance will be a big
> issue there.
>
> (Again, re-reading and recalculating, that makes 32 million 32 bits (?)
> floating point numbers per BLOB, times 1000. That is, at least, about 1
> Terabyte, if I did my calculations properly).
>
> A question has just finally come to my mind: Is compression
> possible, either on
> the original data (transformation to avoid redundant "spectral"
> information) or
> on the float values (gzip compression or something like that)? In
> that case,
> you may save quite a lot of disk space.
>
> Good luck!
>
> Antonio Fiol
>
> P.S. Of course, you need to add any overhead data generated by
> the RDBMS in
> both cases.
>
>
> Nick Fankhauser wrote:
>
> > Glenn
> >
> > If I understand correctly, you're asking if it is possible to
> represent the
> > one to many relationship between the spectra & the values they
> consist of.
> > The answer is yes- that is one of the things relational databases like
> > PostgreSQL do well, and what you describe is a fairly simple
> data structure.
> >
> > -Nick
> >
> >
> --------------------------------------------------------------------------
> > Nick Fankhauser  nickf@ontko.com  Phone 1.765.935.4283  Fax
> 1.765.962.9788
> > Ray Ontko & Co.     Software Consulting Services
http://www.ontko.com/
>
> > -----Original Message-----
> > From: pgsql-general-owner@postgresql.org
> > [mailto:pgsql-general-owner@postgresql.org]On Behalf Of Glenn Sullivan
> > Sent: Thursday, November 15, 2001 5:32 PM
> > To: pgsql-general@postgresql.org
> > Subject: [GENERAL] spectral datasets in postgresql
> >
> >
> > Hi,
> >
> > Is it possible/feasible/desirable to use postgresql to hold large
spectral
> > datasets.  The dataset would consist of a number of attributes and a
> > spectrum.  The spectrum is simply a list of floating point
> > values.  However,
> > there can be from 128 points to 32 million points for one spectrum.
> >
> > There would then of course be thousands of these spectra.
> >
> > I don't know how you would save those 32 million points. If you can do
> > so, I don't know if it is a feasible way to use postgresql.
> >
> >
> > I welcome your comments.
> >
> > Glenn
> >
> > ---------------------------(end of broadcast)---------------------------
> > TIP 1: subscribe and unsubscribe commands go to majordomo@postgresql.org
> >
>
> ---------------------------(end of broadcast)---------------------------
> TIP 1: subscribe and unsubscribe commands go to majordomo@postgresql.org