Thread: spectral datasets in postgresql
Hi, Is it possible/feasible/desirable to use postgresql to hold large spectral datasets. The dataset would consist of a number of attributes and a spectrum. The spectrum is simply a list of floating point values. However, there can be from 128 points to 32 million points for one spectrum. There would then of course be thousands of these spectra. I don't know how you would save those 32 million points. If you can do so, I don't know if it is a feasible way to use postgresql. I welcome your comments. Glenn
Glenn If I understand correctly, you're asking if it is possible to represent the one to many relationship between the spectra & the values they consist of. The answer is yes- that is one of the things relational databases like PostgreSQL do well, and what you describe is a fairly simple data structure. -Nick -------------------------------------------------------------------------- Nick Fankhauser nickf@ontko.com Phone 1.765.935.4283 Fax 1.765.962.9788 Ray Ontko & Co. Software Consulting Services http://www.ontko.com/ > -----Original Message----- > From: pgsql-general-owner@postgresql.org > [mailto:pgsql-general-owner@postgresql.org]On Behalf Of Glenn Sullivan > Sent: Thursday, November 15, 2001 5:32 PM > To: pgsql-general@postgresql.org > Subject: [GENERAL] spectral datasets in postgresql > > > Hi, > > Is it possible/feasible/desirable to use postgresql to hold large spectral > datasets. The dataset would consist of a number of attributes and a > spectrum. The spectrum is simply a list of floating point > values. However, > there can be from 128 points to 32 million points for one spectrum. > > There would then of course be thousands of these spectra. > > I don't know how you would save those 32 million points. If you can do > so, I don't know if it is a feasible way to use postgresql. > > > I welcome your comments. > > Glenn > > ---------------------------(end of broadcast)--------------------------- > TIP 1: subscribe and unsubscribe commands go to majordomo@postgresql.org >
Nick, I am afraid that Glenn is not as worried about that as he is aabout the fact that there may be up to 32 million entries per spectrum times 1000+ spectra, which makes a total of about 3200+ million entries, supposedly on the same relation. Maybe I'm off-base, but I am thinking of something like: CREATE TABLE spectra ( id SERIAL, element VARCHAR(30), ...other data about the spectra... ); CREATE TABLE data ( id int, value float ); Table spectra would hold 1000+ entries (quite small table, after all). Table data would hold all the "value"s for all spectra, i.e. 3200+ million entries, and not even id is unique. Structure seems fine to me for modest purposes, but not for this one. Of course, you did not talk about any particular structure, Nick, but I can't think of another one at the moment. Of course, what here applies por PostgreSQL seems to me that may also apply for any other RDBMS I can think of. Nick, you are certainly a more experienced programmer than I am. Could you please tell me what I am missing? And maybe give your opinion on the following proposal? Thank you!! (After re-reading my e-mail, I have come to the following point: It may be feasible, as the rows are not very large. I am not sure if the influence of that is very important or not. In any case, supposing 10byte/row, which is probably a really low estimate, that makes a table with 32 Terabytes of data. IIRC, PostgreSQL needs all the data on a particular table to fit on a filesystem. I guess you will need an enormous partition...) However, ... Glenn, Will you need to search by your "values"? A search like "I would like to know which spectra have a spectrum "value" between X and Y", I mean... If you will not need that, you may think of BLOBS to store your spectrum data. That will avoid you 32 million inserts per spectrum ;-) Even if you issue them in the same transaction, I guess the performance will be a big issue there. (Again, re-reading and recalculating, that makes 32 million 32 bits (?) floating point numbers per BLOB, times 1000. That is, at least, about 1 Terabyte, if I did my calculations properly). A question has just finally come to my mind: Is compression possible, either on the original data (transformation to avoid redundant "spectral" information) or on the float values (gzip compression or something like that)? In that case, you may save quite a lot of disk space. Good luck! Antonio Fiol P.S. Of course, you need to add any overhead data generated by the RDBMS in both cases. Nick Fankhauser wrote: > Glenn > > If I understand correctly, you're asking if it is possible to represent the > one to many relationship between the spectra & the values they consist of. > The answer is yes- that is one of the things relational databases like > PostgreSQL do well, and what you describe is a fairly simple data structure. > > -Nick > > -------------------------------------------------------------------------- > Nick Fankhauser nickf@ontko.com Phone 1.765.935.4283 Fax 1.765.962.9788 > Ray Ontko & Co. Software Consulting Services http://www.ontko.com/ > > > -----Original Message----- > > From: pgsql-general-owner@postgresql.org > > [mailto:pgsql-general-owner@postgresql.org]On Behalf Of Glenn Sullivan > > Sent: Thursday, November 15, 2001 5:32 PM > > To: pgsql-general@postgresql.org > > Subject: [GENERAL] spectral datasets in postgresql > > > > > > Hi, > > > > Is it possible/feasible/desirable to use postgresql to hold large spectral > > datasets. The dataset would consist of a number of attributes and a > > spectrum. The spectrum is simply a list of floating point > > values. However, > > there can be from 128 points to 32 million points for one spectrum. > > > > There would then of course be thousands of these spectra. > > > > I don't know how you would save those 32 million points. If you can do > > so, I don't know if it is a feasible way to use postgresql. > > > > > > I welcome your comments. > > > > Glenn > > > > ---------------------------(end of broadcast)--------------------------- > > TIP 1: subscribe and unsubscribe commands go to majordomo@postgresql.org > > > > ---------------------------(end of broadcast)--------------------------- > TIP 1: subscribe and unsubscribe commands go to majordomo@postgresql.org
Antonio- You're right- I focused on the wide variation of 128->32 million & thought that was Glenn's question. While I was peripherally aware of the hugeness of the data, I chocked that up as an equal issue under just about any storage system and I was indeed thinking of the simple data structure you proposed. However, as you point out, 32 Billion is big enough to present its own unique set of problems. Now that I look at it from this angle, I note that one of the characteristics of this data is that it is always retrieved in the same groups, since the float values for one spectrum can belong to no other. Although I don't know the details, I'd guess that the values are unlikely to change- maybe Glenn can weigh in with some more information on that. This being the case, perhaps a better structure would involve storing the spectra values in BLOBs, and simply having a single table with the other attributes and OIDs for the BLOBs in them. This would eliminate the need to scan billions of rows to retrieve values. However, once we have a structure of this type, it seems almost a waste to use a relational database. Glenn- Do the values for a spectrum need to be sorted? that is, do they come to you out of order & then need to be retrieved in order? Knowing this would help determine whether storing them in a BLOB with no further organization is a reasonable approach. Also, do the values need to be changeable, or is this basically static data? Antonio- Thanks for setting pointing me at the real issue- I completely missed it! -Nick > -----Original Message----- > From: admin@ontko.com [mailto:admin@ontko.com]On Behalf Of Antonio Fiol > Bonnín > Sent: Friday, November 16, 2001 3:25 AM > To: nickf@ontko.com > Cc: Glenn Sullivan; pgsql-general@postgresql.org > Subject: Re: [GENERAL] spectral datasets in postgresql > > > Nick, > > I am afraid that Glenn is not as worried about that as he is > aabout the fact > that there may be up to 32 million entries per spectrum times > 1000+ spectra, > which makes a total of about 3200+ million entries, supposedly on the same > relation. > > Maybe I'm off-base, but I am thinking of something like: > > CREATE TABLE spectra ( id SERIAL, element VARCHAR(30), ...other > data about the > spectra... ); > CREATE TABLE data ( id int, value float ); > > Table spectra would hold 1000+ entries (quite small table, after all). > Table data would hold all the "value"s for all spectra, i.e. 3200+ million > entries, and not even id is unique. > > Structure seems fine to me for modest purposes, but not for this one. Of > course, you did not talk about any particular structure, Nick, but I can't > think of another one at the moment. > > Of course, what here applies por PostgreSQL seems to me that may > also apply for > any other RDBMS I can think of. > > Nick, you are certainly a more experienced programmer than I am. Could you > please tell me what I am missing? And maybe give your opinion on > the following > proposal? Thank you!! > > (After re-reading my e-mail, I have come to the following point: It may be > feasible, as the rows are not very large. I am not sure if the > influence of > that is very important or not. In any case, supposing 10byte/row, which is > probably a really low estimate, that makes a table with 32 > Terabytes of data. > IIRC, PostgreSQL needs all the data on a particular table to fit on a > filesystem. I guess you will need an enormous partition...) > > However, ... > > Glenn, > > Will you need to search by your "values"? A search like "I would > like to know > which spectra have a spectrum "value" between X and Y", I mean... > > If you will not need that, you may think of BLOBS to store your > spectrum data. > That will avoid you 32 million inserts per spectrum ;-) Even if > you issue them > in the same transaction, I guess the performance will be a big > issue there. > > (Again, re-reading and recalculating, that makes 32 million 32 bits (?) > floating point numbers per BLOB, times 1000. That is, at least, about 1 > Terabyte, if I did my calculations properly). > > A question has just finally come to my mind: Is compression > possible, either on > the original data (transformation to avoid redundant "spectral" > information) or > on the float values (gzip compression or something like that)? In > that case, > you may save quite a lot of disk space. > > Good luck! > > Antonio Fiol > > P.S. Of course, you need to add any overhead data generated by > the RDBMS in > both cases. > > > Nick Fankhauser wrote: > > > Glenn > > > > If I understand correctly, you're asking if it is possible to > represent the > > one to many relationship between the spectra & the values they > consist of. > > The answer is yes- that is one of the things relational databases like > > PostgreSQL do well, and what you describe is a fairly simple > data structure. > > > > -Nick > > > > > -------------------------------------------------------------------------- > > Nick Fankhauser nickf@ontko.com Phone 1.765.935.4283 Fax > 1.765.962.9788 > > Ray Ontko & Co. Software Consulting Services http://www.ontko.com/ > > > -----Original Message----- > > From: pgsql-general-owner@postgresql.org > > [mailto:pgsql-general-owner@postgresql.org]On Behalf Of Glenn Sullivan > > Sent: Thursday, November 15, 2001 5:32 PM > > To: pgsql-general@postgresql.org > > Subject: [GENERAL] spectral datasets in postgresql > > > > > > Hi, > > > > Is it possible/feasible/desirable to use postgresql to hold large spectral > > datasets. The dataset would consist of a number of attributes and a > > spectrum. The spectrum is simply a list of floating point > > values. However, > > there can be from 128 points to 32 million points for one spectrum. > > > > There would then of course be thousands of these spectra. > > > > I don't know how you would save those 32 million points. If you can do > > so, I don't know if it is a feasible way to use postgresql. > > > > > > I welcome your comments. > > > > Glenn > > > > ---------------------------(end of broadcast)--------------------------- > > TIP 1: subscribe and unsubscribe commands go to majordomo@postgresql.org > > > > ---------------------------(end of broadcast)--------------------------- > TIP 1: subscribe and unsubscribe commands go to majordomo@postgresql.org