Thread: arrays of floating point numbers / linear algebra operations into the DB
Hello, I'd like to perform linear algebra operations on float4/8 arrays. These tasks are tipically carried on using ad hoc optimized libraries (e.g. BLAS). In order to do this, I studied a bit how arrays are stored internally by the DB: from what I understood, arrays are basically a vector of Datum, and floating point numbers are stored by reference into Datums. At a first glance, this seem to close the discussion because in order to perform fast linear algebra operations, you need to store array items in consecutive memory cells. What are the alternatives? Create a new specialized data type for floating point vectors? Basically, the use-case is to be able to rescale, add and multiply (element-by-element) vectors. Thanks for your help, e.
Re: arrays of floating point numbers / linear algebra operations into the DB
From
Colin Wetherbee
Date:
Enrico Sirola wrote: > Hello, > I'd like to perform linear algebra operations on float4/8 arrays. These > tasks are tipically carried on using ad hoc optimized libraries (e.g. > BLAS). In order to do this, I studied a bit how arrays are stored > internally by the DB: from what I understood, arrays are basically a > vector of Datum, and floating point numbers are stored by reference into > Datums. At a first glance, this seem to close the discussion because in > order to perform fast linear algebra operations, you need to store array > items in consecutive memory cells. > What are the alternatives? Create a new specialized data type for > floating point vectors? > Basically, the use-case is to be able to rescale, add and multiply > (element-by-element) > vectors. I'm not sure about the internals of PostgreSQL (eg. the Datum object(?) you mention), but if you're just scaling vectors, consecutive memory addresses shouldn't be absolutely necessary. Add and multiply operations within a linked list (which is how I'm naively assuming Datum storage for arrays in memory is implemented) will be "roughly" just as fast. How many scaling operations are you planning to execute per second, and how many elements do you scale per operation? Colin
Re: arrays of floating point numbers / linear algebra operations into the DB
From
Martijn van Oosterhout
Date:
On Fri, Feb 01, 2008 at 11:31:37AM +0100, Enrico Sirola wrote: > Hello, > I'd like to perform linear algebra operations on float4/8 arrays. > These tasks are tipically carried on using ad hoc optimized libraries > (e.g. BLAS). In order to do this, I studied a bit how arrays are > stored internally by the DB: from what I understood, arrays are > basically a vector of Datum, and floating point numbers are stored by > reference into Datums. Well, arrays are not vectors of Datum, they are a vector of the objects they contain. When passed to a function floats, arrays and other by-ref types as passed by reference, but the array object itself does not contain references, it contains the actual objects. That doesn't necessarily make it the same as a C array though, the alignment considerations may be different. But at first glance certainly seems like an array would be in the right format for what you're doing. Have a nice day, -- Martijn van Oosterhout <kleptog@svana.org> http://svana.org/kleptog/ > Those who make peaceful revolution impossible will make violent revolution inevitable. > -- John F Kennedy
Attachment
Re: arrays of floating point numbers / linear algebra operations into the DB
From
Enrico Sirola
Date:
Hi Colin, Il giorno 01/feb/08, alle ore 15:22, Colin Wetherbee ha scritto: > > I'm not sure about the internals of PostgreSQL (eg. the Datum > object(?) you mention), but if you're just scaling vectors, > consecutive memory addresses shouldn't be absolutely necessary. Add > and multiply operations within a linked list (which is how I'm > naively assuming Datum storage for arrays in memory is implemented) > will be "roughly" just as fast. > I'm not an expert, anyway the SSE instructions family should make the difference when performing this kind of workload, and those instructions work on consecutive memory cells. > How many scaling operations are you planning to execute per second, > and how many elements do you scale per operation? typically, arrays contain 1000 elements, and an operation is either multiply it by a scalar or multiply it element-by-element with another array. The time to rescale 1000 arrays, multiply it for another array and at the end sum all the 1000 resulting arrays should be enough to be carried on in an interactive application (let's say 0.5s). This, in the case when no disk-access is required. Disk access will obviously downgrade performances a bit ad the beginning, but the workload is mostly read-only so after a while the whole table will be cached anyway. The table containing the arrays would be truncated/repopulated every day and the number of arrays is expected to be more or less 150000 (at least this is what we have now). Nowadays, we have a c++ middleware between the calculations and an aggressive caching of the table contents (and we don't use arrays, just a row per element) but the application could be refactored (and simplified a lot) if we have a smart way to save data into the DB. Bye, e.
Re: arrays of floating point numbers / linear algebra operations into the DB
From
"Webb Sprague"
Date:
On Feb 1, 2008 2:31 AM, Enrico Sirola <enrico.sirola@gmail.com> wrote: > Hello, > I'd like to perform linear algebra operations on float4/8 arrays. > These tasks are tipically carried on using ad hoc optimized libraries > (e.g. BLAS). If there were a coherently designed, simple, and fast LAPACK/ MATLAB style library and set of datatypes for matrices and vectors in Postgres, I think that would be a HUGE plus for the project! I would have used it on a project I am working on in mortality forecasting (I would have been able to put all of my mathematics in the database instead of using scipy), it would tie in beautifully with the GIS and imagery efforts, it would ease fancy statistics calculation on database infrastructure, it would provide useful libraries for the datamining/ knowledge discovery types, etc, etc. If we just had fast matrix arithmetic, eigen-stuff (including singular value decomposition), convolution, random matrix generation, and table <-> matrix functions, that would be amazing and would provide the material for further library development since a lot of complex algorithms just fall out when you can do advanced linear algebra. We need to be able to convert transparently between matrices/ vectors (which I think should be simple N by 1 matrices by default) and arrays, but we would probably want to go for a separate datatype in order to get speed since scientifically important matrices can be HUGE. Just my fairly worthless $0.02, as I all I would provide would be to be a tester and member of the peanut-gallery, but there you go. Seems like a perfect Summer Of Code project for someone better at C-level programming than me. -W
Webb Sprague wrote: > On Feb 1, 2008 2:31 AM, Enrico Sirola <enrico.sirola@gmail.com> wrote: >> I'd like to perform linear algebra operations on float4/8 arrays... > > If there were a coherently designed, simple, and fast LAPACK/ MATLAB > style library and set of datatypes for matrices and vectors in > Postgres, I think that would be a HUGE plus for the project! I'd also be very excited about this project. Especially if some GIST or similar index could efficiently search for vectors "close" to other vectors. I assume something like "within a n-dimensional bounding box" would be possible with GIST.... I'd be eager to help, test, debug, etc; but probably aren't qualified to take the lead on such a project.
Re: arrays of floating point numbers / linear algebra operations into the DB
From
"Webb Sprague"
Date:
(I had meant also to add that a linear algebra package would help Postgres to be the mediator for real-time data, from things like temprature sensors, etc, and their relationship to not-so-scientific data, say in a manufacturing environment). On Feb 1, 2008 12:19 PM, Ron Mayer <rm_pg@cheapcomplexdevices.com> wrote: > Webb Sprague wrote: > > On Feb 1, 2008 2:31 AM, Enrico Sirola <enrico.sirola@gmail.com> wrote: > >> I'd like to perform linear algebra operations on float4/8 arrays... > > > > If there were a coherently designed, simple, and fast LAPACK/ MATLAB > > style library and set of datatypes for matrices and vectors in > > Postgres, I think that would be a HUGE plus for the project! > > I'd also be very excited about this project. > > Especially if some GIST or similar index could efficiently search > for vectors "close" to other vectors. That would be very interesting as we could play with a multitude of different distance metrics from Analysis!!! Wow! > I'd be eager to help, test, debug, etc; but probably aren't qualified > to take the lead on such a project. I almost think the hardest part would be to spec it out and design the interface to the libraries. Once we had that, the libraries are already there, though figuring out how we are going to handle gigabyte size elements (e.g. a satellite image) will require some finesse, and perhaps some tiling ... Hmm. If I get some more interest on this list (I need just one LAPACK / BLAS hacker...), I will apply for a pgFoundry project and appoint myself head of the peanut gallery...
Enrico Sirola wrote: > typically, arrays contain 1000 elements, and an operation is either > multiply it by a scalar or multiply it element-by-element with another > array. The time to rescale 1000 arrays, multiply it for another array > and at the end sum all the 1000 resulting arrays should be enough to be > carried on in an interactive application (let's say 0.5s). This, in the > case when no disk-access is required. Disk access will obviously > downgrade performances a bit ad the beginning, but the workload is > mostly read-only so after a while the whole table will be cached anyway. > The table containing the arrays would be truncated/repopulated every day > and the number of arrays is expected to be more or less 150000 (at least > this is what we have now). Nowadays, we have a c++ middleware between > the calculations and an aggressive caching of the table contents (and we > don't use arrays, just a row per element) but the application could be > refactored (and simplified a lot) if we have a smart way to save data > into the DB. I don't know if the speed will meet your needs, but you might test to see if PL/R will work for you: http://www.joeconway.com/plr/ You could use pg.spi.exec() from within the R procedure to grab the arrays, do all of your processing inside R (which uses whatever BLAS you've set it up to use), and then return the result out to Postgres. Joe
Re: arrays of floating point numbers / linear algebra operations into the DB
From
Enrico Sirola
Date:
Hi Joe, > I don't know if the speed will meet your needs, but you might test > to see if PL/R will work for you: > > http://www.joeconway.com/plr/ > > You could use pg.spi.exec() from within the R procedure to grab the > arrays, do all of your processing inside R (which uses whatever BLAS > you've set it up to use), and then return the result out to Postgres. Thanks a lot for the hint, I'll give it a try. It also will be much easier to implement a prototype :-) Bye, e.
Webb Sprague wrote: > On Feb 1, 2008 12:19 PM, Ron Mayer <rm_pg@cheapcomplexdevices.com> wrote: >> Webb Sprague wrote: >>> On Feb 1, 2008 2:31 AM, Enrico Sirola <enrico.sirola@gmail.com> wrote: >>>> ...linear algebra ... >>> ... matrices and vectors . >> ...Especially if some GIST or similar index could efficiently search >> for vectors "close" to other vectors... > > Hmm. If I get some more interest on this list (I need just one LAPACK > / BLAS hacker...), I will apply for a pgFoundry project and appoint > myself head of the peanut gallery... I think you should start one. I'd be happy to help. I'm rather proficient in C; somewhat literate about postgres' GIST stuff (I think a couple of my bugfix patches were accepted in postgis); and deal with a big database doing lots of similarity-based searches (a 6'2" guy with light brown hair being similar to a 6'1" guy with dark blond hair) - and am experimenting with modeling some of the data as vectors in postgres.
Re: arrays of floating point numbers / linear algebra operations into the DB
From
"Webb Sprague"
Date:
> >>>> ...linear algebra ... > >>> ... matrices and vectors . > >> ...Especially if some GIST or similar index could efficiently search > >> for vectors "close" to other vectors... > > > > Hmm. If I get some more interest on this list (I need just one LAPACK > > / BLAS hacker...), I will apply for a pgFoundry project and appoint > > myself head of the peanut gallery... > > I think you should start one. I'd be happy to help. OK. You are on. I think designing an interface is the first step, and I am inclined to use matlab syntax plus cool things I wish they had (convolution matrices, recycling, etc). > I'm rather proficient in C; somewhat literate about postgres' GIST > stuff (I think a couple of my bugfix patches were accepted in postgis); Nifty! I am having trouble bending my head around how we can fit 10K by 10K matrices into Datums, but if you have worked with PostGIS then a lot of those big geographic fields might help. > and deal with a big database doing lots of similarity-based searches (a > 6'2" guy with light brown hair being similar to a 6'1" guy with dark > blond hair) - and am experimenting with modeling some of the data as > vectors in postgres. Well, I bet a good linear algebra library would help. A lot. :)
--- Webb Sprague <webb.sprague@gmail.com> wrote: > > >>>> ...linear algebra ... > > >>> ... matrices and vectors . > > >> ...Especially if some GIST or similar index > could efficiently search > > >> for vectors "close" to other vectors... > > > I see a potential problem here, in terms of how one defines "close" or similitude. I think, though, practical answers can be found in examples of applying quantitative methods in some subdisciplines of biology. > > > Hmm. If I get some more interest on this list > (I need just one LAPACK > > > / BLAS hacker...), I will apply for a pgFoundry > project and appoint > > > myself head of the peanut gallery... > > Someone pointed to the potential utility of pl/R. I would be interested at least in learning about your assessment of the two (postgis and pl/r. Alas, I don't have decent date I could use to experiment with either (except possibly for time series analysis, which is a completely different kettle of fish. > > > and deal with a big database doing lots of > similarity-based searches (a > > 6'2" guy with light brown hair being similar to a > 6'1" guy with dark > > blond hair) - and am experimenting with modeling > some of the data as > > vectors in postgres. > > Well, I bet a good linear algebra library would > help. A lot. :) > If you're looking at similarity, and some practicality in the USE of quantitative procedures, you may want to look into the biogeography and numerical taxonomy literature, and to a lesser extent quantitative plant ecology. All three subdisciplines of biology have decades of experience, and copious literature, looking at similarity measures, and in my experience much more practical or pragmatic than the 'pure' biostatistics literature, and infinitely more practical than any theoretical statistical or mathematical literature I have seen (trust me, I have a bookcase full of this "stuff"). A good linear algebra library would be useful, but there are a lot of nonlinear analyses that would be of interest; and there are nonparametric, yet quantitative approaches that are of considerable interest in assessing similarity. I don't know of work looking at applying things like discriminant functions analysis or cluster analysis or any of the many ordination analyses that may be considered to searches in a database, but then I haven't looked at the question since I graduated. I am interested in the question, though, and would be interested in hearing about your experience on the question. If I can manage the time, I hope to start a project where I can store description data for specimens of plants and animals, use analyses including but not limited to ordination, clustering, discriminant functions, cannonical correlation, to create a structure for comparing them, and for identifying new specimens, or at a minimum, if the specimen is truly something unknown, learn what known specimens or groups thereof it is most similar to, and how it is different. I have managed to install pl/r, but I haven't had the time to figure out how best to analyze data stored in the database using it. In the data I Do have, it changes daily, and some of the tables are well over 100MB, so I am a bit worried about how well it can handle such an amount of data, and how long it would take. Cheers, Ted
Re: arrays of floating point numbers / linear algebra operations into the DB
From
"Webb Sprague"
Date:
On Feb 1, 2008 2:31 AM, Enrico Sirola <enrico.sirola@gmail.com> wrote: > Hello, > I'd like to perform linear algebra operations on float4/8 arrays Having avoided a bunch of real work wondering about linear algebra and PG, did you consider the Gnu Scientific Library ? We would still need to hook everything together, but it seems to do a lot of this, and is written in C, etc.
Ted Byers wrote: > --- Webb Sprague <webb.sprague@gmail.com> wrote: >>>>>>> ...linear algebra ... >>>>>> ... matrices and vectors . >>>>> ...Especially if some GIST or similar index >> could efficiently search >>>>> for vectors "close" to other vectors... > > I see a potential problem here, in terms of how one > defines "close" or similitude. I think, though, > practical answers can be found in examples of applying > quantitative methods in some subdisciplines of > biology. Even if the best GIST can give is selecting vectors constrained by a n-dimensional bounding box (in much the same way it does for postgis) it can help a lot. Then your application can select everything in a conservatively large box, and use whatever it's favorite metric is to narrow the data further. > Someone pointed to the potential utility of pl/R. I > would be interested at least in learning about your > assessment of the two (postgis and pl/r. I think they'd be complimentary. IMHO if a native postgresql datatype could allow indexes to narrow the amount of data that needs to be processed; it'd be great to do the rest of the work using R (though we're perhaps foolishly using something else in our application). > If you're looking at similarity, and some practicality > in the USE of quantitative procedures, you may want to > look into the biogeography and numerical taxonomy > literature, and to a lesser extent quantitative plant > ecology. Indeed. Though the current literature I wade through is crime analysis. Ideally our software tries to match witness descriptions of "black SUV" as substantially similar to "dark green 4runner" - especially when seen at night; and understand that a "Glock 19" is a "9mm handgun" and that a "9mm handgun" might be but isn't necessarily a "Glock 19". Also - two person records with a tattoo of a cross might contribute a little similarity -- but two records with tattoos of Darth Maul / Maori face art contribute much more to the similarity scores because they're so much more distinctive. And of course there are entire companies focused on similarity metrics for fingerprints, DNA, and names (Bill is arguably more similar to William than to Bell). Any magic indexes to help such queries would be very cool; but so far we do most of it in our application logic. > A good linear algebra library would be useful, but > there are a lot of nonlinear analyses that would be of > interest; and there are nonparametric, yet > quantitative approaches that are of considerable > interest in assessing similarity. True, many things don't map cleanly to linear algebra; but they would be quite useful. > If I can manage the time, I hope to start a project > where I can store description data for specimens of > plants and animals, use analyses including but not > limited to ordination, clustering, discriminant > functions, cannonical correlation, to create a > structure for comparing them, and for identifying new > specimens, or at a minimum, if the specimen is truly > something unknown, learn what known specimens or > groups thereof it is most similar to, and how it is > different. Very cool. Sounds like a somewhat similar issue to mine.
Re: arrays of floating point numbers / linear algebra operations into the DB
From
Enrico Sirola
Date:
Hi Webb, Joe, Martijn Webb Sprague ha scritto: > On Feb 1, 2008 2:31 AM, Enrico Sirola <enrico.sirola@gmail.com> wrote: >> Hello, >> I'd like to perform linear algebra operations on float4/8 arrays > > Having avoided a bunch of real work wondering about linear algebra and > PG, did you consider the Gnu Scientific Library ? We would still need > to hook everything together, but it seems to do a lot of this, and is > written in C, etc. I experimented a bit today with cblas, and wrapped the blas function for scaling a vector. The following session shows the usage: create or replace function scale(float8, float8[]) returns float8[] as '$libdir/linalg', 'scale' language 'C' immutable strict; sps_dev=# select scale(k, '{1,2,3}') from generate_series(1,10) k; scale ------------ {1,2,3} {2,4,6} {3,6,9} {4,8,12} {5,10,15} {6,12,18} {7,14,21} {8,16,24} {9,18,27} {10,20,30} (10 rows) sps_dev=# create operator * (leftarg=float8, rightarg=float8[], procedure=scale); sps_dev=# select k * '{1,2,3}'::float8[] from generate_series(1,10) k; ?column? ------------ {1,2,3} {2,4,6} {3,6,9} {4,8,12} {5,10,15} {6,12,18} {7,14,21} {8,16,24} {9,18,27} {10,20,30} (10 rows) I'm quite proud, this is my first C extension function ;-) I'd gladly post the code if it's ok for the list users. It's more or less 100 lines of code. This approach seems promising... By the way, Webb: I took a look at GSL and it seems to me that, from a linear algebra point of view, it's basically cblas, so I'd use cblas directly. Please let me know your thoughts/advices, e.
Re: arrays of floating point numbers / linear algebra operations into the DB
From
"Webb Sprague"
Date:
> I'm quite proud, this is my first C extension function ;-) > I'd gladly post the code if it's ok for the list users. It's more or > less 100 lines of code. This approach seems promising... I would definitely like to see it. > By the way, Webb: I took a look at GSL and it seems to me that, from a > linear algebra point of view, it's basically cblas, so I'd use cblas > directly. > Please let me know your thoughts/advices, The only thing about GSL is that it would make it easier to tie into some very sophisticated stuff later, and (I think) the basic linear algebra is probably just as fast as CBLAS, and we could implement it first. It would also be easy to define a big project as : "bring GSL to Postgres", and then people could work on pieces. But if you actually write it, you get to decide :) GSL licensing is GNU ish, so may be that is a deal breaker, too. w > e. > >
Re: arrays of floating point numbers / linear algebra operations into the DB
From
Enrico Sirola
Date:
Hi Webb, Webb Sprague ha scritto: >> I'm quite proud, this is my first C extension function ;-) >> I'd gladly post the code if it's ok for the list users. It's more or >> less 100 lines of code. This approach seems promising... > > I would definitely like to see it. here it goes: -------------------------->linalg.h<---------------------------------- #ifndef linalg_h #define linalg_h #include "postgres.h" #include "utils/array.h" Datum scale(PG_FUNCTION_ARGS); #endif /* linalg_h */ -------------------------->linalg.c<---------------------------------- #include "linalg.h" #include "fmgr.h" #include <cblas.h> PG_MODULE_MAGIC; PG_FUNCTION_INFO_V1( scale); Datum scale(PG_FUNCTION_ARGS) { float8 x = PG_GETARG_FLOAT8(0); ArrayType *v1 = PG_GETARG_ARRAYTYPE_P(1); int *dims1, *lbs1, ndims1, nitems1, ndatabytes1; int *arrlbound1, *arrlbound2; char *arrdatap1, *arrdatap2; ArrayType *rv; /* get argument array details */ lbs1 = ARR_LBOUND(v1); dims1 = ARR_DIMS(v1); ndims1 = v1->ndim; nitems1 = ArrayGetNItems(ndims1, dims1); ndatabytes1 = ARR_SIZE(v1) - ARR_DATA_OFFSET(v1); if ( ndims1 != 1 ) ereport(ERROR, (errcode(ERRCODE_DATATYPE_MISMATCH), errmsg("Multi dimensional array given"), errdetail("Array have %d dimensions", ndims1))); if (ARR_HASNULL(v1)) ereport(ERROR, (errcode(ERRCODE_DATATYPE_MISMATCH), errmsg("Null in array"), errdetail("array should not contain null elements"))); /* allcate new vector */ rv = (ArrayType *) palloc(ndatabytes1); SET_VARSIZE(rv, ndatabytes1); rv->ndim = v1->ndim; rv->dataoffset = v1->dataoffset; // no nulls (0) rv->elemtype = v1->elemtype; memcpy(ARR_DIMS(rv), ARR_DIMS(v1), sizeof(int)); arrlbound2 = ARR_LBOUND(rv); arrlbound1 = ARR_LBOUND(v1); memcpy(arrlbound2, arrlbound1, sizeof(int)); arrdatap1 = ARR_DATA_PTR(v1); arrdatap2 = ARR_DATA_PTR(rv); memcpy(arrdatap2, arrdatap1, nitems1 * sizeof(float8)); /* scale vector a la blas */ cblas_dscal(nitems1, x, (float8*) arrdatap2, 1); PG_RETURN_ARRAYTYPE_P(rv); } --------------------------->linalg.sql<--------------------------------- /* -*-sql-*- */ create or replace function scale(float8, float8[]) returns float8[] as '$libdir/linalg', 'scale' language 'C' immutable strict; create aggregate array_accum ( sfunc = array_append, basetype = anyelement, stype = anyarray, initcond = '{}' ); create operator * (leftarg=float8, rightarg=float8[], procedure=scale); -------------------------------->end<------------------------------------ > GSL licensing is GNU ish, so may be that is a deal breaker, too. well, I don't know. This is just a proof of concept. Anyway, yes, there could be problems with GPL. On the above code: coupling cblas functions with PG float8[] seems easy, you just have to know which black-magic-macros to use in order to access the data structure. It took me a while to figure out how it works (I'm not actually sure I understood everything, but at least the above code seems to work). The problem for the code above is that it doesn't work for vectors longer than 1000 elements or so (try it with 2000 and it doesn't work). I guess I should manage the "toasting" machinery in some ways - any suggestion is appreciated Bye, e.
Re: arrays of floating point numbers / linear algebra operations into the DB
From
Enrico Sirola
Date:
I respond myself: Enrico Sirola ha scritto: [...] > seems to work). The problem for the code above is that it doesn't work > for vectors longer than 1000 elements or so (try it with 2000 and it > doesn't work). I guess I should manage the "toasting" machinery in some > ways - any suggestion is appreciated wrong. it was just that I forgot to add ARR_OVERHEAD_NONULLS(ndims1) to the mem allocation for rv: rv = (ArrayType *) palloc(nbytes); becomes rv = (ArrayType *) palloc(nbytes) + ARR_OVERHEAD_NONULLS(ndims1); and now it seems to work :)