Thread: How To: A large [2D] matrix, 100,000+ rows/columns
Experts,
A very high level question... Subject sums it up.
I love PGSQL, but I don't know if it is a good fit for this back-end... I hope it is.
Imagine something akin to stocks, where you have a row for every stock, and a column for every stock. Except where the same stock is the row & col, a number is at each X-Y (row/column), and that is the big picture. I need to have a very large matrix to maintain & query, and if not (1,600 column limit), then how could such data be broken down to work?
By wanting postgresql as a solution, am I round-hole, square-pegging myself?
I don't mind keeping, say, a 1,500 column max per table, but then adding new items (stocks in the above analogy) might make it difficult to keep track of things...
Hoping someone has tackled this before, and performance isn't too big a concern, as the data changes seldom.
I know it's a weird one, so thanks!
:-)
> On 9 Jun 2023, at 04:17, Pat Trainor <pat.trainor@gmail.com> wrote: (…) > Imagine something akin to stocks, where you have a row for every stock, and a column for every stock. Except where thesame stock is the row & col, a number is at each X-Y (row/column), and that is the big picture. I need to have a verylarge matrix to maintain & query, and if not (1,600 column limit), then how could such data be broken down to work? If your matrix contains values that are all of the same type, as matrices usually do, then a matrix can be described as theCarthesian product of rows and columns, with values connecting those. For rows and columns you could enumerate them using generate_series() or a pair of recursive CTEs, or you could put theminto their own table. For the values (or cells), a tuple of (row, column, value) would be sufficient. Then in the end, the matrix would be a presentation of the left joins of the Carthesian product of rows and columns withyour cell values. The left joins are to account for missing cell values (empty cells), or you could explicitly add tuplesfor those with an ‘empty’ value. For presentation, I would use something like Python Pandas and the xlsxwriter. Data-entry is going to be a bit of a pain if you cannot automate it, and it’s not going to be very space-efficient, but itdoes fit the relational model this way and it would be easy to expand the matrix in either direction. > By wanting postgresql as a solution, am I round-hole, square-pegging myself? I expect that there are solutions that were explicitly designed for handling (large) matrices and that those would perhapsperform better. > I don't mind keeping, say, a 1,500 column max per table, but then adding new items (stocks in the above analogy) mightmake it difficult to keep track of things... That’s also a possibility, but that sort of pushes the column lookups down to the catalog level and induces overhead on allother catalog lookups as well. It’s not a huge amount though. An alternative approach would be a table of rows (or columns) with each an array of values, especially since you don’t expectmany updates. That would be a first attempt at optimisation if the pure relational model doesn’t work out. > Hoping someone has tackled this before, and performance isn't too big a concern, as the data changes seldom. Not in practice, AFAIR, but I was thinking of a solution like this for small matrices (Sudoku’s, I hate the things, but Ineed some solutions to a few to help me with a number of test scripts). Alban Hertroys -- If you can't see the forest for the trees, cut the trees and you'll find there is no forest.
On 6/8/23 22:17, Pat Trainor wrote: > Imagine something akin to stocks, where you have a row for every stock, > and a column for every stock. Except where the same stock is the row & > col, a number is at each X-Y (row/column), and that is the big picture. > I need to have a very large matrix to maintain & query, and if not > (1,600 column limit), then how could such data be broken down to work? 100,000 rows * 100,000 columns * 8 bytes (assuming float8) = about 80 GB per matrix if I got the math correct. Is this really a dense matrix or is it sparse? What kind of operations? Does it really need to be stored as such or could it be stored as vectors that are converted to a matrix on the fly when needed? Seems like using python or R makes more sense. Perhaps it might make sense to store the data in Postgres and use plpython or plr. But it is hard to say with more details. -- Joe Conway PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
Joe Conway schreef op vr 09-06-2023 om 09:16 [-0400]: > On 6/8/23 22:17, Pat Trainor wrote: > > I need to have a very large matrix to maintain & query, and if not > > (1,600 column limit), then how could such data be broken down to > > work? > > 100,000 rows * > 100,000 columns * > 8 bytes (assuming float8) > = about 80 GB per matrix if I got the math correct. > > based on my personal experience i would not use postgres in the case where you need many columns, u can work around this with json for example, but it will likely end up being less easy to work with as Joe replied: R or Python are probably a better fit, or another database that can easily handle a lot of columns, postgres is a great database, but not when you need a lot of columns (as you noted+: there might be another backend storage for postgres that can handle this better (or in the future?), but i don't think there is one; also there is the header for which standard 8K is provisioned anyway, so that is the first bottleneck (you can change this value, if you compile postgres yourself) https://www.postgresql.org/docs/current/limits.html ) Wim
> On 09/06/2023 16:00 CEST Wim Bertels <wim.bertels@ucll.be> wrote: > > Joe Conway schreef op vr 09-06-2023 om 09:16 [-0400]: > > On 6/8/23 22:17, Pat Trainor wrote: > > > I need to have a very large matrix to maintain & query, and if not > > > (1,600 column limit), then how could such data be broken down to > > > work? > > > > 100,000 rows * > > 100,000 columns * > > 8 bytes (assuming float8) > > = about 80 GB per matrix if I got the math correct. > > based on my personal experience i would not use postgres in the case > where you need many columns, u can work around this with json for > example, but it will likely end up being less easy to work with > > as Joe replied: R or Python are probably a better fit, > or another database that can easily handle a lot of columns, > postgres is a great database, but not when you need a lot of columns > > (as you noted+: > there might be another backend storage for postgres that can handle > this better (or in the future?), but i don't think there is one; > also there is the header for which standard 8K is provisioned anyway, > so that is the first bottleneck (you can change this value, if you > compile postgres yourself) > https://www.postgresql.org/docs/current/limits.html ) Rasdaman may also be an option. Saw it a few weeks ago on this very list. https://rasdaman.org https://www.postgresql.org/message-id/CAFj8pRDjE0mdL6_b86ZDawHtNeRPQLciWos3m3PGJueJ5COSjQ%40mail.gmail.com -- Erik