Thread: Array columns vs normalized table
I work with state labor data which is reported to us in the form industry, year, quarter1, quarter2, quarter3, quarter4 where each quarter represents an employment count. Obviously, this can be normalized to industry, year, quarter, employment Can anyone comment on, or point to me to an article or discussion regarding, why one would use an array column instead of normalizing the data? That is, would there be any benefit to storing it as industry int, year smallint, employment int[ ] where the last column would be a four element array with data for the four quarters. Thanks, --Lee -- Lee Hachadoorian PhD Student, Geography Program in Earth & Environmental Sciences CUNY Graduate Center
Hello 2010/3/2 Lee Hachadoorian <lee.hachadoorian@gmail.com>: > I work with state labor data which is reported to us in the form > > industry, year, quarter1, quarter2, quarter3, quarter4 > > where each quarter represents an employment count. Obviously, this can > be normalized to > > industry, year, quarter, employment > > Can anyone comment on, or point to me to an article or discussion > regarding, why one would use an array column instead of normalizing > the data? That is, would there be any benefit to storing it as for very large timeseries you can use arrays. The storage can be more effective. sample industry year q1 q2 q3 q4 1, 2001, 10,11,12,13 1, 2002, 14, 15,16,17 can be transformed to tuple (industry: 1, start_year: 2002, data: 10,11,12,13,14,15,16,17, ..... ) your model isn't more safer then normalised tables and I think is better prefer normalized tables. the queries to arrays are difficult - so usually data from arrays are dynamically unpacked to tables. But time series stored as array take significantly less space on disc. Look on http://www.postgresql.org/docs/8.3/static/storage-page-layout.html - you safe repeated row headers. Regards Pavel Stehule I hope so nobody uses this technique. It cannot substitute normalizated tables. > > industry int, year smallint, employment int[ ] if > > where the last column would be a four element array with data for the > four quarters. > > Thanks, > --Lee > > -- > Lee Hachadoorian > PhD Student, Geography > Program in Earth & Environmental Sciences > CUNY Graduate Center > > -- > Sent via pgsql-general mailing list (pgsql-general@postgresql.org) > To make changes to your subscription: > http://www.postgresql.org/mailpref/pgsql-general >
On Tue, Mar 2, 2010 at 11:21 AM, Lee Hachadoorian <lee.hachadoorian@gmail.com> wrote: > I work with state labor data which is reported to us in the form > > industry, year, quarter1, quarter2, quarter3, quarter4 > > where each quarter represents an employment count. Obviously, this can > be normalized to > > industry, year, quarter, employment > > Can anyone comment on, or point to me to an article or discussion > regarding, why one would use an array column instead of normalizing > the data? That is, would there be any benefit to storing it as > > industry int, year smallint, employment int[ ] > > where the last column would be a four element array with data for the > four quarters. I think you might want to step back and ask yourself why you'd want to normalize this data at all? Unless you are trying to operate on all 4 columns at once (comparisons or aggregation, etc.) I see no reason to. It will take more space and give you no new capabilities. If you need do need to operate on all 4 quarters simultaneously then you may very well want to normalize, but in that case there's certainly no advantage in going to an array type. Personally, the only reason I'd see for ever using an array type is when you have many very closely related values that would cause some huge number of rows if the data is stored normalized -- Peter Hunsberger
Lee Hachadoorian wrote: > I work with state labor data which is reported to us in the form > > industry, year, quarter1, quarter2, quarter3, quarter4 > > where each quarter represents an employment count. Obviously, this can > be normalized to > > industry, year, quarter, employment > > Can anyone comment on, or point to me to an article or discussion > regarding, why one would use an array column instead of normalizing > the data? That is, would there be any benefit to storing it as > > industry int, year smallint, employment int[ ] > > where the last column would be a four element array with data for the > four quarters. > > Thanks, > --Lee > > -- > Lee Hachadoorian > PhD Student, Geography > Program in Earth & Environmental Sciences > CUNY Graduate Center If you want to do that, I'd recommend: industry int, year smallint, emp_q1 int, emp_q2 int, emp_q3 int, emp_q4 int That way it is more clear, easier to query, uses less space and you wont end up with employment data for the 5th quarter or something odd like that. Arrays are great for working with your data during the query process. But you should generally avoid using them to store your data on disk. Scott
Pavel, the idea of using arrays to store long time-series data sounds good, but I take your point that normalized tables might be better and are easier to query. I suppose the choice will be between normalizing or using the denormalized industry int, year smallint, emp_q1 int, emp_q2 int, emp_q3 int, emp_q4 int as suggested by Peter and Scott. We're mostly actually interested in annual numbers, but need to preserve the quarterly data for verification and for unusual analyses. So perhaps storing denormalized with an additional emp_annual int field, and a view that keeps the quarterly data out of sight. Thanks for your replies. Please feel free to comment if you think of anything else. Best, --Lee -- Lee Hachadoorian PhD Student, Geography Program in Earth & Environmental Sciences CUNY Graduate Center