Thread: PostgreSQL Arrays and Performance
A few performance issues using PostgreSQL's arrays led us to the question how postgres actually stores variable length arrays. First, let me explain our situation. We have a rather large table containing a simple integer primary key and a couple more columns of fixed size. However, there is a dates column of type "timestamp without time zone[]" that is apparently causing some severe performance problems. During a daily update process new timestamps are collected and existing data rows are being updated (new rows are also being added). These changes affect a large percentage of the existing rows. What we have been observing in the last few weeks is, that the overall database size is increasing rapidly due to this table and vacuum processes seem to deadlock with other processes querying data from this table. Therefore, the the database keeps growing and becomes more and more unusable. The only thing that helps is dumping and restoring it which is nothing you are eager to do on a large live system and a daily basis. This problem led us to the question, how these arrays are stored internally. Are they stored "in-place" with the other columns or merely as a pointer to another file? Would it be more efficient to not use an array for this purpose but split the table in two parts? Any help is appreciated! Marc Philipp
Marc Philipp <mail@marcphilipp.de> writes: > A few performance issues using PostgreSQL's arrays led us to the > question how postgres actually stores variable length arrays. First, > let me explain our situation. > We have a rather large table containing a simple integer primary key > and a couple more columns of fixed size. However, there is a dates > column of type "timestamp without time zone[]" that is apparently > causing some severe performance problems. How large are the arrays? PG is definitely not designed to do well with very large arrays (say more than a couple hundred elements). You should reconsider your data design if you find yourself trying to do that. regards, tom lane
Marc Philipp wrote: > During a daily update process new timestamps are collected and existing > data rows are being updated (new rows are also being added). These > changes affect a large percentage of the existing rows. > > What we have been observing in the last few weeks is, that the overall > database size is increasing rapidly due to this table and vacuum > processes seem to deadlock with other processes querying data from this > table. This sounds like it has more to do with inadequate freespace map settings than use of arrays. Every update creates a dead tuple, and if it is large (because the array is large) and leaked (because you have no room in your freespace map), that would explain a rapidly increasing database size. > Therefore, the the database keeps growing and becomes more and more > unusable. The only thing that helps is dumping and restoring it which > is nothing you are eager to do on a large live system and a daily basis. Arrays are stored as compressed varlenas, pretty much exactly like varchar or text fields. However, if your use of arrays causes you to need to perform updates to a large percentage of your rows on a daily basis, instead of just inserting new rows, you should probably rethink your data model. > Would it be more efficient to not use an array for this purpose but > split the table in two parts? > I think so. Joe
> How large are the arrays? PG is definitely not designed to do well > with > very large arrays (say more than a couple hundred elements). You > should > reconsider your data design if you find yourself trying to do that At the moment, the arrays are not larger than 200 entries. But there is not upper bound for their size. Regards, Marc Philipp
> This sounds like it has more to do with inadequate freespace map > settings than use of arrays. Every update creates a dead tuple, and > if > it is large (because the array is large) and leaked (because you have > no > room in your freespace map), that would explain a rapidly increasing > database size. Can you tell me more about free-space map settings? What exactly is the free-space map? The information in the documentation is not very helpful. How can dead tuples leak? Regards, Marc Philipp
On Sun, Jan 08, 2006 at 10:22:22AM +0100, Marc Philipp wrote: > > This sounds like it has more to do with inadequate freespace map > > settings than use of arrays. Every update creates a dead tuple, and > > if > > it is large (because the array is large) and leaked (because you have > > no > > room in your freespace map), that would explain a rapidly increasing > > database size. > > Can you tell me more about free-space map settings? What exactly is the > free-space map? The information in the documentation is not very > helpful. How can dead tuples leak? http://www.pervasive-postgres.com/lp/newsletters/2005/Insights_opensource_Nov.asp#3 is an article I wrote that might clear things up. http://www.pervasive-postgres.com/lp/newsletters/2005/Insights_opensource_Dec.asp#2 might also be an interesting read, though it's just about MVCC in general. -- Jim C. Nasby, Sr. Engineering Consultant jnasby@pervasive.com Pervasive Software http://pervasive.com work: 512-231-6117 vcard: http://jim.nasby.net/pervasive.vcf cell: 512-569-9461