Thread: newbie design question re impact of VACUUM
After looking at "Chapter 22. Routine Database Maintenance Tasks" (http://www.postgresql.org/docs/8.1/interactive/maintenance.html), I started wondering about what (if any) consideration to give to to VACUUM issues in the following context. As a background, I'll be using Postgres in part as a processing queue for a 40-column stream of information (~ 250 bytes/row) with a sustained input rate of 20 rows/sec. This queue will be processed periodically (every few minutes), design constraints are to (1) only process each row once, and (2) keep the processed rows around for a period of time (say a month or so). My first (naive?) idea was to add a boolean "was_processed" column to the table (defaulted to false) and UPDATE it to true as part of (1). After reading Chapter 22, though, it seems that even a minor UPDATE like that copies the row and requires VACUUMing. Given that, my basic question is whether row width is a consideration in UPDATE or VACUUM performance, and if so if it is generally accepted practice to design around it? For example, if I were to make a child table to effectively hold the "was_processed" flag I could probably avoid UPDATEs entirely, but I'm not sure how to value that in this context. Thanks in advance for and help/info/pointers.
shakahshakah@gmail.com wrote: > After looking at "Chapter 22. Routine Database Maintenance Tasks" > (http://www.postgresql.org/docs/8.1/interactive/maintenance.html), I > started wondering about what (if any) consideration to give to to > VACUUM issues in the following context. > > As a background, I'll be using Postgres in part as a processing queue > for a 40-column stream of information (~ 250 bytes/row) with a > sustained input rate of 20 rows/sec. This queue will be processed > periodically (every few minutes), design constraints are to (1) only > process each row once, and (2) keep the processed rows around for a > period of time (say a month or so). I would have the application vacuum the queue table after processing a batch of rows. That's about as small as you could keep the table's disk usage. You are right in that updating the row will basically create a new version of the row. -- Richard Huxton Archonet Ltd
In article <1131466149.590544.159870@g43g2000cwa.googlegroups.com>, "shakahshakah@gmail.com" <shakahshakah@gmail.com> writes: > As a background, I'll be using Postgres in part as a processing queue > for a 40-column stream of information (~ 250 bytes/row) with a > sustained input rate of 20 rows/sec. This queue will be processed > periodically (every few minutes), design constraints are to (1) only > process each row once, and (2) keep the processed rows around for a > period of time (say a month or so). > My first (naive?) idea was to add a boolean "was_processed" column to > the table (defaulted to false) and UPDATE it to true as part of (1). > After reading Chapter 22, though, it seems that even a minor UPDATE > like that copies the row and requires VACUUMing. That's true, but there might be a way to avoid it. If your queue elements have a timestamp, you could run your processing routine not over elements where "was_processed" is false, but over elements within some time interval, e.g. the last minute. This would eliminate the need for an UPDATE.
First rule of performance tuning: don't. See how well things run using the simple plan you've drawn up. If performance is acceptable, you're done. Yes, you could keep the flag in a seperate table, but remember that every row has a ~20 byte overhead, which is non-trivial. If you want to go this route, I suggest keeping a table of queue entries yet to be processed (can probably just be a table of ints). To see what you need to process, join that table back to the main queue table. When you've processed something, delete it's row from the 'side table'. Another possibility is to have two tables; one is the active queue and the other is historical. This is probably a better way to go so that your queue process doesn't have to slog through months of data. After you've processed an entry, just move it from the active table to the history table. On Tue, Nov 08, 2005 at 08:09:09AM -0800, shakahshakah@gmail.com wrote: > After looking at "Chapter 22. Routine Database Maintenance Tasks" > (http://www.postgresql.org/docs/8.1/interactive/maintenance.html), I > started wondering about what (if any) consideration to give to to > VACUUM issues in the following context. > > As a background, I'll be using Postgres in part as a processing queue > for a 40-column stream of information (~ 250 bytes/row) with a > sustained input rate of 20 rows/sec. This queue will be processed > periodically (every few minutes), design constraints are to (1) only > process each row once, and (2) keep the processed rows around for a > period of time (say a month or so). > > My first (naive?) idea was to add a boolean "was_processed" column to > the table (defaulted to false) and UPDATE it to true as part of (1). > After reading Chapter 22, though, it seems that even a minor UPDATE > like that copies the row and requires VACUUMing. Given that, my basic > question is whether row width is a consideration in UPDATE or VACUUM > performance, and if so if it is generally accepted practice to design > around it? For example, if I were to make a child table to effectively > hold the "was_processed" flag I could probably avoid UPDATEs entirely, > but I'm not sure how to value that in this context. > > Thanks in advance for and help/info/pointers. > > > ---------------------------(end of broadcast)--------------------------- > TIP 1: if posting/reading through Usenet, please send an appropriate > subscribe-nomail command to majordomo@postgresql.org so that your > message can get through to the mailing list cleanly > -- Jim C. Nasby, Sr. Engineering Consultant jnasby@pervasive.com Pervasive Software http://pervasive.com work: 512-231-6117 vcard: http://jim.nasby.net/pervasive.vcf cell: 512-569-9461