Thread: newbie design question re impact of VACUUM

newbie design question re impact of VACUUM

From

"shakahshakah@gmail.com"

Date:

09 November 2005, 10:34:00

After looking at "Chapter 22. Routine Database Maintenance Tasks"
(http://www.postgresql.org/docs/8.1/interactive/maintenance.html), I
started wondering about what (if any) consideration to give to to
VACUUM issues in the following context.

As a background, I'll be using Postgres in part as a processing queue
for a 40-column stream of information (~ 250 bytes/row) with a
sustained input rate of 20 rows/sec. This queue will be processed
periodically (every few minutes), design constraints are to (1) only
process each row once, and (2) keep the processed rows around for a
period of time (say a month or so).

My first (naive?) idea was to add a boolean "was_processed" column to
the table (defaulted to false) and UPDATE it to true as part of (1).
After reading Chapter 22, though, it seems that even a minor UPDATE
like that copies the row and requires VACUUMing. Given that, my basic
question is whether row width is a consideration in UPDATE or VACUUM
performance, and if so if it is generally accepted practice to design
around it? For example, if I were to make a child table to effectively
hold the "was_processed" flag I could probably avoid UPDATEs entirely,
but I'm not sure how to value that in this context.

Thanks in advance for and help/info/pointers.

Re: newbie design question re impact of VACUUM

From

Richard Huxton

Date:

09 November 2005, 11:00:46

shakahshakah@gmail.com wrote:
> After looking at "Chapter 22. Routine Database Maintenance Tasks"
> (http://www.postgresql.org/docs/8.1/interactive/maintenance.html), I
> started wondering about what (if any) consideration to give to to
> VACUUM issues in the following context.
>
> As a background, I'll be using Postgres in part as a processing queue
> for a 40-column stream of information (~ 250 bytes/row) with a
> sustained input rate of 20 rows/sec. This queue will be processed
> periodically (every few minutes), design constraints are to (1) only
> process each row once, and (2) keep the processed rows around for a
> period of time (say a month or so).

I would have the application vacuum the queue table after processing a
batch of rows. That's about as small as you could keep the table's disk
usage.

You are right in that updating the row will basically create a new
version of the row.

--
   Richard Huxton
   Archonet Ltd

Re: newbie design question re impact of VACUUM

From

Harald Fuchs

Date:

09 November 2005, 12:24:18

In article <1131466149.590544.159870@g43g2000cwa.googlegroups.com>,
"shakahshakah@gmail.com" <shakahshakah@gmail.com> writes:

> As a background, I'll be using Postgres in part as a processing queue
> for a 40-column stream of information (~ 250 bytes/row) with a
> sustained input rate of 20 rows/sec. This queue will be processed
> periodically (every few minutes), design constraints are to (1) only
> process each row once, and (2) keep the processed rows around for a
> period of time (say a month or so).

> My first (naive?) idea was to add a boolean "was_processed" column to
> the table (defaulted to false) and UPDATE it to true as part of (1).
> After reading Chapter 22, though, it seems that even a minor UPDATE
> like that copies the row and requires VACUUMing.

That's true, but there might be a way to avoid it.  If your queue
elements have a timestamp, you could run your processing routine not
over elements where "was_processed" is false, but over elements within
some time interval, e.g. the last minute.  This would eliminate the
need for an UPDATE.

Re: newbie design question re impact of VACUUM

From

"Jim C. Nasby"

Date:

09 November 2005, 16:36:49

First rule of performance tuning: don't. See how well things run using
the simple plan you've drawn up. If performance is acceptable, you're
done.

Yes, you could keep the flag in a seperate table, but remember that
every row has a ~20 byte overhead, which is non-trivial. If you want to
go this route, I suggest keeping a table of queue entries yet to be
processed (can probably just be a table of ints). To see what you need
to process, join that table back to the main queue table. When you've
processed something, delete it's row from the 'side table'.

Another possibility is to have two tables; one is the active queue and
the other is historical. This is probably a better way to go so that
your queue process doesn't have to slog through months of data. After
you've processed an entry, just move it from the active table to the
history table.

On Tue, Nov 08, 2005 at 08:09:09AM -0800, shakahshakah@gmail.com wrote:
> After looking at "Chapter 22. Routine Database Maintenance Tasks"
> (http://www.postgresql.org/docs/8.1/interactive/maintenance.html), I
> started wondering about what (if any) consideration to give to to
> VACUUM issues in the following context.
>
> As a background, I'll be using Postgres in part as a processing queue
> for a 40-column stream of information (~ 250 bytes/row) with a
> sustained input rate of 20 rows/sec. This queue will be processed
> periodically (every few minutes), design constraints are to (1) only
> process each row once, and (2) keep the processed rows around for a
> period of time (say a month or so).
>
> My first (naive?) idea was to add a boolean "was_processed" column to
> the table (defaulted to false) and UPDATE it to true as part of (1).
> After reading Chapter 22, though, it seems that even a minor UPDATE
> like that copies the row and requires VACUUMing. Given that, my basic
> question is whether row width is a consideration in UPDATE or VACUUM
> performance, and if so if it is generally accepted practice to design
> around it? For example, if I were to make a child table to effectively
> hold the "was_processed" flag I could probably avoid UPDATEs entirely,
> but I'm not sure how to value that in this context.
>
> Thanks in advance for and help/info/pointers.
>
>
> ---------------------------(end of broadcast)---------------------------
> TIP 1: if posting/reading through Usenet, please send an appropriate
>        subscribe-nomail command to majordomo@postgresql.org so that your
>        message can get through to the mailing list cleanly
>

--
Jim C. Nasby, Sr. Engineering Consultant      jnasby@pervasive.com
Pervasive Software      http://pervasive.com    work: 512-231-6117
vcard: http://jim.nasby.net/pervasive.vcf       cell: 512-569-9461