Thread: Deleting bytea, autovacuum, and 8.2/8.4 differences

Deleting bytea, autovacuum, and 8.2/8.4 differences

From
"fkater@googlemail.com"
Date:
Hi all,

my posting on 2010-01-14 about the performance when writing
bytea to disk caused a longer discussion. While the fact
still holds that the overall postgresql write performance is
roughly 25% of the serial I/O disk performance this was
compensated for my special use case here by doing some other
non-postgresql related things in parallel.

Now I cannot optimize my processes any further, however, now
I am facing another quite unexpected performance issue:
Deleting rows from my simple table (with the bytea column)
having 16 MB data each, takes roughly as long as writing
them!

Little more detail:

* The table just has 5 unused int columns, a timestamp,
OIDs, and the bytea column, no indices; the bytea storage
type is 'extended', the 16 MB are compressed to approx. the
half.

* All the usual optimizations are done to reach better
write through (pg_xlog on another disk, much tweaks to the
server conf etc), however, this does not matter here, since
not the absolute performance is of interest here but the
fact that deleting roughly takes 100% of the writing time.

* I need to write 15 rows of 16 MB each to disk in a maximum
time of 15 s, which is performed here in roughly 10 seconds,
however, now I am facing the problem that keeping my
database tidy (deleting rows) takes another 5-15 s (10s on
average), so my process exceeds the maximum time of 15s for
about 5s.

* Right now I am deleting like this:

DELETE FROM table WHERE (CURRENT_TIMESTAMP -
my_timestamp_column) > interval '2 minutes';

while it is planned to have the interval set to 6 hours in
the final version (thus creating a FIFO buffer for the
latest 6 hours of inserted data; so the FIFO will keep
approx.  10.000 rows spanning 160-200 GB data).

* This deletion SQL command was simply repeatedly executed
by pgAdmin while my app kept adding the 16 MB rows.

* Autovacuum is on; I believe I need to keep it on,
otherwise I do not free the disk space, right? If I switch
it off, the deletion time reduces from the average 10s down
to 4s.

* I am using server + libpq version 8.2.4, currently on
WinXP. Will an upgrade to 8.4 help here?

Do you have any other ideas to help me out?
Oh, please...

Thank You
 Felix





Re: Deleting bytea, autovacuum, and 8.2/8.4 differences

From
Dave Crooke
Date:
Hi there

I'm not an expert on PG's "toast" system, but a couple of thoughts inline below.

Cheers
Dave

On Sat, Mar 13, 2010 at 3:17 PM, fkater@googlemail.com <fkater@googlemail.com> wrote:
Hi all,

my posting on 2010-01-14 about the performance when writing
bytea to disk caused a longer discussion. While the fact
still holds that the overall postgresql write performance is
roughly 25% of the serial I/O disk performance this was
compensated for my special use case here by doing some other
non-postgresql related things in parallel.

Now I cannot optimize my processes any further, however, now
I am facing another quite unexpected performance issue:
Deleting rows from my simple table (with the bytea column)
having 16 MB data each, takes roughly as long as writing
them!

Little more detail:

* The table just has 5 unused int columns, a timestamp,
OIDs, and the bytea column, no indices; the bytea storage
type is 'extended', the 16 MB are compressed to approx. the
half.
 
Why no indices?
 

* All the usual optimizations are done to reach better
write through (pg_xlog on another disk, much tweaks to the
server conf etc), however, this does not matter here, since
not the absolute performance is of interest here but the
fact that deleting roughly takes 100% of the writing time.

* I need to write 15 rows of 16 MB each to disk in a maximum
time of 15 s, which is performed here in roughly 10 seconds,
however, now I am facing the problem that keeping my
database tidy (deleting rows) takes another 5-15 s (10s on
average), so my process exceeds the maximum time of 15s for
about 5s.

* Right now I am deleting like this:

DELETE FROM table WHERE (CURRENT_TIMESTAMP -
my_timestamp_column) > interval '2 minutes';

You need an index on my_timestamp_column
 

while it is planned to have the interval set to 6 hours in
the final version (thus creating a FIFO buffer for the
latest 6 hours of inserted data; so the FIFO will keep
approx.  10.000 rows spanning 160-200 GB data).

That's not the way to keep a 6 hour rolling buffer ... what you need to do is run the delete frequently, with  interval '6 hours' in the SQL acting as the cutoff.

If you really do want to drop the entire table contents before refilling it, do a DROP TABLE and recreate it.


* This deletion SQL command was simply repeatedly executed
by pgAdmin while my app kept adding the 16 MB rows.

Are you sure you are timing the delete, and not pgAdmin re-populating some kind of buffer?
 

* Autovacuum is on; I believe I need to keep it on,
otherwise I do not free the disk space, right? If I switch
it off, the deletion time reduces from the average 10s down
to 4s.

You may be running autovaccum too aggressively, it may be interfering with I/O to the tables.

Postgres vacuuming does not free disk space (in the sense of returning it to the OS), it removes old versions of rows that have been UPDATEd or DELETEd and makes that space in the table file available for new writes.


* I am using server + libpq version 8.2.4, currently on
WinXP. Will an upgrade to 8.4 help here?

8.4 has a lot of performance improvements. It's definitely worth a shot. I'd also consider switching to another OS where you can use a 64-bit version of PG and a much bigger buffer cache.


Do you have any other ideas to help me out?
Oh, please...

Thank You
 Felix





--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

Re: Deleting bytea, autovacuum, and 8.2/8.4 differences

From
"fkater@googlemail.com"
Date:
Hi Dave,

thank you for your answers! Here some comments:

Dave Crooke:

> > * The table just has 5 unused int columns, a timestamp,
> > OIDs, and the bytea column, *no indices*; the bytea storage
> > type is 'extended', the 16 MB are compressed to approx. the
> > half.
> >
>
> Why no indices?

Simply because the test case had just < 50 rows (deleting
all rows older than 2 minues). Later on I would use indices.


> > while it is planned to have the interval set to 6 hours in
> > the final version (thus creating a FIFO buffer for the
> > latest 6 hours of inserted data; so the FIFO will keep
> > approx.  10.000 rows spanning 160-200 GB data).
> >
>
> That's not the way to keep a 6 hour rolling buffer ... what you need to do
> is run the delete frequently, with  *interval '6 hours'* in the SQL acting
> as the cutoff.

In fact the delete was run frequently to cut everything
older than 6 hours *immediately*.


> If you really do want to drop the entire table contents before refilling it,
> do a *DROP TABLE* and recreate it.

No, I do not want to drop the whole table.


> > * This deletion SQL command was simply repeatedly executed
> > by pgAdmin while my app kept adding the 16 MB rows.
> >
>
> Are you sure you are timing the delete, and not pgAdmin re-populating some
> kind of buffer?

Quite sure, yes. Because I launched just the delete command
in pgAdmin while the rest was executed by my application
outside pgAdmin, of course.



> > * Autovacuum is on; I believe I need to keep it on,
> > otherwise I do not free the disk space, right? If I switch
> > it off, the deletion time reduces from the average 10s down
> > to 4s.
> >
>
> You may be running autovaccum too aggressively, it may be interfering with
> I/O to the tables.

Hm, so would should I change then? I wonder if it helps to
run autovacuum less aggressive if there will not be a
situation were the whole process is stopped for a while. But
I'd like to understand what to change here.


> 8.4 has a lot of performance improvements. It's definitely worth a shot. I'd
> also consider switching to another OS where you can use a 64-bit version of
> PG and a much bigger buffer cache.

O.k., I'll give it a try.


Thank You.
 Felix


Re: Deleting bytea, autovacuum, and 8.2/8.4 differences

From
VJK
Date:

A quick test:
-----
1. create table x1(x int, y bytea);
2. Load some data  say with python:
cp /opt/java/src.zip ~/tmp/a.dat (19MB)
##
import psycopg2
conn = psycopg2.connect("dbname='test' user='*****' password='****' host='127.0.0.1'");
conn.cursor().execute("INSERT INTO x1 VALUES (1, %s)", (psycopg2.Binary(open("a.dat").read()),))
conn.commit()
##
3. create table x2(x int, y bytea);
4. Copy table x1 100 times to x2 (1.9GB) and monitor/measure IO:
insert into x2 select a x, y from generate_series(1,100) a, x1;

Results:
-----------
On Linux 2.6.32 with an ext3 file system on one 15K rpm disk, we saw with SystemTap that the source 1.9GB  (19MB x 100) resulted in 5GB of actual disk IO and took 61 seconds (52 CPU + 9 sleep/wait for IO)

Deletion (delete from x2) took 32 seconds with 12 seconds CPU and 20 sec sleep + wait for IO. Actual disk IO was about 4GB.

Since Pg does not use the concept of rollback segments,  it is unclear why deletion produces so much disk IO (4GB).

VJ



On Sat, Mar 13, 2010 at 5:17 PM, fkater@googlemail.com <fkater@googlemail.com> wrote:
Hi all,

my posting on 2010-01-14 about the performance when writing
bytea to disk caused a longer discussion. While the fact
still holds that the overall postgresql write performance is
roughly 25% of the serial I/O disk performance this was
compensated for my special use case here by doing some other
non-postgresql related things in parallel.

Now I cannot optimize my processes any further, however, now
I am facing another quite unexpected performance issue:
Deleting rows from my simple table (with the bytea column)
having 16 MB data each, takes roughly as long as writing
them!

Little more detail:

* The table just has 5 unused int columns, a timestamp,
OIDs, and the bytea column, no indices; the bytea storage
type is 'extended', the 16 MB are compressed to approx. the
half.

* All the usual optimizations are done to reach better
write through (pg_xlog on another disk, much tweaks to the
server conf etc), however, this does not matter here, since
not the absolute performance is of interest here but the
fact that deleting roughly takes 100% of the writing time.

* I need to write 15 rows of 16 MB each to disk in a maximum
time of 15 s, which is performed here in roughly 10 seconds,
however, now I am facing the problem that keeping my
database tidy (deleting rows) takes another 5-15 s (10s on
average), so my process exceeds the maximum time of 15s for
about 5s.

* Right now I am deleting like this:

DELETE FROM table WHERE (CURRENT_TIMESTAMP -
my_timestamp_column) > interval '2 minutes';

while it is planned to have the interval set to 6 hours in
the final version (thus creating a FIFO buffer for the
latest 6 hours of inserted data; so the FIFO will keep
approx.  10.000 rows spanning 160-200 GB data).

* This deletion SQL command was simply repeatedly executed
by pgAdmin while my app kept adding the 16 MB rows.

* Autovacuum is on; I believe I need to keep it on,
otherwise I do not free the disk space, right? If I switch
it off, the deletion time reduces from the average 10s down
to 4s.

* I am using server + libpq version 8.2.4, currently on
WinXP. Will an upgrade to 8.4 help here?

Do you have any other ideas to help me out?
Oh, please...

Thank You
 Felix





--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: Deleting bytea, autovacuum, and 8.2/8.4 differences

From
"Kevin Grittner"
Date:
"fkater@googlemail.com" <fkater@googlemail.com> wrote:

> Simply because the test case had just < 50 rows (deleting
> all rows older than 2 minues). Later on I would use indices.

Running a performance test with 50 rows without indexes and
extrapolating to a much larger data set with indexes won't tell you
much.  The plans chosen by the PostgreSQL optimizer will probably be
completely different, and the behavior of the caches (at all levels)
will be very different.

>> > while it is planned to have the interval set to 6 hours in
>> > the final version (thus creating a FIFO buffer for the
>> > latest 6 hours of inserted data; so the FIFO will keep
>> > approx.  10.000 rows spanning 160-200 GB data).

This might lend itself to partitioning.  Dropping a partition
containing data older than six hours would be very fast.  Without
knowing what kinds of queries you want to run on the data, it's hard
to predict the performance impact on your other operations, though.

>> > * Autovacuum is on; I believe I need to keep it on,
>> > otherwise I do not free the disk space, right? If I switch
>> > it off, the deletion time reduces from the average 10s down
>> > to 4s.
>> >
>>
>> You may be running autovaccum too aggressively, it may be
>> interfering with I/O to the tables.
>
> Hm, so would should I change then? I wonder if it helps to
> run autovacuum less aggressive if there will not be a
> situation were the whole process is stopped for a while. But
> I'd like to understand what to change here.

I'd be very careful about this, I've seen performance problems more
often (and more dramatic) from not running it aggressively enough.
Monitor performance and bloat closely when you adjust this, and make
sure the data and load are modeling what you expect in production,
or you'll tune for the wrong environment and likely make matters
worse for the environment that really matters.

-Kevin

Re: Deleting bytea, autovacuum, and 8.2/8.4 differences

From
"Kevin Grittner"
Date:
VJK <vjkmail@gmail.com> wrote:

> the source 1.9GB  (19MB x 100) resulted in 5GB of actual disk IO

> Deletion (delete from x2) took 32 seconds with 12 seconds CPU and
> 20 sec sleep + wait for IO. Actual disk IO was about 4GB.
>
> Since Pg does not use the concept of rollback segments,  it is
> unclear why deletion produces so much disk IO (4GB).

One delete would mark the xmax of the tuple, so that transactions
without that transaction ID in their visible set would ignore it.
The next table scan would set hint bits, which would store
information within the tuple to indicate that the deleting
transaction successfully committed, then the vacuum would later wake
up and rewrite the page with the deleted tuples removed.

If you have enough battery backed cache space on a hardware RAID
controller card, and that cache is configured in write-back mode,
many of these writes might be combined -- the original delete, the
hint bit write, and the vacuum might all combine into one physical
write to disk.  What does your disk system look like, exactly?

-Kevin

Re: Deleting bytea, autovacuum, and 8.2/8.4 differences

From
Greg Smith
Date:
VJK wrote:
> Since Pg does not use the concept of rollback segments,  it is unclear
> why deletion produces so much disk IO (4GB).

With PostgreSQL's write-ahead log, MVCC and related commit log, and
transactional DDL features, there's actually even more overhead that can
be involved than a simple rollback segment design when you delete things:

http://www.postgresql.org/docs/current/static/wal.html
http://www.postgresql.org/docs/current/static/mvcc-intro.html
http://wiki.postgresql.org/wiki/Hint_Bits
http://wiki.postgresql.org/wiki/Transactional_DDL_in_PostgreSQL:_A_Competitive_Analysis

One fun thing to try here is to increase shared_buffers and
checkpoint_segments, then see if the total number of writes go down.
The defaults for both are really low, which makes buffer page writes
that might otherwise get combined as local memory changes instead get
pushed constantly to disk.

--
Greg Smith  2ndQuadrant US  Baltimore, MD
PostgreSQL Training, Services and Support
greg@2ndQuadrant.com   www.2ndQuadrant.us


Re: Deleting bytea, autovacuum, and 8.2/8.4 differences

From
VJK
Date:
Inline:

On Mon, Mar 15, 2010 at 10:12 AM, Kevin Grittner <Kevin.Grittner@wicourts.gov> wrote:
VJK <vjkmail@gmail.com> wrote:

> the source 1.9GB  (19MB x 100) resulted in 5GB of actual disk IO

> Deletion (delete from x2) took 32 seconds with 12 seconds CPU and
> 20 sec sleep + wait for IO. Actual disk IO was about 4GB.
>
> Since Pg does not use the concept of rollback segments,  it is
> unclear why deletion produces so much disk IO (4GB).

One delete would mark the xmax of the tuple, so that transactions
without that transaction ID in their visible set would ignore it.
The next table scan would set hint bits, which would store
information within the tuple to indicate that the deleting
transaction successfully committed, then the vacuum would later wake
up and rewrite the page with the deleted tuples removed.
 
I did not observe any vacuuming activity during the deletion process.  However, even with vacuuming,  4GB of disk IO is rather excessive for deleting 1.9GB of data.

If you have enough battery backed cache space on a hardware RAID
controller card, and that cache is configured in write-back mode,
many of these writes might be combined -- the original delete, the
hint bit write, and the vacuum might all combine into one physical
write to disk.  

They are combined alright, I see between 170-200 MB/s IO spikes on the iotop screen which means writes to the cache -- the disk itself is capable of 110(ic)-160(oc) MB/s only, with sequential 1MB block size writes.

What does your disk system look like, exactly?

As I wrote before,  it's actually a single 15K rpm mirrored pair that you can look at as a single disk for performance purposes.  It is connected through a PERC6i controller to a Dell 2950.

The disk subsystem is not really important here. What is really interesting,  why so much IO is generated during the deletion process ?

-Kevin

Re: Deleting bytea, autovacuum, and 8.2/8.4 differences

From
Tom Lane
Date:
Greg Smith <greg@2ndquadrant.com> writes:
> VJK wrote:
>> Since Pg does not use the concept of rollback segments,  it is unclear
>> why deletion produces so much disk IO (4GB).

> With PostgreSQL's write-ahead log, MVCC and related commit log, and
> transactional DDL features, there's actually even more overhead that can
> be involved than a simple rollback segment design when you delete things:

For an example like this one, you have to keep in mind that the
toast-table rows for the large bytea value have to be marked deleted,
too.  Also, since I/O happens in units of pages, the I/O volume to
delete a tuple is just as much as the I/O to create it.  (The WAL
entry for deletion might be smaller, but that's all.)  So it is entirely
unsurprising that "DELETE FROM foo" is about as expensive as filling the
table initially.

If deleting a whole table is significant for you performance-wise,
you might look into using TRUNCATE instead.

            regards, tom lane

Re: Deleting bytea, autovacuum, and 8.2/8.4 differences

From
Tom Lane
Date:
Matthew Wakeling <matthew@flymine.org> writes:
> On Mon, 15 Mar 2010, Tom Lane wrote:
>> If deleting a whole table is significant for you performance-wise,
>> you might look into using TRUNCATE instead.

> Might you still end up with a normal delete operation
> on the TOAST table when performing a TRUNCATE on the owner table?

No, you get a TRUNCATE on its toast table too.

            regards, tom lane

Re: Deleting bytea, autovacuum, and 8.2/8.4 differences

From
VJK
Date:
Inline:

On Mon, Mar 15, 2010 at 10:42 AM, Greg Smith <greg@2ndquadrant.com> wrote:
VJK wrote:
Since Pg does not use the concept of rollback segments,  it is unclear why deletion produces so much disk IO (4GB).

With PostgreSQL's write-ahead log, MVCC and related commit log, and transactional DDL features, there's actually even more overhead that can be involved than a simple rollback segment design when you delete things:

There does not appear to be much WAL activity.  Here's the insertion of 100 rows as seen by iotop:
   4.39 G  0.00 %  9.78 % postgres: writer process          
   5.34 G  0.00 %  5.93 % postgres: postgr~0.5.93(1212) idle
  27.84 M  0.00 %  1.77 % postgres: wal writer process      
 144.00 K  0.00 %  0.00 % postgres: stats collector process 
   0.00 B  0.00 %  0.00 % postgres: autova~ launcher process
   0.00 B  0.00 %  0.00 % postgres: postgr~0.5.93(4632) idle
                                                            


.. and the deletion:
288.18 M  0.00 % 37.80 % postgres: writer process          
  3.41 G  0.00 % 19.76 % postgres: postgr~0.5.93(1212) DELETE
 27.27 M  0.00 %  3.18 % postgres: wal writer process      
 72.00 K  0.00 %  0.03 % postgres: stats collector process 
  0.00 B  0.00 %  0.00 % postgres: autova~ launcher process
  0.00 B  0.00 %  0.00 % postgres: postgr~0.5.93(4632) idle

So, the original 1.9 GB of useful data generate about 10GB of IO, 5 of which end up being written to the disk  The deletion generates about 3.8 GB of IO all of which results in disk IO.  WAL activity is about 27MB in both cases.
 

http://www.postgresql.org/docs/current/static/wal.html
http://www.postgresql.org/docs/current/static/mvcc-intro.html
http://wiki.postgresql.org/wiki/Hint_Bits
http://wiki.postgresql.org/wiki/Transactional_DDL_in_PostgreSQL:_A_Competitive_Analysis


I read all of the above, but it does not really clarify why deletion generates so much IO.
 
One fun thing to try here is to increase shared_buffers and checkpoint_segments, then see if the total number of writes go down.  The defaults for both are really low, which makes buffer page writes that might otherwise get combined as local memory changes instead get pushed constantly to disk.

--
Greg Smith  2ndQuadrant US  Baltimore, MD
PostgreSQL Training, Services and Support
greg@2ndQuadrant.com   www.2ndQuadrant.us


Re: Deleting bytea, autovacuum, and 8.2/8.4 differences

From
Matthew Wakeling
Date:
On Mon, 15 Mar 2010, Tom Lane wrote:
> For an example like this one, you have to keep in mind that the
> toast-table rows for the large bytea value have to be marked deleted,
> too.  Also, since I/O happens in units of pages, the I/O volume to
> delete a tuple is just as much as the I/O to create it.  (The WAL
> entry for deletion might be smaller, but that's all.)  So it is entirely
> unsurprising that "DELETE FROM foo" is about as expensive as filling the
> table initially.
>
> If deleting a whole table is significant for you performance-wise,
> you might look into using TRUNCATE instead.

What are the implications of using TRUNCATE on a table that has TOASTed
data? Is TOAST all stored in one single table, or is it split up by owner
table/column name? Might you still end up with a normal delete operation
on the TOAST table when performing a TRUNCATE on the owner table?

Matthew

--
sed -e '/^[when][coders]/!d;/^...[discover].$/d;/^..[real].[code]$/!d
' <`locate dict/words`

Re: Deleting bytea, autovacuum, and 8.2/8.4 differences

From
VJK
Date:


On Mon, Mar 15, 2010 at 10:53 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
Greg Smith <greg@2ndquadrant.com> writes:
> VJK wrote:
>> Since Pg does not use the concept of rollback segments,  it is unclear
>> why deletion produces so much disk IO (4GB).

For an example like this one, you have to keep in mind that the
toast-table rows for the large bytea value have to be marked deleted,
too.  Also, since I/O happens in units of pages, the I/O volume to
delete a tuple is just as much as the I/O to create it.

That makes sense.
 
                       regards, tom lane