Thread: random observations while testing with a 1,8B row table

random observations while testing with a 1,8B row table

From
Stefan Kaltenbrunner
Date:
Hi all!

During my testing of large work_mem and maintainence_work_mem setting
wrt to CREATE INDEX and sorting I encountered a number of things wrt to
doing various operations on such a large table (about 106GB on disk with
no dead tuples).
I will summarize some of the just in case somebody is interested:

-> table used has 5 integer columns non-indexed during the loads
-> hardware is a Dual Opteron 280 with 4 cores@2,4GHz and 16GB RAM, data
is on a multipathed (busy) SAN with different (RAID 10) Arrays for WAL
and data.


1. data loading - I'm using COPY with batches of 300M rows it takes

*) with one copy running it takes about 20minutes/batch to load the data
(~250k rows/sec) and virtually no context switches.

*) with two copys running concurrently it takes a bit less then 30
minutes/batch and a steady 40k context switches/sec (~340k rows/sec overall)

*) with three copy it takes about 40min/batch at 140k context
switches/sec (380k rows/sec overall)

the profiles for those runs look very similiar to:

samples  %        symbol name
5065118  20.9607  XLogInsert
3496868  14.4709  DoCopy
2807313  11.6174  CopyReadLine
1373621   5.6844  PageAddItem
1227069   5.0779  heap_formtuple
1193319   4.9383  LWLockAcquire
894243    3.7006  hash_search
717427    2.9689  LWLockRelease
699359    2.8941  pg_atoi
691385    2.8611  FunctionCall3
640383    2.6501  heap_insert
579331    2.3974  int4in
411286    1.7020  AllocSetReset
376452    1.5579  hash_any
349220    1.4452  RelationGetBufferForTuple
261568    1.0824  AllocSetAlloc
257511    1.0656  ReadBuffer

while the amount of IO going on is quite a lot it looks like we are
still mostly CPU-bound for COPY.

2. updating all of the rows in the table:

I updated all of the rows in the table with a simple UPDATE testtable
set a=a+1;
this took about 2,5 hours (~200rows/sec)

with a profile looking like:
samples  %        symbol name
27860285 26.5844  XLogInsert
4828077   4.6070  PageAddItem
4490535   4.2849  heap_update
4267647   4.0722  slot_deform_tuple
3996750   3.8137  LWLockAcquire
3716184   3.5460  slot_getattr
3454679   3.2965  hash_search
2998742   2.8614  hash_any
2909261   2.7760  heap_fill_tuple
2825256   2.6959  LWLockRelease
2283086   2.1785  LockBuffer
2135048   2.0373  ExecTargetList
1636017   1.5611  ExecEvalVar
1632377   1.5576  UnpinBuffer
1566087   1.4944  RelationGetBufferForTuple
1561378   1.4899  ExecMakeFunctionResultNoSets
1511366   1.4421  ReadBuffer
1381614   1.3183  heap_compute_data_size



3. vacuuming this table - it turned out that VACUUM FULL is completly
unusable on a table(which i actually expected before) of this size not
only to the locking involved but rather due to a gigantic memory
requirement and unbelievable slowness.

It seems that the heap-scan part of vacuum full completed after about 2
hours ending up with a postmaster having a resident size of about
8,5GB(!!!) with maintainance_work_mem set to 1GB.

profile for this stage looks like:

samples  %        symbol name
941058   26.0131  scan_heap
444435   12.2852  HeapTupleSatisfiesVacuum
242117    6.6927  TransactionIdIsInProgress
220044    6.0825  _mdfd_getseg
212571    5.8760  hash_search
186963    5.1681  TransactionIdPrecedes
176016    4.8655  SetBufferCommitInfoNeedsSave
137668    3.8055  TransactionIdDidCommit
137068    3.7889  PageRepairFragmentation
111474    3.0814  TransactionLogFetch
103814    2.8697  LWLockAcquire
102925    2.8451  LWLockRelease
102456    2.8321  hash_any
67199     1.8575  BufferAlloc

after that the postmaster started slowly consuming more and more memory,
doing virtually no IO and eating CPU like mad with a profile similiar to:

samples  %        symbol name
2708391248 94.1869  repair_frag
155395833  5.4040  enough_space
5707137   0.1985  XLogInsert
1410703   0.0491  PageAddItem
691616    0.0241  BgBufferSync

I actually ended up canceling the VACUUM FULL after about 50 hours of
runtime with a resident size of ~11,5GB.


Stefan


Re: random observations while testing with a 1,8B row

From
"Luke Lonergan"
Date:
Stefan,

On 3/10/06 9:40 AM, "Stefan Kaltenbrunner" <stefan@kaltenbrunner.cc> wrote:

> I will summarize some of the just in case somebody is interested:

I am!

> -> table used has 5 integer columns non-indexed during the loads
> -> hardware is a Dual Opteron 280 with 4 cores@2,4GHz and 16GB RAM, data
> is on a multipathed (busy) SAN with different (RAID 10) Arrays for WAL
> and data.

How many connections out of the machine?  How many disks behind each LUN?

So - about 20 Bytes per row (5*4) unless those are int8, but on disk it's
108GB/1.8B = 60 Bytes per row on disk.  I wonder what all that overhead is?

> 1. data loading - I'm using COPY with batches of 300M rows it takes
> 
> *) with one copy running it takes about 20minutes/batch to load the data
> (~250k rows/sec) and virtually no context switches.
> 
> *) with two copys running concurrently it takes a bit less then 30
> minutes/batch and a steady 40k context switches/sec (~340k rows/sec overall)
> 
> *) with three copy it takes about 40min/batch at 140k context
> switches/sec (380k rows/sec overall)

So, from 15 MB/s up to about 20 MB/s.

> while the amount of IO going on is quite a lot it looks like we are
> still mostly CPU-bound for COPY.

It's what we see almost always.  In this case if your I/O configuration is
capable of performing at about 3x the 20MB/s max parsing rate, or 60MB/s,
you will be CPU limited.

The 3x is approximate, and based on observations, the reasoning underneath
it is that Postgres is writing the data several times, once to the WAL, then
from the WAL to the heap files.
> 2. updating all of the rows in the table:
> 
> I updated all of the rows in the table with a simple UPDATE testtable
> set a=a+1;
> this took about 2,5 hours (~200rows/sec)

Ugh.  This is where Bizgres MPP shines, I'll try to recreate your test and
post results.  This scales linearly in Bizgres MPP with the number of disks
and CPUs available, but I would hope for much more than that.

> 3. vacuuming this table - it turned out that VACUUM FULL is completly
> unusable on a table(which i actually expected before) of this size not
> only to the locking involved but rather due to a gigantic memory
> requirement and unbelievable slowness.

Simple vacuum should be enough IMO.

- Luke




Re: random observations while testing with a 1,8B row table

From
Stefan Kaltenbrunner
Date:
Luke Lonergan wrote:
> Stefan,
> 
> On 3/10/06 9:40 AM, "Stefan Kaltenbrunner" <stefan@kaltenbrunner.cc> wrote:
> 
> 
>>I will summarize some of the just in case somebody is interested:
> 
> 
> I am!

heh - not surprised :-)

> 
> 
>>-> table used has 5 integer columns non-indexed during the loads
>>-> hardware is a Dual Opteron 280 with 4 cores@2,4GHz and 16GB RAM, data
>>is on a multipathed (busy) SAN with different (RAID 10) Arrays for WAL
>>and data.
> 
> 
> How many connections out of the machine?  How many disks behind each LUN?

2 HBAs in the server, 2x2 possible paths to each LUN.
6 disks for the WAL and 12 disks for the data


> 
> So - about 20 Bytes per row (5*4) unless those are int8, but on disk it's
> 108GB/1.8B = 60 Bytes per row on disk.  I wonder what all that overhead is?
> 
> 
>>1. data loading - I'm using COPY with batches of 300M rows it takes
>>
>>*) with one copy running it takes about 20minutes/batch to load the data
>>(~250k rows/sec) and virtually no context switches.
>>
>>*) with two copys running concurrently it takes a bit less then 30
>>minutes/batch and a steady 40k context switches/sec (~340k rows/sec overall)
>>
>>*) with three copy it takes about 40min/batch at 140k context
>>switches/sec (380k rows/sec overall)
> 
> 
> So, from 15 MB/s up to about 20 MB/s.
>  
> 
> 
>>while the amount of IO going on is quite a lot it looks like we are
>>still mostly CPU-bound for COPY.
> 
> 
> It's what we see almost always.  In this case if your I/O configuration is
> capable of performing at about 3x the 20MB/s max parsing rate, or 60MB/s,
> you will be CPU limited.

the IO-System I use should be capable of doing that if pushed hard
enough :-)

> 
> The 3x is approximate, and based on observations, the reasoning underneath
> it is that Postgres is writing the data several times, once to the WAL, then
> from the WAL to the heap files.
>  
> 
>>2. updating all of the rows in the table:
>>
>>I updated all of the rows in the table with a simple UPDATE testtable
>>set a=a+1;
>>this took about 2,5 hours (~200rows/sec)
> 
> 
> Ugh.  This is where Bizgres MPP shines, I'll try to recreate your test and
> post results.  This scales linearly in Bizgres MPP with the number of disks
> and CPUs available, but I would hope for much more than that.

interesting to know, but still I'm testing/playing with postgresql here
not bizgres MPP ...

> 
> 
>>3. vacuuming this table - it turned out that VACUUM FULL is completly
>>unusable on a table(which i actually expected before) of this size not
>>only to the locking involved but rather due to a gigantic memory
>>requirement and unbelievable slowness.
> 
> 
> Simple vacuum should be enough IMO.

sure, that was mostly meant as an experiment, if I had to do this on a
production database I would most likely use CLUSTER to get the desired
effect (which in my case was purely getting back the diskspace wasted by
dead tuples)



Stefan


Re: random observations while testing with a 1,8B row table

From
Tom Lane
Date:
Stefan Kaltenbrunner <stefan@kaltenbrunner.cc> writes:
>>> 3. vacuuming this table - it turned out that VACUUM FULL is completly
>>> unusable on a table(which i actually expected before) of this size not
>>> only to the locking involved but rather due to a gigantic memory
>>> requirement and unbelievable slowness.

> sure, that was mostly meant as an experiment, if I had to do this on a
> production database I would most likely use CLUSTER to get the desired
> effect (which in my case was purely getting back the diskspace wasted by
> dead tuples)

Yeah, the VACUUM FULL algorithm is really designed for situations where
just a fraction of the rows have to be moved to re-compact the table.
It might be interesting to teach it to abandon that plan and go to a
CLUSTER-like table rewrite once the percentage of dead space is seen to
reach some suitable level.  CLUSTER has its own disadvantages though
(2X peak disk space usage, doesn't work on core catalogs, etc).
        regards, tom lane


Re: random observations while testing with a 1,8B row

From
"Luke Lonergan"
Date:
Stefan,

On 3/10/06 11:48 AM, "Stefan Kaltenbrunner" <stefan@kaltenbrunner.cc> wrote:

> 2 HBAs in the server, 2x2 possible paths to each LUN.
> 6 disks for the WAL and 12 disks for the data

So - you have 18 disks worth of potential bandwidth, not factoring loss due
to RAID.  That's roughly 18 * 60 = 1,080 MB/s.  If we organized that into
four banks, one for each CPU and made each one RAID5 and left two disks for
spares, you'd have 12 disks working for you at 720MB/s, which is possibly
double the number of active FC channels you have, unless they are all
active, in which case you have a nicely matched 800MB/s of FC.

>> So, from 15 MB/s up to about 20 MB/s.

Gee - seems a long distance from 700 MB/s potential :-)

> the IO-System I use should be capable of doing that if pushed hard
> enough :-)

I would expect some 10x this if configured well.

> interesting to know, but still I'm testing/playing with postgresql here
> not bizgres MPP ...

Sure.  Still, what I'd expect is something like 10x this update rate using
the parallelism buried in your hardware.

If you configure the same machine with 4 Bizgres MPP segments running on 4
LUNs I think you'd be shocked at the speedups.

- Luke




Re: random observations while testing with a 1,8B row table

From
Steve Atkins
Date:
On Mar 10, 2006, at 11:54 AM, Tom Lane wrote:

> Stefan Kaltenbrunner <stefan@kaltenbrunner.cc> writes:
>>>> 3. vacuuming this table - it turned out that VACUUM FULL is  
>>>> completly
>>>> unusable on a table(which i actually expected before) of this  
>>>> size not
>>>> only to the locking involved but rather due to a gigantic memory
>>>> requirement and unbelievable slowness.
>
>> sure, that was mostly meant as an experiment, if I had to do this  
>> on a
>> production database I would most likely use CLUSTER to get the  
>> desired
>> effect (which in my case was purely getting back the diskspace  
>> wasted by
>> dead tuples)
>
> Yeah, the VACUUM FULL algorithm is really designed for situations  
> where
> just a fraction of the rows have to be moved to re-compact the table.
> It might be interesting to teach it to abandon that plan and go to a
> CLUSTER-like table rewrite once the percentage of dead space is  
> seen to
> reach some suitable level.  CLUSTER has its own disadvantages though
> (2X peak disk space usage, doesn't work on core catalogs, etc).

I get bitten by this quite often (customer machines, one giant table,
purge out a lot of old data).

CLUSTER is great for that, given the headroom, though I've often
resorted to a dump and restore because I've not had the headroom
for cluster, and it's a lot less downtime than a full vacuum.

While the right fix there is to redo the application engine side to use
table partitioning, I keep wondering whether it would be possible
to move rows near the end of the table to the beginning in one, non- 
locking
phase (vacuum to populate FSM with free space near beginning of table,
touch rows starting at end of table, repeat) and then finish off with a
vacuum full to tidy up the remainder and truncate the files (or a  
simpler
"lock the table and truncate anything unused at the end").

Cheers,  Steve





Re: random observations while testing with a 1,8B row table

From
Stefan Kaltenbrunner
Date:
Luke Lonergan wrote:
> Stefan,
> 
> On 3/10/06 11:48 AM, "Stefan Kaltenbrunner" <stefan@kaltenbrunner.cc> wrote:
> 
> 
>>2 HBAs in the server, 2x2 possible paths to each LUN.
>>6 disks for the WAL and 12 disks for the data
> 
> 
> So - you have 18 disks worth of potential bandwidth, not factoring loss due
> to RAID.  That's roughly 18 * 60 = 1,080 MB/s.  If we organized that into
> four banks, one for each CPU and made each one RAID5 and left two disks for
> spares, you'd have 12 disks working for you at 720MB/s, which is possibly
> double the number of active FC channels you have, unless they are all
> active, in which case you have a nicely matched 800MB/s of FC.

wrong(or rather extremely optimistic) the array itself only has two
(redundant) FC-loops(@2GB )to the attached expansion chassis. The array
has 2 active/active controllers (with a failover penalty) with two host
interfaces each, furthermore it has write-cache mirroring(to the standby
controller) enabled which means the traffic has to go over the internal
FC-loop too.
beside that the host(as I said) itself only has two HBAs @2GB which are
configured for failover which limits the hosts maximum available
bandwith to less than 200MB/S per LUN.

> 
> 
>>>So, from 15 MB/s up to about 20 MB/s.
> 
> 
> Gee - seems a long distance from 700 MB/s potential :-)

well the array is capable of about 110MB/s write per controller head (a
bit more half the possible due to write mirroring enabled which uses
delta-syncronisation).
WAL and data are on different controllers though by default.

> 
> 
>>the IO-System I use should be capable of doing that if pushed hard
>>enough :-)
> 
> 
> I would expect some 10x this if configured well.

see above ...

> 
> 
>>interesting to know, but still I'm testing/playing with postgresql here
>>not bizgres MPP ...
> 
> 
> Sure.  Still, what I'd expect is something like 10x this update rate using
> the parallelism buried in your hardware.
> 
> If you configure the same machine with 4 Bizgres MPP segments running on 4
> LUNs I think you'd be shocked at the speedups.

that might be true, though it might sound a bit harsh I really prefer to
spend the small amount of spare time I have with testing(and helping to
improve if possible) postgresql than playing with a piece of commercial
software I'm not going to use anyway ...


Stefan


Re: random observations while testing with a 1,8B row

From
"Luke Lonergan"
Date:
Stefan,

On 3/10/06 12:23 PM, "Stefan Kaltenbrunner" <stefan@kaltenbrunner.cc> wrote:

> wrong(or rather extremely optimistic) the array itself only has two
> (redundant) FC-loops(@2GB )to the attached expansion chassis. The array
> has 2 active/active controllers (with a failover penalty) with two host
> interfaces each, furthermore it has write-cache mirroring(to the standby
> controller) enabled which means the traffic has to go over the internal
> FC-loop too.
> beside that the host(as I said) itself only has two HBAs @2GB which are
> configured for failover which limits the hosts maximum available
> bandwith to less than 200MB/S per LUN.

Wow - the ickiness of SAN fro a performance / value standpoint never ceases
to astound me.

>> Gee - seems a long distance from 700 MB/s potential :-)
> 
> well the array is capable of about 110MB/s write per controller head (a
> bit more half the possible due to write mirroring enabled which uses
> delta-syncronisation).
> WAL and data are on different controllers though by default.

So - you're getting 20MB/s on loading from a potential of 200MB/s?

>> I would expect some 10x this if configured well.
> 
> see above ...

OTOH - configured well could include taking the disks out of the smart (?)
chassis, plugging them into a dumb chassis and deploying 2 dual channel U320
SCSI adapters - total cost of about $3,000.
> that might be true, though it might sound a bit harsh I really prefer to
> spend the small amount of spare time I have with testing(and helping to
> improve if possible) postgresql than playing with a piece of commercial
> software I'm not going to use anyway ...

No problem - that's our job anyway - to make the case for Postgres' use in
typical large scale use-cases like the one you describe.

- Luke




Re: random observations while testing with a 1,8B row table

From
Stefan Kaltenbrunner
Date:
Luke Lonergan wrote:
> Stefan,
> 
> On 3/10/06 12:23 PM, "Stefan Kaltenbrunner" <stefan@kaltenbrunner.cc> wrote:
> 
> 
>>wrong(or rather extremely optimistic) the array itself only has two
>>(redundant) FC-loops(@2GB )to the attached expansion chassis. The array
>>has 2 active/active controllers (with a failover penalty) with two host
>>interfaces each, furthermore it has write-cache mirroring(to the standby
>>controller) enabled which means the traffic has to go over the internal
>>FC-loop too.
>>beside that the host(as I said) itself only has two HBAs @2GB which are
>>configured for failover which limits the hosts maximum available
>>bandwith to less than 200MB/S per LUN.
> 
> 
> Wow - the ickiness of SAN fro a performance / value standpoint never ceases
> to astound me.

Well while make it sound a bit like that, performance is not everything.
One has to factor manageability,scalability (in terms of future upgrades
using the same platform and such) and high-availability features in too.
With that in mind a SAN (or a NAS - depends on the actual usecases)
suddenly looks much more interesting than plain old DASD.

> 
> 
>>>Gee - seems a long distance from 700 MB/s potential :-)
>>
>>well the array is capable of about 110MB/s write per controller head (a
>>bit more half the possible due to write mirroring enabled which uses
>>delta-syncronisation).
>>WAL and data are on different controllers though by default.
> 
> 
> So - you're getting 20MB/s on loading from a potential of 200MB/s?

no - I can write 110MB/s on thw WAL LUN and 110MB/s on the other LUN
concurrently.

> 
> 
>>>I would expect some 10x this if configured well.
>>
>>see above ...
> 
> 
> OTOH - configured well could include taking the disks out of the smart (?)
> chassis, plugging them into a dumb chassis and deploying 2 dual channel U320
> SCSI adapters - total cost of about $3,000.

as i said above even if that would work (it does not because the disks
have FC-connectors) I would loose a LOT of features like being able to
use the SAN for more than a single host (big one!) or doing
firmware-upgrades without downtime, using SAN-replication, having
cable-length exceeding 12m(makes it possible to place parts of the
infrastructure at remote sites),out-of-band management,scriptable(!),...

Beside that, sequential-io as you are propagating everywhere is NOT the
holy grail or the sole solution to a fast database.
While the SAN above really is not a screamer for that kind of
application it is actually a very good performer(compared with some of
the DASD based boxes) under heavy random-io and concurrent load.
This has a direct measurable influence on the overall speed of our
production applications which are mostly OLTP ;-)

>  
> 
>>that might be true, though it might sound a bit harsh I really prefer to
>>spend the small amount of spare time I have with testing(and helping to
>>improve if possible) postgresql than playing with a piece of commercial
>>software I'm not going to use anyway ...
> 
> 
> No problem - that's our job anyway - to make the case for Postgres' use in
> typical large scale use-cases like the one you describe.

yep


Stefan


Re: random observations while testing with a 1,8B row

From
"Luke Lonergan"
Date:
Stefan,

On 3/11/06 12:21 AM, "Stefan Kaltenbrunner" <stefan@kaltenbrunner.cc> wrote:

>> So - you're getting 20MB/s on loading from a potential of 200MB/s?
> 
> no - I can write 110MB/s on thw WAL LUN and 110MB/s on the other LUN
> concurrently.

The numbers you published earlier show you are getting a maximum of 20MB/s
on data loading.  It's CPU limited by Postgres COPY.

> Beside that, sequential-io as you are propagating everywhere is NOT the
> holy grail or the sole solution to a fast database.
> While the SAN above really is not a screamer for that kind of
> application it is actually a very good performer(compared with some of
> the DASD based boxes) under heavy random-io and concurrent load.
> This has a direct measurable influence on the overall speed of our
> production applications which are mostly OLTP ;-)

The same DASD can be configured with RAID10 and will far surpass the
external FC SAN configuration you describe at the same price.

The DASD story is not all about sequential I/O.  The main limitation is the
number of devices you can make available using DASD, and that's a Postgres
limitation that we solve.

For OLTP apps, you may be fast enough with the limited bandwidth of your FC
SAN, but it would still be faster with hundreds of channels and disks.

- Luke




Re: random observations while testing with a 1,8B row table

From
Stefan Kaltenbrunner
Date:
Tom Lane wrote:
> Stefan Kaltenbrunner <stefan@kaltenbrunner.cc> writes:
> 
>>>>3. vacuuming this table - it turned out that VACUUM FULL is completly
>>>>unusable on a table(which i actually expected before) of this size not
>>>>only to the locking involved but rather due to a gigantic memory
>>>>requirement and unbelievable slowness.
> 
> 
>>sure, that was mostly meant as an experiment, if I had to do this on a
>>production database I would most likely use CLUSTER to get the desired
>>effect (which in my case was purely getting back the diskspace wasted by
>>dead tuples)
> 
> 
> Yeah, the VACUUM FULL algorithm is really designed for situations where
> just a fraction of the rows have to be moved to re-compact the table.
> It might be interesting to teach it to abandon that plan and go to a
> CLUSTER-like table rewrite once the percentage of dead space is seen to
> reach some suitable level.  CLUSTER has its own disadvantages though
> (2X peak disk space usage, doesn't work on core catalogs, etc).

hmm very interesting idea, I for myself like it but from what i have
seen people quite often use vacuum full to get their disk usage down
_because_ they are running low on space (and because it's not that well
known that CLUSTER could be much faster) - maybe we should add a
note/hint about this to the maintenance/vacuum docs at least ?


Stefan


Re: random observations while testing with a 1,8B row

From
Hannu Krosing
Date:
Ühel kenal päeval, R, 2006-03-10 kell 12:23, kirjutas Steve Atkins:

> I get bitten by this quite often (customer machines, one giant table,
> purge out a lot of old data).
> 
> CLUSTER is great for that, given the headroom, though I've often
> resorted to a dump and restore because I've not had the headroom
> for cluster, and it's a lot less downtime than a full vacuum.
> 
> While the right fix there is to redo the application engine side to use
> table partitioning, I keep wondering whether it would be possible
> to move rows near the end of the table to the beginning in one, non- 
> locking
> phase (vacuum to populate FSM with free space near beginning of table,
> touch rows starting at end of table, repeat) and then finish off with a
> vacuum full to tidy up the remainder and truncate the files (or a  
> simpler
> "lock the table and truncate anything unused at the end").

At some point I had to compress a very busily updated table. I used the
following approach:

1) VACUUM buzytable; (lazy not full)

2) SELECT primary_key_value, ctid FROM buzytable;

3) Extract N last records from there and for each keep repeating
 3A) UPDATE buzytable        SET primary_key_value = primary_key_value     WHERE primary_key_value = extracted_value
3B)SELECT ctid FROM buzytable       WHERE primary_key_value = extracted_value
 
 until the tuple is moved to another pages, hopefully nearer to  the beginning of table

repeat from 1) until the page for last row (extracted from ctid) is
smaller than some thresold.

This was the only way I was able to get a table back to small enough
size without service interruption.

--------------
Hannu




Re: random observations while testing with a 1,8B row

From
Tom Lane
Date:
Hannu Krosing <hannu@skype.net> writes:
> At some point I had to compress a very busily updated table. I used the
> following approach:
> [ move a few rows at a time ]

We could possibly do something similar with VACUUM FULL as well: once
maintenance_work_mem is filled, start discarding per-page data to stay
under the memory limit.  This would mean that some area near the middle
of the table remains uncompacted, but it would allow putting an upper
bound on the time and space used by any one pass of VACUUM FULL ...
        regards, tom lane


Re: random observations while testing with a 1,8B row

From
"Jim C. Nasby"
Date:
On Sat, Mar 11, 2006 at 10:21:43PM +0200, Hannu Krosing wrote:
> > table partitioning, I keep wondering whether it would be possible
> > to move rows near the end of the table to the beginning in one, non- 
> > locking
> > phase (vacuum to populate FSM with free space near beginning of table,
> > touch rows starting at end of table, repeat) and then finish off with a
> > vacuum full to tidy up the remainder and truncate the files (or a  
> > simpler
> > "lock the table and truncate anything unused at the end").
> 
> At some point I had to compress a very busily updated table. I used the
> following approach:
> 
> 1) VACUUM buzytable; (lazy not full)
> 
> 2) SELECT primary_key_value, ctid FROM buzytable;
> 
> 3) Extract N last records from there and for each keep repeating
> 
>   3A) UPDATE buzytable 
>         SET primary_key_value = primary_key_value
>       WHERE primary_key_value = extracted_value
>   
>   3B) SELECT ctid FROM buzytable 
>        WHERE primary_key_value = extracted_value
> 
>   until the tuple is moved to another pages, hopefully nearer to 
>   the beginning of table
> 
> repeat from 1) until the page for last row (extracted from ctid) is
> smaller than some thresold.

BTW, this is what the following TODO would hopefully fix:

Allow FSM to return free space toward the beginning of the heap file, in
hopes that empty pages at the end can be truncated by VACUUM
-- 
Jim C. Nasby, Sr. Engineering Consultant      jnasby@pervasive.com
Pervasive Software      http://pervasive.com    work: 512-231-6117
vcard: http://jim.nasby.net/pervasive.vcf       cell: 512-569-9461