Thread: A long-running transaction

A long-running transaction

From
John Summerfield
Date:
I have a Java (java 1.1) program that I wrote some years ago, to read 
records from a text file and insert it into a ostgresql database.

One of the assumptions I made was that one file contained one day's 
data, maybe as many as 1500 records, and I coded it to do the whole lot 
as one transaction so either a single file was loaded in its entirity, 
or none of its data was.

I lost the Java code, but revived the idea and I've collected about two 
years' data using (Linux) shell scripts, and loading the data using psql.

Then, I found the Java code on a disused hard disk:-)

I made the necessary changes for it to build in java 1.5, and used psql 
to extract data from my new database in the correct format for the old 
program. This time, I have a little more data than I ever loaded at once 
before:
summer@Bandicoot:~$ wc -l testdata
6242217 testdata
summer@Bandicoot:~$ \ls -hl testdata
-rw-r--r-- 1 summer summer 285M 2007-03-28 22:32 testdata
summer@Bandicoot:~$

Now, it wouldn't surprise me if postgresql used lots of memory - but how 
much could it possibly need? My laptop, where I first tried this, has 
1.25 Gbytes, so I could allow it some.

It wouldn't surprise me a lot if it used lots of memory and caused all 
sorts of problems, but no, it's chugging away, still using no more RAM 
than it could have had on my old Pentium 133 all those years ago.

In the process of checking it out, I've set it running on a machine with 
a AMD Sempron(tm)   2400+ running Kubuntu 6.10 (kernel is 
2.6.17-6-server-xen0) and 512 Mbytes of RAM.

This is the java program:-)
summer   pts/6    :0.0             Thu20    5days  1:07   1:07 
/usr/bin/gij-4.1 -cp /usr/s
It's been running five days so far, and I can see where it's up to by 
attaching strace. It's reading 2k of the input file every few seconds.

Okay, clearly something's wrong, and I don't think it's all my crddu code.
No probs swapping:
summer@Bandicoot:~$ free             total       used       free     shared    buffers     cached
Mem:        460800     456472       4328          0        860     262164
-/+ buffers/cache:     193448     267352
Swap:      1461872        284    1461588
summer@Bandicoot:~$

It is hitting the disk pretty hard now on this machine, but the laptop's 
still going too, and the disk seems to run about half the time, part of 
a second running, part idle (but the intervals are getting shorter).

It struck me as fairly curious that neither postgresql nor the 
application was hogging the CPU.

Perhaps the laptop is more interesting: look at the size of the buffer pool:
summer@Echidna:~> free             total       used       free     shared    buffers     cached
Mem:       1295528    1268548      26980          0       3976     392388
-/+ buffers/cache:     872184     423344
Swap:      1941496      32656    1908840
summer@Echidna:~>
Again, no problem with over-use of RAM, and I'm logged on using KDE too 
and that's running fine.

It's been running a little longer here:
summer   pts/23    28Mar07  5days 25:12  25:11  java -cp 
/home/summer/Classes/:/usr/share/p

This is Sun's Java 1.5 on OpenSUSE 10.2.


This is what suggested I should write:
summer@Echidna:~> procinfo
Linux 2.6.18.8-0.1-default (geeko@buildhost) (gcc 4.1.2 20061115) #1 
1CPU [Echidna.]

Memory:      Total        Used        Free      Shared     Buffers
Mem:       1295528     1271720       23808           0        3716
Swap:      1941496       32656     1908840

Bootup: Tue Mar 27 18:50:19 2007    Load average: 2.21 2.65 2.69 2/243 19305

user  :   1d  3:17:04.03  16.0%  page in :  131097310  disk 1: 
3079516r20087664w
nice  :       0:05:39.64   0.1%  page out:  197016649
system:   2d 20:38:37.13  40.1%  page act:   87906251
IOwait:   2d  0:46:37.33  28.5%  page dea:   16218135
hw irq:       0:44:46.71   0.4%  page flt:  306255213
sw irq:       0:50:04.69   0.5%  swap in :       4026
idle  :   1d  0:36:29.73  14.4%  swap out:       9552
uptime:   7d  2:59:20.97         context :  702502442

irq  0: 153880209 timer                 irq  7:         0 parport0
irq  1:     69402 i8042                 irq  8:         2 rtc
irq  2:         0 cascade [4]           irq  9:   1696942 acpi
irq  3:         4                       irq 10:         1
irq  4:         4                       irq 11:  71842329 ehci_hcd:usb1, 
uhci_
irq  5:  28545863 Intel 82801DB-ICH4    irq 12:    467432 i8042
irq  6:         1                       irq 14:  25021586 ide0

summer@Echidna:~>


Look at that line beginning "System:" two days 20 hours in the Linux 
kernel. It's my guess that the Linux kernel is spending a great deal of 
time manipulating that buffer pool.

This shows postgresql taking 60% CPU:
summer@Echidna:~> ps xaru
USER       PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
summer   20237  0.2  1.2 263716 15988 pts/23   Rl+  Mar28  25:11 java 
-cp /home/summer/Class
postgres 19321  0.1  0.3  19844  3984 ?        D    21:50   0:00 
postgres: summer stocksshar
summer   19344  0.0  0.0   2484   852 pts/25   R+   21:52   0:00 ps xaru
summer@Echidna:~> ps xaru
USER       PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
postgres 20248 60.0  0.8  20760 11536 ?        D    Mar28 5167:27 
postgres: summer sharetrad
postgres 19321  0.1  0.3  19844  4008 ?        D    21:50   0:00 
postgres: summer stocksshar
summer   19348  0.0  0.0   2480   848 pts/25   R+   21:52   0:00 ps xaru
summer@Echidna:~>

Note that pid 20248 is the one, the other has nothing to do with the 
problem.

This is the SQL I used to create the table:      String createTableStatement = "create table TJS0102_trades ("
       + "   trade date"                  + ",  ttime int2"                  + ",  ASX varchar(7)"                  +
", thigh int4"                  + ",  tlow int4"                  + ",  tclose int4"                  + ",  topen int4"
                + ",  tvolume int4"                  + ",  tvalue int4"
 
//                 + ")"                  + ",  unique(ASX,trade,ttime)"                  + ");";


For each record, I update a non-key field in another table; the source 
data for that other table is less than a megabyte.


I have changed the program so as to load each day's data as a single 
transaction; it ran on a Pentium IV 2.8 Ghz with HT, 512 Mbytes in about 
three and an half hours. This is more-or-less what I expected.

If you agree with me that postgresql should do better, now is a good 
time to ask for this as a test case.

Note, you could also get suitable test data from float.com.au - it's not 
where mine came from, but it should be substantially the same.





Re: A long-running transaction

From
Andrew Sullivan
Date:
On Tue, Apr 03, 2007 at 10:16:13PM +0800, John Summerfield wrote:
> It is hitting the disk pretty hard now on this machine, but the laptop's 
> still going too, and the disk seems to run about half the time, part of 
> a second running, part idle (but the intervals are getting shorter).
> 
> It struck me as fairly curious that neither postgresql nor the 
> application was hogging the CPU.

Why?  Nothing about this seems likely CPU bound.  It's probably I/O. 
I note is number:

> IOwait:   2d  0:46:37.33  28.5%  page dea:   16218135

which is pretty awful.  Also

> For each record, I update a non-key field in another table; the source 
> data for that other table is less than a megabyte.

this is a real issue.  Basically, you're constrained at the rotation
speed of your disk, because for each record, you have to first find
then update one row somewhere else.  

A

-- 
Andrew Sullivan  | ajs@crankycanuck.ca
The plural of anecdote is not data.    --Roger Brinner


Re: A long-running transaction

From
John Summerfield
Date:
Andrew Sullivan wrote:
> On Tue, Apr 03, 2007 at 10:16:13PM +0800, John Summerfield wrote:
>> It is hitting the disk pretty hard now on this machine, but the laptop's 
>> still going too, and the disk seems to run about half the time, part of 
>> a second running, part idle (but the intervals are getting shorter).
>>
>> It struck me as fairly curious that neither postgresql nor the 
>> application was hogging the CPU.
> 
> Why?  Nothing about this seems likely CPU bound.  It's probably I/O. 
> I note is number:

For the first day or so, the disk light was switching off about half the
time.

> 
>> IOwait:   2d  0:46:37.33  28.5%  page dea:   16218135
> 
> which is pretty awful.  Also

It was some days in before it became so. Eventually, the machines (I
kept it running on the laptop because of the amount of RAM plus another)
did begin to thrash, but that certainly wasn't the case at first.

Note that the proportion of system time is fairly unusual:
user  :   1d  3:17:04.03  16.0%  page in :  131097310  disk 1:
3079516r20087664w
nice  :       0:05:39.64   0.1%  page out:  197016649
system:   2d 20:38:37.13  40.1%  page act:   87906251

The Linux kernel's clearly doing a lot of work, and the disk supports
DMA and DMA is turned on, so it's not using A PIO mode. According to
hdparm, it's using udma5.

> 
>> For each record, I update a non-key field in another table; the source 
>> data for that other table is less than a megabyte.
> 
> this is a real issue.  Basically, you're constrained at the rotation
> speed of your disk, because for each record, you have to first find
> then update one row somewhere else.  

It should be in cache: it's all one transaction, and on the laptop,
there's over a Gbyte of RAM. Indeed, I would expect postgresql itself to
cache it (except the fact it uses so little RAM suggests it doesn't do
that at all).




> 
> A
> 




Re: A long-running transaction

From
Andrew Sullivan
Date:
On Wed, Apr 11, 2007 at 12:41:23AM +0800, John Summerfield wrote:

> The Linux kernel's clearly doing a lot of work, and the disk supports

You might also be into context-switch hell.  What processor, which
kernel, and which Postgres version again?

> >>For each record, I update a non-key field in another table; the source 
> >>data for that other table is less than a megabyte.
> >
> >this is a real issue.  Basically, you're constrained at the rotation
> >speed of your disk, because for each record, you have to first find
> >then update one row somewhere else.  
> 
> It should be in cache: it's all one transaction, and on the laptop,

It's not in cache if you're updating -- you have to write it.   

> there's over a Gbyte of RAM. Indeed, I would expect postgresql itself to
> cache it (except the fact it uses so little RAM suggests it doesn't do
> that at all).

What do you have configured as your shared buffers?  If you haven't
given very much, there won't be much in the way of buffers used, of
course.  Note that there's a much earlier diminishing return on the
size of shared buffers in Postgres than in many systems.

A

-- 
Andrew Sullivan  | ajs@crankycanuck.ca
The whole tendency of modern prose is away from concreteness.    --George Orwell


Re: A long-running transaction

From
John Summerfield
Date:
Andrew Sullivan wrote:
> On Wed, Apr 11, 2007 at 12:41:23AM +0800, John Summerfield wrote:
> 
>> The Linux kernel's clearly doing a lot of work, and the disk supports
> 
> You might also be into context-switch hell.  What processor, which
> kernel, and which Postgres version again?
on opensuse 10.2 (the laptop)
model name      : Intel(R) Pentium(R) M processor 1500MHz
2.6.18.8-0.1-xen
Might not have been xen-enabled, I switch fairly often
summer@Lyrebird:~> rpm -qa postg\*
postgresql-server-8.1.5-13
postgresql-libs-8.1.5-13
postgresql-devel-8.1.5-13
postgresql-jdbc-8.1-12.2
postgresql-contrib-8.1.5-13
postgresql-pl-8.1.5-15
postgresql-8.1.5-13
summer@Lyrebird:~>

Also:
model name      : AMD Sempron(tm)   2400+
2.6.17-6-server-xen0
summer@Bandicoot:~$ dpkg --list postg\* | grep ^ii
ii  postgresql-8.2           8.2.3-1~edgy1  object-relational SQL
database, version 8.2

> 
>>>> For each record, I update a non-key field in another table; the source 
>>>> data for that other table is less than a megabyte.
>>> this is a real issue.  Basically, you're constrained at the rotation
>>> speed of your disk, because for each record, you have to first find
>>> then update one row somewhere else.  
>> It should be in cache: it's all one transaction, and on the laptop,
> 
> It's not in cache if you're updating -- you have to write it.

Linux caches writes, I don't think it should be hitting disk at all. The
table being updated contains records 7482 (658K raw data) of which
probably fewer than 2000 are being updated, and typically the same ones
all the time: we're updating the date of the latest trade.

> 
>> there's over a Gbyte of RAM. Indeed, I would expect postgresql itself to
>> cache it (except the fact it uses so little RAM suggests it doesn't do
>> that at all).
> 
> What do you have configured as your shared buffers?  If you haven't
> given very much, there won't be much in the way of buffers used, of
> course.  Note that there's a much earlier diminishing return on the
> size of shared buffers in Postgres than in many systems.
Laptop (1.25 Gbytes)
shared_buffers = 1000                   # min 16 or max_connections*2,
8KB each

AMD (512 Mbytes less some for video)
shared_buffers = 24MB                   # min 128kB or max_connections*16kB

Note we're only running one connexion here, and data transfers (I
believe) are tens of bytes: here is a typical data record:
2005-02-03,AAC,1.900,1.800,1.850,1.820,328984,0


> 
> A
> 




Re: A long-running transaction

From
Andrew Sullivan
Date:
On Wed, Apr 11, 2007 at 05:54:45AM +0800, John Summerfield wrote:
> Linux caches writes, I don't think it should be hitting disk at all. The

I _sure hope_ you don't care about this data, then.  That's not a
real safe way to work.  But. . .

> table being updated contains records 7482 (658K raw data) of which
> probably fewer than 2000 are being updated, and typically the same ones
> all the time: we're updating the date of the latest trade.

. . . this is likely your problem.  The updates probably get slower
and slower.  What's happening is that you're expiring a row _for each
update_, which means it _isn't_ the same row every time.  This is
approximately the worst use model for PostgreSQL's MVCC approach. 
Worse, though it's not the same row, you have to grovel through all
the dead rows to find the actually live one.  So that's probably
what's killing you.

> Laptop (1.25 Gbytes)
> shared_buffers = 1000                   # min 16 or max_connections*2,
> 8KB each

so you have 8000 K configured as your shared buffers there.  That's
as much as you'll ever use for shared memory by Postgres.  You can
probably bump a little in this case.  Your other config seems ok to
me.  But I don't think this is your problem -- the update pattern is.  

A

-- 
Andrew Sullivan  | ajs@crankycanuck.ca
If they don't do anything, we don't need their acronym.    --Josh Hamilton, on the US FEMA


Re: A long-running transaction

From
John Summerfield
Date:
Andrew Sullivan wrote:
> On Wed, Apr 11, 2007 at 05:54:45AM +0800, John Summerfield wrote:
>> Linux caches writes, I don't think it should be hitting disk at all. The
> 
> I _sure hope_ you don't care about this data, then.  That's not a
> real safe way to work.  But. . .

As I said in the first place, this is all one transaction. No doubt that
contributes to the problem.

> 
>> table being updated contains records 7482 (658K raw data) of which
>> probably fewer than 2000 are being updated, and typically the same ones
>> all the time: we're updating the date of the latest trade.
> 
> . . . this is likely your problem.  The updates probably get slower
> and slower.  What's happening is that you're expiring a row _for each
> update_, which means it _isn't_ the same row every time.  This is
> approximately the worst use model for PostgreSQL's MVCC approach. 
> Worse, though it's not the same row, you have to grovel through all
> the dead rows to find the actually live one.  So that's probably
> what's killing you.

I don't understand what you're telling me. Your surmise correctly
reflects my experience, but I don't understand the explanation.

Each row's identified by a key, and the key doesn't change.

I don't know how Postgresql works, but 25 or so years ago I had a good
understanding of how ADABAS* (now MAXDB I believe, but doubtless
rewritten several times) worked than. And the OS then was IBM's OS/VS or
a close relative from Fujitsu, OSIV/F4, and those worked very
differently from how Linux works today.

ADABAS would put the updated record right back where it came from, it
maintained some free space in each block in case an update increased the
size of a record. Whether Postgresql does that, or something different I
don't know; the Adabas technique would handle this process well.

I'll point out this line again, in case your explanation means you think
postgresql is spending lots of CPU:
user  :   1d  3:17:04.03  16.0%  page in :  131097310  disk 1:
3079516r20087664w
nice  :       0:05:39.64   0.1%  page out:  197016649
system:   2d 20:38:37.13  40.1%  page act:   87906251

That line above. The way I figure it the kernel's doing an enormous
amount of work handling its buffers. System time is overhead, it's like
a government tax - you gotta pay it, but you try really hard not to pay
too much.

user time is time doing real work (perhaps not when it's in postgresql,
but in general), and one does not expect to see it dwarfed by the system
time. Here, we have the kernel-space code spending twice as much time
doing something as the user-space code.

It's not normal Linux kernel behaviour, and it's not normal on my
laptop. The best idea I have is that postgresql is doing something that
causes this bad behaviour. If you think that updating records as I do
causes this, I won't argue the point because I simply do no know.

I don't need a workaround, I have two: one is to commit the updates when
the date field in the input records changes, and the other is to create
the input data in separate files, one for each date. Both give
reasonable performance, the first took around three and a half hours
(different system, currently declines to boot:-((, it's a Pentium IV 3.0
Ghz with HT), the other ran a little longer on the laptop. Roughly what
I expected.


Since all the work is within one transaction, there's no real need for
postgresql to write data at all, unless it's running out of memory.


* ADABAS then didn't do SQL, and wasn't relational. It was about when
IBM's DB2 came out, and Cicom had Supra. According to Cincom, DB2 wasn't
relational either.
> 
>> Laptop (1.25 Gbytes)
>> shared_buffers = 1000                   # min 16 or max_connections*2,
>> 8KB each
> 
> so you have 8000 K configured as your shared buffers there.  That's
> as much as you'll ever use for shared memory by Postgres.  You can
> probably bump a little in this case.  Your other config seems ok to
> me.  But I don't think this is your problem -- the update pattern is.  

In both cases, I am using default values. Unless someone thinks there's
clearly something to be gained by changing them, I won't.

> 
> A
> 




Re: A long-running transaction

From
Andrew Sullivan
Date:
On Wed, Apr 11, 2007 at 10:23:34PM +0800, John Summerfield wrote:
> Each row's identified by a key, and the key doesn't change.

That doesn't matter.

> ADABAS would put the updated record right back where it came from, it

That isn't how PostgreSQL works.

I'm having a hard time laying my hands on the bits of the docs that
explain this (anyone?), but MVCC stands fo Multi-Version Concurrency
Control.  When you write a row, it has a version number.  That
version number tells other transactions what version of the row they
look at.  

Here's a _really over simplified_ explanation, at a conceptual level. 
[Note to experts: please don't correct details of how this is wrong. 
I know that.  But John's problem is that he doesn't have this
conceptually yet.  The following explanation has worked for me in the
past in getting the idea across first, so that the details can later
make sense.]

Suppose you have two transactions, t1 and t2.  They start at
approximately the same time.  I'm assuming they're READ COMMITTED.

Suppose t1 updates row R.  That version of R has a number R(t1).  

Suppose at the same time, t2 is looking at R.  The _version_ of R
that t2 sees is R(tn), where n<1.  This way, t2 does not have to
wait on t1 in order for t2 to proceed (unless t2 actually modifies R. 
Ignore that for the sake of this explanation, please).

The way this works is that, in marking R(t1), the system says "R(t1) is
valid for transactions that committed after t1".  Since t2 hasn't
committed, it sees the old version.  So when t1 commits, there are
_two_ versions of R -- R(t1) and R(tn),n<1.  When all transactions
have committed such that there is no (tn),n<1, then the row is marked
dead, and can be recovered by VACUUM (this is one of the main things
VACUUM does.  The idea is to move the housekeeping of concurrency
control out of the main transaction.  Oracle does something
conceptually similar, except using rollback segments, which is why
long-running transactions can exhaust the supply of rollback segments
on Oracle).

Now, you're thinking, "But this is all in one transaction!"  Yes, but
the implementation details make it impossible that t1 rewrite the
same row over and over again, so your rewriting of the same row again
and again actually is creating huge numbers of dead rows.  The
unfortunate thing is that you have to follow the whole dead-row chain
to find the currently live version of your row.

The upshot of this is that updating the same row over and over in one
transaction will make your transaction go slower on each round, until
you are nearly stopped.  That's why a lot of updates of the sort you
describe are in fact the worst case under Postgres.  EnterpriseDB
seems to have a proposal (which I believe is going to make 8.3) that
will mitigate but not completely solve some of this.

> user  :   1d  3:17:04.03  16.0%  page in :  131097310  disk 1:
> 3079516r20087664w
> nice  :       0:05:39.64   0.1%  page out:  197016649
> system:   2d 20:38:37.13  40.1%  page act:   87906251
> 
> That line above. The way I figure it the kernel's doing an enormous
> amount of work handling its buffers. 

No, I think it's doing an enormous amount of I/O, because it has to
keep looking for these new rows (and scanning over the old areas
while it's at it).  This is system time, if I recall correctly
through the haze that is now my understanding of Linux I/O, because
it's down at the I/O subsystem and can't actually be done by the user
program.  (Feel free to correct me on this.  But I still bet it's
dead rows.)

> Since all the work is within one transaction, there's no real need for
> postgresql to write data at all, unless it's running out of memory.

No, I'm sure this has to do with the multi-version row writing.  Your
alternatives are ok, but I'd be more tempted to commit a range of
data in a timestamped series, and then join on a subselect picking
the latest insert (and add another daemon to knock old values off the
bottom of the list).

> In both cases, I am using default values. Unless someone thinks there's
> clearly something to be gained by changing them, I won't.

The PostgreSQL instal defaults are there to get things started, not
to perform well.  There is a long-standing debate in the community
about whether that's a good idea (and the defaults are way more
sensible than they used to be), but in general you won't get really
good performance from the default configuration of a PostgreSQL
installation.  

Here is a possibly useful bit of reading to start with on tuning,
although it's getting long in the tooth:

http://www.powerpostgresql.com/Downloads/annotated_conf_80.html

Also

http://varlena.com/varlena/GeneralBits/Tidbits/perf.html

And the -performance list will probably help you more than this one.

Hope this helps.

A

-- 
Andrew Sullivan  | ajs@crankycanuck.ca
Users never remark, "Wow, this software may be buggy and hard 
to use, but at least there is a lot of code underneath."    --Damien Katz


Re: A long-running transaction

From
John Summerfield
Date:
Andrew Sullivan wrote:
> On Wed, Apr 11, 2007 at 10:23:34PM +0800, John Summerfield wrote:
>
>>Each row's identified by a key, and the key doesn't change.
>
>
> That doesn't matter.
>
>
>>ADABAS would put the updated record right back where it came from, it
>
>
> That isn't how PostgreSQL works.
>
> I'm having a hard time laying my hands on the bits of the docs that
> explain this (anyone?), but MVCC stands fo Multi-Version Concurrency
> Control.  When you write a row, it has a version number.  That
> version number tells other transactions what version of the row they
> look at.
>
> Here's a _really over simplified_ explanation, at a conceptual level.
> [Note to experts: please don't correct details of how this is wrong.
> I know that.  But John's problem is that he doesn't have this
> conceptually yet.  The following explanation has worked for me in the
> past in getting the idea across first, so that the details can later
> make sense.]
>
> Suppose you have two transactions, t1 and t2.  They start at
> approximately the same time.  I'm assuming they're READ COMMITTED.
>
> Suppose t1 updates row R.  That version of R has a number R(t1).
>
> Suppose at the same time, t2 is looking at R.  The _version_ of R
> that t2 sees is R(tn), where n<1.  This way, t2 does not have to
> wait on t1 in order for t2 to proceed (unless t2 actually modifies R.
> Ignore that for the sake of this explanation, please).
>
> The way this works is that, in marking R(t1), the system says "R(t1) is
> valid for transactions that committed after t1".  Since t2 hasn't
> committed, it sees the old version.  So when t1 commits, there are
> _two_ versions of R -- R(t1) and R(tn),n<1.  When all transactions
> have committed such that there is no (tn),n<1, then the row is marked
> dead, and can be recovered by VACUUM (this is one of the main things
> VACUUM does.  The idea is to move the housekeeping of concurrency
> control out of the main transaction.  Oracle does something
> conceptually similar, except using rollback segments, which is why
> long-running transactions can exhaust the supply of rollback segments
> on Oracle).
>
> Now, you're thinking, "But this is all in one transaction!"  Yes, but
> the implementation details make it impossible that t1 rewrite the
> same row over and over again, so your rewriting of the same row again
> and again actually is creating huge numbers of dead rows.  The
> unfortunate thing is that you have to follow the whole dead-row chain
> to find the currently live version of your row.

Why is this implementation preferable to not doing that?

>
> The upshot of this is that updating the same row over and over in one
> transaction will make your transaction go slower on each round, until
> you are nearly stopped.  That's why a lot of updates of the sort you
> describe are in fact the worst case under Postgres.  EnterpriseDB

Where is enterprisedb? enterprisedb.com seems to be held by a squatter
in Kingston. About where they're playing an important cricket
competition about now.

> seems to have a proposal (which I believe is going to make 8.3) that
> will mitigate but not completely solve some of this.

I can see we need before and after, but why all those intermediate rows?
How can they possibly be used? If I roll back the transaction, they're
all discarded, and if I commit then only the last is required.

Also, I don't see why (apparently) a sequential search is used; surely
if all of these rows might be required, still a pointer to the last
would be right? Or last two, if you need the ability to check the order.

Is there a development version I can try, with this improvement in it?

>
>
>>user  :   1d  3:17:04.03  16.0%  page in :  131097310  disk 1:
>>3079516r20087664w
>>nice  :       0:05:39.64   0.1%  page out:  197016649
>>system:   2d 20:38:37.13  40.1%  page act:   87906251
>>
>>That line above. The way I figure it the kernel's doing an enormous
>>amount of work handling its buffers.
>
>
> No, I think it's doing an enormous amount of I/O, because it has to
> keep looking for these new rows (and scanning over the old areas
> while it's at it).  This is system time, if I recall correctly
> through the haze that is now my understanding of Linux I/O, because
> it's down at the I/O subsystem and can't actually be done by the user
> program.  (Feel free to correct me on this.  But I still bet it's
> dead rows.)

Your argument is fairly convicing, and it does reflect that the
postmaster seems always to be doing an update, but I am still dubious.
1. For the first day or so, my observation was that the disk was not
particularly busy. It seemed to be running about half the time. It is
fairly noticible because, for reasons I don't understand at all, it (on
the laptop) generates EMI and upsets the radio in the next room.
Unplugging the power cord stops the EMI - go figure, but of course I
only have a few hours before the battery runs out.

2. System time, unless I am mistaken (and that is possible), is time the
CPU is active in the kernel. It is separate from IOWait.


I'm reposting the whole lot:
Bootup: Tue Mar 27 18:50:19 2007    Load average: 2.21 2.65 2.69 2/243 19305

user  :   1d  3:17:04.03  16.0%  page in :  131097310  disk 1:
3079516r20087664w
nice  :       0:05:39.64   0.1%  page out:  197016649
system:   2d 20:38:37.13  40.1%  page act:   87906251
IOwait:   2d  0:46:37.33  28.5%  page dea:   16218135
hw irq:       0:44:46.71   0.4%  page flt:  306255213
sw irq:       0:50:04.69   0.5%  swap in :       4026
idle  :   1d  0:36:29.73  14.4%  swap out:       9552
uptime:   7d  2:59:20.97         context :  702502442

If I add user + system + iowait + idle + get to about the 7 days amd
some total uptime: I expect if I did the arithmentic more precisely, the
result would match even better.

>
>
>>Since all the work is within one transaction, there's no real need for
>>postgresql to write data at all, unless it's running out of memory.
>
>
> No, I'm sure this has to do with the multi-version row writing.  Your
> alternatives are ok, but I'd be more tempted to commit a range of
> data in a timestamped series, and then join on a subselect picking
> the latest insert (and add another daemon to knock old values off the
> bottom of the list).

I don't plan on programming round this, because it's a conversion
problem, not ongoing and in any event, dumping the source database into
separate files for each date works well enough. but if I did I'd simply
use a vector in Java and cache the updates myself, and write the updates
before comitting.

In the usual case, there's no significant difference.


>
>
>>In both cases, I am using default values. Unless someone thinks there's
>>clearly something to be gained by changing them, I won't.
>
>
> The PostgreSQL instal defaults are there to get things started, not
> to perform well.  There is a long-standing debate in the community
> about whether that's a good idea (and the defaults are way more
> sensible than they used to be), but in general you won't get really
> good performance from the default configuration of a PostgreSQL
> installation.

and the distribution's defaults are not necessarily the same as yours.
That said, I didn't see anything in postgresql.conf that obviously
needed changing; with only one connexion I couldn't see a need to
increase shared memory, and I didn't see anything at all about buffer pools.

>
> Here is a possibly useful bit of reading to start with on tuning,
> although it's getting long in the tooth:
>
> http://www.powerpostgresql.com/Downloads/annotated_conf_80.html
>
> Also
>
> http://varlena.com/varlena/GeneralBits/Tidbits/perf.html

There's nothing that leaps out at me; thw WAL stuff doesn't apply until
I commit, and I haven't. Work memory, maybe, but I'm not doing complex
queries.

Effective cache size might.


>
> And the -performance list will probably help you more than this one.


At present I'm trying to clarify in my mind the nature of the problem.
What I'm trying to do seems to me reasonable. I have some data, and I
want it all in or none of it, so it fits the idea of a single transaction.

It might be that my demands exceed Postgresql's current capabilities,
but by itself it doesn't make what I'm trying to do unreasonable.


>
> Hope this helps.

I appreciate the time you're spending on this. I will be especially
pleased if the final result is an improvement in postgresql.

>
> A
>




Re: A long-running transaction

From
Andrew Sullivan
Date:
On Fri, Apr 13, 2007 at 09:02:58AM +0800, John Summerfield wrote:
> 
> Why is this implementation preferable to not doing that?

Because this way, readers never wait for writers.  On most database
systems, that pending UPDATE would block anyone reading the row too,
even if they weren't going to write on it.  It's a trade-off.  It
happens to mean that the trade is bad for the thing you happen to be
doing, but since most database operations aren't bulk ones in the way
you're trying to achieve this, that's quite possibly a reasonable
trade off.

> Where is enterprisedb? enterprisedb.com seems to be held by a squatter
> in Kingston. About where they're playing an important cricket
> competition about now.

Google took me here in one:

http://www.enterprisedb.com

But anyway, their stuff is built on top of Pg, so going to their site
won't help you.

> I can see we need before and after, but why all those intermediate rows?

Because in the general case, you need the intermediate rows.  The
problem is that you'd have to write a special piece of code to catch
the case where nobody else can see the row that you're about to
expire, and that test isn't free.  Therefore, you do it the same way
any other row gets expired.

> Also, I don't see why (apparently) a sequential search is used; surely
> if all of these rows might be required, still a pointer to the last
> would be right? Or last two, if you need the ability to check the order.

No, it's not a seqscan.  It's following the chain of versions of the
row.  You don't know until you look at the actual row whether the
version of it you are looking at is valid for you.  There isn't any
other place to store that information.  (If you want someone who
really understands how all this works to explain it to you in more
accurate terms, you should ask the question on -hackers.  They'll
point you to the developers' docs that I can't seem to find right
now.)

> Is there a development version I can try, with this improvement in it?

Well, you could try using the current CVS HEAD, which is close to
feature freeze.  It'd sure be a service to the community, at least,
because we'd learn whether the proposed change fixes this sort of
case.  (But I wouldn't trust my data to the HEAD for real.  You said
you're not actually in production yet, though.)

> 1. For the first day or so, my observation was that the disk was not
> particularly busy.

That's completely consistent with the theory I have.  As the number
of dead tuples goes up, your disk activity will slowly get worse.

> At present I'm trying to clarify in my mind the nature of the problem.
> What I'm trying to do seems to me reasonable. I have some data, and I
> want it all in or none of it, so it fits the idea of a single transaction.
> 
> It might be that my demands exceed Postgresql's current capabilities,
> but by itself it doesn't make what I'm trying to do unreasonable.

No, it's not unreasonable, but it happens to be a pessimal case under
Postgres.

A

-- 
Andrew Sullivan  | ajs@crankycanuck.ca
"The year's penultimate month" is not in truth a good way of saying
November.    --H.W. Fowler


Re: A long-running transaction

From
Andrew Sullivan
Date:
On Fri, Apr 13, 2007 at 07:49:19AM -0400, Andrew Sullivan wrote:
> On Fri, Apr 13, 2007 at 09:02:58AM +0800, John Summerfield wrote:

> > 1. For the first day or so, my observation was that the disk was not
> > particularly busy.
> 
> That's completely consistent with the theory I have.  As the number
> of dead tuples goes up, your disk activity will slowly get worse.

This simple demonstration occurred to me on the subway on the way
here, to show that the dead tuples really will mount.

testing=# \d testtab                        Table "public.testtab"Column |  Type   |                      Modifiers
                 
 
--------+---------+------------------------------------------------------id     | integer | not null default
nextval('testtab_id_seq'::regclass)col1  | text    | col2   | text    | 
 
Indexes:   "testtab_pkey" PRIMARY KEY, btree (id)

testing=# SELECT * from testtab;id | col1 | col2 
----+------+------ 1 | one  | 
(1 row)

Now, we check how many dead tuples we have:

testing=# VACUUM VERBOSE testtab ;
INFO:  vacuuming "public.testtab"
INFO:  index "testtab_pkey" now contains 1 row versions in 2 pages
DETAIL:  0 index pages have been deleted, 0 are currently reusable.
CPU 0.00s/0.00u sec elapsed 0.00 sec.
INFO:  "testtab": found 0 removable, 1 nonremovable row versions in 1 pages
DETAIL:  0 dead row versions cannot be removed yet.
There were 3 unused item pointers.
0 pages are entirely empty.
CPU 0.00s/0.00u sec elapsed 0.00 sec.
INFO:  vacuuming "pg_toast.pg_toast_90325"
INFO:  index "pg_toast_90325_index" now contains 0 row versions in 1 pages
DETAIL:  0 index pages have been deleted, 0 are currently reusable.
CPU 0.00s/0.00u sec elapsed 0.00 sec.
INFO:  "pg_toast_90325": found 0 removable, 0 nonremovable row versions in 0 pages
DETAIL:  0 dead row versions cannot be removed yet.
There were 0 unused item pointers.
0 pages are entirely empty.
CPU 0.00s/0.00u sec elapsed 0.00 sec.
VACUUM

Notice the lines:

INFO:  "testtab": found 0 removable, 1 nonremovable row versions in 1 pages
DETAIL:  0 dead row versions cannot be removed yet.


So, let's do an update

BEGIN
testing=# UPDATE testtab set col2='1';
UPDATE 1
testing=# commit;
COMMIT

This time, when we run vacuum, we get this (snipped out for brevity):

INFO:  "testtab": found 1 removable, 1 nonremovable row versions in 1 pages
DETAIL:  0 dead row versions cannot be removed yet.

So, several updates in one transaction:

testing=# begin;
BEGIN
testing=# UPDATE testtab set col2='2';
UPDATE 1
testing=# UPDATE testtab set col2='3';
UPDATE 1
testing=# UPDATE testtab set col2='4';
UPDATE 1
testing=# UPDATE testtab set col2='5';
UPDATE 1
testing=# commit;
COMMIT

This time, VACUUM VERBOSE tells us that all of those were dead:

INFO:  "testtab": found 4 removable, 1 nonremovable row versions in 1 pages
DETAIL:  0 dead row versions cannot be removed yet.

Ok, what if we just update without actually changing anything?

testing=# begin;
BEGIN
testing=# UPDATE testtab set col2='5';
UPDATE 1
testing=# UPDATE testtab set col2='5';
UPDATE 1
testing=# UPDATE testtab set col2='5';
UPDATE 1
testing=# UPDATE testtab set col2='5';
UPDATE 1
testing=# UPDATE testtab set col2='5';
UPDATE 1
testing=# commit;
COMMIT

We get the same result:

INFO:  "testtab": found 5 removable, 1 nonremovable row versions in 1 pages
DETAIL:  0 dead row versions cannot be removed yet.

A





> 
> > At present I'm trying to clarify in my mind the nature of the problem.
> > What I'm trying to do seems to me reasonable. I have some data, and I
> > want it all in or none of it, so it fits the idea of a single transaction.
> > 
> > It might be that my demands exceed Postgresql's current capabilities,
> > but by itself it doesn't make what I'm trying to do unreasonable.
> 
> No, it's not unreasonable, but it happens to be a pessimal case under
> Postgres.
> 
> A
> 
> -- 
> Andrew Sullivan  | ajs@crankycanuck.ca
> "The year's penultimate month" is not in truth a good way of saying
> November.
>         --H.W. Fowler
> 
> ---------------------------(end of broadcast)---------------------------
> TIP 9: In versions below 8.0, the planner will ignore your desire to
>        choose an index scan if your joining column's datatypes do not
>        match

-- 
Andrew Sullivan  | ajs@crankycanuck.ca
Users never remark, "Wow, this software may be buggy and hard 
to use, but at least there is a lot of code underneath."    --Damien Katz


Re: A long-running transaction

From
Joe
Date:
Hi Andrew,

On Fri, 2007-04-13 at 07:49 -0400, Andrew Sullivan wrote:
> Because in the general case, you need the intermediate rows.  The
> problem is that you'd have to write a special piece of code to catch
> the case where nobody else can see the row that you're about to
> expire, and that test isn't free.  Therefore, you do it the same way
> any other row gets expired.

Just curious:  is there a way to defeat MVCC?  i.e., if you can lock the
database exclusively, there won't be any readers or writers?

Joe



Re: A long-running transaction

From
Andrew Sullivan
Date:
On Fri, Apr 13, 2007 at 12:42:36PM -0400, Joe wrote:
> 
> Just curious:  is there a way to defeat MVCC?  

No.  That is,

> i.e., if you can lock the
> database exclusively, there won't be any readers or writers?

while this is true, it doesn't shut off the way writes go through the
system.  MVCC is rather close to the central guts of PostgreSQL. 

If you want to understand more about why this is the case, what the
trade-off considerations are, &c., then I stronly advise you to troll
through the -hackers archives.  Approximately once every 8-10 months
someone comes along with an idea for a change that disables MVCC in
just this or that case.  Most of the time, the actual effects of this
are different than people think they will be -- concurrency is hard,
and making changes so that it's impossible to get into a race is
tricky.  This is why the in-place storage management that
EnterpriseDB contributed was both somewhat controversial and somewhat
limited in its application.

A

-- 
Andrew Sullivan  | ajs@crankycanuck.ca
This work was visionary and imaginative, and goes to show that visionary
and imaginative work need not end up well.     --Dennis Ritchie