Re: [pgsql-sql] Daily digest v1.2492 (19 messages) - Mailing list pgsql-sql

From Steve Midgley
Subject Re: [pgsql-sql] Daily digest v1.2492 (19 messages)
Date
Msg-id 20070403183618.60EB89FB486@postgresql.org
Whole thread Raw
List pgsql-sql
Hi John,

It sounds like a disk-bound operation, so cpu is not maxed out. I'm not 
clear on all the details of your operation but it sounds like you're 
using Java to do row-by-row based inserts, selects and updates within a 
transaction, from a file. This can be a very slow process if you have 
many rows. The OS stats you describe fits that theory (but not 
conclusively).

If you are using (psuedo-)code such as:

Open file {  Read line {    select from Pg: "select from [other_table] where val = 
[line[colN]]"    exec to Pg: "insert into [table] (col1, col2, ...) values 
(line[col1], line[col2]..."  }
}

You can radically speed up such a system by using the "copy" 
(http://www.postgresql.org/docs/8.2/interactive/sql-copy.html) command 
to load all the data at once from the file into Pg and then do 
post-processing with Java/SQL to get all the fields looking right. 
Doing a bulk update with a join across several tables is so much faster 
than looping through them with a wrapper in Java (or other lang) you 
won't believe it.

I hope this helps and is on-topic for you.

Steve

At 09:38 AM 4/3/2007, pgsql-sql-owner@postgresql.org wrote:
>Date: Tue, 03 Apr 2007 22:16:13 +0800
>From: John Summerfield <postgres@herakles.homelinux.org>
>To: pgsql-sql@postgresql.org
>Subject: A long-running transaction
>Message-ID: <461261AD.6060108@herakles.homelinux.org>
>
>I have a Java (java 1.1) program that I wrote some years ago, to read
>records from a text file and insert it into a ostgresql database.
>
>One of the assumptions I made was that one file contained one day's
>data, maybe as many as 1500 records, and I coded it to do the whole 
>lot
>as one transaction so either a single file was loaded in its entirity, 
>
>or none of its data was.
>
>I lost the Java code, but revived the idea and I've collected about 
>two
>years' data using (Linux) shell scripts, and loading the data using 
>psql.
>
>Then, I found the Java code on a disused hard disk:-)
>
>I made the necessary changes for it to build in java 1.5, and used 
>psql
>to extract data from my new database in the correct format for the old 
>
>program. This time, I have a little more data than I ever loaded at 
>once
>before:
>summer@Bandicoot:~$ wc -l testdata
>6242217 testdata
>summer@Bandicoot:~$ \ls -hl testdata
>-rw-r--r-- 1 summer summer 285M 2007-03-28 22:32 testdata
>summer@Bandicoot:~$
>
>Now, it wouldn't surprise me if postgresql used lots of memory - but 
>how
>much could it possibly need? My laptop, where I first tried this, has
>1.25 Gbytes, so I could allow it some.
>
>It wouldn't surprise me a lot if it used lots of memory and caused all 
>
>sorts of problems, but no, it's chugging away, still using no more RAM 
>
>than it could have had on my old Pentium 133 all those years ago.
>
>In the process of checking it out, I've set it running on a machine 
>with
>a AMD Sempron(tm)   2400+ running Kubuntu 6.10 (kernel is
>2.6.17-6-server-xen0) and 512 Mbytes of RAM.
>
>This is the java program:-)
>summer   pts/6    :0.0             Thu20    5days  1:07   1:07
>/usr/bin/gij-4.1 -cp /usr/s
>It's been running five days so far, and I can see where it's up to by
>attaching strace. It's reading 2k of the input file every few seconds.
>
>Okay, clearly something's wrong, and I don't think it's all my crddu 
>code.
>No probs swapping:
>summer@Bandicoot:~$ free
>               total       used       free     shared    buffers 
> cached
>Mem:        460800     456472       4328          0        860 
>262164
>-/+ buffers/cache:     193448     267352
>Swap:      1461872        284    1461588
>summer@Bandicoot:~$
>
>It is hitting the disk pretty hard now on this machine, but the 
>laptop's
>still going too, and the disk seems to run about half the time, part 
>of
>a second running, part idle (but the intervals are getting shorter).
>
>It struck me as fairly curious that neither postgresql nor the
>application was hogging the CPU.
>
>Perhaps the laptop is more interesting: look at the size of the buffer 
>pool:
>summer@Echidna:~> free
>               total       used       free     shared    buffers 
> cached
>Mem:       1295528    1268548      26980          0       3976 
>392388
>-/+ buffers/cache:     872184     423344
>Swap:      1941496      32656    1908840
>summer@Echidna:~>
>Again, no problem with over-use of RAM, and I'm logged on using KDE 
>too
>and that's running fine.
>
>It's been running a little longer here:
>summer   pts/23    28Mar07  5days 25:12  25:11  java -cp
>/home/summer/Classes/:/usr/share/p
>
>This is Sun's Java 1.5 on OpenSUSE 10.2.
>
>
>This is what suggested I should write:
>summer@Echidna:~> procinfo
>Linux 2.6.18.8-0.1-default (geeko@buildhost) (gcc 4.1.2 20061115) #1
>1CPU [Echidna.]
>
>Memory:      Total        Used        Free      Shared     Buffers
>Mem:       1295528     1271720       23808           0        3716
>Swap:      1941496       32656     1908840
>
>Bootup: Tue Mar 27 18:50:19 2007    Load average: 2.21 2.65 2.69 2/243 
>19305
>
>user  :   1d  3:17:04.03  16.0%  page in :  131097310  disk 1:
>3079516r20087664w
>nice  :       0:05:39.64   0.1%  page out:  197016649
>system:   2d 20:38:37.13  40.1%  page act:   87906251
>IOwait:   2d  0:46:37.33  28.5%  page dea:   16218135
>hw irq:       0:44:46.71   0.4%  page flt:  306255213
>sw irq:       0:50:04.69   0.5%  swap in :       4026
>idle  :   1d  0:36:29.73  14.4%  swap out:       9552
>uptime:   7d  2:59:20.97         context :  702502442
>
>irq  0: 153880209 timer                 irq  7:         0 parport0
>irq  1:     69402 i8042                 irq  8:         2 rtc
>irq  2:         0 cascade [4]           irq  9:   1696942 acpi
>irq  3:         4                       irq 10:         1
>irq  4:         4                       irq 11:  71842329 
>ehci_hcd:usb1,
>uhci_
>irq  5:  28545863 Intel 82801DB-ICH4    irq 12:    467432 i8042
>irq  6:         1                       irq 14:  25021586 ide0
>
>summer@Echidna:~>
>
>
>Look at that line beginning "System:" two days 20 hours in the Linux
>kernel. It's my guess that the Linux kernel is spending a great deal 
>of
>time manipulating that buffer pool.
>
>This shows postgresql taking 60% CPU:
>summer@Echidna:~> ps xaru
>USER       PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME 
>COMMAND
>summer   20237  0.2  1.2 263716 15988 pts/23   Rl+  Mar28  25:11 java
>-cp /home/summer/Class
>postgres 19321  0.1  0.3  19844  3984 ?        D    21:50   0:00
>postgres: summer stocksshar
>summer   19344  0.0  0.0   2484   852 pts/25   R+   21:52   0:00 ps 
>xaru
>summer@Echidna:~> ps xaru
>USER       PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME 
>COMMAND
>postgres 20248 60.0  0.8  20760 11536 ?        D    Mar28 5167:27
>postgres: summer sharetrad
>postgres 19321  0.1  0.3  19844  4008 ?        D    21:50   0:00
>postgres: summer stocksshar
>summer   19348  0.0  0.0   2480   848 pts/25   R+   21:52   0:00 ps 
>xaru
>summer@Echidna:~>
>
>Note that pid 20248 is the one, the other has nothing to do with the
>problem.
>
>This is the SQL I used to create the table:
>        String createTableStatement = "create table TJS0102_trades ("
>                    + "   trade date"
>                    + ",  ttime int2"
>                    + ",  ASX varchar(7)"
>                    + ",  thigh int4"
>                    + ",  tlow int4"
>                    + ",  tclose int4"
>                    + ",  topen int4"
>                    + ",  tvolume int4"
>                    + ",  tvalue int4"
>//                 + ")"
>                    + ",  unique(ASX,trade,ttime)"
>                    + ");";
>
>
>For each record, I update a non-key field in another table; the source 
>
>data for that other table is less than a megabyte.
>
>
>I have changed the program so as to load each day's data as a single
>transaction; it ran on a Pentium IV 2.8 Ghz with HT, 512 Mbytes in 
>about
>three and an half hours. This is more-or-less what I expected.
>
>If you agree with me that postgresql should do better, now is a good
>time to ask for this as a test case.
>
>Note, you could also get suitable test data from float.com.au - it's 
>not
>where mine came from, but it should be substantially the same.



pgsql-sql by date:

Previous
From: Karthikeyan Sundaram
Date:
Subject: Re: plpgsql function question
Next
From: "A. Kretschmer"
Date:
Subject: Re: plpgsql function question