Re: AIX support - Mailing list pgsql-hackers

From Tom Lane
Subject Re: AIX support
Date
Msg-id 1299410.1770136400@sss.pgh.pa.us
Whole thread Raw
In response to RE: AIX support  (Aditya Kamath <Aditya.Kamath1@ibm.com>)
Responses Re: AIX support
Re: AIX support
RE: AIX support
Re: AIX support
List pgsql-hackers
Just to play devil's advocate for a minute:

I've managed to run check-world (with --enable-tap-tests, but few
optional features) on the GCC compile farm's AIX 7.3 machine,
cfarm119.cfarm.net.

It took more than five hours.

bash-5.3$ time make -s check-world -j2 PROVE_FLAGS='--quiet --nocolor --nocount' >/dev/null

real    311m10.942s
user    4m26.936s
sys     4m13.243s

(I can't go higher than -j2 due to the machine's restrictive ulimit -u
setting.  Perfectly reasonable policy for a shared resource, and
that's not what I'm griping about.)

This compares unfavorably to my 2004-vintage Mac PPC G4 laptop
(running NetBSD 10.1), let alone anything remotely modern.  The G4
needs about three-and-two-thirds hours for substantially the same
test:

$ time make -s check-world -j2 PROVE_FLAGS='--quiet --nocolor --nocount' >/dev/null
13188.92s real  1229.68s user  1210.25s system

If the user/system times are to be trusted, the AIX machine is indeed
several times faster than the G4 CPU-wise, so why is it so slow?

Apparently, because its file system sucks.  One thing we do over and
over in the TAP tests is to copy an initialized data directory to
prepare a new instance, basically "cp -RPp template-dir $PGDATA".
I'm observing that taking about 22 seconds on the AIX machine
(which is actually slower than running initdb would be: about 15s),
compared to 2.6s on the G4, and about 0.035s on my Linux workstation
(which can do the same overall -j2 check-world in five minutes).

To be clear, there is as far as I can tell next to zero background
I/O load on cfarm119.  This is a typical readout when I'm not
running anything:

$ iostat 1

System configuration: lcpu=40 drives=12 ent=4.00 paths=10 vdisks=6

tty:      tin         tout    avg-cpu: % user % sys % idle % iowait physc % entc
          0.0         70.0               24.9  33.5   41.6      0.0   4.0  100.5

Disks:         % tm_act     Kbps      tps    Kb_read   Kb_wrtn
cd1               0.0       0.0       0.0          0         0
cd0               0.0       0.0       0.0          0         0
hdisk1            0.0       0.0       0.0          0         0
hdisk4            0.0       0.0       0.0          0         0
hdisk9            0.0       0.0       0.0          0         0
hdisk8            0.0       0.0       0.0          0         0
hdisk7            0.0       0.0       0.0          0         0
hdisk6            0.0       0.0       0.0          0         0
hdisk3            0.0       0.0       0.0          0         0
hdisk2          100.0     640.0     160.0          0       640
hdisk5            0.0       0.0       0.0          0         0
hdisk0            0.0       0.0       0.0          0         0

Unless there is something seriously wrong with how cfarm119 is set up,
the conclusion has to be that AIX is mind-bogglingly bad at disk I/O.

This conclusion is borne out by some simple pgbench testing:
the AIX machine performs somewhat-respectably on select-only
tests, or on pgbench's default test with fsync off, but on the
default test with fsync on it gets half the TPS rate of the G4:

cfarm119:
bash-5.3$ pgbench -T 60 -j 4 -c 4 bench
pgbench (19devel)
starting vacuum...end.
transaction type: <builtin: TPC-B (sort of)>
scaling factor: 10
query mode: simple
number of clients: 4
number of threads: 4
maximum number of tries: 1
duration: 60 s
number of transactions actually processed: 2493
number of failed transactions: 0 (0.000%)
latency average = 96.455 ms
initial connection time = 33.624 ms
tps = 41.469913 (without initial connection time)

g4:
[tgl@g42]$ pgbench -T 60 -j 4 -c 4 bench
pgbench (19devel)
starting vacuum...end.
transaction type: <builtin: TPC-B (sort of)>
scaling factor: 10
query mode: simple
number of clients: 4
number of threads: 4
maximum number of tries: 1
duration: 60 s
number of transactions actually processed: 5767
number of failed transactions: 0 (0.000%)
latency average = 41.619 ms
initial connection time = 122.045 ms
tps = 96.109550 (without initial connection time)


Remind me again why anyone would choose to run Postgres on this
platform?  Why are we moving mountains to make it possible?

            regards, tom lane



pgsql-hackers by date:

Previous
From: Christoph Berg
Date:
Subject: Re: Re[2]: [PATCH] Add last_executed timestamp to pg_stat_statements
Next
From: Florents Tselai
Date:
Subject: Re: Emitting JSON to file using COPY TO