Re: Inserting 8MB bytea: just 25% of disk perf used? - Mailing list pgsql-performance

From fkater@googlemail.com
Subject Re: Inserting 8MB bytea: just 25% of disk perf used?
Date
Msg-id 20100122204202.GA1843@comppasch2
Whole thread Raw
In response to Re: Inserting 8MB bytea: just 25% of disk perf used?  (Scott Carey <scott@richrelevance.com>)
Responses Re: Inserting 8MB bytea: just 25% of disk perf used?
List pgsql-performance
Scott Carey:

> Well, something is causing the system to alternate between
> CPU and disk bound here.  (see below).
> It would be useful to see what affect the index has.

Ok, I simply deleted the index and repeated the test: I did
not notice any difference. This is probably so because in
fact I am doing just 10 INSERTs.


> > (B) perfmon34.png: Results/graphs (performance monitor):
> >
> Great data!

BTW: I have some more screenshots but as they do not arrive
on the mailing list I keep it. The new graphs are basicly
the same anyway.


> > (1) The upper dark/gray graph touching the 100% sometimes is
> > "disk write time %" of the data disk G:
> >
> > (2) The yellow graph is nearly completly overpainted by (1)
> > since it is "disk time %".
> >
> > (3) The brown graph below (1) is "Disk Write Byts/s" divided
> > by 1.000.000, so around 40 MB/s average.
> >
>
> Looks like it is writing everything twice, or close to it.
> Alternatively the index writes occupy half, but that is
> unlikely.

'Writing twice': That is the most interesting point I
believe. Why is the data disk doing 40 MB/s *not* including
WAL, however, having 20 MB/s write thoughput in fact. Seems
like: 20 MB for data, 20 MB for X, 20 MB for WAL.

Although that questions is still unanswered: I verified
again that I am disk bound by temporarily replacing the
raid-0 with slower solution: a singly attached sata disk
of the same type: This *did* slow down the test a lot
(approx.  20%). So, yes, I am disk bound but, again, why
that much...

About removing the index on OIDs: No impact (see above).


> > (4) The read graph is "Disk Time %" of the WAL drive E:,
> > average approx. 30%.
> >
>
> WAL doesn't look like a bottleneck here, as other tests
> have shown.  A larger wal_buffers setting might lower this
> more, since your record overflows the buffer for sure.
> You might want to change your test case to write records
> similar size to what you expect (do you expect 80MB?) and
> then set wal_buffers up to the size of one checkpoint
> segment (16MB) if you expect larger data per transaction.

Ok, without knowing each exact effect I changed some of the
configuration values (from the defaults in 8.2.4), and did
some tests:

(1) First, the most important 8.2.4 defaults (for you to
overlook):

#shared_buffers=32MB
#temp_buffers=8MB
#max_prepared_transactions=5
#work_mem=1MB
#maintenance_work_mem=16MB
#max_stack_depth=2MB
#max_fsm_pages=204800
#max_fsm_relations=1000
#max_files_per_process=1000
#shared_preload_libraries=''
#vacuum_cost_delay=0
#vacuum_cost_page_hit=1
#vacuum_cost_page_miss=10
#vacuum_cost_page_dirty=20
#vacuum_cost_limit=200
#bgwriter_delay=200ms
#bgwriter_lru_percent=1.0
#bgwriter_lru_maxpages=5
#bgwriter_all_percent=0.333
#bgwriter_all_maxpages=5
#fsync=on
#full_page_writes=on
#wal_buffers=64kB
#checkpoint_segments=3
#checkpoint_timeout=5min
#checkpoint_warning=30s
#seq_page_cost=1.0
#random_page_cost=4.0
#cpu_tuple_cost=0.01
#cpu_index_tuple_cost=0.005
#cpu_operator_cost=0.0025
#effective_cache_size=128MB
#default_statistics_target=10
#constraint_exclusion=off
#from_collapse_limit=8
#join_collapse_limit=8
#autovacuum=on
#autovacuum_naptime=1min
#autovacuum_vacuum_threshold=500
#autovacuum_analyze_threshold=250
#autovacuum_vacuum_scale_factor=0.2
#autovacuum_analyze_scale_factor=0.1
#autovacuum_freeze_max_age=200000000
#autovacuum_vacuum_cost_delay=-1
#autovacuum_vacuum_cost_limit=-1
#deadlock_timeout=1s
#max_locks_per_transaction=64


(2) The tests:

Note: The standard speed was about 800MB/40s, so 20MB/s.


a)
What I changed: fsync=off
Result: 35s, so 5s faster.


b) like a) but:
checkpoint_segments=128 (was 3)
autovacuum=off

Result: 35s (no change...?!)


c) like b) but:
temp_buffers=200MB (was 8)
wal_sync_method=open_datasync (was fsync)
wal_buffers=1024kB (was 64)

Result:
The best ever, it took just 29s, so 800MB/29s = 27.5MB/s.
However, having autovacuum=off probably means that deleted
rows will occupy disk space? And I also fear that
checkpoint_segments=128 mean that at some point in the
future there will be a huge delay then (?).


d) also like b) but:
temp_buffers=1000MB
wal_buffers=4096kB
checkpoint_segments=3
autovacuum=on

Result: Again slower 36s



I am not able to interprete that in depth.





> > (C) My interpretation
> >
> > (1) Although the data disk G: sometimes hits 100%: All in
> > all it seems that neither the CPUs nor the data disk
> > (approx. 65%) nor the WAL disk (approx. 30%) are at their
> > limits. See also 1000 writes/s, 40MB/s write thoughput.
> >
>
> I think it is alternating.  Whatever is causing the 25%
> CPU jump during the 'slow' periods is a clue.  Some
> process on the system must be increasing its time
> significantly in these bursts.  I suspect it is postgres
> flushing data from its shared_buffers to the OS.  8.2 is
> not very efficient at its ability to write out to the OS
> in a constant stream, and tends to be very 'bursty' like
> this.  I suspect that 8.3 or 8.4 would perform a lot
> better here, when tuned right.

Ok, I've managed to use 8.4 here. Unfortunatelly: There was
nearly no improvement in speed. For example test 2d)
performed in 35s.


> > The write troughput is still disappointing to me: Even if we
> > would find a way to avoid those 'interrupts' of disk
> > inactivity (see E 1), we are too far beyond serial disk
> > write throughput (20 MB/s data disk + 20 MB/s other (!) WAL

... or better 20MB/s data disk + 20MB/s unexplained writes
to data disk + 20 MB/s WAL disk...

> > disk: is far below 100-200 MB/s resp. 40-70 MB/s).
> >
>
> Although reaching 'full' sequential throughput is very
> hard because not all of the writes are sequential, there
> is a rather large gap here.

Yes, it's a pitty.


Thank You again so much.

 Felix



pgsql-performance by date:

Previous
From: Richard Neill
Date:
Subject: Re: Fragmentation/Vacuum, Analyze, Re-Index
Next
From: Craig Ringer
Date:
Subject: Re: Data Set Growth causing 26+hour runtime, on what we believe to be very simple SQL