Re: Inserting 8MB bytea: just 25% of disk perf used? - Mailing list pgsql-performance

From Scott Carey
Subject Re: Inserting 8MB bytea: just 25% of disk perf used?
Date
Msg-id BA8C2169-BB56-4148-8468-89DC8D3E50EF@richrelevance.com
Whole thread Raw
In response to Inserting 8MB bytea: just 25% of disk perf used?  ("fkater@googlemail.com" <fkater@googlemail.com>)
Responses Re: Inserting 8MB bytea: just 25% of disk perf used?
Re: Inserting 8MB bytea: just 25% of disk perf used?
List pgsql-performance
On Jan 20, 2010, at 5:32 AM, fkater@googlemail.com wrote:

>
>> Bulk inserts into an indexed table is always significantly
>> slower than inserting unindexed and then indexing later.
>
> Agreed. However, shouldn't this be included in the disk-time
> counters? If so, it should by near 100%.
>

Well, something is causing the system to alternate between CPU and disk bound here.
(see below).
It would be useful to see what affect the index has.


>
> THE TESTS:
>
> In the attachement you'll find 2 screenshots perfmon34.png
> and perfmon35.png (I hope 2x14 kb is o.k. for the mailing
> list).
>
> To explain it a bit:
>
>
> (A) The setting (to recall):
>
> WinXP, 4 CPU 3 Ghz, 4 GB RAM, Postgres 8.2.4, postgres
> system data on hd D:, postgres data on separate sata raid-0
> hd G: (serial write performance of 100-200 MB/s), pg_xlog
> symlinked to separate hd E: (sata 10.000 rpm); using libpq
> and PQexecParams() with $001.. notation and Format=1
> (binary) for doing 10 times a simple INSERT command to
> insert 80 MB of bytea data (so 800 MB in total) into a
> simple 5 col table (1 bytea with STORAGE EXTERNAL, rest
> unused cols, with oids, index on oid; all done locally.
>
>
> (B) perfmon34.png: Results/graphs (performance monitor):
>
Great data!

> (1) The upper dark/gray graph touching the 100% sometimes is
> "disk write time %" of the data disk G:
>
> (2) The yellow graph is nearly completly overpainted by (1)
> since it is "disk time %".
>
> (3) The brown graph below (1) is "Disk Write Byts/s" divided
> by 1.000.000, so around 40 MB/s average.
>

Looks like it is writing everything twice, or close to it.  Alternatively the index writes occupy half, but that is
unlikely.


> (4) The read graph is "Disk Time %" of the WAL drive E:,
> average approx. 30%.
>

WAL doesn't look like a bottleneck here, as other tests have shown.
A larger wal_buffers setting might lower this more, since your record overflows the buffer for sure.
You might want to change your test case to write records similar size to what you expect (do you expect 80MB?) and then
setwal_buffers up to the size of one checkpoint segment (16MB) if you expect larger data per transaction. 

> (5) Below (4) there is CPU time in total (average around
> 20%), all 4 CPUs counted -- and please beleave me: separate
> CPU logs shows CPUs peaks never above 40-50%. Watching the
> several postgres processes: 0-10% CPU usage.

What I see here is the following:
The system is I/O bound, or close, and then it is CPU bound.  Note how the CPU spikes up by about 25% (one CPU) when
thedisk time drops. 


>
> (6) The blue/cyan line is "Disk Writes/sec" divided by 100,
> so around 1000 writes/s max for the data drive G:
>
> (7) The pink graph (Disk reads/s of data disk G:) shows
> nearly zero activity.
>
> (8)
> Duration of it all 40s, so inserting 800MB is done at a
> speed of 20MB/s.
>
> (9) The other tool mentioned (DiskMon) tool had some
> problems to list all data writes in parallel. It continued
> to fill its list for 5 min.  after the test was done. I have
> not examined its results.
>
Yes, that tool by default will log a LOT of data.  It will be useful later if we want to figure out what
sort of writes happen during the peaks and valleys on your chart.
>
> (C) My interpretation
>
> (1) Although the data disk G: sometimes hits 100%: All in
> all it seems that neither the CPUs nor the data disk
> (approx. 65%) nor the WAL disk (approx. 30%) are at their
> limits. See also 1000 writes/s, 40MB/s write thoughput.
>

I think it is alternating.  Whatever is causing the 25% CPU jump during the 'slow' periods is a clue.  Some process on
thesystem must be increasing its time significantly in these bursts.  I suspect it is postgres flushing data from its
shared_buffersto the OS.  8.2 is not very efficient at its ability to write out to the OS in a constant stream, and
tendsto be very 'bursty' like this.  I suspect that 8.3 or 8.4 would perform a lot better here, when tuned right. 

> (2) The screenshot also demonstrates that the whole process
> is not done smoothly but seems to be interrupted: About 15
> times the data disk time% changes between 100% and ~40%.
>
>
> (D) perfmod35.png: Further tests (fsync=off)
>
> I repeated the whole thing with fsync=off. The result was
> remarkably better (35s instead of 40s, so 22.8 MB/s). The
> former 100% peaks of the data disk G: are now flat 100%
> tops for approx 50% of the time.
>
> See attachement perfmon35.png
>
>
> (E) Remaining questions
>
> (1)
> It seems that I am disk bound, however it is still a bit
> unclear what the system is waiting for during these
> significant 'interrupts' when neither the disk nor the CPU
> (nor postgress processes) seem to be at its limits.
>

I suspect it is the postgres checkpoint system interacting with the write volume and size of shared_buffers.
This interaction was significantly improved in 8.3.  If you can, running a contemporary version would probably help a
lot. 8.2.anything is rather old from a performance perspective. 
Adjusting the work_mem and checkpoint tuning would likely help as well.  If you can get the transaction to commit
beforethe data has hit disk the total write volume should be lower.  That means larger work_mem, or smaller write
chunksper transaction. 

> (2)
> The write troughput is still disappointing to me: Even if we
> would find a way to avoid those 'interrupts' of disk
> inactivity (see E 1), we are too far beyond serial disk
> write throughput (20 MB/s data disk + 20 MB/s other (!) WAL
> disk: is far below 100-200 MB/s resp. 40-70 MB/s).
>

Although reaching 'full' sequential throughput is very hard because not all of the writes are sequential, there is a
ratherlarge gap here.  

> (3)
> It would be nice to have an SQL script for the whole test. I
> failed though to read/create 80 MB data and insert it 10
> times in a loop.
>
>
> BTY: While writing data I've tested parallel readout via
> Gigabit Ethernet by an external linux client which performed
> greately and slowed down the write process for about 5s
> only!
>
>
> Felix
>
> <perfmon34.png><perfmon35.png>


pgsql-performance by date:

Previous
From: Greg Smith
Date:
Subject: Re: ext4 finally doing the right thing
Next
From: Greg Smith
Date:
Subject: Re: Inserting 8MB bytea: just 25% of disk perf used?