Home > mailing lists

Re: checkpointer continuous flushing - Mailing list pgsql-hackers

From	Amit Kapila
Subject	Re: checkpointer continuous flushing
Date	September 5, 2015 06:14:47
Msg-id	CAA4eK1+uDKCEzeOLzT5Sok3ukMjzy-ov-=QnZaOY0o3bCm9=Yw@mail.gmail.com Whole thread Raw
In response to	Re: checkpointer continuous flushing (Fabien COELHO <coelho@cri.ensmp.fr>)
Responses	Re: checkpointer continuous flushing (Fabien COELHO <coelho@cri.ensmp.fr>)
List	pgsql-hackers

On Tue, Sep 1, 2015 at 5:30 PM, Fabien COELHO <coelho@cri.ensmp.fr> wrote:

Hello Amit,

About the disks: what kind of HDD (RAID? speed?)? HDD write cache?

Speed of Reads -
Timing cached reads: 27790 MB in 1.98 seconds = 14001.86 MB/sec
Timing buffered disk reads: 3830 MB in 3.00 seconds = 1276.55 MB/sec

Woops.... 14 GB/s and 1.2 GB/s?! Is this a *hard* disk??

Yes, there is no SSD in system. I have confirmed the same. There are RAID

spinning drives.

Copy speed -

dd if=/dev/zero of=/tmp/output.img bs=8k count=256k
262144+0 records in
262144+0 records out
2147483648 bytes (2.1 GB) copied, 1.30993 s, 1.6 GB/s

Woops, 1.6 GB/s write... same questions, "rotating plates"??

One thing to notice is that if I don't remove the output file (output.img) the

speed is much slower, see the below output. I think this means in our case

we will get ~320 MB/s

dd if=/dev/zero of=/data/akapila/output.img bs=8k count=256k

262144+0 records in

262144+0 records out

2147483648 bytes (2.1 GB) copied, 1.28086 s, 1.7 GB/s

dd if=/dev/zero of=/data/akapila/output.img bs=8k count=256k

262144+0 records in

262144+0 records out

2147483648 bytes (2.1 GB) copied, 6.72301 s, 319 MB/s

dd if=/dev/zero of=/data/akapila/output.img bs=8k count=256k

262144+0 records in

262144+0 records out

2147483648 bytes (2.1 GB) copied, 6.73963 s, 319 MB/s

If I remove the file each time:

dd if=/dev/zero of=/data/akapila/output.img bs=8k count=256k

262144+0 records in

262144+0 records out

2147483648 bytes (2.1 GB) copied, 1.2855 s, 1.7 GB/s

rm /data/akapila/output.img

dd if=/dev/zero of=/data/akapila/output.img bs=8k count=256k

262144+0 records in

262144+0 records out

2147483648 bytes (2.1 GB) copied, 1.27725 s, 1.7 GB/s

rm /data/akapila/output.img

dd if=/dev/zero of=/data/akapila/output.img bs=8k count=256k

262144+0 records in

262144+0 records out

2147483648 bytes (2.1 GB) copied, 1.27417 s, 1.7 GB/s

rm /data/akapila/output.img

Looks more like several SSD... Or the file is kept in memory and not committed to disk yet? Try a "sync" afterwards??

If these are SSD, or if there is some SSD cache on top of the HDD, I would not expect the patch to do much, because the SSD random I/O writes are pretty comparable to sequential I/O writes.

I would be curious whether flushing helps, though.

Yes, me too. I think we should try to reach on consensus for exact scenarios

and configuration where this patch('es) can give benefit or we want to verify

if there is any regression as I have access to this m/c for a very-very limited

time. This m/c might get formatted soon for some other purpose.

max_wal_size=5GB

Hmmm... Maybe quite small given the average performance?

We can check with larger value, but do you expect some different
results and why?

Because checkpoints are xlog triggered (which depends on max_wal_size) or time triggered (which depends on checkpoint_timeout). Given the large tps, I expect that the WAL is filled very quickly hence may trigger checkpoints every ... that is the question.

checkpoint_timeout=2min

This seems rather small. Are the checkpoints xlog or time triggered?

I wanted to test by triggering more checkpoints, but I can test with
larger checkpoint interval as wel like 5 or 10 mins. Any suggestions?

For a +2 hours test, I would suggest 10 or 15 minutes.

Okay, lets keep it as 10 minutes.

I don't think increasing shared_buffers would have any impact, because
8GB is sufficient for 300 scale factor data,

It fits at the beginning, but when updates and inserts are performed postgres adds new pages (update = delete + insert), and the deleted space is eventually reclaimed by vacuum later on.

Now if space is available in the page it is reused, so what really happens is not that simple...

At 8500 tps the disk space extension for tables may be up to 3 MB/s at the beginning, and would evolve but should be at least about 0.6 MB/s (insert in history, assuming updates are performed in page), on average.

So whether the database fits in 8 GB shared buffer during the 2 hours of the pgbench run is an open question.

With this kind of configuration, I have noticed that more than 80%

of updates are HOT updates, not much bloat, so I think it won't

cross 8GB limit, but still I can keep it to 32GB if you have any doubts.

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

pgsql-hackers by date:

Previous

From: Tom Lane
Date: 05 September 2015, 05:55:05
Subject: Re: pg_ctl/pg_rewind tests vs. slow AIX buildfarm members

Next

From: dinesh kumar
Date: 05 September 2015, 08:16:15
Subject: Re: [PATCH] SQL function to report log message