Thread: full_page_writes on SSD?
On Tue, Nov 24, 2015 at 12:48 PM, Marcin Mańk <marcin.mank@gmail.com> wrote: > if SSDs have 4kB/8kB sectors, and we'd make the Postgres page > size equal to the SSD page size, do we still need full_page_writes? If an OS write of the PostgreSQL page size has no chance of being partially persisted (a/k/a torn), I don't think full page writes are needed. That seems likely to be true if pg page size matches SSD sector size. -- Kevin Grittner EDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 2015-11-24 13:09:58 -0600, Kevin Grittner wrote: > On Tue, Nov 24, 2015 at 12:48 PM, Marcin Mańk <marcin.mank@gmail.com> wrote: > > > if SSDs have 4kB/8kB sectors, and we'd make the Postgres page > > size equal to the SSD page size, do we still need full_page_writes? > > If an OS write of the PostgreSQL page size has no chance of being > partially persisted (a/k/a torn), I don't think full page writes > are needed. That seems likely to be true if pg page size matches > SSD sector size. At the very least it also needs to match the page size used by the OS (4KB on x86). But be generally wary of turning of fpw's if you use replication. Not having them often turns a asynchronously batched write workload into one containing a lot of synchronous, single threaded, reads. Even with SSDs that can very quickly lead to not being able to keep up with replay anymore.
On 11/24/2015 10:48 AM, Marcin Mańk wrote: > I saw this: > http://blog.pgaddict.com/posts/postgresql-on-ssd-4kb-or-8kB-pages > > It made me wonder: if SSDs have 4kB/8kB sectors, and we'd make the > Postgres page size equal to the SSD page size, do we still need > full_page_writes? an SSD's actual write block is much much larger than that. they emulate 512 or 4k sectors, but they are not actually written in sector order, rather new writes are accumulated in a buffer on the drive, then written out to a whole block, and a sector mapping table is maintained by the drive. -- john r pierce, recycling bits in santa cruz
On 11/24/2015 08:14 PM, Andres Freund wrote: > On 2015-11-24 13:09:58 -0600, Kevin Grittner wrote: >> On Tue, Nov 24, 2015 at 12:48 PM, Marcin Mańk <marcin.mank@gmail.com> wrote: >> >>> if SSDs have 4kB/8kB sectors, and we'd make the Postgres page >>> size equal to the SSD page size, do we still need >>> full_page_writes? >> >> If an OS write of the PostgreSQL page size has no chance of being >> partially persisted (a/k/a torn), I don't think full page writes >> are needed. That seems likely to be true if pg page size matches >> SSD sector size. > > At the very least it also needs to match the page size used by the > OS (4KB on x86). Right. I find this possibility (when the OS and SSD page sizes match) interesting, exactly because it might make the storage resilient to torn pages. > > But be generally wary of turning of fpw's if you use replication. > Not having them often turns a asynchronously batched write workload > into one containing a lot of synchronous, single threaded, reads. > Even with SSDs that can very quickly lead to not being able to keep > up with replay anymore. > I don't immediately see why that would happen? Can you elaborate? regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On 11/24/2015 08:40 PM, John R Pierce wrote: > On 11/24/2015 10:48 AM, Marcin Mańk wrote: >> I saw this: >> http://blog.pgaddict.com/posts/postgresql-on-ssd-4kb-or-8kB-pages >> >> It made me wonder: if SSDs have 4kB/8kB sectors, and we'd make the >> Postgres page size equal to the SSD page size, do we still need >> full_page_writes? > > > an SSD's actual write block is much much larger than that. they > emulate 512 or 4k sectors, but they are not actually written in > sector order, rather new writes are accumulated in a buffer on the > drive, then written out to a whole block, and a sector mapping table > is maintained by the drive. I don't see how that's related to full_page_writes? It's true that SSDs optimize the writes in various ways, generally along the lines you described, because they do work with "erase blocks"(generally 256kB - 1MB chunks) and such. But the internal structure of SSD has very little to do with FPW because what matters is whether the on-drive write cache is volatile or not (SSD can't really work without it). What matters (when it comes to resiliency to torn pages) is the page size at the OS level, because that's what's being handed over to the SSD. Of course, there might be other benefits of further lowering page sizes at the OS/database level (and AFAIK there are SSD drives that use pages smaller than 4kB). regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
I investigate bit about SSD and how it works and need to be aligned .
And I conclude that in the ideal world we need a general --ebs=xxx switch in various linux tools to ensure alignment. Or make calculation by had..
On the market there are SSD disks with page size 4 or 8 kb. But there is for ssd disk typical property - the EBS - Erase Block Size. If disk operate and write to single sector, whole Erase block must be read by driver electronic, modified and write back to the drive.
On the market there are devices with multiple EBS sizes . 128, 256, 512 1024 1534 2048 kib etc
In my case Samsung 850evo there are 8k pages and 1536 Erase Block
So first problem with alegment - partition should start on the Erase block bounduary . So --ebs switch in partition tools for propper aignment would be practical. Or calculate by hand. In my sase 1536 = 3072 512b sectors.
Things get complicate if You use mdadm raid. Because Raid superblock is located on the begining of the raid device and does not fill whole rerase block, it is practical to set in creation of raid --offset to real filesystem start at next erase block from the begining of raid device so underlying filesystem would be aligned as well. so --ebs=xxx on mdadm would be practice
And now ext4 so blocksize 4096 . because page size of ssd is 8kb , setting stride´wit is a smallest unit on with filesystem operate in one disk to 2 to fill ssd pagesize is practical. And stripe size set as ebs/pagesize or as whole ebs . and may be it would be useful to use ext4 --offset to edb as well.
this should align partition, raid and filesystem. fix me if I am wrong.
And now it is turn for database storage engine. I think try to write on erase block size bounduary and erase block size amount of data may have some benefits not with the speed but in lower wear-out of the entire ssd disk..
---------- Původní zpráva ----------
Od: Marcin Mańk <marcin.mank@gmail.com>
Komu: PostgreSQL <pgsql-general@postgresql.org>
Datum: 24. 11. 2015 20:07:30
Předmět: [GENERAL] full_page_writes on SSD?
I saw this: http://blog.pgaddict.com/posts/postgresql-on-ssd-4kb-or-8kB-pagesIt made me wonder: if SSDs have 4kB/8kB sectors, and we'd make the Postgres page size equal to the SSD page size, do we still need full_page_writes?RegardsMarcin Mańk
I am constantly using SSD both on my OS and database and have none of these problems.
However I don’t use SSD for O/S’s virtual memory.
From what I have read of this thread.
Potentially there could also be a situation that SSD is hitting its limit of auto recovery and has been over used.
It is well known that using SSD’s for OS’s virtual memory causes SSDs to wear out much quicker.
To really test all these. One needs to use a brand new SSD. Also ensure you are not using O/S’s virtual memory on the same SSD as DB and its log file.
You might want to also double check the language of the OS and postgresql installed. As these determine the final size of memory used to read and write.
From: pgsql-general-owner@postgresql.org [mailto:pgsql-general-owner@postgresql.org] On Behalf Of NTPT
Sent: 25 November 2015 12:10
To: Marcin Mańk
Cc: PostgreSQL
Subject: Re: [GENERAL] full_page_writes on SSD?
Hi,
I investigate bit about SSD and how it works and need to be aligned .
And I conclude that in the ideal world we need a general --ebs=xxx switch in various linux tools to ensure alignment. Or make calculation by had..
On the market there are SSD disks with page size 4 or 8 kb. But there is for ssd disk typical property - the EBS - Erase Block Size. If disk operate and write to single sector, whole Erase block must be read by driver electronic, modified and write back to the drive.
On the market there are devices with multiple EBS sizes . 128, 256, 512 1024 1534 2048 kib etc
In my case Samsung 850evo there are 8k pages and 1536 Erase Block
So first problem with alegment - partition should start on the Erase block bounduary . So --ebs switch in partition tools for propper aignment would be practical. Or calculate by hand. In my sase 1536 = 3072 512b sectors.
Things get complicate if You use mdadm raid. Because Raid superblock is located on the begining of the raid device and does not fill whole rerase block, it is practical to set in creation of raid --offset to real filesystem start at next erase block from the begining of raid device so underlying filesystem would be aligned as well. so --ebs=xxx on mdadm would be practice
And now ext4 so blocksize 4096 . because page size of ssd is 8kb , setting stride´wit is a smallest unit on with filesystem operate in one disk to 2 to fill ssd pagesize is practical. And stripe size set as ebs/pagesize or as whole ebs . and may be it would be useful to use ext4 --offset to edb as well.
this should align partition, raid and filesystem. fix me if I am wrong.
And now it is turn for database storage engine. I think try to write on erase block size bounduary and erase block size amount of data may have some benefits not with the speed but in lower wear-out of the entire ssd disk..
---------- Původní zpráva ----------
Od: Marcin Mańk <marcin.mank@gmail.com>
Komu: PostgreSQL <pgsql-general@postgresql.org>
Datum: 24. 11. 2015 20:07:30
Předmět: [GENERAL] full_page_writes on SSD?
I saw this: http://blog.pgaddict.com/posts/postgresql-on-ssd-4kb-or-8kB-pages
It made me wonder: if SSDs have 4kB/8kB sectors, and we'd make the Postgres page size equal to the SSD page size, do we still need full_page_writes?
Regards
Marcin Mańk
=
On 11/25/15 5:38 AM, Tomas Vondra wrote: >> But be generally wary of turning of fpw's if you use replication. >> Not having them often turns a asynchronously batched write workload >> into one containing a lot of synchronous, single threaded, reads. >> Even with SSDs that can very quickly lead to not being able to keep >> up with replay anymore. >> > > I don't immediately see why that would happen? Can you elaborate? If there's no FPI records in WAL then recovery/replay has to read the blocks from disk before it can apply the real WAL record. Way back in the day, recovery would always do this... someone had the bright idea around 8.0 to make use of the FPIs if they're present. IIRC that resulted in order of magnitude improvements of recovery time in many cases. For SR, the effect might not be as large, if the slave is actively being used, and if the queries hitting the slave tend to be grabbing the same data that's being written on the master. In many environments I expect that to be the case. But if it's not it wouldn't surprise me if it became very easy to lag a slave as replay constantly waited for blocks to come in. If running with full_page_writes turned off becomes remotely common it'd probably be worth finding a way to pre-issue read requests to the OS, similar to what we do in some cases if effective_io_concurrency > 1. -- Jim Nasby, Data Architect, Blue Treble Consulting, Austin TX Experts in Analytics, Data Architecture and PostgreSQL Data in Trouble? Get it in Treble! http://BlueTreble.com