PostgreSQL, Asynchronous I/O, Buffered I/O and why did fsync-gate not affect Oracle or MySQL? - Mailing list pgsql-general

Hi all,


On 2019/10/09, I posted a question here concerning PostgreSQL I/O, the
(primarily) Linux fsync problem and my lecturer's attitude to
PostgreSQL (text of that email is at the bottom of this post).


I asked why the fsync issue didn't affect Oracle and/or MySQL for example?

As far as I can see, this was because Oracle uses Direct I/O whereas
PostgreSQL uses Buffered I/O. I know that the issue has been resolved
and no longer affects currently supported versions - but I'm still
curious...


From here: https://www.percona.com/blog/2019/02/22/postgresql-fsync-failure-fixed-minor-versions-released-feb-14-2019/

> Whereas, writing the modified/dirty buffers to datafiles from shared buffers is always through Buffered IO.


Now, I'm not quite sure that I completely comprehend matters: Is there
a difference between Asynchronous I/O and Buffered I/O?

If so, could some kind person point me to referernces that explain the
difference?


But, my foggy brain aside, I read (can't find URL - paraphrasing):
PostgreSQL is happy to let the kernel take the load off the server and
look after I/O - that's fine, but you'd better be able to trust your
kernel.


However, MySQL also uses Asynchronous I/O by default - or does it?

From here:
https://dev.mysql.com/doc/refman/8.0/en/optimizing-innodb-diskio.html#:~:text=InnoDB%20uses%20the%20asynchronous%20I,which%20is%20enabled%20by%20default,
we have:

> InnoDB uses the asynchronous I/O subsystem (native AIO) on Linux to perform read-ahead and write requests for data
filepages. This behavior is controlled by the innodb_use_native_aio configuration option, which is enabled by default. 


Now, I haven't (AFAIK) seen references to problems with this fsync
issue on MySQL. Maybe they're so used to losing data, nobody noticed?
:-)


Seriously though, it does beg the question - why did this cause a
major issue for PostgreSQL but not for MySQL?


Is it because of (a) difference(s) between Asynchronous I/O and
Buffered I/O asked about above?


A couple of pointers (excuse the pun!) about this issue would be
great. Some stuff which shows the difference between Direct I/O and
the others would also be helpful. I seem to remember there was mention
of this for PostgreSQL but that it would be a mulit-year project. Why
so if PostgreSQL can already use Direct I/O for the WAL logs?

Obviously, I can search and I have been - but I'd appreciate material
from people here who can sort the wheat from the chaff and point me to
solid references. If here is not a suitable forum, then kindly
redirect me.

TIA and rgs,


Pól Ua...



=============================

> 2019/10/09 Is my lecturer wrong about PostgreSQL? I think he is!


I recently started a Masters in Computer Science (and not at the
institution in my email address).

One of my courses is "Advanced Databases" - yummy I thought - it's not
even compulsory for me but I just *_had_* to take this module. The
lecturer is a bit of an Oracle fan-boy (ACE director no less...
hmmm...) and I want(ed) - becoming less enthusiasic by the minute - to
do my dissertation with him. So, we're having a chat and I make plain
my love of good 'ol PostgreSQL as my RDBMS of choice and he tells me
that there are problems with random block corruption with PostgreSQL.
I said "really" and before that conversation could go any further,
another student came over and asked a question.

So, I toddled off and did some research - I had heard something about
this before (vague fuzzy memories) of a problem with the Linux kernel
so I searched for a bit and duly dug up a couple of pages

https://lwn.net/Articles/752063/ : PostgreSQL's fsync() surprise - and

https://news.ycombinator.com/item?id=19238121 : Linux Fsync Issue for
Buffered IO and Its Preliminary Fix for PostgreSQL

So, this week I go back to my lecturer and say, yep, there was some
issue but it was a Linux kernel problem and not PostgreSQL's fault and
has been resolved.

He tells me that he knew about that but that there was another issue
(he had "spoken to people" at meetings!). I said "well, why isn't it
fixed?" and he replied "where's the impetus?" to which I responded
(quite shocked at this stage) something like "well, I know that the
core team values correctness very highly" to which he came back with
"yes, but they have no commercial imperative to fix anything - they
have to wait until somebody is capable enough and interested enough to
do the work". He then muttered something about this mysterious flaw
having been fixed in EnterpriseDB.

At this point, I lost interest. Having lurked on lists and going by my
general "gut feeling" - if there was a serious issue causing
irrecoverable block corruption, I'm pretty sure that it would be "all
hands on deck" until this problem had been solved and "nice-to-haves"
(GENERATED AS... for example) would have been parked till then.

Now, I have four questions:

1) Is my lecturer full of it or does he really have a point?

2) The actual concrete acknowledged problem with fsync that affected
PostgreSQL - why didn't it affect Oracle? Or MySQL? Or did it but it
was so rare that it never became apparent - it wasn't that obvious
with PostgreSQL either - one of those rare and intermittent problems?

3) Were there ever any problems with BSD?

4) What is the OS of choice for *_serious_* PostgreSQL installations?

I hope that I have been clear, but should anyone require any
clarification, please don't hesitate to ask me.

Tia and rgs,

Pól...



pgsql-general by date:

Previous
From: Dmitry Dolgov
Date:
Subject: Re: Why is writing JSONB faster than just JSON?
Next
From: Pól Ua Laoínecháin
Date:
Subject: Is this the future of I/O for the RDBMS?