Thread: Anything to be gained from a 'Postgres Filesystem'?

Anything to be gained from a 'Postgres Filesystem'?

From

"Matt Clark"

Date:

21 October 2004, 08:58:23

I suppose I'm just idly wondering really.  Clearly it's against PG
philosophy to build an FS or direct IO management into PG, but now it's so
relatively easy to plug filesystems into the main open-source Oses, It
struck me that there might be some useful changes to, say, XFS or ext3, that
could be made that would help PG out.

I'm thinking along the lines of an FS that's aware of PG's strategies and
requirements and therefore optimised to make those activities as efiicient
as possible - possibly even being aware of PG's disk layout and treating
files differently on that basis.

Not being an FS guru I'm not really clear on whether this would help much
(enough to be worth it anyway) or not - any thoughts?  And if there were
useful gains to be had, would it need a whole new FS or could an existing
one be modified?

So there might be (as I said, I'm not an FS guru...):
* great append performance for the WAL?
* optimised scattered writes for checkpointing?
* Knowledge that FSYNC is being used for preserving ordering a lot of the
time, rather than requiring actual writes to disk (so long as the writes
eventually happen in order...)?


Matt



Matt Clark
Ymogen Ltd
P: 0845 130 4531
W: https://ymogen.net/
M: 0774 870 1584

Re: Anything to be gained from a 'Postgres Filesystem'?

From

"Leeuw van der, Tim"

Date:

21 October 2004, 09:27:46

Hiya,

Looking at that list, I got the feeling that you'd want to push that PG-awareness down into the block-io layer as well,
then,so as to be able to optimise for (perhaps) conflicting goals depending on what the app does; for the IO system to
beable to read the apps mind it needs to have some knowledge of what the app is / needs / wants and I get the
impressionthat this awareness needs to go deeper than the FS only. 

--Tim

(But you might have time to rewrite Linux/BSD as a PG-OS? just kidding!)

-----Original Message-----
From: pgsql-performance-owner@postgresql.org [mailto:pgsql-performance-owner@postgresql.org]On Behalf Of Matt Clark
Sent: Thursday, October 21, 2004 9:58 AM
To: pgsql-performance@postgresql.org
Subject: [PERFORM] Anything to be gained from a 'Postgres Filesystem'?


I suppose I'm just idly wondering really.  Clearly it's against PG
philosophy to build an FS or direct IO management into PG, but now it's so
relatively easy to plug filesystems into the main open-source Oses, It
struck me that there might be some useful changes to, say, XFS or ext3, that
could be made that would help PG out.

I'm thinking along the lines of an FS that's aware of PG's strategies and
requirements and therefore optimised to make those activities as efiicient
as possible - possibly even being aware of PG's disk layout and treating
files differently on that basis.

Not being an FS guru I'm not really clear on whether this would help much
(enough to be worth it anyway) or not - any thoughts?  And if there were
useful gains to be had, would it need a whole new FS or could an existing
one be modified?

So there might be (as I said, I'm not an FS guru...):
* great append performance for the WAL?
* optimised scattered writes for checkpointing?
* Knowledge that FSYNC is being used for preserving ordering a lot of the
time, rather than requiring actual writes to disk (so long as the writes
eventually happen in order...)?


Matt



Matt Clark
Ymogen Ltd
P: 0845 130 4531
W: https://ymogen.net/
M: 0774 870 1584



---------------------------(end of broadcast)---------------------------
TIP 2: you can get off all lists at once with the unregister command
    (send "unregister YourEmailAddressHere" to majordomo@postgresql.org)

Re: Anything to be gained from a 'Postgres Filesystem'?

From

Pierre-Frédéric Caillaud

Date:

21 October 2004, 09:32:52

    Reiser4 ?

On Thu, 21 Oct 2004 08:58:01 +0100, Matt Clark <matt@ymogen.net> wrote:

> I suppose I'm just idly wondering really.  Clearly it's against PG
> philosophy to build an FS or direct IO management into PG, but now it's
> so
> relatively easy to plug filesystems into the main open-source Oses, It
> struck me that there might be some useful changes to, say, XFS or ext3,
> that
> could be made that would help PG out.
>
> I'm thinking along the lines of an FS that's aware of PG's strategies and
> requirements and therefore optimised to make those activities as
> efiicient
> as possible - possibly even being aware of PG's disk layout and treating
> files differently on that basis.
>
> Not being an FS guru I'm not really clear on whether this would help much
> (enough to be worth it anyway) or not - any thoughts?  And if there were
> useful gains to be had, would it need a whole new FS or could an existing
> one be modified?
>
> So there might be (as I said, I'm not an FS guru...):
> * great append performance for the WAL?
> * optimised scattered writes for checkpointing?
> * Knowledge that FSYNC is being used for preserving ordering a lot of the
> time, rather than requiring actual writes to disk (so long as the writes
> eventually happen in order...)?
>
>
> Matt
>
>
>
> Matt Clark
> Ymogen Ltd
> P: 0845 130 4531
> W: https://ymogen.net/
> M: 0774 870 1584
>
>
> ---------------------------(end of broadcast)---------------------------
> TIP 2: you can get off all lists at once with the unregister command
>     (send "unregister YourEmailAddressHere" to majordomo@postgresql.org)
>

Re: Anything to be gained from a 'Postgres Filesystem'?

From

"Matt Clark"

Date:

21 October 2004, 09:38:53

> Looking at that list, I got the feeling that you'd want to
> push that PG-awareness down into the block-io layer as well,
> then, so as to be able to optimise for (perhaps) conflicting
> goals depending on what the app does; for the IO system to be
> able to read the apps mind it needs to have some knowledge of
> what the app is / needs / wants and I get the impression that
> this awareness needs to go deeper than the FS only.

That's a fair point, it would need be a kernel patch really, although not
necessarily a very big one, more a case of looking at FDs and if they're
flagged in some way then get the PGfs to do the job instead of/as well as
the normal code path.

Re: Anything to be gained from a 'Postgres Filesystem'?

From

"Steinar H. Gunderson"

Date:

21 October 2004, 11:27:32

On Thu, Oct 21, 2004 at 08:58:01AM +0100, Matt Clark wrote:
> I suppose I'm just idly wondering really.  Clearly it's against PG
> philosophy to build an FS or direct IO management into PG, but now it's so
> relatively easy to plug filesystems into the main open-source Oses, It
> struck me that there might be some useful changes to, say, XFS or ext3, that
> could be made that would help PG out.

This really sounds like a poor replacement for just making PostgreSQL use raw
devices to me. (I have no idea why that isn't done already, but presumably it
isn't all that easy to get right. :-) )

/* Steinar */
--
Homepage: http://www.sesse.net/

Re: Anything to be gained from a 'Postgres Filesystem'?

From

"Leeuw van der, Tim"

Date:

21 October 2004, 11:44:29

Hi,

I guess the difference is in 'severe hacking inside PG' vs. 'some unknown amount of hacking that doesn't touch PG
code'.

Hacking PG internally to handle raw devices will meet with strong resistance from large portions of the development
team.I don't expect (m)any core devs of PG will be excited about rewriting the entire I/O architecture of PG and
duplicatinglarge amounts of OS type of code inside the application, just to try to attain an unknown performance
benefit.

PG doesn't use one big file, as some databases do, but many small files. Now PG would need to be able to do
file-management,if you put the PG database on a raw disk partition! That's icky stuff, and you'll find much resistance
againstputting such code inside PG. 
So why not try to have the external FS know a bit about PG and it's directory-layout, and it's IO requirements? Then
suchtype of code can at least be maintained outside the application, and will not be as much of a burden to the rest of
theapplication. 

(I'm not sure if it's a good idea to create a PG-specific FS in your OS of choice, but it's certainly gonna be easier
thangetting FS code inside of PG) 

cheers,

--Tim

-----Original Message-----
From: pgsql-performance-owner@postgresql.org [mailto:pgsql-performance-owner@postgresql.org]On Behalf Of Steinar H.
Gunderson
Sent: Thursday, October 21, 2004 12:27 PM
To: pgsql-performance@postgresql.org
Subject: Re: [PERFORM] Anything to be gained from a 'Postgres Filesystem'?

On Thu, Oct 21, 2004 at 08:58:01AM +0100, Matt Clark wrote:
> I suppose I'm just idly wondering really.  Clearly it's against PG
> philosophy to build an FS or direct IO management into PG, but now it's so
> relatively easy to plug filesystems into the main open-source Oses, It
> struck me that there might be some useful changes to, say, XFS or ext3, that
> could be made that would help PG out.

This really sounds like a poor replacement for just making PostgreSQL use raw
devices to me. (I have no idea why that isn't done already, but presumably it
isn't all that easy to get right. :-) )

/* Steinar */
--
Homepage: http://www.sesse.net/

---------------------------(end of broadcast)---------------------------
TIP 1: subscribe and unsubscribe commands go to majordomo@postgresql.org

Re: Anything to be gained from a 'Postgres Filesystem'?

From

Aaron Werman

Date:

21 October 2004, 12:47:18

The intuitive thing would be to put pg into a file system.

/Aaron

On Thu, 21 Oct 2004 12:44:10 +0200, Leeuw van der, Tim
<tim.leeuwvander@nl.unisys.com> wrote:
> Hi,
>
> I guess the difference is in 'severe hacking inside PG' vs. 'some unknown amount of hacking that doesn't touch PG
code'.
>
> Hacking PG internally to handle raw devices will meet with strong resistance from large portions of the development
team.I don't expect (m)any core devs of PG will be excited about rewriting the entire I/O architecture of PG and
duplicatinglarge amounts of OS type of code inside the application, just to try to attain an unknown performance
benefit.
>
> PG doesn't use one big file, as some databases do, but many small files. Now PG would need to be able to do
file-management,if you put the PG database on a raw disk partition! That's icky stuff, and you'll find much resistance
againstputting such code inside PG. 
> So why not try to have the external FS know a bit about PG and it's directory-layout, and it's IO requirements? Then
suchtype of code can at least be maintained outside the application, and will not be as much of a burden to the rest of
theapplication. 
>
> (I'm not sure if it's a good idea to create a PG-specific FS in your OS of choice, but it's certainly gonna be easier
thangetting FS code inside of PG) 
>
> cheers,
>
> --Tim
>
>
>
> -----Original Message-----
> From: pgsql-performance-owner@postgresql.org [mailto:pgsql-performance-owner@postgresql.org]On Behalf Of Steinar H.
Gunderson
> Sent: Thursday, October 21, 2004 12:27 PM
> To: pgsql-performance@postgresql.org
> Subject: Re: [PERFORM] Anything to be gained from a 'Postgres Filesystem'?
>
> On Thu, Oct 21, 2004 at 08:58:01AM +0100, Matt Clark wrote:
> > I suppose I'm just idly wondering really.  Clearly it's against PG
> > philosophy to build an FS or direct IO management into PG, but now it's so
> > relatively easy to plug filesystems into the main open-source Oses, It
> > struck me that there might be some useful changes to, say, XFS or ext3, that
> > could be made that would help PG out.
>
> This really sounds like a poor replacement for just making PostgreSQL use raw
> devices to me. (I have no idea why that isn't done already, but presumably it
> isn't all that easy to get right. :-) )
>
> /* Steinar */
> --
> Homepage: http://www.sesse.net/
>
> ---------------------------(end of broadcast)---------------------------
> TIP 1: subscribe and unsubscribe commands go to majordomo@postgresql.org
>
> ---------------------------(end of broadcast)---------------------------
> TIP 7: don't forget to increase your free space map settings
>


--

Regards,
/Aaron

Re: Anything to be gained from a 'Postgres Filesystem'?

From

Neil Conway

Date:

21 October 2004, 13:02:14

Matt Clark wrote:
> I'm thinking along the lines of an FS that's aware of PG's strategies and
> requirements and therefore optimised to make those activities as efiicient
> as possible - possibly even being aware of PG's disk layout and treating
> files differently on that basis.

As someone else noted, this doesn't belong in the filesystem (rather the
kernel's block I/O layer/buffer cache). But I agree, an API by which we
can tell the kernel what kind of I/O behavior to expect would be good.
The kernel needs to provide good behavior for a wide range of
applications, but the DBMS can take advantage of a lot of
domain-specific information. In theory, being able to pass that
domain-specific information on to the kernel would mean we could get
better performance without needing to reimplement large chunks of
functionality that really ought to be done by the kernel anyway (as
implementing raw I/O would require, for example). On the other hand, it
would probably mean adding a fair bit of OS-specific hackery, which
we've largely managed to avoid in the past.

The closest API to what you're describing that I'm aware of is
posix_fadvise(). While that is technically-speaking a POSIX standard, it
is not widely implemented (I know Linux 2.6 implements it; based on some
quick googling, it looks like AIX does too). Using posix_fadvise() has
been discussed in the past, so you might want to search the archives. We
could use FADV_SEQUENTIAL to request more aggressive readahead on a file
that we know we're about to sequentially scan. We might be able to use
FADV_NOREUSE on the WAL. We might be able to get away with specifying
FADV_RANDOM for indexes all of the time, or at least most of the time.
One question is how this would interact with concurrent access (AFAICS
there is no way to fetch the "current advice" on an fd...)

Also, I would imagine Win32 provides some means to inform the kernel
about your expected I/O pattern, but I haven't checked. Does anyone know
of any other relevant APIs?

-Neil

Re: Anything to be gained from a 'Postgres Filesystem'?

From

"Steinar H. Gunderson"

Date:

21 October 2004, 14:45:21

On Thu, Oct 21, 2004 at 12:44:10PM +0200, Leeuw van der, Tim wrote:
> Hacking PG internally to handle raw devices will meet with strong
> resistance from large portions of the development team. I don't expect
> (m)any core devs of PG will be excited about rewriting the entire I/O
> architecture of PG and duplicating large amounts of OS type of code inside
> the application, just to try to attain an unknown performance benefit.

Well, at least I see people claiming >30% difference between different file
systems, but no, I'm not shouting "bah, you'd better do this or I'll warez
Oracle" :-) I have no idea how much you can improve over the "best"
filesystems out there, but having two layers of journalling (both WAL _and_
FS journalling) on top of each other don't make all that much sense to me.
:-)

/* Steinar */
--
Homepage: http://www.sesse.net/

Re: Anything to be gained from a 'Postgres Filesystem'?

From

Tom Lane

Date:

21 October 2004, 15:21:12

"Steinar H. Gunderson" <sgunderson@bigfoot.com> writes:
> ... I have no idea how much you can improve over the "best"
> filesystems out there, but having two layers of journalling (both WAL _and_
> FS journalling) on top of each other don't make all that much sense to me.

Which is why setting the FS to journal metadata but not file contents is
often suggested as best practice for a PG-only filesystem.

            regards, tom lane

Re: Anything to be gained from a 'Postgres Filesystem'?

From

Jan Dittmer

Date:

21 October 2004, 16:02:32

Neil Conway wrote:
> Also, I would imagine Win32 provides some means to inform the kernel
> about your expected I/O pattern, but I haven't checked. Does anyone know
> of any other relevant APIs?

See CreateFile, Parameter dwFlagsAndAttributes

http://msdn.microsoft.com/library/default.asp?url=/library/en-us/fileio/base/createfile.asp

There is FILE_FLAG_NO_BUFFERING, FILE_FLAG_OPEN_NO_RECALL,
FILE_FLAG_RANDOM_ACCESS and even FILE_FLAG_POSIX_SEMANTICS

Jan

Re: Anything to be gained from a 'Postgres Filesystem'?

From

"Steinar H. Gunderson"

Date:

21 October 2004, 16:49:33

On Thu, Oct 21, 2004 at 10:20:55AM -0400, Tom Lane wrote:
>> ... I have no idea how much you can improve over the "best"
>> filesystems out there, but having two layers of journalling (both WAL _and_
>> FS journalling) on top of each other don't make all that much sense to me.
> Which is why setting the FS to journal metadata but not file contents is
> often suggested as best practice for a PG-only filesystem.

Mm, but you still journal the metadata. Oh well, noatime etc.. :-)

By the way, I'm probably hitting a FAQ here, but would O_DIRECT help
PostgreSQL any, given large enough shared_buffers?

/* Steinar */
--
Homepage: http://www.sesse.net/

Re: Anything to be gained from a 'Postgres Filesystem'?

From

Sean Chittenden

Date:

21 October 2004, 19:40:49

> As someone else noted, this doesn't belong in the filesystem (rather
> the kernel's block I/O layer/buffer cache). But I agree, an API by
> which we can tell the kernel what kind of I/O behavior to expect would
> be good.
[snip]
> The closest API to what you're describing that I'm aware of is
> posix_fadvise(). While that is technically-speaking a POSIX standard,
> it is not widely implemented (I know Linux 2.6 implements it; based on
> some quick googling, it looks like AIX does too).

Don't forget about the existence/usefulness/widely implemented
madvise(2)/posix_madvise(2) call, which can give the OS the following
hints: MADV_NORMAL, MADV_SEQUENTIAL, MADV_RANDOM, MADV_WILLNEED,
MADV_DONTNEED, and MADV_FREE.  :)  -sc

--
Sean Chittenden

Re: Anything to be gained from a 'Postgres Filesystem'?

From

"Jim C. Nasby"

Date:

21 October 2004, 23:20:19

Note that most people are now moving away from raw devices for databases
in most applicaitons. The relatively small performance gain isn't worth
the hassles.

On Thu, Oct 21, 2004 at 12:27:27PM +0200, Steinar H. Gunderson wrote:
> On Thu, Oct 21, 2004 at 08:58:01AM +0100, Matt Clark wrote:
> > I suppose I'm just idly wondering really.  Clearly it's against PG
> > philosophy to build an FS or direct IO management into PG, but now it's so
> > relatively easy to plug filesystems into the main open-source Oses, It
> > struck me that there might be some useful changes to, say, XFS or ext3, that
> > could be made that would help PG out.
>
> This really sounds like a poor replacement for just making PostgreSQL use raw
> devices to me. (I have no idea why that isn't done already, but presumably it
> isn't all that easy to get right. :-) )
>
> /* Steinar */
> --
> Homepage: http://www.sesse.net/
>
> ---------------------------(end of broadcast)---------------------------
> TIP 1: subscribe and unsubscribe commands go to majordomo@postgresql.org
>

--
Jim C. Nasby, Database Consultant               decibel@decibel.org
Give your computer some brain candy! www.distributed.net Team #1828

Windows: "Where do you want to go today?"
Linux: "Where do you want to go tomorrow?"
FreeBSD: "Are you guys coming, or what?"

Re: Anything to be gained from a 'Postgres Filesystem'?

From

Markus Schaber

Date:

04 November 2004, 11:01:11

Hi, Leeuw,

On Thu, 21 Oct 2004 12:44:10 +0200
"Leeuw van der, Tim" <tim.leeuwvander@nl.unisys.com> wrote:

> (I'm not sure if it's a good idea to create a PG-specific FS in your
> OS of choice, but it's certainly gonna be easier than getting FS code
> inside of PG)

I don't think PG really needs a specific FS. I rather think that PG
could profit from some functionality that's missing in traditional UN*X
file systems.

posix_fadvise(2) may be a candidate. Read/Write bareers another pone, as
well asn syncing a bunch of data in different files with a single call
(so that the OS can determine the best write order). I can also imagine
some interaction with the FS journalling system (to avoid duplicate
efforts).

We should create a list of those needs, and then communicate those to
the kernel/fs developers. Then we (as well as other apps) can make use
of those features where they are available, and use the old way
everywhere else.

Maybe Reiser4 is a step into the right way, and maybe even a postgres
plugin for Reiser4 will be worth the effort. Maybe XFS/JFS etc. already
have such capabilities. Maybe that's completely wrong.

cheers,
Markus

--
markus schaber | dipl. informatiker
logi-track ag | rennweg 14-16 | ch 8001 zürich
phone +41-43-888 62 52 | fax +41-43-888 62 53
mailto:schabios@logi-track.com | www.logi-track.com

Re: Anything to be gained from a 'Postgres Filesystem'?

From

Pierre-Frédéric Caillaud

Date:

04 November 2004, 12:29:16

> posix_fadvise(2) may be a candidate. Read/Write bareers another pone, as
> well asn syncing a bunch of data in different files with a single call
> (so that the OS can determine the best write order). I can also imagine
> some interaction with the FS journalling system (to avoid duplicate
> efforts).

    There is also the fact that syncing after every transaction could be
changed to syncing every N transactions (N fixed or depending on the data
size written by the transactions) which would be more efficient than the
current behaviour with a sleep. HOWEVER suppressing the sleep() would lead
to postgres returning from the COMMIT while it is in fact not synced,
which somehow rings a huge alarm bell somewhere.

    What about read order ?
    This could be very useful for SELECT queries involving indexes, which in
case of a non-clustered table lead to random seeks in the table.
    There's fadvise to tell the OS to readahead on a seq scan (I think the OS
detects it anyway), but if there was a system call telling the OS "in the
next seconds I'm going to read these chunks of data from this file (gives
a list of offsets and lengths), could you put them in your cache in the
most efficient order without seeking too much, so that when I read() them
in random order, they will be in the cache already ?". This would be an
asynchronous call which would return immediately, just queuing up the data
somewhere in the kernel, and maybe sending a signal to the application
when a certain percentage of the data has been cached.
    PG could take advantage of this with not much code changes, simply by
putting a fifo between the index scan and the tuple fetches, to wait the
time necessary for the OS to have enough reads to cluster them efficiently.
    On very large tables this would maybe not gain much, but on tables which
are explicitely clustered, or naturally clustered like accessing an index
on a serial primary key in order, it could be interesting.

    Just a thought.

Re: Anything to be gained from a 'Postgres Filesystem'?

From

Chris Browne

Date:

04 November 2004, 16:23:58

lists@boutiquenumerique.com (Pierre-Frédéric Caillaud) writes:
>> posix_fadvise(2) may be a candidate. Read/Write bareers another pone, as
>> well asn syncing a bunch of data in different files with a single call
>> (so that the OS can determine the best write order). I can also imagine
>> some interaction with the FS journalling system (to avoid duplicate
>> efforts).
>
>     There is also the fact that syncing after every transaction
> could be  changed to syncing every N transactions (N fixed or
> depending on the data  size written by the transactions) which would
> be more efficient than the  current behaviour with a sleep. HOWEVER
> suppressing the sleep() would lead  to postgres returning from the
> COMMIT while it is in fact not synced,  which somehow rings a huge
> alarm bell somewhere.
>
>     What about read order ?
>     This could be very useful for SELECT queries involving
> indexes, which in  case of a non-clustered table lead to random seeks
> in the table.

Another thing that would be valuable would be to have some way to say:

  "Read this data; don't bother throwing other data out of the cache
   to stuff this in."

Something like a "read_uncached()" call...

That would mean that a seq scan or a vacuum wouldn't force useful data
out of cache.
--
let name="cbbrowne" and tld="cbbrowne.com" in String.concat "@" [name;tld];;
http://www.ntlug.org/~cbbrowne/linuxxian.html
A VAX is virtually a computer, but not quite.

Re: Anything to be gained from a 'Postgres Filesystem'?

From

Simon Riggs

Date:

04 November 2004, 19:08:08

On Thu, 2004-11-04 at 15:47, Chris Browne wrote:

> Another thing that would be valuable would be to have some way to say:
>
>   "Read this data; don't bother throwing other data out of the cache
>    to stuff this in."
>
> Something like a "read_uncached()" call...
>
> That would mean that a seq scan or a vacuum wouldn't force useful data
> out of cache.

ARC does almost exactly those two things in 8.0.

Seq scans do get put in cache, but in a way that means they don't spoil
the main bulk of the cache.

--
Best Regards, Simon Riggs

Re: Anything to be gained from a 'Postgres Filesystem'?

From

"Steinar H. Gunderson"

Date:

04 November 2004, 19:20:29

On Thu, Nov 04, 2004 at 10:47:31AM -0500, Chris Browne wrote:
> Another thing that would be valuable would be to have some way to say:
>
>   "Read this data; don't bother throwing other data out of the cache
>    to stuff this in."
>
> Something like a "read_uncached()" call...

You mean, like, open(filename, O_DIRECT)? :-)

/* Steinar */
--
Homepage: http://www.sesse.net/

Re: Anything to be gained from a 'Postgres Filesystem'?

From

Tom Lane

Date:

04 November 2004, 19:34:28

Simon Riggs <simon@2ndquadrant.com> writes:
> On Thu, 2004-11-04 at 15:47, Chris Browne wrote:
>> Something like a "read_uncached()" call...
>>
>> That would mean that a seq scan or a vacuum wouldn't force useful data
>> out of cache.

> ARC does almost exactly those two things in 8.0.

But only for Postgres' own shared buffers.  The kernel cache still gets
trashed, because we have no way to suggest to the kernel that it not
hang onto the data read in.

            regards, tom lane

Re: Anything to be gained from a 'Postgres Filesystem'?

From

Simon Riggs

Date:

04 November 2004, 20:43:57

On Thu, 2004-11-04 at 19:34, Tom Lane wrote:
> Simon Riggs <simon@2ndquadrant.com> writes:
> > On Thu, 2004-11-04 at 15:47, Chris Browne wrote:
> >> Something like a "read_uncached()" call...
> >>
> >> That would mean that a seq scan or a vacuum wouldn't force useful data
> >> out of cache.
>
> > ARC does almost exactly those two things in 8.0.
>
> But only for Postgres' own shared buffers.  The kernel cache still gets
> trashed, because we have no way to suggest to the kernel that it not
> hang onto the data read in.

I guess a difference in viewpoints. I'm inclined to give most of the RAM
to PostgreSQL, since as you point out, the kernel is out of our control.
That way, we can do what we like with it - keep it or not, as we choose.

--
Best Regards, Simon Riggs

Re: Anything to be gained from a 'Postgres Filesystem'?

From

Tom Lane

Date:

04 November 2004, 21:01:20

Simon Riggs <simon@2ndquadrant.com> writes:
> On Thu, 2004-11-04 at 19:34, Tom Lane wrote:
>> But only for Postgres' own shared buffers.  The kernel cache still gets
>> trashed, because we have no way to suggest to the kernel that it not
>> hang onto the data read in.

> I guess a difference in viewpoints. I'm inclined to give most of the RAM
> to PostgreSQL, since as you point out, the kernel is out of our control.
> That way, we can do what we like with it - keep it or not, as we choose.

That's always been a Bad Idea for three or four different reasons, of
which ARC will eliminate no more than one.

            regards, tom lane

Re: Anything to be gained from a 'Postgres Filesystem'?

From

Christopher Browne

Date:

05 November 2004, 02:54:30

In an attempt to throw the authorities off his trail, schabios@logi-track.com (Markus Schaber) transmitted:
> We should create a list of those needs, and then communicate those
> to the kernel/fs developers. Then we (as well as other apps) can
> make use of those features where they are available, and use the old
> way everywhere else.

Which kernel/fs developers did you have in mind?  The ones working on
Linux?  Or FreeBSD?  Or DragonflyBSD?  Or Solaris?  Or AIX?

Please keep in mind that many of the PostgreSQL developers are BSD
folk that aren't particularly interested in creating bleeding edge
Linux capabilities.

Furthermore, I'd think long and hard before jumping into such a
_spectacularly_ bleeding edge kind of project.  The reason why you
would want this would be if you needed to get some margin of
performance.  I can't see wanting that without also wanting some
_assurance_ of system reliability, at which point I also want things
like vendor support.

If you've ever contacted Red Hat Software, you'd know that they very
nearly refuse to provide support for any filesystem other than ext3.
Use anything else and they'll make noises about not being able to
assure you of anything at all.

If you need high performance, you'd also want to use interesting sorts
of hardware.  Disk arrays, RAID controllers, that sort of thing.
Vendors of such things don't particularly want to talk to you unless
you're using a "supported" Linux distribution and a "supported"
filesystem.

Jumping into a customized filesystem that neither hardware nor
software vendors would remotely consider supporting just doesn't look
like a viable strategy to me.

> Maybe Reiser4 is a step into the right way, and maybe even a
> postgres plugin for Reiser4 will be worth the effort. Maybe XFS/JFS
> etc. already have such capabilities. Maybe that's completely wrong.

The capabilities tend to be redundant.  They tend to implement vaguely
similar transactional capabilities to what databases have to
implement.  The similarities are not close enough to eliminate either
variety of "commit" as redundant.
--
"cbbrowne","@","linuxfinances.info"
http://linuxfinances.info/info/linux.html
Rules of the  Evil Overlord #128. "I will not  employ robots as agents
of  destruction  if  there  is  any  possible way  that  they  can  be
re-programmed  or if their  battery packs  are externally  mounted and
easily removable." <http://www.eviloverlord.com/>

Re: Anything to be gained from a 'Postgres Filesystem'?

From

Christopher Browne

Date:

05 November 2004, 02:55:39

After a long battle with technology, simon@2ndquadrant.com (Simon Riggs), an earthling, wrote:
> On Thu, 2004-11-04 at 15:47, Chris Browne wrote:
>
>> Another thing that would be valuable would be to have some way to say:
>>
>>   "Read this data; don't bother throwing other data out of the cache
>>    to stuff this in."
>>
>> Something like a "read_uncached()" call...
>>
>> That would mean that a seq scan or a vacuum wouldn't force useful
>> data out of cache.
>
> ARC does almost exactly those two things in 8.0.
>
> Seq scans do get put in cache, but in a way that means they don't
> spoil the main bulk of the cache.

We're not talking about the same cache.

ARC does these exact things for _shared memory_ cache, and is the
obvious inspiration.

But it does more or less nothing about the way OS file buffer cache is
managed, and the handling of _that_ would be the point of modifying OS
filesystem semantics.
--
select 'cbbrowne' || '@' || 'linuxfinances.info';
http://www3.sympatico.ca/cbbrowne/oses.html
Have you ever considered beating yourself with a cluestick?

Re: Anything to be gained from a 'Postgres Filesystem'?

From

Neil Conway

Date:

05 November 2004, 04:29:14

On Fri, 2004-11-05 at 06:20, Steinar H. Gunderson wrote:
> You mean, like, open(filename, O_DIRECT)? :-)

This disables readahead (at least on Linux), which is certainly not we
want: for the very case where we don't want to keep the data in cache
for a while (sequential scans, VACUUM), we also want aggressive
readahead.

-Neil

Re: Anything to be gained from a 'Postgres Filesystem'?

From

Neil Conway

Date:

05 November 2004, 04:34:32

On Thu, 2004-11-04 at 23:29, Pierre-Frédéric Caillaud wrote:
>     There is also the fact that syncing after every transaction could be
> changed to syncing every N transactions (N fixed or depending on the data
> size written by the transactions) which would be more efficient than the
> current behaviour with a sleep.

Uh, which "sleep" are you referring to?

Also, how would interacting with the filesystem's journal effect how
often we need to force-write the WAL to disk? (ISTM we need to sync
_something_ to disk when a transaction commits in order to maintain the
WAL invariant.)

>     There's fadvise to tell the OS to readahead on a seq scan (I think the OS
> detects it anyway)

Not perfectly, though; also, Linux will do a more aggressive readahead
if you tell it to do so via posix_fadvise().

> if there was a system call telling the OS "in the
> next seconds I'm going to read these chunks of data from this file (gives
> a list of offsets and lengths), could you put them in your cache in the
> most efficient order without seeking too much, so that when I read() them
> in random order, they will be in the cache already ?".

http://www.opengroup.org/onlinepubs/009695399/functions/posix_fadvise.html

POSIX_FADV_WILLNEED
        Specifies that the application expects to access the specified
        data in the near future.

-Neil

Re: Anything to be gained from a 'Postgres Filesystem'?

From

Neil Conway

Date:

05 November 2004, 04:42:03

On Fri, 2004-11-05 at 02:47, Chris Browne wrote:
> Another thing that would be valuable would be to have some way to say:
>
>   "Read this data; don't bother throwing other data out of the cache
>    to stuff this in."

This is similar, although not exactly the same thing:

http://www.opengroup.org/onlinepubs/009695399/functions/posix_fadvise.html

POSIX_FADV_NOREUSE
        Specifies that the application expects to access the specified
        data once and then not reuse it thereafter.

-Neil

Re: Anything to be gained from a 'Postgres Filesystem'?

From

Markus Schaber

Date:

14 December 2004, 18:20:24

Hi, Christopher,
[sorry for the delay of my answer, we were rather busy last weks]

On Thu, 04 Nov 2004 21:29:04 -0500
Christopher Browne <cbbrowne@acm.org> wrote:

> In an attempt to throw the authorities off his trail, schabios@logi-track.com (Markus Schaber) transmitted:
> > We should create a list of those needs, and then communicate those
> > to the kernel/fs developers. Then we (as well as other apps) can
> > make use of those features where they are available, and use the old
> > way everywhere else.
>
> Which kernel/fs developers did you have in mind?  The ones working on
> Linux?  Or FreeBSD?  Or DragonflyBSD?  Or Solaris?  Or AIX?

All of them, and others (e. G. Windows).

Once we have a list of those needs, the advocates can talk to the OS
developers. Some OS developers will follow, others not.

Then the postgres folks (and other application developers that benefit
from this capabilities) can point interested users to our benchmarks and
tell them that Foox performs 3 times as fast as BaarOs because they
provide better support for database needs.

> Please keep in mind that many of the PostgreSQL developers are BSD
> folk that aren't particularly interested in creating bleeding edge
> Linux capabilities.

Then this should be motivation to add those things to BSD, maybe as a
patch or loadable module so it does not bloat mainstream. I personally
would prefer it to appear in BSD first, because in case it really pays
of, it won't be long until it appears in Linux as well :-)

> Jumping into a customized filesystem that neither hardware nor
> software vendors would remotely consider supporting just doesn't look
> like a viable strategy to me.

I did not vote for a custom filesystem, as the OP did. I did vote for
isolating a set of useful capabilities PostgreSQL could exploit, and
then try to confince the kernel developers to include this capabilities,
so they are likely to be included in the main distributions.

I don't know about the BSD market, but I know that Redhat and SuSE often
ship their patched versions of the kernels (so then they officially
support the extensions), and most of this is likely to be included in
main stream later.

> > Maybe Reiser4 is a step into the right way, and maybe even a
> > postgres plugin for Reiser4 will be worth the effort. Maybe XFS/JFS
> > etc. already have such capabilities. Maybe that's completely wrong.
>
> The capabilities tend to be redundant.  They tend to implement vaguely
> similar transactional capabilities to what databases have to
> implement.  The similarities are not close enough to eliminate either
> variety of "commit" as redundant.

But a speed gain may be possible by coordinating DB and FS tansactions.

Thanks,
Markus

--
markus schaber | dipl. informatiker
logi-track ag | rennweg 14-16 | ch 8001 zürich
phone +41-43-888 62 52 | fax +41-43-888 62 53
mailto:schabios@logi-track.com | www.logi-track.com