Thread: Streaming base backups

Streaming base backups

From
Magnus Hagander
Date:
Attached is an updated streaming base backup patch, based off the work
that Heikki
started. It includes support for tablespaces, permissions, progress
reporting and
some actual documentation of the protocol changes (user interface
documentation is
going to be depending on exactly what the frontend client will look like, so I'm
waiting with that one a while).

The basic implementation is: Add a new command to the replication mode called
BASE_BACKUP, that will initiate a base backup, stream the contents (in tar
compatible format) of the data directory and all tablespaces, and then end
the base backup in a single operation.

Other than the basic implementation, there is a small refactoring done of
pg_start_backup() and pg_stop_backup() splitting them into a "backend function"
that is easier to call internally and a "user facing function" that remains
identical to the previous one, and I've also added a pg_abort_backup()
internal-only function to get out of crashes while in backup mode in a safer
way (so it can be called from error handlers). Also, the walsender needs a
resource owner in order to call pg_start_backup().

I've implemented a frontend for this in pg_streamrecv, based on the assumption
that we wanted to include this in bin/ for 9.1 - and that it seems like a
reasonable place to put it. This can obviously be moved elsewhere if we want to.
That code needs a lot more cleanup, but I wanted to make sure I got the backend
patch out for review quickly. You can find the current WIP branch for
pg_streamrecv on my github page at https://github.com/mhagander/pg_streamrecv,
in the branch "baserecv". I'll be posting that as a separate patch once it's
been a bit more cleaned up (it does work now if you want to test it, though).


Some remaining thoughts and must-dos:

* Compression: Do we want to be able to compress the backups server-side? Or
  defer that to whenever we get compression in libpq? (you can still tunnel it
  through for example SSH to get compression if you want to) My thinking is
  defer it.
* Compression: We could still implement compression of the tar files in
  pg_streamrecv (probably easier, possibly more useful?)
* Windows support (need to implement readlink)
* Tar code is copied from pg_dump and modified. Should we try to factor it out
  into port/? There are changes in the middle of it so it can't be done with
  the current calling points, it would need a refactor. I think it's not worth
  it, given how simple it is.

Improvements I want to add, but that aren't required for basic operation:

* Stefan mentiond it might be useful to put some
posix_fadvise(POSIX_FADV_DONTNEED)
  in the process that streams all the files out. Seems useful, as long as that
  doesn't kick them out of the cache *completely*, for other backends as well.
  Do we know if that is the case?
* include all the necessary WAL files in the backup. This way we could generate
  a tar file that would work on it's own - right now, you still need to set up
  log archiving (or use streaming repl) to get the remaining logfiles from the
  master. This is fine for replication setups, but not for backups.
  This would also require us to block recycling of WAL files during the backup,
  of course.
* Suggestion from Heikki: don't put backup_label in $PGDATA during the backup.
  Rather, include it just in the tar file. That way if you crash during the
  backup, the master doesn't start recovery from the backup_label, leading
  to failure to start up in the worst case
* Suggestion from Heikki: perhaps at some point we're going to need a full
  bison grammar for walsender commands.
* Relocation of tablespaces (can at least partially be done client-side)


--
 Magnus Hagander
 Me: http://www.hagander.net/
 Work: http://www.redpill-linpro.com/

Attachment

Re: Streaming base backups

From
Stefan Kaltenbrunner
Date:
On 01/05/2011 02:54 PM, Magnus Hagander wrote:
[..]
> Some remaining thoughts and must-dos:
>
> * Compression: Do we want to be able to compress the backups server-side? Or
>    defer that to whenever we get compression in libpq? (you can still tunnel it
>    through for example SSH to get compression if you want to) My thinking is
>    defer it.
> * Compression: We could still implement compression of the tar files in
>    pg_streamrecv (probably easier, possibly more useful?)

hmm compression would be nice but I don't think it is required for this 
initial implementation.


> * Windows support (need to implement readlink)
> * Tar code is copied from pg_dump and modified. Should we try to factor it out
>    into port/? There are changes in the middle of it so it can't be done with
>    the current calling points, it would need a refactor. I think it's not worth
>    it, given how simple it is.
>
> Improvements I want to add, but that aren't required for basic operation:
>
> * Stefan mentiond it might be useful to put some
> posix_fadvise(POSIX_FADV_DONTNEED)
>    in the process that streams all the files out. Seems useful, as long as that
>    doesn't kick them out of the cache *completely*, for other backends as well.
>    Do we know if that is the case?

well my main concern is that a basebackup done that way might blew up 
the buffercache of the OS causing temporary performance issues.
This might be more serious with an in-core solution than with what 
people use now because a number of backup software and tools (like some 
of the commercial backup solutions) employ various tricks to avoid that.
One interesting tidbit i found was:

http://insights.oetiker.ch/linux/fadvise/

which is very Linux specific but interesting nevertheless...




Stefan


Re: Streaming base backups

From
Dimitri Fontaine
Date:
Magnus Hagander <magnus@hagander.net> writes:
> Attached is an updated streaming base backup patch, based off the work

Thanks! :)

> * Compression: Do we want to be able to compress the backups server-side? Or
>   defer that to whenever we get compression in libpq? (you can still tunnel it
>   through for example SSH to get compression if you want to) My thinking is
>   defer it.

Compression in libpq would be a nice way to solve it, later.

> * Compression: We could still implement compression of the tar files in
>   pg_streamrecv (probably easier, possibly more useful?)

What about pg_streamrecv | gzip > …, which has the big advantage of
being friendly to *any* compression command line tool, whatever patents
and licenses?

> * Stefan mentiond it might be useful to put some
> posix_fadvise(POSIX_FADV_DONTNEED)
>   in the process that streams all the files out. Seems useful, as long as that
>   doesn't kick them out of the cache *completely*, for other backends as well.
>   Do we know if that is the case?

Maybe have a look at pgfincore to only tag DONTNEED for blocks that are
not already in SHM?

> * include all the necessary WAL files in the backup. This way we could generate
>   a tar file that would work on it's own - right now, you still need to set up
>   log archiving (or use streaming repl) to get the remaining logfiles from the
>   master. This is fine for replication setups, but not for backups.
>   This would also require us to block recycling of WAL files during the backup,
>   of course.

Well, I would guess that if you're streaming the WAL files in parallel
while the base backup is taken, then you're able to have it all without
archiving setup, and the server could still recycling them.

Regards,
--
Dimitri Fontaine
http://2ndQuadrant.fr     PostgreSQL : Expertise, Formation et Support


Re: Streaming base backups

From
Magnus Hagander
Date:
On Wed, Jan 5, 2011 at 22:58, Dimitri Fontaine <dimitri@2ndquadrant.fr> wrote:
> Magnus Hagander <magnus@hagander.net> writes:
>> Attached is an updated streaming base backup patch, based off the work
>
> Thanks! :)
>
>> * Compression: Do we want to be able to compress the backups server-side? Or
>>   defer that to whenever we get compression in libpq? (you can still tunnel it
>>   through for example SSH to get compression if you want to) My thinking is
>>   defer it.
>
> Compression in libpq would be a nice way to solve it, later.

Yeah, I'm pretty much set on postponing that one.


>> * Compression: We could still implement compression of the tar files in
>>   pg_streamrecv (probably easier, possibly more useful?)
>
> What about pg_streamrecv | gzip > …, which has the big advantage of
> being friendly to *any* compression command line tool, whatever patents
> and licenses?

That's part of what I meant with "easier and more useful".

Right now though, pg_streamrecv will output one tar file for each
tablespace, so you can't get it on stdout. But that can be changed of
course. The easiest step 1 is to just use gzopen() from zlib on the
files and use the same code as now :-)


>> * Stefan mentiond it might be useful to put some
>> posix_fadvise(POSIX_FADV_DONTNEED)
>>   in the process that streams all the files out. Seems useful, as long as that
>>   doesn't kick them out of the cache *completely*, for other backends as well.
>>   Do we know if that is the case?
>
> Maybe have a look at pgfincore to only tag DONTNEED for blocks that are
> not already in SHM?

I think that's way more complex than we want to go here.


>> * include all the necessary WAL files in the backup. This way we could generate
>>   a tar file that would work on it's own - right now, you still need to set up
>>   log archiving (or use streaming repl) to get the remaining logfiles from the
>>   master. This is fine for replication setups, but not for backups.
>>   This would also require us to block recycling of WAL files during the backup,
>>   of course.
>
> Well, I would guess that if you're streaming the WAL files in parallel
> while the base backup is taken, then you're able to have it all without
> archiving setup, and the server could still recycling them.

Yes, this was mostly for the use-case of "getting a single tarfile
that you can actually use to restore from without needing the log
archive at all".

--
 Magnus Hagander
 Me: http://www.hagander.net/
 Work: http://www.redpill-linpro.com/


Re: Streaming base backups

From
Dimitri Fontaine
Date:
Magnus Hagander <magnus@hagander.net> writes:
>> Compression in libpq would be a nice way to solve it, later.
>
> Yeah, I'm pretty much set on postponing that one.

+1, in case it was not clear for whoever's counting the votes :)

>> What about pg_streamrecv | gzip > …, which has the big advantage of
>
> That's part of what I meant with "easier and more useful".

Well…

> Right now though, pg_streamrecv will output one tar file for each
> tablespace, so you can't get it on stdout. But that can be changed of
> course. The easiest step 1 is to just use gzopen() from zlib on the
> files and use the same code as now :-)

Oh if integrating it is easier :)

>> Maybe have a look at pgfincore to only tag DONTNEED for blocks that are
>> not already in SHM?
>
> I think that's way more complex than we want to go here.

Yeah.

>> Well, I would guess that if you're streaming the WAL files in parallel
>> while the base backup is taken, then you're able to have it all without
>> archiving setup, and the server could still recycling them.
>
> Yes, this was mostly for the use-case of "getting a single tarfile
> that you can actually use to restore from without needing the log
> archive at all".

It also allows for a simpler kick-start procedure for preparing a
standby, and allows to stop worrying too much about wal_keep_segments
and archive servers.

When do the standby launch its walreceiver? It would be extra-nice for
the base backup tool to optionally continue streaming WALs until the
standby starts doing it itself, so that wal_keep_segments is really
deprecated.  No idea how feasible that is, though.

Regards,
--
Dimitri Fontaine
http://2ndQuadrant.fr     PostgreSQL : Expertise, Formation et Support


Re: Streaming base backups

From
Heikki Linnakangas
Date:
On 06.01.2011 00:27, Dimitri Fontaine wrote:
> Magnus Hagander<magnus@hagander.net>  writes:
>>> What about pg_streamrecv | gzip>  …, which has the big advantage of
>>
>> That's part of what I meant with "easier and more useful".
>
> Well…

One thing to keep in mind is that if you do compression in libpq for the 
transfer, and gzip the tar file in the client, that's quite inefficient. 
You compress the data once in the server, decompress in the client, then 
compress it again in the client.  If you're going to write the backup to 
a compressed file, and you want to transfer it compressed to save 
bandwidth, you want to gzip it in the server to begin with.

--   Heikki Linnakangas  EnterpriseDB   http://www.enterprisedb.com


Re: Streaming base backups

From
Marti Raudsepp
Date:
On Wed, Jan 5, 2011 at 23:58, Dimitri Fontaine <dimitri@2ndquadrant.fr> wrote:
>> * Stefan mentiond it might be useful to put some
>> posix_fadvise(POSIX_FADV_DONTNEED)
>>   in the process that streams all the files out. Seems useful, as long as that
>>   doesn't kick them out of the cache *completely*, for other backends as well.
>>   Do we know if that is the case?
>
> Maybe have a look at pgfincore to only tag DONTNEED for blocks that are
> not already in SHM?

It's not much of an improvement. For pages that we already have in
shared memory, OS cache is mostly useless. OS cache matters for pages
that *aren't* in shared memory.

Regards,
Marti


Re: Streaming base backups

From
Magnus Hagander
Date:
On Wed, Jan 5, 2011 at 23:27, Dimitri Fontaine <dimitri@2ndquadrant.fr> wrote:
> Magnus Hagander <magnus@hagander.net> writes:

>>> Well, I would guess that if you're streaming the WAL files in parallel
>>> while the base backup is taken, then you're able to have it all without
>>> archiving setup, and the server could still recycling them.
>>
>> Yes, this was mostly for the use-case of "getting a single tarfile
>> that you can actually use to restore from without needing the log
>> archive at all".
>
> It also allows for a simpler kick-start procedure for preparing a
> standby, and allows to stop worrying too much about wal_keep_segments
> and archive servers.
>
> When do the standby launch its walreceiver? It would be extra-nice for
> the base backup tool to optionally continue streaming WALs until the
> standby starts doing it itself, so that wal_keep_segments is really
> deprecated.  No idea how feasible that is, though.

I think that's we're inventing a whole lot of complexity that may not
be necessary at all. Let's do it the simple way and see how far we can
get by with that one - we can always improve this for 9.2

--
 Magnus Hagander
 Me: http://www.hagander.net/
 Work: http://www.redpill-linpro.com/


Re: Streaming base backups

From
Heikki Linnakangas
Date:
On 05.01.2011 15:54, Magnus Hagander wrote:
> Attached is an updated streaming base backup patch, based off the work
> that Heikki started.
> ...
> I've implemented a frontend for this in pg_streamrecv, based on the assumption
> that we wanted to include this in bin/ for 9.1 - and that it seems like a
> reasonable place to put it. This can obviously be moved elsewhere if we want to.

Hmm, is there any point in keeping the two functionalities in the same 
binary, taking the base backup and streaming WAL to an archive 
directory? Looks like the only common option between the two modes is 
passing the connection string, and the verbose flag. A separate 
pg_basebackup binary would probably make more sense.

> That code needs a lot more cleanup, but I wanted to make sure I got the backend
> patch out for review quickly. You can find the current WIP branch for
> pg_streamrecv on my github page at https://github.com/mhagander/pg_streamrecv,
> in the branch "baserecv". I'll be posting that as a separate patch once it's
> been a bit more cleaned up (it does work now if you want to test it, though).

Looks like pg_streamrecv creates the pg_xlog and pg_tblspc directories, 
because they're not included in the streamed tar. Wouldn't it be better 
to include them in the tar as empty directories at the server-side? 
Otherwise if you write the tar file to disk and untar it later, you have 
to manually create them.

It would be nice to have an option in pg_streamrecv to specify the 
backup label to use.

An option to stream the tar to stdout instead of a file would be very 
handy too, so that you could pipe it directly to gzip for example. I 
realize you get multiple tar files if tablespaces are used, but even if 
you just throw an error in that case, it would be handy.

> * Suggestion from Heikki: perhaps at some point we're going to need a full
>    bison grammar for walsender commands.

Maybe we should at least start using the lexer; we're not quite there to 
need a full-blown grammar yet, but even a lexer might help.


BTW, looking at the WAL-streaming side of pg_streamrecv, if you start it 
from scratch with an empty target directory, it needs to connect to 
"postgres" database, to run pg_current_xlog_location(), and then 
reconnect in replication mode. That's a bit awkward, there might not be 
a "postgres" database, and even if there is, you might not have the 
permission to connect to it. It would be much better to have a variant 
of the START_REPLICATION command at the server-side that begins 
streaming from the current location. Maybe just by leaving out the 
start-location parameter.

--   Heikki Linnakangas  EnterpriseDB   http://www.enterprisedb.com


Re: Streaming base backups

From
Magnus Hagander
Date:
On Thu, Jan 6, 2011 at 23:57, Heikki Linnakangas
<heikki.linnakangas@enterprisedb.com> wrote:
> On 05.01.2011 15:54, Magnus Hagander wrote:
>>
>> Attached is an updated streaming base backup patch, based off the work
>> that Heikki started.
>> ...
>> I've implemented a frontend for this in pg_streamrecv, based on the
>> assumption
>> that we wanted to include this in bin/ for 9.1 - and that it seems like a
>> reasonable place to put it. This can obviously be moved elsewhere if we
>> want to.
>
> Hmm, is there any point in keeping the two functionalities in the same
> binary, taking the base backup and streaming WAL to an archive directory?
> Looks like the only common option between the two modes is passing the
> connection string, and the verbose flag. A separate pg_basebackup binary
> would probably make more sense.

Yeah, once I broke things apart for better readability, I started
leaning in that direction as well.

However, if you consider the things that Dimiti mentioned about
streaming at the same time as downloading, having them in the same one
would make more sense. I don't think that's something for now,
though..


>> That code needs a lot more cleanup, but I wanted to make sure I got the
>> backend
>> patch out for review quickly. You can find the current WIP branch for
>> pg_streamrecv on my github page at
>> https://github.com/mhagander/pg_streamrecv,
>> in the branch "baserecv". I'll be posting that as a separate patch once
>> it's
>> been a bit more cleaned up (it does work now if you want to test it,
>> though).
>
> Looks like pg_streamrecv creates the pg_xlog and pg_tblspc directories,
> because they're not included in the streamed tar. Wouldn't it be better to
> include them in the tar as empty directories at the server-side? Otherwise
> if you write the tar file to disk and untar it later, you have to manually
> create them.

Yeah, good point. Originally, the tar code (your tar code, btw :P)
didn't create *any* directories, so I stuck it in there. I agree it
should be moved to the backend patch now.


> It would be nice to have an option in pg_streamrecv to specify the backup
> label to use.

Agreed.


> An option to stream the tar to stdout instead of a file would be very handy
> too, so that you could pipe it directly to gzip for example. I realize you
> get multiple tar files if tablespaces are used, but even if you just throw
> an error in that case, it would be handy.

Makes sense.


>> * Suggestion from Heikki: perhaps at some point we're going to need a full
>>   bison grammar for walsender commands.
>
> Maybe we should at least start using the lexer; we're not quite there to
> need a full-blown grammar yet, but even a lexer might help.

Might. I don't speak flex very well, so I'm not really sure what that
would mean.


> BTW, looking at the WAL-streaming side of pg_streamrecv, if you start it
> from scratch with an empty target directory, it needs to connect to
> "postgres" database, to run pg_current_xlog_location(), and then reconnect
> in replication mode. That's a bit awkward, there might not be a "postgres"
> database, and even if there is, you might not have the permission to connect
> to it. It would be much better to have a variant of the START_REPLICATION
> command at the server-side that begins streaming from the current location.
> Maybe just by leaving out the start-location parameter.

Agreed. That part is unchanged from the one that runs against 9.0
though, where that wasn't a possibility. But adding something like
that to the walsender in 9.1 would be good.

--
 Magnus Hagander
 Me: http://www.hagander.net/
 Work: http://www.redpill-linpro.com/


Re: Streaming base backups

From
Cédric Villemain
Date:
2011/1/5 Magnus Hagander <magnus@hagander.net>:
> On Wed, Jan 5, 2011 at 22:58, Dimitri Fontaine <dimitri@2ndquadrant.fr> wrote:
>> Magnus Hagander <magnus@hagander.net> writes:
>>> * Stefan mentiond it might be useful to put some
>>> posix_fadvise(POSIX_FADV_DONTNEED)
>>>   in the process that streams all the files out. Seems useful, as long as that
>>>   doesn't kick them out of the cache *completely*, for other backends as well.
>>>   Do we know if that is the case?
>>
>> Maybe have a look at pgfincore to only tag DONTNEED for blocks that are
>> not already in SHM?
>
> I think that's way more complex than we want to go here.
>

DONTNEED will remove the block from OS buffer everytime.

It should not be that hard to implement a snapshot(it needs mincore())
and to restore previous state. I don't know how basebackup is
performed exactly...so perhaps I am wrong.

posix_fadvise support is already in postgresql core...we can start by
just doing a snapshot of the files before starting, or at some point
in the basebackup, it will need only 256kB per GB of data...
--
Cédric Villemain               2ndQuadrant
http://2ndQuadrant.fr/     PostgreSQL : Expertise, Formation et Support


Re: Streaming base backups

From
Simon Riggs
Date:
On Wed, 2011-01-05 at 14:54 +0100, Magnus Hagander wrote:

> The basic implementation is: Add a new command to the replication mode called
> BASE_BACKUP, that will initiate a base backup, stream the contents (in tar
> compatible format) of the data directory and all tablespaces, and then end
> the base backup in a single operation.

I'm a little dubious of the performance of that approach for some users,
though it does seem a popular idea.

One very useful feature will be some way of confirming the number and
size of files to transfer, so that the base backup client can find out
the progress.

It would also be good to avoid writing a backup_label file at all on the
master, so there was no reason why multiple concurrent backups could not
be taken. The current coding allows for the idea that the start and stop
might be in different sessions, whereas here we know we are in one
session.

-- Simon Riggs           http://www.2ndQuadrant.com/books/PostgreSQL Development, 24x7 Support, Training and Services



Re: Streaming base backups

From
Magnus Hagander
Date:
On Fri, Jan 7, 2011 at 02:15, Simon Riggs <simon@2ndquadrant.com> wrote:
> On Wed, 2011-01-05 at 14:54 +0100, Magnus Hagander wrote:
>
>> The basic implementation is: Add a new command to the replication mode called
>> BASE_BACKUP, that will initiate a base backup, stream the contents (in tar
>> compatible format) of the data directory and all tablespaces, and then end
>> the base backup in a single operation.
>
> I'm a little dubious of the performance of that approach for some users,
> though it does seem a popular idea.

Well, it's of course only going to be an *option*. We should keep our
flexibility and allow the current ways as well.


> One very useful feature will be some way of confirming the number and
> size of files to transfer, so that the base backup client can find out
> the progress.

The patch already does this. Or rather, as it's coded it does this
once per tablespace.

It'll give you an approximation only of course, that can change, but
it should be enough for the purposes of a progress indication.


> It would also be good to avoid writing a backup_label file at all on the
> master, so there was no reason why multiple concurrent backups could not
> be taken. The current coding allows for the idea that the start and stop
> might be in different sessions, whereas here we know we are in one
> session.

Yeah, I have that on the todo list suggested by Heikki. I consider it
a later phase though.


--
 Magnus Hagander
 Me: http://www.hagander.net/
 Work: http://www.redpill-linpro.com/


Re: Streaming base backups

From
Magnus Hagander
Date:
On Fri, Jan 7, 2011 at 01:47, Cédric Villemain
<cedric.villemain.debian@gmail.com> wrote:
> 2011/1/5 Magnus Hagander <magnus@hagander.net>:
>> On Wed, Jan 5, 2011 at 22:58, Dimitri Fontaine <dimitri@2ndquadrant.fr> wrote:
>>> Magnus Hagander <magnus@hagander.net> writes:
>>>> * Stefan mentiond it might be useful to put some
>>>> posix_fadvise(POSIX_FADV_DONTNEED)
>>>>   in the process that streams all the files out. Seems useful, as long as that
>>>>   doesn't kick them out of the cache *completely*, for other backends as well.
>>>>   Do we know if that is the case?
>>>
>>> Maybe have a look at pgfincore to only tag DONTNEED for blocks that are
>>> not already in SHM?
>>
>> I think that's way more complex than we want to go here.
>>
>
> DONTNEED will remove the block from OS buffer everytime.

Then we definitely don't want to use it - because some other backend
might well want the file. Better leave it up to the standard logic in
the kernel.

> It should not be that hard to implement a snapshot(it needs mincore())
> and to restore previous state. I don't know how basebackup is
> performed exactly...so perhaps I am wrong.

Uh, it just reads the files out of the filesystem. Just like you'd to
today, except it's now integrated and streams the data across a
regular libpq connection.

--
 Magnus Hagander
 Me: http://www.hagander.net/
 Work: http://www.redpill-linpro.com/


Re: Streaming base backups

From
Garick Hamlin
Date:
On Thu, Jan 06, 2011 at 07:47:39PM -0500, Cédric Villemain wrote:
> 2011/1/5 Magnus Hagander <magnus@hagander.net>:
> > On Wed, Jan 5, 2011 at 22:58, Dimitri Fontaine <dimitri@2ndquadrant.fr> wrote:
> >> Magnus Hagander <magnus@hagander.net> writes:
> >>> * Stefan mentiond it might be useful to put some
> >>> posix_fadvise(POSIX_FADV_DONTNEED)
> >>>   in the process that streams all the files out. Seems useful, as long as that
> >>>   doesn't kick them out of the cache *completely*, for other backends as well.
> >>>   Do we know if that is the case?
> >>
> >> Maybe have a look at pgfincore to only tag DONTNEED for blocks that are
> >> not already in SHM?
> >
> > I think that's way more complex than we want to go here.
> >
> 
> DONTNEED will remove the block from OS buffer everytime.
> 
> It should not be that hard to implement a snapshot(it needs mincore())
> and to restore previous state. I don't know how basebackup is
> performed exactly...so perhaps I am wrong.
> 
> posix_fadvise support is already in postgresql core...we can start by
> just doing a snapshot of the files before starting, or at some point
> in the basebackup, it will need only 256kB per GB of data...

It is actually possible to be more scalable than the simple solution you
outline here (although that solution works pretty well).  

I've written a program that syncronizes the OS cache state using
mmap()/mincore() between two computers.  It haven't actually tested its
impact on performance yet, but I was surprised by how fast it actually runs
and how compact cache maps can be.

If one encodes the data so one remembers the number of zeros between 1s 
one, storage scale by the amount of memory in each size rather than the 
dataset size.  I actually played with doing that, then doing huffman 
encoding of that.  I get around 1.2-1.3 bits / page of _physical memory_ 
on my tests.

I don't have my notes handy, but here are some numbers from memory...

The obvious worst cases are 1 bit per page of _dataset_ or 19 bits per page
of physical memory in the machine.  The latter limit get better, however,
since there are < 1024 symbols possible for the encoder (since in this 
case symbols are spans of zeros that need to fit in a file that is 1 GB in
size).  So is actually real worst case is much closer to 1 bit per page of 
the dataset or ~10 bits per page of physical memory.  The real performance
I see with huffman is more like 1.3 bits per page of physical memory.  All the 
encoding decoding is actually very fast.  zlib would actually compress even 
better than huffman, but huffman encoder/decoder is actually pretty good and
very straightforward code.

I would like to integrate something like this into PG or perhaps even into
something like rsync, but its was written as proof of concept and I haven't 
had time work on it recently.

Garick

> -- 
> Cédric Villemain               2ndQuadrant
> http://2ndQuadrant.fr/     PostgreSQL : Expertise, Formation et Support
> 
> -- 
> Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-hackers


Re: Streaming base backups

From
Garick Hamlin
Date:
On Fri, Jan 07, 2011 at 10:26:29AM -0500, Garick Hamlin wrote:
> On Thu, Jan 06, 2011 at 07:47:39PM -0500, Cédric Villemain wrote:
> > 2011/1/5 Magnus Hagander <magnus@hagander.net>:
> > > On Wed, Jan 5, 2011 at 22:58, Dimitri Fontaine <dimitri@2ndquadrant.fr> wrote:
> > >> Magnus Hagander <magnus@hagander.net> writes:
> > >>> * Stefan mentiond it might be useful to put some
> > >>> posix_fadvise(POSIX_FADV_DONTNEED)
> > >>>   in the process that streams all the files out. Seems useful, as long as that
> > >>>   doesn't kick them out of the cache *completely*, for other backends as well.
> > >>>   Do we know if that is the case?
> > >>
> > >> Maybe have a look at pgfincore to only tag DONTNEED for blocks that are
> > >> not already in SHM?
> > >
> > > I think that's way more complex than we want to go here.
> > >
> > 
> > DONTNEED will remove the block from OS buffer everytime.
> > 
> > It should not be that hard to implement a snapshot(it needs mincore())
> > and to restore previous state. I don't know how basebackup is
> > performed exactly...so perhaps I am wrong.
> > 
> > posix_fadvise support is already in postgresql core...we can start by
> > just doing a snapshot of the files before starting, or at some point
> > in the basebackup, it will need only 256kB per GB of data...
> 
> It is actually possible to be more scalable than the simple solution you
> outline here (although that solution works pretty well).  
> 
> I've written a program that syncronizes the OS cache state using
> mmap()/mincore() between two computers.  It haven't actually tested its
> impact on performance yet, but I was surprised by how fast it actually runs
> and how compact cache maps can be.
> 
> If one encodes the data so one remembers the number of zeros between 1s 
> one, storage scale by the amount of memory in each size rather than the 

Sorry for the typos, that should read:

the storage scales by the number of pages resident in memory rather than the 
total dataset size.

> dataset size.  I actually played with doing that, then doing huffman 
> encoding of that.  I get around 1.2-1.3 bits / page of _physical memory_ 
> on my tests.
> 
> I don't have my notes handy, but here are some numbers from memory...
> 
> The obvious worst cases are 1 bit per page of _dataset_ or 19 bits per page
> of physical memory in the machine.  The latter limit get better, however,
> since there are < 1024 symbols possible for the encoder (since in this 
> case symbols are spans of zeros that need to fit in a file that is 1 GB in
> size).  So is actually real worst case is much closer to 1 bit per page of 
> the dataset or ~10 bits per page of physical memory.  The real performance
> I see with huffman is more like 1.3 bits per page of physical memory.  All the 
> encoding decoding is actually very fast.  zlib would actually compress even 
> better than huffman, but huffman encoder/decoder is actually pretty good and
> very straightforward code.
> 
> I would like to integrate something like this into PG or perhaps even into
> something like rsync, but its was written as proof of concept and I haven't 
> had time work on it recently.
> 
> Garick
> 
> > -- 
> > Cédric Villemain               2ndQuadrant
> > http://2ndQuadrant.fr/     PostgreSQL : Expertise, Formation et Support
> > 
> > -- 
> > Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
> > To make changes to your subscription:
> > http://www.postgresql.org/mailpref/pgsql-hackers
> 
> -- 
> Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-hackers


Re: Streaming base backups

From
Heikki Linnakangas
Date:
On 05.01.2011 15:54, Magnus Hagander wrote:
> * Suggestion from Heikki: perhaps at some point we're going to need a full
>    bison grammar for walsender commands.

Here's a patch for this (Also available at
git@github.com:hlinnaka/postgres.git, branch "streaming_base"). I
thought I know our bison/flex magic pretty well by now, but it turned
out to take much longer than I thought. But here it is.

I'm not 100% sure if this is worth the trouble quite yet. It adds quite
a lot of boilerplate code.. OTOH, having a bison grammar file makes it
easier to see what exactly the grammar is, and I like that. It's not too
bad with three commands yet, but if it expands much further a bison
grammar is a must.

At first I tried using the backend lexer for this, but it couldn't parse
the xlog-start location in the "START_REPLICATION 0/47000000" command.
In hindsight that may have been a badly chosen syntax. But as you
pointed out on IM, the lexer needed to handle this limited set of
commands is very small, so I wrote a dedicated flex lexer instead that
can handle it.

--
   Heikki Linnakangas
   EnterpriseDB   http://www.enterprisedb.com

Attachment

Re: Streaming base backups

From
Heikki Linnakangas
Date:
On 05.01.2011 15:54, Magnus Hagander wrote:
> I've implemented a frontend for this in pg_streamrecv, based on the assumption
> that we wanted to include this in bin/ for 9.1 - and that it seems like a
> reasonable place to put it. This can obviously be moved elsewhere if we want to.
> That code needs a lot more cleanup, but I wanted to make sure I got the backend
> patch out for review quickly. You can find the current WIP branch for
> pg_streamrecv on my github page at https://github.com/mhagander/pg_streamrecv,
> in the branch "baserecv". I'll be posting that as a separate patch once it's
> been a bit more cleaned up (it does work now if you want to test it, though).

One more thing, now that I've played a bit with pg_streamrecv:

I find it strange that the data directory must exist when you call 
pg_streamrecv in base-backup mode. I would expect it to work like 
initdb, and create the directory if it doesn't exist.

--   Heikki Linnakangas  EnterpriseDB   http://www.enterprisedb.com


Re: Streaming base backups

From
Magnus Hagander
Date:
On Thu, Jan 6, 2011 at 23:57, Heikki Linnakangas
<heikki.linnakangas@enterprisedb.com> wrote:
>
> Looks like pg_streamrecv creates the pg_xlog and pg_tblspc directories,
> because they're not included in the streamed tar. Wouldn't it be better to
> include them in the tar as empty directories at the server-side? Otherwise
> if you write the tar file to disk and untar it later, you have to manually
> create them.

Attached is an updated patch that does this.

It also collects all the header records as a single resultset at the
beginning. This made for cleaner code, but more importantly makes it
possible to get the total size of the backup even if there are
multiple tablespaces.

It also changes the tar members to use relative paths instead of
absolute ones - since we send the root of the directory in the header
anyway. That also takes away the "./" portion in all tar members.

git branch on github updated as well, of course.

--
 Magnus Hagander
 Me: http://www.hagander.net/
 Work: http://www.redpill-linpro.com/

Attachment

Re: Streaming base backups

From
Hannu Krosing
Date:
On 7.1.2011 15:45, Magnus Hagander wrote:
> On Fri, Jan 7, 2011 at 02:15, Simon Riggs<simon@2ndquadrant.com>  wrote:
>
>> One very useful feature will be some way of confirming the number and
>> size of files to transfer, so that the base backup client can find out
>> the progress.
> The patch already does this. Or rather, as it's coded it does this
> once per tablespace.
>
> It'll give you an approximation only of course, that can change,
In this case you actually could send exact numbers, as you need to only 
transfer the files up to the size they were when starting the base backup. The rest will 
be taken care of by WAL replay

>   but
> it should be enough for the purposes of a progress indication.
>
>
>> It would also be good to avoid writing a backup_label file at all on the
>> master, so there was no reason why multiple concurrent backups could not
>> be taken. The current coding allows for the idea that the start and stop
>> might be in different sessions, whereas here we know we are in one
>> session.
> Yeah, I have that on the todo list suggested by Heikki. I consider it
> a later phase though.
>
>


-- 
--------------------------------------------
Hannu Krosing
Senior Consultant,
Infinite Scalability&  Performance
http://www.2ndQuadrant.com/books/



Re: Streaming base backups

From
Magnus Hagander
Date:
On Sun, Jan 9, 2011 at 09:55, Hannu Krosing <hannu@2ndquadrant.com> wrote:
> On 7.1.2011 15:45, Magnus Hagander wrote:
>>
>> On Fri, Jan 7, 2011 at 02:15, Simon Riggs<simon@2ndquadrant.com>  wrote:
>>
>>> One very useful feature will be some way of confirming the number and
>>> size of files to transfer, so that the base backup client can find out
>>> the progress.
>>
>> The patch already does this. Or rather, as it's coded it does this
>> once per tablespace.
>>
>> It'll give you an approximation only of course, that can change,
>
> In this case you actually could send exact numbers, as you need to only
> transfer the files
>  up to the size they were when starting the base backup. The rest will be
> taken care of by
>  WAL replay

It will still be an estimate, because files can get smaller, and even
go away completely.

But we really don't need more than an estimate...


--
 Magnus Hagander
 Me: http://www.hagander.net/
 Work: http://www.redpill-linpro.com/


Re: Streaming base backups

From
Hannu Krosing
Date:
On 9.1.2011 10:44, Magnus Hagander wrote:
> On Sun, Jan 9, 2011 at 09:55, Hannu Krosing<hannu@2ndquadrant.com>  wrote:
>> On 7.1.2011 15:45, Magnus Hagander wrote:
>>> On Fri, Jan 7, 2011 at 02:15, Simon Riggs<simon@2ndquadrant.com>    wrote:
>>>
>>>> One very useful feature will be some way of confirming the number and
>>>> size of files to transfer, so that the base backup client can find out
>>>> the progress.
>>> The patch already does this. Or rather, as it's coded it does this
>>> once per tablespace.
>>>
>>> It'll give you an approximation only of course, that can change,
>> In this case you actually could send exact numbers, as you need to only
>> transfer the files
>>   up to the size they were when starting the base backup. The rest will be
>> taken care of by
>>   WAL replay
> It will still be an estimate, because files can get smaller, and even
> go away completely.
Sure. Just wanted to remind the fact that you don't need to send the 
tail part of the
file which was added after the start of backup.

And you can give the "worst case" estimate for space needed by base backup.

OTOH, streaming the WAL files in parallel can still fill up all 
available space :P

> But we really don't need more than an estimate...
>
Agreed.

-- 
--------------------------------------------
Hannu Krosing
Senior Consultant,
Infinite Scalability&  Performance
http://www.2ndQuadrant.com/books/



Re: Streaming base backups

From
Magnus Hagander
Date:
On Sun, Jan 9, 2011 at 12:05, Hannu Krosing <hannu@2ndquadrant.com> wrote:
> On 9.1.2011 10:44, Magnus Hagander wrote:
>>
>> On Sun, Jan 9, 2011 at 09:55, Hannu Krosing<hannu@2ndquadrant.com>  wrote:
>>>
>>> On 7.1.2011 15:45, Magnus Hagander wrote:
>>>>
>>>> On Fri, Jan 7, 2011 at 02:15, Simon Riggs<simon@2ndquadrant.com>
>>>>  wrote:
>>>>
>>>>> One very useful feature will be some way of confirming the number and
>>>>> size of files to transfer, so that the base backup client can find out
>>>>> the progress.
>>>>
>>>> The patch already does this. Or rather, as it's coded it does this
>>>> once per tablespace.
>>>>
>>>> It'll give you an approximation only of course, that can change,
>>>
>>> In this case you actually could send exact numbers, as you need to only
>>> transfer the files
>>>  up to the size they were when starting the base backup. The rest will be
>>> taken care of by
>>>  WAL replay
>>
>> It will still be an estimate, because files can get smaller, and even
>> go away completely.
>
> Sure. Just wanted to remind the fact that you don't need to send the tail
> part of the
> file which was added after the start of backup.

True - but that's a PITA to keep track of. We do this if the file
changes during the transmission of that *file*, since otherwise the
tar header would specify an incorrect size, but not through the whole
backup.


> And you can give the "worst case" estimate for space needed by base backup.
>
> OTOH, streaming the WAL files in parallel can still fill up all available
> space :P

Yeah. I don't think it's worth the extra complexity of having to
enumerate and keep records for every individual file before the
streaming starts.

--
 Magnus Hagander
 Me: http://www.hagander.net/
 Work: http://www.redpill-linpro.com/


Re: Streaming base backups

From
Cédric Villemain
Date:
2011/1/7 Garick Hamlin <ghamlin@isc.upenn.edu>:
> On Thu, Jan 06, 2011 at 07:47:39PM -0500, Cédric Villemain wrote:
>> 2011/1/5 Magnus Hagander <magnus@hagander.net>:
>> > On Wed, Jan 5, 2011 at 22:58, Dimitri Fontaine <dimitri@2ndquadrant.fr> wrote:
>> >> Magnus Hagander <magnus@hagander.net> writes:
>> >>> * Stefan mentiond it might be useful to put some
>> >>> posix_fadvise(POSIX_FADV_DONTNEED)
>> >>>   in the process that streams all the files out. Seems useful, as long as that
>> >>>   doesn't kick them out of the cache *completely*, for other backends as well.
>> >>>   Do we know if that is the case?
>> >>
>> >> Maybe have a look at pgfincore to only tag DONTNEED for blocks that are
>> >> not already in SHM?
>> >
>> > I think that's way more complex than we want to go here.
>> >
>>
>> DONTNEED will remove the block from OS buffer everytime.
>>
>> It should not be that hard to implement a snapshot(it needs mincore())
>> and to restore previous state. I don't know how basebackup is
>> performed exactly...so perhaps I am wrong.
>>
>> posix_fadvise support is already in postgresql core...we can start by
>> just doing a snapshot of the files before starting, or at some point
>> in the basebackup, it will need only 256kB per GB of data...
>
> It is actually possible to be more scalable than the simple solution you
> outline here (although that solution works pretty well).

Yes I suggest something pretty simple to go with a first shoot.

>
> I've written a program that syncronizes the OS cache state using
> mmap()/mincore() between two computers.  It haven't actually tested its
> impact on performance yet, but I was surprised by how fast it actually runs
> and how compact cache maps can be.
>
> If one encodes the data so one remembers the number of zeros between 1s
> one, storage scale by the amount of memory in each size rather than the
> dataset size.  I actually played with doing that, then doing huffman
> encoding of that.  I get around 1.2-1.3 bits / page of _physical memory_
> on my tests.
>
> I don't have my notes handy, but here are some numbers from memory...

that is interesting, even if I didn't have issue with the size of the
maps so far, I thought that a simple zlib compression should be
enought.

>
> The obvious worst cases are 1 bit per page of _dataset_ or 19 bits per page
> of physical memory in the machine.  The latter limit get better, however,
> since there are < 1024 symbols possible for the encoder (since in this
> case symbols are spans of zeros that need to fit in a file that is 1 GB in
> size).  So is actually real worst case is much closer to 1 bit per page of
> the dataset or ~10 bits per page of physical memory.  The real performance
> I see with huffman is more like 1.3 bits per page of physical memory.  All the
> encoding decoding is actually very fast.  zlib would actually compress even
> better than huffman, but huffman encoder/decoder is actually pretty good and
> very straightforward code.

pgfincore currently hold those information in flat file. The on-going
dev is more simple and provide the data as bits, so you can store it
in a table, and restore it on your slave thanks to SR, and use it on
the slave.

>
> I would like to integrate something like this into PG or perhaps even into
> something like rsync, but its was written as proof of concept and I haven't
> had time work on it recently.
>
> Garick
>
>> --
>> Cédric Villemain               2ndQuadrant
>> http://2ndQuadrant.fr/     PostgreSQL : Expertise, Formation et Support
>>
>> --
>> Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
>> To make changes to your subscription:
>> http://www.postgresql.org/mailpref/pgsql-hackers
>



--
Cédric Villemain               2ndQuadrant
http://2ndQuadrant.fr/     PostgreSQL : Expertise, Formation et Support


Re: Streaming base backups

From
Cédric Villemain
Date:
2011/1/7 Magnus Hagander <magnus@hagander.net>:
> On Fri, Jan 7, 2011 at 01:47, Cédric Villemain
> <cedric.villemain.debian@gmail.com> wrote:
>> 2011/1/5 Magnus Hagander <magnus@hagander.net>:
>>> On Wed, Jan 5, 2011 at 22:58, Dimitri Fontaine <dimitri@2ndquadrant.fr> wrote:
>>>> Magnus Hagander <magnus@hagander.net> writes:
>>>>> * Stefan mentiond it might be useful to put some
>>>>> posix_fadvise(POSIX_FADV_DONTNEED)
>>>>>   in the process that streams all the files out. Seems useful, as long as that
>>>>>   doesn't kick them out of the cache *completely*, for other backends as well.
>>>>>   Do we know if that is the case?
>>>>
>>>> Maybe have a look at pgfincore to only tag DONTNEED for blocks that are
>>>> not already in SHM?
>>>
>>> I think that's way more complex than we want to go here.
>>>
>>
>> DONTNEED will remove the block from OS buffer everytime.
>
> Then we definitely don't want to use it - because some other backend
> might well want the file. Better leave it up to the standard logic in
> the kernel.

Looking at the patch, it is (very) easy to add the support for that in
basebackup.c
That supposed allowing mincore(), so mmap(), and so probably switch
the fopen() to an open() (or add an open() just for mmap
requirement...)

Let's go ?

>
>> It should not be that hard to implement a snapshot(it needs mincore())
>> and to restore previous state. I don't know how basebackup is
>> performed exactly...so perhaps I am wrong.
>
> Uh, it just reads the files out of the filesystem. Just like you'd to
> today, except it's now integrated and streams the data across a
> regular libpq connection.
>
> --
>  Magnus Hagander
>  Me: http://www.hagander.net/
>  Work: http://www.redpill-linpro.com/
>



--
Cédric Villemain               2ndQuadrant
http://2ndQuadrant.fr/     PostgreSQL : Expertise, Formation et Support


Re: Streaming base backups

From
Magnus Hagander
Date:
On Sun, Jan 9, 2011 at 23:33, Cédric Villemain
<cedric.villemain.debian@gmail.com> wrote:
> 2011/1/7 Magnus Hagander <magnus@hagander.net>:
>> On Fri, Jan 7, 2011 at 01:47, Cédric Villemain
>> <cedric.villemain.debian@gmail.com> wrote:
>>> 2011/1/5 Magnus Hagander <magnus@hagander.net>:
>>>> On Wed, Jan 5, 2011 at 22:58, Dimitri Fontaine <dimitri@2ndquadrant.fr> wrote:
>>>>> Magnus Hagander <magnus@hagander.net> writes:
>>>>>> * Stefan mentiond it might be useful to put some
>>>>>> posix_fadvise(POSIX_FADV_DONTNEED)
>>>>>>   in the process that streams all the files out. Seems useful, as long as that
>>>>>>   doesn't kick them out of the cache *completely*, for other backends as well.
>>>>>>   Do we know if that is the case?
>>>>>
>>>>> Maybe have a look at pgfincore to only tag DONTNEED for blocks that are
>>>>> not already in SHM?
>>>>
>>>> I think that's way more complex than we want to go here.
>>>>
>>>
>>> DONTNEED will remove the block from OS buffer everytime.
>>
>> Then we definitely don't want to use it - because some other backend
>> might well want the file. Better leave it up to the standard logic in
>> the kernel.
>
> Looking at the patch, it is (very) easy to add the support for that in
> basebackup.c
> That supposed allowing mincore(), so mmap(), and so probably switch
> the fopen() to an open() (or add an open() just for mmap
> requirement...)
>
> Let's go ?

Per above, I still don't think we *should* do this. We don't want to
kick things out of the cache underneath other backends, and since we
can't control that. Either way, it shouldn't happen in the beginning,
and if it does, should be backed with proper benchmarks.

I've committed the backend side of this, without that. Still working
on the client, and on cleaning up Heikki's patch for grammar/parser
support.

--
 Magnus Hagander
 Me: http://www.hagander.net/
 Work: http://www.redpill-linpro.com/


Re: Streaming base backups

From
Cédric Villemain
Date:
2011/1/10 Magnus Hagander <magnus@hagander.net>:
> On Sun, Jan 9, 2011 at 23:33, Cédric Villemain
> <cedric.villemain.debian@gmail.com> wrote:
>> 2011/1/7 Magnus Hagander <magnus@hagander.net>:
>>> On Fri, Jan 7, 2011 at 01:47, Cédric Villemain
>>> <cedric.villemain.debian@gmail.com> wrote:
>>>> 2011/1/5 Magnus Hagander <magnus@hagander.net>:
>>>>> On Wed, Jan 5, 2011 at 22:58, Dimitri Fontaine <dimitri@2ndquadrant.fr> wrote:
>>>>>> Magnus Hagander <magnus@hagander.net> writes:
>>>>>>> * Stefan mentiond it might be useful to put some
>>>>>>> posix_fadvise(POSIX_FADV_DONTNEED)
>>>>>>>   in the process that streams all the files out. Seems useful, as long as that
>>>>>>>   doesn't kick them out of the cache *completely*, for other backends as well.
>>>>>>>   Do we know if that is the case?
>>>>>>
>>>>>> Maybe have a look at pgfincore to only tag DONTNEED for blocks that are
>>>>>> not already in SHM?
>>>>>
>>>>> I think that's way more complex than we want to go here.
>>>>>
>>>>
>>>> DONTNEED will remove the block from OS buffer everytime.
>>>
>>> Then we definitely don't want to use it - because some other backend
>>> might well want the file. Better leave it up to the standard logic in
>>> the kernel.
>>
>> Looking at the patch, it is (very) easy to add the support for that in
>> basebackup.c
>> That supposed allowing mincore(), so mmap(), and so probably switch
>> the fopen() to an open() (or add an open() just for mmap
>> requirement...)
>>
>> Let's go ?
>
> Per above, I still don't think we *should* do this. We don't want to
> kick things out of the cache underneath other backends, and since we

we are dropping stuff underneath other backends  anyway but I
understand your point.

> can't control that. Either way, it shouldn't happen in the beginning,
> and if it does, should be backed with proper benchmarks.

I agree.

>
> I've committed the backend side of this, without that. Still working
> on the client, and on cleaning up Heikki's patch for grammar/parser
> support.

--
Cédric Villemain               2ndQuadrant
http://2ndQuadrant.fr/     PostgreSQL : Expertise, Formation et Support


Re: Streaming base backups

From
Stefan Kaltenbrunner
Date:
On 01/10/2011 08:13 PM, Cédric Villemain wrote:
> 2011/1/10 Magnus Hagander<magnus@hagander.net>:
>> On Sun, Jan 9, 2011 at 23:33, Cédric Villemain
>> <cedric.villemain.debian@gmail.com>  wrote:
>>> 2011/1/7 Magnus Hagander<magnus@hagander.net>:
>>>> On Fri, Jan 7, 2011 at 01:47, Cédric Villemain
>>>> <cedric.villemain.debian@gmail.com>  wrote:
>>>>> 2011/1/5 Magnus Hagander<magnus@hagander.net>:
>>>>>> On Wed, Jan 5, 2011 at 22:58, Dimitri Fontaine<dimitri@2ndquadrant.fr>  wrote:
>>>>>>> Magnus Hagander<magnus@hagander.net>  writes:
>>>>>>>> * Stefan mentiond it might be useful to put some
>>>>>>>> posix_fadvise(POSIX_FADV_DONTNEED)
>>>>>>>>    in the process that streams all the files out. Seems useful, as long as that
>>>>>>>>    doesn't kick them out of the cache *completely*, for other backends as well.
>>>>>>>>    Do we know if that is the case?
>>>>>>>
>>>>>>> Maybe have a look at pgfincore to only tag DONTNEED for blocks that are
>>>>>>> not already in SHM?
>>>>>>
>>>>>> I think that's way more complex than we want to go here.
>>>>>>
>>>>>
>>>>> DONTNEED will remove the block from OS buffer everytime.
>>>>
>>>> Then we definitely don't want to use it - because some other backend
>>>> might well want the file. Better leave it up to the standard logic in
>>>> the kernel.
>>>
>>> Looking at the patch, it is (very) easy to add the support for that in
>>> basebackup.c
>>> That supposed allowing mincore(), so mmap(), and so probably switch
>>> the fopen() to an open() (or add an open() just for mmap
>>> requirement...)
>>>
>>> Let's go ?
>>
>> Per above, I still don't think we *should* do this. We don't want to
>> kick things out of the cache underneath other backends, and since we
>
> we are dropping stuff underneath other backends  anyway but I
> understand your point.
>
>> can't control that. Either way, it shouldn't happen in the beginning,
>> and if it does, should be backed with proper benchmarks.
>
> I agree.

well I want to point out that the link I provided upthread actually 
provides a (linux centric) way to do get the property of interest for this:

* if the datablocks are in the OS buffercache just leave them alone, if 
the are NOT tell the OS that "this current user" is not interested in 
having it there

I would like to see something like that implemented in the backend 
sometime and maybe even as a guc of some sort, that way we actually 
could use that for say a pg_dump run as well, I have seen the 
responsetimes of big boxes tank not because of the CPU and lock-load 
pg_dump imposes but because of the way that it can cause the 
OS-buffercache to get spoiled with not-really-important data.



anyway I agree that the (positive and/or negative) effect of something 
like that needs to be measured but this effect is not too easy to see in 
very simple setups...


Stefan


Re: Streaming base backups

From
Cédric Villemain
Date:
2011/1/10 Stefan Kaltenbrunner <stefan@kaltenbrunner.cc>:
> On 01/10/2011 08:13 PM, Cédric Villemain wrote:
>>
>> 2011/1/10 Magnus Hagander<magnus@hagander.net>:
>>>
>>> On Sun, Jan 9, 2011 at 23:33, Cédric Villemain
>>> <cedric.villemain.debian@gmail.com>  wrote:
>>>>
>>>> 2011/1/7 Magnus Hagander<magnus@hagander.net>:
>>>>>
>>>>> On Fri, Jan 7, 2011 at 01:47, Cédric Villemain
>>>>> <cedric.villemain.debian@gmail.com>  wrote:
>>>>>>
>>>>>> 2011/1/5 Magnus Hagander<magnus@hagander.net>:
>>>>>>>
>>>>>>> On Wed, Jan 5, 2011 at 22:58, Dimitri
>>>>>>> Fontaine<dimitri@2ndquadrant.fr>  wrote:
>>>>>>>>
>>>>>>>> Magnus Hagander<magnus@hagander.net>  writes:
>>>>>>>>>
>>>>>>>>> * Stefan mentiond it might be useful to put some
>>>>>>>>> posix_fadvise(POSIX_FADV_DONTNEED)
>>>>>>>>>   in the process that streams all the files out. Seems useful, as
>>>>>>>>> long as that
>>>>>>>>>   doesn't kick them out of the cache *completely*, for other
>>>>>>>>> backends as well.
>>>>>>>>>   Do we know if that is the case?
>>>>>>>>
>>>>>>>> Maybe have a look at pgfincore to only tag DONTNEED for blocks that
>>>>>>>> are
>>>>>>>> not already in SHM?
>>>>>>>
>>>>>>> I think that's way more complex than we want to go here.
>>>>>>>
>>>>>>
>>>>>> DONTNEED will remove the block from OS buffer everytime.
>>>>>
>>>>> Then we definitely don't want to use it - because some other backend
>>>>> might well want the file. Better leave it up to the standard logic in
>>>>> the kernel.
>>>>
>>>> Looking at the patch, it is (very) easy to add the support for that in
>>>> basebackup.c
>>>> That supposed allowing mincore(), so mmap(), and so probably switch
>>>> the fopen() to an open() (or add an open() just for mmap
>>>> requirement...)
>>>>
>>>> Let's go ?
>>>
>>> Per above, I still don't think we *should* do this. We don't want to
>>> kick things out of the cache underneath other backends, and since we
>>
>> we are dropping stuff underneath other backends  anyway but I
>> understand your point.
>>
>>> can't control that. Either way, it shouldn't happen in the beginning,
>>> and if it does, should be backed with proper benchmarks.
>>
>> I agree.
>
> well I want to point out that the link I provided upthread actually provides
> a (linux centric) way to do get the property of interest for this:

yes, it is exactly what we are talking about here.
mincore and posix_fadvise.

freeBSD should allow that later, at least it is in the todo list
Windows may allow that too with different API.

>
> * if the datablocks are in the OS buffercache just leave them alone, if the
> are NOT tell the OS that "this current user" is not interested in having it
> there

my experience is that posix_fadvise on a specific block behave more
brutaly than flaging a whole file. In the later case it may not do
what you want if it estimates it is not welcome (because of other IO
request)

What Magnus point out is that other backends execute queries and
request blocks (and load them in shared buffers of postgresql) and it
is *hard* to be sure we don't remove blocks just loaded by another
backend ( the worst case beeing flushing prefeteched blocks not yet in
shared buffers, cf effective_io_concurrency )

>
> I would like to see something like that implemented in the backend sometime
> and maybe even as a guc of some sort, that way we actually could use that
> for say a pg_dump run as well, I have seen the responsetimes of big boxes
> tank not because of the CPU and lock-load pg_dump imposes but because of the
> way that it can cause the OS-buffercache to get spoiled with
> not-really-important data.

Glad to here that, pgfincore is also a POC about those topics.
The best solution is to mmap in postgres, but it is not posible, so we
have to do snapshot of objects and restore them afterwards (again *it
is* what tobias do with is rsync). Side note : because of readahead,
inspect block by block while you read the file provide bad results (or
you need to fadvise POSIX_FADV_RANDOM to remove readahead behavior,
which is not good at all).

>
> anyway I agree that the (positive and/or negative) effect of something like
> that needs to be measured but this effect is not too easy to see in very
> simple setups...

yes. and with pgbase_backup, copying 1GB over the network is longer
than  2 seconds, we will probably need to have a specific strategy.


--
Cédric Villemain               2ndQuadrant
http://2ndQuadrant.fr/     PostgreSQL : Expertise, Formation et Support


Re: Streaming base backups

From
Cédric Villemain
Date:
2011/1/10 Magnus Hagander <magnus@hagander.net>:
> On Sun, Jan 9, 2011 at 23:33, Cédric Villemain
> <cedric.villemain.debian@gmail.com> wrote:
>> 2011/1/7 Magnus Hagander <magnus@hagander.net>:
>>> On Fri, Jan 7, 2011 at 01:47, Cédric Villemain
>>> <cedric.villemain.debian@gmail.com> wrote:
>>>> 2011/1/5 Magnus Hagander <magnus@hagander.net>:
>>>>> On Wed, Jan 5, 2011 at 22:58, Dimitri Fontaine <dimitri@2ndquadrant.fr> wrote:
>>>>>> Magnus Hagander <magnus@hagander.net> writes:
>>>>>>> * Stefan mentiond it might be useful to put some
>>>>>>> posix_fadvise(POSIX_FADV_DONTNEED)
>>>>>>>   in the process that streams all the files out. Seems useful, as long as that
>>>>>>>   doesn't kick them out of the cache *completely*, for other backends as well.
>>>>>>>   Do we know if that is the case?
>>>>>>
>>>>>> Maybe have a look at pgfincore to only tag DONTNEED for blocks that are
>>>>>> not already in SHM?
>>>>>
>>>>> I think that's way more complex than we want to go here.
>>>>>
>>>>
>>>> DONTNEED will remove the block from OS buffer everytime.
>>>
>>> Then we definitely don't want to use it - because some other backend
>>> might well want the file. Better leave it up to the standard logic in
>>> the kernel.
>>
>> Looking at the patch, it is (very) easy to add the support for that in
>> basebackup.c
>> That supposed allowing mincore(), so mmap(), and so probably switch
>> the fopen() to an open() (or add an open() just for mmap
>> requirement...)
>>
>> Let's go ?
>
> Per above, I still don't think we *should* do this. We don't want to
> kick things out of the cache underneath other backends, and since we
> can't control that. Either way, it shouldn't happen in the beginning,
> and if it does, should be backed with proper benchmarks.
>
> I've committed the backend side of this, without that. Still working
> on the client, and on cleaning up Heikki's patch for grammar/parser
> support.

attached is a small patch fixing "-d basedir" when its called with an
absolute path.
maybe we can use pg_mkdir_p() instead of mkdir ?

>
> --
>  Magnus Hagander
>  Me: http://www.hagander.net/
>  Work: http://www.redpill-linpro.com/
>



--
Cédric Villemain               2ndQuadrant
http://2ndQuadrant.fr/     PostgreSQL : Expertise, Formation et Support

Attachment

Re: Streaming base backups

From
Magnus Hagander
Date:
On Tue, Jan 11, 2011 at 01:28, Cédric Villemain
<cedric.villemain.debian@gmail.com> wrote:
> 2011/1/10 Magnus Hagander <magnus@hagander.net>:
>> On Sun, Jan 9, 2011 at 23:33, Cédric Villemain
>> <cedric.villemain.debian@gmail.com> wrote:
>> I've committed the backend side of this, without that. Still working
>> on the client, and on cleaning up Heikki's patch for grammar/parser
>> support.
>
> attached is a small patch fixing "-d basedir" when its called with an
> absolute path.
> maybe we can use pg_mkdir_p() instead of mkdir ?

Heh, that was actually a hack to be able to run pg_basebackup on the
same machine as the database with the tablespaces. It will be removed
before commit :-) (It was also in the wrong place to work, I realize I
managed to break it in a refactor) I've put in a big ugly comment to
make sure it gets removed :-)

And yes, using pg_mkdir_p() is good. I used to do that, I think I
removed it by mistake when it was supposed to be removed elsewhere.
I've put it back.

--
 Magnus Hagander
 Me: http://www.hagander.net/
 Work: http://www.redpill-linpro.com/


Re: Streaming base backups

From
Garick Hamlin
Date:
On Mon, Jan 10, 2011 at 09:09:28AM -0500, Magnus Hagander wrote:
> On Sun, Jan 9, 2011 at 23:33, Cédric Villemain
> <cedric.villemain.debian@gmail.com> wrote:
> > 2011/1/7 Magnus Hagander <magnus@hagander.net>:
> >> On Fri, Jan 7, 2011 at 01:47, Cédric Villemain
> >> <cedric.villemain.debian@gmail.com> wrote:
> >>> 2011/1/5 Magnus Hagander <magnus@hagander.net>:
> >>>> On Wed, Jan 5, 2011 at 22:58, Dimitri Fontaine <dimitri@2ndquadrant.fr> wrote:
> >>>>> Magnus Hagander <magnus@hagander.net> writes:
> >>>>>> * Stefan mentiond it might be useful to put some
> >>>>>> posix_fadvise(POSIX_FADV_DONTNEED)
> >>>>>>   in the process that streams all the files out. Seems useful, as long as that
> >>>>>>   doesn't kick them out of the cache *completely*, for other backends as well.
> >>>>>>   Do we know if that is the case?
> >>>>>
> >>>>> Maybe have a look at pgfincore to only tag DONTNEED for blocks that are
> >>>>> not already in SHM?
> >>>>
> >>>> I think that's way more complex than we want to go here.
> >>>>
> >>>
> >>> DONTNEED will remove the block from OS buffer everytime.
> >>
> >> Then we definitely don't want to use it - because some other backend
> >> might well want the file. Better leave it up to the standard logic in
> >> the kernel.
> >
> > Looking at the patch, it is (very) easy to add the support for that in
> > basebackup.c
> > That supposed allowing mincore(), so mmap(), and so probably switch
> > the fopen() to an open() (or add an open() just for mmap
> > requirement...)
> >
> > Let's go ?
> 
> Per above, I still don't think we *should* do this. We don't want to
> kick things out of the cache underneath other backends, and since we
> can't control that. Either way, it shouldn't happen in the beginning,
> and if it does, should be backed with proper benchmarks.

Another option that occurs to me is an option to use direct IO (or another
means as needed) to bypass the cache.  So rather than kicking it out of 
the cache, we attempt just not to pollute the cache by bypassing it for cold
pages and use either normal io for 'hot pages', or use a 'read()' to "heat" 
the cache afterward.

Garick

> 
> I've committed the backend side of this, without that. Still working
> on the client, and on cleaning up Heikki's patch for grammar/parser
> support.
> 
> -- 
>  Magnus Hagander
>  Me: http://www.hagander.net/
>  Work: http://www.redpill-linpro.com/
> 
> -- 
> Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-hackers


Re: Streaming base backups

From
Cédric Villemain
Date:
2011/1/11 Garick Hamlin <ghamlin@isc.upenn.edu>:
> On Mon, Jan 10, 2011 at 09:09:28AM -0500, Magnus Hagander wrote:
>> On Sun, Jan 9, 2011 at 23:33, Cédric Villemain
>> <cedric.villemain.debian@gmail.com> wrote:
>> > 2011/1/7 Magnus Hagander <magnus@hagander.net>:
>> >> On Fri, Jan 7, 2011 at 01:47, Cédric Villemain
>> >> <cedric.villemain.debian@gmail.com> wrote:
>> >>> 2011/1/5 Magnus Hagander <magnus@hagander.net>:
>> >>>> On Wed, Jan 5, 2011 at 22:58, Dimitri Fontaine <dimitri@2ndquadrant.fr> wrote:
>> >>>>> Magnus Hagander <magnus@hagander.net> writes:
>> >>>>>> * Stefan mentiond it might be useful to put some
>> >>>>>> posix_fadvise(POSIX_FADV_DONTNEED)
>> >>>>>>   in the process that streams all the files out. Seems useful, as long as that
>> >>>>>>   doesn't kick them out of the cache *completely*, for other backends as well.
>> >>>>>>   Do we know if that is the case?
>> >>>>>
>> >>>>> Maybe have a look at pgfincore to only tag DONTNEED for blocks that are
>> >>>>> not already in SHM?
>> >>>>
>> >>>> I think that's way more complex than we want to go here.
>> >>>>
>> >>>
>> >>> DONTNEED will remove the block from OS buffer everytime.
>> >>
>> >> Then we definitely don't want to use it - because some other backend
>> >> might well want the file. Better leave it up to the standard logic in
>> >> the kernel.
>> >
>> > Looking at the patch, it is (very) easy to add the support for that in
>> > basebackup.c
>> > That supposed allowing mincore(), so mmap(), and so probably switch
>> > the fopen() to an open() (or add an open() just for mmap
>> > requirement...)
>> >
>> > Let's go ?
>>
>> Per above, I still don't think we *should* do this. We don't want to
>> kick things out of the cache underneath other backends, and since we
>> can't control that. Either way, it shouldn't happen in the beginning,
>> and if it does, should be backed with proper benchmarks.
>
> Another option that occurs to me is an option to use direct IO (or another
> means as needed) to bypass the cache.  So rather than kicking it out of
> the cache, we attempt just not to pollute the cache by bypassing it for cold
> pages and use either normal io for 'hot pages', or use a 'read()' to "heat"
> the cache afterward.

AFAIR, even Linus is rejecting the idea to use it seriously, except if
I shuffle in my memory.

>
> Garick
>
>>
>> I've committed the backend side of this, without that. Still working
>> on the client, and on cleaning up Heikki's patch for grammar/parser
>> support.
>>
>> --
>>  Magnus Hagander
>>  Me: http://www.hagander.net/
>>  Work: http://www.redpill-linpro.com/
>>
>> --
>> Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
>> To make changes to your subscription:
>> http://www.postgresql.org/mailpref/pgsql-hackers
>



--
Cédric Villemain               2ndQuadrant
http://2ndQuadrant.fr/     PostgreSQL : Expertise, Formation et Support


Re: Streaming base backups

From
Garick Hamlin
Date:
On Tue, Jan 11, 2011 at 11:39:20AM -0500, Cédric Villemain wrote:
> 2011/1/11 Garick Hamlin <ghamlin@isc.upenn.edu>:
> > On Mon, Jan 10, 2011 at 09:09:28AM -0500, Magnus Hagander wrote:
> >> On Sun, Jan 9, 2011 at 23:33, Cédric Villemain
> >> <cedric.villemain.debian@gmail.com> wrote:
> >> > 2011/1/7 Magnus Hagander <magnus@hagander.net>:
> >> >> On Fri, Jan 7, 2011 at 01:47, Cédric Villemain
> >> >> <cedric.villemain.debian@gmail.com> wrote:
> >> >>> 2011/1/5 Magnus Hagander <magnus@hagander.net>:
> >> >>>> On Wed, Jan 5, 2011 at 22:58, Dimitri Fontaine <dimitri@2ndquadrant.fr> wrote:
> >> >>>>> Magnus Hagander <magnus@hagander.net> writes:
> >> >>>>>> * Stefan mentiond it might be useful to put some
> >> >>>>>> posix_fadvise(POSIX_FADV_DONTNEED)
> >> >>>>>>   in the process that streams all the files out. Seems useful, as long as that
> >> >>>>>>   doesn't kick them out of the cache *completely*, for other backends as well.
> >> >>>>>>   Do we know if that is the case?
> >> >>>>>
> >> >>>>> Maybe have a look at pgfincore to only tag DONTNEED for blocks that are
> >> >>>>> not already in SHM?
> >> >>>>
> >> >>>> I think that's way more complex than we want to go here.
> >> >>>>
> >> >>>
> >> >>> DONTNEED will remove the block from OS buffer everytime.
> >> >>
> >> >> Then we definitely don't want to use it - because some other backend
> >> >> might well want the file. Better leave it up to the standard logic in
> >> >> the kernel.
> >> >
> >> > Looking at the patch, it is (very) easy to add the support for that in
> >> > basebackup.c
> >> > That supposed allowing mincore(), so mmap(), and so probably switch
> >> > the fopen() to an open() (or add an open() just for mmap
> >> > requirement...)
> >> >
> >> > Let's go ?
> >>
> >> Per above, I still don't think we *should* do this. We don't want to
> >> kick things out of the cache underneath other backends, and since we
> >> can't control that. Either way, it shouldn't happen in the beginning,
> >> and if it does, should be backed with proper benchmarks.
> >
> > Another option that occurs to me is an option to use direct IO (or another
> > means as needed) to bypass the cache.  So rather than kicking it out of
> > the cache, we attempt just not to pollute the cache by bypassing it for cold
> > pages and use either normal io for 'hot pages', or use a 'read()' to "heat"
> > the cache afterward.
> 
> AFAIR, even Linus is rejecting the idea to use it seriously, except if
> I shuffle in my memory.

Direct IO is generally a pain.

POSIX_FADV_NOREUSE is an alternative (I think).  Realistically I wasn't sure which
way(s) actually worked.  My gut was that direct io would likely work right on Linux
and Solaris, at least.  If POSIX_FADV_NOREUSE works than maybe that is the answer
instead, but I haven't tested either.

Garick


> 
> >
> > Garick
> >
> >>
> >> I've committed the backend side of this, without that. Still working
> >> on the client, and on cleaning up Heikki's patch for grammar/parser
> >> support.
> >>
> >> --
> >>  Magnus Hagander
> >>  Me: http://www.hagander.net/
> >>  Work: http://www.redpill-linpro.com/
> >>
> >> --
> >> Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
> >> To make changes to your subscription:
> >> http://www.postgresql.org/mailpref/pgsql-hackers
> >
> 
> 
> 
> -- 
> Cédric Villemain               2ndQuadrant
> http://2ndQuadrant.fr/     PostgreSQL : Expertise, Formation et Support


Re: Streaming base backups

From
Florian Pflug
Date:
On Jan11, 2011, at 18:09 , Garick Hamlin wrote:
> My gut was that direct io would likely work right on Linux
> and Solaris, at least.

Didn't we discover recently that O_DIRECT fails for ext4 on linux
if ordered=data, or something like that?

best regards,
Florian Pflug




Re: Streaming base backups

From
Cédric Villemain
Date:
2011/1/11 Garick Hamlin <ghamlin@isc.upenn.edu>:
> On Tue, Jan 11, 2011 at 11:39:20AM -0500, Cédric Villemain wrote:
>> 2011/1/11 Garick Hamlin <ghamlin@isc.upenn.edu>:
>> > On Mon, Jan 10, 2011 at 09:09:28AM -0500, Magnus Hagander wrote:
>> >> On Sun, Jan 9, 2011 at 23:33, Cédric Villemain
>> >> <cedric.villemain.debian@gmail.com> wrote:
>> >> > 2011/1/7 Magnus Hagander <magnus@hagander.net>:
>> >> >> On Fri, Jan 7, 2011 at 01:47, Cédric Villemain
>> >> >> <cedric.villemain.debian@gmail.com> wrote:
>> >> >>> 2011/1/5 Magnus Hagander <magnus@hagander.net>:
>> >> >>>> On Wed, Jan 5, 2011 at 22:58, Dimitri Fontaine <dimitri@2ndquadrant.fr> wrote:
>> >> >>>>> Magnus Hagander <magnus@hagander.net> writes:
>> >> >>>>>> * Stefan mentiond it might be useful to put some
>> >> >>>>>> posix_fadvise(POSIX_FADV_DONTNEED)
>> >> >>>>>>   in the process that streams all the files out. Seems useful, as long as that
>> >> >>>>>>   doesn't kick them out of the cache *completely*, for other backends as well.
>> >> >>>>>>   Do we know if that is the case?
>> >> >>>>>
>> >> >>>>> Maybe have a look at pgfincore to only tag DONTNEED for blocks that are
>> >> >>>>> not already in SHM?
>> >> >>>>
>> >> >>>> I think that's way more complex than we want to go here.
>> >> >>>>
>> >> >>>
>> >> >>> DONTNEED will remove the block from OS buffer everytime.
>> >> >>
>> >> >> Then we definitely don't want to use it - because some other backend
>> >> >> might well want the file. Better leave it up to the standard logic in
>> >> >> the kernel.
>> >> >
>> >> > Looking at the patch, it is (very) easy to add the support for that in
>> >> > basebackup.c
>> >> > That supposed allowing mincore(), so mmap(), and so probably switch
>> >> > the fopen() to an open() (or add an open() just for mmap
>> >> > requirement...)
>> >> >
>> >> > Let's go ?
>> >>
>> >> Per above, I still don't think we *should* do this. We don't want to
>> >> kick things out of the cache underneath other backends, and since we
>> >> can't control that. Either way, it shouldn't happen in the beginning,
>> >> and if it does, should be backed with proper benchmarks.
>> >
>> > Another option that occurs to me is an option to use direct IO (or another
>> > means as needed) to bypass the cache.  So rather than kicking it out of
>> > the cache, we attempt just not to pollute the cache by bypassing it for cold
>> > pages and use either normal io for 'hot pages', or use a 'read()' to "heat"
>> > the cache afterward.
>>
>> AFAIR, even Linus is rejecting the idea to use it seriously, except if
>> I shuffle in my memory.
>
> Direct IO is generally a pain.
>
> POSIX_FADV_NOREUSE is an alternative (I think).  Realistically I wasn't sure which
> way(s) actually worked.  My gut was that direct io would likely work right on Linux
> and Solaris, at least.  If POSIX_FADV_NOREUSE works than maybe that is the answer
> instead, but I haven't tested either.

yes it should be the best option, unfortunely it is a ghost flag, it
doesn't do anythig.
At some point there were a libprefetch library and a linux fincore()
syscall in the air. Unfortunely actors of those items stop
communication with open source afais. (I didn't get answers myself,
neither linux ML get ones.)


>
> Garick
>
>
>>
>> >
>> > Garick
>> >
>> >>
>> >> I've committed the backend side of this, without that. Still working
>> >> on the client, and on cleaning up Heikki's patch for grammar/parser
>> >> support.
>> >>
>> >> --
>> >>  Magnus Hagander
>> >>  Me: http://www.hagander.net/
>> >>  Work: http://www.redpill-linpro.com/
>> >>
>> >> --
>> >> Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
>> >> To make changes to your subscription:
>> >> http://www.postgresql.org/mailpref/pgsql-hackers
>> >
>>
>>
>>
>> --
>> Cédric Villemain               2ndQuadrant
>> http://2ndQuadrant.fr/     PostgreSQL : Expertise, Formation et Support
>



--
Cédric Villemain               2ndQuadrant
http://2ndQuadrant.fr/     PostgreSQL : Expertise, Formation et Support


Re: Streaming base backups

From
Tom Lane
Date:
Florian Pflug <fgp@phlo.org> writes:
> On Jan11, 2011, at 18:09 , Garick Hamlin wrote:
>> My gut was that direct io would likely work right on Linux
>> and Solaris, at least.

> Didn't we discover recently that O_DIRECT fails for ext4 on linux
> if ordered=data, or something like that?

Quite.  Blithe assertions that something like this "should work" aren't
worth the electrons they're written on.
        regards, tom lane


Re: Streaming base backups

From
Garick Hamlin
Date:
On Tue, Jan 11, 2011 at 12:45:02PM -0500, Tom Lane wrote:
> Florian Pflug <fgp@phlo.org> writes:
> > On Jan11, 2011, at 18:09 , Garick Hamlin wrote:
> >> My gut was that direct io would likely work right on Linux
> >> and Solaris, at least.
> 
> > Didn't we discover recently that O_DIRECT fails for ext4 on linux
> > if ordered=data, or something like that?
> 
> Quite.  Blithe assertions that something like this "should work" aren't
> worth the electrons they're written on.

Indeed.  I wasn't making such a claim in case that wasn't clear.  I believe,
in fact, there is no single way that will work everywhere.  This isn't
needed for correctness of course, it is merely a tweak for performance as
long as the 'not working case' on platform + filesystem X case degrades to
something close to what would have happened if we didn't try.  I expected
POSIX_FADV_NOREUSE not to work on Linux, but haven't looked at it recently
and not all systems are Linux so I mentioned it.  This was why I thought
direct io might be more realistic.

I did not have a chance to test before I wrote this email so I attempted to 
make my uncertainty clear.  I _know_ it will not work in some environments,
but I thought it was worth looking at if it worked on more than one sane 
common setup, but I can understand if you feel differently about that.

Garick

> 
>             regards, tom lane


Re: Streaming base backups

From
Fujii Masao
Date:
On Mon, Jan 10, 2011 at 11:09 PM, Magnus Hagander <magnus@hagander.net> wrote:
> I've committed the backend side of this, without that. Still working
> on the client, and on cleaning up Heikki's patch for grammar/parser
> support.

Great work!

I have some comments:

While walsender is sending a base backup, WalSndWakeup should
not send the signal to that walsender?

In sendFile or elsewhere, we should periodically check whether
postmaster is alive and whether the flag was set by the signal?

At the end of the backup by walsender, it forces a switch to a new
WAL file and waits until the last WAL file has been archived. So we
should change postmaster so that it doesn't cause the archiver to
end before walsender ends when shutdown is requested?

Also, when shutdown is requested, the walsender which is
streaming WAL should not end before another walsender which
is sending a backup ends, to stream the backup-end WAL?

Regards,

-- 
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center


Re: Streaming base backups

From
Magnus Hagander
Date:
On Wed, Jan 12, 2011 at 10:39, Fujii Masao <masao.fujii@gmail.com> wrote:
> On Mon, Jan 10, 2011 at 11:09 PM, Magnus Hagander <magnus@hagander.net> wrote:
>> I've committed the backend side of this, without that. Still working
>> on the client, and on cleaning up Heikki's patch for grammar/parser
>> support.
>
> Great work!
>
> I have some comments:
>
> While walsender is sending a base backup, WalSndWakeup should
> not send the signal to that walsender?

True, it's not necessary. How bad does it actually hurt things though?
Given that the walsender running the backup isn't actually waiting on
the latch, it doesn't actually send a signal, does it?


> In sendFile or elsewhere, we should periodically check whether
> postmaster is alive and whether the flag was set by the signal?

That, however, we probably should.


> At the end of the backup by walsender, it forces a switch to a new
> WAL file and waits until the last WAL file has been archived. So we
> should change postmaster so that it doesn't cause the archiver to
> end before walsender ends when shutdown is requested?

Um. I have to admit I'm not entirely following what you mean enough to
confirm it, but it *sounds* correct :-)

What scenario exactly is the problematic one?


> Also, when shutdown is requested, the walsender which is
> streaming WAL should not end before another walsender which
> is sending a backup ends, to stream the backup-end WAL?

Not sure I see the reason for that. If we're shutting down in the
middle of the base backup, we don't have any support for continuing
that one after we're back up - you have to start over.

--
 Magnus Hagander
 Me: http://www.hagander.net/
 Work: http://www.redpill-linpro.com/


Re: Streaming base backups

From
Fujii Masao
Date:
On Fri, Jan 14, 2011 at 4:13 AM, Magnus Hagander <magnus@hagander.net> wrote:
>> While walsender is sending a base backup, WalSndWakeup should
>> not send the signal to that walsender?
>
> True, it's not necessary. How bad does it actually hurt things though?
> Given that the walsender running the backup isn't actually waiting on
> the latch, it doesn't actually send a signal, does it?

Yeah, you are right. Once WalSndWakeup sends the signal to walsender,
latch->is_set is set. So, then WalSndWakeup does nothing against that
walsender until latch->is_set is reset. Since ResetLatch is not called while
walsender is sending a base backup, that would be harmless.

>> At the end of the backup by walsender, it forces a switch to a new
>> WAL file and waits until the last WAL file has been archived. So we
>> should change postmaster so that it doesn't cause the archiver to
>> end before walsender ends when shutdown is requested?
>
> Um. I have to admit I'm not entirely following what you mean enough to
> confirm it, but it *sounds* correct :-)
>
> What scenario exactly is the problematic one?

1. Smart shutdown is requested while walsender is sending a backup.
2. Shutdown causes archiver to end.    (Though shutdown sends SIGUSR2 to walsender to exit, walsender     running
backupdoesn't respond for now)
 
3. At the end of backup, walsender calls do_pg_stop_backup, which    forces a switch to a new WAL file and waits until
thelast WAL file has    been archived.    *BUT*, since archiver has already been dead, walsender waits for    that
forever.

>> Also, when shutdown is requested, the walsender which is
>> streaming WAL should not end before another walsender which
>> is sending a backup ends, to stream the backup-end WAL?
>
> Not sure I see the reason for that. If we're shutting down in the
> middle of the base backup, we don't have any support for continuing
> that one after we're back up - you have to start over.

For now, shutdown is designed to cause walsender to end after
sending all the WAL records. So I thought that.

Regards,

-- 
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center


Re: Streaming base backups

From
Heikki Linnakangas
Date:
On 14.01.2011 08:45, Fujii Masao wrote:
> On Fri, Jan 14, 2011 at 4:13 AM, Magnus Hagander<magnus@hagander.net>  wrote:
>>> At the end of the backup by walsender, it forces a switch to a new
>>> WAL file and waits until the last WAL file has been archived. So we
>>> should change postmaster so that it doesn't cause the archiver to
>>> end before walsender ends when shutdown is requested?
>>
>> Um. I have to admit I'm not entirely following what you mean enough to
>> confirm it, but it *sounds* correct :-)
>>
>> What scenario exactly is the problematic one?
>
> 1. Smart shutdown is requested while walsender is sending a backup.
> 2. Shutdown causes archiver to end.
>       (Though shutdown sends SIGUSR2 to walsender to exit, walsender
>        running backup doesn't respond for now)
> 3. At the end of backup, walsender calls do_pg_stop_backup, which
>       forces a switch to a new WAL file and waits until the last WAL file has
>       been archived.
>       *BUT*, since archiver has already been dead, walsender waits for
>       that forever.

Not only does it wait forever, but it writes the end-of-backup WAL 
record after bgwriter has already exited and written the shutdown 
checkpoint record.

I think postmaster should treat a walsender as a regular backend, until 
it has started streaming.

We can achieve that by starting up the child as PM_CHILD_ACTIVE, and 
changing the state to PM_CHILD_WALSENDER later, when streaming is 
started. Looking at the postmaster.c, that should be safe, postmaster 
will treat a backend as a regular backend anyway until it has connected 
to shared memory. It is *not* safe to switch a walsender back to a 
regular process, but we have no need to do that.

--   Heikki Linnakangas  EnterpriseDB   http://www.enterprisedb.com


Re: Streaming base backups

From
Magnus Hagander
Date:
On Fri, Jan 14, 2011 at 11:19, Heikki Linnakangas
<heikki.linnakangas@enterprisedb.com> wrote:
> On 14.01.2011 08:45, Fujii Masao wrote:
>>
>> On Fri, Jan 14, 2011 at 4:13 AM, Magnus Hagander<magnus@hagander.net>
>>  wrote:
>>>>
>>>> At the end of the backup by walsender, it forces a switch to a new
>>>> WAL file and waits until the last WAL file has been archived. So we
>>>> should change postmaster so that it doesn't cause the archiver to
>>>> end before walsender ends when shutdown is requested?
>>>
>>> Um. I have to admit I'm not entirely following what you mean enough to
>>> confirm it, but it *sounds* correct :-)
>>>
>>> What scenario exactly is the problematic one?
>>
>> 1. Smart shutdown is requested while walsender is sending a backup.
>> 2. Shutdown causes archiver to end.
>>      (Though shutdown sends SIGUSR2 to walsender to exit, walsender
>>       running backup doesn't respond for now)
>> 3. At the end of backup, walsender calls do_pg_stop_backup, which
>>      forces a switch to a new WAL file and waits until the last WAL file
>> has
>>      been archived.
>>      *BUT*, since archiver has already been dead, walsender waits for
>>      that forever.
>
> Not only does it wait forever, but it writes the end-of-backup WAL record
> after bgwriter has already exited and written the shutdown checkpoint
> record.
>
> I think postmaster should treat a walsender as a regular backend, until it
> has started streaming.
>
> We can achieve that by starting up the child as PM_CHILD_ACTIVE, and
> changing the state to PM_CHILD_WALSENDER later, when streaming is started.
> Looking at the postmaster.c, that should be safe, postmaster will treat a
> backend as a regular backend anyway until it has connected to shared memory.
> It is *not* safe to switch a walsender back to a regular process, but we
> have no need to do that.

Seems reasonable to me.

I've applied a patch that exits base backups when the postmaster is
shutting down - I'm happily waiting for Heikki to submit one that
changes the shutdown logic in the postmaster :-)

--
 Magnus Hagander
 Me: http://www.hagander.net/
 Work: http://www.redpill-linpro.com/


Re: Streaming base backups

From
Heikki Linnakangas
Date:
On 14.01.2011 13:38, Magnus Hagander wrote:
> On Fri, Jan 14, 2011 at 11:19, Heikki Linnakangas
> <heikki.linnakangas@enterprisedb.com>  wrote:
>> On 14.01.2011 08:45, Fujii Masao wrote:
>>> 1. Smart shutdown is requested while walsender is sending a backup.
>>> 2. Shutdown causes archiver to end.
>>>       (Though shutdown sends SIGUSR2 to walsender to exit, walsender
>>>        running backup doesn't respond for now)
>>> 3. At the end of backup, walsender calls do_pg_stop_backup, which
>>>       forces a switch to a new WAL file and waits until the last WAL file
>>> has
>>>       been archived.
>>>       *BUT*, since archiver has already been dead, walsender waits for
>>>       that forever.
>>
>> Not only does it wait forever, but it writes the end-of-backup WAL record
>> after bgwriter has already exited and written the shutdown checkpoint
>> record.
>>
>> I think postmaster should treat a walsender as a regular backend, until it
>> has started streaming.
>>
>> We can achieve that by starting up the child as PM_CHILD_ACTIVE, and
>> changing the state to PM_CHILD_WALSENDER later, when streaming is started.
>> Looking at the postmaster.c, that should be safe, postmaster will treat a
>> backend as a regular backend anyway until it has connected to shared memory.
>> It is *not* safe to switch a walsender back to a regular process, but we
>> have no need to do that.
>
> Seems reasonable to me.
>
> I've applied a patch that exits base backups when the postmaster is
> shutting down - I'm happily waiting for Heikki to submit one that
> changes the shutdown logic in the postmaster :-)

Ok, committed a fix for that.

BTW, I just spotted a small race condition between creating a new table 
space and base backup. We take a snapshot of all the tablespaces in 
pg_tblspc before calling pg_start_backup(). If someone creates a new 
tablespace and puts some data in it in the window between base backup 
acquiring the list tablespaces and starting the backup, the new 
tablespace won't be included in the backup.

--   Heikki Linnakangas  EnterpriseDB   http://www.enterprisedb.com


Re: Streaming base backups

From
Tom Lane
Date:
Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> writes:
> BTW, I just spotted a small race condition between creating a new table 
> space and base backup. We take a snapshot of all the tablespaces in 
> pg_tblspc before calling pg_start_backup(). If someone creates a new 
> tablespace and puts some data in it in the window between base backup 
> acquiring the list tablespaces and starting the backup, the new 
> tablespace won't be included in the backup.

So what?  The needed actions will be covered by WAL replay.
        regards, tom lane


Re: Streaming base backups

From
Heikki Linnakangas
Date:
On 15.01.2011 17:30, Tom Lane wrote:
> Heikki Linnakangas<heikki.linnakangas@enterprisedb.com>  writes:
>> BTW, I just spotted a small race condition between creating a new table
>> space and base backup. We take a snapshot of all the tablespaces in
>> pg_tblspc before calling pg_start_backup(). If someone creates a new
>> tablespace and puts some data in it in the window between base backup
>> acquiring the list tablespaces and starting the backup, the new
>> tablespace won't be included in the backup.
>
> So what?  The needed actions will be covered by WAL replay.

No, they won't, if pg_base_backup() is called *after* getting the list 
of tablespaces.

--   Heikki Linnakangas  EnterpriseDB   http://www.enterprisedb.com


Re: Streaming base backups

From
Tom Lane
Date:
Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> writes:
> On 15.01.2011 17:30, Tom Lane wrote:
>> So what?  The needed actions will be covered by WAL replay.

> No, they won't, if pg_base_backup() is called *after* getting the list 
> of tablespaces.

Ah.  Then the fix is to change the order in which those things are done.
        regards, tom lane


Re: Streaming base backups

From
Magnus Hagander
Date:
On Sat, Jan 15, 2011 at 16:54, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> writes:
>> On 15.01.2011 17:30, Tom Lane wrote:
>>> So what?  The needed actions will be covered by WAL replay.
>
>> No, they won't, if pg_base_backup() is called *after* getting the list
>> of tablespaces.
>
> Ah.  Then the fix is to change the order in which those things are done.

Grumble. It used to be that way. For some reason I can't recall, I broke it.

Something like this to fix? or is this going to put those "warnings by
stupid versions of gcc" back?


--
 Magnus Hagander
 Me: http://www.hagander.net/
 Work: http://www.redpill-linpro.com/

Attachment

Re: Streaming base backups

From
Tom Lane
Date:
Magnus Hagander <magnus@hagander.net> writes:
> Something like this to fix? or is this going to put those "warnings by
> stupid versions of gcc" back?

Possibly.  If so, I'll fix it --- I have an old gcc to test against
here.
        regards, tom lane


Re: Streaming base backups

From
Magnus Hagander
Date:
On Sat, Jan 15, 2011 at 19:27, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Magnus Hagander <magnus@hagander.net> writes:
>> Something like this to fix? or is this going to put those "warnings by
>> stupid versions of gcc" back?
>
> Possibly.  If so, I'll fix it --- I have an old gcc to test against
> here.

Ok, thanks, I'll commit tihs one then.


--
 Magnus Hagander
 Me: http://www.hagander.net/
 Work: http://www.redpill-linpro.com/


Re: Streaming base backups

From
Tatsuo Ishii
Date:
> When do the standby launch its walreceiver? It would be extra-nice for
> the base backup tool to optionally continue streaming WALs until the
> standby starts doing it itself, so that wal_keep_segments is really
> deprecated.  No idea how feasible that is, though.

Good point. I have been always wondering why we can't use exiting WAL
transporting infrastructure for sending/receiving WAL archive
segments in streaming replication.
If my memory serves, Fujii has already proposed such an idea but was
rejected for some reason I don't understand.
--
Tatsuo Ishii
SRA OSS, Inc. Japan
English: http://www.sraoss.co.jp/index_en.php
Japanese: http://www.sraoss.co.jp


Re: Streaming base backups

From
Robert Haas
Date:
On Sat, Jan 15, 2011 at 8:33 PM, Tatsuo Ishii <ishii@postgresql.org> wrote:
>> When do the standby launch its walreceiver? It would be extra-nice for
>> the base backup tool to optionally continue streaming WALs until the
>> standby starts doing it itself, so that wal_keep_segments is really
>> deprecated.  No idea how feasible that is, though.
>
> Good point. I have been always wondering why we can't use exiting WAL
> transporting infrastructure for sending/receiving WAL archive
> segments in streaming replication.
> If my memory serves, Fujii has already proposed such an idea but was
> rejected for some reason I don't understand.

I must be confused, because you can use backup_command/restore_command
to transport WAL segments, in conjunction with streaming replication.

What Fujii-san unsuccessfully proposed was to have the master restore
segments from the archive and stream them to clients, on request.  It
was deemed better to have the slave obtain them from the archive
directly.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: Streaming base backups

From
Tatsuo Ishii
Date:
>> Good point. I have been always wondering why we can't use exiting WAL
>> transporting infrastructure for sending/receiving WAL archive
>> segments in streaming replication.
>> If my memory serves, Fujii has already proposed such an idea but was
>> rejected for some reason I don't understand.
> 
> I must be confused, because you can use backup_command/restore_command
> to transport WAL segments, in conjunction with streaming replication.

Yes, but using restore_command is not terribly convenient. On
Linux/UNIX systems you have to enable ssh access, which is extremely
hard on Windows.

IMO Streaming replication is not yet easy enough to set up for
ordinary users. It is already proposed that making base backup easier
and I think it's good. Why don't we go step beyond a little bit more?

> What Fujii-san unsuccessfully proposed was to have the master restore
> segments from the archive and stream them to clients, on request.  It
> was deemed better to have the slave obtain them from the archive
> directly.

Did Fuji-san agreed on the conclusion?
--
Tatsuo Ishii
SRA OSS, Inc. Japan
English: http://www.sraoss.co.jp/index_en.php
Japanese: http://www.sraoss.co.jp


Re: Streaming base backups

From
Fujii Masao
Date:
On Mon, Jan 17, 2011 at 11:32 AM, Tatsuo Ishii <ishii@postgresql.org> wrote:
>>> Good point. I have been always wondering why we can't use exiting WAL
>>> transporting infrastructure for sending/receiving WAL archive
>>> segments in streaming replication.
>>> If my memory serves, Fujii has already proposed such an idea but was
>>> rejected for some reason I don't understand.
>>
>> I must be confused, because you can use backup_command/restore_command
>> to transport WAL segments, in conjunction with streaming replication.
>
> Yes, but using restore_command is not terribly convenient. On
> Linux/UNIX systems you have to enable ssh access, which is extremely
> hard on Windows.

Agreed.

> IMO Streaming replication is not yet easy enough to set up for
> ordinary users. It is already proposed that making base backup easier
> and I think it's good. Why don't we go step beyond a little bit more?
>
>> What Fujii-san unsuccessfully proposed was to have the master restore
>> segments from the archive and stream them to clients, on request.  It
>> was deemed better to have the slave obtain them from the archive
>> directly.
>
> Did Fuji-san agreed on the conclusion?

No. If the conclusion is true, we would not need a streaming backup feature.

Regards,

--
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center


Re: Streaming base backups

From
Magnus Hagander
Date:
On Mon, Jan 17, 2011 at 03:32, Tatsuo Ishii <ishii@postgresql.org> wrote:
>>> Good point. I have been always wondering why we can't use exiting WAL
>>> transporting infrastructure for sending/receiving WAL archive
>>> segments in streaming replication.
>>> If my memory serves, Fujii has already proposed such an idea but was
>>> rejected for some reason I don't understand.
>>
>> I must be confused, because you can use backup_command/restore_command
>> to transport WAL segments, in conjunction with streaming replication.
>
> Yes, but using restore_command is not terribly convenient. On
> Linux/UNIX systems you have to enable ssh access, which is extremely
> hard on Windows.

Agreed.


> IMO Streaming replication is not yet easy enough to set up for
> ordinary users. It is already proposed that making base backup easier
> and I think it's good. Why don't we go step beyond a little bit more?

With pg_basebackup, you can set up streaming replication in what's
basically a single command (run the base backup, copy i na
recovery.conf file). In my first version I even had a switch that
would create the recovery.conf file for you - should we bring that
back?

It does require you to set a "reasonable" wal_keep_segments, though,
but that's really all you need to do on the master side.


>> What Fujii-san unsuccessfully proposed was to have the master restore
>> segments from the archive and stream them to clients, on request.  It
>> was deemed better to have the slave obtain them from the archive
>> directly.
>
> Did Fuji-san agreed on the conclusion?

I can see the point of the mastering being able to do this, but it
seems like a pretty narrow usecase, really. I think we invented
wal_keep_segments partially to solve this problem in a neater way?

--
 Magnus Hagander
 Me: http://www.hagander.net/
 Work: http://www.redpill-linpro.com/


Re: Streaming base backups

From
Dimitri Fontaine
Date:
Magnus Hagander <magnus@hagander.net> writes:
> With pg_basebackup, you can set up streaming replication in what's
> basically a single command (run the base backup, copy i na
> recovery.conf file). In my first version I even had a switch that
> would create the recovery.conf file for you - should we bring that
> back?

+1.  Well, make it optional maybe?

> It does require you to set a "reasonable" wal_keep_segments, though,
> but that's really all you need to do on the master side.

Until we get integrated WAL streaming while the base backup is ongoing.
We don't know when that is (9.1 or future), but that's what we're aiming
to now, right?

>>> What Fujii-san unsuccessfully proposed was to have the master restore
>>> segments from the archive and stream them to clients, on request.  It
>>> was deemed better to have the slave obtain them from the archive
>>> directly.
>>
>> Did Fuji-san agreed on the conclusion?
>
> I can see the point of the mastering being able to do this, but it
> seems like a pretty narrow usecase, really. I think we invented
> wal_keep_segments partially to solve this problem in a neater way?

Well I still think that the easier setup we can offer here is to ship
with integrated libpq based archive and restore commands.  Those could
be bin/pg_walsender and bin/pg_walreceiver.  They would have some
switches to make them suitable for running in subprocesses of either the
base backup utility or the default libpq based archive daemon.

Again, all of that is not forcibly material for 9.1, despite having all
the pieces already coded and tested, mainly in Magnus hands.  But could
we get agreement about going this route?

Regards,
--
Dimitri Fontaine
http://2ndQuadrant.fr     PostgreSQL : Expertise, Formation et Support


Re: Streaming base backups

From
Magnus Hagander
Date:
On Mon, Jan 17, 2011 at 11:18, Dimitri Fontaine <dimitri@2ndquadrant.fr> wrote:
> Magnus Hagander <magnus@hagander.net> writes:
>> With pg_basebackup, you can set up streaming replication in what's
>> basically a single command (run the base backup, copy i na
>> recovery.conf file). In my first version I even had a switch that
>> would create the recovery.conf file for you - should we bring that
>> back?
>
> +1.  Well, make it optional maybe?

It has always been optional. Basically it just creates a recovery.conf file with
primary_conninfo=<whatever pg_streamrecv was using>
standby_mode=on


>> It does require you to set a "reasonable" wal_keep_segments, though,
>> but that's really all you need to do on the master side.
>
> Until we get integrated WAL streaming while the base backup is ongoing.
> We don't know when that is (9.1 or future), but that's what we're aiming
> to now, right?

Yeah, it does sound like a plan. But to still allow both - streaming
it in parallell will eat two connections, and I'm sure some people
might consider that a higher cost.


>>>> What Fujii-san unsuccessfully proposed was to have the master restore
>>>> segments from the archive and stream them to clients, on request.  It
>>>> was deemed better to have the slave obtain them from the archive
>>>> directly.
>>>
>>> Did Fuji-san agreed on the conclusion?
>>
>> I can see the point of the mastering being able to do this, but it
>> seems like a pretty narrow usecase, really. I think we invented
>> wal_keep_segments partially to solve this problem in a neater way?
>
> Well I still think that the easier setup we can offer here is to ship
> with integrated libpq based archive and restore commands.  Those could
> be bin/pg_walsender and bin/pg_walreceiver.  They would have some
> switches to make them suitable for running in subprocesses of either the
> base backup utility or the default libpq based archive daemon.

Not sure why they'd run as an archive command and not like now as a
replication client - but let's keep that out of this thread and in a
new one :)

--
 Magnus Hagander
 Me: http://www.hagander.net/
 Work: http://www.redpill-linpro.com/


Re: Streaming base backups

From
Dimitri Fontaine
Date:
Magnus Hagander <magnus@hagander.net> writes:
>> Until we get integrated WAL streaming while the base backup is ongoing.
>> We don't know when that is (9.1 or future), but that's what we're aiming
>> to now, right?
>
> Yeah, it does sound like a plan. But to still allow both - streaming
> it in parallell will eat two connections, and I'm sure some people
> might consider that a higher cost.

Sure.  Ah, tradeoffs :)

>> Well I still think that the easier setup we can offer here is to ship
>> with integrated libpq based archive and restore commands.  Those could
>> be bin/pg_walsender and bin/pg_walreceiver.  They would have some
>> switches to make them suitable for running in subprocesses of either the
>> base backup utility or the default libpq based archive daemon.
>
> Not sure why they'd run as an archive command and not like now as a
> replication client - but let's keep that out of this thread and in a
> new one :)

On the archive side you're right that it's not necessary, but it would
be to cater for the restore side.  Sure enough, thinking about it some
more, what we would like here is for the standby to be able to talk to
the archive server (pg_streamsendrecv) rather than the primary, in order
to offload it.  Ok scratch all that and get cascading support instead :)

Regards,
--
Dimitri Fontaine
http://2ndQuadrant.fr     PostgreSQL : Expertise, Formation et Support