Thread: refactoring basebackup.c
Hi, I'd like to propose a fairly major refactoring of the server's basebackup.c. The current code isn't horrific or anything, but the base backup mechanism has grown quite a few features over the years and all of the code knows about all of the features. This is going to make it progressively more difficult to add additional features, and I have a few in mind that I'd like to add, as discussed below and also on several other recent threads.[1][2] The attached patch set shows what I have in mind. It needs more work, but I believe that there's enough here for someone to review the overall direction, and even some of the specifics, and hopefully give me some useful feedback. This patch set is built around the idea of creating two new abstractions, a base backup sink -- or bbsink -- and a base backup archiver -- or bbarchiver. Each of these works like a foreign data wrapper or custom scan or TupleTableSlot. That is, there's a table of function pointers that act like method callbacks. Every implementation can allocate a struct of sufficient size for its own bookkeeping data, and the first member of the struct is always the same, and basically holds the data that all implementations must store, including a pointer to the table of function pointers. If we were using C++, bbarchiver and bbsink would be abstract base classes. They represent closely-related concepts, so much so that I initially thought we could get by with just one new abstraction layer. I found on experimentation that this did not work well, so I split it up into two and that worked a lot better. The distinction is this: a bbsink is something to which you can send a bunch of archives -- currently, each would be a tarfile -- and also a backup manifest. A bbarchiver is something to which you send every file in the data directory individually, or at least the ones that are getting backed up, plus any that are being injected into the backup (e.g. the backup_label). Commonly, a bbsink will do something with the data and then forward it to a subsequent bbsink, or a bbarchiver will do something with the data and then forward it to a subsequent bbarchiver or bbsink. For example, there's a bbarchiver_tar object which, like any bbarchiver, sees all the files and their contents as input. The output is a tarfile, which gets send to a bbsink. As things stand in the patch set now, the tar archives are ultimately sent to the "libpq" bbsink, which sends them to the client. In the future, we could have other bbarchivers. For example, we could add "pax", "zip", or "cpio" bbarchiver which produces archives of that format, and any given backup could choose which one to use. Or, we could have a bbarchiver that runs each individual file through a compression algorithm and then forwards the resulting data to a subsequent bbarchiver. That would make it easy to produce a tarfile of individually compressed files, which is one possible way of creating a seekable achive.[3] Likewise, we could have other bbsinks. For example, we could have a "localdisk" bbsink that cause the server to write the backup somewhere in the local filesystem instead of streaming it out over libpq. Or, we could have an "s3" bbsink that writes the archives to S3. We could also have bbsinks that compresses the input archives using some compressor (e.g. lz4, zstd, bzip2, ...) and forward the resulting compressed archives to the next bbsink in the chain. I'm not trying to pass judgement on whether any of these particular things are things we want to do, nor am I saying that this patch set solves all the problems with doing them. However, I believe it will make such things a whole lot easier to implement, because all of the knowledge about whatever new functionality is being added is centralized in one place, rather than being spread across the entirety of basebackup.c. As an example of this, look at how 0010 changes basebackup.c and basebackup_tar.c: afterwards, basebackup.c no longer knows anything that is tar-specific, whereas right now it knows about tar-specific things in many places. Here's an overview of this patch set: 0001-0003 are cleanup patches that I have posted for review on separate threads.[4][5] They are included here to make it easy to apply this whole series if someone wishes to do so. 0004 is a minor refactoring that reduces by 1 the number of functions in basebackup.c that know about the specifics of tarfiles. It is just a preparatory patch and probably not very interesting. 0005 invents the bbsink abstraction. 0006 creates basebackup_libpq.c and moves all code that knows about the details of sending archives via libpq there. The functionality is exposed for use by basebackup.c as a new type of bbsink, bbsink_libpq. 0007 creates basebackup_throttle.c and moves all code that knows about throttling backups there. The functionality is exposed for use by basebackup.c as a new type of bbsink, bbsink_throttle. This means that the throttling logic could be reused to throttle output to any final destination. Essentially, this is a bbsink that just passes everything it gets through to the next bbsink, but with a rate limit. If throttling's not enabled, no bbsink_throttle object is created, so all of the throttling code is completely out of the execution pipeline. 0008 creates basebackup_progress.c and moves all code that knows about progress reporting there. The functionality is exposed for use by basebackup.c as a new type of bbsink, bbsink_progress. Since the abstraction doesn't fit perfectly in this case, some extra functions are added to work around the problem. This is not entirely elegant, but I don't think it's still an improvement over what we have now, and I don't have a better idea. 0009 invents the bbarchiver abstraction. 0010 invents two new bbarchivers, a tar bbarchiver and a tarsize bbarchiver, and refactors basebackup.c to make use of them. The tar bbarchiver puts the files it sees into tar archives and forwards the resulting archives to a bbsink. The tarsize bbarchiver is used to support the PROGRESS option to the BASE_BACKUP command. It just estimates the size of the backup by summing up the file sizes without reading them. This approach is good for a couple of reasons. First, without something like this, it's impossible to keep basebackup.c from knowing something about the tar format, because the PROGRESS option doesn't just figure out how big the files to be backed up are: it figures out how big it thinks the archives will be, and that involves tar-specific considerations. This area needs more work, as the whole idea of measuring progress by estimating the archive size is going to break down as soon as server-side compression is in the picture. Second, this makes the code path that we use for figuring out the backup size details much more similar to the path we use for performing the actual backup. For instance, with this patch, we include the exact same files in the calculation that we will include in the backup, and in the same order, something that's not true today. The basebackup_tar.c file added by this patch is sadly lacking in comments, which I will add in a future version of the patch set. I think, though, that it will not be too unclear what's going on here. 0011 invents another new kind of bbarchiver. This bbarchiver just eavesdrops on the stream of files to facilitate backup manifest construction, and then forwards everything through to a subsequent bbarchiver. Like bbsink_throttle, it can be entirely omitted if not used. This patch is a bit clunky at the moment and needs some polish, but it is another demonstration of how these abstractions can be used to simplify basebackup.c, so that basebackup.c only has to worry about determining what should be backed up and not have to worry much about all the specific things that need to be done as part of that. Although this patch set adds quite a bit of code on net, it makes basebackup.c considerably smaller and simpler, removing more than 400 lines of code from that file, about 20% of the current total. There are some gratifying changes vs. the status quo. For example, in master, we have this: sendDir(const char *path, int basepathlen, bool sizeonly, List *tablespaces, bool sendtblspclinks, backup_manifest_info *manifest, const char *spcoid) Notably, the sizeonly flag makes the function not do what the name of the function suggests that it does. Also, we've got to pass some extra fields through to enable specific features. With the patch set, the equivalent function looks like this: archive_directory(bbarchiver *archiver, const char *path, int basepathlen, List *tablespaces, bool sendtblspclinks) The question "what should I do with the directories and files we find as we recurse?" is now answered by the choice of which bbarchiver to pass to the function, rather than by the values of sizeonly, manifest, and spcoid. That's not night and day, but I think it's better, especially as you imagine adding more features in the future. The really important part, for me, is that you can make the bbarchiver do anything you like without needing to make any more changes to this function. It just arranges to invoke your callbacks. You take it from there. One pretty major question that this patch set doesn't address is what the user interface for any of the hypothetical features mentioned above ought to look like, or how basebackup.c ought to support them. The syntax for the BASE_BACKUP command, like the contents of basebackup.c, has grown organically, and doesn't seem to be very scalable. Also, the wire protocol - a series of CopyData results which the client is entirely responsible for knowing how to interpret and about which the server provides only minimal information - doesn't much lend itself to extensibility. Some careful design work is likely needed in both areas, and this patch does not try to do any of it. I am quite interested in discussing those questions, but I felt that they weren't the most important problems to solve first. What do you all think? Thanks, -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company [1] http://postgr.es/m/CA+TgmoZubLXYR+Pd_gi3MVgyv5hQdLm-GBrVXkun-Lewaw12Kg@mail.gmail.com [2] http://postgr.es/m/CA+TgmoYr7+-0_vyQoHbTP5H3QGZFgfhnrn6ewDteF=kUqkG=Fw@mail.gmail.com [3] http://postgr.es/m/CA+TgmoZQCoCyPv6fGoovtPEZF98AXCwYDnSB0=p5XtxNY68r_A@mail.gmail.com and following [4] http://postgr.es/m/CA+TgmoYq+59SJ2zBbP891ngWPA9fymOqntqYcweSDYXS2a620A@mail.gmail.com [5] http://postgr.es/m/CA+TgmobWbfReO9-XFk8urR1K4wTNwqoHx_v56t7=T8KaiEoKNw@mail.gmail.com
So it might be good if I'd remembered to attach the patches. Let's try that again. ...Robert
Attachment
- v1-0005-Introduce-bbsink-abstraction.patch
- v1-0002-Minor-code-cleanup-for-perform_base_backup.patch
- v1-0003-Assorted-cleanup-of-tar-related-code.patch
- v1-0004-Recast-_tarWriteDirectory-as-convert_link_to_dire.patch
- v1-0001-Don-t-export-basebackup.c-s-sendTablespace.patch
- v1-0006-Convert-libpq-related-code-to-a-bbsink.patch
- v1-0009-Introduce-bbarchiver-abstraction.patch
- v1-0008-Convert-progress-reporting-code-to-a-bbsink.patch
- v1-0010-Create-and-use-bbarchiver-implementations-for-tar.patch
- v1-0007-Convert-throttling-related-code-to-a-bbsink.patch
- v1-0011-WIP-Convert-backup-manifest-generation-to-a-bbarc.patch
Hi, On 2020-05-08 16:53:09 -0400, Robert Haas wrote: > They represent closely-related concepts, so much so that I initially > thought we could get by with just one new abstraction layer. I found > on experimentation that this did not work well, so I split it up into > two and that worked a lot better. The distinction is this: a bbsink is > something to which you can send a bunch of archives -- currently, each > would be a tarfile -- and also a backup manifest. A bbarchiver is > something to which you send every file in the data directory > individually, or at least the ones that are getting backed up, plus > any that are being injected into the backup (e.g. the backup_label). > Commonly, a bbsink will do something with the data and then forward it > to a subsequent bbsink, or a bbarchiver will do something with the > data and then forward it to a subsequent bbarchiver or bbsink. For > example, there's a bbarchiver_tar object which, like any bbarchiver, > sees all the files and their contents as input. The output is a > tarfile, which gets send to a bbsink. As things stand in the patch set > now, the tar archives are ultimately sent to the "libpq" bbsink, which > sends them to the client. Hm. I wonder if there's cases where recursively forwarding like this will cause noticable performance effects. The only operation that seems frequent enough to potentially be noticable would be "chunks" of the file. So perhaps it'd be good to make sure we read in large enough chunks? > 0010 invents two new bbarchivers, a tar bbarchiver and a tarsize > bbarchiver, and refactors basebackup.c to make use of them. The tar > bbarchiver puts the files it sees into tar archives and forwards the > resulting archives to a bbsink. The tarsize bbarchiver is used to > support the PROGRESS option to the BASE_BACKUP command. It just > estimates the size of the backup by summing up the file sizes without > reading them. This approach is good for a couple of reasons. First, > without something like this, it's impossible to keep basebackup.c from > knowing something about the tar format, because the PROGRESS option > doesn't just figure out how big the files to be backed up are: it > figures out how big it thinks the archives will be, and that involves > tar-specific considerations. ISTM that it's not actually good to have the progress calculations include the tar overhead. As you say: > This area needs more work, as the whole idea of measuring progress by > estimating the archive size is going to break down as soon as > server-side compression is in the picture. This, to me, indicates that we should measure the progress solely based on how much of the "source" data was processed. The overhead of tar, the reduction due to compression, shouldn't be included. > What do you all think? I've not though enough about the specifics, but I think it looks like it's going roughly in a better direction. One thing I wonder about is how stateful the interface is. Archivers will pretty much always track which file is currently open etc. Somehow such a repeating state machine seems a bit ugly - but I don't really have a better answer. Greetings, Andres Freund
On Fri, May 8, 2020 at 5:27 PM Andres Freund <andres@anarazel.de> wrote: > I wonder if there's cases where recursively forwarding like this will > cause noticable performance effects. The only operation that seems > frequent enough to potentially be noticable would be "chunks" of the > file. So perhaps it'd be good to make sure we read in large enough > chunks? Yeah, that needs to be tested. Right now the chunk size is 32kB but it might be a good idea to go larger. Another thing is that right now the chunk size is tied to the protocol message size, and I'm not sure whether the size that's optimal for disk reads is also optimal for protocol messages. > This, to me, indicates that we should measure the progress solely based > on how much of the "source" data was processed. The overhead of tar, the > reduction due to compression, shouldn't be included. I don't think it's a particularly bad thing that we include a small amount of progress for sending an empty file, a directory, or a symlink. That could make the results more meaningful if you have a database with lots of empty relations in it. However, I agree that the effect of compression shouldn't be included. To get there, I think we need to redesign the wire protocol. Right now, the server has no way of letting the client know how many uncompressed bytes it's sent, and the client has no way of figuring it out without uncompressing, which seems like something we want to avoid. There are some other problems with the current wire protocol, too: 1. The syntax for the BASE_BACKUP command is large and unwieldy. We really ought to adopt an extensible options syntax, like COPY, VACUUM, EXPLAIN, etc. do, rather than using a zillion ad-hoc bolt-ons, each with bespoke lexer and parser support. 2. The client is sent a list of tablespaces and is supposed to use that to expect an equal number of archives, computing the name for each one on the client side from the tablespace info. However, I think we should be able to support modes like "put all the tablespaces in a single archive" or "send a separate archive for every 256GB" or "ship it all to the cloud and don't send me any archives". To get there, I think we should have the server send the archive name to the clients, and the client should just keep receiving the next archive until it's told that there are no more. Then if there's one archive or ten archives or no archives, the client doesn't have to care. It just receives what the server sends until it hears that there are no more. It also doesn't know how the server is naming the archives; the server can, for example, adjust the archive names based on which compression format is being chosen, without knowledge of the server's naming conventions needing to exist on the client side. I think we should keep support for the current BASE_BACKUP command but either add a new variant using an extensible options, or else invent a whole new command with a different name (BACKUP, SEND_BACKUP, whatever) that takes extensible options. This command should send back all the archives and the backup manifest using a single COPY stream rather than multiple COPY streams. Within the COPY stream, we'll invent a sub-protocol, e.g. based on the first letter of the message, e.g.: t = Tablespace boundary. No further message payload. Indicates, for progress reporting purposes, that we are advancing to the next tablespace. f = Filename. The remainder of the message payload is the name of the next file that will be transferred. d = Data. The next four bytes contain the number of uncompressed bytes covered by this message, for progress reporting purposes. The rest of the message is payload, possibly compressed. Could be empty, if the data is being shipped elsewhere, and these messages are only being sent to update the client's notion of progress. > I've not though enough about the specifics, but I think it looks like > it's going roughly in a better direction. Good to hear. > One thing I wonder about is how stateful the interface is. Archivers > will pretty much always track which file is currently open etc. Somehow > such a repeating state machine seems a bit ugly - but I don't really > have a better answer. I thought about that a bit, too. There might be some way to unify that by having some common context object that's defined by basebackup.c and all archivers get it, so that they have some commonly-desired details without needing bespoke code, but I'm not sure at this point whether that will actually produce a nicer result. Even if we don't have it initially, it seems like it wouldn't be very hard to add it later, so I'm not too stressed about it. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Yeah, that needs to be tested. Right now the chunk size is 32kB but it
might be a good idea to go larger. Another thing is that right now the
chunk size is tied to the protocol message size, and I'm not sure
whether the size that's optimal for disk reads is also optimal for
protocol messages.
I don't think it's a particularly bad thing that we include a small
amount of progress for sending an empty file, a directory, or a
symlink. That could make the results more meaningful if you have a
database with lots of empty relations in it. However, I agree that the
effect of compression shouldn't be included. To get there, I think we
need to redesign the wire protocol. Right now, the server has no way
of letting the client know how many uncompressed bytes it's sent, and
the client has no way of figuring it out without uncompressing, which
seems like something we want to avoid.
There are some other problems with the current wire protocol, too:
1. The syntax for the BASE_BACKUP command is large and unwieldy. We
really ought to adopt an extensible options syntax, like COPY, VACUUM,
EXPLAIN, etc. do, rather than using a zillion ad-hoc bolt-ons, each
with bespoke lexer and parser support.
2. The client is sent a list of tablespaces and is supposed to use
that to expect an equal number of archives, computing the name for
each one on the client side from the tablespace info. However, I think
we should be able to support modes like "put all the tablespaces in a
single archive" or "send a separate archive for every 256GB" or "ship
it all to the cloud and don't send me any archives". To get there, I
think we should have the server send the archive name to the clients,
and the client should just keep receiving the next archive until it's
told that there are no more. Then if there's one archive or ten
archives or no archives, the client doesn't have to care. It just
receives what the server sends until it hears that there are no more.
It also doesn't know how the server is naming the archives; the server
can, for example, adjust the archive names based on which compression
format is being chosen, without knowledge of the server's naming
conventions needing to exist on the client side.
I think we should keep support for the current BASE_BACKUP command but
either add a new variant using an extensible options, or else invent a
whole new command with a different name (BACKUP, SEND_BACKUP,
whatever) that takes extensible options. This command should send back
all the archives and the backup manifest using a single COPY stream
rather than multiple COPY streams. Within the COPY stream, we'll
invent a sub-protocol, e.g. based on the first letter of the message,
e.g.:
t = Tablespace boundary. No further message payload. Indicates, for
progress reporting purposes, that we are advancing to the next
tablespace.
f = Filename. The remainder of the message payload is the name of the
next file that will be transferred.
d = Data. The next four bytes contain the number of uncompressed bytes
covered by this message, for progress reporting purposes. The rest of
the message is payload, possibly compressed. Could be empty, if the
data is being shipped elsewhere, and these messages are only being
sent to update the client's notion of progress.
I thought about that a bit, too. There might be some way to unify that
by having some common context object that's defined by basebackup.c
and all archivers get it, so that they have some commonly-desired
details without needing bespoke code, but I'm not sure at this point
whether that will actually produce a nicer result. Even if we don't
have it initially, it seems like it wouldn't be very hard to add it
later, so I'm not too stressed about it.
On Sat, May 9, 2020 at 2:23 AM Robert Haas <robertmhaas@gmail.com> wrote: > > Hi, > > I'd like to propose a fairly major refactoring of the server's > basebackup.c. The current code isn't horrific or anything, but the > base backup mechanism has grown quite a few features over the years > and all of the code knows about all of the features. This is going to > make it progressively more difficult to add additional features, and I > have a few in mind that I'd like to add, as discussed below and also > on several other recent threads.[1][2] The attached patch set shows > what I have in mind. It needs more work, but I believe that there's > enough here for someone to review the overall direction, and even some > of the specifics, and hopefully give me some useful feedback. > > This patch set is built around the idea of creating two new > abstractions, a base backup sink -- or bbsink -- and a base backup > archiver -- or bbarchiver. Each of these works like a foreign data > wrapper or custom scan or TupleTableSlot. That is, there's a table of > function pointers that act like method callbacks. Every implementation > can allocate a struct of sufficient size for its own bookkeeping data, > and the first member of the struct is always the same, and basically > holds the data that all implementations must store, including a > pointer to the table of function pointers. If we were using C++, > bbarchiver and bbsink would be abstract base classes. > > They represent closely-related concepts, so much so that I initially > thought we could get by with just one new abstraction layer. I found > on experimentation that this did not work well, so I split it up into > two and that worked a lot better. The distinction is this: a bbsink is > something to which you can send a bunch of archives -- currently, each > would be a tarfile -- and also a backup manifest. A bbarchiver is > something to which you send every file in the data directory > individually, or at least the ones that are getting backed up, plus > any that are being injected into the backup (e.g. the backup_label). > Commonly, a bbsink will do something with the data and then forward it > to a subsequent bbsink, or a bbarchiver will do something with the > data and then forward it to a subsequent bbarchiver or bbsink. For > example, there's a bbarchiver_tar object which, like any bbarchiver, > sees all the files and their contents as input. The output is a > tarfile, which gets send to a bbsink. As things stand in the patch set > now, the tar archives are ultimately sent to the "libpq" bbsink, which > sends them to the client. > > In the future, we could have other bbarchivers. For example, we could > add "pax", "zip", or "cpio" bbarchiver which produces archives of that > format, and any given backup could choose which one to use. Or, we > could have a bbarchiver that runs each individual file through a > compression algorithm and then forwards the resulting data to a > subsequent bbarchiver. That would make it easy to produce a tarfile of > individually compressed files, which is one possible way of creating a > seekable achive.[3] Likewise, we could have other bbsinks. For > example, we could have a "localdisk" bbsink that cause the server to > write the backup somewhere in the local filesystem instead of > streaming it out over libpq. Or, we could have an "s3" bbsink that > writes the archives to S3. We could also have bbsinks that compresses > the input archives using some compressor (e.g. lz4, zstd, bzip2, ...) > and forward the resulting compressed archives to the next bbsink in > the chain. I'm not trying to pass judgement on whether any of these > particular things are things we want to do, nor am I saying that this > patch set solves all the problems with doing them. However, I believe > it will make such things a whole lot easier to implement, because all > of the knowledge about whatever new functionality is being added is > centralized in one place, rather than being spread across the entirety > of basebackup.c. As an example of this, look at how 0010 changes > basebackup.c and basebackup_tar.c: afterwards, basebackup.c no longer > knows anything that is tar-specific, whereas right now it knows about > tar-specific things in many places. > > Here's an overview of this patch set: > > 0001-0003 are cleanup patches that I have posted for review on > separate threads.[4][5] They are included here to make it easy to > apply this whole series if someone wishes to do so. > > 0004 is a minor refactoring that reduces by 1 the number of functions > in basebackup.c that know about the specifics of tarfiles. It is just > a preparatory patch and probably not very interesting. > > 0005 invents the bbsink abstraction. > > 0006 creates basebackup_libpq.c and moves all code that knows about > the details of sending archives via libpq there. The functionality is > exposed for use by basebackup.c as a new type of bbsink, bbsink_libpq. > > 0007 creates basebackup_throttle.c and moves all code that knows about > throttling backups there. The functionality is exposed for use by > basebackup.c as a new type of bbsink, bbsink_throttle. This means that > the throttling logic could be reused to throttle output to any final > destination. Essentially, this is a bbsink that just passes everything > it gets through to the next bbsink, but with a rate limit. If > throttling's not enabled, no bbsink_throttle object is created, so all > of the throttling code is completely out of the execution pipeline. > > 0008 creates basebackup_progress.c and moves all code that knows about > progress reporting there. The functionality is exposed for use by > basebackup.c as a new type of bbsink, bbsink_progress. Since the > abstraction doesn't fit perfectly in this case, some extra functions > are added to work around the problem. This is not entirely elegant, > but I don't think it's still an improvement over what we have now, and > I don't have a better idea. > > 0009 invents the bbarchiver abstraction. > > 0010 invents two new bbarchivers, a tar bbarchiver and a tarsize > bbarchiver, and refactors basebackup.c to make use of them. The tar > bbarchiver puts the files it sees into tar archives and forwards the > resulting archives to a bbsink. The tarsize bbarchiver is used to > support the PROGRESS option to the BASE_BACKUP command. It just > estimates the size of the backup by summing up the file sizes without > reading them. This approach is good for a couple of reasons. First, > without something like this, it's impossible to keep basebackup.c from > knowing something about the tar format, because the PROGRESS option > doesn't just figure out how big the files to be backed up are: it > figures out how big it thinks the archives will be, and that involves > tar-specific considerations. This area needs more work, as the whole > idea of measuring progress by estimating the archive size is going to > break down as soon as server-side compression is in the picture. > Second, this makes the code path that we use for figuring out the > backup size details much more similar to the path we use for > performing the actual backup. For instance, with this patch, we > include the exact same files in the calculation that we will include > in the backup, and in the same order, something that's not true today. > The basebackup_tar.c file added by this patch is sadly lacking in > comments, which I will add in a future version of the patch set. I > think, though, that it will not be too unclear what's going on here. > > 0011 invents another new kind of bbarchiver. This bbarchiver just > eavesdrops on the stream of files to facilitate backup manifest > construction, and then forwards everything through to a subsequent > bbarchiver. Like bbsink_throttle, it can be entirely omitted if not > used. This patch is a bit clunky at the moment and needs some polish, > but it is another demonstration of how these abstractions can be used > to simplify basebackup.c, so that basebackup.c only has to worry about > determining what should be backed up and not have to worry much about > all the specific things that need to be done as part of that. > > Although this patch set adds quite a bit of code on net, it makes > basebackup.c considerably smaller and simpler, removing more than 400 > lines of code from that file, about 20% of the current total. There > are some gratifying changes vs. the status quo. For example, in > master, we have this: > > sendDir(const char *path, int basepathlen, bool sizeonly, List *tablespaces, > bool sendtblspclinks, backup_manifest_info *manifest, > const char *spcoid) > > Notably, the sizeonly flag makes the function not do what the name of > the function suggests that it does. Also, we've got to pass some extra > fields through to enable specific features. With the patch set, the > equivalent function looks like this: > > archive_directory(bbarchiver *archiver, const char *path, int basepathlen, > List *tablespaces, bool sendtblspclinks) > > The question "what should I do with the directories and files we find > as we recurse?" is now answered by the choice of which bbarchiver to > pass to the function, rather than by the values of sizeonly, manifest, > and spcoid. That's not night and day, but I think it's better, > especially as you imagine adding more features in the future. The > really important part, for me, is that you can make the bbarchiver do > anything you like without needing to make any more changes to this > function. It just arranges to invoke your callbacks. You take it from > there. > > One pretty major question that this patch set doesn't address is what > the user interface for any of the hypothetical features mentioned > above ought to look like, or how basebackup.c ought to support them. > The syntax for the BASE_BACKUP command, like the contents of > basebackup.c, has grown organically, and doesn't seem to be very > scalable. Also, the wire protocol - a series of CopyData results which > the client is entirely responsible for knowing how to interpret and > about which the server provides only minimal information - doesn't > much lend itself to extensibility. Some careful design work is likely > needed in both areas, and this patch does not try to do any of it. I > am quite interested in discussing those questions, but I felt that > they weren't the most important problems to solve first. > > What do you all think? The overall idea looks quite nice. I had a look at some of the patches at least 0005 and 0006. At first look, I have one comment. +/* + * Each archive is set as a separate stream of COPY data, and thus begins + * with a CopyOutResponse message. + */ +static void +bbsink_libpq_begin_archive(bbsink *sink, const char *archive_name) +{ + SendCopyOutResponse(); +} Some of the bbsink_libpq_* functions are directly calling pq_* e.g. bbsink_libpq_begin_backup whereas others are calling SendCopy* functions and therein those are calling pq_* functions. I think bbsink_libpq_* function can directly call pq_* functions instead of adding one more level of the function call. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
On Tue, May 12, 2020 at 4:32 AM Dilip Kumar <dilipbalaut@gmail.com> wrote: > Some of the bbsink_libpq_* functions are directly calling pq_* e.g. > bbsink_libpq_begin_backup whereas others are calling SendCopy* > functions and therein those are calling pq_* functions. I think > bbsink_libpq_* function can directly call pq_* functions instead of > adding one more level of the function call. I think all the helper functions have more than one caller, though. That's why I created them - to avoid duplicating code. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Wed, May 13, 2020 at 1:56 AM Robert Haas <robertmhaas@gmail.com> wrote: > > On Tue, May 12, 2020 at 4:32 AM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > Some of the bbsink_libpq_* functions are directly calling pq_* e.g. > > bbsink_libpq_begin_backup whereas others are calling SendCopy* > > functions and therein those are calling pq_* functions. I think > > bbsink_libpq_* function can directly call pq_* functions instead of > > adding one more level of the function call. > > I think all the helper functions have more than one caller, though. > That's why I created them - to avoid duplicating code. You are right, somehow I missed that part. Sorry for the noise. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
8kb | 32kb (default value) | 128kB | 1024kB | |
Without refactor patch | real 10m22.718s user 1m23.629s sys 8m51.410s | real 8m36.245s user 1m8.471s sys 7m21.520s | real 6m54.299s user 0m55.690s sys 5m46.502s | real 18m3.511s user 1m38.197s sys 9m36.517s |
With refactor patch (Robert's patch) | real 10m11.350s user 1m25.038s sys 8m39.226s | real 8m56.226s user 1m9.774s sys 7m41.032s | real 7m26.678s user 0m54.833s sys 6m20.057s | real 18m17.230s user 1m42.749s sys 9m53.704s |
- The max rate at which the transfer is happening when the tar size is 128 Kb is at most .48 GB/sec. Is there a possibility to understand what is the buffer size which is being used. That could help us explain some part of the puzzle.
- Secondly the idea of taking just the min of two runs is a bit counter to the following. How do we justify the performance numbers and attribute that the differences is not related to noise. It might be better to do a few experiments for each of the kind and then try and fit a basic linear model and report the std deviation. "Order statistics" where you get the min(X1, X2, ... , Xn) is generally a biased estimator. A variance calculation of the biased statistics is a bit tricky and so the results could be corrupted by noise.
Hi,Did some performance testing by varying TAR_SEND_SIZE with Robert's refactor patch and without the patch to check the impact.Below are the details:Backup type: local backup using pg_basebackupData size: Around 200GB (200 tables - each table around 1.05 GB)different TAR_SEND_SIZE values: 8kb, 32kb (default value), 128kB, 1MB (1024kB)Server details:RAM: 500 GB CPU details: Architecture: x86_64 CPU op-mode(s): 32-bit, 64-bit Byte Order: Little Endian CPU(s): 128 Filesystem: ext4
8kb 32kb (default value) 128kB 1024kB Without refactor patch real 10m22.718s
user 1m23.629s
sys 8m51.410sreal 8m36.245s
user 1m8.471s
sys 7m21.520sreal 6m54.299s
user 0m55.690s
sys 5m46.502sreal 18m3.511s
user 1m38.197s
sys 9m36.517sWith refactor patch (Robert's patch) real 10m11.350s
user 1m25.038s
sys 8m39.226sreal 8m56.226s
user 1m9.774s
sys 7m41.032sreal 7m26.678s
user 0m54.833s
sys 6m20.057sreal 18m17.230s
user 1m42.749s
sys 9m53.704sThe above numbers are taken from the minimum of two runs of each scenario.I can see, when we have TAR_SEND_SIZE as 32kb or 128kb, it is giving us a good performance whereas, for 1Mb it is taking 2.5x more time.Please let me know your thoughts/suggestions on the same.----Thanks & Regards,Suraj kharage,EnterpriseDB Corporation,The Postgres Database Company.
8kb 32kb (default value) 128kB 1024kB Without refactor patch real 10m22.718s
user 1m23.629s
sys 8m51.410sreal 8m36.245s
user 1m8.471s
sys 7m21.520sreal 6m54.299s
user 0m55.690s
sys 5m46.502sreal 18m3.511s
user 1m38.197s
sys 9m36.517sWith refactor patch (Robert's patch) real 10m11.350s
user 1m25.038s
sys 8m39.226sreal 8m56.226s
user 1m9.774s
sys 7m41.032sreal 7m26.678s
user 0m54.833s
sys 6m20.057sreal 18m17.230s
user 1m42.749s
sys 9m53.704sThe above numbers are taken from the minimum of two runs of each scenario.I can see, when we have TAR_SEND_SIZE as 32kb or 128kb, it is giving us a good performance whereas, for 1Mb it is taking 2.5x more time.Please let me know your thoughts/suggestions on the same.
So the patch came out slightly faster at 8kB and slightly slower in the other tests. That's kinda strange. I wonder if it's just noise. How much do the results vary run to run?
8kb | 32kb (default value) | 128kB | 1024kB | ||
WIthout refactor patch | 1st run | real 10m50.924s user 1m29.774s sys 9m13.058s | real 8m36.245s user 1m8.471s sys 7m21.520s | real 7m8.690s user 0m54.840s sys 6m1.725s | real 18m16.898s user 1m39.105s sys 9m42.803s |
2nd run | real 10m22.718s user 1m23.629s sys 8m51.410s | real 8m44.455s user 1m7.896s sys 7m28.909s | real 6m54.299s user 0m55.690s sys 5m46.502s | real 18m3.511s user 1m38.197s sys 9m36.517s | |
WIth refactor patch | 1st run | real 10m11.350s user 1m25.038s sys 8m39.226s | real 8m56.226s user 1m9.774s sys 7m41.032s | real 7m26.678s user 0m54.833s sys 6m20.057s | real 19m5.218s user 1m44.122s sys 10m17.623s |
2nd run | real 11m30.500s user 1m45.221s sys 9m37.815s | real 9m4.103s user 1m6.893s sys 7m49.393s | real 7m26.713s user 0m54.868s sys 6m19.652s | real 18m17.230s user 1m42.749s sys 9m53.704s |
Iteration | WIthout refactor patch | WIth refactor patch |
1st run | real 10m19.001s user 1m37.895s sys 8m33.008s | real 9m45.291s user 1m23.192s sys 8m14.993s |
2nd run | real 9m33.970s user 1m19.490s sys 8m6.062s | real 9m30.560s user 1m22.124s sys 8m0.979s |
3rd run | real 9m19.327s user 1m21.772s sys 7m50.613s | real 8m59.241s user 1m19.001s sys 7m32.645s |
4th run | real 9m56.873s user 1m22.370s sys 8m27.054s | real 9m52.290s user 1m22.175s sys 8m23.052s |
5th run | real 9m45.343s user 1m23.113s sys 8m15.418s | real 9m49.633s user 1m23.122s sys 8m19.240s |
Hi,On Wed, May 13, 2020 at 7:49 PM Robert Haas <robertmhaas@gmail.com> wrote:So the patch came out slightly faster at 8kB and slightly slower in the other tests. That's kinda strange. I wonder if it's just noise. How much do the results vary run to run?It is not varying much except for 8kB run. Please see below details for both runs of each scenario.
8kb 32kb (default value) 128kB 1024kB WIthout refactor
patch1st run real 10m50.924s
user 1m29.774s
sys 9m13.058sreal 8m36.245s
user 1m8.471s
sys 7m21.520sreal 7m8.690s
user 0m54.840s
sys 6m1.725sreal 18m16.898s
user 1m39.105s
sys 9m42.803s2nd run real 10m22.718s
user 1m23.629s
sys 8m51.410sreal 8m44.455s
user 1m7.896s
sys 7m28.909sreal 6m54.299s
user 0m55.690s
sys 5m46.502sreal 18m3.511s
user 1m38.197s
sys 9m36.517sWIth refactor
patch1st run real 10m11.350s
user 1m25.038s
sys 8m39.226sreal 8m56.226s
user 1m9.774s
sys 7m41.032sreal 7m26.678s
user 0m54.833s
sys 6m20.057sreal 19m5.218s
user 1m44.122s
sys 10m17.623s2nd run real 11m30.500s
user 1m45.221s
sys 9m37.815sreal 9m4.103s
user 1m6.893s
sys 7m49.393sreal 7m26.713s
user 0m54.868s
sys 6m19.652sreal 18m17.230s
user 1m42.749s
sys 9m53.704s----Thanks & Regards,Suraj kharage,EnterpriseDB Corporation,The Postgres Database Company.
Hi,I have repeated the experiment with 8K block size and found that the results are not varying much after applying the patch.Please find the details below.Later I connected with Suraj to validate the experiment details and found that the setup and steps followed are exactly the same in thisexperiment when compared with the previous experiment.
On Fri, May 8, 2020 at 4:55 PM Robert Haas <robertmhaas@gmail.com> wrote: > So it might be good if I'd remembered to attach the patches. Let's try > that again. Here's an updated patch set. This is now rebased over master and includes as 0001 the patch I posted separately at http://postgr.es/m/CA+TgmobAczXDRO_Gr2euo_TxgzaH1JxbNxvFx=HYvBinefNH8Q@mail.gmail.com but drops some other patches that were committed meanwhile. 0002-0009 of this series are basically the same as 0004-0011 from the previous series, except for rebasing and fixing a bug I discovered in what's now 0006. 0012 does a refactoring of pg_basebackup along similar lines to the server-side refactoring from patches earlier in the series. 0012 is a really terrible, hacky, awful demonstration of how this infrastructure can support server-side compression. If you apply it and take a tar-format backup without -R, you will get .tar files that are actually .tar.gz files. You can rename them, decompress them, and use pg_verifybackup to check that everything is OK. If you try to do anything else with 0012 applied, everything will break. In the process of working on this, I learned a lot about how pg_basebackup actually works, and found out about a number of things that, with the benefit of hindsight, seem like they might not have been the best way to go. 1. pg_basebackup -R injects recovery.conf (on older versions) or injects standby.signal and appends to postgresql.auto.conf (on newer versions) by parsing the tar file sent by the server and editing it on the fly. From the point of view of server-side compression, this is not ideal, because if you want to make these kinds of changes when server-side compression is in use, you'd have to decompress the stream on the client side in order to figure out where in the steam you ought to inject your changes. But having to do that is a major expense. If the client instead told the server what to change when generating the archive, and the server did it, this expense could be avoided. It would have the additional advantage that the backup manifest could reflect the effects of those changes; right now it doesn't, and pg_verifybackup just knows to expect differences in those files. 2. According to the comments, some tar programs require two tar blocks (i.e. 512-byte blocks) of zero bytes at the end of an archive. The server does not generate these blocks of zero bytes, so it basically creates a tar file that works fine with my copy of tar but might break with somebody else's. Instead, the client appends 1024 zero bytes to the end of every file it receives from the server. That is an odd way of fixing this problem, and it makes things rather inflexible. If the server sends you any kind of a file OTHER THAN a tar file with the last 1024 zero bytes stripped off, then adding 1024 zero bytes will be the wrong thing to do. It would be better if the server just generated fully correct tar files (whatever we think that means) and the client wrote out exactly what it got from the server. Then, we could have the server generate cpio archives or zip files or gzip-compressed tar files or lz4-compressed tar files or anything we like, and the client wouldn't really need to care as long as it didn't need to extract those archives. That seems a lot cleaner. 3. The way that progress reporting works relies on the server knowing exactly how large the archive sent to the client is going to be. Progress as reckoned by the client is equal to the number of archive payload bytes the client has received. This works OK with a tar because we know how big the tar file is going to be based on the size of the input files we intend to send, but it's unsuitable for any sort of compressed archive (tar.gz, zip, whatever) because the compression ratio cannot be predicted in advance. It would be better if the server sent the payload bytes (possibly compressed) interleaved with progress indicators, so that the client could correctly indicate that, say, the backup is 30% complete because 30GB of 100GB has been processed on the server side, even though the amount of data actually received by the client might be 25GB or 20GB or 10GB or whatever because it got compressed before transmission. 4. A related consideration is that we might want to have an option to do something with the backup other than send it to the client. For example, it might be useful to have an option for pg_basebackup to tell the server to write the backup files to some specified server directory, or to, say, S3. There are security concerns there, and I'm not proposing to do anything about this immediately, but it seems like something we might eventually want to have. In such a case, we are not going to send any payload to the client, but the client probably still wants progress indicators, so the current system of coupling progress to the number of bytes received by the client breaks down for that reason also. 5. As things stand today, the client must know exactly how many archives it should expect to receive from the server and what each one is. It can do that, because it knows to expect one archive per tablespace, and the archive must be an uncompressed tarfile, so there is no ambiguity. But, if the server could send archives to other places, or send other kinds of archives to the client, then this would become more complex. There is no intrinsic reason why the logic on the client side can't simply be made more complicated in order to cope, but it doesn't seem like great design, because then every time you enhance the server, you've also got to enhance the client, and that limits cross-version compatibility, and also seems more fragile. I would rather that the server advertise the number of archives and the names of each archive to the client explicitly, allowing the client to be dumb unless it needs to post-process (e.g. extract) those archives. Putting all of the above together, what I propose - but have not yet tried to implement - is a new COPY sub-protocol for taking base backups. Instead of sending a COPY stream per archive, the server would send a single COPY stream where the first byte of each message is a type indicator, like we do with the replication sub-protocol today. For example, if the first byte is 'a' that could indicate that we're beginning a new archive and the rest of the message would indicate the archive name and perhaps some flags or options. If the first byte is 'p' that could indicate that we're sending archive payload, perhaps with the first four bytes of the message being progress, i.e. the number of newly-processed bytes on the server side prior to any compression, and the remaining bytes being payload. On receipt of such a message, the client would increment the progress indicator by the value indicated in those first four bytes, and then process the remaining bytes by writing them to a file or whatever behavior the user selected via -Fp, -Ft, -Z, etc. To be clear, I'm not saying that this specific thing is the right thing, just something of this sort. The server would need to continue supporting the current multi-copy protocol for compatibility with older pg_basebackup versions, and pg_basebackup would need to continue to support it for compatibility with older server versions, but we could use the new approach going forward. (Or, we could break compatibility, but that would probably be unpopular and seems unnecessary and even risky to me at this point.) The ideas in the previous paragraph would address #3-#5 directly, but they also indirectly address #2 because while we're switching protocols we could easily move the padding with zero bytes to the server side, and I think we should. #1 is a bit of a separate consideration. To tackle #1 along the lines proposed above, the client needs a way to send the recovery.conf contents to the server so that the server can inject them into the tar file. It's not exactly clear to me what the best way of permitting this is, and maybe there's a totally different approach that would be altogether better. One thing to consider is that we might well want the client to be able to send *multiple* chunks of data to the server at the start of a backup. For instance, suppose we want to support incremental backups. I think the right approach is for the client to send the backup_manifest file from the previous full backup to the server. What exactly the server does with it afterward depends on your preferred approach, but the necessary information is there. Maybe incremental backup is based on comparing cryptographic checksums, so the server looks at all the files and sends to the client those where the checksum (hopefully SHA-something!) does not match. I wouldn't favor this approach myself, but I know some people like it. Or maybe it's based on finding blocks modified since the LSN of the previous backup; the manifest has enough information for that to work, too. In such an approach, there can be altogether new files with old LSNs, because files can be flat-copied without changing block LSNs, so it's important to have the complete list of files from the previous backup, and that too is in the manifest. There are even timestamps for the bold among you. Anyway, my point is to advocate for a design where the client says (1) I want a backup with these options and then (2) here are 0, 1, or >1 files (recovery parameters and/or backup manifest and/or other things) in support of that and then the server hands back a stream of archives which the client may or may not choose to post-process. It's tempting to think about solving this problem by appealing to CopyBoth, but I think that might be the wrong idea. The reason we use CopyBoth for the replication subprotocol is because there's periodic messages flowing in both directions that are only loosely coupled to each other. Apart from reading frequently enough to avoid a deadlock because both sides have full write buffers, each end of the connection can kind of do whatever it wants. But for the kinds of use cases I'm talking about here, that's not so. First the client talks and the server acknowledges, then the reverse. That doesn't mean we couldn't use CopyBoth, but maybe a CopyIn followed by a CopyOut would be more straightforward. I am not sure of the details here and am happy to hear the ideas of others. One final thought is that the options framework for pg_basebackup is a little unfortunate. As of today, what the client receives, always, is a series of tar files. If you say -Fp, it doesn't change the backup format; it just extracts the tar files. If you say -Ft, it doesn't. If you say -Ft but also -Z, it compresses the tar files. Thinking just about server-side compression and ignoring for the moment more remote features like alternate archive formats (e.g. zip) or things like storing the backup to an alternate location rather than returning it to the client, you probably want the client to be able to specify at least (1) server-side compression (perhaps with one of several algorithms) and the client just writes the results, (2) server-side compression (still with a choice of algorithm) and the client decompresses but does not extract, (3) server-side compression (still with a choice of algorithms) and the client decompresses and extracts, (4) client-side compression (with a choice of algorithms), and (5) client-side extraction. You might also want (6) server-side compression (with a choice of algorithms) and client-side decompresses and then re-compresses with a different algorithm (e.g. lz4 on the server to save bandwidth at moderate CPU cost into parallel bzip2 on the client for minimum archival storage). Or, as also discussed upthread, you might want (7) server-side compression of each file individually, so that you get a seekable archive of individually compressed files (e.g. to support fast delta restore). I think that with these refactoring patches - and the wire protocol redesign mentioned above - it is very reasonable to offer as many of these options as we believe users will find useful, but it is not very clear how to extend the current command-line option framework to support them. It probably would have been better if pg_basebackup, instead of having -Fp and -Ft, just had an --extract option that you could either specify or omit, because that would not have presumed anything about the archive format, but the existing structure is well-baked at this point. Thanks, -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Attachment
- v2-0001-Flexible-options-for-BASE_BACKUP-and-CREATE_REPLI.patch
- v2-0003-Introduce-bbsink-abstraction.patch
- v2-0002-Recast-_tarWriteDirectory-as-convert_link_to_dire.patch
- v2-0005-Convert-throttling-related-code-to-a-bbsink.patch
- v2-0004-Convert-libpq-related-code-to-a-bbsink.patch
- v2-0006-Convert-progress-reporting-code-to-a-bbsink.patch
- v2-0009-WIP-Convert-backup-manifest-generation-to-a-bbarc.patch
- v2-0008-Create-and-use-bbarchiver-implementations-for-tar.patch
- v2-0007-Introduce-bbarchiver-abstraction.patch
- v2-0010-WIP-Introduce-bbstreamer-abstration-and-adapt-pg_.patch
- v2-0011-POC-Embarrassingly-bad-server-side-compression-pa.patch
Hi, On 2020-07-29 11:31:26 -0400, Robert Haas wrote: > Here's an updated patch set. This is now rebased over master and > includes as 0001 the patch I posted separately at > http://postgr.es/m/CA+TgmobAczXDRO_Gr2euo_TxgzaH1JxbNxvFx=HYvBinefNH8Q@mail.gmail.com > but drops some other patches that were committed meanwhile. 0002-0009 > of this series are basically the same as 0004-0011 from the previous > series, except for rebasing and fixing a bug I discovered in what's > now 0006. 0012 does a refactoring of pg_basebackup along similar lines > to the server-side refactoring from patches earlier in the series. Have you tested whether this still works against older servers? Or do you think we should not have that as a goal? > 1. pg_basebackup -R injects recovery.conf (on older versions) or > injects standby.signal and appends to postgresql.auto.conf (on newer > versions) by parsing the tar file sent by the server and editing it on > the fly. From the point of view of server-side compression, this is > not ideal, because if you want to make these kinds of changes when > server-side compression is in use, you'd have to decompress the stream > on the client side in order to figure out where in the steam you ought > to inject your changes. But having to do that is a major expense. If > the client instead told the server what to change when generating the > archive, and the server did it, this expense could be avoided. It > would have the additional advantage that the backup manifest could > reflect the effects of those changes; right now it doesn't, and > pg_verifybackup just knows to expect differences in those files. Hm. I don't think I terribly like the idea of things like -R having to be processed server side. That'll be awfully annoying to keep working across versions, for one. But perhaps the config file should just not be in the main tar file going forward? I think we should eventually be able to use one archive for multiple purposes, e.g. to set up a standby as well as using it for a base backup. Or multiple standbys with different tablespace remappings. > 2. According to the comments, some tar programs require two tar blocks > (i.e. 512-byte blocks) of zero bytes at the end of an archive. The > server does not generate these blocks of zero bytes, so it basically > creates a tar file that works fine with my copy of tar but might break > with somebody else's. Instead, the client appends 1024 zero bytes to > the end of every file it receives from the server. That is an odd way > of fixing this problem, and it makes things rather inflexible. If the > server sends you any kind of a file OTHER THAN a tar file with the > last 1024 zero bytes stripped off, then adding 1024 zero bytes will be > the wrong thing to do. It would be better if the server just generated > fully correct tar files (whatever we think that means) and the client > wrote out exactly what it got from the server. Then, we could have the > server generate cpio archives or zip files or gzip-compressed tar > files or lz4-compressed tar files or anything we like, and the client > wouldn't really need to care as long as it didn't need to extract > those archives. That seems a lot cleaner. Yea. > 5. As things stand today, the client must know exactly how many > archives it should expect to receive from the server and what each one > is. It can do that, because it knows to expect one archive per > tablespace, and the archive must be an uncompressed tarfile, so there > is no ambiguity. But, if the server could send archives to other > places, or send other kinds of archives to the client, then this would > become more complex. There is no intrinsic reason why the logic on the > client side can't simply be made more complicated in order to cope, > but it doesn't seem like great design, because then every time you > enhance the server, you've also got to enhance the client, and that > limits cross-version compatibility, and also seems more fragile. I > would rather that the server advertise the number of archives and the > names of each archive to the client explicitly, allowing the client to > be dumb unless it needs to post-process (e.g. extract) those archives. ISTM that that can help to some degree, but things like tablespace remapping etc IMO aren't best done server side, so I think the client will continue to need to know about the contents to a significnat degree? > Putting all of the above together, what I propose - but have not yet > tried to implement - is a new COPY sub-protocol for taking base > backups. Instead of sending a COPY stream per archive, the server > would send a single COPY stream where the first byte of each message > is a type indicator, like we do with the replication sub-protocol > today. For example, if the first byte is 'a' that could indicate that > we're beginning a new archive and the rest of the message would > indicate the archive name and perhaps some flags or options. If the > first byte is 'p' that could indicate that we're sending archive > payload, perhaps with the first four bytes of the message being > progress, i.e. the number of newly-processed bytes on the server side > prior to any compression, and the remaining bytes being payload. On > receipt of such a message, the client would increment the progress > indicator by the value indicated in those first four bytes, and then > process the remaining bytes by writing them to a file or whatever > behavior the user selected via -Fp, -Ft, -Z, etc. Wonder if there's a way to get this to be less stateful. It seems a bit ugly that the client would know what the last 'a' was for a 'p'? Perhaps we could actually make 'a' include an identifier for each archive, and then 'p' would append to a specific archive? Which would then also would allow for concurrent processing of those archives on the server side. I'd personally rather have a separate message type for progress and payload. Seems odd to have to send payload messages with 0 payload just because we want to update progress (in case of uploading to e.g. S3). And I think it'd be nice if we could have a more extensible progress measurement approach than a fixed length prefix. E.g. it might be nice to allow it to report both the overall progress, as well as a per archive progress. Or we might want to send progress when uploading to S3, even when not having pre-calculated the total size of the data directory. Greetings, Andres Freund
On Fri, Jul 31, 2020 at 12:49 PM Andres Freund <andres@anarazel.de> wrote: > Have you tested whether this still works against older servers? Or do > you think we should not have that as a goal? I haven't tested that recently but I intended to keep it working. I'll make sure to nail that down before I get to the point of committing anything, but I don't expect big problems. It's kind of annoying to have so much backward compatibility stuff here but I think ripping any of that out should wait for another time. > Hm. I don't think I terribly like the idea of things like -R having to > be processed server side. That'll be awfully annoying to keep working > across versions, for one. But perhaps the config file should just not be > in the main tar file going forward? That'd be a user-visible change, though, whereas what I'm proposing isn't. Instead of directly injecting stuff, the client can just send it to the server and have the server inject it, provided the server is new enough. Cross-version issues don't seem to be any worse than now. That being said, I don't love it, either. We could just suggest to people that using -R together with server compression is > I think we should eventually be able to use one archive for multiple > purposes, e.g. to set up a standby as well as using it for a base > backup. Or multiple standbys with different tablespace remappings. I don't think I understand your point here. > ISTM that that can help to some degree, but things like tablespace > remapping etc IMO aren't best done server side, so I think the client > will continue to need to know about the contents to a significnat > degree? If I'm not mistaken, those mappings are only applied with -Fp i.e. if we're extracting. And it's no problem to jigger things in that case; we can only do this if we understand the archive in the first place. The problem is when you have to decompress and recompress to jigger things. > Wonder if there's a way to get this to be less stateful. It seems a bit > ugly that the client would know what the last 'a' was for a 'p'? Perhaps > we could actually make 'a' include an identifier for each archive, and > then 'p' would append to a specific archive? Which would then also would > allow for concurrent processing of those archives on the server side. ...says the guy working on asynchronous I/O. I don't know, it's not a bad idea, but I think we'd have to change a LOT of code to make it actually do something useful. I feel like this could be added as a later extension of the protocol, rather than being something that we necessarily need to do now. > I'd personally rather have a separate message type for progress and > payload. Seems odd to have to send payload messages with 0 payload just > because we want to update progress (in case of uploading to > e.g. S3). And I think it'd be nice if we could have a more extensible > progress measurement approach than a fixed length prefix. E.g. it might > be nice to allow it to report both the overall progress, as well as a > per archive progress. Or we might want to send progress when uploading > to S3, even when not having pre-calculated the total size of the data > directory. I don't mind a separate message type here, but if you want merging of short messages with adjacent longer messages to generate a minimal number of system calls, that might have some implications for the other thread where we're talking about how to avoid extra memory copies when generating protocol messages. If you don't mind them going out as separate network packets, then it doesn't matter. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
> On Jul 29, 2020, at 8:31 AM, Robert Haas <robertmhaas@gmail.com> wrote: > > On Fri, May 8, 2020 at 4:55 PM Robert Haas <robertmhaas@gmail.com> wrote: >> So it might be good if I'd remembered to attach the patches. Let's try >> that again. > > Here's an updated patch set. Hi Robert, v2-0001 through v2-0009 still apply cleanly, but v2-0010 no longer applies. It seems to be conflicting with Heikki's workfrom August. Could you rebase please? — Mark Dilger EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Wed, Oct 21, 2020 at 12:14 PM Mark Dilger <mark.dilger@enterprisedb.com> wrote: > v2-0001 through v2-0009 still apply cleanly, but v2-0010 no longer applies. It seems to be conflicting with Heikki's workfrom August. Could you rebase please? Here at last is a new version. I've dropped the "bbarchiver" patch for now, added a new patch that I'll talk about below, and revised the others. I'm pretty happy with the code now, so I guess the main things that I'd like feedback on are (1) whether design changes seem to be needed and (2) the UI. Once we have that stuff hammered out, I'll work on adding documentation, which is missing at present. The interesting patches in terms of functionality are 0006 and 0007; the rest is preparatory refactoring. 0006 adds a concept of base backup "targets," which means that it lets you send the base backup to someplace other than the client. You specify the target using a new "-t" option to pg_basebackup. By way of example, 0006 adds a "blackhole" target which throws the backup away instead of sending it anywhere, and also a "server" target which stores the backup to the server filesystem in lieu of streaming it to the client. So you can say something like "pg_basebackup -Xnone -Ft -t server:/backup/2021-07-08" and, provided that you're superuser, the server will try to drop the backup there. At present, you can't use -Fp or -Xfetch or -Xstream with a backup target, because that functionality is implemented on the client side. I think that's an acceptable restriction. Eventually I imagine we will want to have targets like "aws" or "s3" or maybe some kind of plug-in system for new targets. I haven't designed anything like that yet, but I think it's probably not all that hard to generalize what I've got. 0007 adds server-side compression; currently, it only supports server-side compression using gzip, but I hope that it won't be hard to generalize that to support LZ4 as well, and Andres told me he thinks we should aim to support zstd since that library has built-in parallel compression which is very appealing in this context. So you say something like "pg_basebackup -Ft --server-compression=gzip -D /backup/2021-07-08" or, if you want that compressed backup stored on the server and compressed as hard as possible, you could say "pg_basebackup -Xnone -Ft --server-compression=gzip9 -t server:/backup/2021-07-08". Unfortunately, here again there are a number of features that are implemented on the client side, and they don't work in combination with this. -Fp could be made to work by teaching the client to decompress; I just haven't written the code to do that. It's probably not very useful in general, but maybe there's a use case if you're really tight on network bandwidth. Making -R work looks outright useless, because the client would have to get the whole compressed tarfile from the server and then uncompress it, edit the tar file, and recompress. That seems like a thing no one can possibly want. Also, if you say pg_basebackup -Ft -D- >whatever.tar, the server injects the backup manifest into the tarfile, which if you used --server-compression would require decompressing and recompressing the whole thing, so it doesn't seem worth supporting. It's more likely to be a footgun than to help anybody. This option can be used with -Xstream or -Xfetch, but it doesn't compress pg_wal.tar, because that's generated on the client side. The thing I'm really unhappy with here is the -F option to pg_basebackup, which presently allows only p for plain or t for tar. For purposes of these patches, I've essentially treated this as if -Fp means "I want the tar files the server sends to be extracted" and "-Ft" as if it means "I'm happy with them the way they are." Under that interpretation, it's fine for --server-compression to cause e.g. base.tar.gz to be written, because that's what the server sent. But it's not really a "tar" output format; it's a "tar.gz" output format. However, it doesn't seem to make any sense to define -Fz to mean "i want tar.gz output" because -Z or -z already produces tar.gz output when used with -Ft, and also because it would be redundant to make people specify both -Fz and --server-compression. Similarly, when you use --target, the output format is arguably, well, nothing. I mean, some tar files got stored to the target, but you don't have them, but again it seems redundant to have people specify --target and then also have to change the argument to -F. Hindsight being 20-20, I think we would have been better off not having a -Ft or -Fp option at all, and having an --extract option that says you want to extract what the server sends you, but it's probably too late to make that change now. Or maybe it isn't, and we should just break command-line argument compatibility for v15. I don't know. Opinions appreciated, especially if they are nuanced. If you're curious about what the other patches in the series do, here's a very fast recap; see commit messages for more. 0001 revises the grammar for some replication commands to use an extensible-options syntax. 0002 is a trivial refactoring of basebackup.c. 0003 and 0004 refactor the server's basebackup.c and the client's pg_basebackup.c, respectively, by introducing abstractions called bbsink and bbstreamer. 0005 introduces a new COPY sub-protocol for taking base backups. I think it's worth mentioning that I believe that this refactoring is quite powerful and could let us do a bunch of other things that this patch set doesn't attempt. For instance, since this makes it pretty easy to implement server-side compression, it could probably also pretty easily be made to do server-side encryption, if you're brave enough to want to have a discussion on pgsql-hackers about how to design an encryption feature. Thanks to my colleague Tushar Ahuja for helping test some of this code. -- Robert Haas EDB: http://www.enterprisedb.com
Attachment
- v3-0001-Flexible-options-for-BASE_BACKUP-and-CREATE_REPLI.patch
- v3-0006-Support-base-backup-targets.patch
- v3-0005-Modify-pg_basebackup-to-use-a-new-COPY-subprotoco.patch
- v3-0007-WIP-Server-side-gzip-compression.patch
- v3-0004-Introduce-bbstreamer-abstraction-to-modularize-pg.patch
- v3-0002-Refactor-basebackup.c-s-_tarWriteDir-function.patch
- v3-0003-Introduce-bbsink-abstraction-to-modularize-base-b.patch
On 7/8/21 9:26 PM, Robert Haas wrote: > Here at last is a new version. Please refer this scenario ,where backup target using --server-compression is closing the server unexpectedly if we don't provide -no-manifest option [tushar@localhost bin]$ ./pg_basebackup --server-compression=gzip4 -t server:/tmp/data_1 -Xnone NOTICE: WAL archiving is not enabled; you must ensure that all required WAL segments are copied through other means to complete the backup pg_basebackup: error: could not read COPY data: server closed the connection unexpectedly This probably means the server terminated abnormally before or while processing the request. if we try to check with -Ft then this same scenario is working ? [tushar@localhost bin]$ ./pg_basebackup --server-compression=gzip4 -Ft -D data_0 -Xnone NOTICE: WAL archiving is not enabled; you must ensure that all required WAL segments are copied through other means to complete the backup [tushar@localhost bin]$ -- regards,tushar EnterpriseDB https://www.enterprisedb.com/ The Enterprise PostgreSQL Company
On Mon, Jul 12, 2021 at 5:51 PM tushar <tushar.ahuja@enterprisedb.com> wrote: > > On 7/8/21 9:26 PM, Robert Haas wrote: > > Here at last is a new version. > Please refer this scenario ,where backup target using > --server-compression is closing the server > unexpectedly if we don't provide -no-manifest option > > [tushar@localhost bin]$ ./pg_basebackup --server-compression=gzip4 -t > server:/tmp/data_1 -Xnone > NOTICE: WAL archiving is not enabled; you must ensure that all required > WAL segments are copied through other means to complete the backup > pg_basebackup: error: could not read COPY data: server closed the > connection unexpectedly > This probably means the server terminated abnormally > before or while processing the request. > I think the problem is that bbsink_gzip_end_archive() is not forwarding the end request to the next bbsink. The attached patch so fix it. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
Attachment
if i try to perform pg_basebackup using "-t server " option against localhost V/S remote machine ,Here at last is a new version.
i can see difference in backup size.
data directory whose size is
[edb@centos7tushar bin]$ du -sch data/
578M data/
578M total
-h=localhost
[edb@centos7tushar bin]$ ./pg_basebackup -t server:/tmp/all_data2 -h localhost -Xnone --no-manifest -P -v
pg_basebackup: initiating base backup, waiting for checkpoint to complete
pg_basebackup: checkpoint completed
NOTICE: all required WAL segments have been archived
329595/329595 kB (100%), 1/1 tablespace
pg_basebackup: base backup completed
[edb@centos7tushar bin]$ du -sch /tmp/all_data2
322M /tmp/all_data2
322M total
[edb@centos7tushar bin]$
-h=remote
[edb@centos7tushar bin]$ ./pg_basebackup -t server:/tmp/all_data2 -h <remote IP> -Xnone --no-manifest -P -v
pg_basebackup: initiating base backup, waiting for checkpoint to complete
pg_basebackup: checkpoint completed
NOTICE: all required WAL segments have been archived
170437/170437 kB (100%), 1/1 tablespace
pg_basebackup: base backup completed
[edb@0 bin]$ du -sch /tmp/all_data2
167M /tmp/all_data2
167M total
[edb@0 bin]$
-- regards,tushar EnterpriseDB https://www.enterprisedb.com/ The Enterprise PostgreSQL Company
On Fri, Jul 16, 2021 at 12:43 PM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > On Mon, Jul 12, 2021 at 5:51 PM tushar <tushar.ahuja@enterprisedb.com> wrote: > > > > On 7/8/21 9:26 PM, Robert Haas wrote: > > > Here at last is a new version. > > Please refer this scenario ,where backup target using > > --server-compression is closing the server > > unexpectedly if we don't provide -no-manifest option > > > > [tushar@localhost bin]$ ./pg_basebackup --server-compression=gzip4 -t > > server:/tmp/data_1 -Xnone > > NOTICE: WAL archiving is not enabled; you must ensure that all required > > WAL segments are copied through other means to complete the backup > > pg_basebackup: error: could not read COPY data: server closed the > > connection unexpectedly > > This probably means the server terminated abnormally > > before or while processing the request. > > > > I think the problem is that bbsink_gzip_end_archive() is not > forwarding the end request to the next bbsink. The attached patch so > fix it. I was going through the patch, I think the refactoring made the base backup code really clean and readable. I have a few minor suggestions. v3-0003 1. + Assert(sink->bbs_next != NULL); + bbsink_begin_archive(sink->bbs_next, gz_archive_name); I have noticed that the interface for forwarding the request to next bbsink is not uniform, for example bbsink_gzip_begin_archive() is calling bbsink_begin_archive(sink->bbs_next, gz_archive_name); for forwarding the request to next bbsink where as bbsink_progress_begin_backup() is calling bbsink_forward_begin_backup(sink); I think it will be good if we keep the usage uniform. 2. I have noticed that bbbsink_copytblspc_* are not forwarding the request to the next sink, thats probably because we assume this should always be the last sink. I agree that its true for this patch but the commit message of the patch says that in future this might change, so wouldn't it be good to keep the interface generic? I mean bbsink_copytblspc_new(), should take the next sink as an input and the caller can pass it as NULL. And the other apis can also try to forward the request if next is not NULL? 3. It would make more sense to order the function in basebackup_progress.c same as done in other file i.e bbsink_progress_begin_backup, bbsink_progress_archive_contents and then bbsink_progress_end_archive, and this will also be in sync with function pointer declaration in bbsink_ops. v3-0005- 4. + * + * 'copystream' sends a starts a single COPY OUT operation and transmits + * all the archives and the manifest if present during the course of that typo 'copystream' sends a starts a single COPY OUT --> 'copystream' sends a single COPY OUT -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
On 7/16/21 12:43 PM, Dilip Kumar wrote: > I think the problem is that bbsink_gzip_end_archive() is not > forwarding the end request to the next bbsink. The attached patch so > fix it. Thanks Dilip. Reported issue seems to be fixed now with your patch [edb@centos7tushar bin]$ ./pg_basebackup --server-compression=gzip4 -t server:/tmp/data_2 -v -Xnone -R pg_basebackup: initiating base backup, waiting for checkpoint to complete pg_basebackup: checkpoint completed NOTICE: all required WAL segments have been archived pg_basebackup: base backup completed [edb@centos7tushar bin]$ OR [edb@centos7tushar bin]$ ./pg_basebackup -t server:/tmp/pv1 -Xnone --server-compression=gzip4 -r 1024 -P NOTICE: all required WAL segments have been archived 23133/23133 kB (100%), 1/1 tablespace [edb@centos7tushar bin]$ Please refer this scenario ,where -R option is working with '-t server' but not with -Ft --not working [edb@centos7tushar bin]$ ./pg_basebackup --server-compression=gzip4 -Ft -D ccv -Xnone -R --no-manifest pg_basebackup: error: unable to parse archive: base.tar.gz pg_basebackup: only tar archives can be parsed pg_basebackup: the -R option requires pg_basebackup to parse the archive pg_basebackup: removing data directory "ccv" --working [edb@centos7tushar bin]$ ./pg_basebackup --server-compression=gzip4 -t server:/tmp/ccv -Xnone -R --no-manifest NOTICE: all required WAL segments have been archived [edb@centos7tushar bin]$ -- regards,tushar EnterpriseDB https://www.enterprisedb.com/ The Enterprise PostgreSQL Company
On Mon, Jul 19, 2021 at 6:02 PM tushar <tushar.ahuja@enterprisedb.com> wrote: > > On 7/16/21 12:43 PM, Dilip Kumar wrote: > > I think the problem is that bbsink_gzip_end_archive() is not > > forwarding the end request to the next bbsink. The attached patch so > > fix it. > > Thanks Dilip. Reported issue seems to be fixed now with your patch Thanks for the confirmation. > Please refer this scenario ,where -R option is working with '-t server' > but not with -Ft > > --not working > > [edb@centos7tushar bin]$ ./pg_basebackup --server-compression=gzip4 > -Ft -D ccv -Xnone -R --no-manifest > pg_basebackup: error: unable to parse archive: base.tar.gz > pg_basebackup: only tar archives can be parsed > pg_basebackup: the -R option requires pg_basebackup to parse the archive > pg_basebackup: removing data directory "ccv" As per the error message and code, if we are giving -R then we need to inject recovery-conf file and that is only supported with tar format but since you are enabling server compression which is no more .tar format so it is giving an error. > --working > > [edb@centos7tushar bin]$ ./pg_basebackup --server-compression=gzip4 -t > server:/tmp/ccv -Xnone -R --no-manifest > NOTICE: all required WAL segments have been archived > [edb@centos7tushar bin]$ I am not sure why this is working, from the code I could not find if the backup target is server then are we doing anything with the -R option or we are just silently ignoring it -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
> On Jul 8, 2021, at 8:56 AM, Robert Haas <robertmhaas@gmail.com> wrote: > > The interesting > patches in terms of functionality are 0006 and 0007; The difficulty in v3-0007 with pg_basebackup only knowing how to parse tar archives seems to be a natural consequence ofnot sufficiently abstracting out the handling of the tar format. If the bbsink and bbstreamer abstractions fully encapsulateda set of parsing callbacks, then pg_basebackup wouldn't contain things like: streamer = bbstreamer_tar_parser_new(streamer); but instead would use the parser callbacks without knowledge of whether they were parsing tar vs. cpio vs. whatever. Itjust seems really odd that pg_basebackup is using the extensible abstraction layer and then defeating the purpose by knowingtoo much about the format. It might even be a useful exercise to write cpio support into this patch set rather thanwaiting until v16, just to make sure the abstraction layer doesn't have tar-specific assumptions left over. printf(_(" -F, --format=p|t output format (plain (default), tar)\n")); printf(_(" -z, --gzip compress tar output\n")); printf(_(" -Z, --compress=0-9 compress tar output with given compression level\n")); This is the pre-existing --help output, not changed by your patch, but if you anticipate that other output formats will besupported in future releases, perhaps it's better not to write the --help output in such a way as to imply that -z and-Z are somehow connected with the choice of tar format? Would changing the --help now make for less confusion later? I'm just asking... The new options to pg_basebackup should have test coverage in src/bin/pg_basebackup/t/010_pg_basebackup.pl, though I expectyou are waiting to hammer out the interface before writing the tests. > the rest is > preparatory refactoring. patch v3-0001: The new function AppendPlainCommandOption writes too many spaces, which does no harm, but seems silly, resulting in lineslike: LOG: received replication command: BASE_BACKUP ( LABEL 'pg_basebackup base backup', PROGRESS, WAIT 0, MANIFEST 'yes') patch v3-0003: The introduction of the sink abstraction seems incomplete, as basebackup.c still has knowledge of things like tar headers. Calls like _tarWriteHeader(sink, ...) feel like an abstraction violation. I expected perhaps this would get addressedin later patches, but it doesn't. + * 'bbs_buffer' is the buffer into which data destined for the bbsink + * should be stored. It must be a multiple of BLCKSZ. + * + * 'bbs_buffer_length' is the allocated length of the buffer. The length must be a multiple of BLCKSZ, not the pointer. patch-v3-0005: + * 'copystream' sends a starts a single COPY OUT operation and transmits too many verbs. + * Regardless of which method is used, we sent a result set with "is used" vs. "sent" verb tense mismatch. + * So we only check it after the number of bytes sine the last check reaches typo. s/sine/since/ - * (2) we need to inject backup_manifest or recovery configuration into it. + * (2) we need to inject backup_manifest or recovery configuration into + * it. src/bin/pg_basebackup/pg_basebackup.c contains word wrap changes like the above which would better be left to a differentcommit, if done at all. + if (state.manifest_file !=NULL) Need a space after != — Mark Dilger EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Mon, Jul 19, 2021 at 2:51 PM Mark Dilger <mark.dilger@enterprisedb.com> wrote: > The difficulty in v3-0007 with pg_basebackup only knowing how to parse tar archives seems to be a natural consequence ofnot sufficiently abstracting out the handling of the tar format. If the bbsink and bbstreamer abstractions fully encapsulateda set of parsing callbacks, then pg_basebackup wouldn't contain things like: > > streamer = bbstreamer_tar_parser_new(streamer); > > but instead would use the parser callbacks without knowledge of whether they were parsing tar vs. cpio vs. whatever. Itjust seems really odd that pg_basebackup is using the extensible abstraction layer and then defeating the purpose by knowingtoo much about the format. It might even be a useful exercise to write cpio support into this patch set rather thanwaiting until v16, just to make sure the abstraction layer doesn't have tar-specific assumptions left over. Well, I had a patch in an earlier patch set that tried to get knowledge of tar out of basebackup.c, but it couldn't use the bbsink abstraction; it needed a whole separate abstraction layer which I had called bbarchiver with a different API. So I dropped it, for fear of being told, not without some justification, that I was just changing things for the sake of changing them, and also because having exactly one implementation of some interface is really not great. I do conceptually like the idea of making the whole thing flexible enough to generate cpio or zip archives, because like you I think that having tar-specific stuff all over the place is grotty, but I have a feeling there's little market demand for having pg_basebackup produce cpio, pax, zip, iso, etc. archives. On the other hand, server-side compression and server-side backup seem like functionality with real utility. Still, if you or others want to vote for resurrecting bbarchiver on the grounds that general code cleanup is worthwhile for its own sake, I'm OK with that, too. I don't really understand what your problem is with how the patch set leaves pg_basebackup. On the server side, because I dropped the bbarchiver stuff, basebackup.c still ends up knowing a bunch of stuff about tar. pg_basebackup.c, however, really doesn't know anything much about tar any more. It knows that if it's getting a tar file and needs to parse a tar file then it had better call the tar parsing code, but that seems difficult to avoid. What we can avoid, and I think the patch set does, is pg_basebackup.c having any real knowledge of what the tar parser is doing under the hood. Thanks also for the detailed comments. I'll try to the right number of verbs in each sentence in the next version of the patch. I will also look into the issues mentioned by Dilip and Tushar. -- Robert Haas EDB: http://www.enterprisedb.com
> On Jul 20, 2021, at 11:57 AM, Robert Haas <robertmhaas@gmail.com> wrote: > > I don't really understand what your problem is with how the patch set > leaves pg_basebackup. I don't have a problem with how the patch set leaves pg_basebackup. > On the server side, because I dropped the > bbarchiver stuff, basebackup.c still ends up knowing a bunch of stuff > about tar. pg_basebackup.c, however, really doesn't know anything much > about tar any more. It knows that if it's getting a tar file and needs > to parse a tar file then it had better call the tar parsing code, but > that seems difficult to avoid. I was only imagining having a callback for injecting manifests or recovery configurations. It is not necessary that thisbe done in the current patch set, or perhaps ever. — Mark Dilger EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Tue, Jul 20, 2021 at 4:03 PM Mark Dilger <mark.dilger@enterprisedb.com> wrote: > I was only imagining having a callback for injecting manifests or recovery configurations. It is not necessary that thisbe done in the current patch set, or perhaps ever. A callback where? I actually think the ideal scenario would be if the server always did all the work and the client wasn't involved in editing the tarfile, but it's not super-easy to get there from here. We could add an option to tell the server whether to inject the manifest into the archive, which probably wouldn't be too bad. For it to inject the recovery configuration, we'd have to send that configuration to the server somehow. I thought about using COPY BOTH mode instead of COPY OUT mode to allow for stuff like that, but it seems pretty complicated, and I wasn't really sure that we'd get consensus that it was better even if I went to the trouble of coding it up. If we don't do that and stick with the current system where it's handled on the client side, then I agree that we want to separate the tar-specific concerns from the injection-type concerns, which the patch does by making those operations different kinds of bbstreamer that know only a relatively limited amount about what each other are doing. You get [server] => [tar parser] => [recovery injector] => [tar archiver], where the [recovery injector] step nukes the archive file headers for the files it adds or modifies, and the [tar archiver] step fixes them up again. So the only thing that the [recovery injector] piece needs to know is that if it makes any changes to a file, it should send that file to the next step with a 0-length archive header, and all the [tar archiver] piece needs to know is that already-valid headers can be left alone and 0-length ones need to be regenerated. There may be a better scheme; I don't think this is perfectly elegant. I do think it's better than what we've got now. -- Robert Haas EDB: http://www.enterprisedb.com
> On Jul 21, 2021, at 8:09 AM, Robert Haas <robertmhaas@gmail.com> wrote: > > A callback where? If you were going to support lots of formats, not just tar, you might want the streamer class for each format to have a callbackwhich sets up the injector, rather than having CreateBackupStreamer do it directly. Even then, having now studiedCreateBackupStreamer a bit more, the idea seems less appealing than it did initially. I don't think it makes thingsany cleaner when only supporting tar, and maybe not even when supporting multiple formats, so I'll withdraw the suggestion. — Mark Dilger EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Wed, Jul 21, 2021 at 12:11 PM Mark Dilger <mark.dilger@enterprisedb.com> wrote: > If you were going to support lots of formats, not just tar, you might want the streamer class for each format to have acallback which sets up the injector, rather than having CreateBackupStreamer do it directly. Even then, having now studiedCreateBackupStreamer a bit more, the idea seems less appealing than it did initially. I don't think it makes thingsany cleaner when only supporting tar, and maybe not even when supporting multiple formats, so I'll withdraw the suggestion. Gotcha. I think if we had a lot of formats I'd probably make a separate function where you passed in the file extension and archive type and it hands you back a parser for the appropriate kind of archive, or something like that. And then maybe a second, similar function where you pass in the injector and archive type and it wraps an archiver of the right type around it and hands that back. But I don't think that's worth doing until we have 2 or 3 formats, which may or may not happen any time in the forseeable future. -- Robert Haas EDB: http://www.enterprisedb.com
On 7/19/21 8:29 PM, Dilip Kumar wrote: > I am not sure why this is working, from the code I could not find if > the backup target is server then are we doing anything with the -R > option or we are just silently ignoring it OK, in an another scenario I can see , "-t server" working with "--server-compression" option but not with -z or -Z ? "-t server" with option "-z" / or (-Z ) [tushar@localhost bin]$ ./pg_basebackup -t server:/tmp/dataN -Xnone -z --no-manifest -p 9033 pg_basebackup: error: only tar mode backups can be compressed Try "pg_basebackup --help" for more information. tushar@localhost bin]$ ./pg_basebackup -t server:/tmp/dataNa -Z 1 -Xnone --server-compression=gzip4 --no-manifest -p 9033 pg_basebackup: error: only tar mode backups can be compressed Try "pg_basebackup --help" for more information. "-t server" with "server-compression" (working) [tushar@localhost bin]$ ./pg_basebackup -t server:/tmp/dataN -Xnone --server-compression=gzip4 --no-manifest -p 9033 NOTICE: WAL archiving is not enabled; you must ensure that all required WAL segments are copied through other means to complete the backup [tushar@localhost bin]$ -- regards,tushar EnterpriseDB https://www.enterprisedb.com/ The Enterprise PostgreSQL Company
On Thu, Jul 22, 2021 at 1:14 PM tushar <tushar.ahuja@enterprisedb.com> wrote: > On 7/19/21 8:29 PM, Dilip Kumar wrote: > > I am not sure why this is working, from the code I could not find if > > the backup target is server then are we doing anything with the -R > > option or we are just silently ignoring it > > OK, in an another scenario I can see , "-t server" working with > "--server-compression" option but not with -z or -Z ? Right. The error messages or documentation might need some work, but it's expected that you won't be able to do client-side compression if the backup is being sent someplace other than to the client. -- Robert Haas EDB: http://www.enterprisedb.com
0007 adds server-side compression; currently, it only supports
server-side compression using gzip, but I hope that it won't be hard
to generalize that to support LZ4 as well, and Andres told me he
thinks we should aim to support zstd since that library has built-in
parallel compression which is very appealing in this context.
So, I gave a try to LZ4 streaming API for server-side compression.
LZ4 APIs are documented here[1].
With the attached WIP patch, I am now able to take the backup using the lz4
compression. The attached patch is basically applicable on top of Robert's V3
patch-set[2].
I could take the backup using the command:
pg_basebackup -t server:/tmp/data_lz4 -Xnone --server-compression=lz4
Further, when restored the backup `/tmp/data_lz4` and started the server, I
could see the tables I created, along with the data inserted on the original
server.
When I tried to look into the binary difference between the original data
directory and the backup `data_lz4` directory here is how it looked:
$ diff -qr data/ /tmp/data_lz4
Only in /tmp/data_lz4: backup_label
Only in /tmp/data_lz4: backup_manifest
Only in data/base: pgsql_tmp
Only in /tmp/data_lz4: base.tar
Only in /tmp/data_lz4: base.tar.lz4
Files data/global/pg_control and /tmp/data_lz4/global/pg_control differ
Files data/logfile and /tmp/data_lz4/logfile differ
Only in data/pg_stat: db_0.stat
Only in data/pg_stat: global.stat
Only in data/pg_subtrans: 0000
Only in data/pg_wal: 000000010000000000000099.00000028.backup
Only in data/pg_wal: 00000001000000000000009A
Only in data/pg_wal: 00000001000000000000009B
Only in data/pg_wal: 00000001000000000000009C
Only in data/pg_wal: 00000001000000000000009D
Only in data/pg_wal: 00000001000000000000009E
Only in data/pg_wal/archive_status: 000000010000000000000099.00000028.backup.done
Only in data/: postmaster.opts
For now, what concerns me here is, the following `LZ4F_compressUpdate()` API,
is the one which is doing the core work of streaming compression:
size_t LZ4F_compressUpdate(LZ4F_cctx* cctx,
void* dstBuffer, size_t dstCapacity,
const void* srcBuffer, size_t srcSize,
const LZ4F_compressOptions_t* cOptPtr);
where, `dstCapacity`, is basically provided by the earlier call to
`LZ4F_compressBound()` which provides minimum `dstCapacity` required to
guarantee success of `LZ4F_compressUpdate()`, given a `srcSize` and
`preferences`, for a worst-case scenario. `LZ4F_compressBound()` is:
size_t LZ4F_compressBound(size_t srcSize, const LZ4F_preferences_t* prefsPtr);
Now, hard learning here is that the `dstCapacity` returned by the
`LZ4F_compressBound()` even for a single byte i.e. 1 as `srcSize` is about
~256K (seems it is has something to do with the blockSize in lz4 frame that we
chose, the minimum we can have is 64K), though the actual length of compressed
data by the `LZ4F_compressUpdate()` is very less. Whereas, the destination
buffer length for us i.e. `mysink->base.bbs_next->bbs_buffer_length` is only
32K. In the function call `LZ4F_compressUpdate()`, if I directly try to provide
this `mysink->base.bbs_next->bbs_buffer + bytes_written` as `dstBuffer` and
the value returned by the `LZ4F_compressBound()` as the `dstCapacity` that
sounds very much incorrect to me, since the actual out buffer length remaining
is much less than what is calculated for the worst case by `LZ4F_compressBound()`.
For now, I am creating a temporary buffer of the required size, passing it
for compression, assert that the actual compressed bytes are less than the
whatever length we have available and then copy it to our output buffer.
To give an example, I put some logging statements, and I can see in the log:
"
bytes remaining in mysink->base.bbs_next->bbs_buffer: 16537
input size to be compressed: 512
estimated size for compressed buffer by LZ4F_compressBound(): 262667
actual compressed size: 16
"
Will really appreciate any inputs, comments, suggestions here.
Regards,
[1] https://fossies.org/linux/lz4/doc/lz4frame_manual.html
[2] https://www.postgresql.org/message-id/CA+TgmoYgVN=-Yoh71r3P9N7eKysd7_9b9s+1QFfFcs3w7Z-tig@mail.gmail.com
Attachment
On Wed, Sep 8, 2021 at 2:14 PM Jeevan Ladhe <jeevan.ladhe@enterprisedb.com> wrote: > To give an example, I put some logging statements, and I can see in the log: > " > bytes remaining in mysink->base.bbs_next->bbs_buffer: 16537 > input size to be compressed: 512 > estimated size for compressed buffer by LZ4F_compressBound(): 262667 > actual compressed size: 16 > " That is pretty lame. I don't know why it needs a ~256k buffer to produce 16 bytes of output. The way the gzip APIs I used work, you tell it how big the output buffer is and it writes until it fills that buffer, or until the input buffer is empty, whichever happens first. But this seems to be the other way around: you tell it how much input you have, and it tells you how big a buffer it needs. To handle that elegantly, I think I need to make some changes to the design of the bbsink stuff. What I'm thinking is that each bbsink somehow tells the next bbsink how big to make the buffer. So if the LZ4 buffer is told that its buffer should be at least, I don't know, say 64kB. Then it can compute how large an output buffer the LZ4 library requires for 64kB. Hopefully we can assume that liblz4 never needs a smaller buffer for a larger input. Then we can assume that if a 64kB input requires, say, a 300kB output buffer, every possible input < 64kB also requires an output buffer <= 300 kB. But we can't just say, well, we were asked to create a 64kB buffer (or whatever) so let's ask the next bbsink for a 300kB buffer (or whatever), because then as soon as we write any data at all into it the remaining buffer space might be insufficient for the next chunk. So instead what I think we should do is have bbsink_lz4 set the size of the next sink's buffer to its own buffer size + LZ4F_compressBound(its own buffer size). So in this example if it's asked to create a 64kB buffer and LZ4F_compressBound(64kB) = 300kB then it asks the next sink to set the buffer size to 364kB. Now, that means that there will always be at least 300 kB available in the output buffer until we've accumulated a minimum of 64 kB of compressed data, and then at that point we can flush. I think this would be relatively clean and would avoid the need for the double copying that the current design forced you to do. What do you think? + /* + * If we do not have enough space left in the output buffer for this + * chunk to be written, first archive the already written contents. + */ + if (nextChunkLen > mysink->base.bbs_next->bbs_buffer_length - mysink->bytes_written || + mysink->bytes_written >= mysink->base.bbs_next->bbs_buffer_length) + { + bbsink_archive_contents(sink->bbs_next, mysink->bytes_written); + mysink->bytes_written = 0; + } I think this is flat-out wrong. It assumes that the compressor will never generate more than N bytes of output given N bytes of input, which is not true. Not sure there's much point in fixing it now because with the changes described above this code will have to change anyway, but I think it's just lucky that this has worked for you in your testing. + /* + * LZ4F_compressUpdate() returns the number of bytes written into output + * buffer. We need to keep track of how many bytes have been cumulatively + * written into the output buffer(bytes_written). But, + * LZ4F_compressUpdate() returns 0 in case the data is buffered and not + * written to output buffer, set autoFlush to 1 to force the writing to the + * output buffer. + */ + prefs->autoFlush = 1; I don't see why this should be necessary. Elsewhere you have code that caters to bytes being stuck inside LZ4's buffer, so why do we also require this? Thanks for researching this! -- Robert Haas EDB: http://www.enterprisedb.com
On Wed, Sep 8, 2021 at 3:39 PM Robert Haas <robertmhaas@gmail.com> wrote: > The way the gzip APIs I used work, you tell it how big the output > buffer is and it writes until it fills that buffer, or until the input > buffer is empty, whichever happens first. But this seems to be the > other way around: you tell it how much input you have, and it tells > you how big a buffer it needs. To handle that elegantly, I think I > need to make some changes to the design of the bbsink stuff. What I'm > thinking is that each bbsink somehow tells the next bbsink how big to > make the buffer. Here's a new patch set with that design change (and a bug fix for 0001). -- Robert Haas EDB: http://www.enterprisedb.com
Attachment
- v4-0007-WIP-Server-side-gzip-compression.patch
- v4-0004-Introduce-bbstreamer-abstraction-to-modularize-pg.patch
- v4-0006-Support-base-backup-targets.patch
- v4-0005-Modify-pg_basebackup-to-use-a-new-COPY-subprotoco.patch
- v4-0001-Flexible-options-for-BASE_BACKUP-and-CREATE_REPLI.patch
- v4-0002-Refactor-basebackup.c-s-_tarWriteDir-function.patch
- v4-0003-Introduce-bbsink-abstraction-to-modularize-base-b.patch
On Wed, Sep 8, 2021 at 2:14 PM Jeevan Ladhe
<jeevan.ladhe@enterprisedb.com> wrote:
> To give an example, I put some logging statements, and I can see in the log:
> "
> bytes remaining in mysink->base.bbs_next->bbs_buffer: 16537
> input size to be compressed: 512
> estimated size for compressed buffer by LZ4F_compressBound(): 262667
> actual compressed size: 16
> "
That is pretty lame. I don't know why it needs a ~256k buffer to
produce 16 bytes of output.
blocksize. Currently, I have chosen it has 256kB, which is 262144 bytes,
and here the LZ4F_compressBound() has returned 262667 for worst-case
accommodation of 512 bytes i.e. 262144(256kB) + 512 + I guess some
book-keeping bytes. If I choose to have blocksize as 64K, then this turns
The way the gzip APIs I used work, you tell it how big the output
buffer is and it writes until it fills that buffer, or until the input
buffer is empty, whichever happens first. But this seems to be the
other way around: you tell it how much input you have, and it tells
you how big a buffer it needs. To handle that elegantly, I think I
need to make some changes to the design of the bbsink stuff. What I'm
thinking is that each bbsink somehow tells the next bbsink how big to
make the buffer. So if the LZ4 buffer is told that its buffer should
be at least, I don't know, say 64kB. Then it can compute how large an
output buffer the LZ4 library requires for 64kB. Hopefully we can
assume that liblz4 never needs a smaller buffer for a larger input.
Then we can assume that if a 64kB input requires, say, a 300kB output
buffer, every possible input < 64kB also requires an output buffer <=
300 kB.
But we can't just say, well, we were asked to create a 64kB buffer (or
whatever) so let's ask the next bbsink for a 300kB buffer (or
whatever), because then as soon as we write any data at all into it
the remaining buffer space might be insufficient for the next chunk.
So instead what I think we should do is have bbsink_lz4 set the size
of the next sink's buffer to its own buffer size +
LZ4F_compressBound(its own buffer size). So in this example if it's
asked to create a 64kB buffer and LZ4F_compressBound(64kB) = 300kB
then it asks the next sink to set the buffer size to 364kB. Now, that
means that there will always be at least 300 kB available in the
output buffer until we've accumulated a minimum of 64 kB of compressed
data, and then at that point we can flush.
I think this would be relatively clean and would avoid the need for
the double copying that the current design forced you to do. What do
you think?
+ /*
+ * If we do not have enough space left in the output buffer for this
+ * chunk to be written, first archive the already written contents.
+ */
+ if (nextChunkLen > mysink->base.bbs_next->bbs_buffer_length -
mysink->bytes_written ||
+ mysink->bytes_written >= mysink->base.bbs_next->bbs_buffer_length)
+ {
+ bbsink_archive_contents(sink->bbs_next, mysink->bytes_written);
+ mysink->bytes_written = 0;
+ }
I think this is flat-out wrong. It assumes that the compressor will
never generate more than N bytes of output given N bytes of input,
which is not true. Not sure there's much point in fixing it now
because with the changes described above this code will have to change
anyway, but I think it's just lucky that this has worked for you in
your testing.
considered the return value of LZ4F_compressBound() to check if that
many bytes are available. But, as explained earlier our output buffer is
+ /*
+ * LZ4F_compressUpdate() returns the number of bytes written into output
+ * buffer. We need to keep track of how many bytes have been cumulatively
+ * written into the output buffer(bytes_written). But,
+ * LZ4F_compressUpdate() returns 0 in case the data is buffered and not
+ * written to output buffer, set autoFlush to 1 to force the writing to the
+ * output buffer.
+ */
+ prefs->autoFlush = 1;
I don't see why this should be necessary. Elsewhere you have code that
caters to bytes being stuck inside LZ4's buffer, so why do we also
require this?
This is needed to know the actual bytes written in the output buffer. If it is
set to 0, then LZ4F_compressUpdate() would randomly return 0 or actual
bytes are written to the output buffer, depending on whether it has buffered
or really flushed data to the output buffer.
IIUC, you are referring to the following comment for bbsink_lz4_end_archive():
"
* There might be some data inside lz4's internal buffers; we need to get
* that flushed out, also finalize the lz4 frame and then get that forwarded
* to the successor sink as archive content.
"
I think it should be modified to:
"
* Finalize the lz4 frame and then get that forwarded to the successor sink as
* archive content.
"
On Fri, Sep 10, 2021 at 5:25 AM Robert Haas <robertmhaas@gmail.com> wrote: > > On Wed, Sep 8, 2021 at 3:39 PM Robert Haas <robertmhaas@gmail.com> wrote: > > The way the gzip APIs I used work, you tell it how big the output > > buffer is and it writes until it fills that buffer, or until the input > > buffer is empty, whichever happens first. But this seems to be the > > other way around: you tell it how much input you have, and it tells > > you how big a buffer it needs. To handle that elegantly, I think I > > need to make some changes to the design of the bbsink stuff. What I'm > > thinking is that each bbsink somehow tells the next bbsink how big to > > make the buffer. > > Here's a new patch set with that design change (and a bug fix for 0001). Seems like nothing has been done about the issue reported in [1] This one line change shall fix the issue, --- a/src/backend/replication/basebackup_gzip.c +++ b/src/backend/replication/basebackup_gzip.c @@ -264,6 +264,8 @@ bbsink_gzip_end_archive(bbsink *sink) bbsink_archive_contents(sink->bbs_next, mysink->bytes_written); mysink->bytes_written = 0; } + + bbsink_forward_end_archive(sink); } [1] https://www.postgresql.org/message-id/CAFiTN-uhg4iKA7FGWxaG9J8WD_LTx655%2BAUW3_KiK1%3DSakQy4A%40mail.gmail.com -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
On Mon, Sep 13, 2021 at 6:03 AM Jeevan Ladhe <jeevan.ladhe@enterprisedb.com> wrote: >> + /* >> + * If we do not have enough space left in the output buffer for this >> + * chunk to be written, first archive the already written contents. >> + */ >> + if (nextChunkLen > mysink->base.bbs_next->bbs_buffer_length - >> mysink->bytes_written || >> + mysink->bytes_written >= mysink->base.bbs_next->bbs_buffer_length) >> + { >> + bbsink_archive_contents(sink->bbs_next, mysink->bytes_written); >> + mysink->bytes_written = 0; >> + } >> >> I think this is flat-out wrong. It assumes that the compressor will >> never generate more than N bytes of output given N bytes of input, >> which is not true. Not sure there's much point in fixing it now >> because with the changes described above this code will have to change >> anyway, but I think it's just lucky that this has worked for you in >> your testing. > > I see your point. But for it to be accurate, I think we need to then > considered the return value of LZ4F_compressBound() to check if that > many bytes are available. But, as explained earlier our output buffer is > already way smaller than that. Well, in your last version of the patch, you kind of had two output buffers: a bigger one that you use internally and then the "official" one which is associated with the next sink. With my latest patch set you should be able to make that go away by just arranging for the next sink's buffer to be as big as you need it to be. But, if we were going to stick with using an extra buffer, then the solution would not be to do this, but to copy the internal buffer to the official buffer in multiple chunks if needed. So don't bother doing this here but just wait and see how much data you get and then chunk it to the next sink's buffer, calling bbsink_archive_contents() multiple times if required. That would be annoying and expensive so I'm glad we're not doing it that way, but it could be done correctly. >> + /* >> + * LZ4F_compressUpdate() returns the number of bytes written into output >> + * buffer. We need to keep track of how many bytes have been cumulatively >> + * written into the output buffer(bytes_written). But, >> + * LZ4F_compressUpdate() returns 0 in case the data is buffered and not >> + * written to output buffer, set autoFlush to 1 to force the writing to the >> + * output buffer. >> + */ >> + prefs->autoFlush = 1; >> >> I don't see why this should be necessary. Elsewhere you have code that >> caters to bytes being stuck inside LZ4's buffer, so why do we also >> require this? > > This is needed to know the actual bytes written in the output buffer. If it is > set to 0, then LZ4F_compressUpdate() would randomly return 0 or actual > bytes are written to the output buffer, depending on whether it has buffered > or really flushed data to the output buffer. The problem is that if we autoflush, I think it will cause the compression ratio to be less good. Try un-lz4ing a file that is produced this way and then re-lz4 it and compare the size of the re-lz4'd file to the original one. Compressors rely on postponing decisions about how to compress until they've seen as much of the input as possible, and flushing forces them to decide earlier, and maybe making a decision that isn't as good as it could have been. So I believe we should look for a way of avoiding this. Now I realize there's a problem there with doing that and also making sure the output buffer is large enough, and I'm not quite sure how we solve that problem, but there is probably a way to do it. -- Robert Haas EDB: http://www.enterprisedb.com
On Mon, Sep 13, 2021 at 7:19 AM Dilip Kumar <dilipbalaut@gmail.com> wrote: > Seems like nothing has been done about the issue reported in [1] > > This one line change shall fix the issue, Oops. Try this version. -- Robert Haas EDB: http://www.enterprisedb.com
Attachment
- v5-0006-Modify-pg_basebackup-to-use-a-new-COPY-subprotoco.patch
- v5-0001-Flexible-options-for-BASE_BACKUP.patch
- v5-0008-WIP-Server-side-gzip-compression.patch
- v5-0007-Support-base-backup-targets.patch
- v5-0005-Introduce-bbstreamer-abstraction-to-modularize-pg.patch
- v5-0002-Flexible-options-for-CREATE_REPLICATION_SLOT.patch
- v5-0003-Refactor-basebackup.c-s-_tarWriteDir-function.patch
- v5-0004-Introduce-bbsink-abstraction-to-modularize-base-b.patch
On Mon, Sep 13, 2021 at 9:42 PM Robert Haas <robertmhaas@gmail.com> wrote: > > On Mon, Sep 13, 2021 at 7:19 AM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > Seems like nothing has been done about the issue reported in [1] > > > > This one line change shall fix the issue, > > Oops. Try this version. Thanks, this version works fine. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
Hello I found that in 0001 you propose to rename few options. Probably we could rename another option for clarify? I think FAST(it's about some bw limits?) and WAIT (wait for what? checkpoint?) option names are confusing. Could we replace FAST with "CHECKPOINT [fast|spread]" and WAIT to WAIT_WAL_ARCHIVED? I think such names would be more descriptive. - if (PQserverVersion(conn) >= 100000) - /* pg_recvlogical doesn't use an exported snapshot, so suppress */ - appendPQExpBufferStr(query, " NOEXPORT_SNAPSHOT"); + /* pg_recvlogical doesn't use an exported snapshot, so suppress */ + if (use_new_option_syntax) + AppendStringCommandOption(query, use_new_option_syntax, + "SNAPSHOT", "nothing"); + else + AppendPlainCommandOption(query, use_new_option_syntax, + "NOEXPORT_SNAPSHOT"); In 0002, it looks like condition for 9.x releases was lost? Also my gcc version 8.3.0 is not happy with v5-0007-Support-base-backup-targets.patch and produces: basebackup.c: In function ‘parse_basebackup_options’: basebackup.c:970:7: error: ‘target_str’ may be used uninitialized in this function [-Werror=maybe-uninitialized] errmsg("target '%s' does not accept a target detail", ^~~~~~ regards, Sergei
be size_t instead of int, because that's what most of the compression
libraries have their length variables defined as.
On Mon, Sep 13, 2021 at 7:19 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> Seems like nothing has been done about the issue reported in [1]
>
> This one line change shall fix the issue,
Oops. Try this version.
--
Robert Haas
EDB: http://www.enterprisedb.com
>> + /*
>> + * LZ4F_compressUpdate() returns the number of bytes written into output
>> + * buffer. We need to keep track of how many bytes have been cumulatively
>> + * written into the output buffer(bytes_written). But,
>> + * LZ4F_compressUpdate() returns 0 in case the data is buffered and not
>> + * written to output buffer, set autoFlush to 1 to force the writing to the
>> + * output buffer.
>> + */
>> + prefs->autoFlush = 1;
>>
>> I don't see why this should be necessary. Elsewhere you have code that
>> caters to bytes being stuck inside LZ4's buffer, so why do we also
>> require this?
>
> This is needed to know the actual bytes written in the output buffer. If it is
> set to 0, then LZ4F_compressUpdate() would randomly return 0 or actual
> bytes are written to the output buffer, depending on whether it has buffered
> or really flushed data to the output buffer.
The problem is that if we autoflush, I think it will cause the
compression ratio to be less good. Try un-lz4ing a file that is
produced this way and then re-lz4 it and compare the size of the
re-lz4'd file to the original one. Compressors rely on postponing
decisions about how to compress until they've seen as much of the
input as possible, and flushing forces them to decide earlier, and
maybe making a decision that isn't as good as it could have been. So I
believe we should look for a way of avoiding this. Now I realize
there's a problem there with doing that and also making sure the
output buffer is large enough, and I'm not quite sure how we solve
that problem, but there is probably a way to do it.
On Mon, Sep 13, 2021 at 7:19 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> Seems like nothing has been done about the issue reported in [1]
>
> This one line change shall fix the issue,
Oops. Try this version.
--
Robert Haas
EDB: http://www.enterprisedb.com
Attachment
On Tue, Sep 14, 2021 at 11:30 AM Sergei Kornilov <sk@zsrv.org> wrote: > I found that in 0001 you propose to rename few options. Probably we could rename another option for clarify? I think FAST(it's about some bw limits?) and WAIT (wait for what? checkpoint?) option names are confusing. > Could we replace FAST with "CHECKPOINT [fast|spread]" and WAIT to WAIT_WAL_ARCHIVED? I think such names would be more descriptive. I think CHECKPOINT { 'spread' | 'fast' } is probably a good idea; the options logic for pg_basebackup uses the same convention, and if somebody ever wanted to introduce a third kind of checkpoint, it would be a lot easier if you could just make pg_basebackup -cbanana send CHECKPOINT 'banana' to the server. I don't think renaming WAIT -> WAIT_WAL_ARCHIVED has much value. The replication grammar isn't really intended to be consumed directly by end-users, and it's also not clear that WAIT_WAL_ARCHIVED would attract more support than any of 5 or 10 other possible variants. I'd rather leave it alone. > - if (PQserverVersion(conn) >= 100000) > - /* pg_recvlogical doesn't use an exported snapshot, so suppress */ > - appendPQExpBufferStr(query, " NOEXPORT_SNAPSHOT"); > + /* pg_recvlogical doesn't use an exported snapshot, so suppress */ > + if (use_new_option_syntax) > + AppendStringCommandOption(query, use_new_option_syntax, > + "SNAPSHOT", "nothing"); > + else > + AppendPlainCommandOption(query, use_new_option_syntax, > + "NOEXPORT_SNAPSHOT"); > > In 0002, it looks like condition for 9.x releases was lost? Good catch, thanks. I'll post an updated version of these two patches on the thread dedicated to those two patches, which can be found at http://postgr.es/m/CA+Tgmob2cbCPNbqGoixp0J6aib0p00XZerswGZwx-5G=0M+BMA@mail.gmail.com > Also my gcc version 8.3.0 is not happy with v5-0007-Support-base-backup-targets.patch and produces: > > basebackup.c: In function ‘parse_basebackup_options’: > basebackup.c:970:7: error: ‘target_str’ may be used uninitialized in this function [-Werror=maybe-uninitialized] > errmsg("target '%s' does not accept a target detail", > ^~~~~~ OK, I'll fix that. Thanks. -- Robert Haas EDB: http://www.enterprisedb.com
On Tue, Sep 21, 2021 at 7:54 AM Jeevan Ladhe <jeevan.ladhe@enterprisedb.com> wrote: > I was wondering if we should change the bbs_buffer_length in bbsink to > be size_t instead of int, because that's what most of the compression > libraries have their length variables defined as. I looked into this and found that I was already using size_t or Size in a bunch of related places, so this seems to make sense. Here's a new patch set, responding also to Sergei's comments. -- Robert Haas EDB: http://www.enterprisedb.com
Attachment
- v6-0006-Modify-pg_basebackup-to-use-a-new-COPY-subprotoco.patch
- v6-0007-Support-base-backup-targets.patch
- v6-0001-Flexible-options-for-BASE_BACKUP.patch
- v6-0008-WIP-Server-side-gzip-compression.patch
- v6-0005-Introduce-bbstreamer-abstraction-to-modularize-pg.patch
- v6-0003-Refactor-basebackup.c-s-_tarWriteDir-function.patch
- v6-0002-Flexible-options-for-CREATE_REPLICATION_SLOT.patch
- v6-0004-Introduce-bbsink-abstraction-to-modularize-base-b.patch
On Tue, Sep 21, 2021 at 9:08 AM Jeevan Ladhe <jeevan.ladhe@enterprisedb.com> wrote: > Yes, you are right here, and I could verify this fact with an experiment. > When autoflush is 1, the file gets less compressed i.e. the compressed file > is of more size than the one generated when autoflush is set to 0. > But, as of now, I couldn't think of a solution as we need to really advance the > bytes written to the output buffer so that we can write into the output buffer. I don't understand why you think we need to do that. What happens if you just change prefs->autoFlush = 1 to set it to 0 instead? What I think will happen is that you'll call LZ4F_compressUpdate a bunch of times without outputting anything, and then suddenly one of the calls will produce a bunch of output all at once. But so what? I don't see that anything in bbsink_lz4_archive_contents() would get broken by that. It would be a problem if LZ4F_compressUpdate() didn't produce anything and also didn't buffer the data internally, and expected us to keep the input around. That we would have difficulty doing, because we wouldn't be calling LZ4F_compressUpdate() if we didn't need to free up some space in that sink's input buffer. But if it buffers the data internally, I don't know why we care. -- Robert Haas EDB: http://www.enterprisedb.com
On Tue, Sep 21, 2021 at 9:35 AM Jeevan Ladhe <jeevan.ladhe@enterprisedb.com> wrote: > Here is a patch for lz4 based on the v5 set of patches. The patch adapts with the > bbsink changes, and is now able to make the provision for the required length > for the output buffer using the new callback function bbsink_lz4_begin_backup(). > > Sample command to take backup: > pg_basebackup -t server:/tmp/data_lz4 -Xnone --server-compression=lz4 > > Please let me know your thoughts. This pretty much looks right, with the exception of the autoFlush thing about which I sent a separate email. I need to write docs for all of this, and ideally test cases. It might also be good if pg_basebackup had an option to un-gzip or un-lz4 archives, but I haven't thought too hard about what would be required to make that work. + if (opt->compression == BACKUP_COMPRESSION_LZ4) else if + /* First of all write the frame header to destination buffer. */ + Assert(CHUNK_SIZE >= LZ4F_HEADER_SIZE_MAX); + headerSize = LZ4F_compressBegin(mysink->ctx, + mysink->base.bbs_next->bbs_buffer, + CHUNK_SIZE, + prefs); I think this is wrong. I think you should be passing bbs_buffer_length instead of CHUNK_SIZE, and I think you can just delete CHUNK_SIZE. If you think otherwise, why? + * sink's bbs_buffer of length that can accomodate the compressed input Spelling. + * Make it next multiple of BLCKSZ since the buffer length is expected so. The buffer length is expected to be a multiple of BLCKSZ, so round up. + * If we are falling short of available bytes needed by + * LZ4F_compressUpdate() per the upper bound that is decided by + * LZ4F_compressBound(), send the archived contents to the next sink to + * process it further. If the number of available bytes has fallen below the value computed by LZ4F_compressBound(), ask the next sink to process the data so that we can empty the buffer. -- Robert Haas EDB: http://www.enterprisedb.com
On Tue, Sep 21, 2021 at 9:08 AM Jeevan Ladhe
<jeevan.ladhe@enterprisedb.com> wrote:
> Yes, you are right here, and I could verify this fact with an experiment.
> When autoflush is 1, the file gets less compressed i.e. the compressed file
> is of more size than the one generated when autoflush is set to 0.
> But, as of now, I couldn't think of a solution as we need to really advance the
> bytes written to the output buffer so that we can write into the output buffer.
I don't understand why you think we need to do that. What happens if
you just change prefs->autoFlush = 1 to set it to 0 instead? What I
think will happen is that you'll call LZ4F_compressUpdate a bunch of
times without outputting anything, and then suddenly one of the calls
will produce a bunch of output all at once. But so what? I don't see
that anything in bbsink_lz4_archive_contents() would get broken by
that.
It would be a problem if LZ4F_compressUpdate() didn't produce anything
and also didn't buffer the data internally, and expected us to keep
the input around. That we would have difficulty doing, because we
wouldn't be calling LZ4F_compressUpdate() if we didn't need to free up
some space in that sink's input buffer. But if it buffers the data
internally, I don't know why we care.
error: ERROR_dstMaxSize_tooSmall after a few iterations.
After digging a bit in the source of LZ4F_compressUpdate() in LZ4 repository, I
see that it throws this error when the destination buffer capacity, which in
our case is mysink->base.bbs_next->bbs_buffer_length is less than the
compress bound which it calculates internally by calling LZ4F_compressBound()
+ if (opt->compression == BACKUP_COMPRESSION_LZ4)
else if
+ /* First of all write the frame header to destination buffer. */
+ Assert(CHUNK_SIZE >= LZ4F_HEADER_SIZE_MAX);
+ headerSize = LZ4F_compressBegin(mysink->ctx,
+ mysink->base.bbs_next->bbs_buffer,
+ CHUNK_SIZE,
+ prefs);
I think this is wrong. I think you should be passing bbs_buffer_length
instead of CHUNK_SIZE, and I think you can just delete CHUNK_SIZE. If
you think otherwise, why?
+ * sink's bbs_buffer of length that can accomodate the compressed input
Spelling.
+ * Make it next multiple of BLCKSZ since the buffer length is expected so.
The buffer length is expected to be a multiple of BLCKSZ, so round up.
+ * If we are falling short of available bytes needed by
+ * LZ4F_compressUpdate() per the upper bound that is decided by
+ * LZ4F_compressBound(), send the archived contents to the next sink to
+ * process it further.
If the number of available bytes has fallen below the value computed
by LZ4F_compressBound(), ask the next sink to process the data so that
we can empty the buffer.
Attachment
On Wed, Sep 22, 2021 at 12:41 PM Jeevan Ladhe <jeevan.ladhe@enterprisedb.com> wrote: > If I set prefs->autoFlush to 0, then LZ4F_compressUpdate() returns an > error: ERROR_dstMaxSize_tooSmall after a few iterations. > > After digging a bit in the source of LZ4F_compressUpdate() in LZ4 repository, I > see that it throws this error when the destination buffer capacity, which in > our case is mysink->base.bbs_next->bbs_buffer_length is less than the > compress bound which it calculates internally by calling LZ4F_compressBound() > internally for buffered_bytes + input buffer(CHUNK_SIZE in this case). Not sure > how can we control this. Uggh. It had been my guess was that the reason why LZ4F_compressBound() was returning such a large value was because it had to allow for the possibility of bytes inside of its internal buffers. But, if the amount of internally buffered data counts against the argument that you have to pass to LZ4F_compressBound(), then that makes it more complicated. Still, there's got to be a simple way to make this work, and it can't involve setting autoFlush. Like, look at this: https://github.com/lz4/lz4/blob/dev/examples/frameCompress.c That uses the same APIs that we're here and a fixed-size input buffer and a fixed-size output buffer, just as we have here, to compress a file. And it probably works, because otherwise it likely wouldn't be in the "examples" directory. And it sets autoFlush to 0. -- Robert Haas EDB: http://www.enterprisedb.com
Still, there's got to be a simple way to make this work, and it can't
involve setting autoFlush. Like, look at this:
https://github.com/lz4/lz4/blob/dev/examples/frameCompress.c
That uses the same APIs that we're here and a fixed-size input buffer
and a fixed-size output buffer, just as we have here, to compress a
file. And it probably works, because otherwise it likely wouldn't be
in the "examples" directory. And it sets autoFlush to 0.
Still, there's got to be a simple way to make this work, and it can't
involve setting autoFlush. Like, look at this:
https://github.com/lz4/lz4/blob/dev/examples/frameCompress.c
That uses the same APIs that we're here and a fixed-size input buffer
and a fixed-size output buffer, just as we have here, to compress a
file. And it probably works, because otherwise it likely wouldn't be
in the "examples" directory. And it sets autoFlush to 0.Thanks, Robert. I have seen this example, and it is similar to what we have.I went through each of the steps and appears that I have done it correctly.I am still trying to debug and figure out where it is going wrong.I am going to try hooking the pg_basebackup with the lz4 source anddebug both the sources.Regards,Jeevan Ladhe
Attachment
I think the patch v6-0007-Support-base-backup-targets.patch has broken
the case for multiple tablespaces. When I tried to take the backup
for target 'none' and extract the base.tar I was not able to locate
tablespace_map file.
I debugged and figured out in normal tar backup i.e. '-Ft' case
pg_basebackup command is sent with TABLESPACE_MAP to the server:
BASE_BACKUP ( LABEL 'pg_basebackup base backup', PROGRESS,
TABLESPACE_MAP, MANIFEST 'yes', TARGET 'client')
But, with the target command i.e. "pg_basebackup -t server:/tmp/data_v1
-Xnone", we are not sending the TABLESPACE_MAP, here is how the command
is sent:
BASE_BACKUP ( LABEL 'pg_basebackup base backup', PROGRESS, MANIFEST
'yes', TARGET 'server', TARGET_DETAIL '/tmp/data_none')
I am attaching a patch to fix this issue.
Regards,
Jeevan Ladhe
On Tue, Sep 21, 2021 at 7:54 AM Jeevan Ladhe
<jeevan.ladhe@enterprisedb.com> wrote:
> I was wondering if we should change the bbs_buffer_length in bbsink to
> be size_t instead of int, because that's what most of the compression
> libraries have their length variables defined as.
I looked into this and found that I was already using size_t or Size
in a bunch of related places, so this seems to make sense.
Here's a new patch set, responding also to Sergei's comments.
--
Robert Haas
EDB: http://www.enterprisedb.com
Attachment
On Thu, Oct 7, 2021 at 7:50 AM Jeevan Ladhe <jeevan.ladhe@enterprisedb.com> wrote: > I think the patch v6-0007-Support-base-backup-targets.patch has broken > the case for multiple tablespaces. When I tried to take the backup > for target 'none' and extract the base.tar I was not able to locate > tablespace_map file. > > I debugged and figured out in normal tar backup i.e. '-Ft' case > pg_basebackup command is sent with TABLESPACE_MAP to the server: > BASE_BACKUP ( LABEL 'pg_basebackup base backup', PROGRESS, > TABLESPACE_MAP, MANIFEST 'yes', TARGET 'client') > > But, with the target command i.e. "pg_basebackup -t server:/tmp/data_v1 > -Xnone", we are not sending the TABLESPACE_MAP, here is how the command > is sent: > BASE_BACKUP ( LABEL 'pg_basebackup base backup', PROGRESS, MANIFEST > 'yes', TARGET 'server', TARGET_DETAIL '/tmp/data_none') > > I am attaching a patch to fix this issue. Thanks. Here's a new patch set incorporating that change. I committed the preparatory patches to add an extensible options syntax for CREATE_REPLICATION_SLOT and BASE_BACKUP, so those patches are no longer included in this patch set. Barring objections, I will also push 0001, a small preparatory refactoring patch, soon. -- Robert Haas EDB: http://www.enterprisedb.com
Attachment
- v7-0001-Refactor-basebackup.c-s-_tarWriteDir-function.patch
- v7-0004-Modify-pg_basebackup-to-use-a-new-COPY-subprotoco.patch
- v7-0005-Support-base-backup-targets.patch
- v7-0006-WIP-Server-side-gzip-compression.patch
- v7-0003-Introduce-bbstreamer-abstraction-to-modularize-pg.patch
- v7-0002-Introduce-bbsink-abstraction-to-modularize-base-b.patch
On Tue, Oct 5, 2021 at 5:51 AM Jeevan Ladhe <jeevan.ladhe@enterprisedb.com> wrote: > I have fixed the autoFlush issue. Basically, I was wrongly initializing > the lz4 preferences in bbsink_lz4_begin_archive() instead of > bbsink_lz4_begin_backup(). I have fixed the issue in the attached > patch, please have a look at it. Thanks for the new patch. Seems like this is getting closer, but: +/* + * Read the input buffer in CHUNK_SIZE length in each iteration and pass it to + * the lz4 compression. Defined as 8k, since the input buffer is multiple of + * BLCKSZ i.e. multiple of 8k. + */ +#define CHUNK_SIZE 8192 BLCKSZ does not have to be 8kB. + size_t compressedSize; + int nextChunkLen = CHUNK_SIZE; + + /* Last chunk to be read from the input. */ + if (avail_in < CHUNK_SIZE) + nextChunkLen = avail_in; This is the only place where CHUNK_SIZE gets used, and I don't think I see any point to it. I think the 5th argument to LZ4F_compressUpdate could just be avail_in. And as soon as you do that then I think bbsink_lz4_archive_contents() no longer needs to be a loop. For gzip, the output buffer isn't guaranteed to be big enough to write all the data, so the compression step can fail to compress all the data. But LZ4 forces us to make the output buffer big enough that no such failure can happen. Therefore, that can't happen here except if you artificially limit the amount of data that you pass to LZ4F_compressUpdate() to something less than the size of the input buffer. And I don't see any reason to do that. + /* First of all write the frame header to destination buffer. */ + headerSize = LZ4F_compressBegin(mysink->ctx, + mysink->base.bbs_next->bbs_buffer, + mysink->base.bbs_next->bbs_buffer_length, + &mysink->prefs); + compressedSize = LZ4F_compressEnd(mysink->ctx, + mysink->base.bbs_next->bbs_buffer + mysink->bytes_written, + mysink->base.bbs_next->bbs_buffer_length - mysink->bytes_written, + NULL); I think there's some issue with these two chunks of code. What happens if one of these functions wants to write more data than will fit in the output buffer? It seems like either there needs to be some code someplace that ensures adequate space in the output buffer at the time of these calls, or else there needs to be a retry loop that writes as much of the data as possible, flushes the output buffer, and then loops to generate more output data. But there's clearly no retry loop here, and I don't see any code that guarantees that the output buffer has to be large enough (and in the case of LZ4F_compressEnd, have enough remaining space) either. In other words, all the same concerns that apply to LZ4F_compressUpdate() also apply here ... but in LZ4F_compressUpdate() you seem to BOTH have a retry loop and ALSO code to make sure that the buffer is certain to be large enough (which is more than you need, you only need one of those) and here you seem to have NEITHER of those things (which is not enough, you need one or the other). + /* Initialize compressor object. */ + prefs->frameInfo.blockSizeID = LZ4F_max256KB; + prefs->frameInfo.blockMode = LZ4F_blockLinked; + prefs->frameInfo.contentChecksumFlag = LZ4F_noContentChecksum; + prefs->frameInfo.frameType = LZ4F_frame; + prefs->frameInfo.contentSize = 0; + prefs->frameInfo.dictID = 0; + prefs->frameInfo.blockChecksumFlag = LZ4F_noBlockChecksum; + prefs->compressionLevel = 0; + prefs->autoFlush = 0; + prefs->favorDecSpeed = 0; + prefs->reserved[0] = 0; + prefs->reserved[1] = 0; + prefs->reserved[2] = 0; How about instead using memset() to zero the whole thing and then omitting the zero initializations? That seems like it would be less fragile, if the upstream structure definition ever changes. -- Robert Haas EDB: http://www.enterprisedb.com
This is the only place where CHUNK_SIZE gets used, and I don't think I
see any point to it. I think the 5th argument to LZ4F_compressUpdate
could just be avail_in. And as soon as you do that then I think
bbsink_lz4_archive_contents() no longer needs to be a loop.
+ /* First of all write the frame header to destination buffer. */
+ headerSize = LZ4F_compressBegin(mysink->ctx,
+ mysink->base.bbs_next->bbs_buffer,
+ mysink->base.bbs_next->bbs_buffer_length,
+ &mysink->prefs);
+ compressedSize = LZ4F_compressEnd(mysink->ctx,
+ mysink->base.bbs_next->bbs_buffer + mysink->bytes_written,
+ mysink->base.bbs_next->bbs_buffer_length - mysink->bytes_written,
+ NULL);
I think there's some issue with these two chunks of code. What happens
if one of these functions wants to write more data than will fit in
the output buffer? It seems like either there needs to be some code
someplace that ensures adequate space in the output buffer at the time
of these calls, or else there needs to be a retry loop that writes as
much of the data as possible, flushes the output buffer, and then
loops to generate more output data. But there's clearly no retry loop
here, and I don't see any code that guarantees that the output buffer
has to be large enough (and in the case of LZ4F_compressEnd, have
enough remaining space) either. In other words, all the same concerns
that apply to LZ4F_compressUpdate() also apply here ... but in
LZ4F_compressUpdate() you seem to BOTH have a retry loop and ALSO code
to make sure that the buffer is certain to be large enough (which is
more than you need, you only need one of those) and here you seem to
have NEITHER of those things (which is not enough, you need one or the
other).
How about instead using memset() to zero the whole thing and then
omitting the zero initializations? That seems like it would be less
fragile, if the upstream structure definition ever changes.
Attachment
On Thu, Oct 14, 2021 at 1:21 PM Jeevan Ladhe <jeevan.ladhe@enterprisedb.com> wrote: > Agree. Removed the CHUNK_SIZE and the loop. Try harder. :-) The loop is gone, but CHUNK_SIZE itself seems to have evaded the executioner. > Fair enough. I have made the change in the bbsink_lz4_begin_backup() to > make sure we reserve enough extra bytes for the header and the footer those > are written by LZ4F_compressBegin() and LZ4F_compressEnd() respectively. > The LZ4F_compressBound() when passed the input size as "0", would give > the upper bound for output buffer needed by the LZ4F_compressEnd(). I think this is not the best way to accomplish the goal. Adding LZ4F_compressBound(0) to next_buf_len makes the buffer substantially bigger for something that's only going to happen once. We are assuming in any case, I think, that LZ4F_compressBound(0) <= LZ4F_compressBound(mysink->base.bbs_buffer_length), so all you need to do is have bbsink_end_archive() empty the buffer, if necessary, before calling LZ4F_compressEnd(). With just that change, you can set next_buf_len = LZ4F_HEADER_SIZE_MAX + mysink->output_buffer_bound -- but that's also more than you need. You can instead do next_buf_len = Min(LZ4F_HEADER_SIZE_MAX, mysink->output_buffer_bound). Now, you're probably thinking that won't work, because bbsink_lz4_begin_archive() could fill up the buffer partway, and then the first call to bbsink_lz4_archive_contents() could overrun it. But that problem can be solved by reversing the order of operations in bbsink_lz4_archive_contents(): before you call LZ4F_compressUpdate(), test whether you need to empty the buffer first, and if so, do it. That's actually less confusing than the way you've got it, because as you have it written, we don't really know why we're emptying the buffer -- is it to prepare for the next call to LZ4F_compressUpdate(), or is it to prepare for the call to LZ4F_compressEnd()? How do we know now how much space the next person writing into the buffer is going to need? It seems better if bbsink_lz4_archive_contents() empties the buffer before calling LZ4F_compressUpdate() if that call might not have enough space, and likewise bbsink_lz4_end_archive() empties the buffer before calling LZ4F_compressEnd() if that's needed. That way, each callback makes the space *it* needs, not the space the *next* caller needs. (bbsink_lz4_end_archive() still needs to ALSO empty the buffer after LZ4F_compressEnd(), so we don't orphan any data.) On another note, if the call to LZ4F_freeCompressionContext() is required in bbsink_lz4_end_archive(), then I think this code is going to just leak the memory used by the compression context if an error occurs before this code is reached. That kind of sucks. The way to fix it, I suppose, is a TRY/CATCH block, but I don't think that can be something internal to basebackup_lz4.c: I think the bbsink stuff would need to provide some kind of infrastructure for basebackup_lz4.c to use. It would be a lot better if we could instead get LZ4 to allocate memory using palloc(), but a quick Google search suggests that you can't accomplish that without recompiling liblz4, and that's not workable since we don't want to require a liblz4 built specifically for PostgreSQL. Do you see any other solution? -- Robert Haas EDB: http://www.enterprisedb.com
I am sorry, but I did not really get it. Or it is what you have pointed
in the following paragraphs?
> I think this is not the best way to accomplish the goal. Adding
> LZ4F_compressBound(0) to next_buf_len makes the buffer substantially
> bigger for something that's only going to happen once.
Yes, you are right. I missed this.
> We are assuming in any case, I think, that LZ4F_compressBound(0) <=
> LZ4F_compressBound(mysink->base.bbs_buffer_length), so all you need to
> do is have bbsink_end_archive() empty the buffer, if necessary, before
> calling LZ4F_compressEnd().
This is a fair enough assumption.
> With just that change, you can set
> next_buf_len = LZ4F_HEADER_SIZE_MAX + mysink->output_buffer_bound --
> but that's also more than you need. You can instead do next_buf_len =
> Min(LZ4F_HEADER_SIZE_MAX, mysink->output_buffer_bound). Now, you're
> probably thinking that won't work, because bbsink_lz4_begin_archive()
> could fill up the buffer partway, and then the first call to
> bbsink_lz4_archive_contents() could overrun it. But that problem can
> be solved by reversing the order of operations in
> bbsink_lz4_archive_contents(): before you call LZ4F_compressUpdate(),
> test whether you need to empty the buffer first, and if so, do it.
I am still not able to get - how can we survive with a mere
size of Min(LZ4F_HEADER_SIZE_MAX, mysink->output_buffer_bound).
LZ4F_HEADER_SIZE_MAX is defined as 19 in lz4 library. With this
proposal, it is almost guaranteed that the next buffer length will
be always set to 19, which will result in failure of a call to
LZ4F_compressUpdate() with the error LZ4F_ERROR_dstMaxSize_tooSmall,
even if we had called bbsink_archive_contents() before.
> That's actually less confusing than the way you've got it, because as
> you have it written, we don't really know why we're emptying the
> buffer -- is it to prepare for the next call to LZ4F_compressUpdate(),
> or is it to prepare for the call to LZ4F_compressEnd()? How do we know
> now how much space the next person writing into the buffer is going to
> need? It seems better if bbsink_lz4_archive_contents() empties the
> buffer before calling LZ4F_compressUpdate() if that call might not
> have enough space, and likewise bbsink_lz4_end_archive() empties the
> buffer before calling LZ4F_compressEnd() if that's needed. That way,
> each callback makes the space *it* needs, not the space the *next*
> caller needs. (bbsink_lz4_end_archive() still needs to ALSO empty the
> buffer after LZ4F_compressEnd(), so we don't orphan any data.)
Sure, I get your point here.
> On another note, if the call to LZ4F_freeCompressionContext() is
> required in bbsink_lz4_end_archive(), then I think this code is going
> to just leak the memory used by the compression context if an error
> occurs before this code is reached. That kind of sucks.
Yes, the LZ4F_freeCompressionContext() is needed to clear the
LZ4F_cctx. The structure LZ4F_cctx_s maintains internal stages
of compression, internal buffers, etc.
> The way to fix
> it, I suppose, is a TRY/CATCH block, but I don't think that can be
> something internal to basebackup_lz4.c: I think the bbsink stuff would
> need to provide some kind of infrastructure for basebackup_lz4.c to
> use. It would be a lot better if we could instead get LZ4 to allocate
> memory using palloc(), but a quick Google search suggests that you
> can't accomplish that without recompiling liblz4, and that's not
> workable since we don't want to require a liblz4 built specifically
> for PostgreSQL. Do you see any other solution?
You mean the way gzip allows us to use our own alloc and free functions
by means of providing the function pointers for them. Unfortunately,
no, LZ4 does not have that kind of provision. Maybe that makes a
good proposal for LZ4 library ;-).
I cannot think of another solution to it right away.
Regards,
Jeevan Ladhe
On Fri, Oct 15, 2021 at 7:54 AM Jeevan Ladhe <jeevan.ladhe@enterprisedb.com> wrote: > > The loop is gone, but CHUNK_SIZE itself seems to have evaded the executioner. > > I am sorry, but I did not really get it. Or it is what you have pointed > in the following paragraphs? I mean #define CHUNK_SIZE is still in the patch. > > With just that change, you can set > > next_buf_len = LZ4F_HEADER_SIZE_MAX + mysink->output_buffer_bound -- > > but that's also more than you need. You can instead do next_buf_len = > > Min(LZ4F_HEADER_SIZE_MAX, mysink->output_buffer_bound). Now, you're > > probably thinking that won't work, because bbsink_lz4_begin_archive() > > could fill up the buffer partway, and then the first call to > > bbsink_lz4_archive_contents() could overrun it. But that problem can > > be solved by reversing the order of operations in > > bbsink_lz4_archive_contents(): before you call LZ4F_compressUpdate(), > > test whether you need to empty the buffer first, and if so, do it. > > I am still not able to get - how can we survive with a mere > size of Min(LZ4F_HEADER_SIZE_MAX, mysink->output_buffer_bound). > LZ4F_HEADER_SIZE_MAX is defined as 19 in lz4 library. With this > proposal, it is almost guaranteed that the next buffer length will > be always set to 19, which will result in failure of a call to > LZ4F_compressUpdate() with the error LZ4F_ERROR_dstMaxSize_tooSmall, > even if we had called bbsink_archive_contents() before. Sorry, should have been Max(), not Min(). > You mean the way gzip allows us to use our own alloc and free functions > by means of providing the function pointers for them. Unfortunately, > no, LZ4 does not have that kind of provision. Maybe that makes a > good proposal for LZ4 library ;-). > I cannot think of another solution to it right away. OK. Will give it some thought. -- Robert Haas EDB: http://www.enterprisedb.com
I mean #define CHUNK_SIZE is still in the patch.
> > With just that change, you can set
> > next_buf_len = LZ4F_HEADER_SIZE_MAX + mysink->output_buffer_bound --
> > but that's also more than you need. You can instead do next_buf_len =
> > Min(LZ4F_HEADER_SIZE_MAX, mysink->output_buffer_bound). Now, you're
> > probably thinking that won't work, because bbsink_lz4_begin_archive()
> > could fill up the buffer partway, and then the first call to
> > bbsink_lz4_archive_contents() could overrun it. But that problem can
> > be solved by reversing the order of operations in
> > bbsink_lz4_archive_contents(): before you call LZ4F_compressUpdate(),
> > test whether you need to empty the buffer first, and if so, do it.
>
> I am still not able to get - how can we survive with a mere
> size of Min(LZ4F_HEADER_SIZE_MAX, mysink->output_buffer_bound).
> LZ4F_HEADER_SIZE_MAX is defined as 19 in lz4 library. With this
> proposal, it is almost guaranteed that the next buffer length will
> be always set to 19, which will result in failure of a call to
> LZ4F_compressUpdate() with the error LZ4F_ERROR_dstMaxSize_tooSmall,
> even if we had called bbsink_archive_contents() before.
Sorry, should have been Max(), not Min().
> You mean the way gzip allows us to use our own alloc and free functions
> by means of providing the function pointers for them. Unfortunately,
> no, LZ4 does not have that kind of provision. Maybe that makes a
> good proposal for LZ4 library ;-).
> I cannot think of another solution to it right away.
OK. Will give it some thought.
Attachment
On Fri, Oct 15, 2021 at 8:05 AM Robert Haas <robertmhaas@gmail.com> wrote: > > You mean the way gzip allows us to use our own alloc and free functions > > by means of providing the function pointers for them. Unfortunately, > > no, LZ4 does not have that kind of provision. Maybe that makes a > > good proposal for LZ4 library ;-). > > I cannot think of another solution to it right away. > > OK. Will give it some thought. Here's a new patch set. I've tried adding a "cleanup" callback to the bbsink method and ensuring that it gets called even in case of an error. The code for that is untested since I have no use for it with the existing basebackup sink types, so let me know how it goes when you try to use it for LZ4. I've also added documentation for the new pg_basebackup options in this version, and I fixed up a couple of these patches to be pgindent-clean when they previously were not. -- Robert Haas EDB: http://www.enterprisedb.com
Attachment
I tried to take a backup using gzip compression and got a core.
$ pg_basebackup -t server:/tmp/data_gzip -Xnone --server-compression=gzip
NOTICE: WAL archiving is not enabled; you must ensure that all required WAL segments are copied through other means to complete the backup
pg_basebackup: error: could not read COPY data: server closed the connection unexpectedly
This probably means the server terminated abnormally
before or while processing the request.
The backtrace:
gdb) bt
#0 0x0000000000000000 in ?? ()
#1 0x0000558264bfc40a in bbsink_cleanup (sink=0x55826684b5f8) at ../../../src/include/replication/basebackup_sink.h:268
#2 0x0000558264bfc838 in bbsink_forward_cleanup (sink=0x55826684b710) at basebackup_sink.c:124
#3 0x0000558264bf4cab in bbsink_cleanup (sink=0x55826684b710) at ../../../src/include/replication/basebackup_sink.h:268
#4 0x0000558264bf7738 in SendBaseBackup (cmd=0x55826683bd10) at basebackup.c:1020
#5 0x0000558264c10915 in exec_replication_command (
cmd_string=0x5582667bc580 "BASE_BACKUP ( LABEL 'pg_basebackup base backup', PROGRESS, MANIFEST 'yes', TABLESPACE_MAP, TARGET 'server', TARGET_DETAIL '/tmp/data_g
zip', COMPRESSION 'gzip')") at walsender.c:1731
#6 0x0000558264c8a69b in PostgresMain (dbname=0x5582667e84d8 "", username=0x5582667e84b8 "hadoop") at postgres.c:4493
#7 0x0000558264bb10a6 in BackendRun (port=0x5582667de160) at postmaster.c:4560
#8 0x0000558264bb098b in BackendStartup (port=0x5582667de160) at postmaster.c:4288
#9 0x0000558264bacb55 in ServerLoop () at postmaster.c:1801
#10 0x0000558264bac2ee in PostmasterMain (argc=3, argv=0x5582667b68c0) at postmaster.c:1473
#11 0x0000558264aa0950 in main (argc=3, argv=0x5582667b68c0) at main.c:198
bbsink_gzip_ops have the cleanup() callback set to NULL, and when the
bbsink_cleanup() callback is triggered, it tries to invoke a function that
is NULL. I think either bbsink_gzip_ops should set the cleanup callback
to bbsink_forward_cleanup or we should be calling the cleanup() callback
from PG_CATCH instead of PG_FINALLY()? But in the latter case, even if
I have attached a patch to fix this.
Thoughts?
On Fri, Oct 15, 2021 at 8:05 AM Robert Haas <robertmhaas@gmail.com> wrote:
> > You mean the way gzip allows us to use our own alloc and free functions
> > by means of providing the function pointers for them. Unfortunately,
> > no, LZ4 does not have that kind of provision. Maybe that makes a
> > good proposal for LZ4 library ;-).
> > I cannot think of another solution to it right away.
>
> OK. Will give it some thought.
Here's a new patch set. I've tried adding a "cleanup" callback to the
bbsink method and ensuring that it gets called even in case of an
error. The code for that is untested since I have no use for it with
the existing basebackup sink types, so let me know how it goes when
you try to use it for LZ4.
I've also added documentation for the new pg_basebackup options in
this version, and I fixed up a couple of these patches to be
pgindent-clean when they previously were not.
--
Robert Haas
EDB: http://www.enterprisedb.com
Attachment
On Fri, Oct 29, 2021 at 8:59 AM Jeevan Ladhe <jeevan.ladhe@enterprisedb.com> wrote:> > bbsink_gzip_ops have the cleanup() callback set to NULL, and when the > bbsink_cleanup() callback is triggered, it tries to invoke a function that > is NULL. I think either bbsink_gzip_ops should set the cleanup callback > to bbsink_forward_cleanup or we should be calling the cleanup() callback > from PG_CATCH instead of PG_FINALLY()? But in the latter case, even if > we call from PG_CATCH, it will have a similar problem for gzip and other > sinks which may not need a custom cleanup() callback in case there is any > error before the backup could finish up normally. > > I have attached a patch to fix this. Yes, this is the right fix. Apologies for the oversight. -- Robert Haas EDB: http://www.enterprisedb.com
I have implemented the cleanup callback bbsink_lz4_cleanup() in the attached patch.
Please have a look and let me know of any comments.
Regards,
Jeevan Ladhe
On Fri, Oct 29, 2021 at 8:59 AM Jeevan Ladhe
<jeevan.ladhe@enterprisedb.com> wrote:>
> bbsink_gzip_ops have the cleanup() callback set to NULL, and when the
> bbsink_cleanup() callback is triggered, it tries to invoke a function that
> is NULL. I think either bbsink_gzip_ops should set the cleanup callback
> to bbsink_forward_cleanup or we should be calling the cleanup() callback
> from PG_CATCH instead of PG_FINALLY()? But in the latter case, even if
> we call from PG_CATCH, it will have a similar problem for gzip and other
> sinks which may not need a custom cleanup() callback in case there is any
> error before the backup could finish up normally.
>
> I have attached a patch to fix this.
Yes, this is the right fix. Apologies for the oversight.
--
Robert Haas
EDB: http://www.enterprisedb.com
Attachment
On Tue, Nov 2, 2021 at 7:53 AM Jeevan Ladhe <jeevan.ladhe@enterprisedb.com> wrote: > I have implemented the cleanup callback bbsink_lz4_cleanup() in the attached patch. > > Please have a look and let me know of any comments. Looks pretty good. I think you should work on stuff like documentation and tests, and I need to do some work on that stuff, too. Also, I think you should try to figure out how to support different compression levels. For gzip, I did that by making gzip1..gzip9 possible compression settings. But that might not have been the right idea because something like lz43 to mean lz4 at level 3 would be confusing. Also, for the lz4 command line utility, there's not only "lz4 -3" which means LZ4 with level 3 compression, but also "lz4 --fast=3" which selects "ultra-fast compression level 3" rather than regular old level 3. And apparently LZ4 levels go up to 12 rather than just 9 like gzip. I'm thinking maybe we should go with something like "gzip@9" rather than just "gzip9" to mean gzip with compression level 9, and then things like "lz4@3" or "lz4@fast3" would select either the regular compression levels or the ultra-fast compression levels. Meanwhile, I think it's probably OK for me to go ahead and commit 0001-0003 from my patches at this point, since it seems we have pretty good evidence that the abstraction basically works, and there doesn't seem to be any value in holding off and maybe having to do a bunch more rebasing. We may also want to look into making -Fp work with --server-compression, which would require pg_basebackup to know how to decompress. I'm actually not sure if this is worthwhile; you'd need to have a network connection slow enough that it's worth spending a lot of CPU time compressing on the server and decompressing on the client to make up for the cost of network transfer. But some people might have that case. It might make it easier to test this, too, since we probably can't rely on having an LZ4 binary installed. Another thing that you probably need to investigate is also supporting client-side LZ4 compression. I think that is probably a really desirable addition to your patch set, since people might find it odd if that were exclusively a server-side option. Hopefully it's not that much work. One minor nitpick in terms of the code: + mysink->bytes_written = mysink->bytes_written + headerSize; I would use += here. -- Robert Haas EDB: http://www.enterprisedb.com
On Tue, Nov 2, 2021 at 10:32 AM Robert Haas <robertmhaas@gmail.com> wrote: > Looks pretty good. I think you should work on stuff like documentation > and tests, and I need to do some work on that stuff, too. Also, I > think you should try to figure out how to support different > compression levels. On second thought, maybe we don't need to do this. There's a thread on "Teach pg_receivewal to use lz4 compression" which concluded that supporting different compression levels was unnecessary. -- Robert Haas EDB: http://www.enterprisedb.com
On Tue, Nov 2, 2021 at 10:32 AM Robert Haas <robertmhaas@gmail.com> wrote: > Meanwhile, I think it's probably OK for me to go ahead and commit > 0001-0003 from my patches at this point, since it seems we have pretty > good evidence that the abstraction basically works, and there doesn't > seem to be any value in holding off and maybe having to do a bunch > more rebasing. I went ahead and committed 0001 and 0002, but got nervous about proceeding with 0003. For those who may not have been following along closely, what was 0003 and is now 0001 introduces a new COPY subprotocol for taking backups. That probably needs to be documented and as of now the patch does not do that, but the bigger question is what to do about backward compatibility. I wrote the patch in such a way that, post-patch, the server can do backups either the way that we do them now, or the new way that it introduces, but I'm wondering if I should rip that out and just support the new way only. If you run a newer pg_basebackup against an older server, it will work, and still does with the patch. If, however, you run an older pg_basebackup against a newer server, it complains. For example running a pg13 pg_basebackup against a pg14 cluster produces this: pg_basebackup: error: incompatible server version 14.0 pg_basebackup: removing data directory "pgstandby" Now for all I know there is out-of-core software out there that speaks the replication protocol and can take base backups using it and would like it to continue working as it does today, and that's easy for me to do, because that's the way the patch works. But on the other hand since the patch adapts the in-core tools to use the new method when talking to a new server, we wouldn't have test coverage for the old method any more, which might possibly make it annoying to maintain. But then again that is a problem we could leave for the future, and rip it out then rather than now. I'm not sure which way to jump. Anyone else have thoughts? -- Robert Haas EDB: http://www.enterprisedb.com
Attachment
On Fri, Nov 5, 2021 at 11:50 AM Robert Haas <robertmhaas@gmail.com> wrote: > I went ahead and committed 0001 and 0002, but got nervous about > proceeding with 0003. It turns out that these commits are causing failures on prairiedog. Per email from Tom off-list, that's apparently because prairiedog has a fussy version of tar that doesn't like it when you omit the trailing NUL blocks that are supposed to be part of a tar file. So how did this get broken? It turns out that in the current state of the world, the server sends an almost-tarfile to the client. What I mean by an almost-tarfile is that it sends something that looks like a valid tarfile except that the two blocks of trailing NUL bytes are omitted. Prior to these patches, that was a very strategic omission, because the pg_basebackup code wants to edit the tar files, and it wasn't smart enough to parse them, so it just received all the data from the server, then added any members that it wanted to add (e.g. recovery.signal) and then added the terminator itself. I would classify this as an ugly hack, but it worked. With these changes, the client is now capable of really parsing a tarfile, so it would have no problem injecting new files into the archive whether or not the server terminates it properly. It also has no problem adding the two blocks of terminating NUL bytes if the server omits them, but not otherwise. All in all, it's significantly smarter code. However, I also set things up so that the client doesn't bother parsing the tar file from the server if it's not doing anything that requires editing the tar file on the fly. That saves some overhead, and it's also important for the rest of the patch set, which wants to make it so that the server could send us something besides a tarfile, like maybe a .tar.gz. We can't just have a convention of adding 1024 NUL bytes to any file the server sends us unless what the server sends us is always and precisely an unterminated tarfile. Unfortunately, that means that in the case where the tar parsing logic isn't used, the tar file ends up with the proper terminator. Because most 'tar' implementations are happy to ignore that defect, the tests pass on my machine, but not on prairiedog. I think I realized this problem at some point during the development process of this patch, but then I forgot about it again and ended up committing something that has a problem of which, at some earlier point in time, I had been entirely aware. Oops. It's tempting to try to fix this problem by changing the server so that it properly terminates the tar files it sends to the client. Honestly, I don't know how we ever thought it was OK to design a protocol for base backups that involved the server sending something that is almost but not quite a valid tarfile. However, that's not quite good enough, because pg_basebackup is supposed to be backward compatible, so we'd still have the same problem if a new version of pg_basebackup were used with an old server. So what I'm inclined to do is fix both the server and pg_basebackup. On the server side, properly terminate the tarfile. On the client side, if we're talking to a pre-v15 server and don't need to parse the tarfile, blindly add 1024 NUL bytes at the end. I think I can get patches for this done today. Please let me know ASAP if you have objections to this line of attack. Thanks, -- Robert Haas EDB: http://www.enterprisedb.com
Robert Haas <robertmhaas@gmail.com> writes: > It turns out that these commits are causing failures on prairiedog. > Per email from Tom off-list, that's apparently because prairiedog has > a fussy version of tar that doesn't like it when you omit the trailing > NUL blocks that are supposed to be part of a tar file. FTR, prairiedog is green. It's Noah's AIX menagerie that's complaining. It's actually a little bit disturbing that we're only seeing a failure on that one platform, because that means that nothing else is anchoring us to the strict POSIX specification for tarfile format. We knew that GNU tar is forgiving about missing trailing zero blocks, but apparently so is BSD tar. One part of me wants to add some explicit test for the trailing blocks. Another says, well, the *de facto* tar standard seems not to require the trailing blocks, never mind the letter of POSIX --- so when AIX dies, will anyone care anymore? Maybe not. regards, tom lane
On Mon, Nov 8, 2021 at 10:59 AM Tom Lane <tgl@sss.pgh.pa.us> wrote: > Robert Haas <robertmhaas@gmail.com> writes: > > It turns out that these commits are causing failures on prairiedog. > > Per email from Tom off-list, that's apparently because prairiedog has > > a fussy version of tar that doesn't like it when you omit the trailing > > NUL blocks that are supposed to be part of a tar file. > > FTR, prairiedog is green. It's Noah's AIX menagerie that's complaining. Woops. > It's actually a little bit disturbing that we're only seeing a failure > on that one platform, because that means that nothing else is anchoring > us to the strict POSIX specification for tarfile format. We knew that > GNU tar is forgiving about missing trailing zero blocks, but apparently > so is BSD tar. Yeah. > One part of me wants to add some explicit test for the trailing blocks. > Another says, well, the *de facto* tar standard seems not to require > the trailing blocks, never mind the letter of POSIX --- so when AIX > dies, will anyone care anymore? Maybe not. FWIW, I think both of those are pretty defensible positions. Honestly, I'm not sure how likely the bug is to recur once we fix it here, either. The only reason this is a problem is because of the kludge of having the server generate the entire output file except for the last 1kB. If we eliminate that behavior I don't know that this particular problem is especially likely to come back. But adding a test isn't stupid either, just a bit tricky to write. When I was testing locally this morning I found that there were considerably more than 1024 zero bytes at the end of the file because the last file it backs up is pg_control which ends with lots of zero bytes. So it's not sufficient to just write a test that checks for non-zero bytes in the last 1kB of the file. What I think you'd need to do is figure out the number of files in the archive and the sizes of each one, and based on that work out how big the tar archive should be: 512 bytes per file or directory or symlink plus enough extra 512 byte chunks to cover the contents of each file plus an extra 1024 bytes at the end. That doesn't seem particularly simple to code. We could run 'tar tvf' and parse the output to get the number of files and their lengths, but that seems likely to cause more portability headaches than the underlying issue. Since pg_basebackup now has the logic to do all of this parsing internally, we could make it complain if it receives from a v15+ server an archive trailer that is not 1024 bytes of zeroes, but that wouldn't help with this exact problem, because the issue in this case is when pg_basebackup decides it doesn't need to parse in the first place. We could add a pg_basebackup option --force-parsing-and-check-if-the-server-seems-broken, but that seems like overkill to me. So overall I'm inclined to just do nothing about this unless someone has a better idea how to write a reasonable test. Anyway, here's my proposal for fixing the issue immediately before us. 0001 adds logic to pad out the unterminated tar archives, and 0002 makes the server terminate its tar archives while preserving the logic added by 0001 for cases where we're talking to an older server. I assume that it's best to get something committed quickly here so will do that in ~4 hours if there are no major objections, or sooner if I hear some enthusiastic endorsement. -- Robert Haas EDB: http://www.enterprisedb.com
Attachment
On Mon, Nov 8, 2021 at 11:34 AM Robert Haas <robertmhaas@gmail.com> wrote: > Anyway, here's my proposal for fixing the issue immediately before us. > 0001 adds logic to pad out the unterminated tar archives, and 0002 > makes the server terminate its tar archives while preserving the logic > added by 0001 for cases where we're talking to an older server. I > assume that it's best to get something committed quickly here so will > do that in ~4 hours if there are no major objections, or sooner if I > hear some enthusiastic endorsement. I have now committed 0001 and will wait to see what the buildfarm thinks about that before doing anything more. -- Robert Haas EDB: http://www.enterprisedb.com
On Mon, Nov 8, 2021 at 4:41 PM Robert Haas <robertmhaas@gmail.com> wrote: > On Mon, Nov 8, 2021 at 11:34 AM Robert Haas <robertmhaas@gmail.com> wrote: > > Anyway, here's my proposal for fixing the issue immediately before us. > > 0001 adds logic to pad out the unterminated tar archives, and 0002 > > makes the server terminate its tar archives while preserving the logic > > added by 0001 for cases where we're talking to an older server. I > > assume that it's best to get something committed quickly here so will > > do that in ~4 hours if there are no major objections, or sooner if I > > hear some enthusiastic endorsement. > > I have now committed 0001 and will wait to see what the buildfarm > thinks about that before doing anything more. It seemed OK, so I have now committed 0002 as well. -- Robert Haas EDB: http://www.enterprisedb.com
> On Fri, Nov 05, 2021 at 11:50:01AM -0400, Robert Haas wrote: > On Tue, Nov 2, 2021 at 10:32 AM Robert Haas <robertmhaas@gmail.com> wrote: > > Meanwhile, I think it's probably OK for me to go ahead and commit > > 0001-0003 from my patches at this point, since it seems we have pretty > > good evidence that the abstraction basically works, and there doesn't > > seem to be any value in holding off and maybe having to do a bunch > > more rebasing. > > I went ahead and committed 0001 and 0002, but got nervous about > proceeding with 0003. Hi, I'm observing a strange issue which I can only relate to bef47ff85d where bbsink abstraction was introduced. The problem is about failing assertion when doing: DETAIL: Failed process was running: BASE_BACKUP ( LABEL 'pg_basebackup base backup', PROGRESS, WAIT 0, MAX_RATE 102400, MANIFEST 'yes') Walsender tries to send a backup manifest, but crashes on the trottling sink: #2 0x0000560857b551af in ExceptionalCondition (conditionName=0x560857d15d27 "sink->bbs_next != NULL", errorType=0x560857d15c23"FailedAssertion", fileName=0x560857d15d15 "basebackup_sink.c", lineNumber=91) at assert.c:69 #3 0x0000560857918a94 in bbsink_forward_manifest_contents (sink=0x5608593f73f8, len=32768) at basebackup_sink.c:91 #4 0x0000560857918d68 in bbsink_throttle_manifest_contents (sink=0x5608593f7450, len=32768) at basebackup_throttle.c:125 #5 0x00005608579186d0 in bbsink_manifest_contents (sink=0x5608593f7450, len=32768) at ../../../src/include/replication/basebackup_sink.h:240 #6 0x0000560857918b1b in bbsink_forward_manifest_contents (sink=0x5608593f74e8, len=32768) at basebackup_sink.c:94 #7 0x0000560857911edc in bbsink_manifest_contents (sink=0x5608593f74e8, len=32768) at ../../../src/include/replication/basebackup_sink.h:240 #8 0x00005608579129f6 in SendBackupManifest (manifest=0x7ffdaea9d120, sink=0x5608593f74e8) at backup_manifest.c:373 Looking at the similar bbsink_throttle_archive_contents it's not clear why comments for both functions (archive and manifest throttling) say "pass archive contents to next sink", but only bbsink_throttle_manifest_contents does pass bbs_next into the bbsink_forward_manifest_contents. Is it supposed to be like that? Passing the same sink object instead the next one into bbsink_forward_manifest_contents seems to solve the problem in this case.
On Mon, Nov 15, 2021 at 11:25 AM Dmitry Dolgov <9erthalion6@gmail.com> wrote: > Walsender tries to send a backup manifest, but crashes on the trottling sink: > > #2 0x0000560857b551af in ExceptionalCondition (conditionName=0x560857d15d27 "sink->bbs_next != NULL", errorType=0x560857d15c23"FailedAssertion", fileName=0x560857d15d15 "basebackup_sink.c", lineNumber=91) at assert.c:69 > #3 0x0000560857918a94 in bbsink_forward_manifest_contents (sink=0x5608593f73f8, len=32768) at basebackup_sink.c:91 > #4 0x0000560857918d68 in bbsink_throttle_manifest_contents (sink=0x5608593f7450, len=32768) at basebackup_throttle.c:125 > #5 0x00005608579186d0 in bbsink_manifest_contents (sink=0x5608593f7450, len=32768) at ../../../src/include/replication/basebackup_sink.h:240 > #6 0x0000560857918b1b in bbsink_forward_manifest_contents (sink=0x5608593f74e8, len=32768) at basebackup_sink.c:94 > #7 0x0000560857911edc in bbsink_manifest_contents (sink=0x5608593f74e8, len=32768) at ../../../src/include/replication/basebackup_sink.h:240 > #8 0x00005608579129f6 in SendBackupManifest (manifest=0x7ffdaea9d120, sink=0x5608593f74e8) at backup_manifest.c:373 > > Looking at the similar bbsink_throttle_archive_contents it's not clear > why comments for both functions (archive and manifest throttling) say > "pass archive contents to next sink", but only bbsink_throttle_manifest_contents > does pass bbs_next into the bbsink_forward_manifest_contents. Is it > supposed to be like that? Passing the same sink object instead the next > one into bbsink_forward_manifest_contents seems to solve the problem in > this case. Yeah, that's what it should be doing. I'll commit a fix, thanks for the report and diagnosis. -- Robert Haas EDB: http://www.enterprisedb.com
On Mon, Nov 15, 2021 at 2:23 PM Robert Haas <robertmhaas@gmail.com> wrote: > Yeah, that's what it should be doing. I'll commit a fix, thanks for > the report and diagnosis. Here's a new patch set. 0001 - When I committed the patch to add the missing 2 blocks of zero bytes to the tar archives generated by the server, I failed to adjust the documentation. So 0001 does that. This is the only new patch in the series. I was not sure whether to just remove the statement from the documentation saying that those blocks aren't included, or whether to mention that we used to include them and no longer do. I went for the latter; opinions welcome. 0002 - This adds a new COPY subprotocol for taking base backups. I've improved it over the previous version by adding documentation. I'm still seeking comments on the points I raised in http://postgr.es/m/CA+TgmobrOXbDh+hCzzVkD3weV3R-QRy3SPa=FRb_Rv9wF5iPJw@mail.gmail.com but what I'm leaning toward doing is committing the patch as is and then submitting - or maybe several patches - later to rip some this and a few other old things out. That way the debate - or lack thereof - about what to do here doesn't have to block the main patch set, and also, it feels safer to make removing the existing stuff a separate effort rather than doing it now. 0003 - This adds "server" and "blackhole" as backup targets. In this version, I've improved the documentation. Also, the previous version only let you use a backup target with -Xnone, and I realized that was stupid. -Xfetch is OK too. -Xstream still doesn't work, since that's implemented via client-side logic. I think this still needs some work to be committable, like adding tests, but I don't expect to make any major changes. 0004 - Server-side gzip compression. Similar level of maturity to 0003. -- Robert Haas EDB: http://www.enterprisedb.com
Attachment
> supporting different compression levels was unnecessary."
On Mon, Nov 15, 2021 at 2:23 PM Robert Haas <robertmhaas@gmail.com> wrote:
> Yeah, that's what it should be doing. I'll commit a fix, thanks for
> the report and diagnosis.
Here's a new patch set.
0001 - When I committed the patch to add the missing 2 blocks of zero
bytes to the tar archives generated by the server, I failed to adjust
the documentation. So 0001 does that. This is the only new patch in
the series. I was not sure whether to just remove the statement from
the documentation saying that those blocks aren't included, or whether
to mention that we used to include them and no longer do. I went for
the latter; opinions welcome.
0002 - This adds a new COPY subprotocol for taking base backups. I've
improved it over the previous version by adding documentation. I'm
still seeking comments on the points I raised in
http://postgr.es/m/CA+TgmobrOXbDh+hCzzVkD3weV3R-QRy3SPa=FRb_Rv9wF5iPJw@mail.gmail.com
but what I'm leaning toward doing is committing the patch as is and
then submitting - or maybe several patches - later to rip some this
and a few other old things out. That way the debate - or lack thereof
- about what to do here doesn't have to block the main patch set, and
also, it feels safer to make removing the existing stuff a separate
effort rather than doing it now.
0003 - This adds "server" and "blackhole" as backup targets. In this
version, I've improved the documentation. Also, the previous version
only let you use a backup target with -Xnone, and I realized that was
stupid. -Xfetch is OK too. -Xstream still doesn't work, since that's
implemented via client-side logic. I think this still needs some work
to be committable, like adding tests, but I don't expect to make any
major changes.
0004 - Server-side gzip compression. Similar level of maturity to 0003.
--
Robert Haas
EDB: http://www.enterprisedb.com
Attachment
On 11/22/21 11:05 PM, Jeevan Ladhe wrote: > Please find the lz4 compression patch here that basically has: Thanks, Could you please rebase your patch, it is failing at my end - [edb@centos7tushar pg15_lz]$ git apply /tmp/v8-0001-LZ4-compression.patch error: patch failed: doc/src/sgml/ref/pg_basebackup.sgml:230 error: doc/src/sgml/ref/pg_basebackup.sgml: patch does not apply error: patch failed: src/backend/replication/Makefile:19 error: src/backend/replication/Makefile: patch does not apply error: patch failed: src/backend/replication/basebackup.c:64 error: src/backend/replication/basebackup.c: patch does not apply error: patch failed: src/include/replication/basebackup_sink.h:285 error: src/include/replication/basebackup_sink.h: patch does not apply -- regards,tushar EnterpriseDB https://www.enterprisedb.com/ The Enterprise PostgreSQL Company
On 11/22/21 11:05 PM, Jeevan Ladhe wrote:
> Please find the lz4 compression patch here that basically has:
Thanks, Could you please rebase your patch, it is failing at my end -
[edb@centos7tushar pg15_lz]$ git apply /tmp/v8-0001-LZ4-compression.patch
error: patch failed: doc/src/sgml/ref/pg_basebackup.sgml:230
error: doc/src/sgml/ref/pg_basebackup.sgml: patch does not apply
error: patch failed: src/backend/replication/Makefile:19
error: src/backend/replication/Makefile: patch does not apply
error: patch failed: src/backend/replication/basebackup.c:64
error: src/backend/replication/basebackup.c: patch does not apply
error: patch failed: src/include/replication/basebackup_sink.h:285
error: src/include/replication/basebackup_sink.h: patch does not apply
--
regards,tushar
EnterpriseDB https://www.enterprisedb.com/
The Enterprise PostgreSQL Company
On 12/28/21 1:11 PM, Jeevan Ladhe wrote: > You need to apply Robert's v10 version patches 0002, 0003 and 0004, > before applying the lz4 patch(v8 version). Thanks, able to apply now. -- regards,tushar EnterpriseDB https://www.enterprisedb.com/ The Enterprise PostgreSQL Company
On 11/22/21 11:05 PM, Jeevan Ladhe wrote: > Please find the lz4 compression patch here that basically has: One small issue, in the "pg_basebackup --help", we are not displaying lz4 value under --server-compression option [edb@tusharcentos7-v14 bin]$ ./pg_basebackup --help | grep server-compression --server-compression=none|gzip|gzip[1-9] -- regards,tushar EnterpriseDB https://www.enterprisedb.com/ The Enterprise PostgreSQL Company
On 11/22/21 11:05 PM, Jeevan Ladhe wrote: > Please find the lz4 compression patch here that basically has: Please refer to this scenario , where --server-compression is only compressing base backup into lz4 format but not pg_wal directory [edb@centos7tushar bin]$ ./pg_basebackup -Ft --server-compression=lz4 -Xstream -D foo [edb@centos7tushar bin]$ ls foo backup_manifest base.tar.lz4 pg_wal.tar this same is valid for gzip as well if server-compression is set to gzip edb@centos7tushar bin]$ ./pg_basebackup -Ft --server-compression=gzip4 -Xstream -D foo1 [edb@centos7tushar bin]$ ls foo1 backup_manifest base.tar.gz pg_wal.tar if this scenario is valid then both the folders format should be in lz4 format otherwise we should get an error something like - not a valid option ? -- regards,tushar EnterpriseDB https://www.enterprisedb.com/ The Enterprise PostgreSQL Company
On Mon, Jan 3, 2022 at 12:12 PM tushar <tushar.ahuja@enterprisedb.com> wrote: > On 11/22/21 11:05 PM, Jeevan Ladhe wrote: > > Please find the lz4 compression patch here that basically has: > Please refer to this scenario , where --server-compression is only > compressing > base backup into lz4 format but not pg_wal directory > > [edb@centos7tushar bin]$ ./pg_basebackup -Ft --server-compression=lz4 > -Xstream -D foo > > [edb@centos7tushar bin]$ ls foo > backup_manifest base.tar.lz4 pg_wal.tar > > this same is valid for gzip as well if server-compression is set to gzip > > edb@centos7tushar bin]$ ./pg_basebackup -Ft --server-compression=gzip4 > -Xstream -D foo1 > > [edb@centos7tushar bin]$ ls foo1 > backup_manifest base.tar.gz pg_wal.tar > > if this scenario is valid then both the folders format should be in lz4 > format otherwise we should > get an error something like - not a valid option ? Before sending an email like this, it would be a good idea to read the documentation for the --server-compression option. -- Robert Haas EDB: http://www.enterprisedb.com
On 1/4/22 8:07 PM, Robert Haas wrote: > Before sending an email like this, it would be a good idea to read the > documentation for the --server-compression option. Sure, Thanks Robert. One scenario where I feel error message is confusing and if it is not supported at all then error message need to be a little bit more clear if we use -z (or -Z ) with -t , we are getting this error [edb@centos7tushar bin]$ ./pg_basebackup -t server:/tmp/test0 -Xfetch -z pg_basebackup: error: only tar mode backups can be compressed Try "pg_basebackup --help" for more information. but after removing -z option backup is in tar mode only edb@centos7tushar bin]$ ./pg_basebackup -t server:/tmp/test0 -Xfetch [edb@centos7tushar bin]$ ls /tmp/test0 backup_manifest base.tar -- regards,tushar EnterpriseDB https://www.enterprisedb.com/ The Enterprise PostgreSQL Company
On Wed, Jan 5, 2022 at 5:11 AM tushar <tushar.ahuja@enterprisedb.com> wrote: > One scenario where I feel error message is confusing and if it is not > supported at all then error message need to be a little bit more clear > > if we use -z (or -Z ) with -t , we are getting this error > [edb@centos7tushar bin]$ ./pg_basebackup -t server:/tmp/test0 -Xfetch -z > pg_basebackup: error: only tar mode backups can be compressed > Try "pg_basebackup --help" for more information. > > but after removing -z option backup is in tar mode only > > edb@centos7tushar bin]$ ./pg_basebackup -t server:/tmp/test0 -Xfetch > [edb@centos7tushar bin]$ ls /tmp/test0 > backup_manifest base.tar OK, fair enough, I can adjust the error message for that case. -- Robert Haas EDB: http://www.enterprisedb.com
Hi Tushar,You need to apply Robert's v10 version patches 0002, 0003 and 0004, before applying the lz4 patch(v8 version).Please let me know if you still face any issues.
1)lz4 value is missing for --server-compression in pg_basebackup --help
2)Error messages need to improve if using -t server with -z/-Z
On Tue, Dec 28, 2021 at 1:12 PM Jeevan Ladhe <jeevan.ladhe@enterprisedb.com> wrote:Hi Tushar,You need to apply Robert's v10 version patches 0002, 0003 and 0004, before applying the lz4 patch(v8 version).Please let me know if you still face any issues.Thanks, Jeevan.I tested —server-compression option using different other options of pg_basebackup, also checked -t/—server-compression from pg_basebackup of v15 willthrow an error if the server version is v14 or below. Things are looking good to me.Two open issues -
1)lz4 value is missing for --server-compression in pg_basebackup --help
2)Error messages need to improve if using -t server with -z/-Zregards,
Attachment
On Tue, Nov 16, 2021 at 4:47 PM Robert Haas <robertmhaas@gmail.com> wrote: > Here's a new patch set. And here's another one. I've committed the first two patches from the previous set, the second of those just today, and so we're getting down to the meat of the patch set. 0001 adds "server" and "blackhole" as backup targets. It now has some tests. This might be more or less ready to ship, unless somebody else sees a problem, or I find one. 0002 adds server-side gzip compression. This one hasn't got tests yet. Also, it's going to need some adjustment based on the parallel discussion on the new options structure. -- Robert Haas EDB: http://www.enterprisedb.com
Attachment
On Tue, Jan 18, 2022 at 9:43 AM Jeevan Ladhe <jeevan.ladhe@enterprisedb.com> wrote: > The patch surely needs some grooming, but I am expecting some initial > review, specially in the area where we are trying to close the zstd stream > in bbsink_zstd_end_archive(). We need to tell the zstd library to end the > compression by calling ZSTD_compressStream2() thereby sending a > ZSTD_e_end flag. But, this also needs some input string, which per > example[1] line # 686, I have taken as an empty ZSTD_inBuffer. As far as I can see, this is correct. I found https://zstd.docsforge.com/dev/api-documentation/#streaming-compression-howto which seems to endorse what you've done here. One (minor) thing that I notice is that, the way you've written the loop in bbsink_zstd_end_archive(), I think it will typically call bbsink_archive_contents() twice. It will flush whatever is already present in the next sink's buffer as a result of the previous calls to bbsink_zstd_archive_contents(), and then it will call ZSTD_compressStream2() which will partially refill the buffer you just emptied, and then there will be nothing left in the internal buffer, so it will call bbsink_archive_contents() again. But ... the initial flush may not have been necessary. It could be that there was enough space already in the output buffer for the ZSTD_compressStream2() call to succeed without a prior flush. So maybe: do { yet_to_flush = ZSTD_compressStream2(..., ZSTD_e_end); check ZSTD_isError here; if (mysink->zstd_outBuf.pos > 0) bbsink_archive_contents(); } while (yet_to_flush > 0); I believe this might be very slightly more efficient. -- Robert Haas EDB: http://www.enterprisedb.com
Attachment
On Wed, Jan 19, 2022 at 7:16 AM Dipesh Pandit <dipesh.pandit@gmail.com> wrote: > I have added support for decompressing a gzip compressed tar file > at client. pg_basebackup can enable server side compression for > plain format backup with this change. > > Added a gzip extractor which decompresses the compressed archive > and forwards it to the next streamer. I have done initial testing and > working on updating the test coverage. Cool. It's going to need some documentation changes, too. I don't like the way you coded this in CreateBackupStreamer(). I would like the decision about whether to use bbstreamer_gzip_extractor_new(), and/or throw an error about not being able to parse an archive, to based on the file type i.e. "did we get a .tar.gz file?" rather than on whether we asked for server-side compression. Notice that the existing logic checks whether we actually got a .tar file from the server rather than assuming that's what must have happened. As a matter of style, I don't think it's good for the only thing inside of an "if" statement to be another "if" statement. The two could be merged, but we also don't want to have the "if" conditional be too complex. I am imagining that this should end up saying something like if (must_parse_archive && !is_tar && !is_tar_gz) { pg_log_error(... + * "windowBits" must be greater than or equal to "windowBits" value + * provided to deflateInit2 while compressing. It would be nice to clarify why we know the value we're using is safe. Maybe we're using the maximum possible value, in which case you could just add that to the end of the comment: "...so we use the maximum possible value for safety." + /* + * End of the stream, if there is some pending data in output buffers then + * we must forward it to next streamer. + */ + if (res == Z_STREAM_END) { + bbstreamer_content(mystreamer->base.bbs_next, member, mystreamer->base.bbs_buffer.data, + mystreamer->bytes_written, context); + } Uncuddle the brace. It probably doesn't make much difference, but I would be inclined to do the final flush in bbstreamer_gzip_extractor_finalize() rather than here. That way we rely on our own notion of when there's no more input data rather than zlib's notion. Probably terrible things are going to happen if those two ideas don't match up .... but there might be some other compression algorithm that doesn't return a distinguishing code at end-of-stream. Such an algorithm would have to take care of any leftover data in the finalize function, so I think we should do that here too, so the code can be similar in all cases. Perhaps we should move all the gzip stuff to a new file bbstreamer_gzip.c. -- Robert Haas EDB: http://www.enterprisedb.com
On 1/18/22 8:12 PM, Jeevan Ladhe wrote: > Similar to LZ4 server-side compression, I have also tried to add a ZSTD > server-side compression in the attached patch. Thanks Jeevan. while testing found one scenario where the server is getting crash while performing pg_basebackup against server-compression=zstd for a huge data second time Steps to reproduce --PG sources ( apply v11-0001,v11-0001,v9-0001,v9-0002 , configure --with-lz4,--with-zstd, make/install, initdb, start server) --insert huge data (./pgbench -i -s 2000 postgres) --restart the server (./pg_ctl -D data restart) --pg_basebackup ( ./pg_basebackup -t server:/tmp/yc1 --server-compression=zstd -R -Xnone -n -N -l 'ccc' --no-estimate-size -v) --insert huge data (./pgbench -i -s 1000 postgres) --restart the server (./pg_ctl -D data restart) --run pg_basebackup again (./pg_basebackup -t server:/tmp/yc11 --server-compression=zstd -v -Xnone ) [edb@centos7tushar bin]$ ./pg_basebackup -t server:/tmp/yc11 --server-compression=zstd -v -Xnone pg_basebackup: initiating base backup, waiting for checkpoint to complete 2022-01-19 21:23:26.508 IST [30219] LOG: checkpoint starting: force wait 2022-01-19 21:23:26.608 IST [30219] LOG: checkpoint complete: wrote 0 buffers (0.0%); 0 WAL file(s) added, 1 removed, 0 recycled; write=0.001 s, sync=0.001 s, total=0.101 s; sync files=0, longest=0.000 s, average=0.000 s; distance=16369 kB, estimate=16369 kB pg_basebackup: checkpoint completed TRAP: FailedAssertion("len > 0 && len <= sink->bbs_buffer_length", File: "../../../src/include/replication/basebackup_sink.h", Line: 208, PID: 30226) postgres: walsender edb [local] sending backup "pg_basebackup base backup"(ExceptionalCondition+0x7a)[0x94ceca] postgres: walsender edb [local] sending backup "pg_basebackup base backup"[0x7b9a08] postgres: walsender edb [local] sending backup "pg_basebackup base backup"[0x7b9be2] postgres: walsender edb [local] sending backup "pg_basebackup base backup"[0x7b5b30] postgres: walsender edb [local] sending backup "pg_basebackup base backup"(SendBaseBackup+0x563)[0x7b7053] postgres: walsender edb [local] sending backup "pg_basebackup base backup"(exec_replication_command+0x961)[0x7c9a41] postgres: walsender edb [local] sending backup "pg_basebackup base backup"(PostgresMain+0x92f)[0x81ca3f] postgres: walsender edb [local] sending backup "pg_basebackup base backup"[0x48e430] postgres: walsender edb [local] sending backup "pg_basebackup base backup"(PostmasterMain+0xfd2)[0x785702] postgres: walsender edb [local] sending backup "pg_basebackup base backup"(main+0x1c6)[0x48fb96] /lib64/libc.so.6(__libc_start_main+0xf5)[0x7f63642c8555] postgres: walsender edb [local] sending backup "pg_basebackup base backup"[0x48feb5] pg_basebackup: error: could not read COPY data: server closed the connection unexpectedly This probably means the server terminated abnormally before or while processing the request. 2022-01-19 21:25:34.485 IST [30205] LOG: server process (PID 30226) was terminated by signal 6: Aborted 2022-01-19 21:25:34.485 IST [30205] DETAIL: Failed process was running: BASE_BACKUP ( LABEL 'pg_basebackup base backup', PROGRESS, MANIFEST 'yes', TABLESPACE_MAP, TARGET 'server', TARGET_DETAIL '/tmp/yc11', COMPRESSION 'zstd') 2022-01-19 21:25:34.485 IST [30205] LOG: terminating any other active server processes [edb@centos7tushar bin]$ 2022-01-19 21:25:34.489 IST [30205] LOG: all server processes terminated; reinitializing 2022-01-19 21:25:34.536 IST [30228] LOG: database system was interrupted; last known up at 2022-01-19 21:23:26 IST 2022-01-19 21:25:34.669 IST [30228] LOG: database system was not properly shut down; automatic recovery in progress 2022-01-19 21:25:34.671 IST [30228] LOG: redo starts at 9/7000028 2022-01-19 21:25:34.671 IST [30228] LOG: invalid record length at 9/7000148: wanted 24, got 0 2022-01-19 21:25:34.671 IST [30228] LOG: redo done at 9/7000110 system usage: CPU: user: 0.00 s, system: 0.00 s, elapsed: 0.00 s 2022-01-19 21:25:34.673 IST [30229] LOG: checkpoint starting: end-of-recovery immediate wait 2022-01-19 21:25:34.713 IST [30229] LOG: checkpoint complete: wrote 3 buffers (0.0%); 0 WAL file(s) added, 0 removed, 0 recycled; write=0.003 s, sync=0.001 s, total=0.041 s; sync files=2, longest=0.001 s, average=0.001 s; distance=0 kB, estimate=0 kB 2022-01-19 21:25:34.718 IST [30205] LOG: database system is ready to accept connections Observation - if we change server-compression method to lz4 from zstd then it is NOT happening. [edb@centos7tushar bin]$ ./pg_basebackup -t server:/tmp/ycc1 --server-compression=lz4 -v -Xnone pg_basebackup: initiating base backup, waiting for checkpoint to complete 2022-01-19 21:27:51.642 IST [30229] LOG: checkpoint starting: force wait 2022-01-19 21:27:51.687 IST [30229] LOG: checkpoint complete: wrote 0 buffers (0.0%); 0 WAL file(s) added, 1 removed, 0 recycled; write=0.001 s, sync=0.001 s, total=0.046 s; sync files=0, longest=0.000 s, average=0.000 s; distance=16383 kB, estimate=16383 kB pg_basebackup: checkpoint completed NOTICE: WAL archiving is not enabled; you must ensure that all required WAL segments are copied through other means to complete the backup pg_basebackup: base backup completed [edb@centos7tushar bin]$ -- regards,tushar EnterpriseDB https://www.enterprisedb.com/ The Enterprise PostgreSQL Company
On Wed, Jan 19, 2022 at 7:16 AM Dipesh Pandit <dipesh.pandit@gmail.com> wrote: > I have done initial testing and > working on updating the test coverage. I spent some time thinking about test coverage for the server-side backup code today and came up with the attached (v12-0003). It does an end-to-end test that exercises server-side backup and server-side compression and then untars the backup and validity-checks it using pg_verifybackup. In addition to being good test coverage for these patches, it also plugs a gap in the test coverage of pg_verifybackup, which currently has no test case that untars a tar-format backup and then verifies the result. I couldn't figure out a way to do that back at the time I was working on pg_verifybackup, because I didn't think we had any existing precedent for using 'tar' from a TAP test. But it was pointed out to me that we do, so I used that as the model for this test. It should be easy to generalize this test case to test lz4 and zstd as well, I think. But I guess we'll still need something different to test what your patch is doing. -- Robert Haas EDB: http://www.enterprisedb.com
Attachment
> backup code today and came up with the attached (v12-0003). It does an
> end-to-end test that exercises server-side backup and server-side
> compression and then untars the backup and validity-checks it using
> pg_verifybackup. In addition to being good test coverage for these
> patches, it also plugs a gap in the test coverage of pg_verifybackup,
> which currently has no test case that untars a tar-format backup and
> then verifies the result. I couldn't figure out a way to do that back
> at the time I was working on pg_verifybackup, because I didn't think
> we had any existing precedent for using 'tar' from a TAP test. But it
> was pointed out to me that we do, so I used that as the model for this
> test. It should be easy to generalize this test case to test lz4 and
> zstd as well, I think. But I guess we'll still need something
> different to test what your patch is doing.
of the patches 0001, 0002 and 0003.
Attachment
On Thu, Jan 20, 2022 at 8:00 AM Dipesh Pandit <dipesh.pandit@gmail.com> wrote: > Thanks for the feedback, I have incorporated the suggestions and > updated a new patch v2. Cool. I'll do a detailed review later, but I think this is going in a good direction. > I tried to add the test coverage for server side gzip compression with > plain format backup using pg_verifybackup. I have modified the test > to use a flag specific to plain format. If this flag is set then it takes a > plain format backup (with server compression enabled) and verifies > this using pg_verifybackup. I have updated (v2-0002) for the test > coverage. Interesting approach. This unfortunately has the effect of making that test case file look a bit incoherent -- the comment at the top of the file isn't really accurate any more, for example, and the plain_format flag does more than just cause us to use -Fp; it also causes us NOT to use --target server:X. However, that might be something we can figure out a way to clean up. Alternatively, we could have a new test case file that is structured like 002_algorithm.pl but looping over compression methods rather than checksum algorithms, and testing each one with --server-compress and -Fp. It might be easier to make that look nice (but I'm not 100% sure). -- Robert Haas EDB: http://www.enterprisedb.com
On Wed, Jan 19, 2022 at 4:26 PM Robert Haas <robertmhaas@gmail.com> wrote: > I spent some time thinking about test coverage for the server-side > backup code today and came up with the attached (v12-0003). I committed the base backup target patch yesterday, and today I updated the remaining code in light of Michael Paquier's commit 5c649fe153367cdab278738ee4aebbfd158e0546. Here is the resulting patch. Michael, I am proposing to that we remove this message as part of this commit: - pg_log_info("no value specified for compression level, switching to default"); I think most people won't want to specify a compression level, so emitting a message when they don't seems too verbose. -- Robert Haas EDB: http://www.enterprisedb.com
Attachment
On Thu, Jan 20, 2022 at 11:10 AM Robert Haas <robertmhaas@gmail.com> wrote: > On Thu, Jan 20, 2022 at 8:00 AM Dipesh Pandit <dipesh.pandit@gmail.com> wrote: > > Thanks for the feedback, I have incorporated the suggestions and > > updated a new patch v2. > > Cool. I'll do a detailed review later, but I think this is going in a > good direction. Here is a more detailed review. + if (inflateInit2(zs, 15 + 16) != Z_OK) + { + pg_log_error("could not initialize compression library"); + exit(1); + + } Extra blank line. + /* At present, we only know how to parse tar and gzip archives. */ gzip -> tar.gz. You can gzip something that is not a tar. + * Extract the gzip compressed archive using a gzip extractor and then + * forward it to next streamer. This comment is not good. First, we're not necessarily doing it. Second, it just describes what the code does, not why it does it. Maybe something like "If the user requested both that the server compress the backup and also that we extract the backup, we need to decompress it." + if (server_compression != NULL) + { + if (strcmp(server_compression, "gzip") == 0) + server_compression_type = BACKUP_COMPRESSION_GZIP; + else if (strlen(server_compression) == 5 && + strncmp(server_compression, "gzip", 4) == 0 && + server_compression[4] >= '1' && server_compression[4] <= '9') + { + server_compression_type = BACKUP_COMPRESSION_GZIP; + server_compression_level = server_compression[4] - '0'; + } + } + else + server_compression_type = BACKUP_COMPRESSION_NONE; I think this is not required any more. I think probably some other things need to be adjusted as well, based on Michael's changes and the updates in my patch to match. -- Robert Haas EDB: http://www.enterprisedb.com
> test case file look a bit incoherent -- the comment at the top of the
> file isn't really accurate any more, for example, and the plain_format
> flag does more than just cause us to use -Fp; it also causes us NOT to
> use --target server:X. However, that might be something we can figure
> out a way to clean up. Alternatively, we could have a new test case
> file that is structured like 002_algorithm.pl but looping over
> compression methods rather than checksum algorithms, and testing each
> one with --server-compress and -Fp. It might be easier to make that
> look nice (but I'm not 100% sure).
> updated the remaining code in light of Michael Paquier's commit
> 5c649fe153367cdab278738ee4aebbfd158e0546. Here is the resulting patch.
Attachment
On Mon, Jan 24, 2022 at 9:30 AM Dipesh Pandit <dipesh.pandit@gmail.com> wrote: > v13 patch does not apply on the latest head, it requires a rebase. I have applied > it on commit dc43fc9b3aa3e0fa9c84faddad6d301813580f88 to validate gzip > decompression patches. It only needed trivial rebasing; I have committed it after doing that. -- Robert Haas EDB: http://www.enterprisedb.com
Hi, Thank you for committing a great feature. I have tested the committed features. The attached small patch fixes the output of the --help message. In the previous commit, only gzip and none were output,but in the attached patch, client-gzip and server-gzip are added. Regards, Noriyoshi Shinoda -----Original Message----- From: Robert Haas <robertmhaas@gmail.com> Sent: Saturday, January 22, 2022 3:33 AM To: Dipesh Pandit <dipesh.pandit@gmail.com>; Michael Paquier <michael@paquier.xyz> Cc: Jeevan Ladhe <jeevan.ladhe@enterprisedb.com>; tushar <tushar.ahuja@enterprisedb.com>; Dmitry Dolgov <9erthalion6@gmail.com>;Mark Dilger <mark.dilger@enterprisedb.com>; pgsql-hackers@postgresql.org Subject: Re: refactoring basebackup.c On Wed, Jan 19, 2022 at 4:26 PM Robert Haas <robertmhaas@gmail.com> wrote: > I spent some time thinking about test coverage for the server-side > backup code today and came up with the attached (v12-0003). I committed the base backup target patch yesterday, and today I updated the remaining code in light of Michael Paquier'scommit 5c649fe153367cdab278738ee4aebbfd158e0546. Here is the resulting patch. Michael, I am proposing to that we remove this message as part of this commit: - pg_log_info("no value specified for compression level, switching to default"); I think most people won't want to specify a compression level, so emitting a message when they don't seems too verbose. -- Robert Haas EDB: http://www.enterprisedb.com
Attachment
"Shinoda, Noriyoshi (PN Japan FSIP)" <noriyoshi.shinoda@hpe.com> writes: > Hi, > Thank you for committing a great feature. I have tested the committed features. > The attached small patch fixes the output of the --help message. In the > previous commit, only gzip and none were output, but in the attached > patch, client-gzip and server-gzip are added. I think it would be better to write that as `[{client,server}-]gzip`, especially as we add more compression agorithms, where it would presumably become `[{client,server}-]METHOD` (assuming all methods are supported on both the client and server side). I also noticed that in the docs, the `client` and `server` are marked up as replaceable parameters, when they are actually literals, plus the hyphen is misplaced. The `--checkpoint` option also has the `fast` and `spread` literals marked up as parameters. All of these are fixed in the attached patch. - ilmari From 8e3d191917984a6d17f2c72212d90c96467463b0 Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Dagfinn=20Ilmari=20Manns=C3=A5ker?= <ilmari@ilmari.org> Date: Tue, 25 Jan 2022 13:04:05 +0000 Subject: [PATCH] pg_basebackup documentation and help fixes Don't mark up literals as replaceable parameters and indicate alternatives correctly with {...|...}. --- doc/src/sgml/ref/pg_basebackup.sgml | 6 +++--- src/bin/pg_basebackup/pg_basebackup.c | 2 +- 2 files changed, 4 insertions(+), 4 deletions(-) diff --git a/doc/src/sgml/ref/pg_basebackup.sgml b/doc/src/sgml/ref/pg_basebackup.sgml index 1d0df346b9..98c89751b3 100644 --- a/doc/src/sgml/ref/pg_basebackup.sgml +++ b/doc/src/sgml/ref/pg_basebackup.sgml @@ -400,7 +400,7 @@ <term><option>-Z <replaceable class="parameter">level</replaceable></option></term> <term><option>-Z <replaceable class="parameter">method</replaceable></option>[:<replaceable>level</replaceable>]</term> <term><option>--compress=<replaceable class="parameter">level</replaceable></option></term> - <term><option>--compress=[[{<replaceable class="parameter">client|server</replaceable>-}]<replaceable class="parameter">method</replaceable></option>[:<replaceable>level</replaceable>]</term> + <term><option>--compress=[[{client|server}-]<replaceable class="parameter">method</replaceable></option>[:<replaceable>level</replaceable>]</term> <listitem> <para> Requests compression of the backup. If <literal>client</literal> or @@ -441,8 +441,8 @@ <variablelist> <varlistentry> - <term><option>-c <replaceable class="parameter">fast|spread</replaceable></option></term> - <term><option>--checkpoint=<replaceable class="parameter">fast|spread</replaceable></option></term> + <term><option>-c {fast|spread}</option></term> + <term><option>--checkpoint={fast|spread}</option></term> <listitem> <para> Sets checkpoint mode to fast (immediate) or spread (the default) diff --git a/src/bin/pg_basebackup/pg_basebackup.c b/src/bin/pg_basebackup/pg_basebackup.c index 72c27c78d0..46f6f53e9b 100644 --- a/src/bin/pg_basebackup/pg_basebackup.c +++ b/src/bin/pg_basebackup/pg_basebackup.c @@ -391,7 +391,7 @@ usage(void) printf(_(" -X, --wal-method=none|fetch|stream\n" " include required WAL files with specified method\n")); printf(_(" -z, --gzip compress tar output\n")); - printf(_(" -Z, --compress={gzip,none}[:LEVEL] or [LEVEL]\n" + printf(_(" -Z, --compress={[{client,server}-]gzip,none}[:LEVEL] or [LEVEL]\n" " compress tar output with given compression method or level\n")); printf(_("\nGeneral options:\n")); printf(_(" -c, --checkpoint=fast|spread\n" -- 2.30.2
Dagfinn Ilmari Mannsåker <ilmari@ilmari.org> writes: > "Shinoda, Noriyoshi (PN Japan FSIP)" <noriyoshi.shinoda@hpe.com> writes: > >> Hi, >> Thank you for committing a great feature. I have tested the committed features. >> The attached small patch fixes the output of the --help message. In the >> previous commit, only gzip and none were output, but in the attached >> patch, client-gzip and server-gzip are added. > > I think it would be better to write that as `[{client,server}-]gzip`, > especially as we add more compression agorithms, where it would > presumably become `[{client,server}-]METHOD` (assuming all methods are > supported on both the client and server side). > > I also noticed that in the docs, the `client` and `server` are marked up > as replaceable parameters, when they are actually literals, plus the > hyphen is misplaced. The `--checkpoint` option also has the `fast` and > `spread` literals marked up as parameters. > > All of these are fixed in the attached patch. I just noticed there was a superfluous [ in the SGM documentation, and that the short form was missing the [{client|server}-] part. Updated patch attaced. - ilmari From 2164f1a9fc97a5f88f57c7cc9cdafa67398dcc0e Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Dagfinn=20Ilmari=20Manns=C3=A5ker?= <ilmari@ilmari.org> Date: Tue, 25 Jan 2022 13:04:05 +0000 Subject: [PATCH v2] pg_basebackup documentation and help fixes Don't mark up literals as replaceable parameters and indicate alternatives correctly with {...|...}, and add missing [{client,server}-] to the -Z form. --- doc/src/sgml/ref/pg_basebackup.sgml | 8 ++++---- src/bin/pg_basebackup/pg_basebackup.c | 2 +- 2 files changed, 5 insertions(+), 5 deletions(-) diff --git a/doc/src/sgml/ref/pg_basebackup.sgml b/doc/src/sgml/ref/pg_basebackup.sgml index 1d0df346b9..a5e03d2c66 100644 --- a/doc/src/sgml/ref/pg_basebackup.sgml +++ b/doc/src/sgml/ref/pg_basebackup.sgml @@ -398,9 +398,9 @@ <varlistentry> <term><option>-Z <replaceable class="parameter">level</replaceable></option></term> - <term><option>-Z <replaceable class="parameter">method</replaceable></option>[:<replaceable>level</replaceable>]</term> + <term><option>-Z [{client|server}-]<replaceable class="parameter">method</replaceable></option>[:<replaceable>level</replaceable>]</term> <term><option>--compress=<replaceable class="parameter">level</replaceable></option></term> - <term><option>--compress=[[{<replaceable class="parameter">client|server</replaceable>-}]<replaceable class="parameter">method</replaceable></option>[:<replaceable>level</replaceable>]</term> + <term><option>--compress=[{client|server}-]<replaceable class="parameter">method</replaceable></option>[:<replaceable>level</replaceable>]</term> <listitem> <para> Requests compression of the backup. If <literal>client</literal> or @@ -441,8 +441,8 @@ <variablelist> <varlistentry> - <term><option>-c <replaceable class="parameter">fast|spread</replaceable></option></term> - <term><option>--checkpoint=<replaceable class="parameter">fast|spread</replaceable></option></term> + <term><option>-c {fast|spread}</option></term> + <term><option>--checkpoint={fast|spread}</option></term> <listitem> <para> Sets checkpoint mode to fast (immediate) or spread (the default) diff --git a/src/bin/pg_basebackup/pg_basebackup.c b/src/bin/pg_basebackup/pg_basebackup.c index 72c27c78d0..46f6f53e9b 100644 --- a/src/bin/pg_basebackup/pg_basebackup.c +++ b/src/bin/pg_basebackup/pg_basebackup.c @@ -391,7 +391,7 @@ usage(void) printf(_(" -X, --wal-method=none|fetch|stream\n" " include required WAL files with specified method\n")); printf(_(" -z, --gzip compress tar output\n")); - printf(_(" -Z, --compress={gzip,none}[:LEVEL] or [LEVEL]\n" + printf(_(" -Z, --compress={[{client,server}-]gzip,none}[:LEVEL] or [LEVEL]\n" " compress tar output with given compression method or level\n")); printf(_("\nGeneral options:\n")); printf(_(" -c, --checkpoint=fast|spread\n" -- 2.30.2
On 1/22/22 12:03 AM, Robert Haas wrote: > I committed the base backup target patch yesterday, and today I > updated the remaining code in light of Michael Paquier's commit > 5c649fe153367cdab278738ee4aebbfd158e0546. Here is the resulting patch. Thanks Robert, I tested against the latest PG Head and found a few issues - A)Getting syntax error if -z is used along with -t [edb@centos7tushar bin]$ ./pg_basebackup -t server:/tmp/data902 -z -Xfetch pg_basebackup: error: could not initiate base backup: ERROR: syntax error OR [edb@centos7tushar bin]$ ./pg_basebackup -t server:/tmp/t2 --compress=server-gzip:9 -Xfetch -v -z pg_basebackup: initiating base backup, waiting for checkpoint to complete pg_basebackup: error: could not initiate base backup: ERROR: syntax error B)No information of "client-gzip" or "server-gzip" added under "--compress" option/method of ./pg_basebackup --help. C) -R option is silently ignoring [edb@centos7tushar bin]$ ./pg_basebackup -Z 4 -v -t server:/tmp/pp -Xfetch -R pg_basebackup: initiating base backup, waiting for checkpoint to complete pg_basebackup: checkpoint completed pg_basebackup: write-ahead log start point: 0/30000028 on timeline 1 pg_basebackup: write-ahead log end point: 0/30000100 pg_basebackup: base backup completed [edb@centos7tushar bin]$ go to /tmp/pp folder and extract it - there is no "standby.signal" file and if we start cluster against this data directory,it will not be in slave mode. if this is not supported then I think we should throw some errors. -- regards,tushar EnterpriseDB https://www.enterprisedb.com/ The Enterprise PostgreSQL Company
On Tue, Jan 25, 2022 at 8:42 AM Dagfinn Ilmari Mannsåker <ilmari@ilmari.org> wrote: > I just noticed there was a superfluous [ in the SGM documentation, and > that the short form was missing the [{client|server}-] part. Updated > patch attaced. Committed, thanks. -- Robert Haas EDB: http://www.enterprisedb.com
On Tue, Jan 25, 2022 at 03:54:52AM +0000, Shinoda, Noriyoshi (PN Japan FSIP) wrote: > Michael, I am proposing to that we remove this message as part of > this commit: > > - pg_log_info("no value specified for compression > level, switching to default"); > > I think most people won't want to specify a compression level, so > emitting a message when they don't seems too verbose. (Just noticed this message as I am not in CC.) Removing this message is fine by me, thanks! -- Michael
Attachment
On Tue, Jan 25, 2022 at 09:52:12PM +0530, tushar wrote: > C) -R option is silently ignoring > > go to /tmp/pp folder and extract it - there is no "standby.signal" file and > if we start cluster against this data directory,it will not be in slave > mode. Yeah, I don't think it's good to silently ignore the option, and we should not generate the file on the server-side. Rather than erroring in this case, you'd better add the file to the existing compressed file of the base data folder on the client-side. This makes me wonder whether we should begin tracking any open items for v15.. We don't want to lose track of any issue with features committed already in the tree. -- Michael
Attachment
On Tue, Jan 25, 2022 at 8:23 PM Michael Paquier <michael@paquier.xyz> wrote: > On Tue, Jan 25, 2022 at 03:54:52AM +0000, Shinoda, Noriyoshi (PN Japan FSIP) wrote: > > Michael, I am proposing to that we remove this message as part of > > this commit: > > > > - pg_log_info("no value specified for compression > > level, switching to default"); > > > > I think most people won't want to specify a compression level, so > > emitting a message when they don't seems too verbose. > > (Just noticed this message as I am not in CC.) > Removing this message is fine by me, thanks! Oh, I thought I'd CC'd you. I know I meant to do so. -- Robert Haas EDB: http://www.enterprisedb.com
On Tue, Jan 25, 2022 at 11:22 AM tushar <tushar.ahuja@enterprisedb.com> wrote: > A)Getting syntax error if -z is used along with -t > > [edb@centos7tushar bin]$ ./pg_basebackup -t server:/tmp/data902 -z -Xfetch > pg_basebackup: error: could not initiate base backup: ERROR: syntax error Oops. The attached patch should fix this. > B)No information of "client-gzip" or "server-gzip" added under > "--compress" option/method of ./pg_basebackup --help. Already fixed by e1f860f13459e186479319aa9f65ef184277805f. > C) -R option is silently ignoring The attached patch should fix this, too. Thanks for finding these issues. -- Robert Haas EDB: http://www.enterprisedb.com
Attachment
Attachment
On 1/27/22 2:15 AM, Robert Haas wrote: > The attached patch should fix this, too. Thanks, the issues seem to be fixed now. -- regards,tushar EnterpriseDB https://www.enterprisedb.com/ The Enterprise PostgreSQL Company
On Thu, Jan 27, 2022 at 7:15 AM tushar <tushar.ahuja@enterprisedb.com> wrote: > On 1/27/22 2:15 AM, Robert Haas wrote: > > The attached patch should fix this, too. > Thanks, the issues seem to be fixed now. Cool. I committed that patch. -- Robert Haas EDB: http://www.enterprisedb.com
On 1/27/22 10:17 PM, Robert Haas wrote: > Cool. I committed that patch. Thanks , Please refer to this scenario where the label is set to 0 for server-gzip but the directory is still compressed [edb@centos7tushar bin]$ ./pg_basebackup -t server:/tmp/11 --gzip --compress=0 -Xnone NOTICE: all required WAL segments have been archived [edb@centos7tushar bin]$ ls /tmp/11 16384.tar backup_manifest base.tar [edb@centos7tushar bin]$ ./pg_basebackup -t server:/tmp/10 --gzip --compress=server-gzip:0 -Xnone NOTICE: all required WAL segments have been archived [edb@centos7tushar bin]$ ls /tmp/10 16384.tar.gz backup_manifest base.tar.gz 0 is for no compression so the directory should not be compressed if we mention server-gzip:0 and both these above scenarios should match? -- regards,tushar EnterpriseDB https://www.enterprisedb.com/ The Enterprise PostgreSQL Company
On Thu, Jan 27, 2022 at 12:08 PM tushar <tushar.ahuja@enterprisedb.com> wrote: > On 1/27/22 10:17 PM, Robert Haas wrote: > > Cool. I committed that patch. > Thanks , Please refer to this scenario where the label is set to 0 for > server-gzip but the directory is still compressed > > [edb@centos7tushar bin]$ ./pg_basebackup -t server:/tmp/11 --gzip > --compress=0 -Xnone > NOTICE: all required WAL segments have been archived > [edb@centos7tushar bin]$ ls /tmp/11 > 16384.tar backup_manifest base.tar > > > [edb@centos7tushar bin]$ ./pg_basebackup -t server:/tmp/10 --gzip > --compress=server-gzip:0 -Xnone > NOTICE: all required WAL segments have been archived > [edb@centos7tushar bin]$ ls /tmp/10 > 16384.tar.gz backup_manifest base.tar.gz > > 0 is for no compression so the directory should not be compressed if we > mention server-gzip:0 and both these > above scenarios should match? Well what's weird here is that you are using both --gzip and also --compress. Those both control the same behavior, so it's a surprising idea to specify both. But I guess if someone does, we should make the second one fully override the first one. Here's a patch to try to do that. -- Robert Haas EDB: http://www.enterprisedb.com
Attachment
On Thu, Jan 27, 2022 at 2:37 AM Dipesh Pandit <dipesh.pandit@gmail.com> wrote: > I have updated the patches to support server compression (gzip) for > plain format backup. Please find attached v4 patches. I made a pass over these patches today and made a bunch of minor corrections. New version attached. The two biggest things I changed are (1) s/gzip_extractor/gzip_compressor/, because I feel like you extract an archive like a tarfile, but that is not what is happening here, this is not an archive and (2) I took a few bits of out of the test case that didn't seem to be necessary. There wasn't any reason that I could see why testing for PG_VERSION needed to be skipped when the compression method is 'none', so my first thought was to just take out the 'if' statement around that, but then after more thought that test and the one for pg_verifybackup are certainly going to fail if those files are not present, so why have an extra test? It might make sense if we were only conditionally able to run pg_verifybackup and wanted to have some test coverage even when we can't, but that's not the case here, so I see no point. I studied this a bit to see whether I needed to make any adjustments along the lines of 4f0bcc735038e96404fae59aa16ef9beaf6bb0aa in order for this to work on msys. I think I don't, because 002_algorithm.pl and 003_corruption.pl both pass $backup_path, not $real_backup_path, to command_ok -- and I think something inside there does the translation, which is weird, but we might as well be consistent. 008_untar.pl and 4f0bcc735038e96404fae59aa16ef9beaf6bb0aa needed to do something different because --target server:X confused the msys magic, but I think that shouldn't be an issue for this patch. However, I might be wrong. Barring objections or problems, I plan to commit this version tomorrow. I'd do it today, but I have plans for tonight that are incompatible with discovering that the build farm hates this .... -- Robert Haas EDB: http://www.enterprisedb.com
Attachment
> corrections. New version attached. The two biggest things I changed
> are (1) s/gzip_extractor/gzip_compressor/, because I feel like you
> extract an archive like a tarfile, but that is not what is happening
> here, this is not an archive and (2) I took a few bits of out of the
> test case that didn't seem to be necessary. There wasn't any reason
> that I could see why testing for PG_VERSION needed to be skipped when
> the compression method is 'none', so my first thought was to just take
> out the 'if' statement around that, but then after more thought that
> test and the one for pg_verifybackup are certainly going to fail if
> those files are not present, so why have an extra test? It might make
> sense if we were only conditionally able to run pg_verifybackup and
> wanted to have some test coverage even when we can't, but that's not
> the case here, so I see no point.
+ /*
+ * If the user has requested a server compressed archive along with archive
+ * extraction at client then we need to decompress it.
+ */
+ if (format == 'p' && compressmethod == COMPRESSION_GZIP &&
+ compressloc == COMPRESS_LOCATION_SERVER)
+ streamer = bbstreamer_gzip_decompressor_new(streamer);
+#endif
Attachment
right, the current behavior was -Well what's weird here is that you are using both --gzip and also --compress. Those both control the same behavior, so it's a surprising idea to specify both. But I guess if someone does, we should make the second one fully override the first one. Here's a patch to try to do that.
[edb@centos7tushar bin]$ ./pg_basebackup -t server:/tmp/y101 --gzip -Z none -Xnone
pg_basebackup: error: cannot use compression level with method none
Try "pg_basebackup --help" for more information.
and even this was not matching with PG v14 behavior too
e.g
./pg_basebackup -Ft -z -Z none -D /tmp/test1 ( working in PG v14 but throwing above error on PG HEAD)
and somewhere we were breaking the backward compatibility.
now with your patch -this seems working fine
[edb@centos7tushar bin]$ ./pg_basebackup -t server:/tmp/y101 --gzip -Z none -Xnone
NOTICE: WAL archiving is not enabled; you must ensure that all required WAL segments are copied through other means to complete the backup
[edb@centos7tushar bin]$ ls /tmp/y101
backup_manifest base.tar
OR
[edb@centos7tushar bin]$ ./pg_basebackup -t server:/tmp/y0p -Z none -Xfetch -z
[edb@centos7tushar bin]$ ls /tmp/y0p
backup_manifest base.tar.gz
but what about server-gzip:0? should it allow compressing the directory?
[edb@centos7tushar bin]$ ./pg_basebackup -t server:/tmp/1 --compress=server-gzip:0 -Xfetch
[edb@centos7tushar bin]$ ls /tmp/1
backup_manifest base.tar.gz
-- regards,tushar EnterpriseDB https://www.enterprisedb.com/ The Enterprise PostgreSQL Company
On Fri, Jan 28, 2022 at 3:54 AM Dipesh Pandit <dipesh.pandit@gmail.com> wrote: > Thanks. This makes sense. > > +#ifdef HAVE_LIBZ > + /* > + * If the user has requested a server compressed archive along with archive > + * extraction at client then we need to decompress it. > + */ > + if (format == 'p' && compressmethod == COMPRESSION_GZIP && > + compressloc == COMPRESS_LOCATION_SERVER) > + streamer = bbstreamer_gzip_decompressor_new(streamer); > +#endif > > I think it is not required to have HAVE_LIBZ check in pg_basebackup.c > while creating a new gzip writer/decompressor. This check is already > in place in bbstreamer_gzip_writer_new() and bbstreamer_gzip_decompressor_new() > and it throws an error in case the build does not have required library > support. I have removed this check from pg_basebackup.c and updated > a delta patch. The patch can be applied on v5 patch. Right, makes sense. Committed with that change, plus I realized the skip count in the test case file was wrong after the changes I made yesterday, so I fixed that as well. -- Robert Haas EDB: http://www.enterprisedb.com
Attachment
On Fri, Jan 28, 2022 at 12:48 PM Jeevan Ladhe <jeevan.ladhe@enterprisedb.com> wrote: > I have attached the latest rebased version of the LZ4 server-side compression > patch on the recent commits. This patch also introduces the compression level > and adds a tap test. In view of this morning's commit of d45099425eb19e420433c9d81d354fe585f4dbd6 I think the threshold for committing this patch has gone up. We need to make it support decompression with LZ4 on the client side, as we now have for gzip. Other comments: - Even if we were going to support LZ4 only on the server side, surely it's not right to refuse --compress lz4 and --compress client-lz4 at the parsing stage. I don't even think the message you added to main() is reachable. - In the new test case you set decompress_flags but according to the documentation I have here, -m is for multiple files (and so should not be needed here) and -d is for decompression (which is what we want here). So I'm confused why this is like this. Other than that this seems like it's in pretty good shape. > Also, while adding the lz4 case in the pg_verifybackup/t/008_untar.pl, I found > an unused variable {have_zlib}. I have attached a cleanup patch for that as well. This part seems clearly correct, so I have committed it. -- Robert Haas EDB: http://www.enterprisedb.com
On Fri, Jan 28, 2022 at 12:48 PM Jeevan Ladhe
<jeevan.ladhe@enterprisedb.com> wrote:
> I have attached the latest rebased version of the LZ4 server-side compression
> patch on the recent commits. This patch also introduces the compression level
> and adds a tap test.
In view of this morning's commit of
d45099425eb19e420433c9d81d354fe585f4dbd6 I think the threshold for
committing this patch has gone up. We need to make it support
decompression with LZ4 on the client side, as we now have for gzip.
- In the new test case you set decompress_flags but according to the
documentation I have here, -m is for multiple files (and so should not
be needed here) and -d is for decompression (which is what we want
here). So I'm confused why this is like this.
output in scripts. -c ensures that output will be stdout. Conversely,
providing a destination name, or using -m ensures that the output will
be either the specified name, or filename.lz4 respectively."
This part seems clearly correct, so I have committed it.
- Even if we were going to support LZ4 only on the server side, surely
it's not right to refuse --compress lz4 and --compress client-lz4 at
the parsing stage. I don't even think the message you added to main()
is reachable.
- In the new test case you set decompress_flags but according to the
documentation I have here, -m is for multiple files (and so should not
be needed here) and -d is for decompression (which is what we want
here). So I'm confused why this is like this.
Attachment
On Mon, Jan 31, 2022 at 6:11 AM Jeevan Ladhe <jeevan.ladhe@enterprisedb.com> wrote: > I had an offline discussion with Dipesh, and he will be working on the > lz4 client side decompression part. OK. I guess we should also be thinking about client-side LZ4 compression. It's probably best to focus on that before worrying about ZSTD, even though ZSTD would be really cool to have. >> - In the new test case you set decompress_flags but according to the >> documentation I have here, -m is for multiple files (and so should not >> be needed here) and -d is for decompression (which is what we want >> here). So I'm confused why this is like this. > > As explained earlier in the tap test the 'lz4 -d base.tar.lz4' command was > throwing the decompression to stdout. Now, I have removed the '-m', > added '-d' for decompression, and also added the target file explicitly in > the command. I don't see the behavior you describe here. For me: [rhaas ~]$ lz4 q.lz4 Decoding file q q.lz4 : decoded 3785 bytes [rhaas ~]$ rm q [rhaas ~]$ lz4 -m q.lz4 [rhaas ~]$ ls q q [rhaas ~]$ rm q [rhaas ~]$ lz4 -d q.lz4 Decoding file q q.lz4 : decoded 3785 bytes [rhaas ~]$ rm q [rhaas ~]$ lz4 -d -m q.lz4 [rhaas ~]$ ls q q In other words, on my system, the file gets decompressed with or without -d, and with or without -m. The only difference I see is that using -m makes it happen silently, without printing anything on the terminal. Anyway, I wasn't saying that using -m was necessarily wrong, just that I didn't understand why you had it like that. Now that I'm more informed, I recommend that we use -d -m, the former to be explicit about wanting to decompress and the latter because it either makes it less noisy (on my system) or makes it work at all (on yours). It's surprising that the command behavior would be different like that on different systems, but it is what it is. I think any set of flags we put here is better than adding more logical in perl, as it keeps things simpler. -- Robert Haas EDB: http://www.enterprisedb.com
I think you are right, I have removed the message and again introducedthe Assert() back.
Attachment
On Tue, Jan 18, 2022 at 1:55 PM Robert Haas <robertmhaas@gmail.com> wrote: > 0001 adds "server" and "blackhole" as backup targets. It now has some > tests. This might be more or less ready to ship, unless somebody else > sees a problem, or I find one. I played around with this a bit and it seems quite easy to extend this further. So please find attached a couple more patches to generalize this mechanism. 0001 adds an extensibility framework for backup targets. The idea is that an extension loaded via shared_preload_libraries can call BaseBackupAddTarget() to define a new base backup target, which the user can then access via pg_basebackup --target TARGET_NAME, or if they want to pass a detail string, pg_basebackup --target TARGET_NAME:DETAIL. There might be slightly better ways of hooking this into the system. I'm not unhappy with this approach, but there might be a better idea out there. 0002 adds an example contrib module called basebackup_to_shell. The system administrator can set basebackup_to_shell.command='SOMETHING'. A backup directed to the 'shell' target will cause the server to execute the configured command once per generated archive, and once for the backup_manifest, if any. When executing the command, %f gets replaced with the archive filename (e.g. base.tar) and %d gets replaced with the detail. The actual contents of the file are passed to the command's standard input, and it can then do whatever it likes with that data. Clearly, this is not state of the art; for instance, if what you really want is to upload the backup files someplace via HTTP, using this to run 'curl' is probably not so good of an idea as using an extension module that links with libcurl. That would likely lead to better error checking, better performance, nicer configuration, and just generally fewer things that can go wrong. On the other hand, writing an integration in C is kind of tricky, and this thing is quite easy to use -- and it does work. There are a couple of things to be concerned about with 0002 from a security perspective. First, in a backend environment, we have a function to spawn a subprocess via popen(), namely OpenPipeStream(), but there is no function to spawn a subprocess with execve() and end up with a socket connected to its standard input. And that means that whatever command the administrator configures is being interpreted by the shell, which is a potential problem given that we're interpolating the target detail string supplied by the user, who must have at least replication privileges but need not be the superuser. I chose to handle this by allowing the target detail to contain only alphanumeric characters. Refinement is likely possible, but whether the effort is worthwhile seems questionable. Second, what if the superuser wants to allow the use of this module to only some of the users who have replication privileges? That seems a bit unlikely but it's possible, so I added a GUC basebackup_to_shell.required_role. If set, the functionality is only usable by members of the named role. If unset, anyone with replication privilege can use it. I guess someone could criticize this as defaulting to the least secure setting, but considering that you have to have replication privileges to use this at all, I don't find that argument much to get excited about. I have to say that I'm incredibly happy with how easy these patches were to write. I think this is going to make adding new base backup targets as accessible as we can realistically hope to make it. There is some boilerplate code, as an examination of the patches will reveal, but it's not a lot, and at least IMHO it's pretty straightforward. Granted, coding up a new base backup target is something only experienced C hackers are likely to do, but the fact that I was able to throw this together so quickly suggests to me that I've got the design basically right, and that anyone who does want to plug into the new mechanism shouldn't have too much trouble doing so. Thoughts? -- Robert Haas EDB: http://www.enterprisedb.com
Attachment
With a quick look at the patch I have following observations:
----------------------------------------------------------
on client side:
/* Align the output buffer length. */
compressed_bound += compressed_bound + BLCKSZ - (compressed_bound %
BLCKSZ);
----------------------------------------------------------
not changed. I think we can simply change the len to avail_in in the
argument list.
----------------------------------------------------------
+ * Update the offset and capacity of output buffer based on based on number
+ * of bytes written to output buffer.
I think it is thinko:
+ * Update the offset and capacity of output buffer based on number of
+ * bytes written to output buffer.
----------------------------------------------------------
+ if ((mystreamer->base.bbs_buffer.maxlen - mystreamer->bytes_written) <=
+ footer_bound)
----------------------------------------------------------
I think similar to bbstreamer_lz4_compressor_content() in
bbstreamer_lz4_decompressor_content() we can change len to avail_in.
Hi,> On Mon, Jan 31, 2022 at 4:41 PM Jeevan Ladhe <jeevan.ladhe@enterprisedb.com> wrote:Hi Robert,I had an offline discussion with Dipesh, and he will be working on thelz4 client side decompression part.Please find the attached patch to support client side compressionand decompression using lz4.Added a new lz4 bbstreamer to compress the archive chunks atclient if user has specified --compress=clinet-lz4:[LEVEL] optionin pg_basebackup. The new streamer accepts archive chunkscompresses it and forwards it to plain-writer.Similarly, If a user has specified a server compressed lz4 archivewith plain format (-F p) backup then it requires decompressingthe compressed archive chunks before forwarding it to tar extractor.Added a new bbstreamer to decompress the compressed archiveand forward it to tar extractor.Note: This patch can be applied on Jeevan Ladhe's v12 patchfor lz4 compression.Thanks,Dipesh
> bbstreamer_lz4_decompressor_content() we can change len to avail_in.
Attachment
Hi,Thanks for the feedback, I have incorporated the suggestionsand updated a new patch. PFA v2 patch.> I think similar to bbstreamer_lz4_compressor_content() in
> bbstreamer_lz4_decompressor_content() we can change len to avail_in.In bbstreamer_lz4_decompressor_content(), we are modifying avail_inbased on the number of bytes decompressed in each iteration. I thinkwe cannot replace it with "len" here.Jeevan, Your v12 patch does not apply on HEAD, it requires arebase. I have applied it on commit 400fc6b6487ddf16aa82c9d76e5cfbe64d94f660to validate my v2 patch.Thanks,Dipesh
Attachment
On Fri, Feb 11, 2022 at 5:58 AM Jeevan Ladhe <jeevanladhe.os@gmail.com> wrote: > >Jeevan, Your v12 patch does not apply on HEAD, it requires a > rebase. > > Sure, please find the rebased patch attached. It's Friday today, but I'm feeling brave, and it's still morning here, so ... committed. -- Robert Haas EDB: http://www.enterprisedb.com
On Fri, Feb 11, 2022 at 7:20 AM Dipesh Pandit <dipesh.pandit@gmail.com> wrote: > > Sure, please find the rebased patch attached. > > Thanks, I have validated v2 patch on top of rebased patch. I'm still feeling brave, so I committed this too after fixing a few things. In the process I noticed that we don't have support for LZ4 compression of streamed WAL (cf. CreateWalTarMethod). It would be good to fix that. I'm not quite sure whether http://postgr.es/m/pm1bMV6zZh9_4tUgCjSVMLxDX4cnBqCDGTmdGlvBLHPNyXbN18x_k00eyjkCCJGEajWgya2tQLUDpvb2iIwlD22IcUIrIt9WnMtssNh-F9k=@pm.me is basically what we need or whether something else is required. -- Robert Haas EDB: http://www.enterprisedb.com
On Fri, Feb 11, 2022 at 7:20 AM Dipesh Pandit <dipesh.pandit@gmail.com> wrote:
> > Sure, please find the rebased patch attached.
>
> Thanks, I have validated v2 patch on top of rebased patch.
I'm still feeling brave, so I committed this too after fixing a few
things. In the process I noticed that we don't have support for LZ4
compression of streamed WAL (cf. CreateWalTarMethod). It would be good
to fix that. I'm not quite sure whether
http://postgr.es/m/pm1bMV6zZh9_4tUgCjSVMLxDX4cnBqCDGTmdGlvBLHPNyXbN18x_k00eyjkCCJGEajWgya2tQLUDpvb2iIwlD22IcUIrIt9WnMtssNh-F9k=@pm.me
is basically what we need or whether something else is required.
--
Robert Haas
EDB: http://www.enterprisedb.com
On Fri, Feb 11, 2022 at 10:29 AM Justin Pryzby <pryzby@telsasoft.com> wrote: > FYI: there's a couple typos in the last 2 patches. Hmm. OK. But I don't consider "can be optionally specified" incorrect or worse than "can optionally be specified". I do agree that spelling words correctly is a good idea. -- Robert Haas EDB: http://www.enterprisedb.com
Hi, Hackers. Thank you for developing a great feature. The current help message shown below does not seem to be able to specify the 'client-' or 'server-' for lz4 compression. --compress = {[{client, server}-]gzip, lz4, none}[:LEVEL] The attached small patch fixes the help message as follows: --compress = {[{client, server}-]{gzip, lz4}, none}[:LEVEL] Regards, Noriyoshi Shinoda -----Original Message----- From: Robert Haas <robertmhaas@gmail.com> Sent: Saturday, February 12, 2022 12:50 AM To: Justin Pryzby <pryzby@telsasoft.com> Cc: Jeevan Ladhe <jeevanladhe.os@gmail.com>; Dipesh Pandit <dipesh.pandit@gmail.com>; Abhijit Menon-Sen <ams@toroid.org>;Dmitry Dolgov <9erthalion6@gmail.com>; Jeevan Ladhe <jeevan.ladhe@enterprisedb.com>; Mark Dilger <mark.dilger@enterprisedb.com>;pgsql-hackers@postgresql.org; tushar <tushar.ahuja@enterprisedb.com> Subject: Re: refactoring basebackup.c On Fri, Feb 11, 2022 at 10:29 AM Justin Pryzby <pryzby@telsasoft.com> wrote: > FYI: there's a couple typos in the last 2 patches. Hmm. OK. But I don't consider "can be optionally specified" incorrect or worse than "can optionally be specified". I do agree that spelling words correctly is a good idea. -- Robert Haas EDB: http://www.enterprisedb.com
Attachment
On Sat, Feb 12, 2022 at 1:01 AM Shinoda, Noriyoshi (PN Japan FSIP) <noriyoshi.shinoda@hpe.com> wrote: > Thank you for developing a great feature. > The current help message shown below does not seem to be able to specify the 'client-' or 'server-' for lz4 compression. > --compress = {[{client, server}-]gzip, lz4, none}[:LEVEL] > > The attached small patch fixes the help message as follows: > --compress = {[{client, server}-]{gzip, lz4}, none}[:LEVEL] Hmm. After studying this a bit more closely, I think this might actually need a bit more revision than what you propose here. In most places, we use vertical bars to separate alternatives: -X, --wal-method=none|fetch|stream But here, we're using commas in some places and the word "or" in one case as well: -Z, --compress={[{client,server}-]gzip,lz4,none}[:LEVEL] or [LEVEL] We're also not consistently using braces for grouping, which makes the order of operations a bit unclear, and it makes no sense to put brackets around LEVEL when it's the only thing that's part of that alternative. A more consistent way of writing the supported syntax would be like this: -Z, --compress={[{client|server}-]{gzip|lz4}}[:LEVEL]|LEVEL|none} I would be somewhat inclined to leave the level-only variant undocumented and instead write it like this: -Z, --compress={[{client|server}-]{gzip|lz4}}[:LEVEL]|none} -- Robert Haas EDB: http://www.enterprisedb.com
Please find the attached updated version of patch for ZSTD server side
compression.
This patch has following changes:
- Fixes the issue Tushar reported[1].
- Adds a tap test.
- Makes document changes related to zstd.
- Updates pg_basebackup help for pg_basebackup. Here I have chosen the
suggestion by Robert upthread (as given below):
>> I would be somewhat inclined to leave the level-only variant
>> undocumented and instead write it like this:
>> -Z, --compress={[{client|server}-]{gzip|lz4}}[:LEVEL]|none}
- pg_indent on basebackup_zstd.c.
Thanks Tushar, for offline help for testing the patch.
[1] https://www.postgresql.org/message-id/6c3f1558-1e56-9946-78a2-c59340da1dbf%40enterprisedb.com
Regards,
Jeevan Ladhe
On Sat, Feb 12, 2022 at 1:01 AM Shinoda, Noriyoshi (PN Japan FSIP)
<noriyoshi.shinoda@hpe.com> wrote:
> Thank you for developing a great feature.
> The current help message shown below does not seem to be able to specify the 'client-' or 'server-' for lz4 compression.
> --compress = {[{client, server}-]gzip, lz4, none}[:LEVEL]
>
> The attached small patch fixes the help message as follows:
> --compress = {[{client, server}-]{gzip, lz4}, none}[:LEVEL]
Hmm. After studying this a bit more closely, I think this might
actually need a bit more revision than what you propose here. In most
places, we use vertical bars to separate alternatives:
-X, --wal-method=none|fetch|stream
But here, we're using commas in some places and the word "or" in one
case as well:
-Z, --compress={[{client,server}-]gzip,lz4,none}[:LEVEL] or [LEVEL]
We're also not consistently using braces for grouping, which makes the
order of operations a bit unclear, and it makes no sense to put
brackets around LEVEL when it's the only thing that's part of that
alternative.
A more consistent way of writing the supported syntax would be like this:
-Z, --compress={[{client|server}-]{gzip|lz4}}[:LEVEL]|LEVEL|none}
I would be somewhat inclined to leave the level-only variant
undocumented and instead write it like this:
-Z, --compress={[{client|server}-]{gzip|lz4}}[:LEVEL]|none}
--
Robert Haas
EDB: http://www.enterprisedb.com
Attachment
On Wed, Feb 9, 2022 at 8:41 AM Abhijit Menon-Sen <ams@toroid.org> wrote: > It took me a while to assimilate these patches, including the backup > targets one, which I hadn't looked at before. Now that I've wrapped my > head around how to put the pieces together, I really like the idea. As > you say, writing non-trivial integrations in C will take some effort, > but it seems worthwhile. It's also nice that one can continue to use > pg_basebackup to trigger the backups and see progress information. Cool. Thanks for having a look. > Yes, it looks simple to follow the example set by basebackup_to_shell to > write a custom target. The complexity will be in whatever we need to do > to store/forward the backup data, rather than in obtaining the data in > the first place, which is exactly as it should be. Yeah, that's what made me really happy with how this came out. Here's v2, rebased and with documentation added. -- Robert Haas EDB: http://www.enterprisedb.com
Attachment
On 2/15/22 6:48 PM, Jeevan Ladhe wrote: > Please find the attached updated version of patch for ZSTD server side Thanks, Jeevan, I again tested with the attached patch, and as mentioned the crash is fixed now. also, I tested with different labels with gzip V/s zstd against data directory size which is 29GB and found these results ==== ./pg_basebackup -t server:/tmp/<directory> --compress=server-zstd:<label> -Xnone -n -N --no-estimate-size -v --compress=server-zstd:1 = compress directory size is 1.3GB --compress=server-zstd:4 = compress directory size is 1.3GB --compress=server-zstd:7 = compress directory size is 1.2GB --compress=server-zstd:12 = compress directory size is 1.2GB ==== === ./pg_basebackup -t server:/tmp/<directooy> --compress=server-gzip:<label> -Xnone -n -N --no-estimate-size -v --compress=server-gzip:1 = compress directory size is 1.8GB --compress=server-gzip:4 = compress directory size is 1.6GB --compress=server-gzip:9 = compress directory size is 1.6GB === -- regards,tushar EnterpriseDB https://www.enterprisedb.com/ The Enterprise PostgreSQL Company
I further worked on ZSTD and now have implemented client side
compression as well. Attached are the patches for both server-side and
client-side compression.
The patch 0001 is a server-side patch, and has not changed since the
Patch 0002 is the client-side compression patch.
Regards,
Jeevan Ladhe
On 2/15/22 6:48 PM, Jeevan Ladhe wrote:
> Please find the attached updated version of patch for ZSTD server side
Thanks, Jeevan, I again tested with the attached patch, and as mentioned
the crash is fixed now.
also, I tested with different labels with gzip V/s zstd against data
directory size which is 29GB and found these results
====
./pg_basebackup -t server:/tmp/<directory>
--compress=server-zstd:<label> -Xnone -n -N --no-estimate-size -v
--compress=server-zstd:1 = compress directory size is 1.3GB
--compress=server-zstd:4 = compress directory size is 1.3GB
--compress=server-zstd:7 = compress directory size is 1.2GB
--compress=server-zstd:12 = compress directory size is 1.2GB
====
===
./pg_basebackup -t server:/tmp/<directooy>
--compress=server-gzip:<label> -Xnone -n -N --no-estimate-size -v
--compress=server-gzip:1 = compress directory size is 1.8GB
--compress=server-gzip:4 = compress directory size is 1.6GB
--compress=server-gzip:9 = compress directory size is 1.6GB
===
--
regards,tushar
EnterpriseDB https://www.enterprisedb.com/
The Enterprise PostgreSQL Company
Attachment
On Tue, Feb 15, 2022 at 12:59 PM Justin Pryzby <pryzby@telsasoft.com> wrote: > There's superfluous changes to ./configure unrelated to the changes in > configure.ac. Probably because you're using a different version of autotools, > or a vendor's patched copy. You can remove the changes with git checkout -p or > similar. I noticed this already and fixed it in the version of the patch I posted on the other thread. > +++ b/src/backend/replication/basebackup_zstd.c > +bbsink * > +bbsink_zstd_new(bbsink *next, int compresslevel) > +{ > +#ifndef HAVE_LIBZSTD > + ereport(ERROR, > + (errcode(ERRCODE_FEATURE_NOT_SUPPORTED), > + errmsg("zstd compression is not supported by this build"))); > +#else > > This should have an return; like what's added by 71cbbbbe8 and 302612a6c. > Also, the parens() around errcode aren't needed since last year. The parens are still acceptable style, though. The return I guess is needed. > + bbsink_zstd *sink; > + > + Assert(next != NULL); > + Assert(compresslevel >= 0 && compresslevel <= 22); > + > + if (compresslevel < 0 || compresslevel > 22) > + ereport(ERROR, > > This looks like dead code in assert builds. > If it's unreachable, it can be elog(). Actually, the right thing to do here is remove the assert, I think. I don't believe that the code is unreachable. If I'm wrong and it is unreachable then the test-and-ereport should be removed. > + * Compress the input data to the output buffer until we run out of input > + * data. Each time the output buffer falls below the compression bound for > + * the input buffer, invoke the archive_contents() method for then next sink. > > *the next sink ? Yeah. > Does anyone plan to include this for pg15 ? If so, I think at least the WAL > compression should have support added too. I'd plan to rebase Michael's patch. > https://www.postgresql.org/message-id/YNqWd2GSMrnqWIfx@paquier.xyz Yes, I'd like to get this into PG15. It's very similar to the LZ4 compression support which was already committed, so it feels like finishing it up and including it in the release makes a lot of sense. I'm not against the idea of using ZSTD in other places where it makes sense as well, but I think that's a separate issue from this patch. As far as I'm concerned, either basebackup compression with ZSTD or WAL compression with ZSTD could be committed even if the other is not, and I plan to spend my time on this project, not that project. However, if you're saying you want to work on the WAL compression stuff, I've got no objection to that. -- Robert Haas EDB: http://www.enterprisedb.com
On 2022-Feb-14, Robert Haas wrote: > A more consistent way of writing the supported syntax would be like this: > > -Z, --compress={[{client|server}-]{gzip|lz4}}[:LEVEL]|LEVEL|none} > > I would be somewhat inclined to leave the level-only variant > undocumented and instead write it like this: > > -Z, --compress={[{client|server}-]{gzip|lz4}}[:LEVEL]|none} This is hard to interpret for humans though because of the nested brackets and braces. It gets considerably easier if you split it in separate variants: -Z, --compress=[{client|server}-]{gzip|lz4}[:LEVEL] -Z, --compress=LEVEL -Z, --compress=none compress tar output with given compression method or level or, if you choose to leave the level-only variant undocumented, then -Z, --compress=[{client|server}-]{gzip|lz4}[:LEVEL] -Z, --compress=none compress tar output with given compression method or level There still are some nested brackets and braces, but the scope is reduced enough that interpreting seems quite a bit simpler. -- Álvaro Herrera 39°49'30"S 73°17'W — https://www.EnterpriseDB.com/
On Wed, Feb 16, 2022 at 11:11 AM Alvaro Herrera <alvherre@alvh.no-ip.org> wrote: > This is hard to interpret for humans though because of the nested > brackets and braces. It gets considerably easier if you split it in > separate variants: > > -Z, --compress=[{client|server}-]{gzip|lz4}[:LEVEL] > -Z, --compress=LEVEL > -Z, --compress=none > compress tar output with given compression method or level > > > or, if you choose to leave the level-only variant undocumented, then > > -Z, --compress=[{client|server}-]{gzip|lz4}[:LEVEL] > -Z, --compress=none > compress tar output with given compression method or level > > There still are some nested brackets and braces, but the scope is > reduced enough that interpreting seems quite a bit simpler. I could go for that. I'm also just noticing that "none" is not really a compression method or level, and the statement that it can only compress "tar" output is no longer correct, because server-side compression can be used together with -Fp. So maybe we should change the sentence afterward to something a bit more generic, like "specify whether and how to compress the backup". -- Robert Haas EDB: http://www.enterprisedb.com
So, I went ahead and have now also implemented client side decompression
for zstd.
Robert separated[1] the ZSTD configure switch from my original patch
of server side compression and also added documentation related to
the switch. I have included that patch here in the patch series for
simplicity.
The server side compression patch
0002-ZSTD-add-server-side-compression-support.patch has also taken care
of Justin Pryzby's comments[2]. Also, made changes to pg_basebackup help
as suggested by Álvaro Herrera.
[1] https://www.postgresql.org/message-id/CA%2BTgmobRisF-9ocqYDcMng6iSijGj1EZX99PgXA%3D3VVbWuahog%40mail.gmail.com
[2] https://www.postgresql.org/message-id/20220215175944.GY31460%40telsasoft.com
Regards,
Jeevan Ladhe
On Wed, Feb 16, 2022 at 11:11 AM Alvaro Herrera <alvherre@alvh.no-ip.org> wrote:
> This is hard to interpret for humans though because of the nested
> brackets and braces. It gets considerably easier if you split it in
> separate variants:
>
> -Z, --compress=[{client|server}-]{gzip|lz4}[:LEVEL]
> -Z, --compress=LEVEL
> -Z, --compress=none
> compress tar output with given compression method or level
>
>
> or, if you choose to leave the level-only variant undocumented, then
>
> -Z, --compress=[{client|server}-]{gzip|lz4}[:LEVEL]
> -Z, --compress=none
> compress tar output with given compression method or level
>
> There still are some nested brackets and braces, but the scope is
> reduced enough that interpreting seems quite a bit simpler.
I could go for that. I'm also just noticing that "none" is not really
a compression method or level, and the statement that it can only
compress "tar" output is no longer correct, because server-side
compression can be used together with -Fp. So maybe we should change
the sentence afterward to something a bit more generic, like "specify
whether and how to compress the backup".
--
Robert Haas
EDB: http://www.enterprisedb.com
Attachment
On Wed, Feb 16, 2022 at 12:46 PM Jeevan Ladhe <jeevanladhe.os@gmail.com> wrote: > So, I went ahead and have now also implemented client side decompression > for zstd. > > Robert separated[1] the ZSTD configure switch from my original patch > of server side compression and also added documentation related to > the switch. I have included that patch here in the patch series for > simplicity. > > The server side compression patch > 0002-ZSTD-add-server-side-compression-support.patch has also taken care > of Justin Pryzby's comments[2]. Also, made changes to pg_basebackup help > as suggested by Álvaro Herrera. The first hunk of the documentation changes is missing a comma between gzip and lz4. + * At the start of each archive we reset the state to start a new + * compression operation. The parameters are sticky and they would stick + * around as we are resetting with option ZSTD_reset_session_only. I don't think "would" is what you mean here. If you say something would stick around, that means it could be that way it isn't. ("I would go to the store and buy some apples, but I know they don't have any so there's no point.") I think you mean "will". - printf(_(" -Z, --compress={[{client,server}-]gzip,lz4,none}[:LEVEL] or [LEVEL]\n" - " compress tar output with given compression method or level\n")); + printf(_(" -Z, --compress=[{client|server}-]{gzip|lz4|zstd}[:LEVEL]\n")); + printf(_(" -Z, --compress=none\n")); You deleted a line that you should have preserved here. Overall there doesn't seem to be much to complain about here on a first read-through. It will be good if we can also fix CreateWalTarMethod to support LZ4 and ZSTD. -- Robert Haas EDB: http://www.enterprisedb.com
On Wed, Feb 16, 2022 at 12:46 PM Jeevan Ladhe <jeevanladhe.os@gmail.com> wrote:
> So, I went ahead and have now also implemented client side decompression
> for zstd.
>
> Robert separated[1] the ZSTD configure switch from my original patch
> of server side compression and also added documentation related to
> the switch. I have included that patch here in the patch series for
> simplicity.
>
> The server side compression patch
> 0002-ZSTD-add-server-side-compression-support.patch has also taken care
> of Justin Pryzby's comments[2]. Also, made changes to pg_basebackup help
> as suggested by Álvaro Herrera.
The first hunk of the documentation changes is missing a comma between
gzip and lz4.
+ * At the start of each archive we reset the state to start a new
+ * compression operation. The parameters are sticky and they would stick
+ * around as we are resetting with option ZSTD_reset_session_only.
I don't think "would" is what you mean here. If you say something
would stick around, that means it could be that way it isn't. ("I
would go to the store and buy some apples, but I know they don't have
any so there's no point.") I think you mean "will".
- printf(_(" -Z,
--compress={[{client,server}-]gzip,lz4,none}[:LEVEL] or [LEVEL]\n"
- " compress tar output with given
compression method or level\n"));
+ printf(_(" -Z, --compress=[{client|server}-]{gzip|lz4|zstd}[:LEVEL]\n"));
+ printf(_(" -Z, --compress=none\n"));
You deleted a line that you should have preserved here.
Overall there doesn't seem to be much to complain about here on a
first read-through. It will be good if we can also fix
CreateWalTarMethod to support LZ4 and ZSTD.
--
Robert Haas
EDB: http://www.enterprisedb.com
Attachment
- It first writes the header in the function tar_open_for_write, flushes the contents of tar to disk and stores the header offset.
- Next, the contents of WAL are written to the tar archive.
- In the end, it recalculates the checksum in function tar_close() and overwrites the header at an offset stored in step #1.
On Fri, Mar 4, 2022 at 3:32 AM Dipesh Pandit <dipesh.pandit@gmail.com> wrote: > GZIP manages to overcome this problem as it provides an option to turn on/off > compression on the fly while writing a compressed archive with the help of zlib > library function deflateParams(). The current gzip implementation for > CreateWalTarMethod uses this library function to turn off compression just before > step #1 and it writes the uncompressed header of size equal to TAR_BLOCK_SIZE. > It uses the same library function to turn on the compression for writing the contents > of the WAL file as part of step #2. It again turns off the compression just before step > #3 to overwrite the header. The header is overwritten at the same offset with size > equal to TAR_BLOCK_SIZE. This is a real mess. To me, it seems like a pretty big hack to use deflateParams() to shut off compression in the middle of the compressed data stream so that we can go back and overwrite that part of the data later. It appears that the only reason we need that hack is because we don't know the file size starting out. Except we kind of do know the size, because pad_to_size specifies a minimum size for the file. It's true that the maximum file size is unbounded, but I'm not sure why that's important. I wonder if anyone else has an idea why we didn't just set the file size to pad_to_size exactly when we write the tar header the first time, instead of this IMHO kind of nutty approach where we back up. I'd try to figure it out from the comments, but there basically aren't any. I also had a look at the relevant commit messages and didn't see anything relevant there either. If I'm missing something, please point it out. While I'm complaining, I noticed while looking at this code that it is documented that "The caller must ensure that only one method is instantiated in any given program, and that it's only instantiated once!" As far as I can see, this is because somebody thought about putting all of the relevant data into a struct and then decided on an alternative strategy of storing some of it there, and the rest in a global variable. I can't quite imagine why anyone would think that was a good idea. There may be some reason that I can't see right now, but here again there appear to be no relevant code comments. I'm somewhat inclined to wonder whether we could just get rid of walmethods.c entirely and use the new bbstreamer stuff instead. That code also knows how to write plain files into a directory, and write tar archives, and compress stuff, but in my totally biased opinion as the author of most of that code, it's better code. It has no restriction on using at most one method per program, or of instantiating that method only once, and it already has LZ4 support, and there's a pending patch for ZSTD support that I intend to get committed soon as well. It also has, and I know I might be beating a dead horse here, comments. Now, admittedly, it does need to know the size of each archive member up front in order to work, so if we can't solve the problem then we can't go this route. But if we can't solve that problem, then we also can't add LZ4 and ZSTD support to walmethods.c, because random access to compressed data is not really a thing, even if we hacked it to work for gzip. Thoughts? -- Robert Haas EDB: http://www.enterprisedb.com
On Wed, Feb 16, 2022 at 8:46 PM Jeevan Ladhe <jeevanladhe.os@gmail.com> wrote: > Thanks for the comments Robert. I have addressed your comments in the > attached patch v13-0002-ZSTD-add-server-side-compression-support.patch. > Rest of the patches are similar to v12, but just bumped the version number. OK, here's a consolidated patch with all your changes from 0002-0004 as 0001 plus a few proposed edits of my own in 0002. By and large I think this is fine. My proposed changes are largely cosmetic, but one thing that isn't is revising the size - pos <= bound tests to instead check size - pos < bound. My reasoning for that change is: if the number of bytes remaining in the buffer is exactly equal to the maximum number we can write, we don't need to flush it yet. If that sounds correct, we should fix the LZ4 code the same way. -- Robert Haas EDB: http://www.enterprisedb.com
Attachment
My proposed changes are largely cosmetic, but one thing that isn't is
revising the size - pos <= bound tests to instead check size - pos <
bound. My reasoning for that change is: if the number of bytes
remaining in the buffer is exactly equal to the maximum number we can
write, we don't need to flush it yet. If that sounds correct, we
should fix the LZ4 code the same way.
Attachment
On Tue, Mar 8, 2022 at 4:49 AM Jeevan Ladhe <jeevanladhe.os@gmail.com> wrote: > I agree with your patch. The patch looks good to me. > Yes, the LZ4 flush check should also be fixed. Please find the attached > patch to fix the LZ4 code. OK, committed all that stuff. I think we also need to fix one other thing. Right now, for LZ4 support we test HAVE_LIBLZ4, but TOAST and XLOG compression are testing USE_LZ4, so I think we should be doing the same here. And similarly I think we should be testing USE_ZSTD not HAVE_LIBZSTD. Patch for that attached. -- Robert Haas EDB: http://www.enterprisedb.com
Attachment
OK, committed all that stuff.
I think we also need to fix one other thing. Right now, for LZ4
support we test HAVE_LIBLZ4, but TOAST and XLOG compression are
testing USE_LZ4, so I think we should be doing the same here. And
similarly I think we should be testing USE_ZSTD not HAVE_LIBZSTD.
On Tue, Mar 8, 2022 at 11:32 AM Jeevan Ladhe <jeevanladhe.os@gmail.com> wrote: > I reviewed the patch, and it seems to be capturing and replacing all the > places of HAVE_LIB* with USE_* correctly. > Just curious, apart from consistency, do you see other problems as well > when testing one vs the other? So, the kind of problem you would worry about in a case like this is: suppose that configure detects LIBLZ4, but the user specifies --without-lz4. Then maybe there is some way for HAVE_LIBLZ4 to be true, while USE_LIBLZ4 is false, and therefore we should not be compiling code that uses LZ4 but do anyway. As configure.ac is currently coded, I think that's impossible, because we only search for liblz4 if the user says --with-lz4, and if they do that, then USE_LZ4 will be set. Therefore, I don't think there is a live problem here, just an inconsistency. Probably still best to clean it up before an angry Andres chases me down, since I know he's working on the build system... -- Robert Haas EDB: http://www.enterprisedb.com
On Tue, Mar 8, 2022 at 11:32 AM Jeevan Ladhe <jeevanladhe.os@gmail.com> wrote:
> I reviewed the patch, and it seems to be capturing and replacing all the
> places of HAVE_LIB* with USE_* correctly.
> Just curious, apart from consistency, do you see other problems as well
> when testing one vs the other?
So, the kind of problem you would worry about in a case like this is:
suppose that configure detects LIBLZ4, but the user specifies
--without-lz4. Then maybe there is some way for HAVE_LIBLZ4 to be
true, while USE_LIBLZ4 is false, and therefore we should not be
compiling code that uses LZ4 but do anyway. As configure.ac is
currently coded, I think that's impossible, because we only search for
liblz4 if the user says --with-lz4, and if they do that, then USE_LZ4
will be set. Therefore, I don't think there is a live problem here,
just an inconsistency.
Probably still best to clean it up before an angry Andres chases me
down, since I know he's working on the build system...
--
Robert Haas
EDB: http://www.enterprisedb.com
I'm getting errors from pg_basebackup when using both -D- and --compress=server-* The issue seems to go away if I use --no-manifest. $ ./src/bin/pg_basebackup/pg_basebackup -h /tmp -Ft -D- --wal-method none --compress=server-gzip >/dev/null ; echo $? pg_basebackup: error: tar member has empty name 1 $ ./src/bin/pg_basebackup/pg_basebackup -h /tmp -Ft -D- --wal-method none --compress=server-gzip >/dev/null ; echo $? NOTICE: WAL archiving is not enabled; you must ensure that all required WAL segments are copied through other means to completethe backup pg_basebackup: error: COPY stream ended before last file was finished 1
On Thu, Mar 10, 2022 at 8:02 PM Justin Pryzby <pryzby@telsasoft.com> wrote: > I'm getting errors from pg_basebackup when using both -D- and --compress=server-* > The issue seems to go away if I use --no-manifest. > > $ ./src/bin/pg_basebackup/pg_basebackup -h /tmp -Ft -D- --wal-method none --compress=server-gzip >/dev/null ; echo $? > pg_basebackup: error: tar member has empty name > 1 > > $ ./src/bin/pg_basebackup/pg_basebackup -h /tmp -Ft -D- --wal-method none --compress=server-gzip >/dev/null ; echo $? > NOTICE: WAL archiving is not enabled; you must ensure that all required WAL segments are copied through other means tocomplete the backup > pg_basebackup: error: COPY stream ended before last file was finished > 1 Thanks for the report. The problem here is that, when the output is standard output (-D -), pg_basebackup can only produce a single output file, so the manifest gets injected into the tar file on the client side rather than being written separately as we do in normal cases. However, that only works if we're receiving a tar file that we can parse from the server, and here the server is sending a compressed tarfile. The current code mistakely attempts to parse the compressed tarfile as if it were an uncompressed tarfile, which causes the error messages that you are seeing (and which I can also reproduce here). We actually have enough infrastructure available in pg_basebackup now that we could do the "right thing" in this case: decompress the data received from the server, parse the resulting tar file, inject the backup manifest, construct a new tar file, and recompress. However, I think that's probably not a good idea, because it's unlikely that the user will understand that the data is being compressed on the server, then decompressed, and then recompressed again, and the performance of the resulting pipeline will probably not be very good. So I think we should just refuse this command. Patch for that attached. -- Robert Haas EDB: http://www.enterprisedb.com
Attachment
On Fri, Mar 11, 2022 at 10:19:29AM -0500, Robert Haas wrote: > So I think we should just refuse this command. Patch for that attached. Sounds right. Also, I think the magic 8 for .gz should actually be a 7. I'm not sure why it tests for ".gz" but not ".tar.gz", which would help to make them all less magic. commit 1fb1e21ba7a500bb2b85ec3e65f59130fcdb4a7e Author: Justin Pryzby <pryzbyj@telsasoft.com> Date: Thu Mar 10 21:22:16 2022 -0600 pg_basebackup: make magic numbers less magic The magic 8 for .gz should actually be a 7. .tar.gz 1234567 .tar.lz4 .tar.zst 12345678 See d45099425, 751b8d23b, 7cf085f07. diff --git a/src/bin/pg_basebackup/pg_basebackup.c b/src/bin/pg_basebackup/pg_basebackup.c index 9f3ecc60fbe..8dd9721323d 100644 --- a/src/bin/pg_basebackup/pg_basebackup.c +++ b/src/bin/pg_basebackup/pg_basebackup.c @@ -1223,17 +1223,17 @@ CreateBackupStreamer(char *archive_name, char *spclocation, is_tar = (archive_name_len > 4 && strcmp(archive_name + archive_name_len - 4, ".tar") == 0); - /* Is this a gzip archive? */ - is_tar_gz = (archive_name_len > 8 && - strcmp(archive_name + archive_name_len - 3, ".gz") == 0); + /* Is this a .tar.gz archive? */ + is_tar_gz = (archive_name_len > 7 && + strcmp(archive_name + archive_name_len - 7, ".tar.gz") == 0); - /* Is this a LZ4 archive? */ + /* Is this a .tar.lz4 archive? */ is_tar_lz4 = (archive_name_len > 8 && - strcmp(archive_name + archive_name_len - 4, ".lz4") == 0); + strcmp(archive_name + archive_name_len - 8, ".tar.lz4") == 0); - /* Is this a ZSTD archive? */ + /* Is this a .tar.zst archive? */ is_tar_zstd = (archive_name_len > 8 && - strcmp(archive_name + archive_name_len - 4, ".zst") == 0); + strcmp(archive_name + archive_name_len - 8, ".tar.zst") == 0); /* * We have to parse the archive if (1) we're suppose to extract it, or if
On Fri, Mar 11, 2022 at 11:29 AM Justin Pryzby <pryzby@telsasoft.com> wrote: > Sounds right. OK, committed. > Also, I think the magic 8 for .gz should actually be a 7. > > I'm not sure why it tests for ".gz" but not ".tar.gz", which would help to make > them all less magic. > > commit 1fb1e21ba7a500bb2b85ec3e65f59130fcdb4a7e > Author: Justin Pryzby <pryzbyj@telsasoft.com> > Date: Thu Mar 10 21:22:16 2022 -0600 Yeah, your patch looks right. Committed that, too. -- Robert Haas EDB: http://www.enterprisedb.com
On Tue, Feb 15, 2022 at 11:26 AM Robert Haas <robertmhaas@gmail.com> wrote: > On Wed, Feb 9, 2022 at 8:41 AM Abhijit Menon-Sen <ams@toroid.org> wrote: > > It took me a while to assimilate these patches, including the backup > > targets one, which I hadn't looked at before. Now that I've wrapped my > > head around how to put the pieces together, I really like the idea. As > > you say, writing non-trivial integrations in C will take some effort, > > but it seems worthwhile. It's also nice that one can continue to use > > pg_basebackup to trigger the backups and see progress information. > > Cool. Thanks for having a look. > > > Yes, it looks simple to follow the example set by basebackup_to_shell to > > write a custom target. The complexity will be in whatever we need to do > > to store/forward the backup data, rather than in obtaining the data in > > the first place, which is exactly as it should be. > > Yeah, that's what made me really happy with how this came out. > > Here's v2, rebased and with documentation added. I don't hear many comments on this, but I'm pretty sure that it's a good idea, and there haven't been many objections to this patch series as a whole, so I'd like to proceed with it. If nobody objects vigorously, I'll commit this next week. Thanks, -- Robert Haas EDB: http://www.enterprisedb.com
Hi, On 2022-03-11 10:19:29 -0500, Robert Haas wrote: > Thanks for the report. The problem here is that, when the output is > standard output (-D -), pg_basebackup can only produce a single output > file, so the manifest gets injected into the tar file on the client > side rather than being written separately as we do in normal cases. > However, that only works if we're receiving a tar file that we can > parse from the server, and here the server is sending a compressed > tarfile. The current code mistakely attempts to parse the compressed > tarfile as if it were an uncompressed tarfile, which causes the error > messages that you are seeing (and which I can also reproduce here). We > actually have enough infrastructure available in pg_basebackup now > that we could do the "right thing" in this case: decompress the data > received from the server, parse the resulting tar file, inject the > backup manifest, construct a new tar file, and recompress. However, I > think that's probably not a good idea, because it's unlikely that the > user will understand that the data is being compressed on the server, > then decompressed, and then recompressed again, and the performance of > the resulting pipeline will probably not be very good. So I think we > should just refuse this command. Patch for that attached. You could also just append a manifest as a compresed tar to the compressed tar stream. Unfortunately GNU tar requires -i to read concated compressed archives, so perhaps that's not quite an alternative. Greetings, Andres Freund
On Fri, Mar 11, 2022 at 8:52 PM Andres Freund <andres@anarazel.de> wrote: > You could also just append a manifest as a compresed tar to the compressed tar > stream. Unfortunately GNU tar requires -i to read concated compressed > archives, so perhaps that's not quite an alternative. s/Unfortunately/Fortunately/ :-p I think we've already gone way too far in the direction of making this stuff rely on specific details of the tar format. What if someday we wanted to switch to pax, cpio, zip, 7zip, whatever, or even just have one of those things as an option? It's not that I'm dying to have PostgreSQL produce rar or arj files, but I think we box ourselves into a corner when we just assume tar everywhere. As an example of a similar issue with real consequences, consider the recent discovery that we can't easily add support for LZ4 or ZSTD compression of pg_wal.tar. The problem is that the existing code tells the gzip library to emit the tar header as part of the compressed stream without actually compressing it, and then it goes back and overwrites that data later! Unsurprisingly, that's not a feature every compression library offers. -- Robert Haas EDB: http://www.enterprisedb.com
Attachment
On Mon, Mar 14, 2022 at 09:41:35PM +0530, Dipesh Pandit wrote: > I tried to implement support for parallel ZSTD compression. The > library provides an option (ZSTD_c_nbWorkers) to specify the > number of compression workers. The number of parallel > workers can be set as part of compression parameter and if this > option is specified then the library performs parallel compression > based on the specified number of workers. > > User can specify the number of parallel worker as part of > --compress option by appending an integer value after at sign (@). > (-Z, --compress=[{client|server}-]{gzip|lz4|zstd}[:LEVEL][@WORKERS]) I suggest to use a syntax that's more general than that, maybe something like :[level=]N,parallel=N,flag,flag,... For example, someone may want to use zstd "long" mode or (when it's released) rsyncable mode, or specify fine-grained compression parameters (strategy, windowLog, hashLog, etc). I hope the same syntax will be shared with wal_compression and pg_dump. And libpq, if that patch progresses. BTW, I think this may be better left for PG16. -- Justin
On Mon, Mar 14, 2022 at 12:35 PM Justin Pryzby <pryzby@telsasoft.com> wrote: > I suggest to use a syntax that's more general than that, maybe something like > > :[level=]N,parallel=N,flag,flag,... > > For example, someone may want to use zstd "long" mode or (when it's released) > rsyncable mode, or specify fine-grained compression parameters (strategy, > windowLog, hashLog, etc). That's an interesting idea. I wonder what the replication protocol ought to look like in that case. Should we have a COMPRESSION_DETAIL argument that is just a string, and let the server parse it out? Or separate protocol-level options? It does feel reasonable to have both COMPRESSION_LEVEL and COMPRESSION_WORKERS as first-class options, but I don't know that we want COMPRESSION_HASHLOG true as part of our first-class grammar. > I hope the same syntax will be shared with wal_compression and pg_dump. > And libpq, if that patch progresses. > > BTW, I think this may be better left for PG16. Possibly so ... but if we're thinking of any revisions to the newly-added grammar, we had better take care of that now, before it's set in stone. -- Robert Haas EDB: http://www.enterprisedb.com
On Mon, Mar 14, 2022 at 01:02:20PM -0400, Robert Haas wrote: > On Mon, Mar 14, 2022 at 12:35 PM Justin Pryzby <pryzby@telsasoft.com> wrote: > > I suggest to use a syntax that's more general than that, maybe something like > > > > :[level=]N,parallel=N,flag,flag,... > > > > For example, someone may want to use zstd "long" mode or (when it's released) > > rsyncable mode, or specify fine-grained compression parameters (strategy, > > windowLog, hashLog, etc). > > That's an interesting idea. I wonder what the replication protocol > ought to look like in that case. Should we have a COMPRESSION_DETAIL > argument that is just a string, and let the server parse it out? Or > separate protocol-level options? It does feel reasonable to have both > COMPRESSION_LEVEL and COMPRESSION_WORKERS as first-class options, but > I don't know that we want COMPRESSION_HASHLOG true as part of our > first-class grammar. I was only referring to the user-facing grammar. Internally, I was thinking they'd all be handled as first-class options, with separate struct fields and separate replication protocol options. If an option isn't known, it'd be rejected on the client side, rather than causing an error on the server. Maybe there'd be an option parser for this in common/ (I think that might require having new data structure there too, maybe one for each compression method, or maybe a union{} to handles them all). Most of the ~100 lines to support wal_compression='zstd:N' are to parse out the N. -- Justin
On Mon, Mar 14, 2022 at 1:11 PM Justin Pryzby <pryzby@telsasoft.com> wrote: > Internally, I was thinking they'd all be handled as first-class options, with > separate struct fields and separate replication protocol options. If an option > isn't known, it'd be rejected on the client side, rather than causing an error > on the server. There's some appeal to that, but one downside is that it means that the client can't be used to fetch data that is compressed in a way that the server knows about and the client doesn't. I don't think that's great. Why should, for example, pg_basebackup need to be compiled with zstd support in order to request zstd compression on the server side? If the server knows about the brand new justin-magic-sauce compression algorithm, maybe the client should just be able to request it and, when given various .jms files by the server, shrug its shoulders and accept them for what they are. That doesn't work if -Fp is involved, or similar, but it should work fine for simple cases if we set things up right. > Maybe there'd be an option parser for this in common/ (I think that might > require having new data structure there too, maybe one for each compression > method, or maybe a union{} to handles them all). Most of the ~100 lines to > support wal_compression='zstd:N' are to parse out the N. Yes, it's actually a very simple feature now that we've got the rest of the infrastructure set up correctly for it. -- Robert Haas EDB: http://www.enterprisedb.com
I had a look at the patch and also tried to take the backup. I have
following suggestions and observations:
I get following error at my end:
$ pg_basebackup -D /tmp/zstd_bk -Ft -Xfetch --compress=server-zstd:7@4
pg_basebackup: error: could not initiate base backup: ERROR: could not compress data: Unsupported parameter
pg_basebackup: removing data directory "/tmp/zstd_bk"
This is mostly because I have the zstd library version v1.4.4, which
does not have default support for parallel workers. Maybe we should
have a better error, something that is hinting that the parallelism is
not supported by the particular build.
The regression for pg_verifybackup test 008_untar.pl also fails with a
similar error. Here, I think we should have some logic in regression to
skip the test if the parameter is not supported?
+ if (ZSTD_isError(ret))
+ elog(ERROR,
+ "could not compress data: %s",
+ ZSTD_getErrorName(ret));
I think all of this can go on one line, but anyhow we have to improve
the error message here.
Also, just a thought, for the versions where parallelism is not
supported, should we instead just throw a warning and fall back to
non-parallel behavior?
Regards,
Jeevan Ladhe
Hi,I tried to implement support for parallel ZSTD compression. Thelibrary provides an option (ZSTD_c_nbWorkers) to specify thenumber of compression workers. The number of parallelworkers can be set as part of compression parameter and if thisoption is specified then the library performs parallel compressionbased on the specified number of workers.User can specify the number of parallel worker as part of--compress option by appending an integer value after at sign (@).(-Z, --compress=[{client|server}-]{gzip|lz4|zstd}[:LEVEL][@WORKERS])Please find the attached patch v1 with the above changes.Note: ZSTD library version 1.5.x supports parallel compressionby default and if the library version is lower than 1.5.x thenparallel compression is enabled only the source is compiled with buildmacro ZSTD_MULTITHREAD. If the linked library version doesn'tsupport parallel compression then setting the value of parameterZSTD_c_nbWorkers to a value other than 0 will be no-op andreturns an error.Thanks,Dipesh
On Tue, Mar 15, 2022 at 6:33 AM Jeevan Ladhe <jeevanladhe.os@gmail.com> wrote: > I get following error at my end: > > $ pg_basebackup -D /tmp/zstd_bk -Ft -Xfetch --compress=server-zstd:7@4 > pg_basebackup: error: could not initiate base backup: ERROR: could not compress data: Unsupported parameter > pg_basebackup: removing data directory "/tmp/zstd_bk" > > This is mostly because I have the zstd library version v1.4.4, which > does not have default support for parallel workers. Maybe we should > have a better error, something that is hinting that the parallelism is > not supported by the particular build. I'm not averse to trying to improve that error message, but honestly I'd consider that to be good enough already to be acceptable. We could think about trying to add an errhint() telling you that the problem may be with your libzstd build. > The regression for pg_verifybackup test 008_untar.pl also fails with a > similar error. Here, I think we should have some logic in regression to > skip the test if the parameter is not supported? Or at least to have the test not fail. > Also, just a thought, for the versions where parallelism is not > supported, should we instead just throw a warning and fall back to > non-parallel behavior? I don't think so. I think it's better for the user to get an error and then change their mind and request something we can do. -- Robert Haas EDB: http://www.enterprisedb.com
Should zstd's negative compression levels be supported here ? Here's a POC patch which is enough to play with it. $ src/bin/pg_basebackup/pg_basebackup --wal-method fetch -Ft -D - -h /tmp --no-sync --compress=zstd |wc -c 12305659 $ src/bin/pg_basebackup/pg_basebackup --wal-method fetch -Ft -D - -h /tmp --no-sync --compress=zstd:1 |wc -c 13827521 $ src/bin/pg_basebackup/pg_basebackup --wal-method fetch -Ft -D - -h /tmp --no-sync --compress=zstd:0 |wc -c 12304018 $ src/bin/pg_basebackup/pg_basebackup --wal-method fetch -Ft -D - -h /tmp --no-sync --compress=zstd:-1 |wc -c 16443893 $ src/bin/pg_basebackup/pg_basebackup --wal-method fetch -Ft -D - -h /tmp --no-sync --compress=zstd:-2 |wc -c 17349563 $ src/bin/pg_basebackup/pg_basebackup --wal-method fetch -Ft -D - -h /tmp --no-sync --compress=zstd:-4 |wc -c 19452631 $ src/bin/pg_basebackup/pg_basebackup --wal-method fetch -Ft -D - -h /tmp --no-sync --compress=zstd:-7 |wc -c 21871505 Also, with a partial regression DB, this crashes when writing to stdout. $ src/bin/pg_basebackup/pg_basebackup --wal-method fetch -Ft -D - -h /tmp --no-sync --compress=lz4 |wc -c pg_basebackup: bbstreamer_lz4.c:172: bbstreamer_lz4_compressor_content: Assertion `mystreamer->base.bbs_buffer.maxlen >=out_bound' failed. 24117248 #4 0x000055555555e8b4 in bbstreamer_lz4_compressor_content (streamer=0x5555555a5260, member=0x7fffffffc760, data=0x7ffff3068010 "{ \"PostgreSQL-Backup-Manifest-Version\": 1,\n\"Files\": [\n{ \"Path\": \"backup_label\", \"Size\":227, \"Last-Modified\": \"2022-03-16 02:29:11 GMT\", \"Checksum-Algorithm\": \"CRC32C\", \"Checksum\": \"46f69d99\"},\n{ \"Pa"..., len=401072, context=BBSTREAMER_MEMBER_CONTENTS) at bbstreamer_lz4.c:172 mystreamer = 0x5555555a5260 next_in = 0x7ffff3068010 "{ \"PostgreSQL-Backup-Manifest-Version\": 1,\n\"Files\": [\n{ \"Path\": \"backup_label\",\"Size\": 227, \"Last-Modified\": \"2022-03-16 02:29:11 GMT\", \"Checksum-Algorithm\": \"CRC32C\", \"Checksum\":\"46f69d99\" },\n{ \"Pa"... ... (gdb) p mystreamer->base.bbs_buffer.maxlen $1 = 524288 (gdb) p (int) LZ4F_compressBound(len, &mystreamer->prefs) $4 = 524300 This is with: liblz4-1:amd64 1.9.2-2ubuntu0.20.04.1
Attachment
On Mon, Mar 14, 2022 at 1:21 PM Robert Haas <robertmhaas@gmail.com> wrote: > There's some appeal to that, but one downside is that it means that > the client can't be used to fetch data that is compressed in a way > that the server knows about and the client doesn't. I don't think > that's great. Why should, for example, pg_basebackup need to be > compiled with zstd support in order to request zstd compression on the > server side? If the server knows about the brand new > justin-magic-sauce compression algorithm, maybe the client should just > be able to request it and, when given various .jms files by the > server, shrug its shoulders and accept them for what they are. That > doesn't work if -Fp is involved, or similar, but it should work fine > for simple cases if we set things up right. Concretely, I propose the attached patch for v15. It renames the newly-added COMPRESSION_LEVEL option to COMPRESSION_DETAIL, introduces a flexible syntax for options along the lines you proposed, and adjusts things so that a client that doesn't support a particular type of compression can still request that type of compression from the server. I think it's important to do this for v15 so that we don't end up with backward-compatibility problems down the road. -- Robert Haas EDB: http://www.enterprisedb.com
Attachment
diff --git a/doc/src/sgml/protocol.sgml b/doc/src/sgml/protocol.sgml index 9178c779ba..00c593f1af 100644 --- a/doc/src/sgml/protocol.sgml +++ b/doc/src/sgml/protocol.sgml @@ -2731,14 +2731,24 @@ The commands accepted in replication mode are: + <para> + For <literal>gzip</literal> the compression level should be an gzip comma +++ b/src/backend/replication/basebackup.c @@ -18,6 +18,7 @@ #include "access/xlog_internal.h" /* for pg_start/stop_backup */ #include "common/file_perm.h" +#include "common/backup_compression.h" alphabetical - errmsg("unrecognized compression algorithm: \"%s\"", + errmsg("unrecognized compression algorithm \"%s\"", Most other places seem to say "compression method". So I'd suggest to change that here, and in doc/src/sgml/ref/pg_basebackup.sgml. - if (o_compression_level && !o_compression) + if (o_compression_detail && !o_compression) ereport(ERROR, (errcode(ERRCODE_SYNTAX_ERROR), errmsg("compression level requires compression"))); s/level/detail/ /* + * Basic parsing of a value specified for -Z/--compress. + * + * We're not concerned here with understanding exactly what behavior the + * user wants, but we do need to know whether the user is requesting client + * or server side compression or leaving it unspecified, and we need to + * separate the name of the compression algorithm from the detail string. + * + * For instance, if the user writes --compress client-lz4:6, we want to + * separate that into (a) client-side compression, (b) algorithm "lz4", + * and (c) detail "6". Note, however, that all the client/server prefix is + * optional, and so is the detail. The algorithm name is required, unless + * the whole string is an integer, in which case we assume "gzip" as the + * algorithm and use the integer as the detail. .. */ static void +parse_compress_options(char *option, char **algorithm, char **detail, + CompressionLocation *locationres) It'd be great if this were re-usable for wal_compression, which I hope in pg16 will support at least level=N. And eventually pg_dump. But those clients shouldn't accept a client/server prefix. Maybe the way to handle that is for those tools to check locationres and reject it if it was specified. + * We're not concerned with validation at this stage, so if the user writes + * --compress client-turkey:sandwhich, the requested algorithm is "turkey" + * and the detail string is "sandwhich". We'll sort out whether that's legal sp: sandwich + WalCompressionMethod wal_compress_method; This is confusingly similar to src/include/access/xlog.h:WalCompression. I think someone else mentioned this before ? + * A compression specification specifies the parameters that should be used + * when * performing compression with a specific algorithm. The simplest star +/* + * Get the human-readable name corresponding to a particular compression + * algorithm. + */ +char * +get_bc_algorithm_name(bc_algorithm algorithm) should be const ? + /* As a special case, the specification can be a bare integer. */ + bare_level = strtol(specification, &bare_level_endp, 10); Should this call expect_integer_value()? See below. + result->parse_error = + pstrdup("found empty string where a compression option was expected"); Needs to be localized with _() ? Also, document that it's pstrdup'd. +/* + * Parse 'value' as an integer and return the result. + * + * If parsing fails, set result->parse_error to an appropriate message + * and return -1. + */ +static int +expect_integer_value(char *keyword, char *value, bc_specification *result) -1 isn't great, since it's also an integer, and, also a valid compression level for zstd (did you see my message about that?). Maybe INT_MIN is ok. +{ + int ivalue; + char *ivalue_endp; + + ivalue = strtol(value, &ivalue_endp, 10); Should this also set/check errno ? And check if value != ivalue_endp ? See strtol(3) +char * +validate_bc_specification(bc_specification *spec) ... + /* + * If a compression level was specified, check that the algorithm expects + * a compression level and that the level is within the legal range for + * the algorithm. It would be nice if this could be shared with wal_compression and pg_dump. We shouldn't need multiple places with structures giving the algorithms and range of compression levels. + unsigned options; /* OR of BACKUP_COMPRESSION_OPTION constants */ Should be "unsigned int" or "bits32" ? The server crashes if I send an unknown option - you should hit that in the regression tests. $ src/bin/pg_basebackup/pg_basebackup --wal-method fetch -Ft -D - -h /tmp --no-sync --no-manifest --compress=server-lz4:a|wc -c TRAP: FailedAssertion("pointer != NULL", File: "../../../../src/include/utils/memutils.h", Line: 123, PID: 8627) postgres: walsender pryzbyj [local] BASE_BACKUP(ExceptionalCondition+0xa0)[0x560b45d7b64b] postgres: walsender pryzbyj [local] BASE_BACKUP(pfree+0x5d)[0x560b45dad1ea] postgres: walsender pryzbyj [local] BASE_BACKUP(parse_bc_specification+0x154)[0x560b45dc5d4f] postgres: walsender pryzbyj [local] BASE_BACKUP(+0x43d56c)[0x560b45bc556c] postgres: walsender pryzbyj [local] BASE_BACKUP(SendBaseBackup+0x2d)[0x560b45bc85ca] postgres: walsender pryzbyj [local] BASE_BACKUP(exec_replication_command+0x3a2)[0x560b45bdddb2] postgres: walsender pryzbyj [local] BASE_BACKUP(PostgresMain+0x6b2)[0x560b45c39131] postgres: walsender pryzbyj [local] BASE_BACKUP(+0x40530e)[0x560b45b8d30e] postgres: walsender pryzbyj [local] BASE_BACKUP(+0x408572)[0x560b45b90572] postgres: walsender pryzbyj [local] BASE_BACKUP(+0x4087b9)[0x560b45b907b9] postgres: walsender pryzbyj [local] BASE_BACKUP(PostmasterMain+0x1135)[0x560b45b91d9b] postgres: walsender pryzbyj [local] BASE_BACKUP(main+0x229)[0x560b45ad0f78] This is interpreted like client-gzip-1; should multiple specifications of compress be prohibited ? | src/bin/pg_basebackup/pg_basebackup --wal-method fetch -Ft -D - -h /tmp --no-sync --no-manifest --compress=server-lz4 --compress=1
Thanks for the review! I'll address most of these comments later, but quickly for right now... On Thu, Mar 17, 2022 at 3:41 PM Justin Pryzby <pryzby@telsasoft.com> wrote: > It'd be great if this were re-usable for wal_compression, which I hope in pg16 will > support at least level=N. And eventually pg_dump. But those clients shouldn't > accept a client/server prefix. Maybe the way to handle that is for those tools > to check locationres and reject it if it was specified. > [...] > This is confusingly similar to src/include/access/xlog.h:WalCompression. > I think someone else mentioned this before ? A couple of people before me have had delusions of grandeur in this area. We have the WalCompression enum, which has values of the form COMPRESSION_*, instead of WAL_COMPRESSION_*, as if the WAL were going to be the only thing that ever got compressed. And pg_dump.h also has a CompressionAlgorithm enum, with values like COMPR_ALG_*, which isn't great naming either. Clearly there's some cleanup needed here: if we can use the same enum for multiple systems, then it can have a name implying that it's the only game in town, but otherwise both the enum name and the corresponding value need to use a suitable prefix. I think that's a job for another patch, probably post-v15. For now I plan to do the right thing with the new names I'm adding, and leave the existing names alone. That can be changed in the future, if and when it seems sensible. As I said elsewhere, I think the WAL compression stuff is badly designed and should probably be rewritten completely, maybe to reuse the bbstreamer stuff. In that case, WalCompressionMethod would probably go away entirely, making the naming confusion moot, and picking up zstd and lz4 compression support for free. If that doesn't happen, we can probably find some way to at least make them share an enum, but I think that's too hairy to try to clean up right now with feature freeze pending. > The server crashes if I send an unknown option - you should hit that in the > regression tests. > > $ src/bin/pg_basebackup/pg_basebackup --wal-method fetch -Ft -D - -h /tmp --no-sync --no-manifest --compress=server-lz4:a|wc -c > TRAP: FailedAssertion("pointer != NULL", File: "../../../../src/include/utils/memutils.h", Line: 123, PID: 8627) > postgres: walsender pryzbyj [local] BASE_BACKUP(ExceptionalCondition+0xa0)[0x560b45d7b64b] > postgres: walsender pryzbyj [local] BASE_BACKUP(pfree+0x5d)[0x560b45dad1ea] > postgres: walsender pryzbyj [local] BASE_BACKUP(parse_bc_specification+0x154)[0x560b45dc5d4f] > postgres: walsender pryzbyj [local] BASE_BACKUP(+0x43d56c)[0x560b45bc556c] > postgres: walsender pryzbyj [local] BASE_BACKUP(SendBaseBackup+0x2d)[0x560b45bc85ca] > postgres: walsender pryzbyj [local] BASE_BACKUP(exec_replication_command+0x3a2)[0x560b45bdddb2] > postgres: walsender pryzbyj [local] BASE_BACKUP(PostgresMain+0x6b2)[0x560b45c39131] > postgres: walsender pryzbyj [local] BASE_BACKUP(+0x40530e)[0x560b45b8d30e] > postgres: walsender pryzbyj [local] BASE_BACKUP(+0x408572)[0x560b45b90572] > postgres: walsender pryzbyj [local] BASE_BACKUP(+0x4087b9)[0x560b45b907b9] > postgres: walsender pryzbyj [local] BASE_BACKUP(PostmasterMain+0x1135)[0x560b45b91d9b] > postgres: walsender pryzbyj [local] BASE_BACKUP(main+0x229)[0x560b45ad0f78] That's odd - I thought I had tested that case. Will double-check. > This is interpreted like client-gzip-1; should multiple specifications of > compress be prohibited ? > > | src/bin/pg_basebackup/pg_basebackup --wal-method fetch -Ft -D - -h /tmp --no-sync --no-manifest --compress=server-lz4--compress=1 They're not now and haven't been in the past. I think the last one should just win (as it apparently does, here). We do that in some places and throw an error in others and I'm not sure if we have a 100% consistent rule for it, but flipping one location between one behavior and the other isn't going to make things more consistent overall. -- Robert Haas EDB: http://www.enterprisedb.com
On Thu, Mar 17, 2022 at 3:41 PM Justin Pryzby <pryzby@telsasoft.com> wrote: > gzip comma I think it's fine the way it's written. If we made that change, then we'd have a comma for gzip and not for the other two algorithms. Also, I'm just moving that sentence, so any change that there is to be made here is a job for some other patch. > alphabetical Fixed. > - errmsg("unrecognized compression algorithm: \"%s\"", > + errmsg("unrecognized compression algorithm \"%s\"", > > Most other places seem to say "compression method". So I'd suggest to change > that here, and in doc/src/sgml/ref/pg_basebackup.sgml. I'm not sure that's really better, and I don't think this patch is introducing an altogether novel usage. I think I would probably try to standardize on algorithm rather than method if I were standardizing the whole source tree, but I think we can leave that discussion for another time. > - if (o_compression_level && !o_compression) > + if (o_compression_detail && !o_compression) > ereport(ERROR, > (errcode(ERRCODE_SYNTAX_ERROR), > errmsg("compression level requires compression"))); > > s/level/detail/ Fixed. > It'd be great if this were re-usable for wal_compression, which I hope in pg16 will > support at least level=N. And eventually pg_dump. But those clients shouldn't > accept a client/server prefix. Maybe the way to handle that is for those tools > to check locationres and reject it if it was specified. One thing I forgot to mention in my previous response is that I think the parsing code is actually well set up for this the way I have it. server- and client- gets parsed off in a different place than we interpret the rest, which fits well with your observation that other cases wouldn't have a client or server prefix. > sp: sandwich Fixed. > star Fixed. > should be const ? OK. > > + /* As a special case, the specification can be a bare integer. */ > + bare_level = strtol(specification, &bare_level_endp, 10); > > Should this call expect_integer_value()? > See below. I don't think that would be useful. We have no keyword to pass for the error message, nor would we use the error message if one got constructed. > + result->parse_error = > + pstrdup("found empty string where a compression option was expected"); > > Needs to be localized with _() ? > Also, document that it's pstrdup'd. Did the latter. The former would need to be fixed in a bunch of places and while I'm happy to accept an expert opinion on exactly what needs to be done here, I don't want to try to do it and do it wrong. Better to let someone with good knowledge of the subject matter patch it up later than do a crummy job now. > -1 isn't great, since it's also an integer, and, also a valid compression level > for zstd (did you see my message about that?). Maybe INT_MIN is ok. It really doesn't matter. Could just return 42. The client shouldn't use the value if there's an error. > +{ > + int ivalue; > + char *ivalue_endp; > + > + ivalue = strtol(value, &ivalue_endp, 10); > > Should this also set/check errno ? > And check if value != ivalue_endp ? > See strtol(3) Even after reading the man page for strtol, it's not clear to me that this is needed. That page represents checking *endptr != '\0' as sufficient to tell whether an error occurred. Maybe it wouldn't catch an out of range value, but in practice all of the algorithms we support now and any we support in the future are going to catch something clamped to LONG_MIN or LONG_MAX as out of range and display the correct error message. What's your specific thinking here? > + unsigned options; /* OR of BACKUP_COMPRESSION_OPTION constants */ > > Should be "unsigned int" or "bits32" ? I do not see why either of those would be better. > The server crashes if I send an unknown option - you should hit that in the > regression tests. Turns out I was testing this on the client side but not the server side. Fixed and added more tests. v2 attached. -- Robert Haas EDB: http://www.enterprisedb.com
Attachment
Robert Haas <robertmhaas@gmail.com> writes: >> Should this also set/check errno ? >> And check if value != ivalue_endp ? >> See strtol(3) > Even after reading the man page for strtol, it's not clear to me that > this is needed. That page represents checking *endptr != '\0' as > sufficient to tell whether an error occurred. I'm not sure whose man page you looked at, but the POSIX standard [1] has a pretty clear opinion about this: Since 0, {LONG_MIN} or {LLONG_MIN}, and {LONG_MAX} or {LLONG_MAX} are returned on error and are also valid returns on success, an application wishing to check for error situations should set errno to 0, then call strtol() or strtoll(), then check errno. Checking *endptr != '\0' is for detecting whether there is trailing garbage after the number; which may be an error case or not as you choose, but it's a different matter. regards, tom lane [1] https://pubs.opengroup.org/onlinepubs/9699919799/
On Sun, Mar 20, 2022 at 03:05:28PM -0400, Robert Haas wrote: > On Thu, Mar 17, 2022 at 3:41 PM Justin Pryzby <pryzby@telsasoft.com> wrote: > > - errmsg("unrecognized compression algorithm: \"%s\"", > > + errmsg("unrecognized compression algorithm \"%s\"", > > > > Most other places seem to say "compression method". So I'd suggest to change > > that here, and in doc/src/sgml/ref/pg_basebackup.sgml. > > I'm not sure that's really better, and I don't think this patch is > introducing an altogether novel usage. I think I would probably try to > standardize on algorithm rather than method if I were standardizing > the whole source tree, but I think we can leave that discussion for > another time. The user-facing docs are already standardized using "compression method", with 2 exceptions, of which one is contrib/ and the other is what I'm suggesting to make consistent here. $ git grep 'compression algorithm' doc doc/src/sgml/pgcrypto.sgml: Which compression algorithm to use. Only available if doc/src/sgml/ref/pg_basebackup.sgml: compression algorithm is selected, or if server-side compression > > + result->parse_error = > > + pstrdup("found empty string where a compression option was expected"); > > > > Needs to be localized with _() ? > > Also, document that it's pstrdup'd. > > Did the latter. The former would need to be fixed in a bunch of places > and while I'm happy to accept an expert opinion on exactly what needs > to be done here, I don't want to try to do it and do it wrong. Better > to let someone with good knowledge of the subject matter patch it up > later than do a crummy job now. I believe it just needs _("foo") See git grep '= _(' I mentioned another issue off-list: pg_basebackup.c:2741:10: warning: suggest parentheses around assignment used as truth value [-Wparentheses] 2741 | Assert(compressloc = COMPRESS_LOCATION_SERVER); | ^~~~~~~~~~~ pg_basebackup.c:2741:3: note: in expansion of macro ‘Assert’ 2741 | Assert(compressloc = COMPRESS_LOCATION_SERVER); This crashes the server using your v2 patch: src/bin/pg_basebackup/pg_basebackup --wal-method fetch -Ft -D - -h /tmp --no-sync --no-manifest --compress=server-zstd:level,|wc -c I wonder whether the syntax should really use both ":" and ",". Maybe ":" isn't needed at all. This patch also needs to update the other user-facing docs. typo: contain a an -- Justin
On Sun, Mar 20, 2022 at 3:11 PM Tom Lane <tgl@sss.pgh.pa.us> wrote: > > Even after reading the man page for strtol, it's not clear to me that > > this is needed. That page represents checking *endptr != '\0' as > > sufficient to tell whether an error occurred. > > I'm not sure whose man page you looked at, but the POSIX standard [1] > has a pretty clear opinion about this: > > Since 0, {LONG_MIN} or {LLONG_MIN}, and {LONG_MAX} or {LLONG_MAX} are > returned on error and are also valid returns on success, an > application wishing to check for error situations should set errno to > 0, then call strtol() or strtoll(), then check errno. > > Checking *endptr != '\0' is for detecting whether there is trailing > garbage after the number; which may be an error case or not as you > choose, but it's a different matter. I think I'm guilty of verbal inexactitude here but not bad coding. Checking for *endptr != '\0', as I did, is not sufficient to detect "whether an error occurred," as I alleged. But, in the part of my response you didn't quote, I believe I made it clear that I only need to detect garbage, not out-of-range values. And I think *endptr != '\0' will do that. -- Robert Haas EDB: http://www.enterprisedb.com
Robert Haas <robertmhaas@gmail.com> writes: > I think I'm guilty of verbal inexactitude here but not bad coding. > Checking for *endptr != '\0', as I did, is not sufficient to detect > "whether an error occurred," as I alleged. But, in the part of my > response you didn't quote, I believe I made it clear that I only need > to detect garbage, not out-of-range values. And I think *endptr != > '\0' will do that. Hmm ... do you consider an empty string to be valid input? regards, tom lane
On Sun, Mar 20, 2022 at 3:40 PM Justin Pryzby <pryzby@telsasoft.com> wrote: > The user-facing docs are already standardized using "compression method", with > 2 exceptions, of which one is contrib/ and the other is what I'm suggesting to > make consistent here. > > $ git grep 'compression algorithm' doc > doc/src/sgml/pgcrypto.sgml: Which compression algorithm to use. Only available if > doc/src/sgml/ref/pg_basebackup.sgml: compression algorithm is selected, or if server-side compression Well, if you just count the number of occurrences of each string in the documentation, sure. But all of the ones that are talking about a compression method seem to have to do with configurable TOAST compression, and the fact that the documentation for that feature is more extensive than for the pre-existing feature that refers to a compression algorithm does not, at least in my view, turn it into a project standard from which no deviation is permitted. > > Did the latter. The former would need to be fixed in a bunch of places > > and while I'm happy to accept an expert opinion on exactly what needs > > to be done here, I don't want to try to do it and do it wrong. Better > > to let someone with good knowledge of the subject matter patch it up > > later than do a crummy job now. > > I believe it just needs _("foo") > See git grep '= _(' Hmm. Maybe. > I mentioned another issue off-list: > pg_basebackup.c:2741:10: warning: suggest parentheses around assignment used as truth value [-Wparentheses] > 2741 | Assert(compressloc = COMPRESS_LOCATION_SERVER); > | ^~~~~~~~~~~ > pg_basebackup.c:2741:3: note: in expansion of macro ‘Assert’ > 2741 | Assert(compressloc = COMPRESS_LOCATION_SERVER); > > This crashes the server using your v2 patch: > > src/bin/pg_basebackup/pg_basebackup --wal-method fetch -Ft -D - -h /tmp --no-sync --no-manifest --compress=server-zstd:level,|wc -c Well that's unfortunate. Will fix. > I wonder whether the syntax should really use both ":" and ",". > Maybe ":" isn't needed at all. I don't think we should treat the compression method name in the same way as a compression algorithm option. > This patch also needs to update the other user-facing docs. Which ones exactly? > typo: contain a an OK, will fix. -- Robert Haas EDB: http://www.enterprisedb.com
On Sun, Mar 20, 2022 at 9:32 PM Tom Lane <tgl@sss.pgh.pa.us> wrote: > Robert Haas <robertmhaas@gmail.com> writes: > > I think I'm guilty of verbal inexactitude here but not bad coding. > > Checking for *endptr != '\0', as I did, is not sufficient to detect > > "whether an error occurred," as I alleged. But, in the part of my > > response you didn't quote, I believe I made it clear that I only need > > to detect garbage, not out-of-range values. And I think *endptr != > > '\0' will do that. > > Hmm ... do you consider an empty string to be valid input? No, and I thought I had checked properly for that condition before reaching the point in the code where I call strtol(), but it turns out I have not, which I guess is what Justin has been trying to tell me for a few emails now. I'll send an updated patch tomorrow after looking this all over more carefully. -- Robert Haas EDB: http://www.enterprisedb.com
On Sun, Mar 20, 2022 at 09:38:44PM -0400, Robert Haas wrote: > > This patch also needs to update the other user-facing docs. > > Which ones exactly? I mean pg_basebackup -Z -Z level -Z [{client|server}-]method[:level] --compress=level --compress=[{client|server}-]method[:level]
On Mon, Mar 21, 2022 at 9:18 AM Justin Pryzby <pryzby@telsasoft.com> wrote: > On Sun, Mar 20, 2022 at 09:38:44PM -0400, Robert Haas wrote: > > > This patch also needs to update the other user-facing docs. > > > > Which ones exactly? > > I mean pg_basebackup -Z > > -Z level > -Z [{client|server}-]method[:level] > --compress=level > --compress=[{client|server}-]method[:level] Ah, right. Thanks. Here's v3. I have updated that section of the documentation. I also went and added a bunch more test cases for validation of compression detail strings, many inspired by your examples, and fixed all the bugs that I found in the process. I think the crashes you complained about are now fixed, but please let me know if I have missed any. I also added _() calls as you suggested. I searched for the "contain a an" typo that you mentioned but was not able to find it. Can you give me a more specific pointer? I looked a little bit more at the compression method vs. compression algorithm thing. I agree that there is some inconsistency in terminology here, but I'm still not sure that we are well-served by trying to make it totally uniform, especially if we pick the word "method" as the standard rather than "algorithm". In my opinion, "method" is less specific than "algorithm". If someone asks me to choose a compression algorithm, I know that I should give an answer like "lz4" or "zstd". If they ask me to pick a compression method, I'm not quite sure whether they want that kind of answer or whether they want something more detailed, like "use lz4 with compression level 3 and a 1MB block size". After all, that is (at least according to my understanding of how English works) a perfectly valid answer to the question "what method should I use to compress this data?" -- but not to the question "what algorithm should I use to compress this data?". The latter can ONLY be properly answered by saying something like "lz4". And I think that's really the root of my hesitation to make the kinds of changes you want here. If it's just a question of specifying a compression algorithm and a level, I don't think using the name "method" for the algorithm is going to be too bad. But as we enrich the system with multiple compression algorithms each of which may have multiple and different parameters, I think the whole thing becomes murkier and the need for precision in language goes up. Now that is of course an arguable position and you're welcome to disagree with it, but I think that's part of why I'm hesitating. Another part of it, at least for me, is that complete uniformity is not always a positive. I suppose all of us have had the experience at some point of reading a manual that says something like "to activate the boil water function, press and release the 'boil water' button" and rolled our eyes at how useless it was. It's important to me that we don't fall into that trap. We clearly don't want to go ballistic and have random inconsistencies in language for no reason, but at the same time, it's not useful to tell people that METHOD should be replaced with a compression method and LEVEL with a compression level. I mean, if you end up saying something like that interspersed with non-obvious information, that is OK, and I don't want to overstate the point I'm trying to make. But it seems to me that if there's a little variation in phrasing and we end up saying that METHOD means the compression algorithm or that ALGORITHM means the compression method or whatever, that can actually make things more clear. Here again it's debatable: how much variation in phraseology is helpful, and at what point does it just start to seem inconsistent? Well, everyone may have their own opinion. I'm not trying to pretend that this patch (or the existing code base) gets this all right. But I do think that, to the extent that we have a considered position on what to do here, we can make that change later, perhaps even after getting some user feedback on what does and does not make sense to other people. And I also think that what we end up doing here may well end up being more nuanced than a blanket search-and-replace. I'm not saying we couldn't make a blanket search-and-replace. I just don't see it as necessarily creating value, or being all that closely connected to the goal of this patch, which is to quickly clean up a forward-compatibility risk before we hit feature freeze. Thanks, -- Robert Haas EDB: http://www.enterprisedb.com
Attachment
On Mon, Mar 21, 2022 at 12:57:36PM -0400, Robert Haas wrote: > > typo: contain a an > I searched for the "contain a an" typo that you mentioned but was not able to > find it. Can you give me a more specific pointer? Here: + * during parsing, and will otherwise contain a an appropriate error message. > I looked a little bit more at the compression method vs. compression > algorithm thing. I agree that there is some inconsistency in > terminology here, but I'm still not sure that we are well-served by > trying to make it totally uniform, especially if we pick the word > "method" as the standard rather than "algorithm". In my opinion, > "method" is less specific than "algorithm". If someone asks me to > choose a compression algorithm, I know that I should give an answer > like "lz4" or "zstd". If they ask me to pick a compression method, I'm > not quite sure whether they want that kind of answer or whether they > want something more detailed, like "use lz4 with compression level 3 > and a 1MB block size". After all, that is (at least according to my > understanding of how English works) a perfectly valid answer to the > question "what method should I use to compress this data?" -- but not > to the question "what algorithm should I use to compress this data?". > The latter can ONLY be properly answered by saying something like > "lz4". And I think that's really the root of my hesitation to make the > kinds of changes you want here. I think "algorithm" could be much more nuanced than "lz4", but I also think we've spent more than enough time on it now :) -- Justin
On Mon, Mar 21, 2022 at 2:22 PM Justin Pryzby <pryzby@telsasoft.com> wrote: > + * during parsing, and will otherwise contain a an appropriate error message. OK, thanks. v4 attached. > I think "algorithm" could be much more nuanced than "lz4", but I also think > we've spent more than enough time on it now :) Oh dear. But yes. -- Robert Haas EDB: http://www.enterprisedb.com
Attachment
[ Changing subject line in the hopes of attracting more eyeballs. ] On Mon, Mar 14, 2022 at 12:11 PM Dipesh Pandit <dipesh.pandit@gmail.com> wrote: > I tried to implement support for parallel ZSTD compression. Here's a new patch for this. It's more of a rewrite than an update, honestly; commit ffd53659c46a54a6978bcb8c4424c1e157a2c0f1 necessitated totally different options handling, but I also redid the test cases, the documentation, and the error message. For those who may not have been following along, here's an executive summary: libzstd offers an option for parallel compression. It's intended to be transparent: you just say you want it, and the library takes care of it for you. Since we have the ability to do backup compression on either the client or the server side, we can expose this option in both locations. That would be cool, because it would allow for really fast backup compression with a good compression ratio. It would also mean that we would be, or really libzstd would be, spawning threads inside the PostgreSQL backend. Short of cats and dogs living together, it's hard to think of anything more terrifying, because the PostgreSQL backend is very much not thread-safe. However, a lot of the things we usually worry about when people make noises about using threads in the backend don't apply here, because the threads are hidden away behind libzstd interfaces and can't execute any PostgreSQL code. Therefore, I think it might be safe to just ... turn this on. One reason I think that is that this whole approach was recommended to me by Andres ... but that's not to say that there couldn't be problems. I worry a bit that the mere presence of threads could in some way mess things up, but I don't know what the mechanism for that would be, and I don't want to postpone shipping useful features based on nebulous fears. In my ideal world, I'd like to push this into v15. I've done a lot of work to improve the backup code in this release, and this is actually a very small change yet one that potentially enables the project to get a lot more value out of the work that has already been committed. That said, I also don't want to break the world, so if you have an idea what this would break, please tell me. For those curious as to how this affects performance and backup size, I loaded up the UK land registry database. That creates a 3769MB database. Then I backed it up using client-side compression and server-side compression using the various different algorithms that are supported in the master branch, plus parallel zstd. no compression: 3.7GB, 9 seconds gzip: 1.5GB, 140 seconds with server-side, 141 seconds with client-side lz4: 2.0GB, 13 seconds with server-side, 12 seconds with client-side For both parallel and non-parallel zstd compression, I see differences between the compressed size depending on where the compression is done. I don't know whether this is an expected behavior of the zstd library or a bug. Both files uncompress OK and pass pg_verifybackup, but that doesn't mean we're not, for example, selecting different compression levels where we shouldn't be. I'll try to figure out what's going on here. zstd, client-side: 1.7GB, 17 seconds zstd, server-side: 1.3GB, 25 seconds parallel zstd, 4 workers, client-side: 1.7GB, 7.5 seconds parallel zstd, 4 workers, server-side: 1.3GB, 7.2 seconds Notice that compressing the backup with parallel zstd is actually faster than taking an uncompressed backup, even though this test is all being run on the same machine. That's kind of crazy to me: the parallel compression is so fast that we save more time on I/O than we spend compressing. This assumes of course that you have plenty of CPU resources and limited I/O resources, which won't be true for everyone, but it's not an unusual situation. I think the documentation changes in this patch might not be quite up to scratch. I think there's a brewing problem here: as we add more compression options, whether or not that happens in this release, and regardless of what specific options we add, the way things are structured right now, we're going to end up either duplicating a bunch of stuff between the pg_basebackup documentation and the BASE_BACKUP documentation, or else one of those places is going to end up lacking information that someone reading it might like to have. I'm not exactly sure what to do about this, though. This patch contains a trivial adjustment to PostgreSQL::Test::Cluster::run_log to make it return a useful value instead of not. I think that should be pulled out and committed independently regardless of what happens to this patch overall, and possibly back-patched. Thanks, -- Robert Haas EDB: http://www.enterprisedb.com
Attachment
Hi, On 2022-03-23 16:34:04 -0400, Robert Haas wrote: > Therefore, I think it might be safe to just ... turn this on. One reason I > think that is that this whole approach was recommended to me by Andres ... I didn't do a super careful analysis of the issues... But I do think it's pretty much the one case where it "should" be safe. The most likely source of problem would errors thrown while zstd threads are alive. Should make sure that that can't happen. What is the lifetime of the threads zstd spawns? Are they tied to a single compression call? A single ZSTD_createCCtx()? If the latter, how bulletproof is our code ensuring that we don't leak such contexts? If they're short-lived, are we compressing large enough batches to not waste a lot of time starting/stopping threads? > but that's not to say that there couldn't be problems. I worry a bit that > the mere presence of threads could in some way mess things up, but I don't > know what the mechanism for that would be, and I don't want to postpone > shipping useful features based on nebulous fears. One thing that'd be good to tests for is cancelling in-progress server-side compression. And perhaps a few assertions that ensure that we don't escape with some threads still running. That'd have to be platform dependent, but I don't see a problem with that in this case. > For both parallel and non-parallel zstd compression, I see differences > between the compressed size depending on where the compression is > done. I don't know whether this is an expected behavior of the zstd > library or a bug. Both files uncompress OK and pass pg_verifybackup, > but that doesn't mean we're not, for example, selecting different > compression levels where we shouldn't be. I'll try to figure out > what's going on here. > > zstd, client-side: 1.7GB, 17 seconds > zstd, server-side: 1.3GB, 25 seconds > parallel zstd, 4 workers, client-side: 1.7GB, 7.5 seconds > parallel zstd, 4 workers, server-side: 1.3GB, 7.2 seconds What causes this fairly massive client-side/server-side size difference? > + /* > + * We check for failure here because (1) older versions of the library > + * do not support ZSTD_c_nbWorkers and (2) the library might want to > + * reject unreasonable values (though in practice it does not seem to do > + * so). > + */ > + ret = ZSTD_CCtx_setParameter(streamer->cctx, ZSTD_c_nbWorkers, > + compress->workers); > + if (ZSTD_isError(ret)) > + { > + pg_log_error("could not set compression worker count to %d: %s", > + compress->workers, ZSTD_getErrorName(ret)); > + exit(1); > + } Will this cause test failures on systems with older zstd? Greetings, Andres Freund
+ * We check for failure here because (1) older versions of the library + * do not support ZSTD_c_nbWorkers and (2) the library might want to + * reject an unreasonable values (though in practice it does not seem to do + * so). + */ + ret = ZSTD_CCtx_setParameter(mysink->cctx, ZSTD_c_nbWorkers, + mysink->workers); + if (ZSTD_isError(ret)) + ereport(ERROR, + errcode(ERRCODE_INVALID_PARAMETER_VALUE), + errmsg("could not set compression worker count to %d: %s", + mysink->workers, ZSTD_getErrorName(ret))); Also because the library may not be compiled with threading. A few days ago, I tried to rebase the original "parallel workers" patch over the COMPRESS DETAIL patch but then couldn't test it, even after trying various versions of the zstd package and trying to compile it locally. I'll try again soon... I think you should also test the return value when setting the compress level. Not only because it's generally a good idea, but also because I suggested to support negative compression levels. Which weren't allowed before v1.3.4, and then the range is only defined since 1.3.6 (ZSTD_minCLevel). At some point, the range may have been -7..22 but now it's -131072..22. lib/compress/zstd_compress.c:int ZSTD_minCLevel(void) { return (int)-ZSTD_TARGETLENGTH_MAX; } lib/zstd.h:#define ZSTD_TARGETLENGTH_MAX ZSTD_BLOCKSIZE_MAX lib/zstd.h:#define ZSTD_BLOCKSIZE_MAX (1<<ZSTD_BLOCKSIZELOG_MAX) lib/zstd.h:#define ZSTD_BLOCKSIZELOG_MAX 17 ; -1<<17 -131072
Attachment
On Wed, Mar 23, 2022 at 5:14 PM Andres Freund <andres@anarazel.de> wrote: > The most likely source of problem would errors thrown while zstd threads are > alive. Should make sure that that can't happen. > > What is the lifetime of the threads zstd spawns? Are they tied to a single > compression call? A single ZSTD_createCCtx()? If the latter, how bulletproof > is our code ensuring that we don't leak such contexts? I haven't found any real documentation explaining how libzstd manages its threads. I am assuming that it is tied to the ZSTD_CCtx, but I don't know. I guess I could try to figure it out from the source code. Anyway, what we have now is a PG_TRY()/PG_CATCH() block around the code that uses the basink which will cause bbsink_zstd_cleanup() to get called in the event of an error. That will do ZSTD_freeCCtx(). It's probably also worth mentioning here that even if, contrary to expectations, the compression threads hang around to the end of time and chill, in practice nobody is likely to run BASE_BACKUP and then keep the connection open for a long time afterward. So it probably wouldn't really affect resource utilization in real-world scenarios even if the threads never exited, as long as they didn't, you know, busy-loop in the background. And I assume the actual library behavior can't be nearly that bad. This is a pretty mainstream piece of software. > If they're short-lived, are we compressing large enough batches to not waste a > lot of time starting/stopping threads? Well, we're using a single ZSTD_CCtx for an entire base backup. Again, I haven't found documentation explaining with libzstd is actually doing, but it's hard to see how we could make the batch any bigger than that. The context gets reset for each new tablespace, which may or may not do anything to the compression threads. > > but that's not to say that there couldn't be problems. I worry a bit that > > the mere presence of threads could in some way mess things up, but I don't > > know what the mechanism for that would be, and I don't want to postpone > > shipping useful features based on nebulous fears. > > One thing that'd be good to tests for is cancelling in-progress server-side > compression. And perhaps a few assertions that ensure that we don't escape > with some threads still running. That'd have to be platform dependent, but I > don't see a problem with that in this case. More specific suggestions, please? > > For both parallel and non-parallel zstd compression, I see differences > > between the compressed size depending on where the compression is > > done. I don't know whether this is an expected behavior of the zstd > > library or a bug. Both files uncompress OK and pass pg_verifybackup, > > but that doesn't mean we're not, for example, selecting different > > compression levels where we shouldn't be. I'll try to figure out > > what's going on here. > > > > zstd, client-side: 1.7GB, 17 seconds > > zstd, server-side: 1.3GB, 25 seconds > > parallel zstd, 4 workers, client-side: 1.7GB, 7.5 seconds > > parallel zstd, 4 workers, server-side: 1.3GB, 7.2 seconds > > What causes this fairly massive client-side/server-side size difference? You seem not to have read what I wrote about this exact point in the text which you quoted. > Will this cause test failures on systems with older zstd? I put a bunch of logic in the test case to try to avoid that, so hopefully not, but if it does, we can adjust the logic. -- Robert Haas EDB: http://www.enterprisedb.com
On Wed, Mar 23, 2022 at 5:52 PM Justin Pryzby <pryzby@telsasoft.com> wrote: > Also because the library may not be compiled with threading. A few days ago, I > tried to rebase the original "parallel workers" patch over the COMPRESS DETAIL > patch but then couldn't test it, even after trying various versions of the zstd > package and trying to compile it locally. I'll try again soon... Ah. Right, I can update the comment to mention that. > I think you should also test the return value when setting the compress level. > Not only because it's generally a good idea, but also because I suggested to > support negative compression levels. Which weren't allowed before v1.3.4, and > then the range is only defined since 1.3.6 (ZSTD_minCLevel). At some point, > the range may have been -7..22 but now it's -131072..22. Yeah, I was thinking that might be a good change. It would require adjusting some other code though, because right now only compression levels 1..22 are accepted anyhow. > lib/compress/zstd_compress.c:int ZSTD_minCLevel(void) { return (int)-ZSTD_TARGETLENGTH_MAX; } > lib/zstd.h:#define ZSTD_TARGETLENGTH_MAX ZSTD_BLOCKSIZE_MAX > lib/zstd.h:#define ZSTD_BLOCKSIZE_MAX (1<<ZSTD_BLOCKSIZELOG_MAX) > lib/zstd.h:#define ZSTD_BLOCKSIZELOG_MAX 17 > ; -1<<17 > -131072 So does that, like, compress the value by making it way bigger? :-) -- Robert Haas EDB: http://www.enterprisedb.com
On Wed, Mar 23, 2022 at 04:34:04PM -0400, Robert Haas wrote: > be, spawning threads inside the PostgreSQL backend. Short of cats and > dogs living together, it's hard to think of anything more terrifying, > because the PostgreSQL backend is very much not thread-safe. However, > a lot of the things we usually worry about when people make noises > about using threads in the backend don't apply here, because the > threads are hidden away behind libzstd interfaces and can't execute > any PostgreSQL code. Therefore, I think it might be safe to just ... > turn this on. One reason I think that is that this whole approach was > recommended to me by Andres ... but that's not to say that there > couldn't be problems. I worry a bit that the mere presence of threads > could in some way mess things up, but I don't know what the mechanism > for that would be, and I don't want to postpone shipping useful > features based on nebulous fears. Note that the PGDG .RPMs and .DEBs are already linked with pthread, via libxml => liblzma. $ ldd /usr/pgsql-14/bin/postgres |grep xm libxml2.so.2 => /lib64/libxml2.so.2 (0x00007faab984e000) $ objdump -p /lib64/libxml2.so.2 |grep NEED NEEDED libdl.so.2 NEEDED libz.so.1 NEEDED liblzma.so.5 NEEDED libm.so.6 NEEDED libc.so.6 VERNEED 0x0000000000019218 VERNEEDNUM 0x0000000000000005 $ objdump -p /lib64/liblzma.so.5 |grep NEED NEEDED libpthread.so.0 Did you try this on windows at all ? It's probably no surprise that zstd implements threading differently there.
Hi, On 2022-03-23 18:31:12 -0400, Robert Haas wrote: > On Wed, Mar 23, 2022 at 5:14 PM Andres Freund <andres@anarazel.de> wrote: > > The most likely source of problem would errors thrown while zstd threads are > > alive. Should make sure that that can't happen. > > > > What is the lifetime of the threads zstd spawns? Are they tied to a single > > compression call? A single ZSTD_createCCtx()? If the latter, how bulletproof > > is our code ensuring that we don't leak such contexts? > > I haven't found any real documentation explaining how libzstd manages > its threads. I am assuming that it is tied to the ZSTD_CCtx, but I > don't know. I guess I could try to figure it out from the source code. I found this the following section in the manual [1]: ZSTD_c_nbWorkers=400, /* Select how many threads will be spawned to compress in parallel. * When nbWorkers >= 1, triggers asynchronous mode when invoking ZSTD_compressStream*() : * ZSTD_compressStream*() consumes input and flush output if possible, but immediately givesback control to caller, * while compression is performed in parallel, within worker thread(s). * (note : a strong exception to this rule is when first invocation of ZSTD_compressStream2()sets ZSTD_e_end : * in which case, ZSTD_compressStream2() delegates to ZSTD_compress2(), which is always a blockingcall). * More workers improve speed, but also increase memory usage. * Default value is `0`, aka "single-threaded mode" : no worker is spawned, * compression is performed inside Caller's thread, and all invocations are blocking */ "ZSTD_compressStream*() consumes input ... immediately gives back control" pretty much confirms that. Do we care about zstd's memory usage here? I think it's OK to mostly ignore work_mem/maintenance_work_mem here, but I could also see limiting concurrency so that estimated memory usage would fit into work_mem/maintenance_work_mem. > It's probably also worth mentioning here that even if, contrary to > expectations, the compression threads hang around to the end of time > and chill, in practice nobody is likely to run BASE_BACKUP and then > keep the connection open for a long time afterward. So it probably > wouldn't really affect resource utilization in real-world scenarios > even if the threads never exited, as long as they didn't, you know, > busy-loop in the background. And I assume the actual library behavior > can't be nearly that bad. This is a pretty mainstream piece of > software. I'm not really worried about resource utilization, more about the existence of threads moving us into undefined behaviour territory or such. I don't think that's possible, but it's IIRC UB to fork() while threads are present and do pretty much *anything* other than immediately exec*(). > > > but that's not to say that there couldn't be problems. I worry a bit that > > > the mere presence of threads could in some way mess things up, but I don't > > > know what the mechanism for that would be, and I don't want to postpone > > > shipping useful features based on nebulous fears. > > > > One thing that'd be good to tests for is cancelling in-progress server-side > > compression. And perhaps a few assertions that ensure that we don't escape > > with some threads still running. That'd have to be platform dependent, but I > > don't see a problem with that in this case. > > More specific suggestions, please? I was thinking of doing something like calling pthread_is_threaded_np() before and after the zstd section and erroring out if they differ. But I forgot that that's on mac-ism. > > > For both parallel and non-parallel zstd compression, I see differences > > > between the compressed size depending on where the compression is > > > done. I don't know whether this is an expected behavior of the zstd > > > library or a bug. Both files uncompress OK and pass pg_verifybackup, > > > but that doesn't mean we're not, for example, selecting different > > > compression levels where we shouldn't be. I'll try to figure out > > > what's going on here. > > > > > > zstd, client-side: 1.7GB, 17 seconds > > > zstd, server-side: 1.3GB, 25 seconds > > > parallel zstd, 4 workers, client-side: 1.7GB, 7.5 seconds > > > parallel zstd, 4 workers, server-side: 1.3GB, 7.2 seconds > > > > What causes this fairly massive client-side/server-side size difference? > > You seem not to have read what I wrote about this exact point in the > text which you quoted. Somehow not... Perhaps it's related to the amounts of memory fed to ZSTD_compressStream2() in one invocation? I recall that there's some differences between basebackup client / serverside around buffer sizes - but that's before all the recent-ish changes... Greetings, Andres Freund [1] http://facebook.github.io/zstd/zstd_manual.html
On 2022-03-23 18:07:01 -0500, Justin Pryzby wrote: > Did you try this on windows at all ? Really should get zstd installed in the windows cf environment... > It's probably no surprise that zstd implements threading differently there. Worth noting that we have a few of our own threads running on windows already - so we're guaranteed to build against the threaded standard libraries etc already.
On Wed, Mar 23, 2022 at 7:07 PM Justin Pryzby <pryzby@telsasoft.com> wrote: > Did you try this on windows at all ? It's probably no surprise that zstd > implements threading differently there. I did not. I haven't had a properly functioning Windows development environment in about a decade. -- Robert Haas EDB: http://www.enterprisedb.com
On Wed, Mar 23, 2022 at 7:31 PM Andres Freund <andres@anarazel.de> wrote: > I found this the following section in the manual [1]: > > ZSTD_c_nbWorkers=400, /* Select how many threads will be spawned to compress in parallel. > * When nbWorkers >= 1, triggers asynchronous mode when invoking ZSTD_compressStream*() : > * ZSTD_compressStream*() consumes input and flush output if possible, but immediately givesback control to caller, > * while compression is performed in parallel, within worker thread(s). > * (note : a strong exception to this rule is when first invocation of ZSTD_compressStream2()sets ZSTD_e_end : > * in which case, ZSTD_compressStream2() delegates to ZSTD_compress2(), which is always ablocking call). > * More workers improve speed, but also increase memory usage. > * Default value is `0`, aka "single-threaded mode" : no worker is spawned, > * compression is performed inside Caller's thread, and all invocations are blocking */ > > "ZSTD_compressStream*() consumes input ... immediately gives back control" > pretty much confirms that. I saw that too, but I didn't consider it conclusive. It would be nice if their documentation had a bit more detail on what's really happening. > Do we care about zstd's memory usage here? I think it's OK to mostly ignore > work_mem/maintenance_work_mem here, but I could also see limiting concurrency > so that estimated memory usage would fit into work_mem/maintenance_work_mem. I think it's possible that we want to do nothing and possible that we want to do something, but I think it's very unlikely that the thing we want to do is related to maintenance_work_mem. Say we soft-cap the compression level to the one which we think will fit within maintanence_work_mem. I think the most likely outcome is that people will not get the compression level they request and be confused about why that has happened. It also seems possible that we'll be wrong about how much memory will be used - say, because somebody changes the library behavior in a new release - and will limit it to the wrong level. If we're going to do anything here, I think it should be to limit based on the compression level itself and not based how much memory we think that level will use. But that leaves the question of whether we should even try to impose some kind of limit, and there I'm not sure. It feels like it might be overengineered, because we're only talking about users who have replication privileges, and if those accounts are subverted there are big problems anyway. I think if we imposed a governance system here it would get very little use. On the other hand, I think that the higher zstd compression levels of 20+ can actually use a ton of memory, so we might want to limit access to those somehow. Apparently on the command line you have to say --ultra -- not sure if there's a corresponding API call or if that's a guard that's built specifically into the CLI. > Perhaps it's related to the amounts of memory fed to ZSTD_compressStream2() in > one invocation? I recall that there's some differences between basebackup > client / serverside around buffer sizes - but that's before all the recent-ish > changes... That thought occurred to me too but I haven't investigated yet. -- Robert Haas EDB: http://www.enterprisedb.com
On Wed, Mar 23, 2022 at 5:52 PM Justin Pryzby <pryzby@telsasoft.com> wrote: > I think you should also test the return value when setting the compress level. > Not only because it's generally a good idea, but also because I suggested to > support negative compression levels. Which weren't allowed before v1.3.4, and > then the range is only defined since 1.3.6 (ZSTD_minCLevel). At some point, > the range may have been -7..22 but now it's -131072..22. Hi, The attached patch fixes a few goofs around backup compression. It adds a check that setting the compression level succeeds, although it does not allow the broader range of compression levels Justin notes above. That can be done separately, I guess, if we want to do it. It also fixes the problem that client and server-side zstd compression don't actually compress equally well; that turned out to be a bug in the handling of compression options. Finally it adds an exit call to an unlikely failure case so that we would, if that case should occur, print a message and exit, rather than the current behavior of printing a message and then dereferencing a null pointer. -- Robert Haas EDB: http://www.enterprisedb.com
Attachment
On Wed, Mar 23, 2022 at 06:57:04PM -0400, Robert Haas wrote: > On Wed, Mar 23, 2022 at 5:52 PM Justin Pryzby <pryzby@telsasoft.com> wrote: > > Also because the library may not be compiled with threading. A few days ago, I > > tried to rebase the original "parallel workers" patch over the COMPRESS DETAIL > > patch but then couldn't test it, even after trying various versions of the zstd > > package and trying to compile it locally. I'll try again soon... > > Ah. Right, I can update the comment to mention that. Actually, I suggest to remove those comments: | "We check for failure here because..." That should be the rule rather than the exception, so shouldn't require justifying why one might checks the return value of library and system calls. In bbsink_zstd_new(), I think you need to check to see if workers were requested (same as the issue you found with "level"). If someone builds against a version of zstd which doesn't support some parameter, you'll currently call SetParameter with that flag anyway, with a default value. That's not currently breaking anything for me (even though workers=N doesn't work) but I think it's fragile and could break, maybe when compiled against an old zstd, or with future options. SetParameter should only be called when the user requested to set the parameter. I handled that for workers in 003, but didn't touch "level", which is probably fine, but maybe should change for consistency. src/backend/replication/basebackup_zstd.c: elog(ERROR, "could not set zstd compression level to %d: %s", src/bin/pg_basebackup/bbstreamer_gzip.c: pg_log_error("could not set compression level %d: %s", src/bin/pg_basebackup/bbstreamer_zstd.c: pg_log_error("could not set compression level to: %d: %s", I'm not sure why these messages sometimes mention the current compression method and sometimes don't. I suggest that they shouldn't - errcontext will have the algorithm, and the user already specified it anyway. It'd allow the compiler to merge strings. Here's a patch for zstd --long mode. (I don't actually use pg_basebackup, but I will want to use long mode with pg_dump). The "strategy" params may also be interesting, but I haven't played with it. rsyncable is certainly interesting, but currently an experimental, nonpublic interface - and a good example of why to not call SetParameter for params which the user didn't specify: PGDG might eventually compile postgres against a zstd which supports rsyncable flag. And someone might install somewhere which doesn't support rsyncable, but the server would try to call SetParameter(rsyncable, 0), and the rsyncable ID number would've changed, so zstd would probably reject it, and basebackup would be unusable... $ time src/bin/pg_basebackup/pg_basebackup -h /tmp -Ft -D- --wal-method=none --no-manifest -Z zstd:long=1 --checkpoint fast|wc -c 4625935 real 0m1,334s $ time src/bin/pg_basebackup/pg_basebackup -h /tmp -Ft -D- --wal-method=none --no-manifest -Z zstd:long=0 --checkpoint fast|wc -c 8426516 real 0m0,880s
Attachment
On Fri, Mar 25, 2022 at 9:23 AM Dipesh Pandit <dipesh.pandit@gmail.com> wrote: > The changes look good to me. Thanks. Committed. -- Robert Haas EDB: http://www.enterprisedb.com
On Sun, Mar 27, 2022 at 4:50 PM Justin Pryzby <pryzby@telsasoft.com> wrote: > Actually, I suggest to remove those comments: > | "We check for failure here because..." > > That should be the rule rather than the exception, so shouldn't require > justifying why one might checks the return value of library and system calls. I went for modifying the comment rather than removing it. I agree with you that checking for failure doesn't really require justification, but I think that in a case like this it is useful to explain what we know about why it might fail. > In bbsink_zstd_new(), I think you need to check to see if workers were > requested (same as the issue you found with "level"). Fixed. > src/backend/replication/basebackup_zstd.c: elog(ERROR, "could not set zstd compression level to %d: %s", > src/bin/pg_basebackup/bbstreamer_gzip.c: pg_log_error("could not set compression level %d: %s", > src/bin/pg_basebackup/bbstreamer_zstd.c: pg_log_error("could not set compression level to: %d: %s", > > I'm not sure why these messages sometimes mention the current compression > method and sometimes don't. I suggest that they shouldn't - errcontext will > have the algorithm, and the user already specified it anyway. It'd allow the > compiler to merge strings. I don't think that errcontext() helps here. On the client side, it doesn't exist. On the server side, it's not in use. I do see STATEMENT: <whatever> in the server log when a replication command throws a server-side error, which is similar, but pg_basebackup doesn't display that STATEMENT line. I don't really know how to balance the legitimate desire for future messages against the also-legitimate desire for clarity about where things are failing. I'm slightly inclined to think that including the algorithm name is better, because options are in the end algorithm-specific, but it's certainly debatable. I would be interested in hearing other opinions... Here's an updated and rebased version of my patch. -- Robert Haas EDB: http://www.enterprisedb.com
Attachment
On Mon, Mar 28, 2022 at 12:57 PM Robert Haas <robertmhaas@gmail.com> wrote: > Here's an updated and rebased version of my patch. Well, that only updated the comment on the client side. Let's try again. -- Robert Haas EDB: http://www.enterprisedb.com
Attachment
On Mon, Mar 28, 2022 at 4:53 PM Justin Pryzby <pryzby@telsasoft.com> wrote: > I suggest to write it differently, as in 0002. That doesn't seem better to me. What's the argument for it? -- Robert Haas EDB: http://www.enterprisedb.com
On Mon, Mar 28, 2022 at 05:39:31PM -0400, Robert Haas wrote: > On Mon, Mar 28, 2022 at 4:53 PM Justin Pryzby <pryzby@telsasoft.com> wrote: > > I suggest to write it differently, as in 0002. > > That doesn't seem better to me. What's the argument for it? I find this much easier to understand: /* If we got an error or have reached the end of the string, stop. */ - if (result->parse_error != NULL || *kwend == '\0' || *vend == '\0') + if (result->parse_error != NULL) + break; + if (*kwend == '\0') + break; + if (vend != NULL && *vend == '\0') break; than /* If we got an error or have reached the end of the string, stop. */ - if (result->parse_error != NULL || *kwend == '\0' || *vend == '\0') + if (result->parse_error != NULL || + (vend == NULL ? *kwend == '\0' : *vend == '\0')) Also, why wouldn't *kwend be checked in any case ?
Justin Pryzby <pryzby@telsasoft.com> writes: > Also, why wouldn't *kwend be checked in any case ? I suspect Robert wrote it that way intentionally --- but if so, I agree it could do with more than zero commentary. regards, tom lane
On Mon, Mar 28, 2022 at 8:11 PM Tom Lane <tgl@sss.pgh.pa.us> wrote: > I suspect Robert wrote it that way intentionally --- but if so, > I agree it could do with more than zero commentary. Well, the point is, we stop advancing kwend when we get to the end of the keyword, and *vend when we get to the end of the value. If there's a value, the end of the keyword can't have been the end of the string, but the end of the value might have been. If there's no value, the end of the keyword could be the end of the string. Maybe if I just put that last sentence into the comment it's clear enough? -- Robert Haas EDB: http://www.enterprisedb.com
On Tue, Mar 29, 2022 at 8:51 AM Robert Haas <robertmhaas@gmail.com> wrote: > On Mon, Mar 28, 2022 at 8:11 PM Tom Lane <tgl@sss.pgh.pa.us> wrote: > > I suspect Robert wrote it that way intentionally --- but if so, > > I agree it could do with more than zero commentary. > > Well, the point is, we stop advancing kwend when we get to the end of > the keyword, and *vend when we get to the end of the value. If there's > a value, the end of the keyword can't have been the end of the string, > but the end of the value might have been. If there's no value, the end > of the keyword could be the end of the string. > > Maybe if I just put that last sentence into the comment it's clear enough? Done that way, since I thought it was better to fix the bug than wait for more feedback on the wording. We can still adjust the wording, or the coding, if it's not clear enough. -- Robert Haas EDB: http://www.enterprisedb.com
Robert Haas <robertmhaas@gmail.com> writes: >> Maybe if I just put that last sentence into the comment it's clear enough? > Done that way, since I thought it was better to fix the bug than wait > for more feedback on the wording. We can still adjust the wording, or > the coding, if it's not clear enough. FWIW, I thought that explanation was fine, but I was deferring to Justin who was the one who thought things were unclear. regards, tom lane
On Wed, Mar 30, 2022 at 04:14:47PM -0400, Tom Lane wrote: > Robert Haas <robertmhaas@gmail.com> writes: > >> Maybe if I just put that last sentence into the comment it's clear enough? > > > Done that way, since I thought it was better to fix the bug than wait > > for more feedback on the wording. We can still adjust the wording, or > > the coding, if it's not clear enough. > > FWIW, I thought that explanation was fine, but I was deferring to > Justin who was the one who thought things were unclear. I still think it's unnecessarily confusing to nest "if" and "?:" conditionals in one statement, instead of 2 or 3 separate "if"s, or "||"s. But it's also not worth fussing over any more.
On Thu, Mar 23, 2023 at 2:50 PM Thomas Munro <thomas.munro@gmail.com> wrote: > In rem: commit 3500ccc3, > > for X in ` grep -E '^[^*]+event_name = "' > src/backend/utils/activity/wait_event.c | > sed 's/^.* = "//;s/";$//;/unknown/d' ` > do > if ! git grep "$X" doc/src/sgml/monitoring.sgml > /dev/null > then > echo "$X is not documented" > fi > done > > BaseBackupSync is not documented > BaseBackupWrite is not documented [Resending with trimmed CC: list, because the mailing list told me to due to a blocked account, sorry if you already got the above.]
On Wed, Mar 22, 2023 at 10:09 PM Thomas Munro <thomas.munro@gmail.com> wrote: > > BaseBackupSync is not documented > > BaseBackupWrite is not documented > > [Resending with trimmed CC: list, because the mailing list told me to > due to a blocked account, sorry if you already got the above.] Bummer. I'll write a patch to fix that tomorrow, unless somebody beats me to it. -- Robert Haas EDB: http://www.enterprisedb.com
On Thu, Mar 23, 2023 at 4:11 PM Robert Haas <robertmhaas@gmail.com> wrote: > On Wed, Mar 22, 2023 at 10:09 PM Thomas Munro <thomas.munro@gmail.com> wrote: > > > BaseBackupSync is not documented > > > BaseBackupWrite is not documented > > > > [Resending with trimmed CC: list, because the mailing list told me to > > due to a blocked account, sorry if you already got the above.] > > Bummer. I'll write a patch to fix that tomorrow, unless somebody beats me to it. Here's a patch for that, and a patch to add the missing error check Peter noticed. -- Robert Haas EDB: http://www.enterprisedb.com
Attachment
On Fri, Mar 24, 2023 at 10:46:37AM -0400, Robert Haas wrote: > On Thu, Mar 23, 2023 at 4:11 PM Robert Haas <robertmhaas@gmail.com> wrote: > > On Wed, Mar 22, 2023 at 10:09 PM Thomas Munro <thomas.munro@gmail.com> wrote: > > > > BaseBackupSync is not documented > > > > BaseBackupWrite is not documented > > > > > > [Resending with trimmed CC: list, because the mailing list told me to > > > due to a blocked account, sorry if you already got the above.] > > > > Bummer. I'll write a patch to fix that tomorrow, unless somebody beats me to it. > > Here's a patch for that, and a patch to add the missing error check > Peter noticed. I think these maybe got forgotten ?
On Wed, Apr 12, 2023 at 10:57 AM Justin Pryzby <pryzby@telsasoft.com> wrote: > I think these maybe got forgotten ? Committed. -- Robert Haas EDB: http://www.enterprisedb.com