Thread: wal-g (https://github.com/wal-g/wal-g) reliability

wal-g (https://github.com/wal-g/wal-g) reliability

From
Victor Sudakov
Date:
Dear Colleagues,

What do you think of wal-g? Can you entrust your data to it? Has it ever
failed you? Any hidden caveats?

-- 
Victor Sudakov,  VAS4-RIPE, VAS47-RIPN
2:5005/49@fidonet http://vas.tomsk.ru/



Re: wal-g (https://github.com/wal-g/wal-g) reliability

From
Nikolay Samokhvalov
Date:
In short, it's reliable and battle-tested. It's used in  companies such as Yandex.Cloud and GitLab.com, successfully.

You can find materials about it from Yandex.Cloud -- for example, from Andrey Borodin who is one of WAL-G maintainers. He is a frequent guest of our online community sessions -- see https://YouTube.com/RuPostgres (in Russian).

Additionally, you can reach out to the people who use WAL-G here:
- Postgres community Slack 
https://postgres-slack.herokuapp.com/, it has WAL-G channel (English)
- Postgres telegram group https://t.me/pgsql (Russian).

Despite talking to others, I strongly recommend having periodical (say, daily) automated verification jobs that check your backups -- this is both useful to start trusting the backup tool and to ensure that your backups are in a good shape. Without automated verification, a DR strategy is definitely incomplete.

That being said, it's not a small project so it may have issues depending on how you use it. Among possible caveats: if you use it in Google Cloud, you might have issues with backup-push failures when GCS  has instability events -- this was fixed in the very recent codebase. Also, there are some issues on AWS for the new codebase that are reported for the master, but I don't have details (I suppose using some older version should be better here).

On Fri, Feb 5, 2021 at 01:54 Victor Sudakov <vas@sibptus.ru> wrote:
Dear Colleagues,

What do you think of wal-g? Can you entrust your data to it? Has it ever
failed you? Any hidden caveats?

--
Victor Sudakov,  VAS4-RIPE, VAS47-RIPN
2:5005/49@fidonet http://vas.tomsk.ru/


Re: wal-g (https://github.com/wal-g/wal-g) reliability

From
Victor Sudakov
Date:
Hello Nikolay,

Thank you for the comprehensive reply. 

Nikolay Samokhvalov wrote:
> In short, it's reliable and battle-tested. It's used in  companies such as
> Yandex.Cloud and GitLab.com, successfully.
> 
> You can find materials about it from Yandex.Cloud -- for example, from
> Andrey Borodin who is one of WAL-G maintainers. He is a frequent guest of
> our online community sessions -- see https://YouTube.com/RuPostgres (in
> Russian).
> 
> Additionally, you can reach out to the people who use WAL-G here:
> - Postgres community Slack
> https://postgres-slack.herokuapp.com/, it has WAL-G channel (English)
> - Postgres telegram group https://t.me/pgsql (Russian).
> 
> Despite talking to others, I strongly recommend having periodical (say,
> daily) automated verification jobs that check your backups -- this is both
> useful to start trusting the backup tool and to ensure that your backups
> are in a good shape. Without automated verification, a DR strategy is
> definitely incomplete.
> 
> That being said, it's not a small project so it may have issues depending
> on how you use it. Among possible caveats: if you use it in Google Cloud,
> you might have issues with backup-push failures when GCS  has instability
> events -- this was fixed in the very recent codebase. Also, there are some
> issues on AWS for the new codebase that are reported for the master, but I
> don't have details (I suppose using some older version should be better
> here).
> 
> On Fri, Feb 5, 2021 at 01:54 Victor Sudakov <vas@sibptus.ru> wrote:
> 
> > Dear Colleagues,
> >
> > What do you think of wal-g? Can you entrust your data to it? Has it ever
> > failed you? Any hidden caveats?
> >
> > --
> > Victor Sudakov,  VAS4-RIPE, VAS47-RIPN
> > 2:5005/49@fidonet http://vas.tomsk.ru/
> >
> >
> >

-- 
Victor Sudakov,  VAS4-RIPE, VAS47-RIPN
2:5005/49@fidonet http://vas.tomsk.ru/



Re: wal-g (https://github.com/wal-g/wal-g) reliability

From
Victor Sudakov
Date:
Nikolay Samokhvalov wrote:
> In short, it's reliable and battle-tested. It's used in  companies such as
> Yandex.Cloud and GitLab.com, successfully.

Nikolay, don't you mind another question? 

I'm used to the fact that "pg_basebackup -X" creates a self-sufficient
backup of a cluster which can be started right away as it contains all
the WAL files required for recovery. `touch recovery.signal` is never necessary,
and `touch standby.signal` is optional (when you do PITR etc).

It's not the case with wal-g, the result of the `wal-g backup-fetch` 
command requires `touch recovery.signal` and a restore_command
configured to fetch WALs from the wal-g storage.

I have also noticed that wal-g keeps pg_control in a separate tar
archive, and keeps a lot of metadata.

The questions are: 

1. If the metadata in the wal-g storage ever becomes corrupt, will I be
able to restore the database manually from the archives in
$WALG_*_PREFIX/{basebackups,wal}_005/ ?

2. Is there a `wal-g backup-fetch` option for truly self-sufficient
restoration?

-- 
Victor Sudakov,  VAS4-RIPE, VAS47-RIPN
2:5005/49@fidonet http://vas.tomsk.ru/



Re: wal-g (https://github.com/wal-g/wal-g) reliability

From
Stephen Frost
Date:
Greetings,

* Victor Sudakov (vas@sibptus.ru) wrote:
> Nikolay Samokhvalov wrote:
> > In short, it's reliable and battle-tested. It's used in  companies such as
> > Yandex.Cloud and GitLab.com, successfully.
>
> Nikolay, don't you mind another question?
>
> I'm used to the fact that "pg_basebackup -X" creates a self-sufficient
> backup of a cluster which can be started right away as it contains all
> the WAL files required for recovery. `touch recovery.signal` is never necessary,
> and `touch standby.signal` is optional (when you do PITR etc).
>
> It's not the case with wal-g, the result of the `wal-g backup-fetch`
> command requires `touch recovery.signal` and a restore_command
> configured to fetch WALs from the wal-g storage.
>
> I have also noticed that wal-g keeps pg_control in a separate tar
> archive, and keeps a lot of metadata.
>
> The questions are:
>
> 1. If the metadata in the wal-g storage ever becomes corrupt, will I be
> able to restore the database manually from the archives in
> $WALG_*_PREFIX/{basebackups,wal}_005/ ?
>
> 2. Is there a `wal-g backup-fetch` option for truly self-sufficient
> restoration?

Interesting that you ask this!  I say that because it's actually a case
that pgbackrest contemplated and explicitly added support for-
specifically, if you set --archive-copy (or archive-copy=true) and
disable compression (or decompress everything before you start PG), then
you can just start PG from the backup and it'll perform the necessary
recovery from the WAL and start up.  We considered this an interesting
use-case and used it extensively and have maintained support for it.

All that said, of course, this limits the ability to do typical PITR,
but at least it gives you a consistent backup which you can start PG
from.

Thanks,

Stephen

Attachment

Re: wal-g (https://github.com/wal-g/wal-g) reliability

From
Ron
Date:
On 2/8/21 3:28 PM, Stephen Frost wrote:
[snip]
Interesting that you ask this!  I say that because it's actually a case
that pgbackrest contemplated and explicitly added support for-
specifically, if you set --archive-copy (or archive-copy=true) and
disable compression (or decompress everything before you start PG), then
you can just start PG from the backup and it'll perform the necessary
recovery from the WAL and start up.  We considered this an interesting
use-case and used it extensively and have maintained support for it.

All that said, of course, this limits the ability to do typical PITR,
but at least it gives you a consistent backup which you can start PG
from.

In the SQL Server world, that's a COPY_ONLY backup.

--
Angular momentum makes the world go 'round.

Re: wal-g (https://github.com/wal-g/wal-g) reliability

From
Victor Sudakov
Date:
Stephen Frost wrote:
> > 
> > 2. Is there a `wal-g backup-fetch` option for truly self-sufficient
> > restoration?
> 
> Interesting that you ask this!  I say that because it's actually a case
> that pgbackrest contemplated and explicitly added support for-
> specifically, if you set --archive-copy (or archive-copy=true) and
> disable compression (or decompress everything before you start PG), then
> you can just start PG from the backup and it'll perform the necessary
> recovery from the WAL and start up.  We considered this an interesting
> use-case and used it extensively and have maintained support for it.

Well, the release I've downloaded from https://github.com/wal-g/wal-g/releases
does not seem to have the --archive-copy option.
> 
> All that said, of course, this limits the ability to do typical PITR,

Why? If you configure recovery_target_time and restore_command, it's all
the same, isn't it? No matter if you have WAL archives in pg_wal, but
it's better to have them.

> but at least it gives you a consistent backup which you can start PG
> from.
> 

-- 
Victor Sudakov,  VAS4-RIPE, VAS47-RIPN
2:5005/49@fidonet http://vas.tomsk.ru/



Re: wal-g (https://github.com/wal-g/wal-g) reliability

From
Andrey Borodin
Date:
Hi!

Thanks, Nikolay, I'm pleased to read good things about project I maintain.

I'll add some Pros and Cons of WAL-G from my side.

Pros:
WAL-G is very efficient. At Y.Cloud we have few petabytes of PG, 12K+ of hosts. Each cluster is backuped every night
(withdeltas) and network cost us real moneys (our datacenters are in different countries BTW). This is our data and we
highlyvalue each byte. So we aim to make WAL-G 100% safe and tested. 
WAL-G works with PG, MySQL, MongoDB, FDB and MS SQL Server. And almost any storage (cloud, file system or scp). WAL-G
doesnot need any extra space near database. 

Cons:
We are not PG vendor. We contribute to PG ecosystem to make our systems better and reduce risks for Yandex.Cloud
customers.You are not expected to buy commercial support for WAL-G from us directly. Though community behind WAL-G is
bigenough and there are many hackers who can implement feature you want. 
And as a result, docs are not very detailed. We would appreciate any docs enhancements.

> 8 февр. 2021 г., в 10:23, Victor Sudakov <vas@sibptus.ru> написал(а):
>
> Nikolay Samokhvalov wrote:
>> In short, it's reliable and battle-tested. It's used in  companies such as
>> Yandex.Cloud and GitLab.com, successfully.
>
> Nikolay, don't you mind another question?
>
> I'm used to the fact that "pg_basebackup -X" creates a self-sufficient
> backup of a cluster which can be started right away as it contains all
> the WAL files required for recovery. `touch recovery.signal` is never necessary,
> and `touch standby.signal` is optional (when you do PITR etc).
>
> It's not the case with wal-g, the result of the `wal-g backup-fetch`
> command requires `touch recovery.signal` and a restore_command
> configured to fetch WALs from the wal-g storage.
>
> I have also noticed that wal-g keeps pg_control in a separate tar
> archive, and keeps a lot of metadata.
pg_control is in separate file to prevent incomplete restoration. If for some reason WAL-G was killed during
restorationyou won't be able to start resulting corrupted DB. 


> The questions are:
>
> 1. If the metadata in the wal-g storage ever becomes corrupt, will I be
> able to restore the database manually from the archives in
> $WALG_*_PREFIX/{basebackups,wal}_005/ ?
Yes, technically you can restore full backup of PG database without metadata at all. You will need to create empty
sentineljson file though. 

>
> 2. Is there a `wal-g backup-fetch` option for truly self-sufficient
> restoration?
We would appreciate a pull request for this feature. All you need is to read LSN of start and stop, and run wal-fetch
forevery segment in between from WAL-G process. 


Thanks!

Best regards, Andrey Borodin.


Re: wal-g (https://github.com/wal-g/wal-g) reliability

From
Victor Sudakov
Date:
Andrey Borodin wrote:
> 
> Thanks, Nikolay, I'm pleased to read good things about project I maintain.
> 
> I'll add some Pros and Cons of WAL-G from my side.
> 
> Pros:
> WAL-G is very efficient. At Y.Cloud we have few petabytes of PG, 12K+ of hosts. Each cluster is backuped every night
(withdeltas) and network cost us real moneys (our datacenters are in different countries BTW). This is our data and we
highlyvalue each byte. So we aim to make WAL-G 100% safe and tested.
 
> WAL-G works with PG, MySQL, MongoDB, FDB and MS SQL Server. And almost any storage (cloud, file system or scp). WAL-G
doesnot need any extra space near database.
 
> 
> Cons:
> We are not PG vendor. We contribute to PG ecosystem to make our systems better and reduce risks for Yandex.Cloud
customers.You are not expected to buy commercial support for WAL-G from us directly. Though community behind WAL-G is
bigenough and there are many hackers who can implement feature you want.
 
> And as a result, docs are not very detailed. We would appreciate any docs enhancements.

I'm considering WAL-G now, not as a backup solution per se, but as an
advanced WAL archiver which can:

1. Upload/download WALs directly to S3

2. Verify WALs integrity (a very nice feature, but I think does not work
if there are no backups)

3. Prevent WAL overwrite (does it work?)

4. Concurrently download WALs to speed up recovery.


-- 
Victor Sudakov,  VAS4-RIPE, VAS47-RIPN
2:5005/49@fidonet http://vas.tomsk.ru/



Re: wal-g (https://github.com/wal-g/wal-g) reliability

From
Andrey Borodin
Date:

> 9 февр. 2021 г., в 15:12, Victor Sudakov <vas@sibptus.ru> написал(а):
>
> I'm considering WAL-G now, not as a backup solution per se, but as an
> advanced WAL archiver which can:
Keep in mind that once upon a time you will have to delete WAL segments. If you just use wal-g delete you have to point
ourwhich backup to keep. 

> 1. Upload/download WALs directly to S3
>
> 2. Verify WALs integrity (a very nice feature, but I think does not work
> if there are no backups)
If will not be alarmed if WALs before backup are missing. Though it should still analyze chains correctly.

> 3. Prevent WAL overwrite (does it work?)
I think so.


Thanks!

Best regards, Andrey Borodin.


Re: wal-g (https://github.com/wal-g/wal-g) reliability

From
Victor Sudakov
Date:
Andrey Borodin wrote:
> 
> 
> > 9 февр. 2021 г., в 15:12, Victor Sudakov <vas@sibptus.ru> написал(а):
> > 
> > I'm considering WAL-G now, not as a backup solution per se, but as an
> > advanced WAL archiver which can:
> Keep in mind that once upon a time you will have to delete WAL segments. If you just use wal-g delete you have to
pointour which backup to keep.
 

Thank you, this is an important note.

> 
> > 1. Upload/download WALs directly to S3
> > 
> > 2. Verify WALs integrity (a very nice feature, but I think does not work
> > if there are no backups)
> If will not be alarmed if WALs before backup are missing. Though it should still analyze chains correctly.
> 
> > 3. Prevent WAL overwrite (does it work?)
> I think so.
> 

OK.

-- 
Victor Sudakov,  VAS4-RIPE, VAS47-RIPN
2:5005/49@fidonet http://vas.tomsk.ru/



Re: wal-g (https://github.com/wal-g/wal-g) reliability

From
Stephen Frost
Date:
Greetings,

* Victor Sudakov (vas@sibptus.ru) wrote:
> Stephen Frost wrote:
> > > 2. Is there a `wal-g backup-fetch` option for truly self-sufficient
> > > restoration?
> >
> > Interesting that you ask this!  I say that because it's actually a case
> > that pgbackrest contemplated and explicitly added support for-
> > specifically, if you set --archive-copy (or archive-copy=true) and
> > disable compression (or decompress everything before you start PG), then
> > you can just start PG from the backup and it'll perform the necessary
> > recovery from the WAL and start up.  We considered this an interesting
> > use-case and used it extensively and have maintained support for it.
>
> Well, the release I've downloaded from https://github.com/wal-g/wal-g/releases
> does not seem to have the --archive-copy option.

Right- as I mentioned above, that's an option which pgbackrest has.

> > All that said, of course, this limits the ability to do typical PITR,
>
> Why? If you configure recovery_target_time and restore_command, it's all
> the same, isn't it? No matter if you have WAL archives in pg_wal, but
> it's better to have them.

If you have a restore_command configured to be able to pull from a full
repo that has all the new WAL, sure, but the idea behind archive-copy is
specifically that you can pull out that backup and restore it *without*
having to have access to the original repo.  If the assumption is that
you've got access to the full repo, then archive-copy isn't really
gaining you anything and it wouldn't make sense to use it.

Thanks,

Stephen

Attachment