2.10. Logging #

Shardman is a critical point in your infrastructure as it stores all of your data. This makes logging mandatory. So you should understand how logging works in Shardman. Due to the complexity of Shardman, it supports logging from several components: logs from the shardmand daemon that manages the cluster configuration and logs from PostgreSQL database instances.

2.10.1. PostgreSQL Logs #

Shardman uses standard PostgreSQL logging settings, described here. Logging settings should be placed to sdmspec.json in the pgParameters section, as shown in the example below:

                  {
                    "ShardSpec": {
                    "pgParameters": {
                        "log_line_prefix": "%m [%r][%p]",
                        "log_min_messages": "INFO",
                        "log_statement": "none",
                        "log_destination": "stderr",
                        "log_filename": "pg.log",
                        "logging_collector": "on",
                        "log_checkpoints": "false",
                        ...
                    },
                    ...
                    },
                    ...
                  }
                

By default, logs are placed in the directory like this: /var/lib/pgpro/sdm-14/data/keeper-cluster0-clover-1-shrn1-0/postgres/log. In this example, cluster0 is the current cluster, clover-1-shrn1 is the name of the current shard, 0 is the identifier of the integrated keeper process. To change the log directory, set the log_directory parameter.

2.10.2. shardmand Logs #

shardmand is a systemd unit, its logs are written to journald. You can use journalctl to examine it. For example, you can use the following command:

                    $ journalctl -u shardmand@cluster0.service
                

You can filter logs by arbitrary time limits using the --since and --until options, which restrict the entries displayed to those after or before the given time, respectively. The time values can come in a variety of formats. For absolute time values, you should use YYYY-MM-DD HH:MM:SS. For instance, we can see all of the entries since January 10th, 2023 at 5:15 PM by typing:

                    $ journalctl -u shardmand@cluster0.service --since "2023-01-10 17:15:00"
                

If components of the above format are left off, some defaults will be applied. For instance, if the date is omitted, the current date will be assumed. If the time component is missing, “00:00:00” (midnight) will be substituted. The seconds field can be left off as well to default to “00”:

                    $ journalctl -u shardmand@cluster0.service --since "2023-01-10" --until "2023-01-11 03:00"
                

To control the log verbosity for all Shardman services, set SDM_LOG_LEVEL in the shardmand configuration file.

2.10.3. Getting Information on Backend Crashes #

Some crashes are caused by the hardware failure or the DBMS issues. To understand the root causes of the crash, use crash_info. To set it up, follow these steps:

  • Create a directory on each cluster node that the Shardman operating system user has access to (usually, it is postgres). Error reports will be sent to this directory.

    install -d -o postgres -g postgres -m 700  /var/lib/postgresql/crashinfo
    

  • Set the crash_info_location value.

    Note

    This will cause the DBMS to restart.

    shardmanctl --store-endpoints http://etcdserver:2379 set -y  crash_info_location=/var/lib/postgresql/crashinfo
    

  • To make sure the changes are applied, send a signal that will cause the backend failure and a core dump creation, along with the instance restart.

    Note

    Do it in your test environment only.

Connect to your DBMS and find out PID of the backend associated with the current session:

postgres=# select pg_backend_pid();
pg_backend_pid
----------------
    23770

Then send the SIGSEGV signal to the process with the received PID:

kill -11 23770

This will result in this backend crash, and a log file with the time, backtrace and cause of an error will be written to /var/lib/postgresql/crashinfo:

 # Signal
Program received signal: 11 (SIGSEGV)
Signal    UTC date time: 25.10.2024 08:37:02


# Program
                        pid: 23770
                        ppid: 17506
    program_invocation_name: postgres: postgres postgres 10.42.42.10(34202) idle
program_invocation_short_name: tgres 10.42.42.10(34202) idle
                    exe_path: /opt/pgpro/sdm-14/bin/postgres
                        exe: postgres

# Backtrace
1   postgres + 0x5b55c0              0x55c5ba8459b7  0x00007ffcbef19070  bt_crash_handler + 0x3f7
2   libc.so.6 + 0x4251f              0x7f01c2caa520  0x00007ffcbef19140  __sigaction + 0x50
unknown  ./signal/../sysdeps/unix/sysv/linux/x86_64/libc_sigaction.c:0
3   libc.so.6 + 0x125f80             0x7f01c2d8df9a  0x00007ffcbef195b8  epoll_wait + 0x1a
epoll_wait  ../sysdeps/unix/sysv/linux/epoll_wait.c:30
4   postgres + 0x433870              0x55c5ba6c39bb  0x00007ffcbef195c0  WaitEventSetWait + 0x14b
5   postgres + 0x320de0              0x55c5ba5b0e74  0x00007ffcbef19650  secure_read + 0x94
6   postgres + 0x327d20              0x55c5ba5b7dae  0x00007ffcbef196a0  pq_recvbuf + 0x8e
7   postgres + 0x328980              0x55c5ba5b8995  0x00007ffcbef196c0  pq_getbyte + 0x15
8   postgres + 0x457da0              0x55c5ba6e909c  0x00007ffcbef196d0  PostgresMain + 0x12fc
9   postgres + 0x3ce210              0x55c5ba65ef86  0x00007ffcbef19a60  ServerLoop + 0xd76
10  postgres + 0x3cf240              0x55c5ba65fe18  0x00007ffcbef1a040  PostmasterMain + 0xbd8
11  postgres + 0x14ecc0              0x55c5ba3df182  0x00007ffcbef1a0c0  main + 0x4c2
12  libc.so.6 + 0x29d10              0x7f01c2c91d90  0x00007ffcbef1a0f0  __libc_init_first + 0x90
__libc_start_call_main  ../sysdeps/nptl/libc_start_call_main.h:58
13  libc.so.6 + 0x29dc0              0x7f01c2c91e40  0x00007ffcbef1a190  __libc_start_main + 0x80
call_init  ../csu/libc-start.c:128
__libc_start_main_impl  ../csu/libc-start.c:379
14  postgres + 0x14f200              0x55c5ba3df225  0x00007ffcbef1a1e0  _start + 0x25

pdf