Thread: RFC: seccomp-bpf support
SECCOMP ("SECure COMPuting with filters") is a Linux kernel syscall filtering mechanism which allows reduction of the kernel attack surface by preventing (or at least audit logging) normally unused syscalls. Quoting from this link: https://www.kernel.org/doc/Documentation/prctl/seccomp_filter.txt "A large number of system calls are exposed to every userland process with many of them going unused for the entire lifetime of the process. As system calls change and mature, bugs are found and eradicated. A certain subset of userland applications benefit by having a reduced set of available system calls. The resulting set reduces the total kernel surface exposed to the application. System call filtering is meant for use with those applications." Recent security best-practices recommend, and certain highly security-conscious organizations are beginning to require, that SECCOMP be used to the extent possible. The major web browsers, container runtime engines, and systemd are all examples of software that already support seccomp. --------- A seccomp (bpf) filter is comprised of a default action, and a set of rules with actions pertaining to specific syscalls (possibly with even more specific sets of arguments). Once loaded into the kernel, a filter is inherited by all child processes and cannot be removed. It can, however, be overlaid with another filter. For any given syscall match, the most restrictive (a.k.a. highest precedence) action will be taken by the kernel. PostgreSQL has already been run "in the wild" under seccomp control in containers, and possibly systemd. Adding seccomp support into PostgreSQL itself mitigates issues with these approaches, and has several advantages: * Container seccomp filters tend to be extremely broad/permissive, typically allowing about 6 out 7 of all syscalls. They must do this because the use cases for containers vary widely. * systemd does not implement seccomp filters by default. Packagers may decide to do so, but there is no guarantee. Adding them post install potentially requires cooperation by groups outside control of the database admins. * In the container and systemd case there is no particularly good way to inspect what filters are active. It is possible to observe actions taken, but again, control is possibly outside the database admin group. For example, the best way to understand what happened is to review the auditd log, which is likely not readable by the DBA. * With built-in support, it is possible to lock down backend processes more tightly than the postmaster. * With built-in support, it is possible to lock down different backend processes differently than each other, for example by using ALTER ROLE ... SET or ALTER DATABASE ... SET. * With built-in support, it is possible to calculate and return (in the form of an SRF) the effective filters being applied to the postmaster and the current backend. * With built-in support, it could be possible (this part not yet implemented) to have separate filters for different backend types, e.g. autovac workers, background writer, etc. --------- Attached is a patch for discussion, adding support for seccomp-bpf (nowadays generally just called seccomp) syscall filtering at configure-time using libseccomp. I would like to get this in shape to be committed by the end of the November CF if possible. The code itself has been through several rounds of revision based on discussions I have had with the author of libseccomp as well as a few other folks. However as of the moment: * Documentation - general discussion missing entirely * No regression tests --------- For convenience, here are a couple of additional links to relevant information regarding seccomp: https://en.wikipedia.org/wiki/Seccomp https://github.com/seccomp/libseccomp --------- Specific feedback requested: 1. Placement of pg_get_seccomp_filter() in src/backend/utils/adt/genfile.c originally made sense but after several rewrites no longer does. Ideas where it *should* go? 2. Where should a general discussion section go in the docs, if at all? 3. Currently this supports a global filter at the postmaster level, which is inherited by all child processes, and a secondary filter at the client backend session level. It likely makes sense to support secondary filters for other types of child processes, e.g. autovacuum workers, etc. Add that now (pg13), later release, or never? 4. What is the best way to approach testing of this feature? Tap testing perhaps? 5. Default GUC values - should we provide "starter" lists, or only a procedure for generating a list (as below). --------- Notes on usage: =============== In order to determine your minimally required allow lists, do something like the following on a non-production server with the same architecture as production: 0. Setup: * install libseccomp, libseccomp-dev, and seccomp * install auditd if not already installed * configure postgres --with-seccomp and maybe --enable-tap-tests to improve feature coverage (see below) 1. Modify postgresql.conf and/or create <pg_source_dir>/postgresql_tmp.conf 8<-------------------- seccomp = on global_syscall_default = allow global_syscall_allow = '' global_syscall_log = '' global_syscall_error = '' global_syscall_kill = '' session_syscall_default = log session_syscall_allow = '*' session_syscall_log = '*' session_syscall_error = '*' session_syscall_kill = '*' 8<-------------------- 2. Modify /etc/audit/auditd.conf * disp_qos = 'lossless' * change max_log_file_action = 'ignore' 3. Stop auditd, clear out all audit.logs, start auditd: * systemctl stop auditd.service # if running * echo -n "" > /var/log/audit/audit.log * systemctl start auditd.service 4. Start/restart postgres. 5. Exercise postgres as much as possible (one or more of the following): * make installcheck-world * make check world \ EXTRA_REGRESS_OPTS=--temp-config=<pg_source_dir>/postgresql_tmp.conf * run your application through its paces * other random testing of relevant postgres features Note: at this point audit.log will start growing quickly. During `make check world` mine grew to just under 1 GB. 6. Process results: a) systemctl stop auditd.service b) Run the provided "get_syscalls.sh" script c) Cut and paste the result as the value of session_syscall_allow. 7. Optional: a) global_syscall_default = 'log' b) Repeat steps 3-5 c) Repeat step 6a and 6b d) Cut and paste the result as the value of global_syscall_allow 8. Iterate steps 3-6b. * Output should be empty. * If there are any new syscalls, add to global_syscall_allow and session_syscall_allow. * Iterate until output of "get_syscalls.sh" script is empty. 9. Optional: * Change global and session defaults to "error" or "kill" * Reduce the allow lists if desired * This can be done for specific database users, by doing ALTER ROLE... SET session_syscall_allow to '<some reduced allow list>' 10. Adjust settings to taste, restart postgres, and monitor audit.log going forward. Below are some values from my system. Note that I have made no attempt thus far to do static code analysis -- this list was build using `make check world` several times. 8<------------------------- seccomp = on global_syscall_default = log global_syscall_allow = 'accept,access,bind,brk,chmod,clone,close,connect,dup,epoll_create1,epoll_ctl,epoll_wait,exit_group,fadvise64,fallocate,fcntl,fdatasync,fstat,fsync,ftruncate,futex,getdents,getegid,geteuid,getgid,getpeername,getpid,getppid,getrandom,getrusage,getsockname,getsockopt,getuid,ioctl,kill,link,listen,lseek,lstat,mkdir,mmap,mprotect,mremap,munmap,openat,pipe,poll,prctl,pread64,prlimit64,pwrite64,read,readlink,recvfrom,recvmsg,rename,rmdir,rt_sigaction,rt_sigprocmask,rt_sigreturn,seccomp,select,sendto,setitimer,set_robust_list,setsid,setsockopt,shmat,shmctl,shmdt,shmget,shutdown,socket,stat,statfs,symlink,sync_file_range,sysinfo,umask,uname,unlink,utime,wait4,write' global_syscall_log = '' global_syscall_error = '' global_syscall_kill = '' session_syscall_default = log session_syscall_allow = 'access,brk,chmod,close,connect,epoll_create1,epoll_ctl,epoll_wait,exit_group,fadvise64,fallocate,fcntl,fdatasync,fstat,fsync,ftruncate,futex,getdents,getegid,geteuid,getgid,getpeername,getpid,getrandom,getrusage,getsockname,getsockopt,getuid,ioctl,kill,link,lseek,lstat,mkdir,mmap,mprotect,mremap,munmap,openat,poll,pread64,pwrite64,read,readlink,recvfrom,recvmsg,rename,rmdir,rt_sigaction,rt_sigprocmask,rt_sigreturn,select,sendto,setitimer,setsockopt,shutdown,socket,stat,symlink,sync_file_range,sysinfo,umask,uname,unlink,utime,write' session_syscall_log = '*' session_syscall_error = '*' session_syscall_kill = '*' 8<------------------------- That results in the following effective filters at the ("context" equals) global and session levels: 8<------------------------- select * from pg_get_seccomp_filter() order by 4,1; syscall | syscallnum | filter_action | context -----------------+------------+----------------+--------- accept | 43 | global->allow | global access | 21 | global->allow | global bind | 49 | global->allow | global brk | 12 | global->allow | global chmod | 90 | global->allow | global clone | 56 | global->allow | global close | 3 | global->allow | global connect | 42 | global->allow | global <default> | -1 | global->log | global dup | 32 | global->allow | global epoll_create1 | 291 | global->allow | global epoll_ctl | 233 | global->allow | global epoll_wait | 232 | global->allow | global exit_group | 231 | global->allow | global fadvise64 | 221 | global->allow | global fallocate | 285 | global->allow | global fcntl | 72 | global->allow | global fdatasync | 75 | global->allow | global fstat | 5 | global->allow | global fsync | 74 | global->allow | global ftruncate | 77 | global->allow | global futex | 202 | global->allow | global getdents | 78 | global->allow | global getegid | 108 | global->allow | global geteuid | 107 | global->allow | global getgid | 104 | global->allow | global getpeername | 52 | global->allow | global getpid | 39 | global->allow | global getppid | 110 | global->allow | global getrandom | 318 | global->allow | global getrusage | 98 | global->allow | global getsockname | 51 | global->allow | global getsockopt | 55 | global->allow | global getuid | 102 | global->allow | global ioctl | 16 | global->allow | global kill | 62 | global->allow | global link | 86 | global->allow | global listen | 50 | global->allow | global lseek | 8 | global->allow | global lstat | 6 | global->allow | global mkdir | 83 | global->allow | global mmap | 9 | global->allow | global mprotect | 10 | global->allow | global mremap | 25 | global->allow | global munmap | 11 | global->allow | global openat | 257 | global->allow | global pipe | 22 | global->allow | global poll | 7 | global->allow | global prctl | 157 | global->allow | global pread64 | 17 | global->allow | global prlimit64 | 302 | global->allow | global pwrite64 | 18 | global->allow | global read | 0 | global->allow | global readlink | 89 | global->allow | global recvfrom | 45 | global->allow | global recvmsg | 47 | global->allow | global rename | 82 | global->allow | global rmdir | 84 | global->allow | global rt_sigaction | 13 | global->allow | global rt_sigprocmask | 14 | global->allow | global rt_sigreturn | 15 | global->allow | global seccomp | 317 | global->allow | global select | 23 | global->allow | global sendto | 44 | global->allow | global setitimer | 38 | global->allow | global set_robust_list | 273 | global->allow | global setsid | 112 | global->allow | global setsockopt | 54 | global->allow | global shmat | 30 | global->allow | global shmctl | 31 | global->allow | global shmdt | 67 | global->allow | global shmget | 29 | global->allow | global shutdown | 48 | global->allow | global socket | 41 | global->allow | global stat | 4 | global->allow | global statfs | 137 | global->allow | global symlink | 88 | global->allow | global sync_file_range | 277 | global->allow | global sysinfo | 99 | global->allow | global umask | 95 | global->allow | global uname | 63 | global->allow | global unlink | 87 | global->allow | global utime | 132 | global->allow | global wait4 | 61 | global->allow | global write | 1 | global->allow | global accept | 43 | session->log | session access | 21 | session->allow | session bind | 49 | session->log | session brk | 12 | session->allow | session chmod | 90 | session->allow | session clone | 56 | session->log | session close | 3 | session->allow | session connect | 42 | session->allow | session <default> | -1 | session->log | session dup | 32 | session->log | session epoll_create1 | 291 | session->allow | session epoll_ctl | 233 | session->allow | session epoll_wait | 232 | session->allow | session exit_group | 231 | session->allow | session fadvise64 | 221 | session->allow | session fallocate | 285 | session->allow | session fcntl | 72 | session->allow | session fdatasync | 75 | session->allow | session fstat | 5 | session->allow | session fsync | 74 | session->allow | session ftruncate | 77 | session->allow | session futex | 202 | session->allow | session getdents | 78 | session->allow | session getegid | 108 | session->allow | session geteuid | 107 | session->allow | session getgid | 104 | session->allow | session getpeername | 52 | session->allow | session getpid | 39 | session->allow | session getppid | 110 | session->log | session getrandom | 318 | session->allow | session getrusage | 98 | session->allow | session getsockname | 51 | session->allow | session getsockopt | 55 | session->allow | session getuid | 102 | session->allow | session ioctl | 16 | session->allow | session kill | 62 | session->allow | session link | 86 | session->allow | session listen | 50 | session->log | session lseek | 8 | session->allow | session lstat | 6 | session->allow | session mkdir | 83 | session->allow | session mmap | 9 | session->allow | session mprotect | 10 | session->allow | session mremap | 25 | session->allow | session munmap | 11 | session->allow | session openat | 257 | session->allow | session pipe | 22 | session->log | session poll | 7 | session->allow | session prctl | 157 | session->log | session pread64 | 17 | session->allow | session prlimit64 | 302 | session->log | session pwrite64 | 18 | session->allow | session read | 0 | session->allow | session readlink | 89 | session->allow | session recvfrom | 45 | session->allow | session recvmsg | 47 | session->allow | session rename | 82 | session->allow | session rmdir | 84 | session->allow | session rt_sigaction | 13 | session->allow | session rt_sigprocmask | 14 | session->allow | session rt_sigreturn | 15 | session->allow | session seccomp | 317 | session->log | session select | 23 | session->allow | session sendto | 44 | session->allow | session setitimer | 38 | session->allow | session set_robust_list | 273 | session->log | session setsid | 112 | session->log | session setsockopt | 54 | session->allow | session shmat | 30 | session->log | session shmctl | 31 | session->log | session shmdt | 67 | session->log | session shmget | 29 | session->log | session shutdown | 48 | session->allow | session socket | 41 | session->allow | session stat | 4 | session->allow | session statfs | 137 | session->log | session symlink | 88 | session->allow | session sync_file_range | 277 | session->allow | session sysinfo | 99 | session->allow | session umask | 95 | session->allow | session uname | 63 | session->allow | session unlink | 87 | session->allow | session utime | 132 | session->allow | session wait4 | 61 | session->log | session write | 1 | session->allow | session (170 rows) 8<------------------------- If you made it all the way to here, thank you for your attention :-) Joe -- Crunchy Data - http://crunchydata.com PostgreSQL Support for Secure Enterprises Consulting, Training, & Open Source Development
Attachment
On Wed, Aug 28, 2019 at 11:13:27AM -0400, Joe Conway wrote: > SECCOMP ("SECure COMPuting with filters") is a Linux kernel syscall > filtering mechanism which allows reduction of the kernel attack surface > by preventing (or at least audit logging) normally unused syscalls. > > Quoting from this link: > https://www.kernel.org/doc/Documentation/prctl/seccomp_filter.txt > > "A large number of system calls are exposed to every userland process > with many of them going unused for the entire lifetime of the > process. As system calls change and mature, bugs are found and > eradicated. A certain subset of userland applications benefit by > having a reduced set of available system calls. The resulting set > reduces the total kernel surface exposed to the application. System > call filtering is meant for use with those applications." > > Recent security best-practices recommend, and certain highly > security-conscious organizations are beginning to require, that SECCOMP > be used to the extent possible. The major web browsers, container > runtime engines, and systemd are all examples of software that already > support seccomp. Neat! Are the seccomp interfaces for other kernels arranged in a manner similar enough to have a unified interface in PostgreSQL, or is this more of a Linux-only feature? Best, David. -- David Fetter <david(at)fetter(dot)org> http://fetter.org/ Phone: +1 415 235 3778 Remember to vote! Consider donating to Postgres: http://www.postgresql.org/about/donate
On 2019-08-28 17:13, Joe Conway wrote: > * systemd does not implement seccomp filters by default. Packagers may > decide to do so, but there is no guarantee. Adding them post install > potentially requires cooperation by groups outside control of > the database admins. Well, if we are interested in this, why don't we just ask packagers to do so? That seems like a much easier solution. > * In the container and systemd case there is no particularly good way to > inspect what filters are active. It is possible to observe actions > taken, but again, control is possibly outside the database admin > group. For example, the best way to understand what happened is to > review the auditd log, which is likely not readable by the DBA. Why does one need to know what filters are active (other than, obviously, checking that the filters one has set were actually activated)? What decisions would a DBA or database user be able to make based on whether a filter is set or not? > * With built-in support, it is possible to lock down backend processes > more tightly than the postmaster. Also the other way around? > * With built-in support, it is possible to lock down different backend > processes differently than each other, for example by using ALTER ROLE > ... SET or ALTER DATABASE ... SET. What are use cases for this? > Attached is a patch for discussion, adding support for seccomp-bpf > (nowadays generally just called seccomp) syscall filtering at > configure-time using libseccomp. I would like to get this in shape to be > committed by the end of the November CF if possible. To compute the initial set of allowed system calls, you need to have fantastic test coverage. What you don't want is some rarely used error recovery path to cause a system crash. I wouldn't trust our current coverage for this. Also, the list of system calls in use changes all the time. The same function call from PostgreSQL could be a system call or a glibc implementation, depending on the specific installed packages or run-time settings. Extensions would need to maintain their own permissions list, and they would need to be merged manually into the respective existing settings. Without good answers to these, I suspect that this feature would go straight to the top of the list of "if in doubt, turn off". Overall, I think this sounds like a maintenance headache, and the possible benefits are unclear. -- Peter Eisentraut http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On 8/28/19 1:03 PM, Peter Eisentraut wrote: > On 2019-08-28 17:13, Joe Conway wrote: >> * systemd does not implement seccomp filters by default. Packagers may >> decide to do so, but there is no guarantee. Adding them post install >> potentially requires cooperation by groups outside control of >> the database admins. > > Well, if we are interested in this, why don't we just ask packagers to > do so? That seems like a much easier solution. For the reason listed below >> * In the container and systemd case there is no particularly good way to >> inspect what filters are active. It is possible to observe actions >> taken, but again, control is possibly outside the database admin >> group. For example, the best way to understand what happened is to >> review the auditd log, which is likely not readable by the DBA. > > Why does one need to know what filters are active (other than, > obviously, checking that the filters one has set were actually > activated)? What decisions would a DBA or database user be able to make > based on whether a filter is set or not? So that when an enforement action happens, you can understand why it happened. Perhaps it was a bug (omission) in your allow list, or perhaps it was an intentional rule to prevent abuse (say blocking certain syscalls from plperlu), or it was because someone is trying to compromise you system (e.g. some obscure and clearly not needed syscall). >> * With built-in support, it is possible to lock down backend processes >> more tightly than the postmaster. > > Also the other way around? As I stated in the original email, the filters can add restrictions but never remove them. >> * With built-in support, it is possible to lock down different backend >> processes differently than each other, for example by using ALTER ROLE >> ... SET or ALTER DATABASE ... SET. > > What are use cases for this? For example to allow a specific user to use plperlu to exec shell code while others cannot. >> Attached is a patch for discussion, adding support for seccomp-bpf >> (nowadays generally just called seccomp) syscall filtering at >> configure-time using libseccomp. I would like to get this in shape to be >> committed by the end of the November CF if possible. > > To compute the initial set of allowed system calls, you need to have > fantastic test coverage. What you don't want is some rarely used error > recovery path to cause a system crash. I wouldn't trust our current > coverage for this. So if you are worried about that make your default action 'log' and watch audit.log. There will be no errors or crashes of postgres caused by that because there will be no change in postgres visible behavior. And if returning an error from a syscall causes a crash that would be a serious bug and we should fix it. > Also, the list of system calls in use changes all the time. The same > function call from PostgreSQL could be a system call or a glibc > implementation, depending on the specific installed packages or run-time > settings. True. That is why I did not provide an initial list and believe folks who want to use this should develop their own. > Extensions would need to maintain their own permissions list, and they > would need to be merged manually into the respective existing settings. People would have to generate their own list for extensions -- I don't believe it is the extension writers' problem. > Without good answers to these, I suspect that this feature would go > straight to the top of the list of "if in doubt, turn off". That is fine. Perhaps most people never use this, but when needed (and increasingly will be required) it is available. > Overall, I think this sounds like a maintenance headache, and the > possible benefits are unclear. The only people who will incur the maintenance headache are those who need to use it. The benefits are compliance with requirements. Joe -- Crunchy Data - http://crunchydata.com PostgreSQL Support for Secure Enterprises Consulting, Training, & Open Source Development
Attachment
On 8/28/19 12:47 PM, David Fetter wrote: > On Wed, Aug 28, 2019 at 11:13:27AM -0400, Joe Conway wrote: >> SECCOMP ("SECure COMPuting with filters") is a Linux kernel syscall >> filtering mechanism which allows reduction of the kernel attack surface >> by preventing (or at least audit logging) normally unused syscalls. >> >> Quoting from this link: >> https://www.kernel.org/doc/Documentation/prctl/seccomp_filter.txt >> >> "A large number of system calls are exposed to every userland process >> with many of them going unused for the entire lifetime of the >> process. As system calls change and mature, bugs are found and >> eradicated. A certain subset of userland applications benefit by >> having a reduced set of available system calls. The resulting set >> reduces the total kernel surface exposed to the application. System >> call filtering is meant for use with those applications." >> >> Recent security best-practices recommend, and certain highly >> security-conscious organizations are beginning to require, that SECCOMP >> be used to the extent possible. The major web browsers, container >> runtime engines, and systemd are all examples of software that already >> support seccomp. > > Neat! > > Are the seccomp interfaces for other kernels arranged in a manner > similar enough to have a unified interface in PostgreSQL, or is this > more of a Linux-only feature? As far as I know libseccomp is Linux specific, at least at the moment. Joe -- Crunchy Data - http://crunchydata.com PostgreSQL Support for Secure Enterprises Consulting, Training, & Open Source Development
Attachment
Peter Eisentraut <peter.eisentraut@2ndquadrant.com> writes: > On 2019-08-28 17:13, Joe Conway wrote: >> Attached is a patch for discussion, adding support for seccomp-bpf >> (nowadays generally just called seccomp) syscall filtering at >> configure-time using libseccomp. I would like to get this in shape to be >> committed by the end of the November CF if possible. > To compute the initial set of allowed system calls, you need to have > fantastic test coverage. What you don't want is some rarely used error > recovery path to cause a system crash. I wouldn't trust our current > coverage for this. Yeah, that seems like quite a serious problem. I think you'd want to have some sort of static-code-analysis-based way of identifying the syscalls in use, rather than trying to test your way to it. > Overall, I think this sounds like a maintenance headache, and the > possible benefits are unclear. After thinking about this for awhile, I really don't follow what threat model it's trying to protect against. Given that we'll allow any syscall that an unmodified PG executable might use, it seems like the only scenarios being protected against involve someone having already compromised the server enough to have arbitrary code execution. OK, fine, but then why wouldn't the attacker just bypass libseccomp? Or tell it to let through the syscall he wants to use? Having the list of allowed syscalls be determined inside the process seems like fundamentally the wrong implementation. I'd have expected a feature like this to be implemented by SELinux, or some similar technology where the filtering is done by logic that's outside the executable you wish to not trust. (After googling for libseccomp, I see that it's supposed to not allow syscalls to be turned back on once turned off, but that isn't any protection against this problem. An attacker who's found an ACE hole in Postgres can just issue ALTER SYSTEM SET to disable the feature, then force a postmaster restart, then profit.) I follow the idea of limiting the attack surface for kernel bugs, but this doesn't seem like a useful implementation of that, even ignoring the ease-of-use problems Peter mentions. regards, tom lane
Hi, On 2019-08-28 11:13:27 -0400, Joe Conway wrote: > Recent security best-practices recommend, and certain highly > security-conscious organizations are beginning to require, that SECCOMP > be used to the extent possible. The major web browsers, container > runtime engines, and systemd are all examples of software that already > support seccomp. Maybe I'm missing something, but it's not clear to me what meaningful attack surface can be reduced for PostgreSQL by forbidding certain syscalls, given the wide variety of syscalls required to run postgres. That's different from something like a browser's CSS process, or such, which really doesn't need much beyond some IPC and memory allocations. But postgres is going to need syscalls as broad as fork/clone, exec, connect, shm*, etc. I guess you can argue that we'd still reduce the attack surface for kernel escalations, but that seems like a pretty small win compared to the cost. > * With built-in support, it is possible to lock down backend processes > more tightly than the postmaster. Which important syscalls would you get away with removing in backends that postmaster needs? I think the only one - which is a good one though - that I can think of is listen(). But even that might be too restrictive for some PLs running out of process. My main problem with seccomp is that it's *incredibly* fragile, especially for a program as complex as postgres. We already had seccomp related bug reports on list, even just due to the very permissive filtering by some container solutions. There's regularly new syscalls (e.g. epoll_create1(), and we'll soon get openat2()), different versions of glibc use different syscalls (e.g. switching from open() to always using openat()), the system configuration influences which syscalls are being used (e.g. using vsyscalls only being used for certain clock sources), and kernel. bugfixes change the exact set of syscalls being used ([1]). [1] https://lwn.net/Articles/795128/ Then there's also the issue that many extensions are going to need additional syscalls. > Notes on usage: > =============== > In order to determine your minimally required allow lists, do something > like the following on a non-production server with the same architecture > as production: > c) Cut and paste the result as the value of session_syscall_allow. That seems nearly guaranteed to miss a significant fraction of syscalls. There's just no way we're going to cover all the potential paths and configurations in our testsuite. I think if you actually wanted to do something like this, you'd need to use static analysis to come up with a more reliable list. Greetings, Andres Freund
Hi, On 2019-08-28 13:28:06 -0400, Joe Conway wrote: > > To compute the initial set of allowed system calls, you need to have > > fantastic test coverage. What you don't want is some rarely used error > > recovery path to cause a system crash. I wouldn't trust our current > > coverage for this. > So if you are worried about that make your default action 'log' and > watch audit.log. There will be no errors or crashes of postgres caused > by that because there will be no change in postgres visible behavior. But the benefit of integrating this into postgres become even less clear. > And if returning an error from a syscall causes a crash that would be a > serious bug and we should fix it. Err, there's a lot of syscall failures that'll cause PANICs, and where there's no reasonable way around that. Greetings, Andres Freund
On Wed, Aug 28, 2019 at 1:42 PM Tom Lane <tgl@sss.pgh.pa.us> wrote: > > Peter Eisentraut <peter.eisentraut@2ndquadrant.com> writes: > > On 2019-08-28 17:13, Joe Conway wrote: > >> Attached is a patch for discussion, adding support for seccomp-bpf > >> (nowadays generally just called seccomp) syscall filtering at > >> configure-time using libseccomp. I would like to get this in shape to be > >> committed by the end of the November CF if possible. > > > To compute the initial set of allowed system calls, you need to have > > fantastic test coverage. What you don't want is some rarely used error > > recovery path to cause a system crash. I wouldn't trust our current > > coverage for this. > > Yeah, that seems like quite a serious problem. I think you'd want > to have some sort of static-code-analysis-based way of identifying > the syscalls in use, rather than trying to test your way to it. > > > Overall, I think this sounds like a maintenance headache, and the > > possible benefits are unclear. > > After thinking about this for awhile, I really don't follow what > threat model it's trying to protect against. Given that we'll allow > any syscall that an unmodified PG executable might use, it seems > like the only scenarios being protected against involve someone > having already compromised the server enough to have arbitrary code > execution. OK, fine, but then why wouldn't the attacker just > bypass libseccomp? Or tell it to let through the syscall he wants > to use? Having the list of allowed syscalls be determined inside > the process seems like fundamentally the wrong implementation. > I'd have expected a feature like this to be implemented by SELinux, SELinux is generally an object model and while it does implement e.g., capability checks, that is not the main intent, nor is it possible for LSMs to implement syscall filters, the features are orthogonal. > or some similar technology where the filtering is done by logic > that's outside the executable you wish to not trust. > (After googling for libseccomp, I see that it's supposed to not > allow syscalls to be turned back on once turned off, but that isn't > any protection against this problem. An attacker who's found an ACE > hole in Postgres can just issue ALTER SYSTEM SET to disable the > feature, then force a postmaster restart, then profit.) > My preference would have been to enable it unconditionally but Joe was being more practical. > I follow the idea of limiting the attack surface for kernel bugs, > but this doesn't seem like a useful implementation of that, even > ignoring the ease-of-use problems Peter mentions. Limiting the kernel attack surface for network facing daemons is imperative to hardening systems. All modern attacks are chained together so a kernel bug is useful only if you can execute code, and PG is a decent vector for executing code. At a minimum I would urge the community to look at adding high risk syscalls to the kill list, systemd has some predefined sets we can pick from like @obsoluete, @cpu-emulation, @privileged, @mount, and @module. The goal is to prevent an ACE hole in Postgres from becoming a complete system compromise. This may not do it alone, and security conscious integrators will want to, for example, add seccomp filters to systemd to prevent superuser from disabling them. The postmaster and per-role lists can further reduce the available syscalls based on the exact extensions and PLs being used. Each step reduced the surface more and throwing it all out because one step can go rogue is unsatisfying. Thank you.
Andres Freund <andres@anarazel.de> writes: > Maybe I'm missing something, but it's not clear to me what meaningful > attack surface can be reduced for PostgreSQL by forbidding certain > syscalls, given the wide variety of syscalls required to run postgres. I think the idea is to block access to seldom-used syscalls because those are exactly the ones most likely to have bugs. (We certainly hope that all the ones PG uses are well-debugged...) That seems fine. Whether the incremental protection is really worth the effort is debatable, but as long as it's only people who think it *is* worth the effort who have to deal with it, I don't mind. What I don't like about this proposal is that the filter configuration is kept somewhere where it's not at all hard for an attacker to modify it. It can't be a GUC, indeed it can't be in any file that the server has permissions to write on, or you might as well not bother. So I'd throw away pretty much everything in the submitted patch, and instead think about doing the filtering/limiting in something that's launched from the service start script and in turn launches the postmaster. That makes it (mostly?) Not Our Problem, but rather an independent component. Admittedly, you can't get per-subprocess restrictions that way, but the incremental value from that seems *really* tiny. If listen() has a bug you need to fix the bug, not invent this amount of rickety infrastructure to limit who can call it. (And, TBH, I'm still wondering why SELinux isn't the way to address this.) regards, tom lane
Hi, On 2019-08-28 14:23:00 -0400, Joshua Brindle wrote: > > or some similar technology where the filtering is done by logic > > that's outside the executable you wish to not trust. > > (After googling for libseccomp, I see that it's supposed to not > > allow syscalls to be turned back on once turned off, but that isn't > > any protection against this problem. An attacker who's found an ACE > > hole in Postgres can just issue ALTER SYSTEM SET to disable the > > feature, then force a postmaster restart, then profit.) A postmaster restart might not be enough, because the postmaster's restrictions can't be removed, once in place. But all that's needed to circumvent that is force postmaster to die, and rely on systemd etc to restart it. > My preference would have been to enable it unconditionally but Joe was > being more practical. Well, the current approach is to configure the list of allowed syscalls in postgres. How would you ever secure that against the attacks described by Tom? As long as the restrictions are put into place by postgres itself, and as long they're configurable, such attacks are possible, no? And as long as extensions etc need different syscalls, you'll need configurability. > > I follow the idea of limiting the attack surface for kernel bugs, > > but this doesn't seem like a useful implementation of that, even > > ignoring the ease-of-use problems Peter mentions. > > Limiting the kernel attack surface for network facing daemons is > imperative to hardening systems. All modern attacks are chained > together so a kernel bug is useful only if you can execute code, and > PG is a decent vector for executing code. I don't really buy that in case pof postgres. Normally, in a medium to high security world, once you have RCE in postgres, the valuable data can already be exfiltrated. And that's game over. The only real benefits of a kernel vulnerabily is that that might allow to persist the attack for longer - but there's already plenty you can do inside postgres, once you have RCE. > At a minimum I would urge the community to look at adding high risk > syscalls to the kill list, systemd has some predefined sets we can > pick from like @obsoluete, @cpu-emulation, @privileged, @mount, and > @module. I can see some small value in disallowing these - but we're back to the point where that is better done one layer *above* postgres, by a process with more privileges than PG. Because then a PG RCE doesn't have a way to prevent those filters from being applied. > The postmaster and per-role lists can further reduce the available > syscalls based on the exact extensions and PLs being used. I don't buy that per-session configurable lists buy you anything meaningful. With an RCE in one session, it's pretty trivial to corrupt shared memory to also trigger RCE in other sessions. And there's no way seccomp or anything like that will prevent that. An additional reason I'm quite sceptical about more fine grained restrictions is that I think we're going to have to go for some use of threading in the next few years. While I think that's still far from agreed upon, I think there's a pretty large number of "senior" hackers that see this as the future. You can have per-thread seccomp filters, but that's so trivial to circumvent (just overwrite some vtable like data in another thread's data, and have it call whatever gadget you want), that it's not even worth considering. Greetings, Andres Freund
On Wed, Aug 28, 2019 at 2:30 PM Tom Lane <tgl@sss.pgh.pa.us> wrote: > <snip> > > (And, TBH, I'm still wondering why SELinux isn't the way to address this.) Just going to address this one now. SELinux is an LSM and therefore only makes decisions when LSM hooks are invoked, which are not 1:1 for syscalls (not even close). Further, SELinux generally determines what objects a subject can access and only implements capabilities because it has to, to be non-bypassable. Seccomp filtering is an orthogonal system to SELinux and LSMs in general. We are already doing work to further restrict PG subprocesses with SELinux via [1] and [2], but that doesn't address the unused, high-risk, obsolete, etc syscall issue. A prime example is madvise() which was a catastrophic failure that 1) isn't preventable by any LSM including SELinux, 2) isn't used by PG and is therefore a good candidate for a kill list, and 3) a clear win in the dont-let-PG-be-a-vector-for-kernel-compromise arena. We are using SELinux, we are also going to use this, they work together.
On 2019-08-28 14:30:20 -0400, Tom Lane wrote: > Andres Freund <andres@anarazel.de> writes: > > Maybe I'm missing something, but it's not clear to me what meaningful > > attack surface can be reduced for PostgreSQL by forbidding certain > > syscalls, given the wide variety of syscalls required to run postgres. > > I think the idea is to block access to seldom-used syscalls because > those are exactly the ones most likely to have bugs. Yea, I can see some benefit in that. I'm just quite doubtful that the set of syscalls pg relies on doesn't already allow any determined attacker to trigger kernel bugs. E.g. the whole sysv ipc code is among those seldomly used areas of the code. As is the permission transfer through unix sockets. As is forking from within a syscall. ... > (We certainly hope that all the ones PG uses are well-debugged...) One would hope, but it's not quite my experience... > Whether the incremental protection is really worth the effort is > debatable, but as long as it's only people who think it *is* worth the > effort who have to deal with it, I don't mind. It seems almost guaranteed that there'll be bug reports about ominous crashes due to some less commonly used syscall being blacklisted. In a lot of cases that'll be hard to debug. After all, we already got such bug reports, without us providing anything builtin - and it's not like us adding our own filter suport will make container solutions not use their filter, so there's no argument that doing so ourselves will reduce the fragility. > Admittedly, you can't get per-subprocess restrictions that way, but the > incremental value from that seems *really* tiny. If listen() has a bug > you need to fix the bug, not invent this amount of rickety infrastructure > to limit who can call it. And, as I mentioned in another email, once you can corrupt shared memory in arbitrary ways, the differing restrictions aren't worth much anyway. Postmaster might be separated out enough to survive attacks like that, but backends definitely aren't. Greetings, Andres Freund
On Wed, Aug 28, 2019 at 2:47 PM Joshua Brindle <joshua.brindle@crunchydata.com> wrote: > > On Wed, Aug 28, 2019 at 2:30 PM Tom Lane <tgl@sss.pgh.pa.us> wrote: > > > <snip> > > > > (And, TBH, I'm still wondering why SELinux isn't the way to address this.) > > Just going to address this one now. SELinux is an LSM and therefore > only makes decisions when LSM hooks are invoked, which are not 1:1 for > syscalls (not even close). Further, SELinux generally determines what > objects a subject can access and only implements capabilities because > it has to, to be non-bypassable. > > Seccomp filtering is an orthogonal system to SELinux and LSMs in > general. We are already doing work to further restrict PG subprocesses > with SELinux via [1] and [2], but that doesn't address the unused, And I forgot the citations *sigh*, actually there should have only been [1]: 1. https://commitfest.postgresql.org/24/2259/ > high-risk, obsolete, etc syscall issue. A prime example is madvise() > which was a catastrophic failure that 1) isn't preventable by any LSM > including SELinux, 2) isn't used by PG and is therefore a good > candidate for a kill list, and 3) a clear win in the > dont-let-PG-be-a-vector-for-kernel-compromise arena. > > We are using SELinux, we are also going to use this, they work together.
Hi, On 2019-08-28 14:47:04 -0400, Joshua Brindle wrote: > A prime example is madvise() which was a catastrophic failure that 1) > isn't preventable by any LSM including SELinux, 2) isn't used by PG > and is therefore a good candidate for a kill list, and 3) a clear win > in the dont-let-PG-be-a-vector-for-kernel-compromise arena. IIRC it's used by glibc as part of its malloc implementation (also threading etc) - but not necessarily hit during the most common paths. That's *precisely* my problem with this approach. Greetings, Andres Freund
Andres Freund <andres@anarazel.de> writes: > On 2019-08-28 14:30:20 -0400, Tom Lane wrote: >> Admittedly, you can't get per-subprocess restrictions that way, but the >> incremental value from that seems *really* tiny. If listen() has a bug >> you need to fix the bug, not invent this amount of rickety infrastructure >> to limit who can call it. > And, as I mentioned in another email, once you can corrupt shared memory > in arbitrary ways, the differing restrictions aren't worth much > anyway. Postmaster might be separated out enough to survive attacks like > that, but backends definitely aren't. Another point in this area is that if you did feel a need for per-process syscall sets, having a restriction that the postmaster's allowed set be a superset of all the childrens' allowed sets seems quite the wrong thing. The set of calls the postmaster needs is probably a lot smaller than what the children need, seeing that it does so little. It's just different because it includes bind+listen which the children likely don't need. So the hierarchical way seccomp goes about this seems fairly wrong for our purposes regardless. regards, tom lane
On Wed, Aug 28, 2019 at 2:53 PM Andres Freund <andres@anarazel.de> wrote: > > Hi, > > On 2019-08-28 14:47:04 -0400, Joshua Brindle wrote: > > A prime example is madvise() which was a catastrophic failure that 1) > > isn't preventable by any LSM including SELinux, 2) isn't used by PG > > and is therefore a good candidate for a kill list, and 3) a clear win > > in the dont-let-PG-be-a-vector-for-kernel-compromise arena. > > IIRC it's used by glibc as part of its malloc implementation (also > threading etc) - but not necessarily hit during the most common > paths. That's *precisely* my problem with this approach. > As long as glibc handles a returned error cleanly the syscall could be denied without harming the process and the bug would be mitigated. seccomp also allows argument whitelisting so things can get very granular, depending on who is setting up the lists.
Andres Freund <andres@anarazel.de> writes: > On 2019-08-28 14:47:04 -0400, Joshua Brindle wrote: >> A prime example is madvise() which was a catastrophic failure that 1) >> isn't preventable by any LSM including SELinux, 2) isn't used by PG >> and is therefore a good candidate for a kill list, and 3) a clear win >> in the dont-let-PG-be-a-vector-for-kernel-compromise arena. > IIRC it's used by glibc as part of its malloc implementation (also > threading etc) - but not necessarily hit during the most common > paths. That's *precisely* my problem with this approach. I think Andres is right here. There are madvise calls in glibc: glibc-2.28/malloc/malloc.c: __madvise (paligned_mem, size & ~psm1, MADV_DONTNEED); glibc-2.28/malloc/arena.c: __madvise ((char *) h + new_size, diff, MADV_DONTNEED); It appears that the first is only reachable from __malloc_trim which we don't use, but the second is reachable from free(). However, strace'ing says that it's never called during our standard regression tests, confirming Andres' thought that it's in seldom-reached paths. (I did not go through the free() logic in any detail, but it looks like it is only reached when dealing with quite-large allocations, which'd make sense.) So this makes a perfect example for Peter's point that testing is going to be a very fallible way of finding the set of syscalls that need to be allowed. Even if we had 100.00% code coverage of PG proper, we would not necessarily find calls like this. regards, tom lane
Hi, On 2019-08-28 15:02:17 -0400, Joshua Brindle wrote: > On Wed, Aug 28, 2019 at 2:53 PM Andres Freund <andres@anarazel.de> wrote: > > On 2019-08-28 14:47:04 -0400, Joshua Brindle wrote: > > > A prime example is madvise() which was a catastrophic failure that 1) > > > isn't preventable by any LSM including SELinux, 2) isn't used by PG > > > and is therefore a good candidate for a kill list, and 3) a clear win > > > in the dont-let-PG-be-a-vector-for-kernel-compromise arena. > > > > IIRC it's used by glibc as part of its malloc implementation (also > > threading etc) - but not necessarily hit during the most common > > paths. That's *precisely* my problem with this approach. > > > > As long as glibc handles a returned error cleanly the syscall could be > denied without harming the process and the bug would be mitigated. And we'd hit mysterious slowdowns in production uses of PG when seccomp is enabled. Greetings, Andres Freund
On Wed, Aug 28, 2019 at 3:22 PM Andres Freund <andres@anarazel.de> wrote: > > Hi, > > On 2019-08-28 15:02:17 -0400, Joshua Brindle wrote: > > On Wed, Aug 28, 2019 at 2:53 PM Andres Freund <andres@anarazel.de> wrote: > > > On 2019-08-28 14:47:04 -0400, Joshua Brindle wrote: > > > > A prime example is madvise() which was a catastrophic failure that 1) > > > > isn't preventable by any LSM including SELinux, 2) isn't used by PG > > > > and is therefore a good candidate for a kill list, and 3) a clear win > > > > in the dont-let-PG-be-a-vector-for-kernel-compromise arena. > > > > > > IIRC it's used by glibc as part of its malloc implementation (also > > > threading etc) - but not necessarily hit during the most common > > > paths. That's *precisely* my problem with this approach. > > > > > > > As long as glibc handles a returned error cleanly the syscall could be > > denied without harming the process and the bug would be mitigated. > > And we'd hit mysterious slowdowns in production uses of PG when seccomp > is enabled. It seems like complete system compromises should be prioritized over slowdowns, and it seems very unlikely to cause a noticeable slowdown anyway. Are there PG users that backed out all of the Linux KPTI patches due to the slowdown? I think we need to reign in the thread somewhat. The feature allows end users to define some sandboxing within PG. Nothing is being forced on anyone but we would like the capability to harden a PG installation for many reasons already stated. This is being done in places all across the Linux ecosystem and is IMO a very useful mitigation. Thank you.
On Thu, Aug 29, 2019 at 7:08 AM Joshua Brindle <joshua.brindle@crunchydata.com> wrote: > On Wed, Aug 28, 2019 at 2:53 PM Andres Freund <andres@anarazel.de> wrote: > > On 2019-08-28 14:47:04 -0400, Joshua Brindle wrote: > > > A prime example is madvise() which was a catastrophic failure that 1) > > > isn't preventable by any LSM including SELinux, 2) isn't used by PG > > > and is therefore a good candidate for a kill list, and 3) a clear win > > > in the dont-let-PG-be-a-vector-for-kernel-compromise arena. > > > > IIRC it's used by glibc as part of its malloc implementation (also > > threading etc) - but not necessarily hit during the most common > > paths. That's *precisely* my problem with this approach. > > > > As long as glibc handles a returned error cleanly the syscall could be > denied without harming the process and the bug would be mitigated. > > seccomp also allows argument whitelisting so things can get very > granular, depending on who is setting up the lists. Just by the way, there may also be differences between architectures. After some head scratching, we recently discovered[1] that default seccomp whitelists currently cause PostgreSQL to panic for users of Docker, Nspawn etc on POWER and ARM because of that. That's a bug being fixed elsewhere, but it reveals another thing to be careful of if you're trying to build your own whitelist by guessing what libc needs to call. [1] https://www.postgresql.org/message-id/flat/CA%2BhUKGLiR569VHLjtCNp3NT%2BjnKdhy8g2sdgKzWNojyWQVt6Bw%40mail.gmail.com -- Thomas Munro https://enterprisedb.com
Hi, On 2019-08-28 15:38:11 -0400, Joshua Brindle wrote: > It seems like complete system compromises should be prioritized over > slowdowns, and it seems very unlikely to cause a noticeable slowdown > anyway. The point isn't really this specific issue, but that the argument that you'll not cause problems by disabling certain syscalls, or that it's easy to find which ones are used, just plainly isn't true. > Are there PG users that backed out all of the Linux KPTI patches due > to the slowdown? Well, not backed out on a code level, but straight out disabled at boot time (i.e. pti=off)? Yea, I know of several. > I think we need to reign in the thread somewhat. The feature allows > end users to define some sandboxing within PG. Nothing is being forced > on anyone Well, we'll have to deal with the fallout of this to some degree. When postgres breaks people will complain, when it's slow, people will complain, ... Greetings, Andres Freund
On 2019-08-28 21:38, Joshua Brindle wrote: > I think we need to reign in the thread somewhat. The feature allows > end users to define some sandboxing within PG. Nothing is being forced > on anyone Features come with a maintenance cost. If we ship it, then people are going to try it out. Then weird things will happen. They will report mysterious bugs. They will complain to their colleagues. It all comes with a cost. > but we would like the capability to harden a PG installation > for many reasons already stated. Most if not all of those reasons seem to have been questioned. -- Peter Eisentraut http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On 2019-Aug-28, Joshua Brindle wrote: > I think we need to reign in the thread somewhat. The feature allows > end users to define some sandboxing within PG. Nothing is being forced > on anyone but we would like the capability to harden a PG installation > for many reasons already stated. My own objection to this line of development is that it doesn't seem that any useful policy (allowed/denied syscall list) is part or intends to be part of the final feature. So we're shipping a hook system for which each independent vendor is going to develop their own policy. Joe provided an example syscall list, but it's not part of the patch proper; and it seems, per the discussion, that the precise syscall list to use is a significant fraction of this. So, as part of a committable patch, IMO it'd be good to have some sort of final list of syscalls -- maybe as part of the docbook part of the patch. -- Álvaro Herrera https://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On 8/28/19 4:07 PM, Peter Eisentraut wrote: > On 2019-08-28 21:38, Joshua Brindle wrote: >> I think we need to reign in the thread somewhat. The feature allows >> end users to define some sandboxing within PG. Nothing is being forced >> on anyone > > Features come with a maintenance cost. If we ship it, then people are > going to try it out. Then weird things will happen. They will report > mysterious bugs. They will complain to their colleagues. It all comes > with a cost. > >> but we would like the capability to harden a PG installation >> for many reasons already stated. > > Most if not all of those reasons seem to have been questioned. Clearly Joshua and I disagree, but understand that the consensus is not on our side. It is our assessment that PostgreSQL will be subject to seccomp willingly or not (e.g., via docker, systemd, etc.) and the community might be better served to get out in front and have first class support. But I don't want to waste any more of anyone's time on this topic, except to ask if two strategically placed hooks are asking too much? Specifically hooks to replace these two stanzas in the patch: 8<-------------------------- diff --git a/src/backend/postmaster/postmaster.c b/src/backend/postmaster/postmaster.c index 62dc93d..2216d49 100644 *** a/src/backend/postmaster/postmaster.c --- b/src/backend/postmaster/postmaster.c *************** PostmasterMain(int argc, char *argv[]) *** 963,968 **** --- 963,982 ---- [...] diff --git a/src/backend/utils/init/postinit.c b/src/backend/utils/init/postinit.c index 43b9f17..aac1940 100644 *** a/src/backend/utils/init/postinit.c --- b/src/backend/utils/init/postinit.c *************** InitPostgres(const char *in_dbname, Oid *** 1056,1061 **** --- 1056,1076 ---- [...] 8<-------------------------- We will continue to pursue this development for customers that require it and plan to provide an update on our analysis and results. We thank you for your comments and suggestions. Joe -- Crunchy Data - http://crunchydata.com PostgreSQL Support for Secure Enterprises Consulting, Training, & Open Source Development
Attachment
Joe Conway <mail@joeconway.com> writes: > Clearly Joshua and I disagree, but understand that the consensus is not > on our side. It is our assessment that PostgreSQL will be subject to > seccomp willingly or not (e.g., via docker, systemd, etc.) and the > community might be better served to get out in front and have first > class support. Sure, but ... > But I don't want to waste any more of anyone's time on this topic, > except to ask if two strategically placed hooks are asking too much? ... hooks are still implying a design with the filter control inside Postgres. Which, as I said before, seems like a fundamentally incorrect architecture. I'm not objecting to having such control, but I think it has to be outside the postmaster, or it's just not a credible security improvement. It doesn't help to say "I'm going to install a lock to keep out a thief who *by assumption* is carrying lock picking tools." regards, tom lane
On Thu, Aug 29, 2019 at 10:01 AM Tom Lane <tgl@sss.pgh.pa.us> wrote: > > Joe Conway <mail@joeconway.com> writes: > > Clearly Joshua and I disagree, but understand that the consensus is not > > on our side. It is our assessment that PostgreSQL will be subject to > > seccomp willingly or not (e.g., via docker, systemd, etc.) and the > > community might be better served to get out in front and have first > > class support. > > Sure, but ... > > > But I don't want to waste any more of anyone's time on this topic, > > except to ask if two strategically placed hooks are asking too much? > > ... hooks are still implying a design with the filter control inside > Postgres. Which, as I said before, seems like a fundamentally incorrect > architecture. I'm not objecting to having such control, but I think > it has to be outside the postmaster, or it's just not a credible > security improvement. It doesn't help to say "I'm going to install > a lock to keep out a thief who *by assumption* is carrying lock > picking tools." > I recognize this discussion is over but this is a mischaracterization of the intent. Upthread I said: "This may not do it alone, and security conscious integrators will want to, for example, add seccomp filters to systemd to prevent superuser from disabling them. The postmaster and per-role lists can further reduce the available syscalls based on the exact extensions and PLs being used. Each step reduced the surface more and throwing it all out because one step can go rogue is unsatisfying." There are no security silver bullets, each thing we do is risk reduction, and that includes this patchset, whether you can see it or not. Thank you.
On 8/29/19 10:00 AM, Tom Lane wrote: > Joe Conway <mail@joeconway.com> writes: >> Clearly Joshua and I disagree, but understand that the consensus is not >> on our side. It is our assessment that PostgreSQL will be subject to >> seccomp willingly or not (e.g., via docker, systemd, etc.) and the >> community might be better served to get out in front and have first >> class support. > > Sure, but ... > >> But I don't want to waste any more of anyone's time on this topic, >> except to ask if two strategically placed hooks are asking too much? > > ... hooks are still implying a design with the filter control inside > Postgres. Which, as I said before, seems like a fundamentally incorrect > architecture. I'm not objecting to having such control, but I think > it has to be outside the postmaster, or it's just not a credible > security improvement. I disagree. Once a filter is loaded there is no way to unload it short of a postmaster restart. That is an easily detected event that can be alerted upon, and that is definitely a security improvement. Perhaps that is a reason to also set the session level GUC to PGC_POSTMASTER, but that is an easy change if deemed necessary. Joe -- Crunchy Data - http://crunchydata.com PostgreSQL Support for Secure Enterprises Consulting, Training, & Open Source Development
Attachment
Hi, This patch is currently in "needs review" state, but the last message is from August 29, and my understanding is that there have been a couple of objections / disagreements about the architecture, difficulties with producing the set of syscalls, and not providing any built-in policy. I don't think we're any closer to resolve those disagreements since August, so I think we should make some decision about this patch, instead of just moving it from one CF to the next one. The "needs review" status seems not reflecting the situation. Are there any plans to post a new version of the patch with a different design, or something like that? If not, I propose we mark it either as rejected or returned with feedback (and maybe get a new patch in the future). regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On 1/6/20 8:37 PM, Tomas Vondra wrote: > Hi, > > This patch is currently in "needs review" state, but the last message is > from August 29, and my understanding is that there have been a couple of > objections / disagreements about the architecture, difficulties with > producing the set of syscalls, and not providing any built-in policy. > > I don't think we're any closer to resolve those disagreements since > August, so I think we should make some decision about this patch, > instead of just moving it from one CF to the next one. The "needs > review" status seems not reflecting the situation. > > Are there any plans to post a new version of the patch with a different > design, or something like that? If not, I propose we mark it either as > rejected or returned with feedback (and maybe get a new patch in the > future). I assumed it was rejected. Joe -- Crunchy Data - http://crunchydata.com PostgreSQL Support for Secure Enterprises Consulting, Training, & Open Source Development
Attachment
On Tue, Jan 07, 2020 at 06:02:14AM -0500, Joe Conway wrote: >On 1/6/20 8:37 PM, Tomas Vondra wrote: >> Hi, >> >> This patch is currently in "needs review" state, but the last message is >> from August 29, and my understanding is that there have been a couple of >> objections / disagreements about the architecture, difficulties with >> producing the set of syscalls, and not providing any built-in policy. >> >> I don't think we're any closer to resolve those disagreements since >> August, so I think we should make some decision about this patch, >> instead of just moving it from one CF to the next one. The "needs >> review" status seems not reflecting the situation. >> >> Are there any plans to post a new version of the patch with a different >> design, or something like that? If not, I propose we mark it either as >> rejected or returned with feedback (and maybe get a new patch in the >> future). > > >I assumed it was rejected. > I don't know. I still see it in the CF app with "needs review" status: https://commitfest.postgresql.org/26/2263/ Barring objections, I'll mark it as rejected. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Tue, Jan 7, 2020 at 7:59 AM Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote: > Barring objections, I'll mark it as rejected. I think that's right. Since I just caught up on this thread, I'd like to offer a few belated comments: 1. I don't think it would kill us to add a few hooks that would allow this feature to be added by a loadable module. Someone may argue that we should never add a hook unless we know exactly how it's going to be used and agree with it as a goal, but I don't agree with that. Refusing to add hooks just causes people to fork the server. If somebody loads code that uses a hook at least you can tell that they've done it by looking at shared_preload_libraries; if they fork the server it may be much harder to tell that you're not dealing with straight-up PostgreSQL. At any rate, ease-of-debugging considerations for core developers should not take precedence over letting people do with PostgreSQL what they wish. 2. I feel strongly that shipping this feature with mechanism but not policy is unwise; I thought Alvaro articulated this problem particularly well. I think the evidence on this thread is pretty clear: this WILL break for some users, and it WILL need fixing. If the mechanism is in core and the policy is not, then it seems likely that employees at Crunchy, who apparently run into customers that need this on a regular basis, will develop a set of best practices which will allow them to advise customers as to what settings will or will not work well, but because that knowledge will not be embedded in core, it will be pretty hard for anybody else to support such customers, unless they too have a lot of customers who want to run in this mode. I would be a lot more supportive of this if both the mechanism and the policy were going to ship in core and be maintained in core, with adequate documentation. 3. The difficulty in making that happen is that the set of system calls that need to be whitelisted seems likely to vary based on platform, kernel version, glibc version, PostgreSQL build options, loadable modules used, and which specific PostgreSQL features you care about. I can't help feeling that this is designed mostly for processes that do far simpler things than PostgreSQL. It would be interesting, for example, to know what bash or perl does about this. They have the same problem that PostgreSQL does, namely, that they are intended to let users do almost arbitrary things by design -- not a totally unlimited set of things, but an awful lot of things. Perhaps over time this mechanism will undergo design changes, or a clearer set of best practices will emerge, so that it's easier to see how PostgreSQL could use this without breaking things. If indeed this is the future, you can imagine something like glibc getting a "seccomp-clean" mode in which it can be built, and if that happened and were widely used, then the difficulties for PostgreSQL would be reduced. Because such improvements typically happen over time through trial and error and the efforts of many people, I think it is to our advantage to allow people to experiment with the feature even as it exists today out of core, which gets me back to point #1. I agree with Joshua Brindle's point that holding our breath in response to a widely-adopted technology is not a very useful response. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Tue, Jan 07, 2020 at 11:33:07AM -0500, Robert Haas wrote: >On Tue, Jan 7, 2020 at 7:59 AM Tomas Vondra ><tomas.vondra@2ndquadrant.com> wrote: >> Barring objections, I'll mark it as rejected. > >I think that's right. Done. >Since I just caught up on this thread, I'd like to offer a few belated >comments: > >1. I don't think it would kill us to add a few hooks that would allow >this feature to be added by a loadable module. Someone may argue that >we should never add a hook unless we know exactly how it's going to be >used and agree with it as a goal, but I don't agree with that. >Refusing to add hooks just causes people to fork the server. If >somebody loads code that uses a hook at least you can tell that they've >done it by looking at shared_preload_libraries; if they fork the server >it may be much harder to tell that you're not dealing with straight-up >PostgreSQL. At any rate, ease-of-debugging considerations for core >developers should not take precedence over letting people do with >PostgreSQL what they wish. > Not sure I understand. I don't think anyone argued by hooks vs. forking the server in this particular thread, but the thread is fairly long so maybe I'm missing something. I think the "hook issue" is that the actual code is somewhere else. On the one hand that minimizes the dev/testing/maintenance burden for us, on the other hand it means we're not really testing the hooks. Meh. But in this case I think the main argument against hooks was that Tom thinks it's not really the right way to implement this. I don't know if he's right or not, I don't have an opinion on how to integrate seccomp. >2. I feel strongly that shipping this feature with mechanism but not >policy is unwise; I thought Alvaro articulated this problem >particularly well. I think the evidence on this thread is pretty clear: >this WILL break for some users, and it WILL need fixing. If the >mechanism is in core and the policy is not, then it seems likely that >employees at Crunchy, who apparently run into customers that need this >on a regular basis, will develop a set of best practices which will >allow them to advise customers as to what settings will or will not >work well, but because that knowledge will not be embedded in core, it >will be pretty hard for anybody else to support such customers, unless >they too have a lot of customers who want to run in this mode. I would >be a lot more supportive of this if both the mechanism and the policy >were going to ship in core and be maintained in core, with adequate >documentation. > Well, but this exact argument applies to various other approaches: 1) no hooks, forking PostgreSQL 2) hooks added, but neither code nor policy included 3) hooks aded, code included, policy not included Essentially the only case where Crunchy would not have this "lock-in" advantage is when everything is included into PostgreSQL, at which point we can probably make this work without hooks I suppose. >3. The difficulty in making that happen is that the set of system calls >that need to be whitelisted seems likely to vary based on platform, >kernel version, glibc version, PostgreSQL build options, loadable >modules used, and which specific PostgreSQL features you care about. I >can't help feeling that this is designed mostly for processes that do >far simpler things than PostgreSQL. It would be interesting, for >example, to know what bash or perl does about this. They have the same >problem that PostgreSQL does, namely, that they are intended to let >users do almost arbitrary things by design -- not a totally unlimited >set of things, but an awful lot of things. Perhaps over time this >mechanism will undergo design changes, or a clearer set of best >practices will emerge, so that it's easier to see how PostgreSQL could >use this without breaking things. If indeed this is the future, you can >imagine something like glibc getting a "seccomp-clean" mode in which it >can be built, and if that happened and were widely used, then the >difficulties for PostgreSQL would be reduced. Because such improvements >typically happen over time through trial and error and the efforts of >many people, I think it is to our advantage to allow people to >experiment with the feature even as it exists today out of core, which >gets me back to point #1. I agree with Joshua Brindle's point that >holding our breath in response to a widely-adopted technology is not a >very useful response. > I think this is probably the main challenge of this patch - development, maintenance and testing of the policies. I think it's pretty clear the community can't really support this on all possible environments, or with third-party extensions. I don't know if it's even possible to write a "universal policy" covering significant range of systems? It seems much more realistic that individual providers will develop policies for systems of customers. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Tue, Jan 7, 2020 at 12:56 PM Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote: > I think the "hook issue" is that the actual code is somewhere else. On > the one hand that minimizes the dev/testing/maintenance burden for us, > on the other hand it means we're not really testing the hooks. Meh. I don't care about the testing the hooks. If we provide hooks and someone finds them useful, great. If not, they don't have to use them. The mere existence of this hook isn't going to put any significant maintenance burden on the community, while the feature as a whole probably would. > Well, but this exact argument applies to various other approaches: > > 1) no hooks, forking PostgreSQL > 2) hooks added, but neither code nor policy included > 3) hooks aded, code included, policy not included > > Essentially the only case where Crunchy would not have this "lock-in" > advantage is when everything is included into PostgreSQL, at which point > we can probably make this work without hooks I suppose. Well, from my point of view, in case 1 or 2, the feature is entirely Crunchy's. If it works great, good for them. If it sucks, it's their problem. In case 3, the feature is ostensibly a community feature but probably nobody other than Crunchy can actually make it work. That latter situation seems a lot more problematic to me. I don't like PostgreSQL features that I can't make work. If it's too complicated for other developers to figure out, it's probably going to be a real pain for users, too. Putting my cards on the table, I think it's likely that the proposed design contains a significant amount of suckitude. Serious usability and security concerns have been raised, and I find those concerns legitimate. On the other hand, it may still be useful to some people. More importantly, if they can more easily experiment with it, they'll have a chance to find out whether it sucks and, if so, make it better. Perhaps something that we can accept into core will ultimately result. That would be good for everybody. Also, generally, I don't think we should block features (hooks or otherwise) because some other company might get more benefit than our own employer. That seems antithetical to the concept of open source. Blocking them because they're poorly designed or will impose a burden on the community is a different thing. > I think this is probably the main challenge of this patch - development, > maintenance and testing of the policies. I think it's pretty clear the > community can't really support this on all possible environments, or > with third-party extensions. I don't know if it's even possible to write > a "universal policy" covering significant range of systems? It seems > much more realistic that individual providers will develop policies for > systems of customers. I generally agree with this, although I'm not sure that I understand what you're advocating that we do. Accepting the feature as-is seems like a no-go given the remarks already made, and while I don't think I feel as strongly about it as some of the people who have already spoken on the topic, I do share their doubts to some degree and am not in a position to override them even if I disagreed completely. But, hooks would give individual providers those same options, without really burdening anyone else. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company