I’ll just assert that there’s no way to use seccomp() correctly. Just like how there’s no way to use gets() correctly, causing it to eventually be removed from the C and C++ standards.

seccomp, briefly

seccomp allows you to filter syscalls with a ruleset.

The obvious thing is to filter anything your program isn’t supposed to be doing. If it doesn’t do file IO, don’t let it open files. If it’s not supposed to execute anything, don’t let it do that.

But whether you use a whitelist (e.g. only allow working with already open file descriptors), or a blacklist (e.g. don’t allow it to open these files), it’s fundamentally flawed.

1. Syscalls change. Sometimes without even recompiling

open() in your code actually becomes the openat syscall. Maybe. At least today. At least on my machine, today.

select() actually becomes pselect6. At least on Fridays.

If you upgrade libc or distribute a binary to other systems, this may start to fail.

2. Surprising syscalls

Calling printf() will call the syscall newfstatat, a syscall hard to even parse into words. But only the first time you call it! So after your first printf() you can block newfstatat.

Maybe this will all work just fine, normally. But then an unrelated bug happens, and your tool tries to log it, but can’t because newfstatat is blocked. So you get no logs.

So it’s not just what you call, but highly dependent on what order you call things when dropping privileges.

In my example it worked fine when I ran with verbose mode turned on, but not with it off. That’s because in verbose mode I called printf() before dropping privs.

3. (hinting at the solution): There’s no grouping

I would say that the most common thing everyone wants to do is this: After everything’s set up, don’t allow anything done by the process to interact with anything else, except via already open file descriptors.

That’s almost true. Getting the current time, and memory allocation, is probably also safe.

(But the original binary on/off seccomp() blocked even those)

But there’s no way to express this. In order to actually interact with open network sockets in the most minimal of ways I’d need at least:

  • pselect6
  • select
  • poll
  • ppoll
  • write
  • pwrite64
  • writev
  • pwritev
  • read
  • pread64
  • pread
  • preadv
  • close
  • sendfile
  • sendto
  • sendmsg
  • sendmmsg
  • recvfrom
  • recvmsg
  • recvmmsg

And that’s just for the most trivial of examples where you have some unsafe code (e.g. a parser) that takes input on one fd and gives output on another. For example if you implement an oracle that takes an X.509 certificate (famously tricky to parse) and a hostname, and returns if it’s valid or not.

And what’s worse: This is completely dynamic and depends on the architecture. It can change from execution to execution, or millisecond to millisecond. This is just not part of the ABI.

There’s nothing stopping libc from changing to implementing read() as a special case of readv(). select() could be implemented in terms of poll(), tomorrow.

There are 300+ syscalls, and will likely grow. Do you know which ones are “just read or write from the sockets”?

So I don’t think the seccomp(2) manpage is realistic when it says:

It is strongly recommended to use an allow-list approach whenever
possible because such an approach is more robust and simple.  A
deny-list will have to be updated whenever a potentially dangerous
system call is added

Good luck with that.

The solution

OpenBSD clearly got this right. Don’t list syscalls. Who cares if it’s poll() or select()?

For example arping has this code to prevent it doing anything bad at all.

Go on, think about it. Even with full control of the process, what could you possibly do after it runs pledge("stdio", "")? Print profanities to the user? Exit with the wrong exit code? Yeah, but that’s about it.

But seccomp() allows more restrictions. In arping I blocked so that it can only write to stdout and stderr, not read. But so what? I may have to eat my hat on this, but being able to read from stdout doesn’t sound like it’ll cause a security problem.

pledge(), and unveil(), are clearly the right solution here.

But what about Linux?

Maybe one day Landlock will be the thing. But considering the previous nightmare with many generations of Linux solutions getting it wrong I’m not holding my breath.

For now I guess unshare() is the way to go. But even that’s tricky (and doesn’t block as much). I’m planning a follow-up post about how to drop access to the outside world using available tools.