This syscall is a complement to pledge() and adds the same sort of
incremental relinquishing of capabilities for filesystem access.
The first call to unveil() will "drop a veil" on the process, and from
now on, only unveiled parts of the filesystem are visible to it.
Each call to unveil() specifies a path to either a directory or a file
along with permissions for that path. The permissions are a combination
of the following:
- r: Read access (like the "rpath" promise)
- w: Write access (like the "wpath" promise)
- x: Execute access
- c: Create/remove access (like the "cpath" promise)
Attempts to open a path that has not been unveiled with fail with
ENOENT. If the unveiled path lacks sufficient permissions, it will fail
with EACCES.
Like pledge(), subsequent calls to unveil() with the same path can only
remove permissions, not add them.
Once you call unveil(nullptr, nullptr), the veil is locked, and it's no
longer possible to unveil any more paths for the process, ever.
This concept comes from OpenBSD, and their implementation does various
things differently, I'm sure. This is just a first implementation for
SerenityOS, and we'll keep improving on it as we go. :^)
Background: DoubleBuffer is a handy buffer class in the kernel that
allows you to keep writing to it from the "outside" while the "inside"
reads from it. It's used for things like LocalSocket and TTY's.
Internally, it has a read buffer and a write buffer, but the two will
swap places when the read buffer is exhausted (by reading from it.)
Before this patch, it was internally implemented as two Vector<u8>
that we would swap between when the reader side had exhausted the data
in the read buffer. Now instead we preallocate a large KBuffer (64KB*2)
on DoubleBuffer construction and use that throughout its lifetime.
This removes all the kmalloc heap traffic caused by DoubleBuffers :^)
Before this, we would end up in memcpy() churn hell when a program was
doing repeated write() calls to a file in /tmp.
An even better solution will be to only grow the VM allocation of the
underlying buffer and keep using the same physical pages. This would
eliminate all the memcpy() work.
I've benchmarked this using g++ to compile Kernel/Process.cpp.
With these changes, compilation goes from ~35 sec to ~31 sec. :^)
Previously, VFS::open() would only use the passed flags for permission checking
purposes, and Process::sys$open() would set them on the created FileDescription
explicitly. Now, they should be set by VFS::open() on any files being opened,
including files that the kernel opens internally.
This also lets us get rid of the explicit check for whether or not the returned
FileDescription was a preopen fd, and in fact, fixes a bug where a read-only
preopen fd without any other flags would be considered freshly opened (due to
O_RDONLY being indistinguishable from 0) and granted a new set of flags.
As suggested by Joshua, this commit adds the 2-clause BSD license as a
comment block to the top of every source file.
For the first pass, I've just added myself for simplicity. I encourage
everyone to add themselves as copyright holders of any file they've
added or modified in some significant way. If I've added myself in
error somewhere, feel free to replace it with the appropriate copyright
holder instead.
Going forward, all new source files should include a license header.
Symlink resolution is now a virtual method on an inode,
Inode::resolve_as_symlink(). The default implementation just reads the stored
inode contents, treats them as a path and calls through to VFS::resolve_path().
This will let us support other, magical files that appear to be plain old
symlinks but resolve to something else. This is particularly useful for ProcFS.
It turns out we don't even need to store the whole custody chain, as we only
ever access its last element. So we can just store one custody. This also fixes
a performance FIXME :^)
Also, rename parent_custody to out_parent.
Also don't uncache inodes when they reach i_links_count==0 unless they
also have no ref counts other than the +1 from the inode cache.
This prevents the FS from deleting the on-disk inode too soon.
This makes the implementation easier to follow, but also fixes multiple issues
with the old implementation. In particular, it now deals properly with . and ..
in paths, including around mount points.
Hopefully there aren't many new bugs this introduces :^)
You can now bind-mount files and directories. This essentially exposes an
existing part of the file system in another place, and can be used as an
alternative to symlinks or hardlinks.
Here's an example of doing this:
# mkdir /tmp/foo
# mount /home/anon/myfile.txt /tmp/foo -o bind
# cat /tmp/foo
This is anon's file.
We now support these mount flags:
* MS_NODEV: disallow opening any devices from this file system
* MS_NOEXEC: disallow executing any executables from this file system
* MS_NOSUID: ignore set-user-id bits on executables from this file system
The fourth flag, MS_BIND, is defined, but currently ignored.
O_EXEC is mentioned by POSIX, so let's have it. Currently, it is only used
inside the kernel to ensure the process has the right permissions when opening
an executable.
At the moment, the actual flags are ignored, but we correctly propagate them all
the way from the original mount() syscall to each custody that resides on the
mounted FS.
No need to pass around RefPtr<>s and NonnullRefPtr<>s and no need to
heap-allocate them.
Also remove VFS::mount(NonnullRefPtr<FS>&&, StringView path) - it has been
unused for a long time.
In order to preserve the absolute path of the process root, we save the
custody used by chroot() before stripping it to become the new "/".
There's probably a better way to do this.
The chroot() syscall now allows the superuser to isolate a process into
a specific subtree of the filesystem. This is not strictly permanent,
as it is also possible for a superuser to break *out* of a chroot, but
it is a useful mechanism for isolating unprivileged processes.
The VFS now uses the current process's root_directory() as the root for
path resolution purposes. The root directory is stored as an uncached
Custody in the Process object.
As Sergey pointed out, it's silly to have proper entries for . and ..
in TmpFS when we can just synthesize them on the fly.
Note that we have to tolerate removal of . and .. via remove_child()
to keep VFS::rmdir() happy.
This encourages callers to strongly reference file descriptions while
working with them.
This fixes a use-after-free issue where one thread would close() an
open fd while another thread was blocked on it becoming readable.
Test: Kernel/uaf-close-while-blocked-in-read.cpp
Instead of using the FIFO's memory address as part of its absolute path
identity, just use an incrementing FIFO index instead.
Note that this is not used for anything other than debugging (it helps
you identify which file descriptors refer to the same FIFO by looking
at /proc/PID/fds
Supervisor Mode Access Prevention (SMAP) is an x86 CPU feature that
prevents the kernel from accessing userspace memory. With SMAP enabled,
trying to read/write a userspace memory address while in the kernel
will now generate a page fault.
Since it's sometimes necessary to read/write userspace memory, there
are two new instructions that quickly switch the protection on/off:
STAC (disables protection) and CLAC (enables protection.)
These are exposed in kernel code via the stac() and clac() helpers.
There's also a SmapDisabler RAII object that can be used to ensure
that you don't forget to re-enable protection before returning to
userspace code.
THis patch also adds copy_to_user(), copy_from_user() and memset_user()
which are the "correct" way of doing things. These functions allow us
to briefly disable protection for a specific purpose, and then turn it
back on immediately after it's done. Going forward all kernel code
should be moved to using these and all uses of SmapDisabler are to be
considered FIXME's.
Note that we're not realizing the full potential of this feature since
I've used SmapDisabler quite liberally in this initial bring-up patch.
This has been a FIXME for a long time. We now apply the provided
read/write permissions to the constructed FileDescription when opening
a File object via File::open().
We were running without the sticky bit and mode 777, which meant that
the /tmp directory was world-writable *without* protection.
With this fixed, it's no longer possible for everyone to steal root's
files in /tmp.
In order to ensure a specific owner and mode when the local socket
filesystem endpoint is instantiated, we need to be able to call
fchmod() and fchown() on a socket fd between socket() and bind().
This is because until we call bind(), there is no filesystem inode
for the socket yet.