Previously we would crash the process immediately when a promise
violation was found during a syscall. This is error prone, as we
don't unwind the stack. This means that in certain cases we can
leak resources, like an OwnPtr / RefPtr tracked on the stack. Or
even leak a lock acquired in a ScopeLockLocker.
To remedy this situation we move the promise violation handling to
the syscall handler, right before we return to user space. This
allows the code to follow the normal unwind path, and grantees
there is no longer any cleanup that needs to occur.
The Process::require_promise() and Process::require_no_promises()
functions were modified to return ErrorOr<void> so we enforce that
the errors are always propagated by the caller.
This change lays the foundation for making the require_promise return
an error hand handling the process abort outside of the syscall
implementations, to avoid cases where we would leak resources.
It also has the advantage that it makes removes a gs pointer read
to look up the current thread, then process for every syscall. We
can instead go through the Process this pointer in most cases.
As a small cleanup, this also makes `page_round_up` verify its
precondition with `page_round_up_would_wrap` (which callers are expected
to call), rather than having its own logic.
Fixes#11297.
Most other syscalls pass address arguments as `void*` instead of
`uintptr_t`, so let's do that here too. Besides improving consistency,
this commit makes `strace` correctly pretty-print these arguments in
hex.
This feature was introduced in version 4.17 of the Linux kernel, and
while it's not specified by POSIX, I think it will be a nice addition to
our system.
MAP_FIXED_NOREPLACE provides a less error-prone alternative to
MAP_FIXED: while regular fixed mappings would cause any intersecting
ranges to be unmapped, MAP_FIXED_NOREPLACE returns EEXIST instead. This
ensures that we don't corrupt our process's address space if something
is already at the requested address.
Note that the more portable way to do this is to use regular
MAP_ANONYMOUS, and check afterwards whether the returned address matches
what we wanted. This, however, has a large performance impact on
programs like Wine which try to reserve large portions of the address
space at once, as the non-matching addresses have to be unmapped
separately.
Since we're iterating over multiple regions that interesect with the
requested range, just one of them having the requested access flags
is not enough to finish the syscall early.
The advices are almost always exclusive of one another, and while POSIX
does not define madvise, most other unix-like and *BSD systems also only
accept a singular value per call.
This allows userspace to trigger a full (FIXME) flush of a shared file
mapping to disk. We iterate over all the mapped pages in the VMObject
and write them out to the underlying inode, one by one. This is rather
naive, and there's lots of room for improvement.
Note that shared file mappings are currently not possible since mmap()
returns ENOTSUP for PROT_WRITE+MAP_SHARED. That restriction will be
removed in a subsequent commit. :^)
This feels like it was a refactor transition kind of conversion. The
places that were relying on it can easily be changed to explicitly ask
for the ptr() or a new vaddr() method on Userspace<T*>.
FlatPtr can still implicitly convert to Userspace<T> because the
constructor is not explicit, but there's quite a few more places that
are relying on that conversion.
We now use AK::Error and AK::ErrorOr<T> in both kernel and userspace!
This was a slightly tedious refactoring that took a long time, so it's
not unlikely that some bugs crept in.
Nevertheless, it does pass basic functionality testing, and it's just
real nice to finally see the same pattern in all contexts. :^)
Instead of checking it at every call site (to generate EBADF), we make
file_description(fd) return a KResultOr<NonnullRefPtr<FileDescription>>.
This allows us to wrap all the calls in TRY(). :^)
The only place that got a little bit messier from this is sys$mount(),
and there's a whole bunch of things there in need of cleanup.
There's no need to disable interrupts when trying to access an inode's
shared vmobject. Additionally, Inode::shared_vmobject() acquires a Mutex
which is a recipe for deadlock if acquired with interrupts disabled.
We have seen cases where the map fails, but we return the region
to the caller, causing them to page fault later on when they touch
the region.
The fix is to always observe the return code of map/remap.
This makes for nicer handling of errors compared to checking whether a
RefPtr is null. Additionally, this will give way to return different
types of errors in the future.
...and also RangeAllocator => VirtualRangeAllocator.
This clarifies that the ranges we're dealing with are *virtual* memory
ranges and not anything else.
AnonymousVMObject::set_volatile() assumes that nobody ever calls it on
non-purgeable objects, so let's make sure we don't do that.
Also return EINVAL instead of EPERM for non-anonymous VM objects so the
error codes match.
This patch changes the semantics of purgeable memory.
- AnonymousVMObject now has a "purgeable" flag. It can only be set when
constructing the object. (Previously, all anonymous memory was
effectively purgeable.)
- AnonymousVMObject now has a "volatile" flag. It covers the entire
range of physical pages. (Previously, we tracked ranges of volatile
pages, effectively making it a page-level concept.)
- Non-volatile objects maintain a physical page reservation via the
committed pages mechanism, to ensure full coverage for page faults.
- When an object is made volatile, it relinquishes any unused committed
pages immediately. If later made non-volatile again, we then attempt
to make a new committed pages reservation. If this fails, we return
ENOMEM to userspace.
mmap() now creates purgeable objects if passed the MAP_PURGEABLE option
together with MAP_ANONYMOUS. anon_create() memory is always purgeable.
Before we start disabling acquisition of the big process lock for
specific syscalls, make sure to document and assert that all the
lock is held during all syscalls.