Since we know for sure that the virtual memory regions in the new
process being created are not being used on any CPU, there's no need
to do TLB flushes for every mapped page.
Dynamic Vector allocations in sys$select() were showing up in the
full-system profile and since there will never be more than FD_SETSIZE
file descriptors to worry about, we can confidently add enough inline
capacity to this Vector that it never has to kmalloc.
To compensate for the increased stack usage, reduce the size of the
FDInfo struct while we're here. :^)
The perfcore file format was previously limited to a single process
since the pid/executable/regions data was top-level in the JSON.
This patch moves the process-specific data into a top-level array
named "processes" and we now add entries for each process that has
been sampled during the profile run.
This makes it possible to see samples from multiple threads when
viewing a perfcore file with Profiler. This is extremely cool! :^)
The superuser can now call sys$profiling_enable() with PID -1 to enable
profiling of all running threads in the system. The perf events are
collected in a global PerformanceEventBuffer (currently 32 MiB in size.)
The events can be accessed via /proc/profile
If we can't allocate a PerformanceEventBuffer to store the profiling
events, we now fail sys$profiling_enable() and sys$perf_event()
with ENOMEM instead of carrying on with a broken buffer.
I don't dare touch the multi-threading logic and locking mechanism, so it stays
timespec for now. However, this could and should be changed to AK::Time, and I
bet it will simplify the "increment_time_since_boot()" code.
This commit is very invasive, because Thread likes to take a pointer and write
to it. This means that translating between timespec/timeval/Time would have been
more difficult than just changing everything that hands a raw pointer to Thread,
in bulk.
fuzz-syscalls found a bunch of unaligned accesses into struct sigaction
via this syscall. This patch fixes that issue by porting the syscall
to Userspace<T> which we should have done anyway. :^)
Fixes#5500.
This was another vestige from a long time ago, when exiting a thread
would mutate global data structures that were only protected by the
interrupt flag.
This was necessary in the past when crash handling would modify
various global things, but all that stuff is long gone so we can
simplify crashes by leaving the interrupt flag alone.
Make more of the kernel compile in 64-bit mode, and make some things
pointer-size-agnostic (by using FlatPtr.)
There's a lot of work to do here before the kernel will even compile.
Instead of writing to the userspace utsname struct one field at a time,
build up a utsname on the kernel stack and copy it out to userspace
once it's finished. This is both simpler and gets validity checking
built-in for free.
Found by KUBSAN! :^)
Fixes#5499.
(...and ASSERT_NOT_REACHED => VERIFY_NOT_REACHED)
Since all of these checks are done in release builds as well,
let's rename them to VERIFY to prevent confusion, as everyone is
used to assertions being compiled out in release.
We can introduce a new ASSERT macro that is specifically for debug
checks, but I'm doing this wholesale conversion first since we've
accumulated thousands of these already, and it's not immediately
obvious which ones are suitable for ASSERT.
- We've already computed the number of fds * sizeof(pollfd), so use it
instead of needlessly doing it again.
- Use fds_copy.data() instead off address of indexing the vector.
The `default_signal_action(u8 signal)` function already has the
full mapping. The only caveat being that now we need to make
sure the thread constructor and clear_signals() method do the work
of resetting the m_signal_action_data array, instead or relying on
the previous logic in set_default_signal_dispositions.
This is a new promise that guards access to mmap() with MAP_FIXED.
Fixed-address mappings are rarely used, but can be useful if you are
trying to groom the process address space for malicious purposes.
None of our programs need this at the moment, as the only user of
MAP_FIXED is DynamicLoader, but the fixed mappings are constructed
before the process has had a chance to pledge anything.
If we have a tracer process waiting for us to exec, we need to release
the ptrace lock before stopping ourselves, since otherwise the tracer
will block forever on the lock.
Fixes#5409.
sys$mmap() and related syscalls must pad to the nearest page boundary
below the base address *and* above the end address of the specified
range. Since we have to do this in many places, let's make a helper.
This bug is a good example why copy-paste code should eventually be eliminated
from the code base: Apparently the code was copied from read.cpp before
c6027ed7cc, so the same bug got introduced here.
To recap: A malicious program can ask the Kernel to prepare sys-ing to
a huge amount of iovecs. The Kernel must first copy all the vector locations
into 'vecs', and before that allocates an arbitrary amount of memory:
vecs.resize(iov_count);
This can cause Kernel memory exhaustion, triggered by any malicious userland
program.
This patch adds a random offset between 0 and 4096 to the initial
stack pointer in new processes. Since the stack has to be 16-byte
aligned, the bottom bits can't be randomized.
Yet another thing to make things less predictable. :^)
We were doing stack and syscall-origin region validations before
taking the big process lock. There was a window of time where those
regions could then be unmapped/remapped by another thread before we
proceed with our syscall.
This patch closes that window, and makes sys$get_stack_bounds() rely
on the fact that we now know the userspace stack pointer to be valid.
Thanks to @BenWiederhake for spotting this! :^)
If we try to align a number above 0xfffff000 to the next multiple of
the page size (4 KiB), it would wrap around to 0. This is most likely
never what we want, so let's assert if that happens.