This patch begins the work of sharing types and macros between Kernel
and LibC instead of duplicating them via the kludge in UnixTypes.h.
The basic idea is that the Kernel vends various POSIX headers via
Kernel/API/POSIX/ and LibC simply #include's them to get the macros.
Since the InodeIndex encapsulates a 64 bit value, it is correct to
ensure that the Kernel is exposing the entire value and the LibC is
aware of it.
This commit requires an entire re-compile because it's essentially a
change in the Kernel ABI, together with a corresponding change in LibC.
This allows tracing the syscalls made by a thread through the kernel's
performance event framework, which is similar in principle to strace.
Currently, this merely logs a stack backtrace to the current thread's
performance event buffer whenever a syscall is made, if profiling is
enabled. Future improvements could include tracing the arguments and
the return value, for example.
This patch changes the semantics of purgeable memory.
- AnonymousVMObject now has a "purgeable" flag. It can only be set when
constructing the object. (Previously, all anonymous memory was
effectively purgeable.)
- AnonymousVMObject now has a "volatile" flag. It covers the entire
range of physical pages. (Previously, we tracked ranges of volatile
pages, effectively making it a page-level concept.)
- Non-volatile objects maintain a physical page reservation via the
committed pages mechanism, to ensure full coverage for page faults.
- When an object is made volatile, it relinquishes any unused committed
pages immediately. If later made non-volatile again, we then attempt
to make a new committed pages reservation. If this fails, we return
ENOMEM to userspace.
mmap() now creates purgeable objects if passed the MAP_PURGEABLE option
together with MAP_ANONYMOUS. anon_create() memory is always purgeable.
Advisory locks don't actually prevent other processes from writing to
the file, but they do prevent other processes looking to acquire and
advisory lock on the file.
This implementation currently only adds non-blocking locks, which are
all I need for now.
Hook the kernel page fault handler and capture page fault events when
the fault has a current thread attached in TLS. We capture the eip and
ebp so we can unwind the stack and locate which pieces of code are
generating the most page faults.
Co-authored-by: Gunnar Beutner <gbeutner@serenityos.org>
The function fstatat can do the same thing as the stat and lstat
functions. However, it can be passed the file descriptor of a directory
which will be used when as the starting point for relative paths. This
is contrary to stat and lstat which use the current working directory as
the starting for relative paths.
An IP socket can now join a multicast group by using the
IP_ADD_MEMBERSHIP sockopt, which will cause it to start receiving
packets sent to the multicast address, even though this address does
not belong to this host.
This commit will add MSG_PEEK support, which allows a package to be
seen without taking it from the buffer, so that a subsequent recv()
without the MSG_PEEK flag can pick it up.
This turns the perfcore format into more a log than it was before,
which lets us properly log process, thread and region
creation/destruction. This also makes it unnecessary to dump the
process' regions every time it is scheduled like we did before.
Incidentally this also fixes 'profile -c' because we previously ended
up incorrectly dumping the parent's region map into the profile data.
Log-based mmap support enables profiling shared libraries which
are loaded at runtime, e.g. via dlopen().
This enables profiling both the parent and child process for
programs which use execve(). Previously we'd discard the profiling
data for the old process.
The Profiler tool has been updated to not treat thread IDs as
process IDs anymore. This enables support for processes with more
than one thread. Also, there's a new widget to filter which
process should be displayed.
SPDX License Identifiers are a more compact / standardized
way of representing file license information.
See: https://spdx.dev/resources/use/#identifiers
This was done with the `ambr` search and replace tool.
ambr --no-parent-ignore --key-from-file --rep-from-file key.txt rep.txt *
This adds PT_PEEKDEBUG and PT_POKEDEBUG to allow for reading/writing
the debug registers, and updates the Kernel's debug handler to read the
new information from the debug status register.
Mostly due to the fact that clang-format allows aligned comments via
AlignTrailingComments.
We could also use raw string literals in inline asm, which clang-format
deals with properly (and would be nicer in a lot of places).
This prevented from dmidecode to get the right PAGE_SIZE when using the
sysconf syscall.
I found this bug, when I tried to figure why dmidecode fails to mmap
/dev/mem when I passed --no-procfs, and the conclusion is that it tried
to mmap unaligned physical address 0xf5ae0 (SMBIOS data), and that was
caused by a wrong value returned after using the sysconf syscall to get
the plaform page size, therefore, allowing to send an unaligned address
to the mmap syscall.
This can be used to request random VM placement instead of the highly
predictable regular mmap(nullptr, ...) VM allocation strategy.
It will soon be used to implement ASLR in the dynamic loader. :^)
This adds support for FUTEX_WAKE_OP, FUTEX_WAIT_BITSET, FUTEX_WAKE_BITSET,
FUTEX_REQUEUE, and FUTEX_CMP_REQUEUE, as well well as global and private
futex and absolute/relative timeouts against the appropriate clock. This
also changes the implementation so that kernel resources are only used when
a thread is blocked on a futex.
Global futexes are implemented as offsets in VMObjects, so that different
processes can share a futex against the same VMObject despite potentially
being mapped at different virtual addresses.
This patch merges the profiling functionality in the kernel with the
performance events mechanism. A profiler sample is now just another
perf event, rather than a dedicated thing.
Since perf events were already per-process, this now makes profiling
per-process as well.
Processes with perf events would already write out a perfcore.PID file
to the current directory on death, but since we may want to profile
a process and then let it continue running, recorded perf events can
now be accessed at any time via /proc/PID/perf_events.
This patch also adds information about process memory regions to the
perfcore JSON format. This removes the need to supply a core dump to
the Profiler app for symbolication, and so the "profiler coredump"
mechanism is removed entirely.
There's still a hard limit of 4MB worth of perf events per process,
so this is by no means a perfect final design, but it's a nice step
forward for both simplicity and stability.
Fixes#4848Fixes#4849
This brings mmap more in line with other operating systems. Prior to
this, it was impossible to request memory that was definitely committed,
instead MAP_PURGEABLE would provide a region that was not actually
purgeable, but also not fully committed, which meant that using such memory
still could cause crashes when the underlying pages could no longer be
allocated.
This fixes some random crashes in low-memory situations where non-volatile
memory is mapped (e.g. malloc, tls, Gfx::Bitmap, etc) but when a page in
these regions is first accessed, there is insufficient physical memory
available to commit a new page.
Rather than lazily committing regions by default, we now commit
the entire region unless MAP_NORESERVE is specified.
This solves random crashes in low-memory situations where e.g. the
malloc heap allocated memory, but using pages that haven't been
used before triggers a crash when no more physical memory is available.
Use this flag to create large regions without actually committing
the backing memory. madvise() can be used to commit arbitrary areas
of such regions after creating them.
Compared to version 10 this fixes a bunch of formatting issues, mostly
around structs/classes with attributes like [[gnu::packed]], and
incorrect insertion of spaces in parameter types ("T &"/"T &&").
I also removed a bunch of // clang-format off/on and FIXME comments that
are no longer relevant - on the other hand it tried to destroy a couple of
neatly formatted comments, so I had to add some as well.
This implements a number of changes related to time:
* If a HPET is present, it is now used only as a system timer, unless
the Local APIC timer is used (in which case the HPET timer will not
trigger any interrupts at all).
* If a HPET is present, the current time can now be as accurate as the
chip can be, independently from the system timer. We now query the
HPET main counter for the current time in CPU #0's system timer
interrupt, and use that as a base line. If a high precision time is
queried, that base line is used in combination with quering the HPET
timer directly, which should give a much more accurate time stamp at
the expense of more overhead. For faster time stamps, the more coarse
value based on the last interrupt will be returned. This also means
that any missed interrupts should not cause the time to drift.
* The default system interrupt rate is reduced to about 250 per second.
* Fix calculation of Thread CPU usage by using the amount of ticks they
used rather than the number of times a context switch happened.
* Implement CLOCK_REALTIME_COARSE and CLOCK_MONOTONIC_COARSE and use it
for most cases where precise timestamps are not needed.