This commit adds minimal support for compiler-instrumentation based
memory access sanitization.
Currently we only support detection of kmalloc redzone accesses, and
kmalloc use-after-free accesses.
Support for inline checks (for improved performance), and for stack
use-after-return and use-after-return detection is left for future PRs.
When writing to /sys/kernel/request_panic it will do a kernel panic.
Trying to truncate the node will result in kernel panic with a slightly
different message.
The name for this directory is a bit awkward. Also, the distinction of
constant information is not really valuable as I thought it would be, so
let's bring that information back into the /sys/kernel directory.
The name "variables" is a bit awkward and what the directory entries are
really about is kernel configuration so let's make it clear with the new
name.
Instead, use the FixedCharBuffer class to ensure we always use a static
buffer storage for these names. This ensures that if a Process or a
Thread were created, there's a guarantee that setting a new name will
never fail, as only copying of strings should be done to that static
storage.
The limits which are set are 32 characters for processes' names and 64
characters for thread names - this is because threads' names could be
more verbose than processes' names.
Since this is the block size that file system drivers *should* set,
let's name it the logical block size, just like most file systems such
as ext2 already do anyways.
For a long time, our shutdown procedure has basically been:
- Acquire big process lock.
- Switch framebuffer to Kernel debug console.
- Sync and lock all file systems so that disk caches are flushed and
files are in a good state.
- Use firmware and architecture-specific functionality to perform
hardware shutdown.
This naive and simple shutdown procedure has multiple issues:
- No processes are terminated properly, meaning they cannot perform more
complex cleanup work. If they were in the middle of I/O, for instance,
only the data that already reached the Kernel is written to disk, and
data corruption due to unfinished writes can therefore still occur.
- No file systems are unmounted, meaning that any important unmount work
will never happen. This is important for e.g. Ext2, which has
facilites for detecting improper unmounts (see superblock's s_state
variable) and therefore requires a proper unmount to be performed.
This was also the starting point for this PR, since I wanted to
introduce basic Ext2 file system checking and unmounting.
- No hardware is properly shut down beyond what the system firmware does
on its own.
- Shutdown is performed within the write() call that asked the Kernel to
change its power state. If the shutdown procedure takes longer (i.e.
when it's done properly), this blocks the process causing the shutdown
and prevents any potentially-useful interactions between Kernel and
userland during shutdown.
In essence, current shutdown is a glorified system crash with minimal
file system cleanliness guarantees.
Therefore, this commit is the first step in improving our shutdown
procedure. The new shutdown flow is now as follows:
- From the write() call to the power state SysFS node, a new task is
started, the Power State Switch Task. Its only purpose is to change
the operating system's power state. This task takes over shutdown and
reboot duties, although reboot is not modified in this commit.
- The Power State Switch Task assumes that userland has performed all
shutdown duties it can perform on its own. In particular, it assumes
that all kinds of clean process shutdown have been done, and remaining
processes can be hard-killed without consequence. This is an important
separation of concerns: While this commit does not modify userland, in
the future SystemServer will be responsible for performing proper
shutdown of user processes, including timeouts for stubborn processes
etc.
- As mentioned above, the task hard-kills remaining user processes.
- The task hard-kills all Kernel processes except itself and the
Finalizer Task. Since Kernel processes can delay their own shutdown
indefinitely if they want to, they have plenty opportunity to perform
proper shutdown if necessary. This may become a problem with
non-cooperative Kernel tasks, but as seen two commits earlier, for now
all tasks will cooperate within a few seconds.
- The task waits for the Finalizer Task to clean up all processes.
- The task hard-kills and finalizes the Finalizer Task itself, meaning
that it now is the only remaining process in the system.
- The task syncs and locks all file systems, and then unmounts them. Due
to an unknown refcount bug we currently cannot unmount the root file
system; therefore the task is able to abort the clean unmount if
necessary.
- The task performs platform-dependent hardware shutdown as before.
This commit has multiple remaining issues (or exposed existing ones)
which will need to be addressed in the future but are out of scope for
now:
- Unmounting the root filesystem is impossible due to remaining
references to the inodes /home and /home/anon. I investigated this
very heavily and could not find whoever is holding the last two
references.
- Userland cannot perform proper cleanup, since the Kernel's power state
variable is accessed directly by tools instead of a proper userland
shutdown procedure directed by SystemServer.
The recently introduced Firmware/PowerState procedures are removed
again, since all of the architecture-independent code can live in the
power state switch task. The architecture-specific code is kept,
however.
Instead of using ifdefs to use the correct platform-specific methods, we
can just use the same pattern we use for the microseconds_delay function
which has specific implementations for each Arch CPU subdirectory.
When linking a kernel image, the actual correct and platform-specific
power-state changing methods will be called in Firmware/PowerState.cpp
file.
All code that is related to PC BIOS should not be in the Kernel/Firmware
directory as this directory is for abstracted and platform-agnostic code
like ACPI (and device tree parsing in the future).
This fixes a problem with the aarch64 architecure, as these machines
don't have any PC-BIOS in them so actually trying to access these memory
locations (EBDA, BIOS ROM) does not make any sense, as they're specific
to x86 machines only.
Previously, reads would only be successful for offset 0. For this
reason, the maximum size that could be correctly read from the PCI
expansion ROM SysFS node was limited to the block size, and
subsequent blocks would fail. This commit fixes the computation of
the number of bytes to read.
Like the HID, Audio and Storage subsystem, the Graphics subsystem (which
handles GPUs technically) exposes unix device files (typically in /dev).
To ensure consistency across the repository, move all related files to a
new directory under Kernel/Devices called "GPU".
Also remove the redundant "GPU" word from the VirtIO driver directory,
and the word "Graphics" from GraphicsManagement.{h,cpp} filenames.
This has KString, KBuffer, DoubleBuffer, KBufferBuilder, IOWindow,
UserOrKernelBuffer and ScopedCritical classes being moved to the
Kernel/Library subdirectory.
Also, move the panic and assertions handling code to that directory.
The Storage subsystem, like the Audio and HID subsystems, exposes Unix
device files (for example, in the /dev directory). To ensure consistency
across the repository, we should make the Storage subsystem to reside in
the Kernel/Devices directory like the two other mentioned subsystems.
The Raspberry Pi hardware doesn't support a proper software-initiated
shutdown, so this instead uses the watchdog to reboot to a special
partition which the firmware interprets as an immediate halt on
shutdown. When running under Qemu, this causes the emulator to exit.
`process.fds()` is protected by a Mutex, which causes issues when we try
to acquire it while holding a Spinlock. Since nothing seems to use this
value, let's just remove it entirely for now.
To do this we also need to get rid of LockRefPtrs in the USB code as
well.
Most of the SysFS nodes are statically generated during boot and are not
mutated afterwards.
The same goes for general device code - once we generate the appropriate
SysFS nodes, we almost never mutate the node pointers afterwards, making
locking unnecessary.
These were easy to pick-up as these pointers are assigned during the
construction point and are never changed afterwards.
This small change to these pointers will ensure that our code will not
accidentally assign these pointers with a new object which is always a
kind of bug we will want to prevent.
This is done with 2 major steps:
1. Remove JailManagement singleton and use a structure that resembles
what we have with the Process object. This is required later for the
second step in this commit, but on its own, is a major change that
removes this clunky singleton that had no real usage by itself.
2. Use IntrusiveLists to keep references to Process objects in the same
Jail so it will be much more straightforward to iterate on this kind
of objects when needed. Previously we locked the entire Process list
and we did a simple pointer comparison to check if the checked
Process we iterate on is in the same Jail or not, which required
taking multiple Spinlocks in a very clumsy and heavyweight way.
This subdirectory is meant to hold all constant data related to the
kernel. This means that this data is never meant to updated and is
relevant from system boot to system shutdown.
Move the inodes of "load_base", "cmdline" and "system_mode" to that
directory. All nodes under this new subdirectory are generated during
boot, and therefore don't require calling kmalloc each time we need to
read them. Locking is also not necessary, because these nodes and their
data are completely static once being generated.
This replaces manually grabbing the thread's main lock.
This lets us remove the `get_thread_name` and `set_thread_name` syscalls
from the big lock. :^)
For each exposed PCI device in sysfs, there's a new node called "rom"
and by reading it, it exposes the raw data of a PCI option ROM blob to
a user for examining the blob.
There are now 2 separate classes for almost the same object type:
- EnumerableDeviceIdentifier, which is used in the enumeration code for
all PCI host controller classes. This is allowed to be moved and
copied, as it doesn't support ref-counting.
- DeviceIdentifier, which inherits from EnumerableDeviceIdentifier. This
class uses ref-counting, and is not allowed to be copied. It has a
spinlock member in its structure to allow safely executing complicated
IO sequences on a PCI device and its space configuration.
There's a static method that allows a quick conversion from
EnumerableDeviceIdentifier to DeviceIdentifier while creating a
NonnullRefPtr out of it.
The reason for doing this is for the sake of integrity and reliablity of
the system in 2 places:
- Ensure that "complicated" tasks that rely on manipulating PCI device
registers are done in a safe manner. For example, determining a PCI
BAR space size requires multiple read and writes to the same register,
and if another CPU tries to do something else with our selected
register, then the result will be a catastrophe.
- Allow the PCI API to have a united form around a shared object which
actually holds much more data than the PCI::Address structure. This is
fundamental if we want to do certain types of optimizations, and be
able to support more features of the PCI bus in the foreseeable
future.
This patch already has several implications:
- All PCI::Device(s) hold a reference to a DeviceIdentifier structure
being given originally from the PCI::Access singleton. This means that
all instances of DeviceIdentifier structures are located in one place,
and all references are pointing to that location. This ensures that
locking the operation spinlock will take effect in all the appropriate
places.
- We no longer support adding PCI host controllers and then immediately
allow for enumerating it with a lambda function. It was found that
this method is extremely broken and too much complicated to work
reliably with the new paradigm being introduced in this patch. This
means that for Volume Management Devices (Intel VMD devices), we
simply first enumerate the PCI bus for such devices in the storage
code, and if we find a device, we attach it in the PCI::Access method
which will scan for devices behind that bridge and will add new
DeviceIdentifier(s) objects to its internal Vector. Afterwards, we
just continue as usual with scanning for actual storage controllers,
so we will find a corresponding NVMe controllers if there were any
behind that VMD bridge.
We really don't want callers of this function to accidentally change
the jail, or even worse - remove the Process from an attached jail.
To ensure this never happens, we can just declare this method as const
so nobody can mutate it this way.
Use this helper function in various places to replace the old code of
acquiring the SpinlockProtected<RefPtr<Jail>> of a Process to do that
validation.
Only do so after a brief check if we are in a Jail or not. This fixes
SMP, because apparently it is crashing when calling try_generate()
from the SysFSGlobalInformation::refresh_data method, so the fix for
this is to simply not do that inside the Process' Jail spinlock scope,
because otherwise we will simply have a possible flow of taking
multiple conflicting Spinlocks (in the wrong order multiple times), for
the SysFSOverallProcesses generation code:
Process::current().jail(), and then Process::for_each_in_same_jail being
called, we take Process::all_instances(), and Process::current().jail()
again.
Therefore, we should at the very least eliminate the first taking of the
Process::current().jail() spinlock, in the refresh_data method of the
SysFSGlobalInformation class.
These instances were detected by searching for files that include
Kernel/Debug.h, but don't match the regex:
\\bdbgln_if\(|_DEBUG\\b
This regex is pessimistic, so there might be more files that don't check
for any real *_DEBUG macro. There seem to be no corner cases anyway.
In theory, one might use LibCPP to detect things like this
automatically, but let's do this one step after another.
This step would ideally not have been necessary (increases amount of
refactoring and templates necessary, which in turn increases build
times), but it gives us a couple of nice properties:
- SpinlockProtected inside Singleton (a very common combination) can now
obtain any lock rank just via the template parameter. It was not
previously possible to do this with SingletonInstanceCreator magic.
- SpinlockProtected's lock rank is now mandatory; this is the majority
of cases and allows us to see where we're still missing proper ranks.
- The type already informs us what lock rank a lock has, which aids code
readability and (possibly, if gdb cooperates) lock mismatch debugging.
- The rank of a lock can no longer be dynamic, which is not something we
wanted in the first place (or made use of). Locks randomly changing
their rank sounds like a disaster waiting to happen.
- In some places, we might be able to statically check that locks are
taken in the right order (with the right lock rank checking
implementation) as rank information is fully statically known.
This refactoring even more exposes the fact that Mutex has no lock rank
capabilites, which is not fixed here.
Instead, allow userspace to decide on the coredump directory path. By
default, SystemServer sets it to the /tmp/coredump directory, but users
can now change this by writing a new path to the sysfs node at
/sys/kernel/variables/coredump_directory, and also to read this node to
check where coredumps are currently generated at.