GCC-11 added a new option `-fzero-call-used-regs` which causes the
compiler to zero function arguments before return of a function. The
goal being to reduce the possible attack surface by disarming ROP
gadgets that might be potentially useful to attackers, and reducing
the risk of information leaks via stale register data. You can find
the GCC commit below[0].
This is a mitigation I noticed on the Linux KSPP issue tracker[1] and
thought it would be useful mitigation for the SerenityOS Kernel.
The reduction in ROP gadgets is observable using the ropgadget utility:
$ ROPgadget --nosys --nojop --binary Kernel | tail -n1
Unique gadgets found: 42754
$ ROPgadget --nosys --nojop --binary Kernel.RegZeroing | tail -n1
Unique gadgets found: 41238
The size difference for the i686 Kernel binary is negligible:
$ size Kernel Kernel.RegZerogin
text data bss dec hex filename
13253648 7729637 6302360 27285645 1a0588d Kernel
13277504 7729637 6302360 27309501 1a0b5bd Kernel.RegZeroing
We don't have any great workloads to measure regressions in Kernel
performance, but Kees Cook mentioned he measured only around %1
performance regression with this enabled on his Linux kernel build.[2]
References:
[0] d10f3e900b
[1] https://github.com/KSPP/linux/issues/84
[2] https://lore.kernel.org/lkml/20210714220129.844345-1-keescook@chromium.org/
This patch greatly simplifies VMObject locking by doing two things:
1. Giving VMObject an IntrusiveList of all its mapping Region objects.
2. Removing VMObject::m_paging_lock in favor of VMObject::m_lock
Before (1), VMObject::for_each_region() was forced to acquire the
global MM lock (since it worked by walking MemoryManager's list of
all regions and checking for regions that pointed to itself.)
With each VMObject having its own list of Regions, VMObject's own
m_lock is all we need.
Before (2), page fault handlers used a separate mutex for preventing
overlapping work. This design required multiple temporary unlocks
and was generally extremely hard to reason about.
Instead, page fault handlers now use VMObject's own m_lock as well.
Despite what the declaration would have us believe these are not "u8*".
If they were we wouldn't have to use the & operator to get the address
of them and then cast them to "u8*"/FlatPtr afterwards.
Since we're taking from the committed set of pages, there should never
be a reason for this call to fail.
Also add a Badge to disallow taking committed pages from anywhere but
the Region class.
We don't need to have a dedicated API for creating a VMObject with a
single page, the multi-page API option works in all cases.
Also make the API take a Span<NonnullRefPtr<PhysicalPage>> instead of
a NonnullRefPtrVector<PhysicalPage>.
Depending on the values it might be difficult to figure out whether a
value is decimal or hexadecimal. So let's make this more obvious. Also
this allows copying and pasting those numbers into GNOME calculator and
probably also other apps which auto-detect the base.
KBufferBuilder is always allowed to expand if it wants to. This
restriction was added a long time ago when it was unsafe to allocate
VM while generating ProcFS contents.
Use a Mutex instead of a SpinLock to protect the per-FileDescription
generated data cache. This allows processes to go to sleep while
waiting their turn.
Also don't try to be clever by reusing existing cache buffers.
Just allocate KBuffers as needed (and make sure to surface failures.)
KBufferBuilder exists for code that wants to build a KBuffer instead
of a String. KBuffer is backed by anonymous VM, while String is backed
by a kernel heap allocation.
Advisory locks don't actually prevent other processes from writing to
the file, but they do prevent other processes looking to acquire and
advisory lock on the file.
This implementation currently only adds non-blocking locks, which are
all I need for now.
Before we start disabling acquisition of the big process lock for
specific syscalls, make sure to document and assert that all the
lock is held during all syscalls.
There are cases where we want to conditionally take a lock, but still
would like to use an RAII type to make sure we don't leak the lock.
This was previously impossible to do with `MutexLocker` due to it's
design. This commit tweaks the design to allow the object to be
initialized to an "empty" state without a lock associated, so it does
nothing, and then later a lock can be "attached" to the locker.
I realized that the get_lock() API's where also unused, and would no
longer make sense for empty locks, so they were removed.
Move these to MM to simplify the flow of the syscall handler.
While here, also make sure we hold the process space lock for
the duration of the validation to avoid potential issues where
another thread attempts to modify the process space during the
validation. This will allow us to move the validation out of the
big process lock scope in a future change.
Additionally utilize the new no_lock variants of functions to avoid
unnecessary recursive process space spinlock acquisitions.
Currently all syscalls run under the Process:m_big_lock, which is an
obvious bottleneck. Long term we would like to remove the big lock and
replace it with the required fine grained locking.
To facilitate this goal we need a way of gradually decomposing the big
lock into the all of the required fine grained locks. This commit
introduces instrumentation to the syscall table, allowing the big lock
requirement to be toggled on/off per syscall.
Eventually when we are finished, no syscall will required the big lock,
and we'll be able to remove all of this instrumentation.
The entire process is not needed, just require the user to pass in the
Space. Also provide no_lock variant to use when you already have the
VM/Space lock acquired, to avoid unnecessary recursive spinlock
acquisitions.
Depending on the exact layout of the .ksyms section the kernel would
fail to boot because the kernel_load_end variable didn't account for the
section's size.