When handling a page fault, we only need to remap the faulting region in
the current process. There's no need to traverse *all* regions that map
the same VMObject and remap them cross-process as well.
Those other regions will get remapped lazily by their own page fault
handlers eventually. Or maybe they won't and we avoided some work. :^)
- Instead of holding the VMObject lock across physical page allocation
and quick-map + copy, we now only hold it when updating the VMObject's
physical page slot.
We ensure that when we call SharedInodeVMObject::sync we lock the inode
lock before calling Inode virtual write_bytes method directly to avoid
assertion on the unlocked inode lock, as it was regressed recently. This
is not a complete fix as the need to lock from each path before calling
the write_bytes method should be avoided because it can lead to
hard-to-find bugs, and this commit only fixes the problem temporarily.
Previously, when starved for pages, *all* clean file-backed memory
would be released, which is quite excessive.
This patch instead releases just 1 page, since only 1 page is needed
to satisfy the request to `allocate_physical_page()`
Previously, we could only release *all* clean pages.
This patch makes it possible to release a specific amount of clean
pages. If the attempted number of pages to release is more than the
amount of clean pages, all clean pages will be released.
At the point at which we try to map the Region it was already added to
the Process region tree, so we have to make sure to remove it before
freeing it in the mapping failure path, otherwise the tree will contain
a dangling pointer to the free'd instance.
Until now, our only backup plan when running out of physical pages
was to try and purge volatile memory. If that didn't work out, we just
hung userspace out to dry with an ENOMEM.
This patch improves the situation by also considering clean, file-backed
pages (that we could page back in from disk).
This could be better in many ways, but it already allows us to boot to
WindowServer with 256 MiB of RAM. :^)
This deadlock was introduced with the creation of this API. The lock
order is such that we always need to take the page directory lock
before we ever take the MM lock.
This function violated that, as both Region creation and region
destruction require the pd and mm locks, but with the mm lock
already acquired we deadlocked with SMP mode enabled while other
threads were allocating regions.
With this change SMP boots to the desktop successfully for me,
(and then subsequently has other issues). :^)
There's no real value in separating physical pages to supervisor and
user types, so let's remove the concept and just let everyone to use
"user" physical pages which can be allocated from any PhysicalRegion
we want to use. Later on, we will remove the "user" prefix as this
prefix is not needed anymore.
We are limited on the amount of supervisor pages we can allocate, so
don't allocate from that pool. Supervisor pages are always below 16 MiB
barrier so using those was crucial when we used devices like the ISA
SoundBlaster 16 card, because that device required very low physical
addresses to be used.
Each of these strings would previously rely on StringView's char const*
constructor overload, which would call __builtin_strlen on the string.
Since we now have operator ""sv, we can replace these with much simpler
versions. This opens the door to being able to remove
StringView(char const*).
No functional changes.
Uncommitted pages (shared zero pages) can not contain any existing data
and can not be modified, so there's no point to committing a bunch of
extra pages to cover for them in the forked child.
Since both the parent process and child process hold a reference to the
COW committed set, once the child process exits, the committed COW
pages are effectively leaked, only being slowly re-claimed each time
the parent process writes to one of them, realizing it's no longer
shared, and uncommitting it.
In order to mitigate this we now hold a weak reference the parent
VMObject from which the pages are cloned, and we use it on destruction
when available to drop the reference to the committed set from it as
well.
This is basically unchanged since the beginning of 2020, which is a year
before we had proper ASLR.
Now that we have a proper ASLR implementation, we can turn this down a
bit, as it is no longer our only protection against predictable dynamic
loader addresses, and it actually obstructs the default loading address
of x86_64 quite frequently.
There's nothing stopping a userspace program from keeping a bunch of
threads around with a custom signal stack in a suspended state with
their normal thread stack mprotected to PROT_NONE.
OpenJDK seems to do this, for example.
This new type of VMObject will be used to coordinate switching safely
from graphical mode to text mode and vice-versa, by supplying a way to
remap all Regions that were created with this object, so mappings can be
changed according to the given state of system mode. This makes it quite
easy to give applications like WindowServer the feeling of having full
access to the framebuffer device from a DisplayConnector, but still keep
the Kernel in control to be able to safely switch to text console.
If we unregister from the RegionTree before unmapping, there's a race
where a new region can get inserted at the same address that we're about
to unmap. If this happens, ~Region() will then unmap the newly inserted
region, which now finds itself with cleared-out page table entries.
This had no business being in RegionTree, since RegionTree doesn't track
identity-mapped regions anyway. (We allow *any* address to be identity
mapped, not just the ones that are part of the RegionTree's range.)
This patch adds RegionTree::get_lock() which exposes the internal lock
inside RegionTree. We can then lock it from the outside when doing
lookups or traversal.
This solution is not very beautiful, we should find a way to protect
this data with SpinlockProtected or something similar. This is a stopgap
patch to try and fix the currently flaky CI.
...and remove the last remaining client of the API. It's no longer
possible to ask the RegionTree for a VM range. You can only ask it to
place your Region somewhere in available space.
This patch move AddressSpace (the per-process memory manager) to using
the new atomic "place" APIs in RegionTree as well, just like we did for
MemoryManager in the previous commit.
This required updating quite a few places where VM allocation and
actually committing a Region object to the AddressSpace were separated
by other code.
All you have to do now is call into AddressSpace once and it'll take
care of everything for you.