When we hit a COW fault and discover than no other process is sharing
the physical page, we simply remap it r/w and save ourselves the
trouble. When this happens, we can also give back (uncommit) one of our
shared committed COW pages, since we won't be needing it.
We had this optimization before, but I mistakenly removed it in
50472fd69f since I had misunderstood
it to be the reason for a panic.
We had issues with committed physical pages getting miscounted in some
situations, and instead of figuring out what was going wrong and making
sure all the commits had matching uncommits, this patch makes the
problem go away by adding an RAII class to manage this instead. :^)
MemoryManager::commit_user_physical_pages() now returns an (optional)
CommittedPhysicalPageSet. You can then allocate pages from the page set
by calling take_one() on it. Any unallocated pages are uncommitted upon
destruction of the page set.
Since we only ever add 1 super physical region, theres no reason to
add the extra redirection and allocation that comes with a dynamically
sized Vector.
To add a new per-CPU data structure, add an ID for it to the
ProcessorSpecificDataID enum.
Then call ProcessorSpecific<T>::initialize() when you are ready to
construct the per-CPU data structure on the current CPU. It can then
be accessed via ProcessorSpecific<T>::get().
This patch replaces the existing hard-coded mechanisms for Scheduler
and MemoryManager per-CPU data structure.
This allows us to specify virtual addresses for things the kernel should
access via virtual addresses later on. By doing this we can make the
kernel independent from specific physical addresses.
Previously the kernel relied on a fixed offset between virtual and
physical addresses based on the kernel's load address. This allows us
to specify an independent offset.
This patch greatly simplifies VMObject locking by doing two things:
1. Giving VMObject an IntrusiveList of all its mapping Region objects.
2. Removing VMObject::m_paging_lock in favor of VMObject::m_lock
Before (1), VMObject::for_each_region() was forced to acquire the
global MM lock (since it worked by walking MemoryManager's list of
all regions and checking for regions that pointed to itself.)
With each VMObject having its own list of Regions, VMObject's own
m_lock is all we need.
Before (2), page fault handlers used a separate mutex for preventing
overlapping work. This design required multiple temporary unlocks
and was generally extremely hard to reason about.
Instead, page fault handlers now use VMObject's own m_lock as well.
Move these to MM to simplify the flow of the syscall handler.
While here, also make sure we hold the process space lock for
the duration of the validation to avoid potential issues where
another thread attempts to modify the process space during the
validation. This will allow us to move the validation out of the
big process lock scope in a future change.
Additionally utilize the new no_lock variants of functions to avoid
unnecessary recursive process space spinlock acquisitions.
The entire process is not needed, just require the user to pass in the
Space. Also provide no_lock variant to use when you already have the
VM/Space lock acquired, to avoid unnecessary recursive spinlock
acquisitions.
This implements a simple bootloader that is capable of loading ELF64
kernel images. It does this by using QEMU/GRUB to load the kernel image
from disk and pass it to our bootloader as a Multiboot module.
The bootloader then parses the ELF image and sets it up appropriately.
The kernel's entry point is a C++ function with architecture-native
code.
Co-authored-by: Liav A <liavalb@gmail.com>
Nobody was using this API to request anythign about `PAGE_SIZE`
alignment, so let's get rid of it for now. We can reimplement it if
we end up needing it.
Also note that it wasn't actually used anywhere.
The previous allocator was very naive and kept the state of all pages
in one big bitmap. When allocating, we had to scan through the bitmap
until we found an unset bit.
This patch introduces a new binary buddy allocator that manages the
physical memory pages.
Each PhysicalRegion is divided into zones (PhysicalZone) of 16MB each.
Any extra pages at the end of physical RAM that don't fit into a 16MB
zone are turned into 15 or fewer 1MB zones.
Each zone starts out with one full-sized block, which is then
recursively subdivided into halves upon allocation, until a block of
the request size can be returned.
There are more opportunities for improvement here: the way zone objects
are allocated and stored is non-optimal. Same goes for the allocation
of buddy block state bitmaps.
Instead of each PhysicalPage knowing whether it comes from the
supervisor pages or from the user pages, we can just check in both
sets when freeing a page.
It's just a handful of pointer range checks, nothing expensive.
We had an inconsistency in valid user addresses. is_user_range() was
checking against the kernel base address, but previous changes caused
the maximum valid user addressable range to be 32 MiB below that.
This patch stops mmap(MAP_FIXED) of a range between these two bounds
from panic-ing the kernel in RangeAllocator::allocate_specific.
By making sure the PhysicalPage instance is fully destructed the
allocators will have a chance to reclaim the PhysicalPageEntry for
free-list purposes. Just pass them the physical address of the page
that was freed, which is enough to lookup the PhysicalPageEntry later.
By moving the PhysicalPage classes out of the kernel heap into a static
array, one for each physical page, we can avoid the added overhead and
easily find them by indexing into an array.
This also wraps the PhysicalPage into a PhysicalPageEntry, which allows
us to re-use each slot with information where to find the next free
page.
We already use PAE for the NX bit, but this changes the PhysicalAddress
structure to be able to hold 64 bit physical addresses. This allows us
to use all the available physical memory.
Replace the AK::String used for Region::m_name with a KString.
This seems beneficial across the board, but as a specific data point,
it reduces time spent in sys$set_mmap_name() by ~50% on test-js. :^)
Problem:
- `static` variables consume memory and sometimes are less
optimizable.
- `static const` variables can be `constexpr`, usually.
- `static` function-local variables require an initialization check
every time the function is run.
Solution:
- If a global `static` variable is only used in a single function then
move it into the function and make it non-`static` and `constexpr`.
- Make all global `static` variables `constexpr` instead of `const`.
- Change function-local `static const[expr]` variables to be just
`constexpr`.
By constraining two implementations, the compiler will select the best
fitting one. All this will require is duplicating the implementation and
simplifying for the `void` case.
This constraining also informs both the caller and compiler by passing
the callback parameter types as part of the constraint
(e.g.: `IterationFunction<int>`).
Some `for_each` functions in LibELF only take functions which return
`void`. This is a minimal correctness check, as it removes one way for a
function to incompletely do something.
There seems to be a possible idiom where inside a lambda, a `return;` is
the same as `continue;` in a for-loop.
Currently, when passing buffers into VirtIOQueues, we use scatter-gather
lists, which contain an internal vector of buffers. This vector is
allocated, filled and the destroy whenever we try to provide buffers
into a virtqueue, which would happen a lot in performance cricital code
(the main transport mechanism for certain paravirtualized devices).
This commit moves it over to using VirtIOQueueChains and building the
chain in place in the VirtIOQueue. Also included are a bunch of fixups
for the VirtIO Console device, making it use an internal VM::RingBuffer
instead.
We want to move this out of the AHCI subsystem into the VM system,
since other parts of the kernel may need to perform scatter-gather IO.
We rename the current VM::ScatterGatherList impl that's used in the
virtio subsystem to VM::ScatterGatherRefList, since its distinguishing
feature from the AHCI scatter-gather list is that it doesn't own its
buffers.
SPDX License Identifiers are a more compact / standardized
way of representing file license information.
See: https://spdx.dev/resources/use/#identifiers
This was done with the `ambr` search and replace tool.
ambr --no-parent-ignore --key-from-file --rep-from-file key.txt rep.txt *
This allows converting a single virtual buffer into its non-physically
contiguous parts, this is especially useful for DMA-based devices that
support scatter/gather-like functionality, as it eliminates the need
to clone outgoing buffers into one physically contiguous buffer.
Alot of code is shared between i386/i686/x86 and x86_64
and a lot probably will be used for compatability modes.
So we start by moving the headers into one Directory.
We will probalby be able to move some cpp files aswell.
Increase type-safety moving the MemoryManager APIs which take a
Region::Access to actually use that type instead of a `u8`.
Eventually the actually m_access can be moved there as well, but
I hit some weird bug where it wasn't using the correct operators
in `set_access_bit(..)` even though it's declared (and tested).
Something to fix-up later.
(...and ASSERT_NOT_REACHED => VERIFY_NOT_REACHED)
Since all of these checks are done in release builds as well,
let's rename them to VERIFY to prevent confusion, as everyone is
used to assertions being compiled out in release.
We can introduce a new ASSERT macro that is specifically for debug
checks, but I'm doing this wholesale conversion first since we've
accumulated thousands of these already, and it's not immediately
obvious which ones are suitable for ASSERT.
You can now declare functions with UNMAP_AFTER_INIT and they'll get
segregated into a separate kernel section that gets completely
unmapped at the end of initialization.
This can be used for anything we don't need to call once we've booted
into userspace.
There are two nice things about this mechanism:
- It allows us to free up entire pages of memory for other use.
(Note that this patch does not actually make use of the freed
pages yet, but in the future we totally could!)
- It allows us to get rid of obviously dangerous gadgets like
write-to-CR0 and write-to-CR4 which are very useful for an attacker
trying to disable SMAP/SMEP/etc.
I've also made sure to include a helpful panic message in case you
hit a kernel crash because of this protection. :^)
You can now use the READONLY_AFTER_INIT macro when declaring a variable
and we will put it in a special ".ro_after_init" section in the kernel.
Data in that section remains writable during the boot and init process,
and is then marked read-only just before launching the SystemServer.
This is based on an idea from the Linux kernel. :^)
If we try to align a number above 0xfffff000 to the next multiple of
the page size (4 KiB), it would wrap around to 0. This is most likely
never what we want, so let's assert if that happens.