I noticed this while debugging a crash in backtrace generation.
If a process would crash while temporarily inspecting another process's
address space, the crashing thread would still use the other process's
page tables while handling the crash, causing all kinds of confusion
when trying to walk the stack of the crashing thread.
As suggested by Joshua, this commit adds the 2-clause BSD license as a
comment block to the top of every source file.
For the first pass, I've just added myself for simplicity. I encourage
everyone to add themselves as copyright holders of any file they've
added or modified in some significant way. If I've added myself in
error somewhere, feel free to replace it with the appropriate copyright
holder instead.
Going forward, all new source files should include a license header.
The kernel and its static data structures are no longer identity-mapped
in the bottom 8MB of the address space, but instead move above 3GB.
The first 8MB above 3GB are pseudo-identity-mapped to the bottom 8MB of
the physical address space. But things don't have to stay this way!
Thanks to Jesse who made an earlier attempt at this, it was really easy
to get device drivers working once the page tables were in place! :^)
Fixes#734.
mmap() & mmap_region() methods are removed from ACPI & DMI components,
and we replace them with the new MM.allocate_kernel_region() helper.
Instead of doing a raw calculation for each VM address, from now on we
can use helper functions to do perform those calculations in a neat,
reusable and readable way.
It would be nice to do this in the assembly code, but we have to check
if the feature is available before doing a CLAC, so I've put this in
the C++ code for now.
Supervisor Mode Access Prevention (SMAP) is an x86 CPU feature that
prevents the kernel from accessing userspace memory. With SMAP enabled,
trying to read/write a userspace memory address while in the kernel
will now generate a page fault.
Since it's sometimes necessary to read/write userspace memory, there
are two new instructions that quickly switch the protection on/off:
STAC (disables protection) and CLAC (enables protection.)
These are exposed in kernel code via the stac() and clac() helpers.
There's also a SmapDisabler RAII object that can be used to ensure
that you don't forget to re-enable protection before returning to
userspace code.
THis patch also adds copy_to_user(), copy_from_user() and memset_user()
which are the "correct" way of doing things. These functions allow us
to briefly disable protection for a specific purpose, and then turn it
back on immediately after it's done. Going forward all kernel code
should be moved to using these and all uses of SmapDisabler are to be
considered FIXME's.
Note that we're not realizing the full potential of this feature since
I've used SmapDisabler quite liberally in this initial bring-up patch.
We now have these API's in <Kernel/Random.h>:
- get_fast_random_bytes(u8* buffer, size_t buffer_size)
- get_good_random_bytes(u8* buffer, size_t buffer_size)
- get_fast_random<T>()
- get_good_random<T>()
Internally they both use x86 RDRAND if available, otherwise they fall
back to the same LCG we had in RandomDevice all along.
The main purpose of this patch is to give kernel code a way to better
express its needs for random data.
Randomness is something that will require a lot more work, but this is
hopefully a step in the right direction.
When entering the kernel from a syscall, we now insert a small bit of
stack padding after the RegisterDump. This makes kernel stacks less
deterministic across syscalls and may make some bugs harder to exploit.
Inspired by Elena Reshetova's talk on kernel stack exploitation.
This prevents code running outside of kernel mode from using the
following instructions:
* SGDT - Store Global Descriptor Table
* SIDT - Store Interrupt Descriptor Table
* SLDT - Store Local Descriptor Table
* SMSW - Store Machine Status Word
* STR - Store Task Register
There's no need for userspace to be able to use these instructions so
let's just disable them to prevent information leakage.
We now refuse to boot on machines that don't support PAE since all
of our paging code depends on it.
Also let's only enable SSE and PGE support if the CPU advertises it.
Introduce one more (CPU) indirection layer in the paging code: the page
directory pointer table (PDPT). Each PageDirectory now has 4 separate
PageDirectoryEntry arrays, governing 1 GB of VM each.
A really neat side-effect of this is that we can now share the physical
page containing the >=3GB kernel-only address space metadata between
all processes, instead of lazily cloning it on page faults.
This will give us access to the NX (No eXecute) bit, allowing us to
prevent execution of memory that's not supposed to be executed.
Instead of having a common entry point and looking at the PIC ISR to
figure out which IRQ we're servicing, just make a separate entryway
for each IRQ that pushes the IRQ number and jumps to a common routine.
This fixes a weird issue where incoming network packets would sometimes
cause the mouse to stop working. I didn't track it down further than
realizing we were sometimes EOI'ing the wrong IRQ.
Now that we have proper wait queues to drive waiter wakeup, we can use
the wake actions to break out of the scheduler's idle loop when we've
got a thread to run.
There was a race window between instantiating a WaitQueueBlocker and
setting the thread state to Blocked. If a thread was preempted between
those steps, someone else might try to wake the wait queue and find an
unblocked thread in a wait queue, which is not sane.
The kernel is now no longer identity mapped to the bottom 8MiB of
memory, and is now mapped at the higher address of `0xc0000000`.
The lower ~1MiB of memory (from GRUB's mmap), however is still
identity mapped to provide an easy way for the kernel to get
physical pages for things such as DMA etc. These could later be
mapped to the higher address too, as I'm not too sure how to
go about doing this elegantly without a lot of address subtractions.
VM regions can now be marked as stack regions, which is then validated
on syscall, and on page fault.
If a thread is caught with its stack pointer pointing into anything
that's *not* a Region with its stack bit set, we'll crash the whole
process with SIGSTKFLT.
Userspace must now allocate custom stacks by using mmap() with the new
MAP_STACK flag. This mechanism was first introduced in OpenBSD, and now
we have it too, yay! :^)
The SysV ABI says that the DF flag should be clear on function entry.
That means we have to clear it when jumping into the kernel from some
random userspace context.
Since the kernel page tables are shared between all processes, there's
no need to (implicitly) flush the TLB for them on every context switch.
Setting the G bit on kernel page tables allows the CPU to keep the
translation caches around.
After we clear the FPU state in a thread when it uses the FPU for the
first time, we also save the clean slate in the thread's FPU state
buffer. When we're doing that, let's write through current->fpu_state()
just to make it clear what's going on.
It was actually safe, since we'd just overwritten the g_last_fpu_thread
pointer anyway, but this patch improves the communication of intent.
Spotted by Bryan Steele, thanks!
Cloned threads (basically, forked processes) inherit the complete FPU
state of their origin thread. There was a bug in the lazy FPU state
save/restore mechanism where a cloned thread would believe it had a
buffer full of valid FPU state (because the inherited flag said so)
but the origin thread had never actually copied any FPU state into it.
This patch fixes that by forcing out an FPU state save after doing
the initial FPU initialization (FNINIT) in a thread. :^)
Now programs can catch the SIGSEGV signal when they segfault.
This commit also introduced the send_urgent_signal_to_self method,
which is needed to send signals to a thread when handling exceptions
caused by the same thread.
Added the exception_code field to RegisterDump, removing the need
for RegisterDumpWithExceptionCode. To accomplish this, I had to
push a dummy exception code during some interrupt entries to properly
pad out the RegisterDump. Note that we also needed to change some code
in sys$sigreturn to deal with the new RegisterDump layout.
If we receive an IRQ while the idle task is running, prevent it from
re-halting the CPU after the IRQ handler returns.
Instead have the idle task yield to the scheduler, so we can see if
the IRQ has unblocked something.
Due to the changes in signal handling m_kernel_stack_for_signal_handler_region
and m_signal_stack_user_region are no longer necessary, and so, have been
removed. I've also removed the similarly reduntant m_tss_to_resume_kernel.