1
Fork 0
mirror of https://github.com/RGBCube/serenity synced 2025-05-31 13:38:11 +00:00
Commit graph

340 commits

Author SHA1 Message Date
Andreas Kling
6eab7b398d Kernel: Make ProcessPagingScope restore CR3 properly
Instead of restoring CR3 to the current process's paging scope when a
ProcessPagingScope goes out of scope, we now restore exactly whatever
the CR3 value was when we created the ProcessPagingScope.

This fixes breakage in situations where a process ends up with nested
ProcessPagingScopes. This was making profiling very fragile, and with
this change it's now possible to profile g++! :^)
2020-01-19 13:44:53 +01:00
Andreas Kling
ad3f931707 Kernel: Optimize VM range deallocation a bit
Previously, when deallocating a range of VM, we would sort and merge
the range list. This was quite slow for large processes.

This patch optimizes VM deallocation in the following ways:

- Use binary search instead of linear scan to find the place to insert
  the deallocated range.

- Insert at the right place immediately, removing the need to sort.

- Merge the inserted range with any adjacent range(s) in-line instead
  of doing a separate merge pass into a list copy.

- Add Traits<Range> to inform Vector that Range objects are trivial
  and can be moved using memmove().

I've also added an assertion that deallocated ranges are actually part
of the RangeAllocator's initial address range.

I've benchmarked this using g++ to compile Kernel/Process.cpp.
With these changes, compilation goes from ~41 sec to ~35 sec.
2020-01-19 13:29:59 +01:00
Andreas Kling
f7b394e9a1 Kernel: Assert that copy_to/from_user() are called with user addresses
This will panic the kernel immediately if these functions are misused
so we can catch it and fix the misuse.

This patch fixes a couple of misuses:

    - create_signal_trampolines() writes to a user-accessible page
      above the 3GB address mark. We should really get rid of this
      page but that's a whole other thing.

    - CoW faults need to use copy_from_user rather than copy_to_user
      since it's the *source* pointer that points to user memory.

    - Inode faults need to use memcpy rather than copy_to_user since
      we're copying a kernel stack buffer into a quickmapped page.

This should make the copy_to/from_user() functions slightly less useful
for exploitation. Before this, they were essentially just glorified
memcpy() with SMAP disabled. :^)
2020-01-19 09:18:55 +01:00
Andreas Kling
2cd212e5df Kernel: Let's say that everything < 3GB is user virtual memory
Technically the bottom 2MB is still identity-mapped for the kernel and
not made available to userspace at all, but for simplicity's sake we
can just ignore that and make "address < 0xc0000000" the canonical
check for user/kernel.
2020-01-19 08:58:33 +01:00
Andreas Kling
862b3ccb4e Kernel: Enforce W^X between sys$mmap() and sys$execve()
It's now an error to sys$mmap() a file as writable if it's currently
mapped executable by anyone else.

It's also an error to sys$execve() a file that's currently mapped
writable by anyone else.

This fixes a race condition vulnerability where one program could make
modifications to an executable while another process was in the kernel,
in the middle of exec'ing the same executable.

Test: Kernel/elf-execve-mmap-race.cpp
2020-01-18 23:40:12 +01:00
Andreas Kling
6fea316611 Kernel: Move all CPU feature initialization into cpu_setup()
..and do it very very early in boot.
2020-01-18 10:11:29 +01:00
Andreas Kling
94ca55cefd Meta: Add license header to source files
As suggested by Joshua, this commit adds the 2-clause BSD license as a
comment block to the top of every source file.

For the first pass, I've just added myself for simplicity. I encourage
everyone to add themselves as copyright holders of any file they've
added or modified in some significant way. If I've added myself in
error somewhere, feel free to replace it with the appropriate copyright
holder instead.

Going forward, all new source files should include a license header.
2020-01-18 09:45:54 +01:00
Andreas Kling
19c31d1617 Kernel: Always dump kernel regions when dumping process regions 2020-01-18 08:57:18 +01:00
Andreas Kling
345f92d5ac Kernel: Remove two unused MemoryManager functions 2020-01-18 08:57:18 +01:00
Andreas Kling
3e8b60c618 Kernel: Clean up MemoryManager initialization a bit more
Move the CPU feature enabling to functions in Arch/i386/CPU.cpp.
2020-01-18 00:28:16 +01:00
Andreas Kling
a850a89c1b Kernel: Add a random offset to the base of the per-process VM allocator
This is not ASLR, but it does de-trivialize exploiting the ELF loader
which would previously always parse executables at 0x01001000 in every
single exec(). I've taken advantage of this multiple times in my own
toy exploits and it's starting to feel cheesy. :^)
2020-01-17 23:29:54 +01:00
Andreas Kling
536c0ff3ee Kernel: Only clone the bottom 2MB of mappings from kernel to processes 2020-01-17 22:34:36 +01:00
Andreas Kling
122c76d7fa Kernel: Don't allocate per-process PDPT from super pages either
The default system is now down to 3 super pages allocated on boot. :^)
2020-01-17 22:34:36 +01:00
Andreas Kling
ad1f79fb4a Kernel: Stop allocating page tables from the super pages pool
We now use the regular "user" physical pages for on-demand page table
allocations. This was by far the biggest source of super physical page
exhaustion, so that bug should be a thing of the past now. :^)

We still have super pages, but they are barely used. They remain useful
for code that requires memory with a low physical address.

Fixes #1000.
2020-01-17 22:34:36 +01:00
Andreas Kling
f71fc88393 Kernel: Re-enable protection of the kernel image in memory 2020-01-17 22:34:36 +01:00
Andreas Kling
59b584d983 Kernel: Tidy up the lowest part of the address space
After MemoryManager initialization, we now only leave the lowest 1MB
of memory identity-mapped. The very first (null) page is not present.
All other pages are RW but not X. Supervisor only.
2020-01-17 22:34:36 +01:00
Andreas Kling
545ec578b3 Kernel: Tidy up the types imported from boot.S a little bit 2020-01-17 22:34:36 +01:00
Andreas Kling
7e6f0efe7c Kernel: Move Multiboot memory map parsing to its own function 2020-01-17 22:34:36 +01:00
Andreas Kling
ba8275a48e Kernel: Clean up ensure_pte() 2020-01-17 22:34:36 +01:00
Andreas Kling
e362b56b4f Kernel: Move kernel above the 3GB virtual address mark
The kernel and its static data structures are no longer identity-mapped
in the bottom 8MB of the address space, but instead move above 3GB.

The first 8MB above 3GB are pseudo-identity-mapped to the bottom 8MB of
the physical address space. But things don't have to stay this way!

Thanks to Jesse who made an earlier attempt at this, it was really easy
to get device drivers working once the page tables were in place! :^)

Fixes #734.
2020-01-17 22:34:26 +01:00
Liav A
d2b41010c5 Kernel: Change Region allocation helpers
We now can create a cacheable Region, so when map() is called, if a
Region is cacheable then all the virtual memory space being allocated
to it will be marked as not cache disabled.

In addition to that, OS components can create a Region that will be
mapped to a specific physical address by using the appropriate helper
method.
2020-01-14 15:38:58 +01:00
Andreas Kling
5c3c2a9bac Kernel: Copy Region's "is_mmap" flag when cloning regions for fork()
Otherwise child processes will not be allowed to munmap(), madvise(),
etc. on the cloned regions!
2020-01-10 19:24:01 +01:00
Andreas Kling
62c45850e1 Kernel: Page allocation should not use memset_user() when zeroing
We're not zeroing new pages through a userspace address, so this should
not use memset_user().
2020-01-10 10:57:33 +01:00
Andreas Kling
197e73ee31 Kernel+LibELF: Enable SMAP protection during non-syscall exec()
When loading a new executable, we now map the ELF image in kernel-only
memory and parse it there. Then we use copy_to_user() when initializing
writable regions with data from the executable.

Note that the exec() syscall still disables SMAP protection and will
require additional work. This patch only affects kernel-originated
process spawns.
2020-01-10 10:57:06 +01:00
Andreas Kling
8e7420ddf2 Kernel: Harden memory mapping of the kernel image
We now map the kernel's text and rodata segments read+execute.
We also make the data and bss segments non-executable.

Thanks to q3k for the idea! :^)
2020-01-06 13:55:39 +01:00
Andreas Kling
9eef39d68a Kernel: Start implementing x86 SMAP support
Supervisor Mode Access Prevention (SMAP) is an x86 CPU feature that
prevents the kernel from accessing userspace memory. With SMAP enabled,
trying to read/write a userspace memory address while in the kernel
will now generate a page fault.

Since it's sometimes necessary to read/write userspace memory, there
are two new instructions that quickly switch the protection on/off:
STAC (disables protection) and CLAC (enables protection.)
These are exposed in kernel code via the stac() and clac() helpers.

There's also a SmapDisabler RAII object that can be used to ensure
that you don't forget to re-enable protection before returning to
userspace code.

THis patch also adds copy_to_user(), copy_from_user() and memset_user()
which are the "correct" way of doing things. These functions allow us
to briefly disable protection for a specific purpose, and then turn it
back on immediately after it's done. Going forward all kernel code
should be moved to using these and all uses of SmapDisabler are to be
considered FIXME's.

Note that we're not realizing the full potential of this feature since
I've used SmapDisabler quite liberally in this initial bring-up patch.
2020-01-05 18:14:51 +01:00
Andreas Kling
aba7829724 Kernel: InodeVMObject can't call Inode::size() with interrupts disabled
Inode::size() may try to take a lock, so we can't be calling it with
interrupts disabled.

This fixes a kernel hang when trying to execute a binary in a TmpFS.
2020-01-03 15:40:03 +01:00
Andreas Kling
0f9800ca57 Kernel: Make the loop that marks the bottom 1MB NX a little less busy 2020-01-02 22:02:29 +01:00
Andreas Kling
32ec1e5aed Kernel: Mask kernel addresses in backtraces and profiles
Addresses outside the userspace virtual range will now show up as
0xdeadc0de in backtraces and profiles generated by unprivileged users.
2020-01-02 20:51:31 +01:00
Andreas Kling
3dcec260ed Kernel: Validate the full range of user memory passed to syscalls
We now validate the full range of userspace memory passed into syscalls
instead of just checking that the first and last byte of the memory are
in process-owned regions.

This fixes an issue where it was possible to avoid rejection of invalid
addresses that sat between two valid ones, simply by passing a valid
address and a size large enough to put the end of the range at another
valid address.

I added a little test utility that tries to provoke EFAULT in various
ways to help verify this. I'm sure we can think of more ways to test
this but it's at least a start. :^)

Thanks to mozjag for pointing out that this code was still lacking!

Incidentally this also makes backtraces work again.

Fixes #989.
2020-01-02 02:17:12 +01:00
Andreas Kling
ea1911b561 Kernel: Share code between Region::map() and Region::remap_page()
These were doing mostly the same things, so let's just share the code.
2020-01-01 19:32:55 +01:00
Andreas Kling
5aeaab601e Kernel: Move CPU feature detection to Arch/x86/CPU.{cpp.h}
We now refuse to boot on machines that don't support PAE since all
of our paging code depends on it.

Also let's only enable SSE and PGE support if the CPU advertises it.
2020-01-01 12:57:00 +01:00
Andreas Kling
8602fa5b49 Kernel: Enable x86 SMEP (Supervisor Mode Execution Protection)
This prevents the kernel from jumping to code in userspace memory.
2020-01-01 01:59:52 +01:00
Andreas Kling
c9ec415e2f Kernel: Always reject never-userspace addresses before checking regions
At the moment, addresses below 8MB and above 3GB are never accessible
to userspace, so just reject them without even looking at the current
process's memory regions.
2019-12-31 03:45:54 +01:00
Andreas Kling
66d5ebafa6 Kernel: Let's also not consider kernel regions to be valid user stacks
This one is less obviously exploitable than the previous one, but still
a bug nonetheless.
2019-12-31 00:28:14 +01:00
Andreas Kling
0fc24fe256 Kernel: User pointer validation should reject kernel-only addresses
We were happily allowing syscalls with pointers into kernel-only
regions (virtual address >= 0xc0000000).

This patch fixes that by only considering user regions in the current
process, and also double-checking the Region::is_user_accessible() flag
before approving an access.

Thanks to Fire30 for finding the bug! :^)
2019-12-31 00:24:35 +01:00
Andreas Kling
1f31156173 Kernel: Add a mode flag to sys$purge and allow purging clean inodes 2019-12-29 13:16:53 +01:00
Andreas Kling
c74cde918a Kernel+SystemMonitor: Expose amount of per-process clean inode memory
This is memory that's loaded from an inode (file) but not modified in
memory, so still identical to what's on disk. This kind of memory can
be freed and reloaded transparently from disk if needed.
2019-12-29 12:45:58 +01:00
Andreas Kling
0d5e0e4cad Kernel+SystemMonitor: Expose amount of per-process dirty private memory
Dirty private memory is all memory in non-inode-backed mappings that's
process-private, meaning it's not shared with any other process.

This patch exposes that number via SystemMonitor, giving us an idea of
how much memory each process is responsible for all on its own.
2019-12-29 12:28:32 +01:00
Andreas Kling
c1f8291ce4 Kernel: When physical page allocation fails, try to purge something
Instead of panicking right away when we run out of physical pages,
we now try to find a PurgeableVMObject with some volatile pages in it.
If we find one, we purge that entire object and steal one of its pages.

This makes it possible for the kernel to keep going instead of dying.
Very cool. :^)
2019-12-26 11:45:36 +01:00
Conrad Pankoff
17aef7dc99 Kernel: Detect support for no-execute (NX) CPU features
Previously we assumed all hosts would have support for IA32_EFER.NXE.
This is mostly true for newer hardware, but older hardware will crash
and burn if you try to use this feature.

Now we check for support via CPUID.80000001[20].
2019-12-26 10:05:51 +01:00
Andreas Kling
9e55bcb7da Kernel: Make kernel memory regions be non-executable by default
From now on, you'll have to request executable memory specifically
if you want some.
2019-12-25 22:41:34 +01:00
Andreas Kling
0b7a2e0a5a Kernel: Set NX bit for virtual addresses 0-1MB and 2-8MB
This removes the ability to jump into kmalloc memory, etc.
Only the kernel image itself is allowed to exec, located between 1-2MB.
2019-12-25 22:24:28 +01:00
Andreas Kling
ce5f7f6c07 Kernel: Use the CPU's NX bit to enforce PROT_EXEC on memory mappings
Now that we have PAE support, we can ask the CPU to crash processes for
trying to execute non-executable memory. This is pretty cool! :^)
2019-12-25 13:35:57 +01:00
Andreas Kling
52deb09382 Kernel: Enable PAE (Physical Address Extension)
Introduce one more (CPU) indirection layer in the paging code: the page
directory pointer table (PDPT). Each PageDirectory now has 4 separate
PageDirectoryEntry arrays, governing 1 GB of VM each.

A really neat side-effect of this is that we can now share the physical
page containing the >=3GB kernel-only address space metadata between
all processes, instead of lazily cloning it on page faults.

This will give us access to the NX (No eXecute) bit, allowing us to
prevent execution of memory that's not supposed to be executed.
2019-12-25 13:35:57 +01:00
Andreas Kling
c087abc48d Kernel: Rename PageDirectory::find_by_pdb() => find_by_cr3()
I caught myself wondering what "pdb" stood for, so let's rename this
to something more obvious.
2019-12-25 02:58:03 +01:00
Andreas Kling
7a0088c4d2 Kernel: Clean up Region access bit setters a little 2019-12-25 02:58:03 +01:00
Andreas Kling
c9a5253ac2 Kernel: Uh, actually *actually* turn on CR4.PGE
I'm not sure how I managed to misread the location of this bit twice.
But I did! Here is finally the correct value, according to Intel:

    "Page Global Enable (bit 7 of CR4)"

Jeez! :^)
2019-12-25 02:58:03 +01:00
Andreas Kling
3623e35978 Kernel: Oops, actually enable CR4.PGE (page table global bit)
Turns out we were setting the wrong bit here. Now we will actually keep
kernel memory mappings in the TLB across context switches.
2019-12-24 22:45:27 +01:00
Andreas Kling
ae2d72377d Kernel: Enable the x86 WP bit to catch invalid memory writes in ring 0
Setting this bit will cause the CPU to generate a page fault when
writing to read-only memory, even if we're executing in the kernel.

Seemingly the only change needed to make this work was to have the
inode-backed page fault handler use a temporary mapping for writing
the read-from-disk data into the newly-allocated physical page.
2019-12-21 16:21:13 +01:00