This was using up to 12KB of kernel stack in the triply indirect case
and looks generally spooky. Let's just allocate a ByteBuffer for now
and take the performance hit (of heap allocation). Longer term we can
reorganize the code to reduce the majority of the heap churn.
Switch to using type-safe bitwise operators for the BlockFlags class,
this cleans up a lot of boilerplate casts which are necessary when the
enum is declared as `enum class`.
As it turns out, Dr. POSIX doesn't require that post-mmap() changes
to a file are reflected in the memory mappings. So we don't actually
have to care about the file size changing (or the contents.)
IIUC, as long as all the MAP_SHARED mappings that refer to the same
inode are in sync, we're good.
This means that VMObjects don't need resizing capabilities. I'm sure
there are ways we can take advantage of this fact.
The perfcore file format was previously limited to a single process
since the pid/executable/regions data was top-level in the JSON.
This patch moves the process-specific data into a top-level array
named "processes" and we now add entries for each process that has
been sampled during the profile run.
This makes it possible to see samples from multiple threads when
viewing a perfcore file with Profiler. This is extremely cool! :^)
The superuser can now call sys$profiling_enable() with PID -1 to enable
profiling of all running threads in the system. The perf events are
collected in a global PerformanceEventBuffer (currently 32 MiB in size.)
The events can be accessed via /proc/profile
This may seem like a no-op change, however it shrinks down the Kernel by a bit:
.text -432
.unmap_after_init -60
.data -480
.debug_info -673
.debug_aranges 8
.debug_ranges -232
.debug_line -558
.debug_str -308
.debug_frame -40
With '= default', the compiler can do more inlining, hence the savings.
I intentionally omitted some opportunities for '= default', because they
would increase the Kernel size.
We don't need to flush the on-disk inode struct multiple times while
writing out its block list. Just mark the in-memory Inode as having
dirty metadata and the SyncTask will flush it eventually.
Since the inode is the logical owner of its block list, let's move the
code that computes the block list there, and also stop hogging the FS
lock while we compute the block list, as there is no need for it.
There are two locks in the Ext2FS implementation:
* The FS lock (Ext2FS::m_lock)
This governs access to the superblock, block group descriptors,
and the block & inode bitmap blocks. It's held while allocating
or freeing blocks/inodes.
* The inode lock (Ext2FSInode::m_lock)
This governs access to the inode metadata, including the block
list, and to the content data as well. It's held while doing
basically anything with the inode.
Once an on-disk block/inode is allocated, it logically belongs
to the in-memory Inode object, so there's no need for the FS lock
to be taken while manipulating them, the inode lock is all you need.
This dramatically reduces the impact of disk I/O on path resolution
and various other things that look at individual inodes.
Since these filesystems operate on an underlying file descriptor
and rely on its offset for correctness, let's use the FS lock to
serialize these operations.
This also means that FS subclasses can rely on block-level read/write
operations being atomic.
This reverts commit 1e737a5c50.
The cached block list does not include meta-blocks, so we'd end up
leaking those. There's definitely a nice way to avoid work here, but it
turns out it wasn't quite this trivial. Reverting for now.
This patch combines inode the scan for an available inode with the
updating of the bit in the inode bitmap into a single operation.
We also exit the scan immediately when we find an inode, instead of
continuing until we've scanned all the eligible groups(!)
Finally, we stop holding the filesystem lock throughout the entire
operation, and instead only take it while actually necessary
(during inode allocation, flush, and inode cache update.)
Improve a bunch of situations where we'd previously panic the kernel
on failure. We now propagate whatever error we had instead. Usually
that'll be EIO.
Both inode and block allocation operate on bitmap blocks and update
counters in the superblock and group descriptor.
Since we're here, also add some error propagation around this code.
(...and ASSERT_NOT_REACHED => VERIFY_NOT_REACHED)
Since all of these checks are done in release builds as well,
let's rename them to VERIFY to prevent confusion, as everyone is
used to assertions being compiled out in release.
We can introduce a new ASSERT macro that is specifically for debug
checks, but I'm doing this wholesale conversion first since we've
accumulated thousands of these already, and it's not immediately
obvious which ones are suitable for ASSERT.
There's no real system here, I just added it to various functions
that I don't believe we ever want to call after initialization
has finished.
With these changes, we're able to unmap 60 KiB of kernel text
after init. :^)
This enabled trivial ASLR bypass for non-dumpable programs by simply
opening /proc/PID/vm before exec'ing.
We now hold the target process's ptrace lock across the refresh/write
operations, and deny access if the process is non-dumpable. The lock
is necessary to prevent a TOCTOU race on Process::is_dumpable() while
the target is exec'ing.
Fixes#5270.
In preparation for marking BlockingResult [[nodiscard]], there are a few
places that perform infinite waits, which we never observe the result of
the wait. Instead of suppressing them, add an alternate function which
returns void when performing and infinite wait.
Now that we no longer need to support the signal trampolines being
user-accessible inside the kernel memory range, we can get rid of the
"kernel" and "user-accessible" flags on Region and simply use the
address of the region to determine whether it's kernel or user.
This also tightens the page table mapping code, since it can now set
user-accessibility based solely on the virtual address of a page.
The way we read/write directories is very inefficient, and this doesn't
solve any of that. It does however reduce memory usage of directory
entry vectors by 25% which has nice immediate benefits.