Switch on the new credentials before loading the new executable into
memory. This ensures that attempts to ptrace() the program from an
unprivileged process will fail.
This covers one bug that was exploited in the 2020 HXP CTF:
https://hxp.io/blog/79/hxp-CTF-2020-wisdom2/
Thanks to yyyyyyy for finding the bug! :^)
We need to account for how many shared lock instances the current
thread owns, so that we can properly release such references when
yielding execution.
We also need to release the process lock when donating.
LexicalPath is a big and heavy class that's really meant as a helper
for extracting parts of a path, not for storage or passing around.
Instead, pass paths around as strings and use LexicalPath locally
as needed.
When a process crashes, we generate a coredump file and write it in
/tmp/coredumps/.
The coredump file is an ELF file of type ET_CORE.
It contains a segment for every userspace memory region of the process,
and an additional PT_NOTE segment that contains the registers state for
each thread, and a additional data about memory regions
(e.g their name).
This adds an allocate_tls syscall through which a userspace process
can request the allocation of a TLS region with a given size.
This will be used by the dynamic loader to allocate TLS for the main
executable & its libraries.
When the main executable needs an interpreter, we load the requested
interpreter program, and pass to it an open file decsriptor to the main
executable via the auxiliary vector.
Note that we do not allocate a TLS region for the interpreter.
This prevents zombies created by multi-threaded applications and brings
our model back to closer to what other OSs do.
This also means that SIGSTOP needs to halt all threads, and SIGCONT needs
to resume those threads.
This is necessary because if a process changes the state to Stopped
or resumes from that state, a wait entry is created in the parent
process. So, if a child process does this before disown is called,
we need to clear those entries to avoid leaking references/zombies
that won't be cleaned up until the former parent exits.
This also should solve an even more unlikely corner case where another
thread is waiting on a pid that is being disowned by another thread.
Fix some problems with join blocks where the joining thread block
condition was added twice, which lead to a crash when trying to
unblock that condition a second time.
Deferred block condition evaluation by File objects were also not
properly keeping the File object alive, which lead to some random
crashes and corruption problems.
Other problems were caused by the fact that the Queued state didn't
handle signals/interruptions consistently. To solve these issues we
remove this state entirely, along with Thread::wait_on and change
the WaitQueue into a BlockCondition instead.
Also, deliver signals even if there isn't going to be a context switch
to another thread.
Fixes#4336 and #4330
This allows us to use blocking timeouts with either monotonic or
real time for all blockers. Which means that clock_nanosleep()
now also supports CLOCK_REALTIME.
Also, switch alarm() to use CLOCK_REALTIME as per specification.
This should catch more malformed ELF files earlier than simply
checking the ELF header alone. Also change the API of
validate_program_headers to take the interpreter_path by pointer. This
makes it less awkward to call when we don't care about the interpreter,
and just want the validation.
New Thread objects should be adopted into a RefPtr upon creation.
If creating a thread failed (e.g. out of memory), releasing the RefPtr
will destruct the partially created object, but in the successful case
the thread will add an additional reference that it keeps until it
finishes execution. Adopting will drop it to 1 when returning from
create_thread, or 0 if the thread could not be fully constructed.
This makes the Scheduler a lot leaner by not having to evaluate
block conditions every time it is invoked. Instead evaluate them as
the states change, and unblock threads at that point.
This also implements some more waitid/waitpid/wait features and
behavior. For example, WUNTRACED and WNOWAIT are now supported. And
wait will now not return EINTR when SIGCHLD is delivered at the
same time.
This adds the ability to pass a pointer to kernel thread/process.
Also add the ability to use a closure as thread function, which
allows passing information to a kernel thread more easily.
Use the TimerQueue to expire blocking operations, which is one less thing
the Scheduler needs to check on every iteration.
Also, add a BlockTimeout class that will automatically handle relative or
absolute timeouts as well as overriding timeouts (e.g. socket timeouts)
more consistently.
Also, rework the TimerQueue class to be able to fire events from
any processor, which requires Timer to be RefCounted. Also allow
creating id-less timers for use by blocking operations.
We need to not only add a record for a reference, but we need
to copy the reference count on fork as well, because the code
in the fork assumes that it has the same amount of references,
still.
Also, once all references are dropped when a process is disowned,
delete the shared buffer.
Fixes#4076
This is a new "browse" permission that lets you open (and subsequently list
contents of) directories underneath the path, but not regular files or any other
types of files.
We were leaking a ref on the executed inode in successful calls to
sys$execve(). This meant that once a binary had ever been executed,
it was impossible to remove it from the file system.
The execve system call is particularly finicky since the function
does not return normally on success, so extra care must be taken to
ensure nothing is kept alive by stack variables.
There is a big NOTE comment about this, and yet the bug still got in.
It would be nice to enforce this, but I'm unsure how.
We need to create a reference for the new PID for each shared buffer
that the process had a reference to. If the process subsequently
get replaced through exec, those references will be dropped again.
But if exec for some reason fails then other code, such as global
destructors could still expect having access to them.
Fixes#4076
The time returned by sys$clock_gettime() was not aligned with the delay
calculations in sys$clock_nanosleep(). This patch fixes that by taking
the system's ticks_per_second value into account in both functions.
This patch also removes the need for Thread::sleep_until() and uses
Thread::sleep() for both absolute and relative sleeps.
This was causing the nesalizer emulator port to sleep for a negative
amount of time at the end of each frame, making it run way too fast.
This makes most operations thread safe, especially so that they
can safely be used in the Kernel. This includes obtaining a strong
reference from a weak reference, which now requires an explicit
call to WeakPtr::strong_ref(). Another major change is that
Weakable::make_weak_ref() may require the explicit target type.
Previously we used reinterpret_cast in WeakPtr, assuming that it
can be properly converted. But WeakPtr does not necessarily have
the knowledge to be able to do this. Instead, we now ask the class
itself to deliver a WeakPtr to the type that we want.
Also, WeakLink is no longer specific to a target type. The reason
for this is that we want to be able to safely convert e.g. WeakPtr<T>
to WeakPtr<U>, and before this we just reinterpret_cast the internal
WeakLink<T> to WeakLink<U>, which is a bold assumption that it would
actually produce the correct code. Instead, WeakLink now operates
on just a raw pointer and we only make those constructors/operators
available if we can verify that it can be safely cast.
In order to guarantee thread safety, we now use the least significant
bit in the pointer for locking purposes. This also means that only
properly aligned pointers can be used.
Most systems (Linux, OpenBSD) adjust 0.5 ms per second, or 0.5 us per
1 ms tick. That is, the clock is sped up or slowed down by at most
0.05%. This means adjusting the clock by 1 s takes 2000 s, and the
clock an be adjusted by at most 1.8 s per hour.
FreeBSD adjusts 5 ms per second if the remaining time adjustment is
>= 1 s (0.5%) , else it adjusts by 0.5 ms as well. This allows adjusting
by (almost) 18 s per hour.
Since Serenity OS can lose more than 22 s per hour (#3429), this
picks an adjustment rate up to 1% for now. This allows us to
adjust up to 36s per hour, which should be sufficient to adjust
the clock fast enough to keep up with how much time the clock
currently loses. Once we have a fancier NTP implementation that can
adjust tick rate in addition to offset, we can think about reducing
this.
adjtime is a bit old-school and most current POSIX-y OSs instead
implement adjtimex/ntp_adjtime, but a) we have to start somewhere
b) ntp_adjtime() is a fairly gnarly API. OpenBSD's adjfreq looks
like it might provide similar functionality with a nicer API. But
before worrying about all this, it's probably a good idea to get
to a place where the kernel APIs are (barely) good enough so that
we can write an ntp service, and once we have that we should write
a way to automatically evaluate how well it keeps the time adjusted,
and only then should we add improvements ot the adjustment mechanism.
This addresses the issue first enountered in #3644. If a path is
first unveiled with "c" permissions, we should NOT return ENOENT
if the node does not exist on the disk, as the program will most
likely be creating it at a later time.
g_scheduler_lock cannot safely be acquired after Thread::m_lock
because another processor may already hold g_scheduler_lock and wait
for the same Thread::m_lock.
When obeying FD_CLOEXEC, we don't need to explicitly call close() on
all the FileDescriptions. We can just clear them out from the process
fd table. ~FileDescription() will call close() anyway.
This fixes an issue where TelnetServer would shut down accepted sockets
when exec'ing a shell for them. Since the parent process still has the
socket open, we should not force-close it. Just let go.
Similar to Process, we need to make Thread refcounted. This will solve
problems that will appear once we schedule threads on more than one
processor. This allows us to hold onto threads without necessarily
holding the scheduler lock for the entire duration.
The thread joining logic hadn't been updated to account for the subtle
differences introduced by software context switching. This fixes several
race conditions related to thread destruction and joining, as well as
finalization which did not properly account for detached state and the
fact that threads can be joined after termination as long as they're not
detached.
Fixes#3596
When SO_TIMESTAMP is set as an option on a SOCK_DGRAM socket, then
recvmsg() will return a SCM_TIMESTAMP control message that
contains a struct timeval with the system time that was current
when the socket was received.
Since the receiving socket isn't yet known at packet receive time,
keep timestamps for all packets.
This is useful for keeping statistics about in-kernel queue latencies
in the future, and it can be used to implement SO_TIMESTAMP.
The implementation only supports a single iovec for now.
Some might say having more than one iovec is the main point of
recvmsg() and sendmsg(), but I'm interested in the control message
bits.
There are plenty of places in the kernel that aren't
checking if they actually got their allocation.
This fixes some of them, but definitely not all.
Fixes#3390Fixes#3391
Also, let's make find_one_free_page() return nullptr
if it doesn't get a free index. This stops the kernel
crashing when out of memory and allows memory purging
to take place again.
Fixes#3487