The Storage subsystem, like the Audio and HID subsystems, exposes Unix
device files (for example, in the /dev directory). To ensure consistency
across the repository, we should make the Storage subsystem to reside in
the Kernel/Devices directory like the two other mentioned subsystems.
That's what this class really is; in fact that's what the first line of
the comment says it is.
This commit does not rename the main files, since those will contain
other time-related classes in a little bit.
This is in preparation for adding MSI(x) support to the NVMe device.
NVMeInterruptQueue needs access to the PCI device to deal with MSI(x)
interrupts. It is ok to pass the NVMeController as a reference to the
NVMeQueue as NVMeController is the one that owns the NVMeQueue.
This is very similar to how AHCIController passes its reference to its
interrupt handler.
Add an explicit QueueType enum which could be used to create a poll or
an interrupt queue. This is better than passing an Optional<irq>.
This refactoring is in preparation for adding MSIx support to NVMe.
Storage controllers are initialized during init and are never modified.
NonnullRefPtr can be safely used instead of the NonnullLockRefPtr. This
also fixes one of the UB issue that was there when using an NVMe device
because of NonnullLockRefPtr.
We can add proper locking when we need to modify the storage controllers
after init.
This class had slightly confusing semantics and the added weirdness
doesn't seem worth it just so we can say "." instead of "->" when
iterating over a vector of NNRPs.
This patch replaces NonnullRefPtrVector<T> with Vector<NNRP<T>>.
There are now 2 separate classes for almost the same object type:
- EnumerableDeviceIdentifier, which is used in the enumeration code for
all PCI host controller classes. This is allowed to be moved and
copied, as it doesn't support ref-counting.
- DeviceIdentifier, which inherits from EnumerableDeviceIdentifier. This
class uses ref-counting, and is not allowed to be copied. It has a
spinlock member in its structure to allow safely executing complicated
IO sequences on a PCI device and its space configuration.
There's a static method that allows a quick conversion from
EnumerableDeviceIdentifier to DeviceIdentifier while creating a
NonnullRefPtr out of it.
The reason for doing this is for the sake of integrity and reliablity of
the system in 2 places:
- Ensure that "complicated" tasks that rely on manipulating PCI device
registers are done in a safe manner. For example, determining a PCI
BAR space size requires multiple read and writes to the same register,
and if another CPU tries to do something else with our selected
register, then the result will be a catastrophe.
- Allow the PCI API to have a united form around a shared object which
actually holds much more data than the PCI::Address structure. This is
fundamental if we want to do certain types of optimizations, and be
able to support more features of the PCI bus in the foreseeable
future.
This patch already has several implications:
- All PCI::Device(s) hold a reference to a DeviceIdentifier structure
being given originally from the PCI::Access singleton. This means that
all instances of DeviceIdentifier structures are located in one place,
and all references are pointing to that location. This ensures that
locking the operation spinlock will take effect in all the appropriate
places.
- We no longer support adding PCI host controllers and then immediately
allow for enumerating it with a lambda function. It was found that
this method is extremely broken and too much complicated to work
reliably with the new paradigm being introduced in this patch. This
means that for Volume Management Devices (Intel VMD devices), we
simply first enumerate the PCI bus for such devices in the storage
code, and if we find a device, we attach it in the PCI::Access method
which will scan for devices behind that bridge and will add new
DeviceIdentifier(s) objects to its internal Vector. Afterwards, we
just continue as usual with scanning for actual storage controllers,
so we will find a corresponding NVMe controllers if there were any
behind that VMD bridge.
A virtual method named device_name() was added to
Kernel::PCI to support logging the PCI::Device name
and address using dmesgln_pci. Previously, PCI::Device
did not store the device name.
All devices inheriting from PCI::Device now use dmesgln_pci where
they previously used dmesgln.
Many code patterns and hardware procedures rely on reliable delay in the
microseconds granularity, and since they are using such delays which are
valid cases, but should not rely on x86 specific code, we allow to
determine in compile time the proper platform-specific code to use to
invoke such delays.
Before of this patch, we supported two methods to address a boot device:
1. Specifying root=/dev/hdXY, where X is a-z letter which corresponds to
a boot device, and Y as number from 1 to 16, to indicate the partition
number, which can be omitted to instruct the kernel to use a raw device
rather than a partition on a raw device.
2. Specifying root=PARTUUID: with a GUID string of a GUID partition. In
case of existing storage device with GPT partitions, this is most likely
the safest option to ensure booting from persistent storage.
While option 2 is more advanced and reliable, the first option has 2
caveats:
1. The string prefix "/dev/hd" doesn't mean anything beside a convention
on Linux installations, that was taken into use in Serenity. In Serenity
we don't mount DevTmpFS before we mount the boot device on /, so the
kernel doesn't really access /dev anyway, so this convention is only a
big misleading relic that can easily make the user to assume we access
/dev early on boot.
2. This convention although resemble the simple linux convention, is
quite limited in specifying a correct boot device across hardware setup
changes, so option 2 was recommended to ensure the system is always
bootable.
With these caveats in mind, this commit tries to fix the problem with
adding more addressing options as well as to remove the first option
being mentioned above of addressing.
To sum it up, there are 4 addressing options:
1. Hardware relative address - Each instance of StorageController is
assigned with a index number relative to the type of hardware it handles
which makes it possible to address storage devices with a prefix of the
commandset ("ata" for ATA, "nvme" for NVMe, "ramdisk" for Plain memory),
and then the number for the parent controller relative hardware index,
another number LUN target_id, and a third number for LUN disk_id.
2. LUN address - Similar to the previous option, but instead we rely on
the parent controller absolute index for the first number.
3. Block device major and minor numbers - by specifying the major and
minor numbers, the kernel can simply try to get the corresponding block
device and use it as the boot device.
4. GUID string, in the same fashion like before, so the user use the
"PARTUUID:" string prefix and add the GUID of the GPT partition.
For the new address modes 1 and 2, the user can choose to also specify a
partition out of the selected boot device. To do that, the user needs to
append the semicolon character and then add the string "partX" where X
is to be changed for the partition number. We start counting from 0, and
therefore the first partition number is 0 and not 1 in the kernel boot
argument.
I believe this to be safe, as the main thing that LockRefPtr provides
over RefPtr is safe copying from a shared LockRefPtr instance. I've
inspected the uses of RefPtr<PhysicalPage> and it seems they're all
guarded by external locking. Some of it is less obvious, but this is
an area where we're making continuous headway.
Until now, our kernel has reimplemented a number of AK classes to
provide automatic internal locking:
- RefPtr
- NonnullRefPtr
- WeakPtr
- Weakable
This patch renames the Kernel classes so that they can coexist with
the original AK classes:
- RefPtr => LockRefPtr
- NonnullRefPtr => NonnullLockRefPtr
- WeakPtr => LockWeakPtr
- Weakable => LockWeakable
The goal here is to eventually get rid of the Lock* classes in favor of
using external locking.
LUN address is essentially how people used to address SCSI devices back
in the day we had these devices more in use. However, SCSI was taken as
an abstraction layer for many Unix and Unix-like systems, so it still
common to see LUN addresses in use. In Serenity, we don't really provide
such abstraction layer, and therefore until now, we didn't use LUNs too.
However (again), this changes, as we want to let users to address their
devices under SysFS easily. LUNs make sense in that regard, because they
can be easily adapted to different interfaces besides SCSI.
For example, for legacy ATA hard drive being connected to the first IDE
controller which was enumerated on the PCI bus, and then to the primary
channel as slave device, the LUN address would be 0:0:1.
To make this happen, we add unique ID number to each StorageController,
which increments by 1 for each new instance of StorageController. Then,
we adapt the ATA and NVMe devices to use these numbers and generate LUN
in the construction time.
Each of these strings would previously rely on StringView's char const*
constructor overload, which would call __builtin_strlen on the string.
Since we now have operator ""sv, we can replace these with much simpler
versions. This opens the door to being able to remove
StringView(char const*).
No functional changes.
Function-local `static constexpr` variables can be `constexpr`. This
can reduce memory consumption, binary size, and offer additional
compiler optimizations.
These changes result in a stripped x86_64 kernel binary size reduction
of 592 bytes.
Add polling support to NVMe so that it does not use interrupt to
complete a IO but instead actively polls for completion. This probably
is not very efficient in terms of CPU usage but it does not use
interrupts to complete a IO which is beneficial at the moment as there
is no MSI(X) support and it can reduce the latency of an IO in a very
fast NVMe device.
The NVMeQueue class has been made the base class for NVMeInterruptQueue
and NVMePollQueue. The factory function `NVMeQueue::try_create` will
return the appropriate queue to the controller based on the polling
boot parameter.
The polling mode can be enabled by adding an extra boot parameter:
`nvme_poll`.
Apologies for the enormous commit, but I don't see a way to split this
up nicely. In the vast majority of cases it's a simple change. A few
extra places can use TRY instead of manual error checking though. :^)
Only a generic struct definition was present for NVMeSubmission. To
improve type safety and clarity, added an union of NVMeSubmission
structs that are applicable to the command being submitted.
The CAP.TO is 0 based. Even though I don't see that mentioned in the
spec explicitly, all major OSs such as Linux, FreeBSD add 1 to the
CAP.TO while calculating the timeout.
IO::delay was added as a lazy alternative to looping with a timeout
error if the condition was not satisfied. Now that we have the
wait_for_ready function, remove the delay in the reset and start
controller function.
This mostly just moved the problem, as a lot of the callers are not
capable of propagating the errors themselves, but it's a step in the
right direction.
We need to use the volatile keyword when mapping the device registers,
or the compiler may optimize access, which lead to this QEMU error:
pci_nvme_ub_mmiord_toosmall in nvme_mmio_read: MMIO read smaller than
32-bits, offset=0x0
Add a basic NVMe driver support to serenity
based on NVMe spec 1.4.
The driver can support multiple NVMe drives (subsystems).
But in a NVMe drive, the driver can support one controller
with multiple namespaces.
Each core will get a separate NVMe Queue.
As the system lacks MSI support, PIN based interrupts are
used for IO.
Tested the NVMe support by replacing IDE driver
with the NVMe driver :^)