This commit addresses the following shortcomings of our current, simple
and elegant memset function:
- REP STOSB/STOSQ has considerable startup overhead, it's impractical to
use for smaller sizes.
- Up until very recently, AMD CPUs didn't have support for "Enhanced REP
MOVSB/STOSB", so it performed pretty poorly on them.
With this commit applied, I could measure a ~5% decrease in `test-js`'s
runtime when I used qemu's TCG backend. The implementation is based on
the following article from Microsoft:
https://msrc-blog.microsoft.com/2021/01/11/building-faster-amd64-memset-routines
Two versions of the routine are implemented: one that uses the ERMS
extension mentioned above, and one that performs plain SSE stores. The
version appropriate for the CPU is selected at load time using an IFUNC.