The reason the title surprised you is the same as the reason I can’t get proper results on Google: You’re thinking about a cast, and no matter what search terms I use I keep getting results talking about casts.
I enjoy super-low-level programming and I have both a personal interest in and a need for creating a manual cast from a double to a floating-point type given customizable properties (number of exponent bits, number of mantissa bits, is there a sign bit?, is there an implied mantissa bit?, etc.)
I’ve already written something that could be called a prototype and it correctly handles all “normal” conversions from a 64-bit double into any other type of floating-point format based on however many bits you want.
So as an example, let’s say I have PI as a double constant “3.1415926535897931”. I manually cast it to a 32-bit float by specifying SIGNBIT=TRUE, EXP=8, MANT=24, IMPLICITMAN=TRUE and I get this result:
Actual [1,8,24,TRUE]FLOAT Result = 3.1415927410125732
Sign = 0
Exp = 0x0000000000000080
Man = 0x0000000000490fdb
So it manually gives the exact same result as a regular cast from double to float, and can round or truncate (here, rounding up was performed). Casting manually seems useless, right? But I want to support arbitrary floating-point types that are not present in C. How about an example from a 16-bit float?
Actual [1,5,11,TRUE]FLOAT Result = 3.1406250000000000
Sign = 0
Exp = 0x0000000000000010
Man = 0x0000000000000248
I also want to study arbitrary floating-point formats. For example, the smallest non-0 number a 32-bit float can be is 1.4012984643248171e-45 and the max is 3.4028234663852886e+38.
How about for a 16-bit float?
[1,5,11,TRUE]FLOAT
Smallest non-0: 5.9604644775390625e-08
Max: 65504.000000000000.
How about this random format?
[1,7,31,TRUE]FLOAT
Smallest non-0: 2.0194839173657902e-28
Max: 1.8446744065119617e+19
So a lot of my converter is already working. If you are still wondering what the point is, you can understand that a graphics programmer who has to work with 16-bit and 32-bit shader precision, F16 and F32 textures, D24 depth textures, and R11G11B10 float textures can really find something like this useful. There are many floating-point formats out there but not really any tools to investigate those float formats.
Now to the Question
There are special cases for denormalized numbers that I am not currently handling, and I am temporarily making assumptions about sign bits etc. Anyone have a good link to a break-down of casting from a floating-point value to another type of floating-point value manually? Going over IEEE doesn’t provide example implementations nor does it really dig into the details. The details I often find cover mostly what I have already implemented, which is from a normalized number converted to another normalized number. I don’t really see guides on the best way to implement the cases where either the source or the destination is denormalized.
I can implement it “my way” but I definitely want to look at what has been done or at the very least go over specifications to ensure my way fully complies.
L. Spiro