Groups | Blog | Home
all groups > dotnet framework > december 2007 >

dotnet framework : C# (float) cast is costly for speed if not used appropriately


Arnie
12/31/2007 11:46:41 AM
Folks,

We ran into a pretty significant performance penalty when casting floats.
We've identified a code workaround that we wanted to pass along but also was
wondering if others had experience with this and if there is a better
solution.

-jeff


.....


I'd like to share findings regarding C# (float) cast.

As we convert double to float, we found several slow down issues.
We realized C# (float) cast can be costly if not used appropriately.

------------------------------------------------------------
Slow cases
------------------------------------------------------------
(A)
private void someMath(float[] input, float[] output)
{
int length = input.Length;
for (int i = 0; i < length; i++)
{
output[i] = (float)Math.Log10(input[i]); // <--- inline (float)
cast is slow!
}
}

(B)
private void Copy(double[] input, float[] output)
{
int length = input.Length;
for (int i = 0; i < length; i++)
{
output[i] = (float)input[i]; // <--- inline (float)
cast is slow!
}
}

In these examples, "inline" (float) casts are executed on the same line as
other operations
such as Math.Log10() or simple data fetch from input array.

These are slow. Even with Release build.
(A): It takes 3 to 6 % more than double[] case. ;-)
(B): It takes as twice(!) as double[] case. ;-)

In my understanding and articles on the Net, the slow down comes from
writing intermediate value
back to memory as follows. The extra trips are costly.


(A) CPU/FPU +--> fetch --> Math.Log10 --+ +--> (float) --+
| | | |
| | | |
| V | V
memory input written back to heap output
Extra memory access!

(B) CPU/FPU +--> fetch --+ +--> (float) --+
| | | |
| | | |
| V | V
memory input written back to heap output
Extra memory access!

------------------------------------------------------------
Fast cases
------------------------------------------------------------

To avoid the extra memory access, we can use a temporary variable to store
the intermediate data.
The temporary variable is allocated in CPU register and we can keep the
speed fast.

(C)
private void someMath(float[] input, float[] output)
{
int length = input.Length;
for (int i = 0; i < length; i++)
{
double tmp = Math.Log10(input[i]); // <-- store in a
temporary variable in CPU register
output[i] = (float)tmp; // <-- then (float) cast.
Fast!
}
}

(D)
private void Copy(double[] input, float[] output)
{
int length = input.Length;
for (int i = 0; i < length; i++)
{
double tmp = input[i]; // <-- store in a
temporary variable in CPU register
output[i] = (float)tmp; // <-- then (float) cast.
Fast!
}
}

In these improved versions, the intermediate data are not written back to
the memory.
The improved versions are actually slightly faster than the double[] case.
(C): 1% faster than double[] case. :)
(D): 3% faster than double[] case. :)

(C) CPU/FPU +--> fetch --> Math.Log10 --> stays in -----> (float) --+
| CPU register |
| Fast! |
| V
memory input
output

(D) CPU/FPU +--> fetch --> stays in -----> (float) --+
| CPU register |
| Fast! |
| V
memory input output


OK, this is what we found from benchmarking and googling.

The same thing can be said for ArraySegment<float> arrays as well.
This is because the issue relates to float variables in the array, not the
array itself.

You would say this is .NET compiler optimization issue.
If you know optimization flags or anything that can fix this issue on
compiler side, please let us know.
That would be a great help!
(By the way, simple release build does not help.)

Otherwise, we will need to optimize our code by hand using temporary
variable technique as in the example.
Well, we have many instances of this kind of "inline" casts in our code.

Jon Skeet [C# MVP]
1/1/2008 5:21:59 PM
[quoted text, click to view]

To be honest, it doesn't really sound that significant to me. Read
on...

[quoted text, click to view]

<snip>

[quoted text, click to view]

<snip>

[quoted text, click to view]

I see no reason to believe that there's an extra value written to the
*heap* (rather than the stack), and no reason why the JIT shouldn't use
a register for the intermediate value without an explicit local
variable.

<snip>

I have included a short but complete program below which uses an array
of a million elements and iterates each method a thousand times. Here
are the results on my laptop:

Log10Fast: 64489ms
Log10Slow: 70420ms
CopyFast: 3841ms
CopySlow: 4070ms

So your optimisation improves things by about 10% for the Log10 case
and about 5% for the Copy case.

[quoted text, click to view]

And have you any reason to believe that's *actually* the bottleneck in
your code? Do you regularly convert a billion floats and care about
200ms of performance loss?

I don't understand why the results are as they are (it would be worth
looking at the JITted, optimised code to find out) - but even so, I
certainly wouldn't start micro-optimising all over the place. Find out
where the *actual* bottleneck in your code is, and consider reducing
readability/simplicity for the sake of performance just in the most
significant parts. Don't start doing it all over the place, which
sounds like the course of action you're considering at the moment.

--
Jon Skeet - <skeet@pobox.com>
http://www.pobox.com/~skeet Blog: http://www.msmvps.com/jon.skeet
Arnie
1/3/2008 10:22:03 AM
Thanks for the feedback Jon.

This is a mature system where they are "ringing" out the last bit of
performance.

It is a scientific test insrument (spectrum analyzer) so they are acquiring
and converting extremely large chunks of data (wave forms). Some runs can
acquire as much as 500MB of data at a time.

So they have "progressed" to the point where they are looking at the right
optimization spots in their code. Casting from double[] is indeed 2x as slow
without the optimization and quite different then the 5-10% case you
demonstrated. Again the only thing they changed was assigning a local
variable hence their curiosity in what the C#/jit compiler is doing.

I think our premise is that given this single change .... it would seem that
the there would be no performance difference if the compiler were taking
advantage of every reasonable performance optimization.

Time to look at IL as see what is going on.

-jeff


[quoted text, click to view]
Jon Skeet [C# MVP]
1/3/2008 5:43:28 PM
[quoted text, click to view]

Hmm... it still sounds dubious to me. I doubt you'll really see
performance benefits which are significant in the context of the whole
app. Mind you, it sounds like you're not seeing the same behaviour as
me to start with, so hey...

[quoted text, click to view]

500MB isn't that much though, in the context of the tests I was doing -
it was using a billion points of data, which would be 8GB. The copy was
then only taking 4 seconds, and making the change only shaved off a
very small amount.

[quoted text, click to view]

So can you give a short but complete program which *does* demonstrate
the 2x difference?

Just as a thought, which CLR are you using? I'm on the 2.0, on x86. If
you're using 1.0, 1.1, or 2.0 on x64, that could account for some
differences.

[quoted text, click to view]

The IL doesn't show much. It's the optimised assembly you need to be
looking at, really. cordbg is your friend - but don't forget to tell it
to perform JIT optimisations. SOS may help too. I don't envy you...

--
Jon Skeet - <skeet@pobox.com>
http://www.pobox.com/~skeet Blog: http://www.msmvps.com/jon.skeet
Laura T.
1/4/2008 11:40:06 AM
I did some testing with these "slow" and "fast" log10 versions..

The JITTed code is almost identical. The only difference is that the "slow"
log10 version has
these two FPU instructions that the "fast" version does not have:
fstp dword ptr [esp]
fld dword ptr [esp]

In my tests, the fast version did not always win.
I did the tests with w 10 milion random element arrays running both versions
1000 times, and the whole show 20 times.
The fast version won just 15 times.

LT

-- for curiosity --

The "slow" (A) version:

for (int i = 0; i < length; i++)
00000011 xor esi,esi
00000013 test ebp,ebp
00000015 jle 0000003F
{
output[i] = (float)Math.Log10(input[i]); // <--- inline
(float) cast is slow!
00000017 cmp esi,dword ptr [edi+4]
0000001a jae 00000045
0000001c fld dword ptr [edi+esi*4+8]
00000020 sub esp,8
00000023 fstp qword ptr [esp]
00000026 call 79E1756B

0000002b fstp dword ptr [esp] <-- difference 1
0000002e fld dword ptr [esp] <- difference 2
00000031 cmp esi,dword ptr [ebx+4]
00000034 jae 00000045
00000036 fstp dword ptr [ebx+esi*4+8]
for (int i = 0; i < length; i++)
0000003a inc esi
0000003b cmp esi,ebp
0000003d jl 00000017
0000003f pop ecx

The "fast" (D) version:

for (int i = 0; i < length; i++)
00000010 xor esi,esi
00000012 test ebp,ebp
00000014 jle 00000038
{
double tmp = Math.Log10(input[i]); // <-- store in a
temporary variable in CPU register
00000016 cmp esi,dword ptr [edi+4]
00000019 jae 0000003D
0000001b fld dword ptr [edi+esi*4+8]
0000001f sub esp,8
00000022 fstp qword ptr [esp]
00000025 call 79EC7513
output[i] = (float)tmp; // <-- then
(float) cast. Fast!

<missing the two FPU instructions here>

0000002a cmp esi,dword ptr [ebx+4]
0000002d jae 0000003D
0000002f fstp dword ptr [ebx+esi*4+8]
for (int i = 0; i < length; i++)
00000033 inc esi
00000034 cmp esi,ebp
00000036 jl 00000016
00000038 pop ebp



"Jon Skeet [C# MVP]" <skeet@pobox.com> ha scritto nel messaggio
news:MPG.21e72f1e5fb47fc0776@msnews.microsoft.com...
[quoted text, click to view]
AddThis Social Bookmark Button