all groups > dotnet clr > april 2007 >
You're in the

dotnet clr

group:

Managed vs Unmanaged Bare Bones Performance Test


Managed vs Unmanaged Bare Bones Performance Test adhingra
4/19/2007 1:52:01 PM
dotnet clr: At our company we are currently at a decisive point to choose between managed
and unmanaged code on the basis of their performance. I have read stuff about
this on various blogs and other websites. Then I decided to take my own test
as I am more concerned with basic performance at this point.

By basic I mean, just the basic stuff inside the CLR i.e. function calling
cost, for loop, variable declaration, etc. Let us not consider GC, memory
allocation costs, etc.

To my surprise the managed code I generated in my test through C# was
lagging behind to a considerable degree when compared with the code generated
by the C++ compiler.

I was wondering if someone can take a quick look at this and tell me why is
this the case. I was under the assumption, once the JIT happens, the CLR
virtual machine and JIT will give the same performance as native C++ compiler
does (as we are talking basic stuff only - no objects, just pure language
constructs and primitive data types).

I created two sample console applications (one in C# and other in C++). They
both call a function passing an int by value from inside a for loop. Nothing
happens inside the function. I used QueryPerformance.... apis for
measurement. (Code is pasted at the bottom of this posting).

Here are the results (for release mode running from console, with default
settings in the IDE)

C# Test for loop (50000 iterations) 0.000023931 (23 micro seconds)
C++ Test for loop (50000 iterations) 0.000000350 (0.35 micro seconds)

So its like C++ compiler is about 20 times faster than the managed CLR
Jitter. And if I also remove time taken for the QueryPerf...... apis then the
diff is even more

Can anyone please elaborate.

Thanks
adhingra

===========================================
C# Code PROGRAM.CS
===========================================

using System;
using System.Collections.Generic;
using System.Text;
using System.Runtime.InteropServices;

namespace ConsoleApp
{
class Program
{
//API declarations for frequency timers
[DllImport("kernel32.dll")]
extern static short QueryPerformanceCounter(ref long x);
[DllImport("kernel32.dll")]
extern static short QueryPerformanceFrequency(ref long x);

static long m_lStart = 0, m_lStop = 0, m_lFreq = 0;
static long m_lOverhead = 0;
static decimal m_mTotalTime = 0;

static void Main(string[] args)
{
//get the CPU frequency
QueryPerformanceFrequency(ref m_lFreq);

//record the overhead for calling the performance counter API
QueryPerformanceCounter(ref m_lStart);
QueryPerformanceCounter(ref m_lStop);

m_lOverhead = m_lStop - m_lStart;

Console.WriteLine("Starting with a simple For Loop calling a
simple function");

QueryPerformanceCounter(ref m_lStart);
for (int i = 0; i < 50000; i++)
{
Run(i);
}
QueryPerformanceCounter(ref m_lStop);

long lDiff = m_lStop - m_lStart;
Console.WriteLine(lDiff);
//Comment or Uncomment the overhead lines to see the times drop
//
//if (lDiff > m_lOverhead)
//{
// lDiff = lDiff - m_lOverhead;
//}

m_mTotalTime = ((Decimal)lDiff)/((Decimal)m_lFreq);
Console.WriteLine(m_mTotalTime);

Console.WriteLine("Press Enter to Continue");
Console.ReadLine();
}

static void Run(int i)
{
//Console.WriteLine(i);
}
}
}


===============================================
C++ Code ConsoleApp.cpp
===============================================

// ConsoleApp.cpp : Defines the entry point for the console application.
//

#include "stdafx.h"

void Run(int i)
{
//printf("%d\n",i);
}

int _tmain(int argc, _TCHAR* argv[])
{
LARGE_INTEGER m_start, m_stop, m_freq;
::QueryPerformanceFrequency(&m_freq);

//record the overhead for calling the performance counter API
::QueryPerformanceCounter(&m_start);
::QueryPerformanceCounter(&m_stop);

LONGLONG m_overhead = m_stop.QuadPart - m_start.QuadPart;
m_start.QuadPart = 0;
m_stop.QuadPart = 0;

printf("%s\n","Starting with a simple For Loop calling a simple function");

QueryPerformanceCounter(&m_start);
for (int i = 0; i < 50000; i++)
{
Run(i);
}
QueryPerformanceCounter(&m_stop);

LONGLONG lDiff = m_stop.QuadPart - m_start.QuadPart;
printf("%d\n",lDiff);
//Comment or Uncomment the overhead lines to see the times drop
//
//if (lDiff > m_overhead)
//{
// lDiff = lDiff - m_overhead;
//}

double totalTime = ((double)lDiff) / ((double)m_freq.QuadPart);
printf("%15.15f\n",totalTime);

printf("%s", "Press Enter to Continue");

int c = getchar();
return 0;
}

RE: Managed vs Unmanaged Bare Bones Performance Test adhingra
4/19/2007 3:10:03 PM
Sorry

I am late with my comments. Shortly after posting this, I realized that this
is a problem with my test as the C++ compiler is optimizing the whole thing
away. (Looked at the disassembly)

However this does not make the benchmark obsolete, rather than measuring the
performance, it actually measured the smartness of the two compilers. I did
some more research and talked to one of my collegeues here at work who is an
expert with C++ and even try making the code do more so that I can fool the
C++ compiler to actually call the function. But the guy is way too smart and
I was told the reason behind this extreme smartness is "Whole Program
Optimization" offered by the VS 2005 Linker.

If the compilation unit is different (i.e. my function is in a different cpp
file) this would not have happened in VS2003, but 2005 is a different beast
of its own with this whole program optimization. The linker no longer just
combine objs anymore, its more like an interpreter now and smart enough to
chip chop objs

But Like Ben pointed out inlining and optimizing are in the feature set of
the Jitter too.
I think I know may be why the Jitter in managed code does not do it because
the Jitter the compiling the one function at a time and it does not have the
luxury due to time constraint to check the whole program and see that the
whether the results of a function are used any where are not.

However I still think it should have jitted away an empty function.

Thanks All
Re: Managed vs Unmanaged Bare Bones Performance Test Ben Voigt
4/19/2007 4:38:32 PM
[quoted text, click to view]

Inlining and optimizing away a call to an empty function is well within the
capabilities of the CLR JIT.

[quoted text, click to view]

Re: Managed vs Unmanaged Bare Bones Performance Test Ben Voigt
4/19/2007 4:43:24 PM
[quoted text, click to view]

Did you actually measure the time for QueryPerf? Ok, I see that you did.
Those are native Win32 APIs, C++ will call them much faster than C#.

..35 microseconds is an extremely short time. Even 23 is too short for a
useful benchmark. Run more iterations. In fact, run 50000 iterations
first, ignoring the result, to force .NET to precompile everything. Then
run a half billion or so iterations and compare the results.

Re: Managed vs Unmanaged Bare Bones Performance Test Ben Voigt
4/19/2007 4:48:27 PM
[quoted text, click to view]

Oh, and if it's desired not to have the loop optimized away, touch a
volatile variable from inside the function.

[quoted text, click to view]

Re: Managed vs Unmanaged Bare Bones Performance Test Jon Skeet [C# MVP]
4/19/2007 10:42:57 PM
[quoted text, click to view]

That was my thought too. I suspect it'll still perform the loop
iteration, however, whereas the C++ compiler may well have removed that
loop completely, which still means it's not a good benchmark.

--
Jon Skeet - <skeet@pobox.com>
http://www.pobox.com/~skeet Blog: http://www.msmvps.com/jon.skeet
Re: Managed vs Unmanaged Bare Bones Performance Test Willy Denoyette [MVP]
4/19/2007 11:22:40 PM
[quoted text, click to view]


This kind of benchmarh is meaningless..
The reason for the huge difference is that the C++ compiler hoists the loop, as it sees no
sensible reason to call an empty function 50000 times, the C# compiler does not do this, it
simply calls the function which only contains a ret.
So what you are comparing is the time taken for a return from QueryPerformanceCounter plus
the time to call QueryPerformanceCounter, against a the time taken to call 50000 times an
empty function.

Willy.



RE: Managed vs Unmanaged Bare Bones Performance Test Jon Skeet [C# MVP]
4/20/2007 12:00:00 AM
[quoted text, click to view]

It measures the smartness of the compilers in *one* particular
situation. Do you often run a loop which does nothing? I know I don't.

<snip>

[quoted text, click to view]

I strongly suspect that it did, by inlining. It just didn't optimise
away the loop itself.

--
Jon Skeet - <skeet@pobox.com>
http://www.pobox.com/~skeet Blog: http://www.msmvps.com/jon.skeet
Re: Managed vs Unmanaged Bare Bones Performance Test Jon Skeet [C# MVP]
4/20/2007 12:00:00 AM
[quoted text, click to view]

<snip>

[quoted text, click to view]

It'll be interesting to see whether or not this ever happens. In Java,
it made a huge difference, because by having dynamic optimisation (and
de-optimisation) you can inline virtual methods until they're first
overridden. That's really important when the language makes methods
virtual by default, but not as important in a world which requires you
to specify that methods are virtual (which at least C# does - not sure
about VB.NET).

There are other improvements as well, of course, and it could improve
start-up time (one would hope) but the effects won't be quite as huge
as they were in the Java world.

--
Jon Skeet - <skeet@pobox.com>
http://www.pobox.com/~skeet Blog: http://www.msmvps.com/jon.skeet
Re: Managed vs Unmanaged Bare Bones Performance Test Willy Denoyette [MVP]
4/20/2007 12:00:00 AM
[quoted text, click to view]

True for optimized builds, the call is hoisted, but the loop is not hoisted by the JIT, the
C++ compiler (optimized build) effectively hoists the loop.
What's produced by the JIT depends on the version of the CLR.

this snip of the code:
for (int i = 0; i < 50000; i++)
{
Run(i);
}

is turned into into:

xor r11d,r11d
add r11d,4
cmp r11d,0C350h
jl 00000642`80150341 (jump to add r11d, 4 if less than)


by the JIT64, while the JIT32 (both v2 of the CLR), produces

xor eax,eax
add eax,1
cmp eax,0C350h
jl 001e014b (jump to add eax, 1 if less than)

see the subtle difference:
add r11d, 4
and
add eax, 1

here the JIT64 is cheating , no big deal in this case, but I would prefer some more
consistent behavior across JIT versions, here I mean hoist the loop, or keep the loop as is,
but don't cheat.

Willy.


Re: Managed vs Unmanaged Bare Bones Performance Test Barry Kelly
4/20/2007 12:06:14 AM
[quoted text, click to view]

I wish you wouldn't multipost.

[quoted text, click to view]

It measured how good the C++ compiler is at doing nothing, versus the
..NET JIT compiler. I agree, C++ is good for nothing.

:)

[quoted text, click to view]

..NET necessarily does whole program optimization because compilation
happens so late; but it is constrained by the amount of time it has to
work with - compilation must occur quickly. Performance will improve
over time, when .NET adds techniques that are common in Java, such as
recompiling with more aggressive optimization after many iterations.

-- Barry

--
Re: Managed vs Unmanaged Bare Bones Performance Test Willy Denoyette [MVP]
4/20/2007 12:15:10 PM
[quoted text, click to view]

True, with the following results:

C# (JIT32)
244
0,0000681650880209635582175947
C# (JIT64)
86
0,0000240253998762412541258735

C++ (O2 switch) 32 bit and 64 bit
164
0.000045815878834

see, C++ is faster than JIT 32 but slower than JIT64 code, for this particular case (50000
iterations), however such kind of micro-benchmarks have absolutely no value. For instance
make the loop count an odd number (eg. 49999) and you get the same results for C# JIT64
and C++.
Note that the JIT32 is known as not to being a great loop optimizer ;-).

Willy.
AddThis Social Bookmark Button