Compilers can optimise away functions you may want to time. Deák Ferenc explores ways to stop this happening.
Recently, I attended a course held by Andrei Alexandrescu (of Modern C++ Design and C++ Coding Standards fame) about C++ optimization. Some of the concepts that were suggested during the lecture have opened up a series of questions, which have lead to some research and finally the creation of this article, in which we explore the delicate interface of the compilers in regard to concepts of real life. We’ll look the way a compiler digests information and some of the measures it takes in order to assure highest performance of delivered application so vital for most systems. But mostly we will try to find loopholes in the contract covering the interfacing of the compiler in relation to the real life notion it represents.
When we programmers look at a source, with or without exhaustive experience in certain fields, some patterns will emerge, mental maps will be charted and some conclusions will be drawn from our in-mind analysis of the few lines observed. Using the familiar building blocks of the language (such as conditions, loops, jumps, etc) we will automatically recognize situations that certain pieces of code will provide us with, and we will act accordingly. However, when an optimizing compiler looks at the code, it sees something completely different.
It is extremely difficult, or might not even be possible, to fully understand all the operations performed by the compiler while generating optimized code, due to our very different view and approach of the same problem. The transformation of the source code into the generated binary code goes through a sequence of steps (each deserving a dedicated chapter in the big book of compiler implementation) and from the various intermediary representations, taking into consideration a series of settings, the final result may emerge having a wide variety of embodiments due to settings to optimization, environment and target system.
The background
During the course, while circumstantiating various techniques regarding C++ optimizations Alexandrescu was talking about measuring the time some operations take, how your compiler might disregard all your hard work while optimizing the final code (because the only code you should benchmark and optimize is Release code), and finally how to trick the compiler into actually performing what you want it to do, regardless of optimizations settings.
And let’s admit it: the best compilers available nowadays are the C++ compilers, which are among the most advanced ones available – in today’s performance oriented world they generate possibly the fastest and most efficient code which satisfies the requirements of lots of platforms and systems. And they have really advanced mechanisms of providing you with the fastest code. They can calculate during compile time complex operations of known data (there go your carefully handcrafted unit tests with vigorously chosen constant data), they can see if you ever intended the use of a function and if not they… just don’t call it if they detect it does not affect other data, they … can do a lot of unimaginable things to make your application faster.
Let’s consider the code in Listing 1.
long some_operation_you_want_to_measure( const char *a, int l) { long c, n = 1; for (c = 0; c<l ; c++) { if(a[c] > '9') return 0; n += n * (a[c] - '0'); } return n; } int main() { unsigned counter = 1000000; while(--counter > 0) { some_operation_you_want_to_measure("234" , 3); } } |
Listing 1 |
In the context above, when the code is compiled with high optimizations (-O3, -O2 for gcc/clang) enabled, the body of the function
some_operation_you_want_to_measure()
will be present in the generated code (unless you have told the compiler that this is a
static
method in which case it will … miraculously disappear in the void) but the only thing that the
main
will do is:
main: # @main xor eax, eax ret
This, however, might change once you change the optimization settings or the compiler. http://gcc.godbolt.org provides an excellent platform for comparing the output of various compilers when applied to the same source (as a side note: both clang and gcc are able to calculate the result of the function above being 60 and directly feeding it into the required place if we assign it to a variable which is printed).
In order to avoid this uncertainty, Alexandrescu has suggested the following solution: “ Use I/O combined with a non-provable false test ”. His solution is a short function, like Listing 2.
template <class T> void doNotOptimizeAway ( T&& d ) { if ( getpid () == 1 ) { const void *p = & d ; putchar (* static_cast < const char *>( p )); } } |
Listing 2 |
And from this point on, we use the syntax in Listing 3.
unsigned counter = 1000000; while(counter-- > 0) { doNotOptimizeAway( some_operation_you_want_to_measure("234" , 3) ); } |
Listing 3 |
As per [
Alexandrescu
], the code in Listing 1 will be treated by the optimizer in a fashion that guarantees the call (or calculation, as we have observed for certain simple situations) of the
some_operation_you_want_to_measure
and it will assure you the code will not be ‘optimized away’. What the code above does is the following: it defines a function which will act as a placeholder, it is never optimized away (due to the fact that it contains an I/O operations which are never optimized away), but regardless does nothing (because no user-land process in a sane system can have their PID equal to 1, the value (un)officially being reserved to the
init
process).
And this unprovable false was the spark for this article. I have started looking for ‘contractual loopholes’ in the standard libraries and functions, specifically researching situations where the compiler could have been clever enough to determine that a condition will always evaluate to false, and use this knowledge while generating code.
The falses
The C and C++ standard libraries contain a huge number of functions, providing abstractions of real life concepts which easily can be translated into computer code. However, due to particularities of computing systems, these translations on certain occasions are unable to fully model the real life situations, thus providing me with the required attack surface.
Playing with time
There are several functions and structures related to time retrieval in the
C++
standard libraries we can find in
<ctime>
, and its counterpart
C
library’s
<time.h>
. One of them is the
localtime
function, which populates a
tm
structure with human readable values of the current time. As per [
n1256
], Listing 4 is the list of its fields (and the declaration was taken from my t
ime.h
).
struct tm { int tm_sec; /* Seconds.[0-60](1 leap second)*/ int tm_min; /* Minutes.[0-59] */ int tm_hour; /* Hours. [0-23] */ int tm_mday; /* Day. [1-31] */ int tm_mon; /* Month. [0-11] */ int tm_year; /* Year - 1900. */ int tm_wday; /* Day of week. [0-6] */ int tm_yday; /* Days in year. [0-365] */ int tm_isdst; /* DST. [-1/0/1] */ }; |
Listing 4 |
As the comment says, all of these fields are restricted to meaningful values; however, no-one can stop me writing code like Listing 5.
time_t now; struct tm *tm; now = time(0); tm = localtime (&now); if(tm->tm_sec >= 62) { const void *p = & d ; putchar (* static_cast < const char *>( p )); } |
Listing 5 |
Now, we humans know that in reality this will never happen (just like checking if the hour is
>24
); however, the compiler currently is not clever enough and the suggested feature of contracts in the
C++17
standard currently has no coverage for this situation [
CppContracts
], thus the compiler will happily generate the following code (showing the output from clang 3.9.1, since this came up with the cleanest code – the comment is from me):
xor edi, edi call time mov qword ptr [rsp], rax mov rdi, rbx call localtime cmp dword ptr [rax], 62 ; <- This is the unnecessary comparison jl .LBB0_3
However, all compilers I have tested with this piece of code provided me with the same result: a comparison which can never be true.
Wondrous world of mathematics
There is an exhaustive library of mathematical functions available in
C++
and
C
(found in
<cmath>
or the corresponding
<math.h>
header files) which can take almost all basic mathematical concepts and translate them into corresponding function calls providing the expected result. One of these is the
sin()
functions which computes the sine of the argument (given in radians). If no error occurs, the sine of the argument is returned, the value is in the range [-1 .. +1] as per the definition of the trigonometrical function. So, from a human’s point of view it really makes no sense to check for values
>1
. However, again, the compiler is not clever enough to realise this, so the following call:
if ( sin( *(reinterpret_cast<float*>(&d)) ) > 1.0f ) { const void *p = & d ; putchar (* static_cast < const char *>( p )); }
will happily generate code like:
movss xmm0, DWORD PTR .LC1[rip] mov QWORD PTR [rsp+8], 60 call sinf ucomiss xmm0, DWORD PTR .LC0[rip] jbe .L2
where the number
60
is the value precalculated by the compiler (as presented above) however the call to the
sin
and the comparison after (
ucomiss
= Unordered Compare Scalar Single-Precision Floating-Point Values and Set EFLAGS) are present in the generated code.
The code above was generated by gcc 6.3. I’ve had the same result with gcc 6.2 and 6.1. The gcc 5.x series produces also similar code (not exactly this, but a bit longer using a different set of assembly commands) which is identical to the code generated by gcc 4.9.x however the gcc 4.8 and 4.7 series did not generate any code for this senseless
sin
check which behaviour was also manifested by clang 3.9, 3.8. Basically, all clangs up to 3.0. Microsoft’s Visual C++ compiler generated code, which was very similar to the code generated by gcc 5.x family with the check inside.
The interesting part came, however, when I increased the complexity of the
some_operation_you_want_to_measure
function. I have added the following line
if(n > 24) n-= n* a[c];
in the body of the
for
loop
after
the line
n += n * (a[c] - '0');
. Suddenly the compilers which did not take into consideration my senseless juggling with sine started paying attention to it and all of them generated some code to handle it. However, if I put the new line
before
the existing line nothing changed, the behaviour was the same as without the line. Seems that I have really stepped on the toes of some optimizers.
After several experimentation stages with the function
some_operation_you_want_to_measure
, I have drawn the conclusion that the more complex the function I want to measure gets, the least possible is for the call to the senseless
sin
to be optimized away. And when the complexity of the function
some_operation_you_want_to_measure
has reached a stage when the compiler was not able to automatically calculate its result during compile time, the senseless sine check was always called. I leave to the imagination of the reader to try to envision what other misconducts can be done with the function from the mathematical library.
The null-terminated byte strings
Not a very widely known name, but there is a header file
<cctype>
which has it roots in C’s
<ctype.h>
containing all kind of functions which can be used to manipulate… well, null terminated byte strings. You can find in there a wide range of functions for checking various properties of characters (in the likes of is this an UPPERCASE character, is this a digit, etc...).
Among them is the very useful
tolower(int)
function which, as its name suggests, will convert its argument (an
int
, representing a character) to its lower case equivalent using character conversion rules as per the current C locale. The function, however, will return its unmodified argument if no lowercase version is available in the current C locale. And its counterpart
toupper
, which behaves the same way, but it converts the argument to an uppercase character if possible.
We can exploit this, in order to create another unprovable false. Let’s consider:
if ( toupper('2') == 1 ) { const void *p = & d ; putchar (* static_cast < const char *>( p )); }
Again, the compiler does not know that the uppercase of the character
2
does not exist, due to specification
toupper
will return
2
and this is again a false which cannot be optimized away.
Contracts
The C++17 standard supposedly will include a notion of contracts which will be like a run time assertion for allowing checks for validity of input arguments or other dependants, but as per the latest (as of February, 2017) available draft [ n4618 ] I have found no mention of them. Due to the nature of some of the falses from above, these would be very easily identifiable at the compilation time and the compiler could take steps in order for them to not to behave as erratic as today.
For example let’s consider the
sin
function. Currently one of the declarations it has is the following:
float sin( float arg );
which says nothing to the compiler about the nature of the function, ie: that its return value is always between -1 and +1 (inclusive).
Let’s revise the declaration of the
sin
function, empowered with the concepts of contracts using the syntax from [
CppContracts
] (please note, the code below is not valid C++):
float sin( float arg ) [[ensures: return <= 1 && return >= -1 ]];
Due to the almost documentation like nature of a contract, the compiler instantly knows that the following
sin( *(reinterpret_cast<float*>(&d)) ) > 1.0f
will always be false and can generate proper code in order to achieve maximum performance and speed.
There is just a tiny little problem with the code above:
return
is not part of the upcoming C++ contracts specification, so the declaration above is just a small dream that might come true one day.
Conclusion
The list of the presented falses is just a subset of all those that exist out there, when you shall look specifically for them you shall find even more, from various libraries, through operating system specific data, to miscellaneous functions. I just wanted to follow up on [ Alexandrescu ], list a few more currently in existence and offer a theoretical way to mitigate the not so serious threat they represent. Once the compilers fully support the contracts, I am sure a new set of improved C++ libraries will come in existence which will help the compilers in their everyday job.
Appendix 1
The techniques presented above, however, have a minor downside to them: Strictly speaking the measurement of time variations will include a tiny fraction consumed by the function call together with the performed operations, which introduces variable distortions in the performance measurements thus degrading the quality of the data we receive. In order to achieve constant time and minimum overhead we can use the following technique: we always call the function we want to measure via a volatile function pointer (Listing 6). This introduces only a few extra bytes in the code (see Listing 7), and by using it we do not depend on calling external functions with all their side effects, and indeed the calls are performed as we would expect them to be. Also and interesting fact is, that in this case the compiler is not able to do the precalculation of the result of the function, nor will it inline the called function into the body of the calling function.
using ptr_fn =long(const char*, int); ptr_fn* volatile pfn = some_op; unsigned counter = 1000000; while(counter-- > 0) { pfn("234" , 3); } |
Listing 6 |
mov QWORD PTR [rsp+8], OFFSET FLAT:some_operat ion_you_want_to_measure(char const*, int) .L10: mov rax, QWORD PTR [rsp+8] mov esi, 3 mov edi, OFFSET FLAT:.LC0 call rax |
Listing 7 |
Appendix 2
In order to have a base of comparison for the variations in the generated code, I have created the test application also in two other programming languages with syntax very familiar to C and which also compile into binary code: Go and D.
Listing 8 is the Go version of the same program. The corresponding assembly code generated was (only showing the interesting parts, since Go compiled an executable file of 1.1Mb) is in Listing 9.
package main func some_operation_you_want_to_measure(a string,l int) int { var n int = 1 for c := 0; c<l ; c++ { if(a[c] > '9') { return 0 } n = n + n * int(a[c] - '0') } return n } func main() { var counter int = 1000000 for counter > 0 { some_operation_you_want_to_measure("234" , 3) counter -- } } |
Listing 8 |
mov rax,0xf4240 nov OWORD PTR [rsp+0x20],rax cmp rax,0x0 jle loc_00461oea 1oc_004010b5: lea rbx,[rip+0x711c4] mov QWORD PTR [rsp],rbx mov QWORD PTR [rsp+0x8],0x3 mov QWORD PTR [rsp+0x10],0x3 call main.some_operation_you_want_to_measure mov rax,QWORD PTR [rsp+0x20] dec rax mov QWORD PTR [rsp+0x20],rax cmp rax,0x0 jg loc_004010b5 1oc_004016ea: add rsp,ox28 ret |
Listing 9 |
The same application in D is shown in Listing 10, and the generated code, which was compiled with optimizations on (-O) (the final executable was 594Kb long, I took only the relevant parts) is in Listing 11.
long some_operation_you_want_to_measure(const char *a, int l) { long c, n = 1; for (c = 0; c<l ; c++) { if(a[c] > '9') return 0; n += n * (a[c] - '0'); } return n; } int main() { int counter = 1000000; while(--counter > 0) { some_operation_you_want_to_measure("234" , 3); } return 1; } |
Listing 10 |
mov ebx,0xf423f loc_00427d53: lea rsi,[rip+0x26806] mov edi,0x3 call _DZtt34some_operation_you_want_to_measureFxPaiZL dec ebx test ebx,ebx jg loc_00427d53 |
Listing 11 |
Although this is not necessarily related to the article, regardless it’s interesting to see how various compilers handle the same situation.
Acknowledgements
I have referenced frequently [ Alexandrescu ] through the article, and with his permission I have re-used code written by him which was presented at the course. It was awesome.
Regarding Appendix 1: it came into existence during the review phase of the article when one of the reviewers draw my attention towards the distorted time measurements and kindly suggested the solution to mitigate this problem. Thank you!
References
[Alexandrescu] Fastware: The art of optimizing C++ code, Kongsberg, 2017 January
[CppContracts] http://www.open-std.org/JTC1/SC22/WG21/docs/papers/2015/n4415.pdf
[n1256] http://www.open-std.org/jtc1/sc22/wg14/www/docs/n1256.pdf
[n4618] http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2016/n4618.pdf