Contractual Loopholes

Compilers can optimise away functions you may want to time. Deák Ferenc explores ways to stop this happening.

Recently, I attended a course held by Andrei Alexandrescu (of Modern C++ Design and C++ Coding Standards fame) about C++ optimization. Some of the concepts that were suggested during the lecture have opened up a series of questions, which have lead to some research and finally the creation of this article, in which we explore the delicate interface of the compilers in regard to concepts of real life. We’ll look the way a compiler digests information and some of the measures it takes in order to assure highest performance of delivered application so vital for most systems. But mostly we will try to find loopholes in the contract covering the interfacing of the compiler in relation to the real life notion it represents.

When we programmers look at a source, with or without exhaustive experience in certain fields, some patterns will emerge, mental maps will be charted and some conclusions will be drawn from our in-mind analysis of the few lines observed. Using the familiar building blocks of the language (such as conditions, loops, jumps, etc) we will automatically recognize situations that certain pieces of code will provide us with, and we will act accordingly. However, when an optimizing compiler looks at the code, it sees something completely different.

It is extremely difficult, or might not even be possible, to fully understand all the operations performed by the compiler while generating optimized code, due to our very different view and approach of the same problem. The transformation of the source code into the generated binary code goes through a sequence of steps (each deserving a dedicated chapter in the big book of compiler implementation) and from the various intermediary representations, taking into consideration a series of settings, the final result may emerge having a wide variety of embodiments due to settings to optimization, environment and target system.

The background

During the course, while circumstantiating various techniques regarding C++ optimizations Alexandrescu was talking about measuring the time some operations take, how your compiler might disregard all your hard work while optimizing the final code (because the only code you should benchmark and optimize is Release code), and finally how to trick the compiler into actually performing what you want it to do, regardless of optimizations settings.

And let’s admit it: the best compilers available nowadays are the C++ compilers, which are among the most advanced ones available – in today’s performance oriented world they generate possibly the fastest and most efficient code which satisfies the requirements of lots of platforms and systems. And they have really advanced mechanisms of providing you with the fastest code. They can calculate during compile time complex operations of known data (there go your carefully handcrafted unit tests with vigorously chosen constant data), they can see if you ever intended the use of a function and if not they… just don’t call it if they detect it does not affect other data, they … can do a lot of unimaginable things to make your application faster.

Let’s consider the code in Listing 1.

long some_operation_you_want_to_measure(
  const char *a, int l)
{
  long c, n = 1;
  for (c = 0; c<l ; c++) 
  {
    if(a[c] > '9') return 0;
    n +=  n * (a[c] - '0');
  }
  return n;
}

int main()
{
  unsigned counter = 1000000;
  while(--counter > 0)
  {
    some_operation_you_want_to_measure("234" , 3);
  }
}

Listing 1

In the context above, when the code is compiled with high optimizations (-O3, -O2 for gcc/clang) enabled, the body of the function some_operation_you_want_to_measure() will be present in the generated code (unless you have told the compiler that this is a static method in which case it will … miraculously disappear in the void) but the only thing that the main will do is:

main:                      # @main
     xor     eax, eax
     ret

This, however, might change once you change the optimization settings or the compiler. http://gcc.godbolt.org provides an excellent platform for comparing the output of various compilers when applied to the same source (as a side note: both clang and gcc are able to calculate the result of the function above being 60 and directly feeding it into the required place if we assign it to a variable which is printed).

In order to avoid this uncertainty, Alexandrescu has suggested the following solution: “ Use I/O combined with a non-provable false test ”. His solution is a short function, like Listing 2.

template <class T> void doNotOptimizeAway ( 
  T&& d )
{
  if ( getpid () == 1 )
  {
    const void *p  = & d ; 
    putchar (* static_cast < const char *>( p ));
  }
}

Listing 2

And from this point on, we use the syntax in Listing 3.

unsigned counter = 1000000;
while(counter-- > 0)
{
  doNotOptimizeAway(
    some_operation_you_want_to_measure("234" , 3)
  );
}

Listing 3

As per [ Alexandrescu ], the code in Listing 1 will be treated by the optimizer in a fashion that guarantees the call (or calculation, as we have observed for certain simple situations) of the some_operation_you_want_to_measure and it will assure you the code will not be ‘optimized away’. What the code above does is the following: it defines a function which will act as a placeholder, it is never optimized away (due to the fact that it contains an I/O operations which are never optimized away), but regardless does nothing (because no user-land process in a sane system can have their PID equal to 1, the value (un)officially being reserved to the init process).

And this unprovable false was the spark for this article. I have started looking for ‘contractual loopholes’ in the standard libraries and functions, specifically researching situations where the compiler could have been clever enough to determine that a condition will always evaluate to false, and use this knowledge while generating code.

The falses

The C and C++ standard libraries contain a huge number of functions, providing abstractions of real life concepts which easily can be translated into computer code. However, due to particularities of computing systems, these translations on certain occasions are unable to fully model the real life situations, thus providing me with the required attack surface.

Playing with time

There are several functions and structures related to time retrieval in the C++ standard libraries we can find in <ctime> , and its counterpart C library’s <time.h> . One of them is the localtime function, which populates a tm structure with human readable values of the current time. As per [ n1256 ], Listing 4 is the list of its fields (and the declaration was taken from my t ime.h ).

struct tm
{
  int tm_sec;   /* Seconds.[0-60](1 leap second)*/
  int tm_min;   /* Minutes.[0-59] */
  int tm_hour;  /* Hours.  [0-23] */
  int tm_mday;  /* Day.    [1-31] */
  int tm_mon;   /* Month.  [0-11] */
  int tm_year;  /* Year - 1900.   */
  int tm_wday;  /* Day of week.  [0-6]    */
  int tm_yday;  /* Days in year. [0-365]  */
  int tm_isdst; /* DST.          [-1/0/1] */
};

Listing 4

As the comment says, all of these fields are restricted to meaningful values; however, no-one can stop me writing code like Listing 5.

time_t now;
struct tm *tm;
now = time(0);
tm = localtime (&now);
if(tm->tm_sec >= 62)
{
  const void *p  = & d ; 
  putchar (* static_cast < const char *>( p ));
}

Listing 5

Now, we humans know that in reality this will never happen (just like checking if the hour is >24 ); however, the compiler currently is not clever enough and the suggested feature of contracts in the C++17 standard currently has no coverage for this situation [ CppContracts ], thus the compiler will happily generate the following code (showing the output from clang 3.9.1, since this came up with the cleanest code – the comment is from me):

  xor     edi, edi
  call    time
  mov     qword ptr [rsp], rax
  mov     rdi, rbx
  call    localtime
  cmp     dword ptr [rax], 62 ; <- This is the unnecessary comparison
  jl      .LBB0_3

However, all compilers I have tested with this piece of code provided me with the same result: a comparison which can never be true.

Wondrous world of mathematics

There is an exhaustive library of mathematical functions available in C++ and C (found in <cmath> or the corresponding <math.h> header files) which can take almost all basic mathematical concepts and translate them into corresponding function calls providing the expected result. One of these is the sin() functions which computes the sine of the argument (given in radians). If no error occurs, the sine of the argument is returned, the value is in the range [-1 .. +1] as per the definition of the trigonometrical function. So, from a human’s point of view it really makes no sense to check for values >1 . However, again, the compiler is not clever enough to realise this, so the following call:

  if ( sin( *(reinterpret_cast<float*>(&d)) )
    > 1.0f )
  {
    const void *p  = & d ;
    putchar (* static_cast < const char *>( p ));
  }

will happily generate code like:

  movss   xmm0, DWORD PTR .LC1[rip]
  mov     QWORD PTR [rsp+8], 60
  call    sinf
  ucomiss xmm0, DWORD PTR .LC0[rip]
  jbe     .L2

where the number 60 is the value precalculated by the compiler (as presented above) however the call to the sin and the comparison after ( ucomiss = Unordered Compare Scalar Single-Precision Floating-Point Values and Set EFLAGS) are present in the generated code.

The code above was generated by gcc 6.3. I’ve had the same result with gcc 6.2 and 6.1. The gcc 5.x series produces also similar code (not exactly this, but a bit longer using a different set of assembly commands) which is identical to the code generated by gcc 4.9.x however the gcc 4.8 and 4.7 series did not generate any code for this senseless sin check which behaviour was also manifested by clang 3.9, 3.8. Basically, all clangs up to 3.0. Microsoft’s Visual C++ compiler generated code, which was very similar to the code generated by gcc 5.x family with the check inside.

The interesting part came, however, when I increased the complexity of the some_operation_you_want_to_measure function. I have added the following line

  if(n > 24) n-= n* a[c];

in the body of the for loop after the line n += n * (a[c] - '0'); . Suddenly the compilers which did not take into consideration my senseless juggling with sine started paying attention to it and all of them generated some code to handle it. However, if I put the new line before the existing line nothing changed, the behaviour was the same as without the line. Seems that I have really stepped on the toes of some optimizers.

After several experimentation stages with the function some_operation_you_want_to_measure , I have drawn the conclusion that the more complex the function I want to measure gets, the least possible is for the call to the senseless sin to be optimized away. And when the complexity of the function some_operation_you_want_to_measure has reached a stage when the compiler was not able to automatically calculate its result during compile time, the senseless sine check was always called. I leave to the imagination of the reader to try to envision what other misconducts can be done with the function from the mathematical library.

The null-terminated byte strings

Not a very widely known name, but there is a header file <cctype> which has it roots in C’s <ctype.h> containing all kind of functions which can be used to manipulate… well, null terminated byte strings. You can find in there a wide range of functions for checking various properties of characters (in the likes of is this an UPPERCASE character, is this a digit, etc...).

Among them is the very useful tolower(int) function which, as its name suggests, will convert its argument (an int , representing a character) to its lower case equivalent using character conversion rules as per the current C locale. The function, however, will return its unmodified argument if no lowercase version is available in the current C locale. And its counterpart toupper , which behaves the same way, but it converts the argument to an uppercase character if possible.

We can exploit this, in order to create another unprovable false. Let’s consider:

  if ( toupper('2') == 1 )
  {
    const void *p  = & d ;
    putchar (* static_cast < const char *>( p ));
  }

Again, the compiler does not know that the uppercase of the character 2 does not exist, due to specification toupper will return 2 and this is again a false which cannot be optimized away.

Contracts

The C++17 standard supposedly will include a notion of contracts which will be like a run time assertion for allowing checks for validity of input arguments or other dependants, but as per the latest (as of February, 2017) available draft [ n4618 ] I have found no mention of them. Due to the nature of some of the falses from above, these would be very easily identifiable at the compilation time and the compiler could take steps in order for them to not to behave as erratic as today.

For example let’s consider the sin function. Currently one of the declarations it has is the following:

  float sin( float arg );

which says nothing to the compiler about the nature of the function, ie: that its return value is always between -1 and +1 (inclusive).

Let’s revise the declaration of the sin function, empowered with the concepts of contracts using the syntax from [ CppContracts ] (please note, the code below is not valid C++):

  float sin( float arg )
    [[ensures: return <= 1 && return >= -1 ]];

Due to the almost documentation like nature of a contract, the compiler instantly knows that the following

  sin( *(reinterpret_cast<float*>(&d)) ) > 1.0f

will always be false and can generate proper code in order to achieve maximum performance and speed.

There is just a tiny little problem with the code above: return is not part of the upcoming C++ contracts specification, so the declaration above is just a small dream that might come true one day.

Conclusion

The list of the presented falses is just a subset of all those that exist out there, when you shall look specifically for them you shall find even more, from various libraries, through operating system specific data, to miscellaneous functions. I just wanted to follow up on [ Alexandrescu ], list a few more currently in existence and offer a theoretical way to mitigate the not so serious threat they represent. Once the compilers fully support the contracts, I am sure a new set of improved C++ libraries will come in existence which will help the compilers in their everyday job.

Appendix 1

The techniques presented above, however, have a minor downside to them: Strictly speaking the measurement of time variations will include a tiny fraction consumed by the function call together with the performed operations, which introduces variable distortions in the performance measurements thus degrading the quality of the data we receive. In order to achieve constant time and minimum overhead we can use the following technique: we always call the function we want to measure via a volatile function pointer (Listing 6). This introduces only a few extra bytes in the code (see Listing 7), and by using it we do not depend on calling external functions with all their side effects, and indeed the calls are performed as we would expect them to be. Also and interesting fact is, that in this case the compiler is not able to do the precalculation of the result of the function, nor will it inline the called function into the body of the calling function.

using ptr_fn =long(const char*, int);

ptr_fn* volatile pfn = some_op;

unsigned counter = 1000000;
while(counter-- > 0)
{
  pfn("234" , 3);
}

Listing 6

mov     QWORD PTR [rsp+8], OFFSET FLAT:some_operat
ion_you_want_to_measure(char const*, int)
.L10:
mov     rax, QWORD PTR [rsp+8]
mov     esi, 3
mov     edi, OFFSET FLAT:.LC0
call    rax

Listing 7

Appendix 2

In order to have a base of comparison for the variations in the generated code, I have created the test application also in two other programming languages with syntax very familiar to C and which also compile into binary code: Go and D.

Listing 8 is the Go version of the same program. The corresponding assembly code generated was (only showing the interesting parts, since Go compiled an executable file of 1.1Mb) is in Listing 9.

package main
func
some_operation_you_want_to_measure(a string,l int)
  int {
  var n int = 1
  for c := 0; c<l ; c++ {
    if(a[c] > '9') {
      return 0
    }
    n = n + n * int(a[c] - '0')
  }
  return n
}
func main() {
  var counter int = 1000000
  for counter > 0 {
    some_operation_you_want_to_measure("234" , 3)
    counter --
  }
}

Listing 8

    mov rax,0xf4240
    nov OWORD PTR [rsp+0x20],rax
    cmp rax,0x0
    jle loc_00461oea
1oc_004010b5:
    lea rbx,[rip+0x711c4]
    mov QWORD PTR [rsp],rbx
    mov QWORD PTR [rsp+0x8],0x3
    mov QWORD PTR [rsp+0x10],0x3
    call main.some_operation_you_want_to_measure
    mov rax,QWORD PTR [rsp+0x20]
    dec rax
    mov QWORD PTR [rsp+0x20],rax
    cmp rax,0x0
    jg loc_004010b5
1oc_004016ea:
    add rsp,ox28
    ret

Listing 9

The same application in D is shown in Listing 10, and the generated code, which was compiled with optimizations on (-O) (the final executable was 594Kb long, I took only the relevant parts) is in Listing 11.

long  some_operation_you_want_to_measure(const char *a, int l) {
  long c, n = 1;
  for (c = 0; c<l ; c++) {
    if(a[c] > '9') return 0;
    n +=  n * (a[c] - '0');
  }
  return n;
}
int main() {
  int counter = 1000000;
  while(--counter > 0) {
    some_operation_you_want_to_measure("234" , 3);
  }
  return 1;
}

Listing 10

    mov ebx,0xf423f
loc_00427d53:
    lea rsi,[rip+0x26806]
    mov edi,0x3
    call _DZtt34some_operation_you_want_to_measureFxPaiZL
    dec ebx
    test ebx,ebx
    jg loc_00427d53

Listing 11

Although this is not necessarily related to the article, regardless it’s interesting to see how various compilers handle the same situation.

Acknowledgements

I have referenced frequently [ Alexandrescu ] through the article, and with his permission I have re-used code written by him which was presented at the course. It was awesome.

Regarding Appendix 1: it came into existence during the review phase of the article when one of the reviewers draw my attention towards the distorted time measurements and kindly suggested the solution to mitigate this problem. Thank you!

References

[Alexandrescu] Fastware: The art of optimizing C++ code, Kongsberg, 2017 January

[CppContracts] http://www.open-std.org/JTC1/SC22/WG21/docs/papers/2015/n4415.pdf

[n1256] http://www.open-std.org/jtc1/sc22/wg14/www/docs/n1256.pdf

[n4618] http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2016/n4618.pdf