Open Source – And Still Deceiving Programmers

Malware can hijack the normal flow of your program. Deák Ferenc walks through the ELF format to avoid malicious code injection.

Computer viruses, trojan horses, rootkits and other pieces of malicious software have been around for a very long time. Since the first application that could be classified as a ‘classic’ computer virus (based on the theory presented in [Neumann66] ) appeared in 1971, countless variations of the same construct have appeared, with more or less destructive intentions, varying from harmless jokes to highly specialized pieces of malicious software targeting industrial processes and machines, while the number of them as per [BBC] has passed 1000000 (by 2008).

The threat presented by these noxious pieces of code is so significant that a new word has emerged in order to classify them: ‘malware’. This is short for malicious software. Malware is a collective category encompassing several types of harmful applications from the classic virus (ie. an application which replicates itself by modifying already existing files or data structures on a computer) through worms (applications which move through the network infecting computers) to trojan horses (applications posing as something other than they are, very often disguised as legitimate applications). Malware also covers spyware and keyloggers (these spy on your activities, frequently registering your keystrokes which then are sent to malicious parties), rootkits (very low- level applications, more often found at the hardware/OS level, hidden from user-level access) and various ransomware applications, which hold your computer hostage by encrypting your data until you pay a ‘ransom’.

Recently, the trend in the propagation of these damaging applications has changed. Due to the increase in the level of security features in Operating Systems which were more traditionally affected by viruses, the number of classical ‘viruses’ which have multiplied themselves via modifying existing system files and spread via execution is in recess; however, there is a sharp increase in the sighting of other types of malware which are propagating via email attachments, malicious downloads or by simply utilizing vulnerabilities in operating systems [Wikipedia_2] .

Most of these vicious applications have, however, one thing in common: we rarely see the original source code which led to their creation (except for some high-profile leaks such as [TheIntercept] ). Indeed, it would be kind-of silly to ask fellow programmers to “please compile and run this code, it is a virus, it will infect your computer and it will multiply itself in countless pieces before rendering your machine unusable” … this would be similar to the old joke of receiving an email with the content “Hi, this is a manual virus. Since I am not so technically proficient as to be able to write a real virus, please forward this email to everyone in your contact list and delete all your files. With kind regards, Amateur Virus Writer”.

This certainly does not mean that everything you download from the internet and compile yourself is guaranteed to be clean and harmless. The open source community focused around various free products goes to great lengths in order to provide a high quality application without backdoors (ie: unofficial ways which makes access to certain systems possible) and, considering the backdoor attempt of 2003 targeting Linux [LinuxBackdoor] , their effort invested in this direction is more than welcome.

Considering this introduction, you might envision that this article will be a tutorial on how to write viruses, Trojan horses and other maleficent pieces of code in order to achieve world domination, or just simply for fun. You couldn’t be further from the truth. On the contrary. This article will present practical ways for defending your open source application against hideous interventions by programmers with hidden intentions, who would like to hijack the normal flow of your code by various means we will discuss later.

We will see different ways of executing code, we will dig deep in the binary section of executable files, and we will have the chance to examine the source code of a true shape-shifting application. All of these can be present in real life situations when you are merging code from your contributors into the final product, and if you don’t pay attention some not so well intended modifications will end up in the final product.

Running an application

The simple and mundane task of starting an executable application compiled for your platform in fact involves a long list of processes from the operating system’s side. Since the details of this topic in itself are worth a small book, I will just provide a very high-level overview of what happens when you start an application.

Starting a process in Linux

Applications in Linux use the so called ‘ELF’ format (Executable and Linkable Format [Wikipedia_1] , [O’Neil16] ). This is a format adopted by various Unix-like operating systems, but more recently the Linux subsystem of Windows 10 also shows support for this type of executable.

A short overview of the steps taken when a new application is launched (either from a shell or from somewhere else) is as follows:

The fork system call is used to create a new process. The fork will create a ‘copy’ of the current process and set up a set of flags reflecting the state of the new process [fork] .
The execve system call is executed with the application to be executed [execve] .
Down in the Linux kernel, a linux_binprm [linux_binprm] structure is being built in order to accommodate the new process, which is passed to: static int load_elf_binary(struct linux_binprm *bprm) in fs/binfmt_elf.c [load_elf_binary]
load_elf_binary does the actual loading of the ELF executable according to the specifications and at the end it calls start_thread , which is the platform dependent way of starting the loaded executable.

For those expressing a deeper interest in this field, the excellent article [HowKernelRuns] or Robert Love’s outstanding book Linux Kernel Development [Love10] will provide all the details required.

The ELF format

The ELF binary in itself is a complex subject – a full description can be found in Learning Linux Binary Analysis [O’Neil16] , and a shorter one on Wikipedia [Wikipedia_1] – so let’s summarize it briefly:

The ELF headers

The file header of the ELF file starts with a few magic numbers for correctly identifying this as being a valid ELF file: 0x7F followed by the characters E , L and F (0x45, 0x4c, 0x46). The architecture of the file is specified (whether 32 or 64 bit) and whether the encoding of the file is big endian or little endian.

In the header, a special field denotes the target operating system’s Application Binary Interface and the instruction set of the binary. ELF files can be of different types, such as relocatable, executable, shared or core. This information is also stored in the elf header.

There are several fields in the header dealing with the length and format of the ELF sections, described below.

The file header of the ELF file is followed by a program header, which describes to the system how to create a process image, and several section headers.

ELF sections

There are several sections in an ELF file, each containing various data, vital to correctly understand and run the application. Among these sections is:

the .text section, which contains the actual code of the application,
the .rodata section, containing the constant strings from the application
the .data section, which contains for example the initialized global variables
the .bss section, which contains uninitialized global data
various other sections describing how this application handles shared libraries and also…
sections describing application startup and destruction steps in the .ctors and .dtors sections (correspondingly .init_array and .fini_array ). These sections contain function pointers to the methods, which will be called on application startup and shutdown, and we will have a more detailed look at them in later paragraphs of this article.

There is a handy Linux utility called readelf ,which displays information about a specific ELF file’s structure. We will refer to it in this article and will present the output of it frequently.

Deceiving techniques

So, after this short but necessary, background introduction, we have finally arrived at the focus point of the article. We will present here various techniques you have to carefully observe in order to keep your source healthy and free of unwanted side effects.

Mainless application

The main function in a ‘normal’ application is the place where the application starts [main_function] . However, be aware that C and C++ compilers handle main very differently. For example, Listing 1 will compile flawlessly using gcc with default compilation flags even though void main(int a) is strictly not standard compliant, but g++ , being more picky, will refuse to compile it. For gcc , you need to use -pedantic to warn you about the return type of main not being int as required by the standard.

#include <stdlib.h>
#include <stdio.h>
void main(int a)
{
  void (*fp[])(int) = {main,exit};
  printf("%d\n",a++);
  fp[a/101](a);
}

Listing 1

But what if an attacker does not wish to provide a main function (since he wants to pose the source code as being a part of a library)?

With gcc , there is always the possibility of using the -e <entrypoint> switch to specify a different entry point for your application. This is very observable in the build files, and there will be more to be dealt with, such as:

The need to specify -nostartfiles to the gcc command line in order to avoid the linking error: (.text+0x20): undefined reference to 'main' . As a short explanation to this, in the background gcc always links to some architecture specific files (such as crtstuff.c ), which provide the application with the required startup functionality, which will end up calling the main [linuxProgramStartup]
Explicitly use the function exit(<CODE>); to properly exit the application in order to avoid the segmentation fault at exit.
There is no access to the common argc and argv values passed in to your application.

With these in mind, the source file in Listing 2, compiled with the command below should work as expected, by totally avoiding the main function thus deceiving you into believing the validity of the application.

#include <stdio.h>
#include <stdlib.h>

int my_main()
{
  printf("Mainless\n");
  exit(2);
}

Listing 2

Compiled with:

  gcc mainless.c -o mainless -e my_main

    -nostartfiles

It is interesting to note that comparing the binary file produced from compiling a ‘mainless’ program with a more standard binary – with ‘main’ and the linked-in gcc startup files – gives us the (not so) surprising result that the ‘mainless’ file is smaller, with a difference of up to 2000 bytes. Also, the elf structure, analyzed with readelf , gives a much simpler layout. A few differences in the header of the ELF file can also be observed:

Header	mainless	with main
Entry point address	0x400390	0x400430
Start of section headers	5184	6624
Number of section headers	20	31

This all proves that the ‘mainless’ file is, indeed, much smaller that the corresponding one with main.

Running code before main

When your targeted compiler is a C compiler, it can be really difficult to run code before main (or as we have seen, its replacement) starts. I have to emphasize that C++ has a dynamic initialization phase where arbitrary code is executed in order to initialize non-local variables. However, that comes with the well-known static initialization order fiasco: In C++, it is unpredictable which non-local is initialized before which other, so if one depends on the other one, the application in question may work flawlessly in some situations, while other reincarnations may suffer from this dependency with an uninitialized variable.

Fortunately, for C compilers, the ELF format provides extra support for running code before application start in the so called ‘constructor’ section, and it is also possible to hijack the .init section in order to execute code we want using special assembly syntax.

The .init_array section of the ELF binary

The .init_array (and the .preinit_array ) section of the ELF binary contains a list of pointers (addresses of functions) called by the code initializing the application. This code (the one calling the functions in the .init_array section) usually resides in the .init section. The difference between .preinit_array and .init_array is that code in the .preinit_array is called before the .init_array .

The gcc compiler has a non-standard C extension to provide support for defining various dedicated elf sections with user specified code via the usage of the __attribute__ syntax. An example of is in Listing 3.

#include <stdio.h>

void my_main(int argc, char* argv[], 
  char* envp[])
{
  printf("my main: %d parameters\n", argc);
}
int main(int argc, char* argv[])
{
  printf("main: %d parameters\n", argc);
}
__attribute__((section(".init_array"))) void 
  (* p_my_main)(int,char*[],char*[]) = &my_main;

Listing 3

Analyzing a debug session of this application will reveal interesting insights on the working of libc and the application startup procedure (see Figure 1).

(linux)$ gdb ./init_array
(gdb) break my_main
Breakpoint 2 at 0x40052a
(gdb) run
Starting program: .../init_array

Breakpoint 2, 0x000000000040052a in my_main(int, char**, char**) ()
(gdb) bt
#0  0x000000000040052a in my_main(int, char**, char**) ()
#1  0x00000000004005cd in __libc_csu_init ()
#2  0x00007ffff7a2d7bf in __libc_start_main (main=0x400550 <main>, ... ) at ../csu/libc-start.c:247
#3  0x0000000000400459 in _start ()

Figure 1

For further information, evaluating this information, while combining it with the relevant section of assembly code (as the result of objdump -h -S init_array ) will give details of how the code is actually executed, and it is easily traceable based on how the .init_array section is handled on application startup.

If we don’t require access to the command line parameters, gcc also has support as an extension for specifying a function to be called before main using a much less cryptic syntax: __attribute__ ((constructor)) , which has similar consequence to the section initializer syntax:

  void __attribute__ ((constructor)) premain()
  {
    printf("premain called\n");
  }

An even more obscure way of defining a method to be called before main is to rely on the ‘.init’ section of the ELF format and do some assembly level magic (see Listing 4).

#include <stdio.h>
#include <stdlib.h>

int my_main()
{
  __asm__ (".section .init \n call my_main \n .section .text\n");
  printf("my_main\n");
}

int main()
{
  printf("main\n");
}

Listing 4

This will directly instruct the compiler to assemble the code call my_main into the section ‘.init’ by using the assembly directive .section .init .

And, last but not least, some proprietary compilers targeting commercial environment (but not gcc ) have support for a special compiler specific #pragma directive: #pragma init , which has the same effect as having __attribute__((constructor)) for the gcc .

Process tracing

Just a side note: the initialization phase of the application is a preferred place among wanna-be virus writers to place ptrace related code. ptrace is a mechanism offered by Linux making it possible for a parent process to observe and influence the execution of other processes. It is mainly used in debugging and examining the state of other processes; however, certain anti debugging features – if compiled in into an executable – will make the analysis of compiled application difficult.

You should be looking for the PTRACE_TRACEME , which detects if the current application is traced by a debugger, and will act accordingly.

Running code after main

Very similar to the above scenario where we want to run code before the main, the ELF binary comes again to our help by making it possible to run C code after main has finished its lifetime. The ‘destructors’ section of the ELF is named the .fini_array and we can get access to it via the following code construct:

  void end_app(void)
  {
    printf("after main\n");
  }
  __attribute__((section(".fini_array"))) void   (* p_end_app)(void) = &end_app;

or there is also support for the __attribute__ ((destructor)) syntax, if we find the above one very cryptic. This section comes very handy in case we are in the situation of running some ‘last minute’ cleanup jobs.

Similarly to the .init section, we can instruct the assembler to generate code into a .fini section, and also some proprietary compilers support the #pragma fini directive, to mark some identifiers as a finalization function.

A shape shifting application

It is important that you carefully observe not only your C and C++ files, but also the accompanying build instructions. Today, there are several tools which facilitate the management of build scripts (such as CMake, SCons, etc...) and these tools often use very complex files, which makes it easy to hide a few unwanted pieces of code, so be sure to check those too – most of the time that is the location a deception begins.

So, let’s consider the following situation, where you are working on a free and open source budget management application, and one of your contributors submits the code in Listing 5.

include <stdio.h>
#include <stdlib.h>
// TODO: This is still work in progress, more
// months in the test plan are required and some
// code cleanup is necessary to remove the clutter.
// Will FIX ASAP!!!
#define INTEREST void*

#ifndef DEBUG_INTEREST
  #define L (int*)
  #define TEST void
  #define OPEN_INTEREST(a) \
    printf("Opening: %s\n", #a);
  #define READ_INTEREST(a) \
    printf("Reading: %s\n", #a);
  #define CLOSE_INTEREST(a) \
    printf("Closing: %s\n", #a);
  #define INTEREST_VALUE(a,b) b
#endif

TEST interest_calculator_test(char const *period)
{
  static INTEREST array[] = {
    L(26), L(2), 0,  // January
    L(5), L(26), 0,  // February
    L(8), L(2), 0,   // March
    L(11), L(2), 0,  // April
    L(14), L(14),0,  // May
    L(14), L(17), 0, // June
    L(20), L(20),0,  // July
    L(20), L(23), 0, // August
    L(2), L(2)};     // September

  INTEREST earning = array, *entry = array;
  char interest[1024] = {0}, *q, type = 3;
     // type 3 = Recurring
  char* c = interest; const char* name 
    = "Monthly";

  OPEN_INTEREST(earning);
  if(earning == 0) READ_INTEREST(entry);
  if(entry != 0)  // Do we have an interest at
                  // the current point?
  {
    if(INTEREST_VALUE(entry,type) == 4)
      if((*INTEREST_VALUE(entry,name) != 46
                  // 11822 minutes ~ 8 days
        && *(int16_t*)
          (INTEREST_VALUE(entry,name)) != 11822)) 
        c = interest; q = (char*)period; 
  }
  if(*q) *c++ = *q++;   // move to next period
  if(*q) {
    *c++ = 47;          // 47 - a check value,
                        // comes from test data
    q = (char*)INTEREST_VALUE(entry,name);  
                        // get its value, save it
    while(*q) *c++ = *q++; // skip period
  }
  if(*q) {
    *(int16_t*)c = 0x000a;
    printf("Current value: %s", interest);
    *c = 0;
    interest_calculator_test(interest); 
                    // advance month to next one
  }
  if(*q) CLOSE_INTEREST(entry);   // Done
}
int main()
{
  const char interest_period[] = {47, 0};
  interest_calculator_test(interest_period);
  return 0;
}

Listing 5

Surely, this is a short unit test for some functions (supposedly OPEN_INTEREST , READ_INTEREST , INTEREST_VALUE being the function we ‘want’ to test), albeit a pretty poorly written one. However, it seems to be harmless for the moment, and the comment on top clearly says it needs improvement, you decide to keep it in your source code base, hoping that the developer who submitted the patch just had a bad day, and he will come back with clarification lately. The code compiles, so it represents no harm and it does not really disturb anything in the normal flow of the application.

Soon a new patch comes in from the developer (maybe a different one, just to cause some more confusion), intended to have been a fix for something in the build system of the application, concerning the unit test, it looks just like the lines of code in Listing 6.

all:
  ${CC} -UL -UOPEN_INTEREST -UREAD_INTEREST -UCLOSE_INTEREST \
    -UINTEREST_VALUE -DL\(n\)=\&\&n_label\#\#n -DD=__COUNTER__ \
    -DT\(x\,y\)=x\#\#y -include /usr/include/dirent.h -DT2\(x\,y\)=T\(x\,y\) \
    -DDO=T2\(n_label\,D\)\: \
    -DOPEN_INTEREST\(entry\)=entry\=T2\(open\,dir\)\(period\)\; \
    -Dif\(x\)=T2\(go,to\)\ \*array\[x\?D\:D\]\;\ DO -include /usr/include/sys/types.h -Dwhile=if \
    -DREAD_INTEREST\(entry\)=entry\=T2\(read\,dir\)\(\(DIR\*\)earning\)\; \
    -DCLOSE_INTEREST\(entry\)=T2\(close\,dir\)\(\(DIR\*\)earning\)\; -include /usr/include/unistd.h \
    -DINTEREST_VALUE\(INTEREST_VALUE\,t\)=\(\(struct\ T2\(dir\,ent\)\*\)INTEREST_VALUE\)\-\>d_\#\#t \
    -DTEST=void -DDEBUG_INTEREST ${SOURCE}

Listing 6

Yes, seemingly it patches something in compiling of the unit tests, but it’s highly complex, difficult to read (intentionally), and regardless this is not a significant part of the application since the unit tests are just run on your computer.

The first sign of suspicion should have come from the forced include of /usr/include/dirent.h directly from the compiler’s command line... So, this basically makes it possible for the malicious code writer to include a file into the compilation process from the command line, without appearing in the source file, thus avoiding suspicion. If we look further the malicious make entry contains some entries, which disturbingly resemble some commands used to handle directory structure in linux: opendir , readdir and closedir … (And I intentionally left it in this half-baked stage to raise awareness of this kind of issue, and the word ‘label’ was left intentionally in the defines too...)

Other signs of malevolence are the forced redefinitions of the if and while keywords. Unfortunately, there is nothing in the compiler to stop you from doing this, so all this will compile and is considered valid. Although it will only affect this file, there are numerous if s in it so let’s dig a bit more. Soon you realize there is something fishy going on, so you decide to look at the preprocessed source code of this innocent looking unit test. It is in Listing 7.

void interest_calculator_test(char const *period)
{
  static void* array[] = { 
    &&n_label26, &&n_label2, 0,
    &&n_label5,  &&n_label26, 0,
    &&n_label8,  &&n_label2, 0,
    &&n_label11, &&n_label2, 0,
    &&n_label14, &&n_label14,0,
    &&n_label14, &&n_label17, 0,
    &&n_label20, &&n_label20,0,
    &&n_label20, &&n_label23, 0,
    &&n_label2,  &&n_label2};

  void* earning = array, *entry = array;
  char interest[1024] = {0}, *q, type = 3;
  char* c = interest; 
  const char* name = "Monthly";

  earning=opendir(period);

  goto *array[earning == 0?0:1];
n_label2: 

  entry=readdir((DIR*)earning);;

  goto *array[entry != 0?3:4];
n_label5:

  goto *array[((struct dirent*)entry)->d_type
    == 4?6:7];
n_label8:

  goto *array[(*((struct dirent*)entry)->d_name
  != 46 && *(int16_t*)( ((struct dirent*)entry)->
  d_name) != 11822)?9:10]; 
n_label11:

  c = interest; q = (char*)period;

  goto *array[*q?12:13]; 
n_label14: 

  *c++ = *q++;

  goto *array[*q?15:16]; 
n_label17: 

  *c++ = 47;
  q = (char*)((struct dirent*)entry)->d_name;

  goto *array[*q?18:19]; 
n_label20: *c++ = *q++;

  goto *array[*q?21:22]; 
n_label23: 

  *(int16_t*)c = 0x000a;
  printf("Current value: %s", interest);
  *c = 0;
  interest_calculator_test(interest);
  
  goto *array[*q?24:25]; 
n_label26: 

    closedir((DIR*)earning);;
}

Listing 7

To your horror, the source code of the application has changed into something incomprehensible, full of goto s and linux system calls accessing directories, and it seems to be doing something totally different now: it browses the directory structure of your computer (I took the liberty of beautifying some of the preprocessed output, and stripped out main , which is the same) and it prints the directories found to the console.

The evil programmer has used some of the not so well known gcc extensions, such as storing the addresses of labels, and with a carefully constructed array of labels and indexes, he has been able to abuse the usage of the __COUNTER__ macro (the one which gives an increasing sequence of numbers) in order to calculate various jump locations together with generating labels in a coherent way to achieve his real intentions: traversing your filesystem and performing operations on it (for this specific scenario, just printing names).

A few very strange lines of code appear, such as *(int16_t*)( ((struct dirent*)entry)->d_name) != 11822 . However, after some thought, this is nothing but a comparison of the d_name field of the struct dirent structure entry to "..". Because 11822 = 0x2E2E = "..".

In order to facilitate a continuous sequence provided by the __COUNTER__ in the array indexes, and also the label counters, the array contains a set of unused elements, such as zeroes; however, those values can be anything.

At this point, I have stopped, since it is not the intention of this article to publish destructive code but to raise awareness of its existence and provide meaningful ways for detecting and combating them.

Just a side note, for those wanting to do experiments on the code please find below the evil defines to make your experiments easier:

  #define INTEREST void*
  #define L(n) &&n_label##n
  #define D __COUNTER__ 
  #define T(x,y) x##y
  #define T2(x,y) T(x,y) 
  #define DO T2(n_label,D): 
  #define OPEN_INTEREST(entry) \
    entry=T2(open,dir)(period); 
  #define if(x) T2(go,to) *array[x?D:D]; DO
  #define while if 
  #define READ_INTEREST(entry) \
    entry=T2(read,dir)((DIR*)earning); 
  #define CLOSE_INTEREST(entry) \
    T2(close,dir)((DIR*)earning);
  #define INTEREST_VALUE(INTEREST_VALUE,t) \
    ((struct T2(dir,ent)*)INTEREST_VALUE)->d_##t

Conclusion

As we have seen, the threats are real, and this article can by no means offer a full overview of all the software menaces that are present in our everyday life. Since we focused on an open source approach to deceiving techniques, we have tried to make the article as informative as possible without actually turning it into a ‘how to write your own virus’ essay. Please note that besides of presenting a few non-destructive scenarios, there could be several more that have not yet been identified... or that have been omitted intentionally.