More speed! Sergey Ignatchenko completes his (Re)Actor allocator series with Part III.
Start the reactor. Free Mars…
~ Kuato from Total Recall
Disclaimer: as usual, the opinions within this article are those of ‘No Bugs’ Hare, and do not necessarily coincide with the opinions of the translators and Overload editors; also, please keep in mind that translation difficulties from Lapine (like those described in [Loganberry04] ) might have prevented an exact translation. In addition, the translator and Overload expressly disclaim all responsibility from any action or inaction resulting from reading this article.
As we briefly discussed in Part I of this mini-series [NoBugs17a] , message-passing technologies such as (Re)Actors (a.k.a. Actors, Reactors, ad hoc FSMs, and event-driven programs) have numerous advantages - ranging from being debuggable (including post-factum production debugging), to providing better overall performance.
In [NoBugs17a] and [NoBugs17b] , we discussed an approach to handling allocations for (Re)Actors, and then were able to achieve a safe dialect of C++ (that is, as long as we’re following a set of well-defined local rules). Now, let’s take a look at another task which can be facilitated by per-(Re)Actor allocators - specifically, at the task of serializing (Re)Actors that are later going to be de-serialized by the same executable. While one solution for this task was provided in [Ignatchenko-Ivanchykhin16] , the proposed ‘ultra-fast’ serialization is rather cumbersome to maintain, and in most cases it can be beaten performance-wise by serializing at the allocator level.
#define (Re)Actors
To make this article self-contained and make sure that we’re all on the same page with terminology, let’s repeat the definition of our (Re)Actors from [NoBugs17a] .
Let’s name a common denominator for all our (Re)Actors a
GenericReactor
.
GenericReactor
is just an abstract class – and has a pure virtual function,
react()
:
class GenericReactor { virtual void react(const Event& ev) = 0; virtual ~GenericReactor() {} }
Let’s name the piece of code which calls
GenericReactor
’s
react()
Infrastructure Code
. Quite often this call will be within a so-called ‘event loop’ (see Listing 1).
std::unique_ptr<GenericReactor> r = reactorFactory.createReactor(...); while(true) { //event loop Event ev = get_event(); //from select(), libuv, ... r->react(ev); } |
Listing 1 |
Let’s note that the
get_event()
function can obtain events from wherever we want; from
select()
(which is quite common for servers) to libraries such as
libuv
(which is common for clients).
Also let’s note that an event loop such as the one above is by far
not
the only way to call
react()
: I’ve seen implementations of Infrastructure Code ranging from the one running multiple (Re)Actors within the same thread, to another one which deserialized the (Re)Actor from a database (DB), then called
react()
, and then serialized the (Re)Actor back to the DB. What’s important, though, is that even if
react()
can be called from different threads, it
must
be called
as if
it is one single thread. In other words, if necessary, all thread sync should be done
outside
of our (Re)Actor, so
react()
doesn’t need to bother about thread sync regardless of the Infrastructure Code in use.
Finally, let’s name any specific derivative from Generic Reactor (which actually implements our
react()
function) a
SpecificReactor
:
class SpecificReactor : public GenericReactor { void react(const Event& ev) override; };
In addition, let’s observe that whenever the (Re)Actor needs to communicate with another (Re)Actor then – adhering to the ‘Do not communicate by sharing memory; instead, share memory by communicating’ principle – it merely sends a message, and it is only this message which will be shared between (Re)Actors. In turn, this means that we can (and should ) use single-threaded allocation for all (Re)Actor purposes – except for allocation of those messages intended for inter-(Re)Actor communications.
Task: Serialization for the same executable
Now, let’s define what we’re going to do with our (Re)Actor in this article (and why ). Basically, as discussed in [Ignatchenko-Ivanchykhin16] , when dealing with (Re)Actors, we often need to serialize the current state of our (Re)Actor so that we can deserialize it in exactly the same executable (though this executable can run on a completely different computer etc.). Applications of such ‘serialization for exactly the same executable’ are numerous; in particular, it is useful for (see, for example, [NoBugs17c] ):
- Migrating our (Re)Actors (for example, to load-balance them)
- Low-Latency Fault Tolerance for (Re)Actors
- Production post-mortem debugging (serializing the state, plus all inputs after that state, of the (Re)Actor, and then transferring them to developer’s box in case of a crash)
‘Speedy Gonzales’ (Re)actor-level serialization
Now, as we have both our (Re)Actors and our task at hand more-or-less defined, let’s see how well we can do with our per-(Re)Actor allocator.
Actually, it is fairly simple:
-
We re-define global allocator (for example,
malloc()
/free()
, though depending on compiler specifics and options YMMV) so that it acts as a bunch of per-(Re)Actor allocators (or at least per-thread allocators) – more on it belowThis means that within our (Re)Actor, we do not need to specify allocators for each call to ‘new’ and for each collection <phew! />
-
Within our per-(Re)-Actor allocator, we allocate/deallocate OS pages ourselves (via calls such as
VirtualAllocEx()
ormmap()
); of course, we also keep a list of all the OS pages we’re using. - Whenever we need to serialize our (Re)Actor, we simply dump all the pages used by the allocator of this (Re)Actor (with an address of each page serialized) – that’s it!
-
When we need to deserialize our (Re)Actor, we try to allocate OS pages at exactly the same (virtual) addresses as they were originally allocated
If such allocation succeeds for all our serialized pages (which is common – though strictly speaking, not guaranteed – when we’re deserializing a (Re)Actor into a dedicated-for-this-(Re)Actor process, which in turn is common for debugging), we just need to copy pages back from the serialized form into allocator (that’s it)
If allocation at the same address isn’t possible for whatever reason, we have to use a process which is essentially similar to relocation discussed in [NoBugs17a] . Very briefly:
- We have a relocation map , which gives the mapping between ‘old’ page addresses and ‘new’ page addresses.
- At OS level, we make sure the pages with ‘old’ addresses are unreadable.
-
We run
traversing
(as discussed in
[NoBugs17a]
) over the state of our (Re)Actor. During this traversing process, we merely try to access all the elements of the state of our (Re)Actor. Whenever any pointer within our (Re)Actor state happens to point to the ‘old’ page, such access will fail (causing an
access violation
exception). We catch each such an exception, and update the pointer which caused the exception to the ‘new’ value within the corresponding ‘new’ page (calculating the ‘new’ value using the relocation map).
Note that while the traversing of our own collections can easily be handled along the same lines as above, traversing and fixing standard collections can be outright impossible without adding a few lines to them ☹. How to handle this in practice? It depends, but one way to do it is to take a cross-platform STL implementation (such as EASTL), and to add the few lines implementing traversing for each collection you require (it is NOT rocket science for any specific STL).
Bingo! After such traversing of the whole state of our (Re)Actor is completed, we can be sure that all the pointers to the heap within our (Re)Actor are updated with the new values. In turn, it means that all the pointers to the state are already updated, so all the relocations due to serialization are already handled, and we can proceed with normal (Re)Actor processing.
One very important property of such serialization is that it is extremely difficult to beat this kind of serialization performance-wise. Deserialization is a slightly different story due to potential relocation, but it is still pretty fast. Also, for several of the use cases mentioned in the ‘Task: Serialization for the same Executable’ section above, it is only the performance of
serialization
which really matters. Indeed, all we need to do is
memcpy()
for large chunks of RAM, and with
memcpy()
speeds being of the order of 5-10 gigabytes/second at least for x64 (see, for example,
[B14]
), this means that even to serialize a (Re)Actor which has 100MB state, we’re talking about times of the order of 10-20ms. Serializing the same thing using conventional serialization methods (even really fast ones, such as the one discussed in
[Ignatchenko-Ivanchykhin16]
) is going to be significantly slower. The exact numbers depend on the specifics of the organization of the data, but if we have a randomly filled 100MB
std::map<int>
, just iterating over it without any serializing is going to take the order of 100ms, almost an order of magnitude longer (!).
Parallel serialization aided by (kinda-)copy-on-write
For those apps where even 10-20ms of additional latency per 100MB of state is too much, it might be possible to reduce it further.
One of the implementations would work along the following lines (which are ideologically similar, though not identical to, classic Copy-on-Write):
When we want to serialize, we (in our ‘main’ (Re)Actor processing thread):
- create a list of pages to be serialized
- pre-allocate space in some other area of RAM where we want to serialize
-
for all the pages to be serialized, set an ‘already serialized’ parameter to
false
-
mark all the pages to be serialized as ‘no-write’ (using
VirtualAllocEx()
ormprotect()
). - start another ‘serialization’ thread (use an always-existing dedicated serialization thread, take it from thread pool, etc.)
- continue processing the (Re)Actor messages in the main thread
The ‘serialization’ thread just takes pages from the list to be serialized one by one, and for each such page:
- checks if it is already serialized: if yes, skips; and if not, marks it as ‘already serialized’ (which should be done using an atomic CAS-like operation to prevent potential races with the main thread)
-
if it wasn’t already serialized:
- serializes the page into pre-allocated space
- removes the ‘no-write’ protection from the page
Whenever we have write access to one of the ‘no-write’ pages, we catch the appropriate CPU-level exception, and within the exception handler:
Check if the page being accessed is already being serialized (this can happen due to races with the ‘serialization’ thread); this should be done in an atomic manner similar to the ‘serialization’ thread as described above
If the page isn’t serialized yet:
- serialize it into pre-allocated space
- remove the ‘no-write’ protection from the page, so future writes no longer cause any trouble.
That’s it. While with such a processing we do have to copy some of the pages within the main thread (causing some latency), for typical access patterns, this will happen relatively rarely, significantly reducing overall serialization latency observed within the ‘main’ (Re)Actor thread. For example, if out of our 100MB (~=25K pages) (Re)Actor state, only 1000 pages are modified during our 20ms serialization – then the latency cost of the serialization will drop approximately by a factor of 25x, bringing observable latency to around 1ms (which is acceptable for the vast majority of the systems out there, even for first-person-shooter games).
Per-(re)actor allocators in a usable manner
Up to now, we are merely assuming that our allocators can be made per-(Re)Actor; one obvious way of doing this is to have our (Re)Actor code specify our own allocator for each and every allocation within our (Re)Actor (we’ll need to cover both explicit calls to new, and all implicit allocations such as collections).
While such a naïve approach would work in theory, it is way too inconvenient to be used in practice. Fortunately, changing an allocator to a per-(Re)Actor one happens to be possible without any changes to the (Re)Actor code . In particular, it can be done along the following lines.
First, we replace
malloc()
/
free()
(
Important
: make sure that your global
::operator new
/
::operator delete
, and your default
std::allocator
also use the replaced functions (!). The latter might be rather tricky unless your
std
library already uses
::operator new()
/
::operator delete()
, but usually it can be take care of; in particular, for GCC, see
[GCC]
) and the
--enable-libstdcxx-allocator
option for
./configure
of
libstdc++.
)
To implement our own
malloc()
, we’re going along the lines of Listing 2. (Of course,
free()
should go along the same lines.)
thread_local OurAllocator* current_allocator = nullptr; void* malloc(size_t sz) { if( current_allocator ) return current_allocator->malloc(sz); else return non_reactor_malloc(sz); } |
Listing 2 |
The point here is that our Infrastructure Code (the one which calls our (Re)Actor) sets the
current_allocator
pointer before every call to
GenericReactor::react()
(see Listing 3).
current_allocator = create_new_allocator(); std::unique_ptr<GenericReactor> r = reactorFactory.createReactor (current_allocator,...); current_allocator = nullptr; while(true) { //event loop Event ev = get_event(); //from select(), libuv, ... current_allocator = r->allocator; r->react(ev); current_allocator = nullptr; } current_allocator = r->allocator; r.reset(); current_allocator = nullptr; |
Listing 3 |
Of course, this is a kind of trick – but it will work. Very briefly: first, we confine our
current_allocator
variable to the current thread by using
thread_local
, and then within this single thread, we can easily control which allocator is currently used by simple assignments within our Infrastructure Code. One thing to remember when using this way is to make sure that we set
current_allocator
before
each and every
method call of our (Re)Actor (including its constructor and destructor(!)).
That’s it: we’ve made our (Re)Actor use a per-(Re)Actor allocator – and without changing a single line within our (Re)Actor’s code too ☺.
Summary
To summarize this part III of the mini-series on ‘Allocators for (Re)Actors’:
-
Allocator-based serialization for (Re)Actors is both
- Easy to implement in a very generic manner, and
- Extremely fast (for x64 – around tens of ms per 100MB of state)
If necessary, parallel serialization may further reduce latencies (in some cases, down to – very roughly – a single digit ms of latency per 100MB of the state).
- The Allocators can be replaced to make them per-(Re)Actor, without any changes to the (Re)Actor code(!)
And as this part III concludes this mini-series, let’s summarize all our findings (from all three parts).
Part I
-
(Re)Actor-based allocators allow for very efficient allocation, with three potential modes:
-
‘Fast’ mode (no protection, but faster than regular
malloc()
) -
‘kinda-Safe’ mode (with protection from
some
of the memory corruptions)
Here, we introduced a potentially novel method of implementing ‘dangling’ pointer detection in runtime – the one based on ID comparison. Compared to traditional ‘tombstones’ it has better locality, and will usually outperform it.
- ‘kinda-Safe with Relocation’ mode, with added ability to relocate heap objects (this, in turn, allows to avoid dreaded external fragmentation, which tends to eat lots of RAM in long-running programs).
-
‘Fast’ mode (no protection, but faster than regular
- Simple ‘traversing’ interface is sufficient to ensure that all the pointers in the (Re)Actor state are updated
Part II
By adding a few more of easily understood guidelines, we can extend our ‘kinda-Safe’ mode from Part I into ‘really safe’ C++ dialect.
All the guidelines/rules we need to follow are local , which enables reasonable tool-aided enforcement and allows to keep code maintainable.
Part III
- Custom (Re)Actor-based allocator can be used for the all-important for (Re)Actors serialization for the same executable. It is (a) very easy to maintain for (Re)Actor code, and (b) extremely fast.
- Per-(Re)Actor allocators can be implemented without any changes within (Re)Actor itself (i.e. all the necessary changes can be confined to Infrastructure Code).
Phew. It was rather long mini-series, but I hope I have made my points about the significant advantages of allocators specialized for (Re)Actor purposes reasonably clear.
References
[B14] Thomas B, ‘Single-threaded memory performance for dual socket Xeon E5-* systems’, https://software.intel.com/en-us/forums/software-tuning-performance-optimization-platform-monitoring/topic/509237
[GCC] ‘The GNU C++ Library Manual’, Chapter 6, https://gcc.gnu.org/onlinedocs/libstdc++/manual/memory.html#allocator.ext
[Ignatchenko-Ivanchykhin16] Sergey Ignatchenko and Dmytro Ivanchykhin, ‘Ultra-fast Serialization of C++ Objects’, Overload #136, Dec 2016
[Loganberry04] David ‘Loganberry’, Frithaes! – an Introduction to Colloquial Lapine!, http://bitsnbobstones.watershipdown.org/lapine/overview.html
[NoBugs17a] ‘No Bugs’ Hare, ‘Allocator for (Re)Actors with Optional Kinda-Safety and Relocation’, Overload #139, Jun 2017
[NoBugs17b] ‘No Bugs’ Hare, ‘A Usable C++ Dialect that is Safe Against Memory Corruption’, Overload #140, Aug 2017
[NoBugs17c] ‘No Bugs’ Hare, ‘Deterministic Components for Interactive Distributed Systems’, ACCU2017, available at http://ithare.com/deterministic-components-for-interactive-distributed-systems-with-transcript/