Manually keeping track of allocated memory and freeing it when no longer used is a tedious task for a programmer. For Zimbu we want programmers to be productive and not risk leaking memory, thus manage memory automatically. This will add overhead, we need to find ways to keep this as small as possible. And for time critical pieces of code manual memory management will still be available.
Different methods have been tried and the best one was kept. Below I will first describe the current implementation and then alternatives that were tried and discarded. You can still find the code for the alternatives if you sync to 27 April 2013.
When a block of memory is allocated, space is reserved for a pointer at the start of it. A thread-local pointer "firstUsed" is the head of a linked list, it points to the just allocated block, and the pointer in the block links to the previous value of "firstUsed". The code looks like this:
This is simple and only touches the just allocated block and a global variable, which is good for caching. By following the linked list all allocated memory blocks can be found. Note that there is no pointer to the previous block. This saves space, but means that it is not possible to free an arbitrary memory block, because we cannot easily remove it from the linked list. It requires going through the list start to end, which only happens in the sweep phase.
Garbage collection takes place while the program is not executing. This "stop-the-world" time is bad, but it cannot be avoided (see below). We can minimize the time it takes. There are two phases, only the first phase requires "stop-the-world":
To find all used memory blocks Zimbu keeps references:
This is only done for variables that point to allocated memory. They may contain pointers to other memory blocks. Zimbu figures this out at compile time and creates a table for each type of variable and where it contains a pointer, plus the type of the referenced block. This way the garbage collector can recursively go through the pointers and find all memory actually in use.
For variables on the stack every function has a stack frame. Like what is done for variables, it contains a table of the local variables and their types. Every thread has a "topFrame" variable that points to the deepest nested function's stack frame. Since the exact location of the variables depend on the compiler the locations are computed on the first call to the function. It would be nice if this can be done at compile time, but it would require knowing exactly what the C compiler does, which is very difficult. This is the C code that Zimbu generates:
The mark is stored in the lower two bits of the pointer. These are zero at first, because memory is aligned to 32 or 64 bits. The mark toggles between "01" and "10" binary. This means there is no need to first go through all the allocated blocks and clear the bits. An unmarked pointer either has "00", when it was never marked before, or it has a previous mark, which differs from the current one.
The multi-threading behavior is fairly simple when using a straightforward method:
A disadvantage is that the work is not spread evenly between threads. We have to wait for the slowest thread to finish. This can be improved by making tables of pointers to be marked instead of doing this recursively. Each thread then picks a block to work on and creates new blocks. This also avoids the unlimited recursion that would otherwise occur.
Measurements show that this solution has an overhead of about 40% when using one thread, compared to not doing any memory management (leaking everything).
When using two threads the overhead is reduced to 30% (measuring total execution time). This does require using the tcmalloc library. The default malloc does not work well with threads, the time to free memory quadruples compared to the single-threaded implementation.
A future improvement is to have thread-local allocations. Then the "stop-the-world" is rarely needed, most memory management is local to the thread and only that thread needs to be stopped. This will be visible to the programmer, which creates an extra hassle. On the other hand, it makes explicit what objects are not thread safe and gives the compiler a better chance of warning for race conditions.
This method is very similar to using a linked list. What differs is that memory blocks do not contain a pointer. Instead, arrays are allocated that correspond with the memory areas where an allocated memory block starts. An allocated flag is set for every position where memory was allocated. Mark flags are in these arrays as well.
When a memory block is allocated, a flag is set in the array that corresponds to the memory address. If such an array does not yet exist it is allocated. Since we expect allocations to happen close together, fairly large arrays are used, corresponding to a Mbyte of memory. Because of alignment a flag is needed for every 4 or 8 memory locations. We can put 8 flags in a byte, we need two flags, this results in one byte of flags for every (4 * 8 / 2 = 16) or (8 * 8 / 2 = 32) bytes of allocated memory.
It turns out that setting flags in the mark phase is a bit slow, because we first need to find the array related to a pointer and then do bit shifts and a read-modify-write operation. This also means that in the multi-threaded situation locks are needed, probably for each array block. Caching behavior also isn't good, beside accessing the allocated memory to find nested pointers, the array is accessed, which can be in a different place every time.
The sweep phase was slow at first. It must find locations where the allocated flags is set and the mark flag is not set. This requires bit shifting. This improved quite a bit by using the __builtin_ffs() function that gcc provides.
Since the sweep phase uses the allocated flags, the "stop-the-world" must include the sweep phase. Otherwise newly allocated memory, which obviously won't be marked, would be freed. On the other hand, the sweep phase can easily be spread over multiple threads, since they are waiting anyway.
When using a single thread this method had an overhead of about 45%. It's slightly slower than the method with a linked list. It is quite a bit more complicated.
This was the first method implemented. The idea was that it does not require a "stop-the-world" phase. Keeping reference counts means overhead, but it is a very predictable amount of overhead. Memory is released as soon as it is no longer used. The goal was to minimize the overhead.
Not needing a "stop-the-world" is not quite true. Circular references are possible, which means the reference count does not go down to zero, even though the memory is not actually being used. Some people say it's safe to ignore and just leak memory with circular references. However, my experience is that this makes a long running program run out of memory. We don't want to sacrifice reliability in Zimbu.
Another problem is that when using a very large data structure, the moment it is no longer is used the program will spend quite a bit of time going through it, following all the pointers, decrementing the references and freeing all the pieces of memory. This can take a substantial amount of time and happen at an unexpected moment. It is wasteful if this happens when the program is done and is about to exit, which is actually quite common.
It turns out that reference counting is very complicated. Here are a few examples.
When assigning a new value to a variable, the old value that is overwritten must be dereferenced. However, it may be used in the expression used to compute the new value, thus we can't do this right away. This can only be done with a temporary variable:
So now a simple assignment turns into four statements. If instead of "var" the location is complicated or has side effects we also need to keep a pointer to it. Also note the checks for null pointers.
Another complication are function arguments. It's very well possible to allocate memory in a function call without it being referred by a variable. E.g., method([1, 2, 3]). The list must be referenced at first and dereferenced later. Either by the caller or by the called method. The Zimbu implementation does this in the called method. It references arguments at the start and derefences them before returning.
Handling exceptions in a try/catch statement means that all the variables on the stack need to be dereferenced. This can be done with stack frames, similar to what is done in the mark and sweep method above.
For handling circular references we need to "stop-the-world" and run an algorithm to find unreferenced memory. Several papers have been written on this subject. It is fairly complicated. The algorithm implemented in Zimbu has three stages (see paper "Concurrent Cycle Collection in Reference Counted Systems" by David F. Bacon and V.T. Rajan):
The list of roots is then made empty: each root is either freed or still reachable.
The paper has a proof that it is correct. Despite that the algorithm has a bug: it accesses already freed objects. That's why the third stage puts objects to be freed in a list first. This shows how easy it is to make mistakes with reference counting.
The overhead for counting can be reduced by not doing this for variables on the stack. This was implemented but turned out not to be faster. It does add complexity.
Making all this work with threads will make it even more complicated. There are papers written about different solutions, without a clear winner.
Since the reference counting method has too much overhead (about 60% to 80%) and is causing so much complexity it was dropped.
I had the idea to use reference counting for when a class has a destructor. For example, suppose that an object holds a reference to an open file. As soon as the object is no longer used the file needs to be closed. In C++ this is done by closing the file in the destructor. We should add this to Zimbu, as it is a useful feature. However, when using a garbage collector that kicks in at an unexpected moment you never know when the file will be closed. Reference counting could solve this: As soon as the count goes to zero the destructor is called. Although this seems nice, it gets more complicated when a reference to the object is stored in another object, e.g. a list. Then that list would also require reference counting. That quickly gets very complicated, as any class could have a normal object and a reference counted object, depending on what's stored inside it.
The alternative is to use normal garbage collection, but handle destructors specially. The main issue to deal with is that the destructor might do anything, including adding a reference to the object itself. E.g., add it to a list of "closed objects". The memory of the object, and all objects it refers to, cannot be freed then.
Note that it will take two GC runs to actually free the objects with a destructor, and also for the objects they refer too. This is not efficient, but the alternative is that the programmer needs to keep track of used objects manually, which is a pain.