Design‎ > ‎

Memory management

Manually keeping track of allocated memory and freeing it when no longer used is a tedious task for a programmer. For Zimbu we want programmers to be productive and not risk leaking memory, thus manage memory automatically.  This will add overhead, we need to find ways to keep this as small as possible.  And for time critical pieces of code manual memory management will still be available.

Different methods have been tried and the best one was kept.  Below I will first describe the current implementation and then alternatives that were tried and discarded. You can still find the code for the alternatives if you sync to 27 April 2013.

Mark and Sweep with a linked list

When a block of memory is allocated, space is reserved for a pointer at the start of it. A thread-local pointer "firstUsed" is the head of a linked list, it points to the just allocated block, and the pointer in the block links to the previous value of "firstUsed". The code looks like this:
void *p = alloc(size + sizeof(void*));
((Zoh *)p)->np = e->firstUsed;
e->firstUsed = p;
This is simple and only touches the just allocated block and a global variable, which is good for caching.  By following the linked list all allocated memory blocks can be found.  Note that there is no pointer to the previous block.  This saves space, but means that it is not possible to free an arbitrary memory block, because we cannot easily remove it from the linked list. It requires going through the list start to end, which only happens in the sweep phase.

Garbage collection takes place while the program is not executing.  This "stop-the-world" time is bad, but it cannot be avoided (see below).  We can minimize the time it takes.  There are two phases, only the first phase requires "stop-the-world":
  1. Mark all blocks that can still be reached.
  2. Sweep: go through the list and free the memory that was not marked.
To find all used memory blocks Zimbu keeps references:
  • All global variables. This includes variables in modules and the SHARED section of classes.
  • All variables used on the stack.  This includes the current object of a method, method arguments and local variables.
  • Variables used internally. This is the context for try-catch statements and pending exceptions.
This is only done for variables that point to allocated memory. They may contain pointers to other memory blocks. Zimbu figures this out at compile time and creates a table for each type of variable and where it contains a pointer, plus the type of the referenced block.  This way the garbage collector can recursively go through the pointers and find all memory actually in use.

For variables on the stack every function has a stack frame. Like what is done for variables, it contains a table of the local variables and their types. Every thread has a "topFrame" variable that points to the deepest nested function's stack frame. Since the exact location of the variables depend on the compiler the locations are computed on the first call to the function. It would be nice if this can be done at compile time, but it would require knowing exactly what the C compiler does, which is very difficult. This is the C code that Zimbu generates:
Zfo IWufo[2] = {{0,ToIUu},{0,0}}; /* the table with offsets */
void IWu(IUu *t) {
  Zsf sf;                 /* the stack frame */
 Zenv *e =ZgetEnv();     /* get the thread-local environment */
 static int sfF = 0;     /* flag indicating the table has been filled */
 if (!sfF) {
  sfF = 1;
  IWufo[0].off = (void*)&sf - (void*)&t;  /* compute offset for "t" */
 }
 sf.frof = IWufo;         /* set stack frame offset table */
 sf.prev = e->topFrame;   /* make this stack frame the top frame */
 e->topFrame = &sf;
 sf.pos=123;              /* position in the code, used for backtrace */
   /* useful code goes here */
 e->topFrame = sf.prev;   /* remove the stack frame */
 return;

The mark is stored in the lower two bits of the pointer.  These are zero at first, because memory is aligned to 32 or 64 bits. The mark toggles between "01" and "10" binary. This means there is no need to first go through all the allocated blocks and clear the bits. An unmarked pointer either has "00", when it was never marked before, or it has a previous mark, which differs from the current one.

The multi-threading behavior is fairly simple when using a straightforward method:
  1. There is one thread initiating the garbage collection (GC), all others are normal threads.
  2. Every normal thread is told to stop work at a safe point (when all used memory is accounted for).
  3. Every normal thread hands over its list of allocated memory to the GC thread and starts a new list.
  4. Every normal thread goes through its stack frames to mark memory as used. It is possible that two threads find the same pointer and both set the mark.  That is OK, since they will both write the same value, it doesn't matter which write comes first.  They will both recurse into the same structure, which is inefficient but doesn't do harm.
  5. The GC thread marks global variables as used.
  6. When all threads are done marking, the normal threads continue their work.
  7. The GC thread goes through the lists of allocated memory it got from the normal threads and frees the blocks that have not been marked.
  8. The GC thread stores the list of still allocated blocks for the next round.
A disadvantage is that the work is not spread evenly between threads. We have to wait for the slowest thread to finish. This can be improved by making tables of pointers to be marked instead of doing this recursively.  Each thread then picks a block to work on and creates new blocks. This also avoids the unlimited recursion that would otherwise occur.

Measurements show that this solution has an overhead of about 40% when using one thread, compared to not doing any memory management (leaking everything).

When using two threads the overhead is reduced to 30% (measuring total execution time). This does require using the tcmalloc library.  The default malloc does not work well with threads, the time to free memory quadruples compared to the single-threaded implementation.

A future improvement is to have thread-local allocations. Then the "stop-the-world" is rarely needed, most memory management is local to the thread and only that thread needs to be stopped. This will be visible to the programmer, which creates an extra hassle.  On the other hand, it makes explicit what objects are not thread safe and gives the compiler a better chance of warning for race conditions.

Mark and sweep with flag arrays

This method is very similar to using a linked list.  What differs is that memory blocks do not contain a pointer. Instead, arrays are allocated that correspond with the memory areas where an allocated memory block starts.  An allocated flag is set for every position where memory was allocated. Mark flags are in these arrays as well.

When a memory block is allocated, a flag is set in the array that corresponds to the memory address. If such an array does not yet exist it is allocated.  Since we expect allocations to happen close together, fairly large arrays are used, corresponding to a Mbyte of memory.  Because of alignment a flag is needed for every 4 or 8 memory locations.  We can put 8 flags in a byte, we need two flags, this results in one byte of flags for every (4 * 8  / 2 = 16) or (8 * 8 / 2 = 32) bytes of allocated memory.

It turns out that setting flags in the mark phase is a bit slow, because we first need to find the array related to a pointer and then do bit shifts and a read-modify-write operation. This also means that in the multi-threaded situation locks are needed, probably for each array block. Caching behavior also isn't good, beside accessing the allocated memory to find nested pointers, the array is accessed, which can be in a different place every time.

The sweep phase was slow at first.  It must find locations where the allocated flags is set and the mark flag is not set.  This requires bit shifting. This improved quite a bit by using the __builtin_ffs() function that gcc provides.

Since the sweep phase uses the allocated flags, the "stop-the-world" must include the sweep phase.  Otherwise newly allocated memory, which obviously won't be marked, would be freed. On the other hand, the sweep phase can easily be spread over multiple threads, since they are waiting anyway.

When using a single thread this method had an overhead of about 45%. It's slightly slower than the method with a linked list.  It is quite a bit more complicated.

Reference counting

This was the first method implemented. The idea was that it does not require a "stop-the-world" phase. Keeping reference counts means overhead, but it is a very predictable amount of overhead.  Memory is released as soon as it is no longer used. The goal was to minimize the overhead.

Not needing a "stop-the-world" is not quite true. Circular references are possible, which means the reference count does not go down to zero, even though the memory is not actually being used.  Some people say it's safe to ignore and just leak memory with circular references.  However, my experience is that this makes a long running program run out of memory.  We don't want to sacrifice reliability in Zimbu.

Another problem is that when using a very large data structure, the moment it is no longer is used the program will spend quite a bit of time going through it, following all the pointers, decrementing the references and freeing all the pieces of memory. This can take a substantial amount of time and happen at an unexpected moment. It is wasteful if this happens when the program is done and is about to exit, which is actually quite common.

It turns out that reference counting is very complicated. Here are a few examples.
When assigning a new value to a variable, the old value that is overwritten must be dereferenced.  However, it may be used in the expression used to compute the new value, thus we can't do this right away. This can only be done with a temporary variable:
 a = *var;
 *var = compute();
 if (*var) ++*(Ti*)(*var);
 if (a && --*(Ti*)(a) == 0) ZunRef((void*)a, ToBytes);
So now a simple assignment turns into four statements.  If instead of "var" the location is complicated or has side effects we also need to keep a pointer to it. Also note the checks for null pointers.

Another complication are function arguments. It's very well possible to allocate memory in a function call without it being referred by a variable.  E.g., method([1, 2, 3]).  The list must be referenced at first and dereferenced later.  Either by the caller or by the called method.  The Zimbu implementation does this in the called method.  It references arguments at the start and derefences them before returning.

Handling exceptions in a try/catch statement means that all the variables on the stack need to be dereferenced. This can be done with stack frames, similar to what is done in the mark and sweep method above.

For handling circular references we need to "stop-the-world" and run an algorithm to find unreferenced memory.  Several papers have been written on this subject.  It is fairly complicated. The algorithm implemented in Zimbu has three stages (see paper "Concurrent Cycle Collection in Reference Counted Systems" by David F. Bacon and V.T. Rajan):
  1. Go through all the root objects, mark the objects visited with GREY, then recursively go to linked cyclic objects, marking with GREY and decrementing reference counts.
  2. Go through all the root objects, if the object is marked GREY then check the reference count. When non-zero mark all linked cyclic objects BLACK and recursively increment the reference count. Else mark as WHITE and recurse into children.
  3. Go through all the root objects, find objects marked WHITE and not a root, visit all their children and put the object in cycleFree. Then free the objects in cycleFree.
The list of roots is then made empty: each root is either freed or still reachable.

The paper has a proof that it is correct.  Despite that the algorithm has a bug: it accesses already freed objects.  That's why the third stage puts objects to be freed in a list first. This shows how easy it is to make mistakes with reference counting.

The overhead for counting can be reduced by not doing this for variables on the stack. This was implemented but turned out not to be faster. It does add complexity.

Making all this work with threads will make it even more complicated.  There are papers written about different solutions, without a clear winner.

Since the reference counting method has too much overhead (about 60% to 80%) and is causing so much complexity it was dropped.

Reference counting for destructors

I had the idea to use reference counting for when a class has a destructor.  For example, suppose that an object holds a reference to an open file.  As soon as the object is no longer used the file needs to be closed.  In C++ this is done by closing the file in the destructor.  We should add this to Zimbu, as it is a useful feature.  However, when using a garbage collector that kicks in at an unexpected moment you never know when the file will be closed. Reference counting could solve this: As soon as the count goes to zero the destructor is called.  Although this seems nice, it gets more complicated when a reference to the object is stored in another object, e.g. a list.  Then that list would also require reference counting. That quickly gets very complicated, as any class could have a normal object and a reference counted object, depending on what's stored inside it.

The alternative is to use normal garbage collection, but handle destructors specially.  The main issue to deal with is that the destructor might do anything, including adding a reference to the object itself.  E.g., add it to a list of "closed objects".  The memory of the object, and all objects it refers to, cannot be freed then.
  1. Do the normal mark phase.
  2. Move objects that are not marked and have a destructor to the "to be destructed" list. Mark the objects they refer to as used, as if the object is still in use.
  3. Do the sweep phase, free memory of unmarked objects.
  4. Invoke the destructors (this can be done in parallel with the sweep phase).  A flag is set that the destructor was called, with the effect that the object is handled as if it does not have a destructor.
Note that it will take two GC runs to actually free the objects with a destructor, and also for the objects they refer too.  This is not efficient, but the alternative is that the programmer needs to keep track of used objects manually, which is a pain.

Comments