8 Memory Management Via Regions
8.1 Introduction
C gives programmers complete control over how memory is managed. An expert
programmer can exploit this to write very fast and/or space-efficient
programs. However, bugs that creep into memory-management code can cause
crashes and are notoriously hard to debug.
Languages like Java and ML use garbage collectors instead of leaving
memory management in the hands of ordinary programmers. This makes
memory management much safer, since the garbage collector is written
by experts, and it is used, and, therefore, debugged, by every
program. However, removing memory management from the control of the
applications programmer can make for slower programs.
Safety is the main goal of Cyclone, so we provide a garbage collector.
But, like C, we also want to give programmers as much control over
memory management as possible, without sacrificing safety. Cyclone's
region system is a way to give programmers more explicit control over
memory management.
In Cyclone, objects are placed into regions. A region is simply an
area of memory that is allocated and deallocated all at once (but not for
our two special regions; see below). So to deallocate an object, you
deallocate its region, and when you deallocate a region, you deallocate all
of the objects in the region. Regions are sometimes called ``arenas'' or
``zones.''
Cyclone has four kinds of region:
-
Stack regions
- As in C, local variables are allocated on the
runtime stack; the stack grows when a block is entered, and it
shrinks when the block exits. We call the area on the stack
allocated for the local variables of a block the stack region
of the block. A stack region has a fixed size---it is just large
enough to hold the locals of the block, and no more objects can be
placed into it. The region is deallocated when the block containing
the declarations of the local variables finishes executing. With
respect to regions, the parameters of a function are considered
locals---when a function is called, its actual parameters are placed
in the same stack region as the variables declared at the start of
the function.
- Lexical regions
- Cyclone also has lexical regions, which are so
named because, like stack regions, their lifetime is delimited by the
surrounding scope. Unlike stack regions, however, you can can add new
objects to a lexical region over time. You create a lexical region in
Cyclone with a statement,
region identifier; statement
This declares and allocates a new lexical region, named
identifier, and executes statement. After
statement finishes executing, the region is deallocated.
Within statement, objects can be added to the region, as we
will explain below.
Typically, statement is a compound statement:
{ region identifier;
statement1
...
statementn
}
- The heap region
- Cyclone has a special region called the
heap. There is only one heap, whose type is denoted `H,
and it is never deallocated. New objects can be added to the heap at any
time (the heap can grow). Cyclone uses a garbage collector to
automatically remove objects from the heap when they are no longer needed.
You can think of garbage collection as an optimization that tries to keep
the size of the heap small. (Alternatively, you can avoid garbage
collection all together by specifying the -nogc flag when
building the executable.)
- Dynamic regions
- Stack and lexical regions obey a strictly
last-in-first-out (LIFO) lifetime discipline. This is often
convenient for storing temporary data, but sometimes, the lifetime
of data cannot be statically determined. Such data can be allocated
in a dynamic region. A dynamic region supports deallocation
at (essentially) any program point. However, before the data in a
dynamic region may be accessed, the dynamic region must be
opened. The open operation fails by throwing an exception if
the dynamic region has already been freed. Note that each data
access within a dynamic region does not require a check. Rather,
you can open a given dynamic region once, access the data many times
with no additional cost, and then exit the scope of the open. Thus,
dynamic regions amortize the cost of checking whether or not data
are still live and localize failure points. We describe dynamic
regions in detail in Section 9.7.
Cyclone forbids dereferencing dangling pointers. This restriction is part of
the type system: it's a compile-time error if a dangling pointer (a pointer
into a deallocated region or to a deallocated object) might be dereferenced.
There are no run-time checks of the form, ``is this pointing into a live
region?'' As explained below, each pointer type has a region and objects of
the type may only point into that region.
8.2 Allocation
You can create a new object on the heap using one of a few kinds of
expression:
-
new expr evaluates expr, places the
result into the heap, and returns a pointer to the result. It is
roughly equivalent to
t @ temp = malloc(sizeof(t));
// where t is the type of expr
*temp = expr;
For example, new 17 allocates space for an integer on the
heap, initializes it to 17, and returns a pointer to the space. For
another example, if we have declared
struct Pair { int x; int y; };
then new Pair(7,9) allocates space for two integers on the
heap, initializes the first to 7 and the second to 9, and returns a
pointer to the first.
- new array-initializer allocates space for an
array, initializes it according to array-initializer, and
returns a pointer to the first element. For example,
let x = new { 3, 4, 5 };
declares a new array containing 3, 4, and 5, and initializes
x to point to the first element. More interestingly,
new { for identifier < expr1 : expr2 }
is roughly equivalent to
unsigned int sz = expr1;
t @ temp = malloc(sz * sizeof(t2)); // where t is the type of expr
for (int identifier = 0; identifier < sz; identifier++)
temp[ identifier] = expr2;
That is,
expr1
is evaluated first to get the size of the new array,
the array is allocated, and each element of the array is
initialized by the result of evaluating
expr2.
expr2 may use identifier, which
holds the index of the element currently being initialized.
For example, this function returns an array containing the first
n positive even numbers:
int *@fat n_evens(int n) {
return new {for next < n : 2*(next+1)};
}
Note that:
-
expr1 is evaluated exactly once, while expr2 is evaluated expr1 times.
- expr1 might evaluate to 0.
- expr1 might evaluate to a negative number.
If so, it is implicitly converted to a very large unsigned
integer; the allocation is likely to fail due to insufficient
memory. Currently, this will cause a crash!!
- Currently, for array initializers are the only way to
create an object whose size depends on run-time data.
- malloc(sizeof(type)). Returns a @notnull
pointer to an uninitialized value of type type.
- malloc(n*sizeof(type)) or
malloc(sizeof(type)*n). The type must be a bits-only
type (i.e., cannot contain pointers, tagged unions, zero-terminated
values, etc.) If n is a compile-time constant expression,
returns a @thin pointer with @numelts(n). If
n is not a compile-time constant, returns
a @fat pointer to the sequence of
n uninitialized values.
- calloc(n,sizeof(type)). Similar to
the malloc case above, but returns memory that is zero'd. Therefore,
calloc supports types that are bits-only or zero-terminated.
- malloc(e) where e is an expression not of one
of the above forms. If e is constant, returns a
char *@numelts(e)@nozeroterm otherwise
returns a char *@fat@nozeroterm.
Objects within regions can be created using the following analogous
expressions.
-
rnew(identifier) expr
- rnew(identifier) array-initializer
- rmalloc(identifier,sizeof(type))
- rmalloc(identifier,n*sizeof(type))
- rmalloc(identifier,sizeof(type)*n)
- rmalloc(identifier,e))
- rcalloc(identifier,n,sizeof(type))
Note that new, malloc, calloc,
rnew, rmalloc and rcalloc are keywords.
Here, the first argument specifies a region handle. The Cyclone
library has a global variable Core::heap_region which is a
handle for the heap region. So, for example, rnew
(heap_region) expr allocates memory in the heap region which is
initialized with expr. Moreover, new expr can be
replaced with rnew(heap_region) expr.
The only way to create an object in a stack region is declaring it as
a local variable. Cyclone does not currently support salloc;
use a lexical region instead.
8.3 Common Uses
Although the type system associated with regions is complicated, there are
some simple common idioms. If you understand these idioms, you should be
able to easily write programs using regions, and port many legacy C programs
to Cyclone. The next subsection will explain the usage patterns of unique
and reference-counted pointers, since they are substantially more
restrictive than other pointers.
Remember that every pointer points into a region, and although the
pointer can be updated, it must always point into that same region (or
a region known to outlive that region). The region that the pointer
points to is indicated in its type, but omitted regions are filled in
by the compiler according to context.
When regions are omitted from pointer types in function bodies, the
compiler tries to infer the region. However, it can sometimes be too
``eager'' and end up rejecting code. For example, in
void f1(int * x) {
int * y = new 42;
y = &x;
}
the compiler uses y's initializer to decide that y's type is
int * `H. Hence the assignment is illegal, the parameter's
region (called `f1) does not outlive the heap. On the other
hand, this function type-checks:
void f2(int x) {
int * y = &x;
y = new 42;
}
because y's type is inferred to be int * `f2 and the
assignment makes y point into a region that outlives `f2. We
can fix our first function by being more explicit:
void f1(int * x) {
int *`f1 y = new 42;
y = &x;
}
Function bodies are the only places where the compiler tries to infer
the region by how a pointer is used. In function prototypes, type
declarations, and top-level global declarations, the rules for the
meaning of omitted region annotations are fixed. This is necessary
for separate compilation: we often have no information other than the
prototype or declaration.
In the absence of region annotations, function-parameter pointers are
assumed to point into any possible region. Hence, given
void f(int * x, int * y);
we could call f with two stack pointers, a lexical-region pointer and
a heap-pointer, etc. Hence this type is the ``most useful'' type from
the caller's perspective. But the callee's body (f) may not
type-check with this type. For example, x cannot be assigned a
heap pointer because we do not know that x points into the heap. If
this is necessary, we must give x the type int *`H. Other
times, we may not care what region x and y are in so long as they are
the same region. Again, our prototype for f does not indicate
this, but we could rewrite it as
void f(int *`r x, int *`r y);
Finally, we may need to refer to the region for x or y in the function
body. If we omit the names (relying on the compiler to make up
names), then we obviously won't be able to do so.
Formally, omitted regions in function parameters are filled in by
fresh region names and the function is ``region polymorphic'' over
these names (as well as all explicit regions).
In the absence of region annotations, function-return pointers are
assumed to point into the heap. Hence the following function will not
type-check:
int * f(int * x) { return x; }
Both of these functions will type-check:
int * f(int *`H x) { return x; }
int *`r f(int *`r x) {return x; }
The second one is more useful because it can be called with any
region.
In type declarations (including typedef) and top-level variables,
omitted region annotations are assumed to point into the heap. In the
future, the meaning of typedef may depend on where the
typedef is used. In the meantime, the following code will
type-check because it is equivalent to the first function in the previous
example:
typedef int * foo_t;
foo_t f(foo_t x) { return x; }
If you want to write a function that creates new objects in a region
determined by the caller, your function should take a region handle as
one of its arguments.
The type of a handle is region_t<`r>, where `r is
the region information associated with pointers into the region. For
example, this function allocates a pair of integers into the region
whose handle is r:
$(int,int)*`r f(region_t<`r> r, int x, int y) {
return rnew(r) $(x,y);
}
Notice that we used the same `r for the handle and the return
type. We could have also passed the object back through a pointer
parameter like this:
void f2(region_t<`r> r,int x,int y,$(int,int)*`r *`s p){
*p = rnew(r) $(7,9);
}
Notice that we have been careful to indicate that the region where
*p lives (corresponding to `s) may be different from
the region for which r is the handle (corresponding to
`r). Here's how to use f2:
{ region rgn;
$(int,int) *`rgn x = NULL;
f2(rgn,3,4,&x);
}
The `s and `rgn in our example are unnecessary
because they would be inferred.
typedef, struct, and datatype
declarations can all be parameterized by regions,
just as they can be parameterized by types. For example, here is part
of the list library.
struct List<`a,`r>{`a hd; struct List<`a,`r> *`r tl;};
typedef struct List<`a,`r> *`r list_t<`a,`r>;
// return a fresh copy of the list in r2
list_t<`a,`r2> rcopy(region_t<`r2> r2, list_t<`a> x) {
list_t result, prev;
if (x == NULL) return NULL;
result = rnew(r2) List{.hd=x->hd,.tl=NULL};
prev = result;
for (x=x->tl; x != NULL; x=x->tl) {
prev->tl = rnew(r2) List(x->hd,NULL);
prev = prev->tl;
}
return result;
}
list_t<`a> copy(list_t<`a> x) {
return rcopy(heap_region, x);
}
// Return the length of a list.
int length(list_t x) {
int i = 0;
while (x != NULL) {
++i;
x = x->tl;
}
return i;
}
The type list_t<type,rgn> describes
pointers to lists whose elements have type type and whose
``spines'' are in rgn.
The functions are interesting for what they don't say.
Specifically, when types and regions are omitted from a type
instantiation, the compiler uses rules similar to those used for
omitted regions on pointer types. More explicit versions of the
functions would look like this:
list_t<`a,`r2> rcopy(region_t<`r2> r2, list_t<`a,`r1> x) {
list_t<`a,`r2> result, prev;
...
}
list_t<`a,`H> copy(list_t<`a,`r> x) { ... }
int length(list_t<`a,`r> x) { ... }
8.4 Type-Checking Regions
Because of recursive functions, there can be any number of live
regions at run time. The compiler uses the following general strategy to
ensure that only pointers into live regions are dereferenced:
-
Use compile-time region names. Syntactically these are
just type variables, but they are used differently.
- Decorate each pointer type and handle type with one or more region names.
- Decorate each program point with a (finite) set of region names.
We call the set the point's capability.
- To dereference a pointer (via *, ->, or
subscript), each region name that appears in the pointer's type name
must be in the program point's capability. Similarly, to use a
handle for allocation, the handle type's region name must be in the
capability.
- Enforce a type system such that the following is impossible: A
program point P's capability contains a region name `r that
decorates a pointer (or handle) expression expr that, at
run time, points into a region that has been deallocated and the
operation at P dereferences expr.
This strategy is probably too vague to make sense at this point, but
it may help to refer back to it as we explain specific aspects of the
type system.
Note that in the rest of the documentation (and in common parlance) we
abuse the word ``region'' to refer both to region names and to
run-time collections of objects. Similarly, we confuse a block of
declarations, its region-name, and the run-time space allocated for
the block. (With loops and recursive functions, ``the space
allocated'' for the block is really any number of distinct regions.)
But in the rest of this section, we painstakingly distinguish
region names, regions, etc.
8.4.1 Region Names
Given a function, we associate a distinct region name with each
program point that creates a region, as follows:
-
If a block (blocks create stack regions) has label L,
then the region-name for the block is `L.
- If a block has no label, the compiler makes up a fresh
region-name for the block.
- In region r <`foo> s, the region-name for the construct
is `foo.
- In region r s, the region-name for the construct is
`r.
- In region h = open(k) s, the region-name for the construct is
`r, assuming k has type region_key_t<`r,_>
(where _ is some other region name of no consequence).
The region name for the heap is `H. Region names associated
with program points within a function should be distinct from each
other, distinct from any region names appearing in the function's
prototype, and should not be `H. (So you cannot use
H as a label name, for example.) Because the function's
return type cannot mention a region name for a block or
region-construct in the function, it is impossible to return a pointer
to deallocated storage.
In region r <`r> s, region r s, and region r =
open(k) s the type of r is region_t<`r> (assuming, that
k has type region_key_t<`r,_>). In other words, the
handle is decorated with the region name for the construct. Pointer types'
region names are explicit, although you generally rely on inference to put
in the correct one for you.
8.4.2 Capabilities
In the absence of explicit effects (see below), the capability for a
program point includes exactly:
-
`H
- The effect corresponding to the function's prototype. Briefly,
any region name in the prototype (or inserted by the compiler due to
an omission) is in the corresponding effect. Furthermore, for each
type variable `a that appears (or is inserted),
``regions(`a)'' is in the corresponding effect. This latter
effect roughly means, ``I don't know what `a is, but if you
instantiate with a type mentioning some regions, then add those
regions to the effect of the instantiated prototype.'' This is
necessary for safely type-checking calls that include function pointers.
- The region names for the blocks and ``region r s''
statements that contain the program point
For each dereference or allocation operation, we simply check that the
region name for the type of the object is in the capability. It takes
extremely tricky code (such as existential region names) to make the
check fail.
8.4.3 Type Declarations
Each pointer and region handle type is decorated with a set of region
names. This set of region names is referred to as the store or effect
qualifier of a type. For instance a pointer type might be
int *`r+`H
. This indicates the type of pointer to an integer
that resides in the region named `r or the heap region, `H. Similarly,
a region handle type might be region_t<`r1+`r2>
which indicates
a handle to region named `r1 or `r2.
A struct, typedef, or datatype declaration
may be parameterized by any number of effect qualifiers. The store
qualifiers are placed in the list of type parameters. They may be
followed by their kind; i.e. ``::E''. For example, given
struct List<`a,`r>{`a hd; struct List<`a,`r> *`r tl;};
the type struct List<int,`H> is for a list of ints in the
heap, while the type struct List<int,`l> is for a list of
ints in some lexical region. Notice that all of the ``cons cells'' of
the List will be in the same region (the type of the
tl field uses the same region name `r that is used
to instantiate the recursive instance of struct List<`a,`r>).
However, we could instantiate `a with a pointer type that has
a different region name.
8.4.4 Subtyping and Effect Qualifiers
A pointer type's effect qualifier is part of the type. If e1
and e2 are pointers, then e1 = e2 is well-typed only
if all the regions names mentioned in the the effect qualifier for
e2's type also appears in the effect qualifier of
e1's type. For instance, both assignments to b below is
legal, while the assignment to a is not.
void foo(int *`r a) {
int *`r+`H b = a;
if(!a) b = new 1;
a = b;
}
The store qualifier in the type of b indicates that it can
point into the region named `r or into the heap region. Therefore,
initializing b with a pointer into `r is certainly consistent
with its store qualifier. Similarly, the second assignment to b
is legal since it is updated to point to the heap region. However, the
assignment to a is not permitted since the declared type of
a claims that it is pointer into the region named `r alone and
b may actually point to the heap.
For handles, if `r is a region name, there is at most one
value of type region_t<`r> (there are 0 if `r is a
block's name), so there is little use in creating variables of type
region_t<`r>. However, it could be useful to create
variables of type region_t<`r1+`r2>
. This is the type of
handle to either the region named `r1 or to the region named
`r2. This is illustrated by the following code:
void foo(int a, region_t<`r> h) {
region_t<`r+`H> rh = h;
if(a) {
rh = heap_region;
}
}
As always, this form of subtyping for effect qualifiers is not
permitted under a reference. Thus, the assignment in the program below
is not legal.
void foo(int *`H *`r a) {
int *`r+`H *`r b = a;
}
8.4.5 Function Calls
If a function parameter or result has type int *`r or
region_t<`r>, the function is polymorphic over the region name
`r. That is, the caller can instantiate `r with any
region in the caller's current capability as long as the region has
the correct kind. This instantiation is usually implicit, so the caller just
calls the function and the compiler uses the types of the actual arguments
to infer the instantiation of the region names (just like it infers the
instantiation of type variables).
The callee is checked knowing nothing about `r except that it is in
its capability (plus whatever can be determined from explicit outlives
assumptions), and that it has the given kind. For example, it will be
impossible to assign a parameter of type int*`r to a global
variable. Why? Because the global would have to have a type that allowed
it to point into any region. There is no such type because we could never
safely follow such a pointer (since it could point into a deallocated
region).
8.4.6 Explicit and Default Effects
If you are not using existential types, you now know everything you
need to know about Cyclone regions and memory management. Even if you
are using these types and functions over them (such as the closure
library in the Cyclone library), you probably don't need to know much more
than ``provide a region that the hidden types outlive''.
The problem with existential types is that when you ``unpack'' the
type, you no longer know that the regions into which the fields point
are allocated. We are sound because the corresponding region names
are not in the capability, but this makes the fields unusable. To
make them usable, we do not hide the capability needed to use them.
Instead, we use a region bound that is not existentially
bound.
If the contents of existential packages contain only heap pointers,
then `H is a fine choice for a region bound.
These issues are discussed in Section 13.