-
Notifications
You must be signed in to change notification settings - Fork 0
Disk management: Global GC Final design
This solution is an extension of the Epoch solution ([See here](3.4. global garbage collector epochs))
This is the design of our final solution, that includes some concepts of the Epoch solution.
We define that a node has a reference to an object if and only if:
- Alias: Any object having an alias have 1 reference pointing to it (only 1 or 0)
- Federation: Any federated object have 1 reference pointing to it (only 1 or 0)
- Memory: The object is present in the memory of the node.
- Other objects: The object is pointed by another object.
- Sessions: The object is being used by any session ([More info here](3.3. global garbage collector activity counters))
There’s only a counter of references in the owner node. It means that, the only node that counts the references explained above is the owner of the object. Therefore, every time a new reference pointing to the object is created by any other node, the owner node must be advised. Given we have only one counter per object, the scalability problem defined in the previous solution is avoided.
The Garbage collector is located in the Storage Location in order to be language-independant. Also, since it is managing disk, it makes sense.
Garbage collector has three threads:
Each node will notify other nodes about references that belong to them. For that, we specify a notification time T so each node notifies other nodes every T seconds + latency. With that we do not solve the race condition problem yet.
We do not use a ring message since it’s not worth it; we actually don’t save too much in communications and it’s difficult to implement.
In order to avoid a huge performance impact during a GET/STORE operation in database, we define this thread to process reference counting of each object.
Among with each object, the references pointed by the object are serialized. This information is located in the end of the bytes of the object and every time there is an UPDATE/STORE/GET operation, we can ‘extract’ it easily and store in a queue of reference counting to process.
The reference counter processor keeps processing the countings and updating the main reference counter. Every time an object with 0 references is found, it is added to an unaccessible candidate list.
This thread solves the performance impact problem specified before.
This thread is actually the one responsible to remove objects. For each unaccessible candidate, it checks if it is in memory or used by any session, it means, if it is actually unaccessible.
For that, GC asks to Execution Environment if an unaccessible candidate is actually accessible. Having 0 references in disk might be not true if the object is currently in a EE memory as local variable (might be associated) or if some client node has it(session). Execution Environment knows if the object is in memory or if the object is being used by any session because it was serialized to send. Therefore, all Execution Environment must have the function getRetainedObjects that inform GC which objects are currently 'retained' (it is, maybe accessible) by Execution Environment.
If the object is not accessible, the object is put in quarantine.
Quarantine is very important in order to solve the race condition of ‘bad nodes passing objects’. We know that:
- a node will always notify a reference every T + latency time
- nodes can still pass objects between them avoiding the notification but they will be being used by a session: it is very important to notice this now.
The session race condition problem is solved with the following new statement:
A session is not actually closed until all the actions started are finished
Therefore, the session race condition cannot happen, and also, passing objects between nodes will keep a session reference, so the first race condition is also solved. Session references will not disappear without notification.
But, quarantine is required to make sure all notifications have arrived (including session ones). An object must be in quarantine at least T + latency * number of nodes to make sure that all nodes have notified.
When an object is not accessible and quarantine time has passed, it is actually removed from disk. But before, it is read from DB in order to 'decrement' all references the object is pointing.
To solve the alias and replica problem, we decoupled it from GC design and we include it in a synchronization design for replicas. Every time an alias is added to an object is now synchronized among all replicas. GC just asks the object if it has an alias, “hasAlias” like a normal boolean synchronized field in the replica management design.
Every time an object is serialized to be send to DB, we call hasAlias and set a bit to 0 or 1 depending on the value; this information arrives to GC among with reference counting
dataClay can have multiple Execution Environments associated to same Storage Location. But, an object is stored only in one Database (SL has one database per EE). So, every time we need to know if the object is accessible, we MUST ask all Execution Environments associated (for replicas). If we remove the object, all replicas are removed. Whenever we need to read the object from DB to decrement references, we take 'any' replica.
dataClay Federation implies that there is a new reference from another dataClay. The implementation for that is the same than the alias. An object that has been federated cannot be removed until it is unfederated (see Federation design). We also use the replication mechanism here.
Every time an object is serialized to be send to DB, we call isFederated and set a bit to 0 or 1 depending on the value; this information arrives to GC among with reference counting
With this solution, we are not blocking dataClay for GC, we have no scalability problems and we solve the Garbage Collection problem (Predicate 3 [See problem specification here](1. problem specification)).
We will list here all the necessary limits defined in dataClay by the limited solutions explained before (including Predicates 1 and 2 in Limited solution section):
- Method time out
- Maximum concurrent requests per node
- Maximum objects with alias per user/session
- Maximum concurrent sessions per node
- Maximum objects being used by a session
- Maximum objects pending to update another reference counter
The definition of the value of the limit is not easy and requires another analysis, along with the definition of a proper system for providing quality, a Resource contention issue.
We should discuss also about the following topics:
- Third parties: All solutions here were design for registered classes. But, there can be leaks caused by third parties, for example, a JAR that consumes the whole memory. What should we do?
- User defined mutable static fields: As we said in the first section, currently users are not able to register mutable static fields. But in the future we want to allow them to register and share it. A static field is data associated to the class, to who this disk quota belongs to? Can we have a disk leak if someone registers classes with only static fields?
- Enrichments: When enrichments are applied is currently a mystery. For applying an enrichment we should change the class, and for reading the new class we should change the class loader. What to do with alive objects? Should we do a Stop-the-Universe?