Most of MATSim-T is written in Java and requires Java 1.5 to run. While Java is widely known, we stumble from time to time over certain features or specialities we'd like to highlight. So this is the place to collect interesting and informative stuff about Java which might (or might not) have some relation to our code.
Memory usage is a critical point in multi agent simulations, just because there usually are so many of them. Thus, every single byte that is used to describe a single agent enlarges the memory consumption of the whole system massively.
Our agents have several attributes (like sex, car-availability, employment status), their plans contain activities with a type and legs with a mode. Each of these attributes are stored as Strings. Considering that even an empty String takes up to 40 bytes in Java and characters in Java are always 16bit, even a small string like "f' (for Person.sex) or "car" (for Leg.mode) takes a lot of memory—and that's for every single agent, leg and activity!
It is quite obvious that it does not make sense to have more than one string with the same content multiple times, especially when every instance uses so much memory. This is where object pools are often used: a so called pool stores commonly used objects exactly once, and other objects can refer to these objects instead of holding their own, identical, instances. In our case this means that instead of having separate instances of String for every "f", "car" or other attribute, each occurring value exists only once as a String-object, and all the agents, legs and activities only reference the corresponding pool object instead of storing their own instance. Having a population of 200k agents with 5 attributes would contain 1mio strings—that would be more than 40MB of RAM alone, not yet counting the memory used by plans, leg-modes, activity-types. But when using our "object pool" we only need about 10 or 15 different Strings (depending on the number of different values in the attributes), instead of 1mio!
The class String offers already such an internal object pool, so that this optimization can be used without much additional code:
String.intern();
The documentation to String.intern() reads as follows:
Returns a canonical representation for the string object.
A pool of strings, initially empty, is maintained privately by the class String.
When the intern method is invoked, if the pool already contains a string equal to this String object as determined by the equals(Object) method, then the string from the pool is returned. Otherwise, this String object is added to the pool and a reference to this String object is returned.
So, the setters for such string-attributes can now be written like the following one:
class Act {
String type = null;
...
public void setType(String type) {
this.type = type.intern();
}
}
Using this simple trick, the memory consumption was reduced by over 25% when reading in 160k agents, while the time used for reading the agents did not change significantly (50secs vs 49secs). Reading larger populations should result in even larger memory savings, as most likely no additional strings have to be added to the pool, and all attributes can just reference to them.
Using Java 1.6.0 can evoke OutOfMemoryErrors while parsing huge xml files. For details see: bugs.sun.com/bugdatabase/view_bug.do?bug_id=6536111
HashSet/HashMap do not specify in which order elements are iterated. Testing such structures are very difficult (sorting is always necessary for comparison). Even more tricky it gets, if tests randomly fail. Debugging will be very hard. Use LinkedHashSet/LinkedHashMap instead. It can be iterated over insertion order.
Our simulations depend heavily on random numbers. But despite using random numbers, we still want the simulations to be deterministic, so that results can be reproduced by running the same scenario a second time.
Usually, random numbers in Java are generated by calling Math.random():
double d = Math.random();
While this method returns a random number, the number is indeed random such as that in a second run of the same scenario, other random numbers would result and the simulation would no longer be deterministic. To overcome this problem, an instance of java.util.Random can be generated and initialized with a custom random seed:
Random random = new Random();
random.setSeed(1234);
...
double d = random.nextDouble();
In this case, everytime the code is run, we get the same random numbers. As drawing random numbers is widely used in the code, MATSim-T offers a global instance of Random, which is automatically initialized with the seed specified in the configuration file:
import org.matsim.core.gbl.MatsimRandom;
...
double d = MatsimRandom.random.nextDouble();
What value should be used as random seed? Is a value of 1 better than the value 86294? As both numbers have the same probability to be chosen in the range from 0 to Integer.MAX_VALUE, any value is equally good for a random seed.
But this is only half of the story. We all know, that these random numbers are only pseudo-random—and they depend on the chosen seed!
PlanAlgorithms could be executed in parallel in multiple threads, e.g. during replanning. As the exact order of execution with multiple threads is not deterministic, the usage of MatsimRandom.random would lead to non-deterministic results. Instead, every instance of a PlanAlgorithm should have its own random number generator. Best way to realize that is that PlanAlgorithms that use Random numbers have a constructor where an object of type java.util.Random can be passed. When instantiating PlanAlgorithms, one can use MatsimRandom.getLocalInstance() to obtain a Random-object that can be passed to the PlanAlgorithm. A Random object received by getLocalInstance() is already correctly initialized to return useful random numbers (see below, Problems when setting the random seed).
In our code, we set the random seed at the start of every iteration, so that we can restart a simulation at any iteration. The code was similar to the example code below:
int baseSeed; // the random seed specified in the configuration file
...
for (int iteration = 0; iteration < 1000; iteration++) {
MatsimRandom.random.setSeed(baseSeed + iteration);
...
}
After running several iterations we realized that the first agent was never chosen for re-planning (remember, they get "randomly" chosen for re-planning). A little bit of research revealed, that the first random number drawn after setting the random seed depends heavily on the random seed! Only a slight change in the random seed (in our case always +1 for each iteration) resulted in only a slight change in the value of the random number. The following figure shows the distribution of the first and second drawn random number after setting different random seeds. As can be clearly seen, the first drawn number only moves in a very small range. The second drawn numbers have a better distribution when the seed is only changed a little.
To overcome this problem, we decided that after setting a random seed, we draw one random number and immediately throw it away, as it seems not enough random.
Note: I consider the following as a rather advanced topic / optimization. I stumbled upon it while testing a new Java-profiler and thought I document it, 'cause others may be interested in it. – Cheers, MarcelWhile the Collections in Java are really nice and useful, they are not always the best solution performance-wise. In a recent case, a
TreeMap<Integer, Integer> was used for travel time lookups. Each link in the network had such a map, storing a (departure) time and the corresponding travel time at that departure time. When searching for routes in the network, this map was accessed really a lot of times. As the queried time was not necessarily a key in the map, often the first entry of a tailmap (a map containing all objects whose key is equal or larger than a certain value) was requested to get the travel time at the next possible departure time.Integer.getValue() was called a crazy number of times, using a lot of time in total, even if I had this function-call nowhere explicitly in my method. Well, to find the correct entry in the map, the Integer-keys had to be compared, which was done by accessing their basic type values. That explained the many calls to this function, which made me think of how to implement the travel time lookup without using Integer-objects, but using the basic type int.int[]), one containing the keys, the other the corresponding values, allowed to use a binary search on the key-array (Arrays.binarySearch()) to find the correct array-index and thus to easily access the corresponding travel time in the value-array. With this change, no Integers were used, thus no (un-)boxing had to be done, and the accessed memory in the array should not be so scattered around then the single map-entry objects, leading to faster access. Finally the code ran in 25% less time!