Susan L. Graham, Peter B. Kessler, Marshall K. McKusick
2004 SIGPLAN notices  
We extended the UNIX ® system's profiler by gathering arcs in the call graph of a program. Here is it 20 years later and this profiler is still in daily use. Why is that? It's not because there aren't well-known areas for improvement. RETROSPECTIVE In the early 1980's, a group of us at the University of California at Berkeley were involved in a project to build compiler construction tools [1] . We were, more or less simultaneously, rewriting pieces of the UNIX operating system [2] . For many of
more » ... us, these were the largest, and most complex, programs on which we had ever worked. Of course we were interested in squeezing the last bits of performance out of these programs. The UNIX system comes with a profiling tool, prof [3], which we had found adequate up until then. The profiler consists of three parts: a kernel module that maintains a histogram of the program counter as it is observed at every clock tick; a runtime routine, a call to which is inserted by the compilers at the head of every function compiled with a profiling option; and a postprocessing program that aggregates and presents the data. The program counter histogram provides statistical sampling of where time is spent during execution. The runtime routine gathers precise call counts. These two sources of information are combined by post-processing to produce a table of each function listing the number of times it was called, the time spent in it, and the average time per call. As our programs became more complex, and as we became better at structuring them into shared, reusable pieces, we noticed that the profiles were becoming more diffuse and less useful. We observed two sources of confusion: as we partitioned operations across several functions to make them more general, the time for an operation spread across the several functions; and as the functions became more useful, they were used from many places, so it wasn't always clear why a function was being called as many times as it was. The difficulty we were having was that we wanted to understand the abstractions used in our system, but the function boundaries did not correspond to abstraction boundaries. Not being afraid to hack on the kernel and the runtime libraries, we set about building a better profiler [4] . Our ground rules were to change only what we needed and to make sure we preserved the efficiency of the tool. In fact, except for fixing a few bugs, the program counter histogram part of the profiler worked fine. Incrementing the 20 appropriate bucket of the program counter histogram had an almost negligible overhead, which allowed us to profile production systems. The space for the histogram could be controlled by getting a finer or coarser histogram. (Another thing that was happening around this time was that we were moving from 16-bit address spaces to 32-bit address spaces and felt quite expansive in the amount of memory we were willing to use.) One of us remembers an epiphany of being able to use a histogram array that was four times the size of the text segment of the program, getting a full 32-bit count for each possible program counter value! But it was in the runtime routine called from the top of each profiled function that we made the most difference. The standard routine uses a per-function data structure to count the number of times each function is called. In its place, we wrote a routine that uses the per-function data structure, and the return address of the call, to record the callers of the function and the number of times each had called this function. That is, we recorded incoming call graph arcs with counts. (We were surprised at how easily, and how dramatically, we could change the profiler with a single "late bound" function call.) We wrote a new post-processing program, i.e., gprof, to combine the call graph arcs with the program counter histogram data to show not only the time spent in each function but also the time spent in other functions called from each function. Our techniques are not without their pitfalls. For example, we have a statistical sample of the time spent in a function from the program counter histogram, and the count of the number of calls to that function. From those we derive an average time per call that need not reflect reality, e.g., if some calls take longer than others. Further, when attributing time spent in called functions to their callers, we have only single arcs in the call graph, and so distribute the "average time" to callers in proportion to how many times they called the function. Another difficulty we had was when we encountered cycles in the call graph: e.g., mutually recursive functions. We could not accumulate time from called functions into a cycle and then propagate that time towards the roots of the graph, because we would go around the cycle endlessly. First we had to identify the cycles and treat them specially. We had good graduate computer science educations, and knew of Tarjan's strongly-connectedcomponents algorithm [5] . That was fun to implement. Modern profilers solve both these problems by periodically gathering not just isolated program counter samples and isolated call graph arcs, but complete call stacks [6] . The additional overhead of gathering the call stack can be hidden by backing off the frequency with which the call stacks are sampled. Gathering complete call stacks depends on being able to find the return addresses all the way up the stack, a convention imposed in order to debug programs. ACM SIGPLAN 49 Best of PLDI 1979PLDI -1999
doi:10.1145/989393.989401 fatcat:aetwti6njfbcjlkwmcpav2ensy