Vectorization technology to improve interpreter performance

Erven Rohou, Kevin Williams, David Yuste
2013 ACM Transactions on Architecture and Code Optimization (TACO)  
In the present computing landscape, interpreters are in use in a wide range of systems. Recent trends in consumer electronics have created a new category of portable, lightweight software applications. Typically, these applications have fast development cycles and short life spans. They run on a wide range of systems and are deployed in a target independent bytecode format over Internet and cellular networks. Their authors are untrusted third-party vendors, and they are executed in secure
more » ... d runtimes or virtual machines. Furthermore, due to security policies or development time constraints, these virtual machines often lack just-in-time compilers and rely on interpreted execution. At the other end of the spectrum, interpreters are also a reality in the field of high-performance computations because of the flexibility they provide. The main performance penalty in interpreters arises from instruction dispatch. Each bytecode requires a minimum number of machine instructions to be executed. In this work, we introduce a novel approach for interpreter optimization that reduces instruction dispatch thanks to vectorization technology. We extend the split compilation paradigm to interpreters, thus guaranteeing that our approach exhibits almost no overhead at runtime. We take advantage of the vast research in vectorization and its presence in modern compilers. Complex analyses are performed ahead of time, and their results are conveyed to the executable bytecode. At runtime, the interpreter retrieves this additional information to build the SIMD IR (intermediate representation) instructions that carry the vector semantics. The bytecode language remains unmodified, making this representation compatible with legacy interpreters and previously proposed JIT compilers. We show that this approach drastically reduces the number of instructions to interpret and decreases execution time of vectorizable applications. Moreover, we map SIMD IR instructions to hardware SIMD instructions when available, with a substantial additional improvement. Finally, we finely analyze the impact of our extension on the behavior of the caches and branch predictors. E. Rohou et al. vendors. For these reasons, mobile executables must be distributed in a secure, target-independent format, namely bytecode. These bytecodes run in secure managed environments called virtual machines. Virtual machines ensure the integrity of the system and control the access of the deployed application to privileged resources. Traditional virtual machines for languages such as Java and .NET use interpreters for fast startup time and augment them with JIT (Just-In-Time) compilers for increased performance [Paleczny et al. 2001; Oracle 2010]. Developing a JIT compiler, however, is much more complex and time consuming than an interpreter, and vendors may decide to content themselves with interpretation. Some vendors of modern consumer electronics require code to be digitally signed and secured before deployment, and their policies prohibit dynamic code generation on their targets [Xamarin 2011], including JIT compilers. At the other end of the spectrum, interpreters are also in use in fields where highperformance computations are necessary, such as physics. The added flexibility and the faster development cycles favor such environments. Scientists from both CERN and Fermilab report [Naumann and Canal 2008] that "many of LHC experiments' algorithms are both designed and used in interpreters". As another example, the need for an interpreter is also one of the three reasons motivating the choice of Jython for the data analysis software of the Herschel Space Observatory [Wieprecht et al. 2004] . Finally, many domain specialists turn to languages such as Python, Ruby, or Matlab simply because they are easier and faster to learn 1 . These languages happen to be mostly interpreted. Unfortunately, the performance of interpreters is lower than what static and JIT compilers achieve. The causes are inherent to the nature of interpreters. First, a program is translated into an IR used internally by the interpreter. At this point, only fast optimizations can be applied. Then, the IR is "executed": each instruction is dispatched, and the code chunk matching its semantics is executed. Instruction dispatch is an important source of inefficiency due to the overhead introduced. Our experiments with our own interpreter for CLI on x86 show that, on average, the execution of a single bytecode takes from 20 to 30 machine instructions. They account for loading a bytecode from memory, decoding it, and transferring the flow of control to the code that performs the appropriate function. Parameters, if any, are then read from the evaluation stack, the computation is performed, and the result is written back. A common way for improving performance of interpreters is to reduce the number of instructions to dispatch. To this end, common sequences of instructions are gathered into single superinstructions Gregg 2003, 2004]. The interpreter identifies these repeated patterns and replaces them with the corresponding superinstruction. However, the space of search for superinstructions is limited due to time constraints. Aggressive analysis of the code and powerful transformations are out of their reach. The spirit of our approach is similar to superinstructions, in the sense that we improve the performance of the interpreter by reducing the number of dispatches. The major differences are as follows. First, the patterns are much coarser grain than traditional superinstructions: they can encompass up to hundreds of dynamic bytecodes. Second, they are not computed by the interpreter, but generated by an offline compiler, and exposed to the interpreter as candidate superinstructions, thanks to predefined builtins. We preserve backward compatibility with legacy interpreters, while enabling very significant speedups on systems based on our proposal.
doi:10.1145/2400682.2400685 fatcat:wzp5leogynemfhtyuwrx2w74lm