From MinX to MinC: semantics-driven decompilation of recursive datatypes

Ed Robbins, Andy King, Tom Schrijvers
2016 SIGPLAN notices  
Reconstructing the meaning of a program from its binary executable is known as reverse engineering; it has a wide range of applications in software security, exposing piracy, legacy systems, etc. Since reversing is ultimately a search for meaning, there is much interest in inferring a type (a meaning) for the elements of a binary in a consistent way. Unfortunately existing approaches do not guarantee any semantic relevance for their reconstructed types. This paper presents a new and
more » ... -founded approach that provides strong guarantees for the reconstructed types. Key to our approach is the derivation of a witness program in a high-level language alongside the reconstructed types. This witness has the same semantics as the binary, is type correct by construction, and it induces a (justifiable) type assignment on the binary. Moreover, the approach effectively yields a type-directed decompiler. We formalise and implement the approach for reversing MINX, an abstraction of x86, to MINC, a type-safe dialect of C with recursive datatypes. Our evaluation compiles a range of textbook C algorithms to MINX and then recovers the original structures. Decompilation Relation Recovered Types MINX program Semantic Equivalence Well Typedness MINC Witness MINX Program Recovered Types MINC Witness R •w {r → b} = R • {r → b : Rw:4(r)} Heap Memory The heap is modelled as a (partial) function H : Word ⇀ Byte and therefore is byte addressable. To read stored objects that straddle w consecutive bytes we define a function H w : Word → Byte * that reads and amalgamates w bytes of the heap into a single vector as follows: where the operations +4 and −4 denote addition and subtraction in 4 byte bit-vector arithmetic, and 1 denotes a 4 byte bit-vector. Syntax The syntax of MINX programs consists of four syntactic categories. The first, w ::= 2 | 4, denotes the width, in bytes, of the primitive data objects that are supported by the instruction set. The second, ι, defines the instructions themselves: where c denotes a numeric constant and a ∈ Word is the location of the function that is to be invoked. (A Harvard architecture is assumed throughout). Square brackets indicate indirection. The instructions op ⊕ w and op ⊗ w are themselves parameterised by the
doi:10.1145/2914770.2837633 fatcat:ifszk325ove2bprmsqni6ge7zm