Dexpler: Converting Android Dalvik Bytecode to Jimple for Static Analysis with Soot

This paper introduces Dexpler, a software package which converts Dalvik bytecode to Jimple. Dexpler is built on top of Dedexer and Soot. As Jimple is Soot’s main internal representation of code, the Dalvik bytecode can be manipulated with any Jimple based tool, for instance for performing point-to or ﬂow analysis.


Introduction
Android applications are mainly written in Java.However, they are not distributed as Java bytecode but rather as Dalvik bytecode.Indeed, the original Java code is first compiled into Java bytecode which is then transformed into Dalvik bytecode by the dx tool 1 .Dalvik bytecode is register based and optimized to run on devices where memory and processing power are scarce.
Analyzing Android applications with Java static analysis tools means either that the Java source code or the Java bytecode of the Android application must be available.Most of the time, Android applications developers do not distribute the source code of their applications.One must then analyze the bytecode, for instance for malware detection.Thus, to analyse Android applications, one is forced to use a Dalvik disassembler such as Smali [2] or Androguard [5].The problem with disassemblers is that they generaly use their own representation of the bytecode which prevents them to use existing tools.
Another possibility is to first convert Dalvik bytecode to Java bytecode using Ded [7], Dex2jar [16] or undx [17] and then use Java tailored static analysis tools such as Soot [20], BCEL [4] or WALA [9].Tools which generate Java bytecode can leverage existing Java bytecode analyzers.However, the conversion from Dalvik to Java bytecode could be avoided by directly converting Dalvik bytecode to the internal representation of a tool.
We introduce Dexpler2 , a Soot modification which allows Soot to directly read Dalvik bytecode and perform analysis and/or transformation on it's internal Jimple representation.Using this method eliminates the intermediate Dalvik to Java bytecode conversion step and enables to use a faster and simpler tool chain for static analysis.Dexpler only uses a disassembler and then does the rest of the work itself or by using Soot.
The contributions of this paper are the following: • we describe a Dalvik to Jimple converter tool • we provide a comprehensive table which maps Dalvik bytecode instructions to Jimple statements The reminder of this paper is organized as follows.In Section 2 we explain what Soot is, and how it has been modified to handle Dalvik bytecode.Section 3 is an overview of the Dalvik bytecode.In Section 4 we propose a Soot modification called Dexpler which enables Soot to read Dalvik bytecode.In Section 5 we evaluate Dexpler on test cases and on one Android application, present and discuss the results.Section 6 explains the current limitation of our tool.We present the related work in Section 7. Finally we conclude the paper and discuss open research challenges in Section 8.

Soot
In this Section we give a brief overview of Soot and then describe how we incorporate Dexpler in Soot.

Soot Overview
Soot [11,20] was created as a Java compiler testbed at McGill University.It has evolved to become a Java static analysis and transformation tool.
Soot can be used as a code analyzer to, among others, check that certain properties hold [22] or guarantee correctness of programs [8].
Multiple tools based on Soot have been developed to perform transformations such as translation of Java to C [21], instrumentation of Java programs [23], obfuscator for Java [18], software watermarking [3], ... 3 .
Soot accepts Java source code, Java bytecode and Jimple source code as input files.Whatever the input format, it is converted into Soot's internal representation: Jimple.Java sIMPLE, is a stack-less, three address representation which features only 15 instructions.Any method code can be viewed as a graph of Jimple statements associated with a list of Jimple local variables.

From Java Bytecode to Jimple
We now describe how Soot handles Java bytecode classes.In a typical case, Soot is launched by specifying the target directory as a parameter.This directory contains the code of the program to analyze, called Application Code (only Java bytecode in this example).First, the main() method of the Main class is executed and calls Scene.loadNecessaryClasses().This method loads basic Java classes and then loads specific Application classes by calling loadClass().Then, SootResolver.resolveClass() is called.The resolver calls SourceLocator.getClassSource() to fetch a reference to a ClassSource, an interface between the file containing the Java bytecode and Soot.In our case the class source is a CoffiClassSource because it is the coffi module which handles the conversion from Java bytecode to Jimple.When the resolver has a reference to a class source, it calls resolve() on it.This methods in turn calls soot.coffi.Util.resolveFromClassFile() which creates a SootClass from the corresponding Java bytecode class.All source fields of Soot class' methods are set to refer to a CoffiMethodSource object.This object is used later to get the Jimple representation of the method.
For instance, if during an analysis with Soot the analysis code calls SootMethod.getActiveBody() and the Jimple code of the method was not already generated, getActiveBody() will call CofficMethodSource.getBody() to compute Jimple code from the Java bytecode.The Jimple code representation of the method can then be analyzed and/or transformed. 3

Soot and Dalvik
Soot is missing a Dalvik to Jimple transformation module.We implemented such a module called Dexpler and incorporated it to Soot using the same structure as Soot's Java bytecode parser module, coffi by adding the DalvikClassSource and DalvikMethodSource classes.

Dalvik Bytecode
We present in this Section the structure of a .dexfile containing Dalvik classes and Dalvik bytecode.

Overall Structure
A single Dalvik executable is produced from N Java bytecode classes through the dx compiler.The resulting Dalvik bytecode is stored in a .dexfile as represented in Figure 1b.As represented in Figure 1a, there is only a single place where literal constant values are stored (constant pool) per Java class.It is heterogeneous since different kind of Objects are mixed together (ex: Class, MethodRef, Integer, String, ...).A .dex file contains four homogeneous constants pools: for Strings, Class, Fields and Methods.It is shared by all the classes.A .dex file contains multiple Class Definitions each containing one or more Method definition each of those being linked to Dalvik bytecode instructions present in the Data section.

Dalvik Instruction
The Dalvik virtual machine is register based.This means most instructions must specify the registers which they ma- There are 237 opcodes present in the Dalvik opcode constant list 4 .However, 12 odex (optimized dex) instructions can not be found in Android applications Dalvik bytecode as they are unsafe instructions generated within the Android system to optimize Dalvik bytecode.Moreover, 8 instructions were never found in application code [15].According to those numbers, only 217 instructions can be found in Android PacKages (.apk) in practice.
The set of instructions can be divided between instructions which provide the type of the registers they manipulate (ex: sub-long v1, v2, v3) and those which do not (ex: const v0, 0xBEEF).Moreover, there is no distinction between NULL and 0 which are both represented as 0 (see Figure 2).As we will see in Section 4, the lack of type and the NULL representation become problematic when translating the Dalvik bytecode to Jimple.

Dexpler
This section describes Dexpler, the Dalvik to Jimple converter tool.It leverages the dedexer [14] Dalvik bytecode disassembler and the Soot fast typing Jimple component implementing a type inferrence algorithm [1] for local variables.We first give a brief overview on dedexer and on how Dexpler is working in Sections 4.1 and 4.2, respectively.Then, we detail issues we have to deal with.

Dedexer
Our tool leverages dedexer a Dalvik bytecode parser and disassembler which generates Jasmin [10,12] like text files containing Dalvik instructions instead of Java instructions.We generate Jimple classes, methods and statements from the informations provided by dedexer's dex file parser.

Overview
Dalvik bytecode instructions are first mapped to Jimple statements and registers mapped to Jimple local variables.The type of local variables is set to UnknownType.Then, Soot's Jimple component, fast typing, is applied to infer the type of the local variables.The third and last step consists in applying Soot's Jimple pack jop, which features components such as nop eleminitor, to optimize the generated Jimple code. 4 dalvik/bytecode/Opcodes.java

Instruction Mapping
Each Dalvik instruction is mapped to a corresponding (or a group of) Jimple statements.A comprehensive mapping is represented in Table 1 in Appendix A. Unused opcodes are marked as '-' and odex opcodes as 'odex'.There are five main groups of instructions: move instructions (0x01 to 0x1C), branch instructions (0x27 to 0x3D), getter and setter instructions (0x44 to 0x6D), method invoke instructions (0x6E to 0x78) logic and arithmetic instructions (0x7B to 0xE2).

Type Inference
The type for local variables is inferred using the fast typing Soot component.However, the inference algorithm sometime generates an exception and stop because some Dalvik instructions (such as the constant initialization instructions 0x12 to 0x19) do not provide enough information and thus confuse the inference engine.
The lack of type is present in the following instructions: • null initialization instructions (zero or null?) • initialization instructions (32 bits: integer or float?, 64 bits: long or double?) Null Initialization Figure 4 illustrates the problem with a bytecode snippet generated from the Java code of Figure 3. Register v0 is initialized with 0 at 01.At this point we do not know if v0 is an integer, a float or a reference to an object.At 02 we still do not have the answer.We have to wait until instruction at 04 to known that the type of v0 is Coordinate.At this point, the Jimple instruction generated for 01 has to be updated to use NullConstant instead of the default IntConstant(0).If this is not handled correctly, the fast typing component generates an exception and stops.
Numeric Constant Initialization Similarly, float constants initialization cannot be distinguished from int constants initialization and double constants initialization from long constants initialization.Thus, we go through the graph of Jimple statements to find how constants are used and correct the initializations Jimple statements when needed.For instance, if a float/int constant (initialized by default to int in the Jimple statement) is later used in a float addition, the constant initialization changes from IntConstant(c) to FloatConstant(Float.intBitsToFloat(c)).
We implemented the algorithm described by Enck et al. [6].It is based on algorithms which extract typing information for a variable by looking at how it is used in operations whithin which the type of the operands is knows (ex: the variable is used as a parameter of a method invocation) [13,19].For each ambiguous register declaration, the algorithm performs a depth first search in the control flow graph of Jimple statements to find out how the declared local variable dv (registers are mapped to Jimple local variables) is used.The type of dv is exposed with the following state-   3 ments: comparison with a known type, instructions operating only on specific types (ex: neg-float), non-void return instructions and method invocation.The search in a branch of the graph is terminated if either the local variable is reassigned (new declaration) or if there is no more statement that follow the current one (eg: the current statement is a return or throw statement).When the type information is found it is forward propagated to all subsequent ambiguous uses between the target ambiguous declaration of dv and any new declaration of dv.

Handling Branches
Dalvik instructions are mapped to Jimple statements.When parsing Dalvik bytecode, we keep a mapping between bytecode instructions addresses and Jimple statements.Thus, when a Dalvik branch instruction is parsed, a Jimple jump instruction is generated and its target is retrieved by fetching the Jimple statement mapped to the Dalvik branch instruction target's address.We add a nop instruction as the first instruction of every Jimple methods.This way, if the first Dalvik instruction is a jump or if the jump's target correspond to a non-yet generated Jimple statement, we redirect it to the this nop Jimple instruction.We correct those Jimple jump instructions once the whole Dalvik bytecode of the method has been processed: at this point we know the target Jimple statement mapped to the Dalvik jump's target address.The Jimple nop instruction we add is removed during the Jimple optimization step.
Branching instructions often rely on the result of a comparison of two registers.Dalvik comparisons between double or float are explicit and provide typing information.However, when a register r is compared with zero one has to check the type of r.If it is an object, we change the zero value to null since it is a comparison between objects.We do this change when the fast typing component has finished.Indeed, comparisons do not influence the type inference.For example, the Jimple statement generated from 02 in Figure 4 has to be updated to use NullConstant instead of IntConstant(0).If this is not handled correctly the bytecode generated from Jimple statements does not run correctly and generates an exception similar to the following one: Exception in thread "main" java.lang.VerifyError: Expecting to find integer on stack.
Dexpler enables us to transform Dalvik bytecode to Jimple representation.From there, Soot can be used as a static analysis tool to analyze the code.The next Section evaluates Dexpler.

Evaluation
We evaluate Dexpler using test cases, and one Android application: Snake.

Test Cases
The first step is to generate the Dalvik bytecode for every test case.The test cases are written in Java then compiled into Java bytecode using javac and finally converted into Dalvik bytecode using dx.The second step is to execute Dexpler on every generated Dalvik bytecode test case.This generates .jimpleand .classfiles.We then compare the execution result from of the versions produced from the original Java bytecode and the Java bytecode produced by Soot from the Dalvik bytecode.Executions of the .classfiles give the correct result.
We wrote test cases for arithmetic operations, branches, method calls, array initialization, string manipulation, null and zero usage, exceptions and casts.
Since simple test cases do not reflect a real application we also evaluated our tool on one Android application.

Android Application
The snake application is a demonstration application developed by the Android team to showcase the Android platform. 5It features 11 classes, 39 methods and was written in 550 lines of Java code.The generated Dalvik bytecode takes 14 KiB and contains 884 Dalvik instructions.
From the Dalvik bytecode of the Snake application we generate Jimple code in one second (duration for the Dalvik to Jimple conversion only).Then we ask Soot to generate Java bytecode from the Jimple representation.We convert the Java bytecode back to Dalvik, repackage an Android application and launch it on the Android emulator.
The application runs smoothly and the game is working.

Static Analysis on Snake
We use Soot to generate a call graph of the Snake application as well as a control flow graph represented in Figure 5 in 14 seconds (duration from the launch time of Soot until Soot has finished).We perform this to check that the generated call graph and CFG correspond to the original code meaning that the conversion from Dalvik to Jimple is correct for this code.
We have successfully tested our prototype tool on test cases as well as on an Android application.

Current Liminations
The current version of Dexpler is able to transform Android applications such as the Snake game.
Moreover, when inferring types for ambiguous declarations the algorithm supposes that the Dalvik bytecode is legal in the sense that it was generated from Java source code and not hand-crafted by malicious developers.In the later case assumptions such as "comparisons always involve variables with the same type" may not hold anymore and may make Dexpler to infer wrong types.

Related Work
To our knowledge no existing tool directly converts Dalvik bytecode to Jimple.We either found tools to convert Dalvik bytecode to Java bytecode or tools to disassemble and/or as-semble Dalvik bytecode using an intermediate representation.
Dalvik to Java Bytecode Converter Ded [7] is a Dalvik bytecode to Java bytecode converter.Once the Java bytecode is generated, Soot is used to optimize the code.Dex2jar [16] also generates Java bytecode from Dalvik bytecode but no not use any external tool to optimize the resulting Java bytecode.Undx [17] is also a Dalvik to Java bytecode converter but seems to be unavailable.
We on the other hand do not directly generate Java bytecode but Jimple code.From there, since the Jimple code is within Soot, we can generate Java bytecode as well.
Dalvik Assembler/Disassembler Smali [2] or Androguard [5] can be used to reverse engineer Dalvik bytecode.They use their own representation of the Dalvik bytecode: they can not leverage existing analysis tools.
Our tool, use Soot's internal representation which allows existing tools to analyze/transform the Dalvik bytecode.

Conclusion
We have presented Dexpler6 a Soot modification with enables Soot to analyse Dalvik bytecode and thus Android applications.This tool leverages dedexer for the parsing of Dalvik dex files and Soot's fast typing component for the type inference.
Dexpler converts every Dalvik instruction to Jimple.We are working on improving Dexpler to make it robust to yet unhandled typing issues.Once this step is done we will look at the performance of this tool compared to current Java bytecode generation and analysis tools.

Figure 2 .
Figure 2. Dalvik Representation of null and zero

5Figure 5 .
Figure 5.Control Flow Graph for addRandomApple Method Extracted from the Generated Jimple Representation.
Figure 3. Illustration of the null init problem.

Table 1 :
Jimple Code representation of Dalvik Instructions

Table 1 :
Jimple Code representation of Dalvik Instructions

Table 1 :
Jimple Code representation of Dalvik Instructions

Table 1 :
Jimple Code representation of Dalvik Instructions