🔬 A reaching definition engine for binary analysis built-in in angr.

Background

In this blog-post I would like to talk about some recent effort we have done to improve the Reaching Definition (RD) analysis inside angr, and how you can use it in your projects. Straight from compiler theory, a reaching definition analysis is (in very simple terms) a static program analysis technique that can help us answer which are the possible definitions for a specific variable at a particular line of code.

Let’s take, for instance, the following snippet:

[0] c = input()
[1] var a = 6  
[2] var b = a + 6  
[3] if c == 3:  
[4]   a = 4  
[5]   c = a + 1 
[6] else  
[7]   a = 2  
[8]   c = a + 2 
[9] b = a + 2

Here we can use RD by placing a -so called- observation point at [5] and ask “which are the values we can observe at [5] for variable ‘a’?”. The answer in this case will be 4. If we would have asked the same question at [8] the answer would have been 2 and, eventually, asking this question at observation point [9] would have given us both 4 and 2 as an answer. If you are confused regarding the latest answer, keep in mind that this technique is purely static: the analysis simply observes that since the var ‘c’ has no specific value we can take both the branches of the ‘if’, therefore, ‘a’ can be re-defined with 4 or 2.

Now, let’s spice this up a little bit. 🌶️

When doing binary analysis we don’t have the luxury of having the source code of the application we are analyzing, and chances are that, if you are using angr, you are exactly in this situation. However, variables are just fancy programming language abstractions and under the hood they are stored in memory and registers. Therefore, we can use the RD analysis to reason about those entities and getting nearly equivalent results.

If we think about the previous program in (totally unoptimized) assembly:

[0]  mov rcx, [rsp+0x8]   ; moving variable 'c' in rcx
[1]  mov rax, 6         
[2]  mov [rsp+0x10], rax  ; setting var 'a'   
[3]  add rbx, rax, 6      ; defining variable 'b'
[4]  mov [rsp+0x1C], rbx  
[5]  cmp rcx, 0x3         ; checking if var 'c' is 3 or not
[6]  jne label2           ; if not, go to label2
[7]  mov [rsp+0x10], 0x4
[8]  mov rax, [rsp+0x10]
[9]  add rcx, rax, 0x1
[10] mov [rsp+0x8], rcx
[11] jmp end 
[12] label2:
[13]  mov [rsp+0x10], 0x2
[14]  mov rax, [rsp+0x10]
[15]  add rcx, rax, 0x1
[16]  mov [rsp+0x8], rcx
[17] end:
[18]  mov rax, [rsp+0x10]
[19]  add rbx, rax, 0x2
[20]  mov [rsp+0x1C], rbx

Now, we need to define an observation point and understand the question we want to ask to get the result we want. For instance, if we want to ask, like before, which are the possible values of the var ‘a’ in the branch where ‘c’ is equal to 3 we would need to ask “which are the values we can observe at [9] for register rax?”.

This technique is the foundation for different analyses at the base of vulnerability hunting tools, static taint engine systems, function prototypes recovery routines, constant propagators, and so on. We extensively use this in angr to accomplish a different kind of tasks because its speed, compared with analyses based on Dynamic Symbolic Execution (DSE), it’s amazing. Personally, I’ve been able to speed up things from taking an hour (using DSE) to minutes by simply re-thinking them using this approach!

Example

Let’s go through some real code that will, hopefully, clarify how to leverage this feature and how to eventually hack into it (recently there have been many questions on our slack and I thought this blog-post might have been useful for some researchers).

For the purpose of this example, I’ll go through a real problem I had to solve for my current project. In particular, I wanted to collect all the locations in a given binary where the return value of a function was consumed as an argument for a memcpy. For instance:

[0] c = foo()
[1] ...
[2] { ... more code here ... }
[3] ...
[4] memcpy(c, a, 0x10)

For most of you this problem probably already triggered the word “taint engine” in your minds, but, how can we practically implement this in angr? A friend and colleague of mine (@badnack) already proposed a solution here. However, while this approach can be useful in some situations, it doesn’t really fit my case mainly for 3 reasons:

  1. It is fully based on DSE. Therefore, while executing code we are creating SimState(s), collecting and solving constraints, and in general, hacking the symbolic exploration (and abusing its features) to force angr to behave like a taint engine.
  2. This is a direct consequence of the previous point, but deserves a separate discussion. If we want to design an analysis not affected by false negatives (i.e., we want to spot all the locations where memcpy is consuming a value returned by a function) we need to reason about EVERY possible path that we can observe over the Control Flow Graph (CFG). This means that we would have to exhaust the symbolic exploration to be sure we checked all of them. Unfortunately, this can either lead to state explosions/path explosions or simply take an unacceptable amount of time.
    Moreover, we would like to start from the sinks (the calls to memcpy) and reverse-search to the sources (the functions' return values), but backward symbolic execution is not quite possible in angr (and honestly I’m also not aware of any tools that implement this).
  3. It doesn’t really scale if we want to make this analysis inter-functional. The first implementation of this spent more than an hour(!) to find all the locations in the code, moreover, the results were partial and some paths were not considered.

The intuition here is to try to implement the same technique we would manually use to spot these dependencies: look at the code and simply look at how variables are defined and use in the function. In the following screenshot, you can see how the variable v25 is defined by the memb_alloc and consumed by memcpy.

function_correlation

The idea here is pretty simple: we want to set an observation point at the memcpy call (address 0x37CA) and then ask for seeable definitions for the register r0 (the one that is holding the v25).

.text:000037BA STRB    R5, [R0,#0x1C]
.text:000037BC MOVS    R3, #1
.text:000037BE STRB    R3, [R0,#0x1D]
.text:000037C0 ADDS    R0, #0xC
.text:000037C2 MOV     R3, R9
.text:000037C4 ADDS    R1, R3, #2
.text:000037C6 MOVS    R2, #0x10
.text:000037C8 LDR     R3, =(memcpy+1)
.text:000037CA BLX     R3  ;memcpy     <== OBSERVATION POINT 
.text:000037CC ADDS    R0, R4, #4      ; t
.text:000037CE MOVS    R1, #0x12C00    ; interval
.text:000037D2 LDR     R3, =(timer_set+1)
.text:000037D4 BLX     R3              ; timer_set
.text:000037D6 MOVS    R1, R4          ; item
.text:000037D8 LDR     R0, =others_services_list ; list
.text:000037DA LDR     R3, =(list_add+1)
.text:000037DC BLX     R3              ; list_add

Zooming out this basic block it’s interesting to see where it is placed inside the CFG of the function under analysis (in green).

memcpy_cfg

Basically, by asking the question “which definitions for r0 can we observe at address 0x37CA?” we will collect all the possible definitions for r0 that are observable when the execution flow reaches the address 0x37CA inside the green basic block.

Let’s start to write some code:

import angr 
import autoblob # just to load the blob, more at https://github.com/subwire/autoblob
import angr.analyses.reaching_definitions.dep_graph as dep_graph

blob_path = "./blob.bin"
target_func_addr = 0x3609 # Odd addresses because ARM Thumb mode 
call_to_memcpy = 0x37CB # Odd addresses because ARM Thumb mode

print("Creating angr Project")
project = angr.Project(blob_path)

# Some standard options for the CFG
print("Creating binary CFG")
bin_cfg = project.analyses.CFG(resolve_indirect_jumps=True, 
                               cross_references=True, 
                               force_complete_scan=False, 
                               normalize=True, 
                               symbols=True)

# Getting the func object from the addres
target_func = bin_cfg.functions.get_by_addr(target_func_addr)
# Starting the ReachingDefinition analysis
rd = project.analyses.ReachingDefinitions(subject=target_func, 
                                          func_graph=target_func.graph,
                                          cc = target_func.calling_convention,
                                          observation_points= [("insn", 
                                                                call_to_memcpy,
                                                                0)],
                                          dep_graph = dep_graph.DepGraph()
                                          )

This will specifically run the reaching definition analysis over the target function 0x3609 by using as an observation point the call to the memcpy at 0x37CB (note that you can set multiple observation points if you wish).

By investigating the results object we can get the information regarding the register r0:

# VEX offset is just how the VEX IR refers to registers
reg_vex_offset = project.arch.registers.get("r0", None)[0]
reg_defs = rd.one_result.register_definitions.get_objects_by_offset(reg_vex_offset)
print(reg_defs)
# RESULT:
# {<Definition {Atom:<Reg 8<4>>, Codeloc:<0x37c1 id=0x37bb[36]>, 
#              Data:DataSet<32>: {<Undefined>}}>}

The result is telling us that the latest instruction that defines the register r0 -before calling the memcpy- is at code location 0x37c1 (for those curious: id is the basic block address and 36 is the VEX statement), but its value (in the DataSet) is Undefined. Why is that? To understand let’s take a look at the instruction at 0x37c1:

.text:000037C0 ADDS    R0, #0xC

The definition of r0 is coming from whatever value r0 was holding added to 0xc. But what was the value of r0 at this point? Well, that’s again a question for RD:

rd2 = project.analyses.ReachingDefinitions(subject=target_func, 
                                          func_graph=target_func.graph,
                                          cc = target_func.calling_convention,
                                          observation_points= [("insn", 0x37c1 , 0)],
                                          dep_graph = dep_graph.DepGraph()
                                          )
reg_defs2 = rd2.one_result.register_definitions.get_objects_by_offset(reg_vex_offset)
print(reg_defs2)
# RESULT:
# {<Definition {Tags:{
#                    <ReturnValueTag {Function: 0x3a3d, 
#                                     Metadata:{'tagged_by': 'SimEngineRDVEX._handle_function_cc'}}>
#                   }, 
#              Atom:<Reg 8<4>>, 
#              Codeloc:<0x37b5 id=0x37b1[-2]>,
#              Data:DataSet<32>: {<Undefined>}}>}

The previous definition for r0 has been recorded at the address 0x37b5, but by looking at that address we can’t see any definition for r0, right?

.text:000037B0
.text:000037B0 loc_37B0                ; m
.text:000037B0 process_pt = R8         ; pt *
.text:000037B0 LDR     R0, =registrations
.text:000037B2 LDR     R3, =(memb_alloc+1)
.text:000037B4 BLX     R3     ; memb_alloc  <==

Well, the RD analysis is smart enough that, by making use of the function’s calling convention (CC), knows that this function is going to return stuff, and under the current CC this “stuff” is returned inside r0, therefore, r0 is actually redefined here. This information is integrated inside the Definition object we got as a result. We can see that, indeed, the Definition has been tagged with a ReturnValueTag, pointing out that not only the content of r0 is the result of a function call, but also that the function that has been called is 0x3a3d. 🏷️

Now we can easily understand why the value observed at the call to memcpy was Undefined: RD didn’t know the value returned by the function 0x3a3d, but simply that r0 it’s going to be redefined. Given that, it assigns an Undefined value and when later the RD engine has to sum up Undefined + 0xC everything is merged into Undefined (for people familiar with data-flow analysis this is basically merging to Top).

The tagging mechanism for these definitions has been recently introduced and I really believe empowers a lot of smart reasoning about these chains of definitions of registers. In particular, I’m gonna leverage this exact mechanism to solve the problem we were discussing at the beginning.

Recalling the problem: we are trying to spot situations where the first argument of memcpy (held by register r0 in this architecture) is the return value of another function. The idea is pretty simple: automatically walk back the definitions chain for the registers we need (by using the dep_graph) until the definition is not Undefined anymore (in that case it would mean that the argument is a constant) OR until we hit a Definition node tagged with a ReturnValueTag.

We can picture this in the following way:

rd_grap_simple

The r0’s definition flowing inside memcpy is the one at D3, since this has the Undefined value we will walk back the dep_graph and eventually find D1 and D2 (in this case 2 different paths can bring us to ins3). D1 has no interesting value for our purpose since it is a constant, but D2 instead will be tagged as coming from function_abc. (quick note: the graph presented above is actually a simplification of what we can actually see if we would have dumped the dep_graph object with networkx).

Now that the approach is clear, it’s time to write the code to implement this analysis.

Final Code

import angr 
import autoblob
import os
import angr.analyses.reaching_definitions.dep_graph as dep_graph

from angr.engines.light import SpOffset, RegisterOffset
from angr.knowledge_plugins.key_definitions.atoms import Register, SpOffset, MemoryLocation
from angr.knowledge_plugins.key_definitions.undefined import Undefined
from angr.knowledge_plugins.key_definitions.definition import Tag
from angr.knowledge_plugins.key_definitions.tag import ReturnValueTag
from angr.knowledge_plugins.key_definitions.tag import ParameterTag

from networkx.drawing.nx_agraph import write_dot

# Utility class to walk back the definitions graph.
class DefinitionExplorer():
    def __init__(self, project, rd_ddg_graph):
        self.project = project
        self.rd_ddg_graph = rd_ddg_graph

    def resolve_use_def(self, reg_def):
        # Now we need to analyze the definition for this atom
        reg_seen_defs = set()
        defs_to_check = set()
        defs_to_check.add(reg_def)
    
        # Cache of all seen nodes (Tie the knot)
        seen_defs = set()

        while len(defs_to_check) != 0:
            current_def = defs_to_check.pop()
            seen_defs.add(current_def) 
            # Check if the current Definition has a tag 
            def_value = self.check_definition_tag(current_def)
            
            # If def_value is not None we hit a "retval" and we collect it,
            # in the other case we need to check if it is Undefined, if yes gotta walk back. 
            if def_value:
                reg_seen_defs.add(def_value)
            else:
                dataset = current_def.data 
                # Boolean guard: do we have any undefined pointers? 
                undefined_pointers = False 
                
                # A value in DataSet can be "Int" or "Undefined"
                for data in dataset:
                    if type(data) == Undefined: undefined_pointers = True  

                # If we have undefined pointers (a.k.a. Top value) we need to process the predecessors.
                if undefined_pointers:
                    for pred in self.rd_ddg_graph.graph.predecessors(current_def):
                        if pred not in seen_defs:
                            defs_to_check.add(pred)
                else:
                     # This is a constant.
                    def_value = ("int", None)
                    reg_seen_defs.add(def_value)

        return reg_seen_defs

    # Checking the tag over a definition.
    def check_definition_tag(self, definition):
        if len(definition.tags) > 0:
            curr_tag = definition.tags.pop() # Ok just take the first one as for now.
            if type(curr_tag) == ReturnValueTag:
                return ("retval",curr_tag.function) 
            else:
                print(type(curr_tag))
                return None

# Path of the blob.
blob_path = "./atmel_6lowpan_udp_rx.bin"

# Address of memcpy function.
memcpy_addr = 0xf647

print("Creating angr Project")
project = angr.Project(blob_path)

print("Creating binary CFG")
bin_cfg = project.analyses.CFG(resolve_indirect_jumps=True, cross_references=True, 
                                force_complete_scan=False, normalize=True, symbols=True)

# Get CFG node for memcpy
memcpy_node = bin_cfg.model.get_any_node(memcpy_addr)
# Get all the XRefs (predecessor of the memcpy nodes)
memcpy_node_preds = memcpy_node.predecessors
# Get the CC of memcpy
memcpy_cc =  project.kb.functions[memcpy_addr].calling_convention

# Grab all functions that have an xrefs to the basic function
memcpy_funcs_preds = list(set([x.function_address for x in memcpy_node_preds]))

# Creating a dictionary of predecessors functions and the address 
# of the xrefs to the memcpy 
FUNC_PREDECESSORS = {}
for memcpy_func_pred_addr in memcpy_funcs_preds:
    FUNC_PREDECESSORS[str(memcpy_func_pred_addr)] = []
for x in memcpy_node_preds:
    FUNC_PREDECESSORS[str(x.function_address)].append(x)

OVERALL_DEFS = set()
FUNCS = set()

for memcpy_func_pred_addr, xrefs in FUNC_PREDECESSORS.items():
    memcpy_func_pred_addr = int(memcpy_func_pred_addr)
    print("Now analyzing predecessor func at {}".format(hex(memcpy_func_pred_addr)))
    print("XRefs are {}".format((xrefs)))
    
    for xref in xrefs:
        print("-->Analyzing XRefs at {}".format(hex(xref.addr)))
        # Get the Function object of the func containing the xref to memcpy
        memcpy_func_pred = bin_cfg.functions.get_by_addr(memcpy_func_pred_addr)

        # Call to the bf function is the last instruction of the block.
        call_to_xref_address = project.factory.block(xref.addr).instruction_addrs[-1]
        
        try:
            rd = project.analyses.ReachingDefinitions(subject=memcpy_func_pred, 
                                                      func_graph=memcpy_func_pred.graph,
                                                      cc = memcpy_func_pred.calling_convention,
                                                      observation_points= [("insn", call_to_xref_address , 0)],
                                                      dep_graph = dep_graph.DepGraph()
                                                     )
        except Exception as e:
            # Sorry for this, sometimes it explodes :)
            continue

        rd_ddg_graph = rd.dep_graph
        # Instantiate the object that will walk back the dep_graph.
        def_explorer = DefinitionExplorer(project, rd_ddg_graph)
        
        # Get the VEX offset for "r0"
        reg_vex_offset = project.arch.registers.get("r0", None)[0]
        
        if rd.observed_results != {}:
            # Cycle all over the results 
            for observed_result in rd.observed_results.items():
                reg_defs = observed_result[1].register_definitions.get_objects_by_offset(reg_vex_offset)
                for reg_def in reg_defs:
                    reg_seen_defs = def_explorer.resolve_use_def(reg_def)
                    for definition in reg_seen_defs:
                        OVERALL_DEFS.add(definition)

            for definition in OVERALL_DEFS:
                if definition[0] == "retval":
                    # It's not always guaranteed that the retval tag of a definition has the
                    # func addr, in those casese we call it a day (definition[1] will be None).
                    if definition[1] != None:
                        FUNCS.add(definition[1])
print(FUNCS)

And as a final result: in FUNCS we can see all the functions that are providing values to the r0 argument of memcpy:

{0x3a3d, 0x4f5d, 0x5281, 0x5585, 0x953d, 0x988d, 0xa41d}

(the code above has been tested with angr 9, commit dc801facba8a0b127e96825da9b09298fed2e5c0. You can download the binary blob here)

Hopefully, this blog-post gave enough information to start to hack into RD, feel free to ping us on our slack for any questions, we have a pretty active and excited community over there!

UPDATE 11/4/2020: More material and information regarding this technique is provided by the current maintainer @pamplemousse.
Specifically, here you can find some slides, and here a guest lecture done for the ASU Security class.