# Background

In this blog-post I would like to talk about some recent effort we have done to improve the Reaching Definition (RD) analysis inside angr, and how you can use it in your projects. Straight from compiler theory, a reaching definition analysis is (in very simple terms) a static program analysis technique that can help us answer which are the possible definitions for a specific variable at a particular line of code.

Let’s take, for instance, the following snippet:

`````` c = input()
 var a = 6
 var b = a + 6
 if c == 3:
   a = 4
   c = a + 1
 else
   a = 2
   c = a + 2
 b = a + 2
``````

Here we can use RD by placing a -so called- observation point at  and ask “which are the values we can observe at  for variable ‘a’?”. The answer in this case will be 4. If we would have asked the same question at  the answer would have been 2 and, eventually, asking this question at observation point  would have given us both 4 and 2 as an answer. If you are confused regarding the latest answer, keep in mind that this technique is purely static: the analysis simply observes that since the var ‘c’ has no specific value we can take both the branches of the ‘if’, therefore, ‘a’ can be re-defined with 4 or 2.

Now, let’s spice this up a little bit. 🌶️

When doing binary analysis we don’t have the luxury of having the source code of the application we are analyzing, and chances are that, if you are using angr, you are exactly in this situation. However, variables are just fancy programming language abstractions and under the hood they are stored in memory and registers. Therefore, we can use the RD analysis to reason about those entities and getting nearly equivalent results.

If we think about the previous program in (totally unoptimized) assembly:

``````  mov rcx, [rsp+0x8]   ; moving variable 'c' in rcx
  mov rax, 6
  mov [rsp+0x10], rax  ; setting var 'a'
  add rbx, rax, 6      ; defining variable 'b'
  mov [rsp+0x1C], rbx
  cmp rcx, 0x3         ; checking if var 'c' is 3 or not
  jne label2           ; if not, go to label2
  mov [rsp+0x10], 0x4
  mov rax, [rsp+0x10]
 mov [rsp+0x8], rcx
 jmp end
 label2:
  mov [rsp+0x10], 0x2
  mov rax, [rsp+0x10]
  mov [rsp+0x8], rcx
 end:
  mov rax, [rsp+0x10]
  mov [rsp+0x1C], rbx
``````

Now, we need to define an observation point and understand the question we want to ask to get the result we want. For instance, if we want to ask, like before, which are the possible values of the var ‘a’ in the branch where ‘c’ is equal to 3 we would need to ask “which are the values we can observe at  for register rax?”.

This technique is the foundation for different analyses at the base of vulnerability hunting tools, static taint engine systems, function prototypes recovery routines, constant propagators, and so on. We extensively use this in angr to accomplish a different kind of tasks because its speed, compared with analyses based on Dynamic Symbolic Execution (DSE), it’s amazing. Personally, I’ve been able to speed up things from taking an hour (using DSE) to minutes by simply re-thinking them using this approach!

# Example

Let’s go through some real code that will, hopefully, clarify how to leverage this feature and how to eventually hack into it (recently there have been many questions on our slack and I thought this blog-post might have been useful for some researchers).

For the purpose of this example, I’ll go through a real problem I had to solve for my current project. In particular, I wanted to collect all the locations in a given binary where the return value of a function was consumed as an argument for a `memcpy`. For instance:

`````` c = foo()
 ...
 { ... more code here ... }
 ...
 memcpy(c, a, 0x10)
``````

For most of you this problem probably already triggered the word “taint engine” in your minds, but, how can we practically implement this in angr? A friend and colleague of mine (@badnack) already proposed a solution here. However, while this approach can be useful in some situations, it doesn’t really fit my case mainly for 3 reasons:

1. It is fully based on DSE. Therefore, while executing code we are creating SimState(s), collecting and solving constraints, and in general, hacking the symbolic exploration (and abusing its features) to force angr to behave like a taint engine.
2. This is a direct consequence of the previous point, but deserves a separate discussion. If we want to design an analysis not affected by false negatives (i.e., we want to spot all the locations where `memcpy` is consuming a value returned by a function) we need to reason about EVERY possible path that we can observe over the Control Flow Graph (CFG). This means that we would have to exhaust the symbolic exploration to be sure we checked all of them. Unfortunately, this can either lead to state explosions/path explosions or simply take an unacceptable amount of time.
Moreover, we would like to start from the sinks (the calls to memcpy) and reverse-search to the sources (the functions' return values), but backward symbolic execution is not quite possible in angr (and honestly I’m also not aware of any tools that implement this).
3. It doesn’t really scale if we want to make this analysis inter-functional. The first implementation of this spent more than an hour(!) to find all the locations in the code, moreover, the results were partial and some paths were not considered.

The intuition here is to try to implement the same technique we would manually use to spot these dependencies: look at the code and simply look at how variables are defined and use in the function. In the following screenshot, you can see how the variable `v25` is defined by the `memb_alloc` and consumed by `memcpy`. The idea here is pretty simple: we want to set an observation point at the memcpy call (address `0x37CA`) and then ask for seeable definitions for the register `r0` (the one that is holding the `v25`).

``````.text:000037BA STRB    R5, [R0,#0x1C]
.text:000037BC MOVS    R3, #1
.text:000037BE STRB    R3, [R0,#0x1D]
.text:000037C2 MOV     R3, R9
.text:000037C6 MOVS    R2, #0x10
.text:000037C8 LDR     R3, =(memcpy+1)
.text:000037CA BLX     R3  ;memcpy     <== OBSERVATION POINT
.text:000037CC ADDS    R0, R4, #4      ; t
.text:000037CE MOVS    R1, #0x12C00    ; interval
.text:000037D2 LDR     R3, =(timer_set+1)
.text:000037D4 BLX     R3              ; timer_set
.text:000037D6 MOVS    R1, R4          ; item
.text:000037D8 LDR     R0, =others_services_list ; list
``````

Zooming out this basic block it’s interesting to see where it is placed inside the CFG of the function under analysis (in green). Basically, by asking the question “which definitions for `r0` can we observe at address `0x37CA`?” we will collect all the possible definitions for `r0` that are observable when the execution flow reaches the address `0x37CA` inside the green basic block.

Let’s start to write some code:

``````import angr
import autoblob # just to load the blob, more at https://github.com/subwire/autoblob
import angr.analyses.reaching_definitions.dep_graph as dep_graph

blob_path = "./blob.bin"
call_to_memcpy = 0x37CB # Odd addresses because ARM Thumb mode

print("Creating angr Project")
project = angr.Project(blob_path)

# Some standard options for the CFG
print("Creating binary CFG")
bin_cfg = project.analyses.CFG(resolve_indirect_jumps=True,
cross_references=True,
force_complete_scan=False,
normalize=True,
symbols=True)

# Getting the func object from the addres
# Starting the ReachingDefinition analysis
rd = project.analyses.ReachingDefinitions(subject=target_func,
func_graph=target_func.graph,
cc = target_func.calling_convention,
observation_points= [("insn",
call_to_memcpy,
0)],
dep_graph = dep_graph.DepGraph()
)
``````

This will specifically run the reaching definition analysis over the target function `0x3609` by using as an observation point the call to the memcpy at `0x37CB` (note that you can set multiple observation points if you wish).

By investigating the results object we can get the information regarding the register `r0`:

``````# VEX offset is just how the VEX IR refers to registers
reg_vex_offset = project.arch.registers.get("r0", None)
reg_defs = rd.one_result.register_definitions.get_objects_by_offset(reg_vex_offset)
print(reg_defs)
# RESULT:
# {<Definition {Atom:<Reg 8<4>>, Codeloc:<0x37c1 id=0x37bb>,
#              Data:DataSet<32>: {<Undefined>}}>}
``````

The result is telling us that the latest instruction that defines the register `r0` -before calling the `memcpy`- is at code location `0x37c1` (for those curious: `id` is the basic block address and `36` is the VEX statement), but its value (in the DataSet) is `Undefined`. Why is that? To understand let’s take a look at the instruction at `0x37c1`:

``````.text:000037C0 ADDS    R0, #0xC
``````

The definition of `r0` is coming from whatever value `r0` was holding added to `0xc`. But what was the value of `r0` at this point? Well, that’s again a question for RD:

``````rd2 = project.analyses.ReachingDefinitions(subject=target_func,
func_graph=target_func.graph,
cc = target_func.calling_convention,
observation_points= [("insn", 0x37c1 , 0)],
dep_graph = dep_graph.DepGraph()
)
reg_defs2 = rd2.one_result.register_definitions.get_objects_by_offset(reg_vex_offset)
print(reg_defs2)
# RESULT:
# {<Definition {Tags:{
#                    <ReturnValueTag {Function: 0x3a3d,
#                   },
#              Atom:<Reg 8<4>>,
#              Codeloc:<0x37b5 id=0x37b1[-2]>,
#              Data:DataSet<32>: {<Undefined>}}>}
``````

The previous definition for `r0` has been recorded at the address `0x37b5`, but by looking at that address we can’t see any definition for `r0`, right?

``````.text:000037B0
.text:000037B0 loc_37B0                ; m
.text:000037B0 process_pt = R8         ; pt *
.text:000037B0 LDR     R0, =registrations
.text:000037B2 LDR     R3, =(memb_alloc+1)
.text:000037B4 BLX     R3     ; memb_alloc  <==
``````

Well, the RD analysis is smart enough that, by making use of the function’s calling convention (CC), knows that this function is going to return stuff, and under the current CC this “stuff” is returned inside `r0`, therefore, `r0` is actually redefined here. This information is integrated inside the `Definition` object we got as a result. We can see that, indeed, the `Definition` has been tagged with a `ReturnValueTag`, pointing out that not only the content of `r0` is the result of a function call, but also that the function that has been called is `0x3a3d`. 🏷️

Now we can easily understand why the value observed at the call to memcpy was `Undefined`: RD didn’t know the value returned by the function `0x3a3d`, but simply that `r0` it’s going to be redefined. Given that, it assigns an `Undefined` value and when later the RD engine has to sum up `Undefined` + `0xC` everything is merged into `Undefined` (for people familiar with data-flow analysis this is basically merging to Top).

The tagging mechanism for these definitions has been recently introduced and I really believe empowers a lot of smart reasoning about these chains of definitions of registers. In particular, I’m gonna leverage this exact mechanism to solve the problem we were discussing at the beginning.

Recalling the problem: we are trying to spot situations where the first argument of `memcpy` (held by register `r0` in this architecture) is the return value of another function. The idea is pretty simple: automatically walk back the definitions chain for the registers we need (by using the `dep_graph`) until the definition is not `Undefined` anymore (in that case it would mean that the argument is a constant) OR until we hit a `Definition` node tagged with a `ReturnValueTag`.

We can picture this in the following way: The `r0`’s definition flowing inside `memcpy` is the one at D3, since this has the `Undefined` value we will walk back the `dep_graph` and eventually find D1 and D2 (in this case 2 different paths can bring us to ins3). D1 has no interesting value for our purpose since it is a constant, but D2 instead will be tagged as coming from `function_abc`. (quick note: the graph presented above is actually a simplification of what we can actually see if we would have dumped the `dep_graph` object with networkx).

Now that the approach is clear, it’s time to write the code to implement this analysis.

# Final Code

``````import angr
import autoblob
import os
import angr.analyses.reaching_definitions.dep_graph as dep_graph

from angr.engines.light import SpOffset, RegisterOffset
from angr.knowledge_plugins.key_definitions.atoms import Register, SpOffset, MemoryLocation
from angr.knowledge_plugins.key_definitions.undefined import Undefined
from angr.knowledge_plugins.key_definitions.definition import Tag
from angr.knowledge_plugins.key_definitions.tag import ReturnValueTag
from angr.knowledge_plugins.key_definitions.tag import ParameterTag

from networkx.drawing.nx_agraph import write_dot

# Utility class to walk back the definitions graph.
class DefinitionExplorer():
def __init__(self, project, rd_ddg_graph):
self.project = project
self.rd_ddg_graph = rd_ddg_graph

def resolve_use_def(self, reg_def):
# Now we need to analyze the definition for this atom
reg_seen_defs = set()
defs_to_check = set()

# Cache of all seen nodes (Tie the knot)
seen_defs = set()

while len(defs_to_check) != 0:
current_def = defs_to_check.pop()
# Check if the current Definition has a tag
def_value = self.check_definition_tag(current_def)

# If def_value is not None we hit a "retval" and we collect it,
# in the other case we need to check if it is Undefined, if yes gotta walk back.
if def_value:
else:
dataset = current_def.data
# Boolean guard: do we have any undefined pointers?
undefined_pointers = False

# A value in DataSet can be "Int" or "Undefined"
for data in dataset:
if type(data) == Undefined: undefined_pointers = True

# If we have undefined pointers (a.k.a. Top value) we need to process the predecessors.
if undefined_pointers:
for pred in self.rd_ddg_graph.graph.predecessors(current_def):
if pred not in seen_defs:
else:
# This is a constant.
def_value = ("int", None)

return reg_seen_defs

# Checking the tag over a definition.
def check_definition_tag(self, definition):
if len(definition.tags) > 0:
curr_tag = definition.tags.pop() # Ok just take the first one as for now.
if type(curr_tag) == ReturnValueTag:
return ("retval",curr_tag.function)
else:
print(type(curr_tag))
return None

# Path of the blob.
blob_path = "./atmel_6lowpan_udp_rx.bin"

print("Creating angr Project")
project = angr.Project(blob_path)

print("Creating binary CFG")
bin_cfg = project.analyses.CFG(resolve_indirect_jumps=True, cross_references=True,
force_complete_scan=False, normalize=True, symbols=True)

# Get CFG node for memcpy
# Get all the XRefs (predecessor of the memcpy nodes)
memcpy_node_preds = memcpy_node.predecessors
# Get the CC of memcpy

# Grab all functions that have an xrefs to the basic function
memcpy_funcs_preds = list(set([x.function_address for x in memcpy_node_preds]))

# Creating a dictionary of predecessors functions and the address
# of the xrefs to the memcpy
FUNC_PREDECESSORS = {}
for x in memcpy_node_preds:

OVERALL_DEFS = set()
FUNCS = set()

print("Now analyzing predecessor func at {}".format(hex(memcpy_func_pred_addr)))
print("XRefs are {}".format((xrefs)))

for xref in xrefs:
# Get the Function object of the func containing the xref to memcpy

# Call to the bf function is the last instruction of the block.

try:
rd = project.analyses.ReachingDefinitions(subject=memcpy_func_pred,
func_graph=memcpy_func_pred.graph,
cc = memcpy_func_pred.calling_convention,
dep_graph = dep_graph.DepGraph()
)
except Exception as e:
# Sorry for this, sometimes it explodes :)
continue

rd_ddg_graph = rd.dep_graph
# Instantiate the object that will walk back the dep_graph.
def_explorer = DefinitionExplorer(project, rd_ddg_graph)

# Get the VEX offset for "r0"
reg_vex_offset = project.arch.registers.get("r0", None)

if rd.observed_results != {}:
# Cycle all over the results
for observed_result in rd.observed_results.items():
reg_defs = observed_result.register_definitions.get_objects_by_offset(reg_vex_offset)
for reg_def in reg_defs:
reg_seen_defs = def_explorer.resolve_use_def(reg_def)
for definition in reg_seen_defs:

for definition in OVERALL_DEFS:
if definition == "retval":
# It's not always guaranteed that the retval tag of a definition has the
# func addr, in those casese we call it a day (definition will be None).
if definition != None:
print(FUNCS)
``````

And as a final result: in `FUNCS` we can see all the functions that are providing values to the `r0` argument of `memcpy`:

``````{0x3a3d, 0x4f5d, 0x5281, 0x5585, 0x953d, 0x988d, 0xa41d}
``````

(the code above has been tested with angr 9, commit `dc801facba8a0b127e96825da9b09298fed2e5c0`. You can download the binary blob here)

Hopefully, this blog-post gave enough information to start to hack into RD, feel free to ping us on our slack for any questions, we have a pretty active and excited community over there!

UPDATE 11/4/2020: More material and information regarding this technique is provided by the current maintainer @pamplemousse.
Specifically, here you can find some slides, and here a guest lecture done for the ASU Security class.