The Decompiler plugin is a sophisticated transformation engine which automatically converts the binary representation of individual functions into a high-level C representation. The Decompiler presents a view of a program which is interactive and dynamically updated as the user adds or makes changes to the annotations associated with the program. A Decompiler window maintains the correspondence between the C representation and the assembly representation displayed in the Code Browser window, to the extent possible. The window allows instant visual association and navigation between C language expressions and their corresponding assembly instructions.
To display the decompiler window, position the cursor on a function in the Code Browser, then select the
icon from the tool bar, or the Decompile option from the Window menu in the tool.
Some of the primary capabilities of the decompiler include:
- Recovers Expressions: The decompiler does full dataflow analysis which allows it to perform slicing on functions. The most tangible benefit to the user is that complicated expressions, which have been split into distinct operations/instructions and then mixed together with other instructions by the compiling/optimizing process, are reconstituted into a single expression again by the decompiler.
- Recovers High-Level Scoped Variables: The decompiler understands how compilers use processor stacks and registers to implement variables with different scopes within a function. Data-flow allows it to follow what was originally a single variable as it moves from the stack, into a register, into a different register, etc. Thus it can effectively recover the original programs concept of a variable, minimizing the need to introduce artificial variables in the output.
- Recovers Function Parameters: The decompiler understands the parameter passing conventions of the compiler and can reconstruct the form of the original function call.
- Uses Data type, Name, and Signature Annotations: The decompiler automatically pulls in all the different data types and variable names that the user has applied to functions, and the C output is altered to reflect this. High-level variables are appropriately named, structure fields and array indices are calculated and displayed with correct syntax, constant char pointers are replaced with appropriate quoted strings, etc.
- Performs Local Type Propagation: In the absence of information, the decompiler does its best to fill in information from what it does know. Variables whose data type has not been explicitly labeled by the user can often by recovered by seeing how the variable is used or by allowing the known data types to propagate.
- Can be used to Automatically Recover Structure Fields: The decompiler can be leveraged to recover references to a structure.
Variables
The decompiler will attempt to combine different locations (stack, memory, register) for variables within a function. Data type information for variables is gathered automatically from several sources. Any annotated function signatures, both of the function and of any sub-functions it calls, provide type information. If the function contains references to global memory locations that have a data type applied to them, these will also be used, and any local variables of the function can be annotated directly with data types. The user can provide data-type information to the decompiler by annotating all these sources. The more information that can be provided the better the produced C-code will be.
Variables not labeled directly are assigned types by analyzing local type propagation. Typically, assigning data types to a few key variables dramatically improves the readability of the C-code, as propagation will accurately fill in all the other data types. Assigning types in function signatures and to global variables is particularly effective because of their effect across multiple functions simultaneously.
If you have C-header files for an API a program is using, there is a prototype C-Code parser than can extract the Data Type information from C-Code and create a Ghidra Data Type Archive (.gdt). The interface is currently fairly crude, but it handles most C syntax including macro expansion. The function signatures and data types extracted can be applied to the program. Just open the archive in the Data Type Manager window, select the archive, right mouse click, and select Apply Function DataTypes. Ghidra currently provides definitions for the majority of windows API functions and data types automatically.
Parameter Variables
Specifying data types for function parameters is especially useful. A function that has data types defined for its parameters will propagate these types into the variables of any calling functions.
C variable argument conventions, or varargs, are are also supported. For instance, if the user has identified the standard C library routine printf, the signature can be defined to be void printf (char *, ...). Now whenever printf() is called, the decompiler will display the correct number of variable arguments.
Function signatures can be applied from a Ghidra data type database. Windows data types and standard C library function signatures are included with the standard distribution. More definitions will be added in the future.
Internal Decompiler Functions
Occasionally, the decompiler may use one of several internal decompiler functions that don't get transformed into more 'C'-like expressions. Use of these can indicate that the pcode is incorrect or needs to be "Tuned" to make the decompiler output better. It can also mean that the decompiler needs an additional simplification rule to take care of that particular situation.
- SUB41(x,c) - truncation operation
- The 4 is the size of the input operand (x) in bytes.
- The 1 is the size of the output value in bytes.
- The x is the thing being truncated
- The c is the number of least significant bytes being truncated
SUB42(0xaabbccdd,1) = 0xbbcc
When "c" is 0, the operation is almost always a cast between integer sizes, where the decompiler didn't quite figure it out. Usually the decompiler didn't figure out that "x" was an integer type or was forced to assume otherwise.
SUB41(x,0) is usually a cast from "int" to "char".
SUB42(x,0) is a cast from "int" to "short" and so on.
SUB84(x,4) is probably part of an extended precision multiplication but also turns up in other things like division strength reduction.
- CONCAT31(x,y) - concatenates two operands together into a larger size object
- The "3" is the size of x in bytes.
- The "1" is the size of y in bytes.
- The result is the 4-byte concatenation of the bits in "x" with the bits
in "y". The "x" forms the most signifigant part of the result, "y" the
least.CONCAT31(0xaabbcc,0xdd) = 0xaabbccdd
This usually crops up when a 1-byte sized (char) variable is being stored in a 4-byte register. All the basic arithmetic/logical ops on the 4-byte register give the correct result for doing the operation on a 1-byte variable; the compiler just has to make sure to ignore the 3 most significant bytes of the register. The CONCAT31 is the decompiler keeping track of these most significant bytes that the compiler was ignoring because it is mistakenly interpreting the register variable as being a 4-byte variable. In many cases the decompiler can figure this out, but especially in looping constructs, it cannot. This is really a dead code issue. The decompiler currently makes judgements about dead code for entire varnodes. A full fix of this problem would require a dead code elimination algorithm that could decide if part of a varnode were dead.
- ZEXT14(x) - zero extension
- The 1 is the size of the operand x
- The 4 is the size of the output in bytes
This is almost always a cast from small integer types to big unsigned types.
- SEXT14(x) - signed extension
- The 1 is the size of the operand x
- The 4 is the size of the output in bytes
This is probably a cast from a small signed integer into a big signed integer.
- SBORROW4(x,y) - true if subtracting the signed numbers would cause a borrow
- The 4 is the size of both x and y in bytes
Returns "true" if there is an arithmetic overflow when subtracting "y" from "x" as signed integers. These are generated particularly by signed integer comparisons. There are rules in place for recovering the original comparison, but this is a missed one special case. These could also conceivably be generated in extended precision subtraction.
- CARRY4(x,y) - true if there would be a carry adding x to y
- SCARRY4(x,y) - true if there would be a signed overflow adding x to y
- The 4 is the size of both x and y in bytes
Returns "true" if there would be a carry adding x to y.
If these are turning up everywhere in a particular binary, it could be a missed simplification that could be easily fixed.
Register Settings
Occasionally a program will use a register to store a global constant. By using the <Set Register> function on the right mouse pop-up menu, the user can specify this value to the decompiler. The constant will be propagated automatically throughout the function, and the resulting code may be greatly simplified.
Decompiler Options
The following Decompiler Analysis Options are available ( Edit->Options Decompiler/Analysis ):
- Eliminate unreachable code - causes the decompiler to eliminate branch paths which it considers unreachable as a result of constant propagation.
- Ignore unimplemented instructions - causes the decompiler to ignore instructions whose semantics have been marked as unimplemented. Otherwise a halt_unimplemented call will appear in the decompilation for such cases.
- Infer constant pointers - allows the decompiler to infer a data-type for constants it determines are likely pointers. In the basic heuristic, each constant is treated as an address, and if that address starts a known data or function element in the program, the constant is assumed to be a pointer.
- Respect read-only flags - causes the decompiler to treat any values in memory or blocks of memory marked read-only as constant values. Normally global memory is considered public writable, meaning you cannot depend on the initial value at a location. Any global value could be changed by another function. For areas of memory that are really read-only and never change their statically initialized value, mark the memory area as read only in the Memory Manager or specific Data locations as Constant (see Data Mutability below).
Typically as part of the import process, memory blocks are marked as read-only if the memory block is tagged as such in the imported binary.
- Simplify predication - causes the decompiler to simplify code that employs conditional (predicated) instructions, merging if/else blocks of code that share the same condition.
- Simplify extended integer operations - causes the decompiler to simplify integer operations, where a single logical value is split into high and low pieces that are acted on in multiple stages. The decompiler tries to identify these constructions and replaces the multiple stages with a single operation.
- Use in-place assignment operators - causes the decompiler to employ in-place C assignment operators such as += in the decompiled syntax.
- Decompiler Timeout (seconds) - the number of seconds to allow the decompiler to run before terminating the decompiler. Currently this does not affect the UI, which will run indefinitely. This setting currently only affects background analysis that uses the decompiler. syntax.
Data Mutability
Decompiler output can be influenced by the mutability of data locations within memory. Supported mutability settings include:
- Read-only/Constant - indicates that a memory locations value never changes and the currently stored value can be treated as a constant.
- Volatile - indicates that a memory location's value may change asynchronously between reads. Reads and writes to such locations are never simplified by the decompiler and are wrapped with specially named function calls (e.g., volatile_read, volatile_write). The language definition and compiler specification may predefine specific volatile regions of memory and may also override the default volatile read/write function names.
Data mutability may be controlled by the user in one of two ways:
It is important to note that the decompiler is only as good as the definition of the the underlying assembly language code. Each assembly instruction has an associated PCODE definition that describes what the instruction does, essentially an RTL (Register Transfer Language). For example, the following MOV instruction which moves a value into an offset onto the stack also has a PCODE definition.
MOV local_1c[ESP], 0x804aac8
temp1 = INT_ADD 0x4, ESP
temp2 = COPY 0x804aac8
STORE ram(temp1), temp2Irregularities in the produced C-code can often be attributed to errors in this underlying definition. Such errors can usually be fixed quickly. Please feedback any problems or issues you find.
- A good way to start using the decompiler is by defining the parameters to functions that are obviously "char *" string references. This allows the decompiler to discover and display any static strings referenced anywhere the function is called.
- The decompiler can work out references to fields of a data structure and figure out array indexing given enough information about data types. Building these data type definitions greatly enhances readability of the C-code and is a natural way to encapsulate reverse engineering knowledge. If you notice many offset references from a base value other than the frame or stack pointer, that value is probably pointing to a structure or an array. Notice psParm1 in the code below. There are several different references off of it. The parameter can be annotated to point to a structure. The user can create a new structure or use one from a Ghidra data type library.
Without knowing the data type, the decompiler produces the following C-code.
After applying the appropriate structure, the code becomes:
- The parameters shown where a function is called may not agree with the parameters where the function is defined. This can be caused by several things:
- The function takes variable arguments.
- The parameters are not actually referenced (used) by the function.
- The decompiler does not see the parameter location being filled.
Parameters determined from the function definition are more likely to be correct.
To display the decompiler window, position the cursor on a function in the Code Browser,
then select the icon from the tool
bar, or the Decompile option from the Window menu in the tool.
Errors from the decompiler process are reported in the status area of the tool and sometimes
at the end of the C code in the decompiler window.
- Double-click - Navigates to the symbol that was clicked.
- Control-double-click - Navigates to the symbol that was clicked, opening the results in a new window.
- Control-shift-click - Triggers the Listing in a Snapshots view to navigate to the address denoted by the symbol that was clicked.
- Middle-mouse - If you press the middle mouse button the decompiler will highlight every occurrence of a variable or constant under the current cursor location (the button changed in the tool options under Browser Field->Cursor Text Highlight).
You can navigate to the target of a
goto
statement by double-clicking its label (you can also double-click a brace to navigate to the matching brace).
Other actions available in the decompiler are described in the following paragraphs.
C Code from the decompiler window can be copied and pasted into any other system text window. Select the text to copy, and then choose Copy from the popup menu.
Set a comment on a line of C-Code. The comment will be
stored in the program database at the closest assembly line associated with the generated
C-Code. Any type of comment (EOL, Post, Pre, Plate) can be attached to the representative
C-Code. When this function is re-displayed at some later point, the comment will
persist.
By default, the decompiler will analyze the code to try to discover function parameters, return type, and local variables. Each time the decompiler displays C-code for a function it does this analysis again. Commit Params/Return causes any parameter names and types and return type to be saved in the program database so that next time the function is decompiled the current definitions will be used. This is useful for "syncing" the function signature with the disassembly display. This causes the names and types of parameters and returns in the disassembly to agree with the decompiler names and types.
Ghidra will do stack analysis that will recover parameters and return types, but for many programs, the analysis the decompiler does is better.
There is a
prototype plug-in that automatically pulls in the decompiler derived information and applies it
to each function as the function is created.
If a variable
displayed in the assembly window has an undefined type, the decompiler will still respect the
name of the variable.
By default, the decompiler will analyze the code to try to discover function parameters, return type, and local variables. Each time the decompiler displays C-code for a function it does this analysis again. Commit Locals causes any local variable names and types to be saved in the program database so that next time the function is decompiled the current local variable definitions will be used. This is useful for "syncing" the local variable definitions with the disassembly display. This causes the names and types of locals in the disassembly to agree with the decompiler names and types.
Ghidra will do stack analysis that will recover local variables
on the stack, but for many programs, the analysis the decompiler does is better.
There is a
prototype plug-in that automatically pulls in the decompiler derived information and applies it
to each function as the function is created. The plugin by default will not commit local
variable definitions, either stack or register locals. Committing locals automatically
can be turned on by changing the analysis options for the Decompiler Parameter ID plugin.
In most cases it is better to commit locals only for certain functions that you really care
about, or after the data type definitions (structures, etc...) have settled down for the
program you are Reverse Engineering.
If a variable
displayed in the assembly window has an undefined type, the decompiler will still respect the
name of the variable.
Automatically creates a structure definition for the pointed to structure, and fills it out based on the references found by the decompiler.
To use this, place the cursor on a function parameter variable, or any variable within a function that is a pointer to a structure. It could currently have a data type of undefined, int, void *, char *, etc... For example: func(int *this), (for a C++ this call function).
If the variable is already a structure pointer, any new
references found will be added to the structure, even if the structure must grow in size.
This is very useful as you find more places the structure is used. If you have already
started recovering a portion of a structure and find it used in another function. Retype
the variable to be the structure, and then use Auto Fill in
Structure to add any new fields recovered for the structure.
This feature is also available in the assembly listing when the
cursor is placed on a defined parameter or return variable.
Currently this
only recovers the structure by following the structure pointer through the current function and
any function the structure is passed into within the current function. Eventually this
will be put into a global type analyzer, but for now it is most useful interactively.
For best
results, the function should be well formed with good flow, and all the switch statements
should be recoverable.
There is also
a script called CreateStructure that you can use for
automated structure recovery. For instance if you have a set of ThisCall routines where
the first parameter to all the routines is a pointer to a shared class structure, the script
could be modified to recover the structure for each this parameter.
Highlight Def-Use
Highlights all places a value is used, starting at the place it is first written, and including all the places where that one value is read. This is usually a proper subset of all the places a variable appears in the function Place the cursor over a variable you would like to highlight and select Highlight Def-Use from the pop-up menu.
As an example the a at the top of the function is under the cursor when Highlight Def-Use is chosen.
Notice that the first three references to a are highlighted but the final use of a is not because the value might have changed in the else clause.
Highlight Forward Slice
Highlight Forward Slice highlights each variable whose value may be affected by the value in the variable under the cursor.
As an example, b, the output of max_alpha, is under the cursor when Highlight Forward Slice is chosen.
We can see that c is tainted by the value of b all the way through to the bottom of the function.
Highlight Backward Slice
Highlight Backward Slice highlights all points in the function that contain a value involved in the creation of the value in the variable under the cursor.
As an example the final a of the function is under the cursor when Highlight Backward Slice is chosen.
We can see that the final value of a is affected by the loop and by the input parameter but never by b and c.
Highlight Forward Instruction Slice
Highlight Forward Inst Slice highlights each instruction whose value may be affected by the value in the variable under the cursor, rather than the values themselves.
Highlight Backward Instruction Slice
Highlight Backward Inst Slice highlights all instructions in the function that contribute to the creation of the value in the variable under the cursor.
Rename Variable
Any parameter or local variable can be renamed. Just place the cursor over a variable definition, or any use of the variable and choose Rename Variable from the popup menu. The name will now be saved for this function, so the next time the decompiler displays the code for the function, the same name is used.
Rename Function
A shortcut for renaming the function from within the decompiler window.
Retype Variable
The decompiler does its best to recover the type of a variable automatically but often only has limited information for analysis. Explicitly changing the type of a variable can dramatically improve the C-code produced. This is especially true for structures. Changing the type of a parameter variable will affect the display for every place the function is called.
To change a variables data type; place the cursor over the variable definition or use of the variable, select Retype Variable from the popup menu, and then enter the name of the type. The name of any data type known to the program can be used.
A simple code improvement is to locate any functions with obvious string parameters and re-type the parameter to be a "char *". Any references to defined memory will now display the passed parameter as a character "string".
Edit Data Type of Variable
Only structure, union, and enum data types can be edited. If a variable's data type is one of these it can be edited. Also, if the data type is a type definition, array, or pointer based on an editable data type, then the base data type can be edited. For example, if you have a structure pointer for a variable then you can edit the structure. To edit a variable's data type; place the cursor over the variable definition or use of the variable and select Edit Data Type... from the popup menu. For structures and unions, the structure editor will appear, and for enums the enum editor will appear. If the data type for a variable can't be edited, the action will be disabled in the popup.
Edit Function Signature
The Edit Function Signature dialog allows you to change the function's signature, the calling convention, whether the function is inline and whether the function has no return.
The function signature includes
- function name
- return type
- number of parameters
- parameter names
- parameter type
- varargs (variable arguments)
This features allows you to edit a function signature text string to change any of these.
For example if a function is actually printf(), instead of changing the name, return type, and parameters individually, the entire function signature can be changed all at once. To do this you could enter
void printf( char *fmt, ...)
within the Signature field and then select the OK button.
In addition, you can select the Calling Convention for this function from a list of available calling conventions as determined by the program's language. Selecting the Inline checkbox indicates that the function is in-lined. Selecting the No Return checkbox indicates that the function does not return.
The signature of the current function, or any called function can be changed.
To edit a function's signature from the Decompile window. Just place the cursor over any function name, select Edit Function Signature from the popup menu, and the dialog will appear with the function's current information.
Override Signature
Overrides the signature of a called function at the point it is called. This allows you to set the parameter values for a particular call.
Remove Signature Override
This action allows you to remove a previously added function signature override.
Find...
Find any string of text within the currently decompiled function.
Debug Function Decompilation
For certain functions, the decompiler may produce an error message, produce incorrect code, or simply exit without producing results. Selecting
Debug Function Decompilation, from the decompiler provider window toolbar, will run the decompiler again, and save all relevant information to an XML file. Instead of submitting the entire program to be analyzed to discover the problem, only a small XML file is needed.
Graph AST Control Flow
Selecting
Graph AST Control Flow, from the decompiler provider window toolbar, will generate an abstract syntax tree (AST) control flow graph based upon the decompiler results and render the graph within the current Graph Service.
If no Graph Service is available then this action will not be present.
Export to C
You can export the current decompiled function to a file by selecting the
icon in the local tool bar of the decompiler window. A file chooser dialog is displayed for you to select the name of the output file. If you do not specify a file extension, ".c" is appended to the filename.
Snapshot
Creates a Snapshot of the current decompiler window, which allows you to leave the current decompiled function in place while navigating to other functions.
Properties
The colors used in the decompiler window can be changed by editing the C Display Options through the Edit Options dialog. To edit the options, choose Edit
Tool Options... from the tool menu. Click on the C Display node in the Options tree. A panel shows the colors that can be customized. Click on the color bar to bring up the color chooser to change the color.
The other options allow you to change the maximum characters in a line displayed in the decompiler window, and the number of characters for indenting in the code.
These function similarly to Code Browser Mouse Hovers
Provided by: Decompiler Plugin