libudis86 is a disassembler library for the x86 architecture, including support for the newer 64bit variants (IA32e, amd64, etc.) It provides you the ability to decode a stream of bytes as x86 instructions, inspect various bits of information about those instructions and even translate to human readable assembly language format.
Contents
libudis86 is reentrant, and to maintain that property it does not use static data. All data related to the disassembly are stored in a single object, called the udis86 object ud_t.
A structure encapsulating udis86 disassembler state.
To use libudis86 you must create an instance of this object,
ud_t ud_obj;
and initialize it,
ud_init(&ud_obj);
You can create multiple such objects and use with the library, each one an independent disassembler.
The decode semantics of a sequence of bytes depends on the target machine state for which they are being disassembled. In x86, this means the current effective processor mode (16, 32 or 64bits), the current program counter (ip/eip/rip), and sometimes, the processor vendor. By default, libudis86 is initialized to be in 32 bit disassembly mode, program counter at 0, and vendor being UD_VENDOR_ANY. The following functions allow you to override these default to suit your needs.
Sets the mode of disassembly. Possible values are 16, 32, and 64. By default, the library works in 32bit mode.
Sets the program counter (IP/EIP/RIP). This changes the offset of the assembly output generated, with direct effect on branch instructions.
Sets the vendor of whose instruction to choose from. This is only useful for selecting the VMX or SVM instruction sets at which point INTEL and AMD have diverged significantly. At a later stage, support for a more granular selection of instruction sets maybe added.
Sets the input source for the library to a buffer of fixed size.
This function sets the input source to a file pointed to by a given standard library FILE pointer. Note that libudis86 does not perform any checks, and assumes that the file pointer to be properly initialized and the file opened for reading.
This function sets the input source for the library. To retrieve each byte in the stream, libudis86 calls back the function pointed to by hook. The hook function, defined by the client, must return a single byte of input each time it is called. To signal end-of-input, it must return the constant - UD_EOI.
Skips n number of bytes in the input stream
libudis86 can translate the decoded instruction into one of two dialects: one which resembles an INTEL assembler syntax (such as those found in NASM, YASM, et. al.), and the other which resembles GNU Assembler (AT&T style) syntax. By default, this is set to INTEL like syntax. You can override the default or specify your own translator using the following function.
libudis86 disassembles one instruction at a time into an intermediate form that lets you inspect the instruction and its various aspects individually. But to generate the assembly language output, this intermediate form must be translated. This function sets the translator. There are two inbuilt translators,
If you do not want libudis86 to translate, you can pass NULL to the function, with no more translations thereafter. This is useful when you only want to identify chunks of code and then create the assembly output if needed, or when you are only interested in examining the instructions and do not want to waste cycles generating the assembly output.
If you want to create your own translator, you can specify a pointer to your own function. This function must accept a single parameter, the udis86 object ud_t, and it will be invoked everytime an instruction is decoded.
With target state and input source set up, you can now disassemble. At the core of libudis86 api is the function ud_disassemble() which does this. libudis86 exposes decoded instructions in an intermediate form meant to be useful for programs that want to examine them. This intermediate form is available using functions and fields of ud_t as described below.
Disassembles the next instruction in the input stream.
Returns: | the number of bytes disassembled. A 0 indicates end of input. |
---|
Note, to restart disassembly after the end of input, you must call one of the input setting functions with a new source of input.
A common use-case pattern for this function is in a loop:
while (ud_disassemble(&ud_obj)) {
/*
* use or print decode info.
*/
}
For each successful invocation of ud_disassemble(), you can use the following functions to get information about the disassembled instruction.
Returns the starting offset of the disassembled instruction relative to the program counter value specified initially.
Returns pointer to character string holding the hexadecimal representation of the disassembled bytes.
Returns pointer to the buffer holding the instruction bytes. Use ud_insn_len(), to determine the length of this buffer.
If the syntax is specified, returns pointer to the character string holding assembly language representation of the disassembled instruction.
Returns a reference to the nth operand of the instruction. If the instruction does not have such an operand, the function returns NULL.
Returns the instruction mnemonic in the form of an enumerated constant (enum ud_mnemonic_code). As a convention all mnemonic constants are composed by prefixing standard instruction mnemonics with UD_I. For example, UD_Imov, UD_Ixor, UD_Ijmp, etc.
See also
Returns a pointer to a character string corresponding to the given mnemonic code. Returns a NULL if the code is invalid.
An intermediate representation of instruction operands is available in the form of ud_operand_t. You can retrieve the nth operand of a disassembled instruction using the function ud_insn_opr().
The operand type, represents a single operand of an instruction. It contains the following fields.
Size of the operand in number of bits.
Type of the operand. Possible values are,
A memory operand. The intermediate form normalizes all memory address equations to the scale-index-base form. The address equation is available in,
A segment:offset pointer operand. The size field can have two values, 32 (for 16:16 seg:off) and 48 (for 16:32 seg:off). The pointer value is available in lval (as lval.ptr.seg and lval.ptr.off)
Contains an enumerated constant of type enum ud_type representing a register operand or the base of a memory operand.
Contains an enumerated constant of type enum ud_type representing the index register of a memory operand.
Contains the size of the displacement component of a memory address operand. The displacement itself is given by lval.
A union data structure that aggregates integer fields of different sizes, storing values depending on the type and size of the operand.
Signed Byte
Unsigned Byte
Signed Word
Unsigned Word
Signed Double Word
Unsigned Double Word
Signed Quad Word
Unsigned Quad Word
Pointer Segment in Segment:Offset
Pointer Offset in Segment:Offset
Instruction Pointer
UD_R_RIP
8-Bit Registers
UD_NONE,
UD_R_AL, UD_R_CL, UD_R_DL, UD_R_BL,
UD_R_AH, UD_R_CH, UD_R_DH, UD_R_BH,
UD_R_SPL, UD_R_BPL, UD_R_SIL, UD_R_DIL,
UD_R_R8B, UD_R_R9B, UD_R_R10B, UD_R_R11B,
UD_R_R12B, UD_R_R13B, UD_R_R14B, UD_R_R15B,
16-Bit General Purporse Registers
UD_R_AX, UD_R_CX, UD_R_DX, UD_R_BX,
UD_R_SP, UD_R_BP, UD_R_SI, UD_R_DI,
UD_R_R8W, UD_R_R9W, UD_R_R10W, UD_R_R11W,
UD_R_R12W, UD_R_R13W, UD_R_R14W, UD_R_R15W,
32-Bit General Purporse Registers:
UD_R_EAX, UD_R_ECX, UD_R_EDX, UD_R_EBX,
UD_R_ESP, UD_R_EBP, UD_R_ESI, UD_R_EDI,
UD_R_R8D, UD_R_R9D, UD_R_R10D, UD_R_R11D,
UD_R_R12D, UD_R_R13D, UD_R_R14D, UD_R_R15D,
64-Bit General Purporse Registers:
UD_R_RAX, UD_R_RCX, UD_R_RDX, UD_R_RBX,
UD_R_RSP, UD_R_RBP, UD_R_RSI, UD_R_RDI,
UD_R_R8, UD_R_R9, UD_R_R10, UD_R_R11,
UD_R_R12, UD_R_R13, UD_R_R14, UD_R_R15,
Segment Registers:
UD_R_ES, UD_R_CS, UD_R_SS, UD_R_DS,
UD_R_FS, UD_R_GS,
Control Registers:
UD_R_CR0, UD_R_CR1, UD_R_CR2, UD_R_CR3,
UD_R_CR4, UD_R_CR5, UD_R_CR6, UD_R_CR7,
UD_R_CR8, UD_R_CR9, UD_R_CR10, UD_R_CR11,
UD_R_CR12, UD_R_CR13, UD_R_CR14, UD_R_CR15,
Debug Registers:
UD_R_DR0, UD_R_DR1, UD_R_DR2, UD_R_DR3,
UD_R_DR4, UD_R_DR5, UD_R_DR6, UD_R_DR7,
UD_R_DR8, UD_R_DR9, UD_R_DR10, UD_R_DR11,
UD_R_DR12, UD_R_DR13, UD_R_DR14, UD_R_DR15,
MMX Registers:
UD_R_MM0, UD_R_MM1, UD_R_MM2, UD_R_MM3,
UD_R_MM4, UD_R_MM5, UD_R_MM6, UD_R_MM7,
FPU Registers:
UD_R_ST0, UD_R_ST1, UD_R_ST2, UD_R_ST3,
UD_R_ST4, UD_R_ST5, UD_R_ST6, UD_R_ST7,
SSE Registers:
UD_R_XMM0, UD_R_XMM1, UD_R_XMM2, UD_R_XMM3,
UD_R_XMM4, UD_R_XMM5, UD_R_XMM6, UD_R_XMM7,
UD_R_XMM8, UD_R_XMM9, UD_R_XMM10, UD_R_XMM11,
UD_R_XMM12, UD_R_XMM13, UD_R_XMM14, UD_R_XMM15,
Prefix bytes that affect the disassembly of the instruction are availabe in the following fields, each of which corressponds to a particular type or class of prefixes.
64-bit mode REX prefix
64-bit mode REX prefix
Segment register prefix
Operand-size prefix (66h)
Address-size prefix (67h)
Lock prefix
Rep prefix
Repe prefix
Repne prefix
These fields default to UD_NONE if the respective prefixes were not found.