ASM documentation says a label represent a basic block, and it is a node in the control graph. So I test the visitLabel
method on this simple example:
public static void main(String[] args) {
int x = 3, y = 4;
if (x < y) {
x++;
}
}
For the visitLabel
method, I instrument it with a native API: setID(int id)
, where the id is incremental. In this example, a CFG should have 3 nodes: one at the beginning, and one for each branch of the if statement. So I expect setID
would be called in 3 locations. However, it is called 5 times, and there are a lot of nop
instructions. Could anybody explain for me why?
Here is the instrumented bytecode for the above program.
public static void main(java.lang.String[]);
Code:
0: iconst_2
1: invokestatic #13 // Method setId:(I)V
4: iconst_3
5: istore_1
6: iconst_3
7: invokestatic #13 // Method setId:(I)V
10: iconst_4
11: istore_2
12: iconst_4
13: invokestatic #13 // Method setId:(I)V
16: iload_1
17: iload_2
18: if_icmpge 28
21: iconst_5
22: invokestatic #13 // Method setId:(I)V
25: iinc 1, 1
28: bipush 6
30: invokestatic #13 // Method setId:(I)V
33: return
34: nop
35: nop
36: nop
37: nop
38: athrow
What I don't understand is why there is a label
before each istore
instruction. There is no branching to make it a new node in the CFG.
The primary purpose of a
Label
is to denote a position in the bytecode sequence. Since this is needed for branch targets, you can use them to identify basic blocks. But you have to be aware, that they are also used for reporting line numbers, when aLineNumberTable
attribute is present and for reporting local variable scopes when aLocalVariableTable
attribute is present as well as, for newer class files, their type annotations recorded in aRuntimeVisibleTypeAnnotations
attribute. Further, labels may mark the protected area of an exception handler. For code generated from Java source code, this protected area matches thetry
block, so its a basic block, but that doesn’t need to hold for other bytecode.See
visitLocalVariable(java.lang.String name, java.lang.String descriptor, java.lang.String signature, Label start, Label end, int index)
visitLocalVariableAnnotation(int typeRef, TypePath typePath, Label[] start, Label[] end, int[] index, java.lang.String descriptor, boolean visible)
visitTryCatchBlock(Label start, Label end, Label handler, java.lang.String type)
Since the scope of local variables may span the last
return
instruction, it is possible to encounter labels after that last instruction, which is what happens in your case. You are injecting abipush 7, invokestatic #13
after thereturn
instruction, resulting in unreachable code.Apparently, you are also using the
COMPUTE_FRAMES
options to let ASM recalculate stack map frames from scratch, but it is impossible to calculate frames for unreachable code, due to the unknown initial stack state. ASM solves this problem by replacing unreachable code withnop
instructions, followed by a singleathrow
statement. For this sequence, it’s possible to specify a valid initial stack frame and it has no impact on the execution (as the code is unreachable).As you can see, four
nop
instructions plus oneathrow
instruction span five bytes, which is the same size as the replacedbipush 7, invokestatic #13
sequence had.You can get rid of most of these reported labels, by specifying
ClassReader.SKIP_DEBUG
to itsaccept
method. Then, you get only one reported label for your example, the branch target associated with theif
statement. But you have to handle thevisitJumpInsn
to identify the start of the conditional code.So, to identify all basic blocks, you have to handle all branch instructions, i.e. via
visitJumpInsn
,visitLookupSwitchInsn
, andvisitTableSwitchInsn
, as well as all end instructions, i.e.athrow
and all variants ofreturn
. Further, you need to process allvisitTryCatchBlock
calls. If you need to identify potential targets of branch instructions in a single pass, I’d usevisitFrame
instead of labels, as frames are mandatory at all branch targets for class file version of 51 (Java 7) or higher.By the way, when all you’re injecting are these sequences of loading a constant and calling a static method (at reachable locations), I’d use
COMPUTE_MAXS
instead ofCOMPUTE_FRAMES
, as an expensive recalculation is not necessary when the general code structure does not change.