This section is a brief introduction to using Instruction-Based Sampling (IBS). A CodeAnalyst project must already be opened by following the directions under Creating a CodeAnalyst Project, or by opening an existing CodeAnalyst project. It also assumes that session settings have been established and CodeAnalyst is ready to profile an application.
A drop-down list of the available profile configurations is included in the CodeAnalyst toolbar.
When data collection completes, CodeAnalyst processes the IBS performance data and creates a new session under "IBS Sessions" in the session management area at the left-hand side of the CodeAnalyst window. Results are displayed in the System Data, System Graph, and System Tasks tabs. These tabs behave like their TBP and EBP counterparts. The tables and graph display the number of IBS-derived events that were sampled by the performance monitoring hardware.
CodeAnalyst reports IBS performance data as IBS-derived events. See Instruction-Based Sampling-Derived Events for descriptions of the IBS-derived events.
Although IBS derived events look similar to performance monitoring counter (PMC) events, the sampling method is quite different. PMC event samples measure the actual number of particular hardware events that occurred during the measurement period. IBS derived events report the number of IBS fetch or op samples for which a particular hardware event was observed. Consider the three IBS derived events:
The IBS all op samples derived event is a count of all IBS op samples that were taken. The IBS branch op derived event is the number of IBS op samples where the monitored macro-op was a branch. These samples are a subset of all the IBS op samples. The IBS mispredicted branch op derived event is the number of IBS op sample branches that mispredicted. These samples are a subset of the IBS branch op samples. Unlike PMC events that count the actual number of branches (event select 0x0C2), it would be incorrect to say that the IBS branch op derived event reports the number of branches. The sampling basis is different.
The "All Data" view shows the number of occurrences of all IBS derived events. Instruction-Based Sampling collects a wide range of performance data in a single run. When both IBS fetch and op data are collected, the "All Data" view displays information for over 60 IBS derived events. CodeAnalyst provides several predefined views that display IBS derived events in logically-related groups. The available views are:
Most software developers will be interested in the overall breakdown of IBS ops, branch operations, load/store operations, data cache behavior, and data translation lookaside buffer behavior. The breakdown of local/remote accesses through the Northbridge can provide information about the efficiency of memory access on non-uniform memory access (NUMA) platforms.
IBS information about instruction cache behavior displays. IC-related IBS derived events are shown, including the number of IBS fetch samples for attempted and completed fetch operations, the number of fetch samples which indicated an IC miss, and the total IBS fetch latency. An attempted fetch is a request to obtain instruction data from cache or system memory. A fetch attempt may be speculative. A completed fetch actually delivers instruction data to the decoder. The delivered data may go unused if the branch operation redirects the pipeline at a later time. Finally, the view also includes two computed performance measurements—the IC miss ratio (the number of IBS IC misses divided by the number of IBS attempted fetches) and the average fetch latency. Fetch latency is the number of cycles from when a fetch is initiated to when the fetch is either completed or aborted. (An aborted fetch is a fetch operation that does not complete and deliver instruction data to the decoder.)
The "IBS All ops" view displays. This view is an overall summary of the collected IBS op samples. It shows the total number of IBS op samples, the number of op samples taken for branch operations, and the number of samples for ops that performed a memory load and/or store operation. Tag-to-retire time is the number of cycles from when an op was selected (tagged) for sampling to when the op retired. Completion-to-retire time is the number of cycles from when an op completed (finished execution) to when the op retired. Total and average tag-to-retire and completion-to-retire times are shown in the next view.
The "IBS MEM data cache" view is displayed. This view shows information related to data cache (DC) behavior. The number of sampled IBS load/store operations is shown along with a breakdown of the number of loads and the number of stores. The number of IBS samples where the load/store operation missed in the data cache is shown. The DC miss rate (DC misses divided by the total number of op samples) and DC miss ratio (DC misses divided by the number of load/store operations) are also displayed.
The "IBS MEM data TLB" view is displayed. This view shows information related to data translation lookaside buffer (DTLB) behavior. The number of sampled IBS load/store operations is shown along with a breakdown of the number of load operations and the number of store operations. AMD processors use a two-level DTLB. Address translation may hit in the L1 DTLB, miss in the L1 DTLB and hit in the L2 DTLB ("L1M L2H"), or miss in both levels of the DTLB ("L1M L2M".) The performance penalty for a miss in both levels is relatively high. Nearly half of the sampled load/store operations incurred a missed at both levels of the DTLB. This is the performance culprit in the sample program, classic, which performs a "textbook" implementation of matrix multiplication.
In order to find the source of the performance issue in the example program, we need to drill down into the classic module.
A list of functions within classic is displayed with the IBS information for each function. CodeAnalyst retains the "IBS MEM data TLB" view. The function "multiply_matrices" has the most load/store activity and incurs the bulk of the DTLB misses.
The source code for the function "multiply_matrices" is displayed with the IBS information for each source line in the function. Most load/store activity occurs at line 65, which is the statement within the nested loops. This is the statement that reads an element from each of the two operand matrices and computes the running sum of the product of the elements. The DTLB misses are caused by the large strides taken through matrix_b. With nearly every iteration the program touches a different page, thereby causing a miss in the DTLB.
The disassembled instructions for source line 65 are displayed along with the IBS data for each instruction. IBS load/store information is attributed to each instruction that performs either a memory read or write operation. Sources of performance-robbing DTLB misses are precisely identified.
The "IBS BR branch" view displays. This view shows the number of IBS branch op samples and indicates if the branch operation mispredicted and/or was taken. Note that only the conditional jump instruction at the end of the innermost loop is marked as a branch instruction. This example further illustrates the precision offered by Instruction-Based Sampling.
Next: Profiling a Java Application