OProfile manual


Table of Contents

1. Introduction
1. OProfile legacy profiling mode
2. OProfile perf_events profiling mode
3. OProfile event counting mode
4. Applications of OProfile
4.1. Support for dynamically compiled (JIT) code
4.2. No support for virtual machine guests
5. System requirements
6. Internet resources
7. Installation
8. Uninstalling OProfile
2. Overview
1. Getting started with OProfile using operf
2. Getting started with OProfile using ocount
3. Specifying performance counter events
4. Tools summary
3. Controlling the profiler
1. Using operf
2. Setting up the JIT profiling feature
2.1. JVM instrumentation
3. Configuration details
3.1. Hardware performance counters
3.2. OProfile timer interrupt mode
3.3. Architecture-specific configuration notes
4. Obtaining profiling results
1. Profile specifications
1.1. Examples
1.2. Profile specification parameters
1.3. Locating and managing binary images
1.4. What to do when you don't get any results
2. Image summaries and symbol summaries (opreport)
2.1. Merging separate profiles
2.2. Side-by-side multiple results
2.3. Callgraph output
2.4. Differential profiles with opreport
2.5. Anonymous executable mappings
2.6. XML formatted output
2.7. Options for opreport
3. Outputting annotated source (opannotate)
3.1. Locating source files
3.2. Usage of opannotate
4. OProfile results with JIT samples
5. gprof-compatible output (opgprof)
5.1. Usage of opgprof
6. Analyzing profile data on another system (oparchive)
6.1. Usage of oparchive
7. Converting sample database files (opimport)
7.1. Usage of opimport
5. Interpreting profiling results
1. Profiling interrupt latency
2. Kernel profiling
2.1. Interrupt masking
2.2. Idle time
2.3. Profiling kernel modules
3. Interpreting call-graph profiles
4. Inaccuracies in annotated source
4.1. Side effects of optimizations
4.2. Prologues and epilogues
4.3. Inlined functions
4.4. Inaccuracy in line number information
5. Assembly functions
6. Overlapping symbols in JITed code
7. Using operf to profile fork/execs
8. Other discrepancies
6. Controlling the event counter
1. Using ocount
7. Acknowledgments

Chapter 1. Introduction

This manual applies to OProfile version 1.2.0git. OProfile is a set of performance monitoring tools for Linux 2.6 and higher systems, available on a number of architectures. OProfile provides the following features:

  • Profiler
  • Post-processing tools for analyzing profile data
  • Event counter

OProfile is capable of monitoring native hardware events occurring in all parts of a running system, from the kernel (including modules and interrupt handlers) to shared libraries to binaries. OProfile can collect event information for the whole system in the background with very little overhead. These features make it ideal for monitoring entire systems to determine bottle necks in real-world systems.

Many CPUs provide "performance counters", hardware registers that can count "events"; for example, cache misses, or CPU cycles. OProfile can collect profiles of code based on the number of these occurring events: repeatedly, every time a certain (configurable) number of events has occurred, the PC value is recorded. This information is aggregated into profiles for each binary image. Alternatively, OProfile's event counting tool can collect simple raw event counts.

1. OProfile legacy profiling mode

Prior to release 1.0, OProfile included a profiling tool consisting of the opcontrol shell script, the oprofiled daemon, and the attendant oprofile kernel driver. This "legacy profiler" was deprecated in release 0.9.8 with the introduction of the operf profiling tool (see Section 2, “OProfile perf_events profiling mode”). Some older architectures/platforms do not support the use of operf. For those cases, oprofile users should install release 0.9.9, which is the last release to include the legacy profiler.

2. OProfile perf_events profiling mode

OProfile has the ability to profile a single process or every currently running process (i.e., system-wide) via the operf program. operf interfaces with the kernel to collect samples via the Linux Kernel Performance Events Subsystem (hereafter referred to as "perf_events"). OProfile can co-exist with other tools on your system that may also be using the perf_events kernel subsystem.

Using operf to profile a single process can be done as a normal user; however, root authority is required to run operf in system-wide profiling mode.

Note

Some older processor models are not supported by the underlying perf_events kernel and, thus, are not supported by operf. If you receive the message
  Your kernel's Performance Events Subsystem does not support your processor type
when attempting to use operf, install OProfile 0.9.9 and try profiling with opcontrol to see if your processor type may be supported by OProfile's legacy mode.

3. OProfile event counting mode

OProfile provides the ocount tool for collecting raw event counts on a per-application, per-process, per-cpu, or system-wide basis. Unlike the profiling tools, post-processing of the data collected is not necessary -- the data is displayed in the output of ocount. A common use case for event counting tools is when performance analysts want to determine the CPI (cycles per instruction) for an application. High CPI implies possible stalls, and many architectures provide events that give detailed information about the different types of stalls. The events provided are architecture-specific, so we refer the reader to the hardware manuals available for the processor type being used.

4. Applications of OProfile

OProfile is useful in a number of situations. You might want to use OProfile when you :

  • need low overhead

  • cannot use highly intrusive profiling methods

  • need to profile interrupt handlers

  • need to profile an application and its shared libraries

  • need to profile dynamically compiled code of supported virtual machines (see Section 4.1, “Support for dynamically compiled (JIT) code”)

  • need to capture the performance behaviour of entire system

  • want to examine hardware effects such as cache misses

  • want detailed source annotation

  • want instruction-level profiles

  • want call-graph profiles

OProfile is not a panacea. OProfile might not be a complete solution when you :

  • require call graph profiles on platforms other than x86, ARM, and PowerPC

  • require 100% instruction-accurate profiles

  • need function call counts or an interstitial profiling API

  • cannot tolerate any disturbance to the system whatsoever

  • need to profile interpreted or dynamically compiled code of non-supported virtual machines

4.1. Support for dynamically compiled (JIT) code

OProfile provides a framework to support JITed code ("just-in-time (JIT) compiled code"). A development library is provided to allow developers to add support for any VM (virtual machine) that produces dynamically compiled code (see the OProfile JIT agent developer guide). In addition, built-in support is included for the following:

  • JVMTI agent library for Java (1.5 and higher)
  • JVMPI agent library for Java (1.5 and lower)
These libraries make it possible for OProfile to attribute profile samples to Java methods. Without a VM-specific agent library, OProfile will typically report samples from JITed code similar to the following example:
     anon: <tgid><address range>
For information on how to use OProfile's JIT support, see Section 2, “Setting up the JIT profiling feature”.

4.2. No support for virtual machine guests

OProfile currently does not support event-based profiling (i.e, using hardware events like cache misses, branch mispredicts) on virtual machine guests running under systems such as VMware. (Note: KVM guests are supported.) The list of supported events displayed by ophelp is based on CPU type and does not take into account whether the running system is a guest system or real system. To use OProfile on such guest systems, you must use the legacy profiler's timer mode (see Section 3.2, “OProfile timer interrupt mode”).

5. System requirements

Linux kernel

Release 2.6.31 or higher

Supported architectures

AMD, ARM, Intel, PowerPC, Tile, MIPS

Required libraries

These libraries are required : popt, bfd, liberty (debian users: libiberty is provided in binutils-dev package), dl, plus the standard C++ libraries.

Required kernel headers

Either the kernel-headers package must be installed or use the --with-kernel configure option.

Required user account

For secure processing of sample data from JIT virtual machines (e.g., Java), the special user account "oprofile" must exist on the system. The 'configure' and 'make install' operations will print warning messages if this account is not found. If you intend to profile JITed code, you must create a group account named 'oprofile' and then create the 'oprofile' user account, setting the default group to 'oprofile'. A runtime error message is printed to the oprofile log when processing JIT samples if this special user account cannot be found.

ELF

Probably not too strenuous a requirement, but older A.OUT binaries/libraries are not supported.

K&R coding style

OK, so it's not really a requirement, but I wish it was...

6. Internet resources

Web page

There is a web page (which you may be reading now) at http://oprofile.sf.net/.

Download

You can download a source tarball or check out code from the code repository at the sourceforge page, http://sf.net/projects/oprofile/.

Mailing list

There is a low-traffic OProfile-specific mailing list, details at http://sf.net/mail/?group_id=16191.

Bug tracker

There is a bug tracker for OProfile at SourceForge, http://sourceforge.net/p/oprofile/bugs/.

IRC channel

Several OProfile developers and users sometimes hang out on channel #oprofile on the OFTC network.

7. Installation

First you need to build OProfile and install it. ./configure, make, make install is often all you need, but note these arguments to ./configure :

--with-java

Use this option if you need to profile Java applications. Also, see Section 5, “System requirements”, "Required user account". This option is used to specify the location of the Java Development Kit (JDK) source tree you wish to use. This is necessary to get the interface description of the JVMPI (or JVMTI) interface to compile the JIT support code successfully.

Note

The Java Runtime Environment (JRE) does not include the development files that are required to compile the JIT support code, so the full JDK must be installed in order to use this option.

By default, the Oprofile JIT support libraries will be installed in <oprof_install_dir>/lib/oprofile. To build and install OProfile and the JIT support libraries as 64-bit, you can do something like the following:

			# CFLAGS="-m64" CXXFLAGS="-m64" ./configure \
			--with-java={my_jdk_installdir} \
			--libdir=/usr/local/lib64
			

Note

If you encounter errors building 64-bit, you should install libtool 1.5.26 or later since that release of libtool fixes known problems for certain platforms. If you install libtool into a non-standard location, you'll need to edit the invocation of 'aclocal' in OProfile's autogen.sh as follows (assume an install location of /usr/local):

aclocal -I m4 -I /usr/local/share/aclocal

--disable-werror

Development versions of OProfile build by default with -Werror. This option turns -Werror off.

--disable-optimization

Disable the -O2 compiler flag (useful if you discover an OProfile bug and want to give a useful back-trace etc.)

--with-kernel

This option is used to specify the location of the kernel headers include directory needed to build the perf_events-enabled operf program. By default, the OProfile build system expects to find this directory under /usr. Use this option if your kernel headers are in a non-standard location or if building in a cross-compile enviroment or in a situation where the host system does not support perf_events but you wish to build binaries for a target system that does support perf_events.

It is recommended that if you have a uniprocessor machine, you enable the local APIC / IO_APIC support for your kernel (this is automatically enabled for SMP kernels). With many BIOS (kernel >= 2.6.9 and UP kernel) it's not sufficient to enable the local APIC -- you must also turn it on explicitly at boot time by providing the "lapic" option to the kernel. If you use the NMI watchdog, be aware that the watchdog is disabled when profiling starts and not re-enabled until the profiling is stopped.

8. Uninstalling OProfile

You must have the source tree available to uninstall OProfile; a make uninstall will remove all installed files except your configuration file in the directory ~/.oprofile.

Chapter 2. Overview

1. Getting started with OProfile using operf

Profiling with operf allows you to precisely target your profiling (i.e., single process or system-wide). With operf, there is no initial setup needed -- simply invoke operf with the options you need; then run the OProfile post-processing tool(s). The operf syntax is as follows:

operf [ options ] [ --system-wide | --pid=<PID> | [ command [ args ] ] ]

A typical usage might look like this:

operf ./my_test_program my_arg

When ./my_test_program completes (or when you press Ctrl-C), profiling stops and you're ready to use opreport or other OProfile post-processing tools. By default, operf stores the sample data in <cur_dir>/oprofile_data/samples/current, and opreport and other post-processing tools will look in that location first for profile data, unless you pass the --session-dir option.

2. Getting started with OProfile using ocount

ocount is an OProfile tool that can be used to count native hardware events occurring in either a specific application, a set of processes or threads, a set of active system processors, or the entire system. The data collected during a counting session is displayed to stdout by default, but may also be saved to a file. The ocount syntax is as follows:

ocount [ options ] [ --system-wide | --process-list <pids> | --thread-list <tids> | --cpu-list <cpus> [ command [ args ] ] ]

A typical usage might look like this:

ocount --events=CPU_CLK_UNHALTED,INST_RETIRED /home/user1/my_test_program my_arg

When my_test_program completes (or when you press Ctrl-C), counting stops and the results are displayed to the screen (as shown below).

Events were actively counted for 2.8 seconds.
Event counts (actual) for /home/user1/my_test_program:
	Event                   Count                    % time counted
	CPU_CLK_UNHALTED        9,408,018,070            100.00
	INST_RETIRED            16,719,918,108           100.00

3. Specifying performance counter events

Whether profiling with operf or doing simple event counting with ocount, you can collect information about one more native hardware events using the --events option -- a comma-separated list of event specfications. The event specification is the means to provide details of how each hardware performance counter should be set up. For profiling, the event specification is a colon-separated string of the form name:count:unitmask:kernel:user as described in the table below. For ocount, specification is of the form name:unitmask:kernel:user. Note the presence of the count field for profiling. The count field tells the profiler how many events should occur between a profile snapshot (usually referred to as a "sample"). Since ocount does not do sampling, the count field is not needed.

If no event specs are passed to operf or ocount, the default event will be used.

Note

The perf_events kernel subsystem allocates hardware counters as necessary, but some processor types have restrictions as to what hardware events may be counted simultaneously. The kernel employs a multiplexing technique when such hardware restrictions are encountered, such that events are monitored on a rotating basis.

name The symbolic event name, e.g. CPU_CLK_UNHALTED
count The counter reset value, e.g. 100000; use only for profiling
unitmask The unit mask, as given in the events list: e.g. 0x0f; or a symbolic name if a name=<um_name> field is present
kernel Enable profiling of kernel code
user Enable profiling of userspace code

The last three values are optional; if you omit them (e.g. operf --events=DATA_MEM_REFS:30000), they will be set to the default values (i.e., the default unit mask value for the given event, and profiling (or counting) both kernel and userspace code will be enabled). Note that on some architectures, some events may require a unit mask be specified.

You can specify unit mask values using either a numerical value (hex values must begin with "0x") or a symbolic name (if the name=<um_name> field is shown in the ophelp output). For some named unit masks, the hex value is not unique; thus, OProfile tools enforce specifying such unit masks value by name.

The table below lists the default profiling event for various processor types. The same events can be used for ocount, minus the count field.

Processor cpu_type Default event
Alpha EV67 alpha/ev67 CYCLES:100000:0:1:1
ARM/XScale PMU1 arm/xscale1 CPU_CYCLES:100000:0:1:1
ARM/XScale PMU2 arm/xscale2 CPU_CYCLES:100000:0:1:1
ARM/MPCore arm/mpcore CPU_CYCLES:100000:0:1:1
Athlon i386/athlon CPU_CLK_UNHALTED:100000:0:1:1
Pentium Pro i386/ppro CPU_CLK_UNHALTED:100000:0:1:1
Pentium II i386/pii CPU_CLK_UNHALTED:100000:0:1:1
Pentium III i386/piii CPU_CLK_UNHALTED:100000:0:1:1
Pentium M (P6 core) i386/p6_mobile CPU_CLK_UNHALTED:100000:0:1:1
Pentium 4 (non-HT) i386/p4 GLOBAL_POWER_EVENTS:100000:1:1:1
Pentium 4 (HT) i386/p4-ht GLOBAL_POWER_EVENTS:100000:1:1:1
Hammer x86-64/hammer CPU_CLK_UNHALTED:100000:0:1:1
Family10h x86-64/family10 CPU_CLK_UNHALTED:100000:0:1:1
Family11h x86-64/family11h CPU_CLK_UNHALTED:100000:0:1:1
IBM pseries ppc64/power{ 4|5|6|7|8|9|970 } CYCLES:100000:0:1:1
IBM s390 s390/{ z10|z196|zEC12 } HWSAMPLING:4127518:0:1:1

4. Tools summary

This section gives a brief description of the available OProfile utilities and their purpose.

ophelp

This utility lists the available events and short descriptions.

operf

This is the program for collecting profile data, discussed in Section 1, “Using operf.

ocount

This tool is used for simple event counting, as described in in Section 1, “Using ocount.

agent libraries

Used by virtual machines (like the Java VM) to record information about JITed code being profiled. See Section 2, “Setting up the JIT profiling feature”.

opreport

This is the main tool for retrieving useful profile data, described in Section 2, “Image summaries and symbol summaries (opreport)”.

opannotate

This utility can be used to produce annotated source, assembly or mixed source/assembly. Source level annotation is available only if the application was compiled with debugging symbols. See Section 3, “Outputting annotated source (opannotate)”.

opgprof

This utility can output gprof-style data files for a binary, for use with gprof -p. See Section 5, “gprof-compatible output (opgprof)”.

oparchive

This utility can be used to collect executables, debuginfo, and sample files and copy the files into an archive. The archive is self-contained and can be moved to another machine for further analysis. See Section 6, “Analyzing profile data on another system (oparchive)”.

opimport

This utility converts sample database files from a foreign binary format (abi) to the native format. This is useful only when moving sample files between hosts for analysis on platforms other than the one used for collection. See Section 7, “Converting sample database files (opimport)”.

Chapter 3. Controlling the profiler

1. Using operf

This section describes in detail how operf is used to control profiling. Unless otherwise directed, operf will profile using the default event for your system. For most systems, the default event is some cycles-based event, assuming your processor type supports hardware performance counters. If your hardware does support performance counters, you can specify something other than the default hardware event on which to profile. The performance monitor counters can be programmed to count various hardware events, such as cache misses or MMX operations. The event chosen for each counter is reflected in the profile data collected by OProfile: functions and binaries at the top of the profiles reflect that most of the chosen events happened within that code.

Additionally, each counter is programmed with a "count" value, which corresponds to how detailed the profile is. The lower the value, the more frequently profile samples are taken. You can choose to sample only kernel code, user-space code, or both (both is the default). Finally, some events have a "unit mask" -- this is a value that further restricts the type of event being counted. You can see the event types and unit masks for your CPU using ophelp. More information on event specification can be found at Section 3, “Specifying performance counter events”.

The operf command syntax is:

operf [ options ] [ --system-wide | --pid=<PID> | [ command [ args ] ] ]

When profiling an application using either the command or --pid option of operf, forks and execs of the profiled process will also be profiled. The samples from an exec'ed process will be attributed to the executable binary run by that process. See Section 7, “Using operf to profile fork/execs”

Following is a description of the operf options.

command [args]

The command or application to be profiled. The [args] are the input arguments that the command or application requires. Either command, --pid or --system-wide is required, but cannot be used simultaneously.

--pid / -p [PID]

This option enables operf to profile a running application. PID should be the process ID of the process you wish to profile. When finished profiling (e.g., when the profiled process ends), press Ctrl-c to stop operf.

--system-wide / -s

This option is for performing a system-wide profile. You must have root authority to run operf in this mode. When finished profiling, Ctrl-C to stop operf. If you run operf --system-wide as a background job (i.e., with the &), you must stop it in a controlled manner in order to process the profile data it has collected. Use kill -SIGINT <operf-PID> for this purpose. It is recommended that when running operf with this option, your current working directory should be /root or a subdirectory of /root to avoid storing sample data files in locations accessible by regular users.

--vmlinux / k [vmlinux_path]

A vmlinux file that matches the running kernel that has symbol and/or debuginfo. Kernel samples will be attributed to this binary, allowing post-processing tools (like opreport) to attribute samples to the appropriate kernel symbols. If this option is not specified, the file /proc/kallsyms is used to obtain kernel symbol addresses correponding to sample addresses. However, the setting of /proc/sys/kernel/kptr_restrict may restrict a non-root user's access to /proc/kallsyms, in which case, all kernel samples are attributed to a pseudo binary named "no-vmlinux".

--callgraph / -g

This option enables the callgraph to be saved during profiling. NOTE: The full callchain is recorded, so there is no depth limit.

--append / -a

By default, operf moves old profile data from <session_dir>/samples/current to <session_dir>/samples/previous. If a 'previous' profile already existed, it will be replaced. If the --append option is passed, old profile data in 'current' is left in place and new profile data will be added to it, and the 'previous' profile (if one existed) will remain untouched. To access the 'previous' profile, simply add a session specification to the normal invocation of oprofile post-processing tools; for example:

opreport session:previous

--events / -e [event1[,event2[,...]]]

This option is for passing a comma-separated list of event specifications for profiling. Each event spec is of the form:

name:count[:unitmask[:kernel[:user]]]

When no event specification is given, the default event for the running processor type will be used for profiling. Use ophelp to list the available events for your processor type.

--separate-thread / -t

This option categorizes samples by thread group ID (tgid) and thread ID (tid). The --separate-thread option is useful for seeing per-thread samples in multi-threaded applications. When used in conjuction with the --system-wide option, --separate-thread is also useful for seeing per-process (i.e., per-thread group) samples for the case where multiple processes are executing the same program during a profiling run.

--separate-cpu / -c

This option categorizes samples by cpu.

--session-dir / -d [path]

This option specifies the session directory to hold the sample data. If not specified, the data is saved in the oprofile_data directory on the current path.

---lazy-conversion / -l

Use this option to reduce the overhead of operf during profiling. Normally, profile data received from the kernel is converted to OProfile format during profiling time. This is typically not an issue when profiling a single application. But when using the --system-wide option, this on-the-fly conversion process can cause noticeable overhead, particularly on busy multi-processor systems. The --lazy-conversion option directs operf to wait until profiling is completed to do the conversion of profile data.

--verbose / -V [level]

A comma-separated list of debugging control values used to increase the verbosity of the output. Valid values are: debug, record, convert, misc, sfile, arcs, and the special value, 'all'.

--version -v

Show operf version.

--help / -h

Show a help message.

2. Setting up the JIT profiling feature

To gather information about JITed code from a virtual machine, it needs to be instrumented with an agent library. We use the agent libraries for Java in the following example. To use the Java profiling feature, you must build OProfile with the "--with-java" option (Section 7, “Installation”).

2.1. JVM instrumentation

Add this to the startup parameters of the JVM (for JVMTI):

-agentpath:<libdir>/libjvmti_oprofile.so[=<options>] 

or

-agentlib:jvmti_oprofile[=<options>] 

The JVMPI agent implementation is enabled with the command line option

-Xrunjvmpi_oprofile[:<options>] 

Currently, there is just one option available -- debug. For JVMPI, the convention for specifying an option is option_name=[yes|no]. For JVMTI, the option specification is simply the option name, implying "yes"; no option specified implies "no".

The agent library (installed in <oprof_install_dir>/lib/oprofile) needs to be in the library search path (e.g. add the library directory to LD_LIBRARY_PATH). If the command line of the JVM is not accessible, it may be buried within shell scripts or a launcher program. It may also be possible to set an environment variable to add the instrumentation. For Sun JVMs this is JAVA_TOOL_OPTIONS. Please check your JVM documentation for further information on the agent startup options.

3. Configuration details

3.1. Hardware performance counters

Most processor models include performance monitor units that can be configured to monitor (count) various types of hardware events. This section is where you can find architecture-specific information to help you use these events for profiling. You do not really need to read this section unless you are interested in using events other than the default event chosen by OProfile.

Note

Your CPU type may not include the requisite support for hardware performance counters, in which case you must use OProfile in timer mode (see Section 3.2, “OProfile timer interrupt mode”), which is only available in OProfile releases prior to 1.0.

The Intel hardware performance counters are detailed in the Intel IA-32 Architecture Manual, Volume 3, available from http://developer.intel.com/. The AMD Athlon/Opteron/Phenom/Turion implementation is detailed in http://www.amd.com/us-en/assets/content_type/white_papers_and_tech_docs/22007.pdf. For IBM PowerPC processors, documentation is available at https://www.power.org/. For example, https://www.power.org/events/Power7 contains specific information on the performance monitor unit for the IBM POWER7.

A physical performance monitor counter (PMC) is configured by a profiling tool to count a particular type of event. When the counter overflows, an interrupt is delivered to the processor. This is the basic mechanism on which OProfile is based. The delivery mode is NMI, so blocking interrupts in the kernel does not prevent profiling. When the interrupt handler is called, the current PC (program counter) value and the current task are recorded into the profiling structure. This allows the overflow event to be attributed to a specific assembly instruction in a specific binary image. OProfile receives this data (commonly referred to as a "sample") from the kernel and writes it to the sample files.

If we use an event such as CPU_CLK_UNHALTED or INST_RETIRED (GLOBAL_POWER_EVENTS or INSTR_RETIRED, respectively, on the Pentium 4), we can use the overflow counts (samples) as an estimate of actual time spent in each part of code. Alternatively we can profile interesting data such as the cache behaviour of routines with the other available counters.

However there are several caveats. First, there are those issues listed in the Intel manual. There is a delay between the counter overflow and the interrupt delivery that can skew results on a small scale - this means you cannot rely on the profiles at the instruction level as being perfectly accurate. For example, if you are profiling an application with an event that counts L1 cache misses, a sample attributed to a particular instruction in the application doesn't necessarily mean that exact instruction is responsible for that event; instead, it means the sample was taken in the dynamic vicinity of that instruction, usually with a margin of error of a few instructions. Further details on this problem can be found in Chapter 5, Interpreting profiling results and also in the Digital paper "ProfileMe: A Hardware Performance Counter".

Each counter has several configuration parameters besides the type of event to count. First, there is the unit mask, which is used to further qualify exactly what to count. Second, there is the count field, discussed below. Third, there are parameters to specify whether to increment counts whilst in kernel or user space. You can configure these separately for each counter.

When the profiler is initially setup, a performance monitor counter is chosen for counting the event, and it is initialized using the count value. Once profiling begins, the counter increments with each event detected, and the counter overflows when the count value is reached. As described above, the counter overflow generates an interrupt, and the sample is recorded. After each overflow event, the counter is re-initialized using the count value, and counting begins anew for the next sample. Higher values for count result in samples being taken less frequently, and therefore less-detailed (and, potentially, less accurate) profiling. Lower values mean more detail, but higher overhead. Picking a good value for this parameter is, unfortunately, somewhat of a black art. It is of course dependent on the event you have chosen. Specifying too large a value will mean not enough interrupts are generated to give a realistic profile (though this problem can be ameliorated by profiling for longer time periods. Specifying too small a value can lead to higher performance overhead.

3.2. OProfile timer interrupt mode

Some CPU types do not provide the needed hardware support for hardware performance counters. Additionally, some older architectures are not supported by the perf_events kernel subsystem. On such machines, the operf and ocount commands will exit with a message indicating the processor type is not supported. However, you can install OProfile 0.9.9 and use the legacy opcontrol-based profiler, which will fall back to using timer interrupts for profiling. Note that in timer mode, OProfile is not able to profile code that has interrupts disabled.

Note

Timer mode is only available using the legacy opcontrol command, available in releases prior to 1.0.

3.3. Architecture-specific configuration notes

3.3.1. Pentium 4 support

The Pentium 4 / Xeon performance counters are organized around 3 types of model specific registers (MSRs): 45 event selection control registers (ESCRs), 18 counter configuration control registers (CCCRs) and 18 counters. ESCRs describe a particular set of events which are to be recorded, and CCCRs bind ESCRs to counters and configure their operation. Unfortunately the relationship between these registers is quite complex; they cannot all be used with one another at any time. There is, however, a subset of 8 counters, 8 ESCRs, and 8 CCCRs which can be used independently of one another, so OProfile only accesses those registers, treating them as a bank of 8 "normal" counters, similar to those in the P6 or Athlon/Opteron/Phenom/Turion families of CPU.

There is currently no support for Precision Event-Based Sampling (PEBS), nor any advanced uses of the Debug Store (DS). Current support is limited to the conservative extension of OProfile's existing interrupt-based model described above.

3.3.2. PowerPC64 support

The performance monitoring unit (PMU) for the IBM PowerPC 64-bit processors consists of between 4 and 8 counters (depending on the model). Advanced features such as instruction matching and thresholding are not supported by OProfile.

Chapter 4. Obtaining profiling results

After collecting profile data, the raw data must undergo special processing in order for you to perform your analysis. The analysis tools that perform this special processing are opreport, opannotate, and opgprof. Additionally, the oparchive is used to gather together profile data, sampled binary files, etc. for the purpose of off-line analysis. While not really an analysis tool, oparchive is put in that category for convenience since it takes many of the same options as the other analysis tools.

1. Profile specifications

All of the analysis tools take a profile specification as an input argument. This is a set of definitions that describes the specific profile data that should be examined. The simplest profile specification is empty: this will match all the available profile files for the current session.

Specification parameters are of the form name:value[,value]. For example, if I wanted to get a combined symbol summary for /bin/myprog and /bin/myprog2, I could do opreport -l image:/bin/myprog,/bin/myprog2. As a special case, you don't actually need to specify the image: part of the specification. Anything left on the command line after all other opreport options have been processed is assumed to be an image: name. Similarly, if no session: is specified, then session:current is assumed ("current" is a special name of the current (i.e., most recent) profiling session).

In addition to the comma-separated list shown above, some of the specification parameters can take glob-style values. For example, if I want to see image summaries for all binaries profiled in /usr/bin/, I could do opreport image:/usr/bin/\*. Note the necessity to escape the special character from the shell.

For opreport, profile specifications can be used to define two profiles, giving differential output. This is done by enclosing each of the two specifications within curly braces, as shown in the examples below. Any specifications outside of curly braces are shared across both.

1.1. Examples

Image summaries for all profiles with DATA_MEM_REFS samples in the saved session called "stresstest" :

# opreport session:stresstest event:DATA_MEM_REFS

Symbol summary for the application called "test_sym53c8xx,9xx". Note the escaping is necessary as image: takes a comma-separated list.

# opreport -l ./test/test_sym53c8xx\,9xx

Image summaries for all binaries in the test directory, excepting boring-test :

# opreport image:./test/\* image-exclude:./test/boring-test

Differential profile of a binary stored in two archives :

# opreport -l /bin/bash { archive:./orig } { archive:./new }

Differential profile of an archived binary with the current session :

# opreport -l /bin/bash { archive:./orig } { }

1.2. Profile specification parameters

archive: archivepath

A path to an archive made with oparchive. Absence of this tag, unlike others, means "the current system", equivalent to specifying "archive:".

session: sessionlist

A comma-separated list of session names to resolve in. Absence of this tag, unlike others, means "the current session", equivalent to specifying "session:current".

session-exclude: sessionlist

A comma-separated list of sessions to exclude.

image: imagelist

A comma-separated list of image names to resolve. Each entry may be relative path, glob-style name, or full path, e.g.

opreport 'image:/usr/bin/oprofiled,*op*,./opreport'
image-exclude: imagelist

Same as image:, but the matching images are excluded.

lib-image: imagelist

Same as image:, but only for images that are for a particular primary binary image (namely, an application).

lib-image-exclude: imagelist

Same as lib-image:, but the matching images are excluded.

event: eventlist

The symbolic event name to match on, e.g. event:DATA_MEM_REFS. You can pass a list of events for side-by-side comparison with opreport.

count: eventcountlist

The event count to match on, e.g. event:DATA_MEM_REFS count:30000. Note that this value refers to the count value in the event spec you passed to operf when setting up to do a profile run. It has nothing to do with the sample counts in the profile data itself. You can pass a list of events for side-by-side comparison with opreport.

unit-mask: masklist

The unit mask value of the event to match on, e.g. unit-mask:1. You can pass a list of events for side-by-side comparison with opreport.

cpu: cpulist

Only consider profiles for the given numbered CPU (starting from zero). This is only useful when using CPU profile separation.

tgid: pidlist

Only consider profiles for the given task groups. Unless some program is using threads, the task group ID of a process is the same as its process ID. This option corresponds to the POSIX notion of a thread group. This is only useful when using per-process profile separation.

tid: tidlist

Only consider profiles for the given threads. When using recent thread libraries, all threads in a process share the same task group ID, but have different thread IDs. You can use this option in combination with tgid: to restrict the results to particular threads within a process. This is only useful when using per-process profile separation.

1.3. Locating and managing binary images

Each session's sample files can be found in the $SESSION_DIR/samples/ directory (default for operf is <cur_dir>/oprofile_data/samples/). These are used, along with the binary image files, to produce human-readable data. In some circumstances (e.g., kernel modules), OProfile will not be able to find the binary images. All the tools have an --image-path option to which you can pass a comma-separated list of alternate paths to search. For example, I can let OProfile find my 2.6 modules by using --image-path /lib/modules/2.6.0/kernel/. It is your responsibility to ensure that the correct images are found when using this option.

Note that if a binary image changes after the sample file was created, you won't be able to get useful symbol-based data out. This situation is detected for you. If you replace a binary, you should make sure to save the old binary if you need to do comparative profiles.

1.4. What to do when you don't get any results

When attempting to get output, you may see the error :

error: no sample files found: profile specification too strict ?

What this is saying is that the profile specification you passed in, when matched against the available sample files, resulted in no matches. There are a number of reasons this might happen:

spelling

You specified a binary name, but spelt it wrongly. Check your spelling !

profiler wasn't running

Make very sure that OProfile was actually up and running when you ran the application you wish to profile.

application didn't run long enough

Remember OProfile is a statistical profiler - you're not guaranteed to get samples for short-running programs. You can help this by using a lower count for the performance counter, so there are a lot more samples taken per second.

application spent most of its time in libraries

Similarly, if the application spends little time in the main binary image itself, with most of it spent in shared libraries it uses, you might not see any samples for the binary image (i.e., executable) itself.

specification was really too strict

For example, you specified something like tgid:3433, but no task with that group ID ever ran the code.

application didn't generate any events

If you're profiling a particular event, for example counting MMX operations, the code might simply have not generated any events in the first place. Verify the code you're profiling does what you expect it to.

you didn't specify kernel module name correctly

If you're trying to get reports for a kernel module, make sure to use the -p option, and specify the module name with the .ko extension. Check if the module is one loaded from initrd.

2. Image summaries and symbol summaries (opreport)

The opreport utility is the primary utility you will use for getting formatted data out of OProfile. It produces two types of data: image summaries and symbol summaries. An image summary lists the number of samples for individual binary images such as libraries or applications. Symbol summaries provide per-symbol profile data. In the following truncated example, we see an image summary for the whole system:

$ opreport --long-filenames
CPU: Intel Sandy Bridge microarchitecture, speed 2401 MHz (estimated)
Counted CPU_CLK_UNHALTED events (Clock cycles when not halted) with a unit mask of 0x00 (No unit mask) count 100000
CPU_CLK_UNHALT...|
  samples|      %|
------------------
    22577 28.9011 /usr/bin/Xorg
        CPU_CLK_UNHALT...|
          samples|      %|
        ------------------
            16846 74.6158 /proc/kallsyms
             2126  9.4167 /usr/bin/Xorg
              763  3.3795 /usr/lib64/libpixman-1.so.0.26.2
              ...
    17402 22.2766 /usr/lib/jvm/java-1.7.0-openjdk-1.7.0.55.x86_64/jre/bin/java
        CPU_CLK_UNHALT...|
          samples|      %|
        ------------------
             5666 32.5595 anon (tgid:29664 range:0x7f3475000000-0x7f347616ffff)
             2312 13.2858 /usr/lib/jvm/java-1.7.0-openjdk-1.7.0.55.x86_64/jre/lib/amd64/server/libjvm.so
             ...
    11554 14.7904 /home/user1/oprof-install/bin/operf
        CPU_CLK_UNHALT...|
          samples|      %|
        ------------------
             7467 64.6270 /proc/kallsyms
             1691 14.6356 /usr/bin/operf
             1324 11.4592 /lib64/libc-2.12.so
              455  3.9380 /usr/lib64/libstdc++.so.6.0.13
              315  2.7263 /ext4
              ...
    ...

If we had specified --symbols in the previous command, we would have gotten a symbol summary of all the images across the entire system. We can restrict this to only part of the system profile; for example, below is a symbol summary for the operf program used to collect the profile.

$ opreport -l -p /lib/modules/`uname -r` `which operf` 2>/dev/null | more
CPU: Intel Sandy Bridge microarchitecture, speed 2401 MHz (estimated)
Counted CPU_CLK_UNHALTED events (Clock cycles when not halted) with a unit mask of 0x00 (No unit mask) count 100000
samples  %        image name               symbol name
860       7.4607  kallsyms                 avtab_search_node
474       4.1121  operf                    OP_perf_utils::op_write_event(event_union*, unsigned long long)
461       3.9993  kallsyms                 avc_has_perm_noaudit
455       3.9473  libstdc++.so.6.0.13      /usr/lib64/libstdc++.so.6.0.13
412       3.5742  libc-2.12.so             _IO_vfscanf
369       3.2012  kallsyms                 __d_lookup
350       3.0363  kallsyms                 sidtab_context_to_sid
274       2.3770  operf                    OP_perf_utils::op_record_process_exec_mmaps(int, int, int, operf_record*)
232       2.0127  operf                    operf_process_info::find_mapping_for_sample(unsigned long long, bool)
222       1.9259  kallsyms                 __link_path_walk
191       1.6570  kallsyms                 pipe_read
34        0.2950  ext4.ko                  ext4_mark_iloc_dirty
...

These are the two basic ways you are most likely to use regularly, but opreport can do a lot more than that, as described below.

2.1. Merging separate profiles

If you have used one of the --separate[*] options whilst profiling, there can be several separate profiles for a single binary image within a session. Normally the output will keep these images separated. So, for example, if you profiled with separation on a per-cpu basis (operf --separate-cpu), you would see separate columns in the output of opreport for each CPU where samples were recorded. But it can be useful to merge these results back together to make the report more readable. The --merge option allows you to do that.

2.2. Side-by-side multiple results

If you have used multiple events when profiling, by default you get side-by-side results of each event's sample values from opreport. You can restrict which events to list by appropriate use of the event: profile specifications, etc.

2.3. Callgraph output

This section provides details on how to use the OProfile callgraph feature.

2.3.1. Callgraph details

When using the --callgraph option, you can see what functions are calling other functions in the output. Consider the following program:

#include <string.h>
#include <stdlib.h>
#include <stdio.h>

#define SIZE 500000

static int compare(const void *s1, const void *s2)
{
        return strcmp(s1, s2);
}

static void repeat(void)
{
        int i;
        char *strings[SIZE];
        char str[] = "abcdefghijklmnopqrstuvwxyz";

        for (i = 0; i < SIZE; ++i) {
                strings[i] = strdup(str);
                strfry(strings[i]);
        }

        qsort(strings, SIZE, sizeof(char *), compare);
}

int main()
{
        while (1)
                repeat();
}

When running with the call-graph option, OProfile will record the function stack every time it takes a sample. opreport --callgraph outputs an entry for each function, where each entry looks similar to:

samples  %        image name               symbol name
  197       0.1548  cg                       main
  127036   99.8452  cg                       repeat
84590    42.5084  libc-2.3.2.so            strfry
  84590    66.4838  libc-2.3.2.so            strfry [self]
  39169    30.7850  libc-2.3.2.so            random_r
  3475      2.7312  libc-2.3.2.so            __i686.get_pc_thunk.bx
-------------------------------------------------------------------------------

Here the non-indented line is the function we're focussing upon (strfry()). This line is the same as you'd get from a normal opreport output.

Above the non-indented line we find the functions that called this function (for example, repeat() calls strfry()). The samples and percentage values here refer to the number of times we took a sample where this call was found in the stack; the percentage is relative to all other callers of the function we're focussing on. Note that these values are not call counts; they only reflect the call stack every time a sample is taken; that is, if a call is found in the stack at the time of a sample, it is recorded in this count.

Below the line are functions that are called by strfry() (called callees). It's clear here that strfry() calls random_r(). We also see a special entry with a "[self]" marker. This records the normal samples for the function, but the percentage becomes relative to all callees. This allows you to compare time spent in the function itself compared to functions it calls. Note that if a function calls itself, then it will appear in the list of callees of itself, but without the "[self]" marker; so recursive calls are still clearly separable.

You may have noticed that the output lists main() as calling strfry(), but it's clear from the source that this doesn't actually happen. See Section 3, “Interpreting call-graph profiles” for an explanation.

2.3.2. Callgraph is not supported with JIT samples

Callgraph output where anonymously mapped code is in the callstack can sometimes be misleading. For all such code, the samples for the anonymously mapped code are stored in a samples subdirectory named {anon:anon}/<tgid>.<begin_addr>.<end_addr>. As stated earlier, if this anonymously mapped code is JITed code from a supported VM like Java, OProfile creates an ELF file to provide a (somewhat) permanent backing file for the code. However, when viewing callgraph output, any anonymously mapped code in the callstack will be attributed to anon (<tgid>: range:<begin_addr>-<end_addr>, even if a .jo ELF file had been created for it. See the example below.

-------------------------------------------------------------------------------
  1         2.2727  libj9ute23.so            java.bin                 traceV
  2         4.5455  libj9ute23.so            java.bin                 utsTraceV
  4         9.0909  libj9trc23.so            java.bin                 fillInUTInterfaces
  37       84.0909  libj9trc23.so            java.bin                 twGetSequenceCounter
8         0.0154  libj9prt23.so            java.bin                 j9time_hires_clock
  27       61.3636  anon (tgid:10014 range:0x100000-0x103000) java.bin                 (no symbols)
  9        20.4545  libc-2.4.so              java.bin                 gettimeofday
  8        18.1818  libj9prt23.so            java.bin                 j9time_hires_clock [self]
-------------------------------------------------------------------------------

The output shows that "anon (tgid:10014 range:0x100000-0x103000)" was a callee of j9time_hires_clock, even though the ELF file 10014.jo was created for this profile run. Unfortunately, there is currently no way to correlate that anonymous callgraph entry with its corresponding .jo file.

2.4. Differential profiles with opreport

Often, we'd like to be able to compare two profiles. For example, when analysing the performance of an application, we'd like to make code changes and examine the effect of the change. This is supported in opreport by giving a profile specification that identifies two different profiles. The general form is of:

$ opreport <shared-spec> { <first-profile> } { <second-profile> }

Note

We lost our Dragon book down the back of the sofa, so you have to be careful to have spaces around those braces, or things will get hopelessly confused. We can only apologise.

For each of the profiles, the shared section is prefixed, and then the specification is analysed. The usual parameters work both within the shared section, and in the sub-specification within the curly braces.

A typical way to use this feature is with archives created with oparchive. Let's look at an example:

$ operf ./a
$ oparchive -o orig ./a
  # edit and recompile a
$ operf ./a
  # now compare the current profile of a with the archived profile
$ opreport  --session-dir=`pwd`/oprofile_data/ -xl ./a { archive:./orig } { }
CPU: PIII, speed 863.233 MHz (estimated)
Counted CPU_CLK_UNHALTED events (clocks processor is not halted) with a
unit mask of 0x00 (No unit mask) count 100000
samples  %        diff %    symbol name
92435    48.5366  +0.4999   a
54226    ---      ---       c
49222    25.8459  +++       d
48787    25.6175  -2.2e-01  b

Note that we specified an empty second profile in the curly braces, as we wanted to use the current session; alternatively, we could have specified another archive, or a tgid etc. We specified the binary a in the shared section, so we matched that in both the profiles we're diffing.

As in the normal output, the results are sorted by the number of samples, and the percentage field represents the relative percentage of the symbol's samples in the second profile.

Notice the new column in the output. This value represents the percentage change of the relative percent between the first and the second profile: roughly, "how much more important this symbol is". Looking at the symbol a(), we can see that it took roughly the same amount of the total profile in both the first and the second profile. The function c() was not in the new profile, so has been marked with ---. Note that the sample value is the number of samples in the first profile; since we're displaying results for the second profile, we don't list a percentage value for it, as it would be meaningless. d() is new in the second profile, and consequently marked with +++.

When comparing profiles between different binaries, it should be clear that functions can change in terms of VMA and size. To avoid this problem, opreport considers a symbol to be the same if the symbol name, image name, and owning application name all match; any other factors are ignored. Note that the check for application name means that trying to compare library profiles between two different applications will not work as you might expect: each symbol will be considered different.

2.5. Anonymous executable mappings

Many applications, typically ones involving dynamic compilation into machine code (just-in-time, or "JIT", compilation), have executable mappings that are not backed by an ELF file. opreport has basic support for showing the samples taken in these regions; for example:

$ opreport /usr/bin/mono -l
CPU: ppc64 POWER5, speed 1654.34 MHz (estimated)
Counted CYCLES events (Processor Cycles using continuous sampling) with a unit mask of 0x00 (No unit mask) count 100000
samples  %        image name    		                symbol name
47       58.7500  mono                     			(no symbols)
14       17.5000  anon (tgid:3189 range:0xf72aa000-0xf72fa000)  (no symbols)
9        11.2500  anon (tgid:3189 range:0xf6cca000-0xf6dd9000)  (no symbols)
.	 .	  .						.

Note that, since such mappings are dependent upon individual invocations of a binary, these mappings are always listed as a dependent image. Equally, the results are not affected by the --merge option.

As shown in the opreport output above, OProfile is unable to attribute the samples to any symbol(s) because there is no ELF file for this code. Enhanced support for JITed code is now available for some virtual machines; e.g., the Java Virtual Machine. For details about OProfile output for JITed code, see Section 4, “OProfile results with JIT samples”.

For more information about JIT support in OProfile, see Section 4.1, “Support for dynamically compiled (JIT) code”.

2.6. XML formatted output

The --xml option can be used to generate XML instead of the usual text format. This allows opreport to eliminate some of the constraints dictated by the two dimensional text format. For example, it is possible to separate the sample data across multiple events, cpus and threads. The XML schema implemented by opreport is found in doc/opreport.xsd. It contains more detailed comments about the structure of the XML generated by opreport.

Since XML is consumed by a client program rather than a user, its structure is fairly static. In particular, the --sort option is incompatible with the --xml option. Percentages are not dislayed in the XML so the options related to percentages will have no effect. Full pathnames are always displayed in the XML so --long-filenames is not necessary. The --details option will cause all of the individual sample data to be included in the XML as well as the instruction byte stream for each symbol (for doing disassembly) and can result in very large XML files.

2.7. Options for opreport

--accumulated / -a

Accumulate sample and percentage counts in the symbol list.

--callgraph / -c

Show callgraph information.

--debug-info / -g

Show source file and line for each symbol.

--demangle / -D none|normal|smart

none: no demangling. normal: use default demangler (default) smart: use pattern-matching to make C++ symbol demangling more readable.

--details / -d

Show per-instruction details for all selected symbols. Note that, for binaries without symbol information, the VMA values shown are raw file offsets for the image binary.

--exclude-dependent / -x

Do not include application-specific images for libraries, kernel modules and the kernel..

--exclude-symbols / -e [symbols]

Exclude all the symbols in the given comma-separated list.

--global-percent / -%

Make all percentages relative to the whole profile.

--help / -? / --usage

Show help message.

--image-path / -p [paths]

Comma-separated list of additional paths to search for binaries. This is needed to find kernel modules.

--root / -R [path]

A path to a filesystem to search for additional binaries.

--include-symbols / -i [symbols]

Only include symbols in the given comma-separated list.

--long-filenames / -f

Output full paths instead of basenames.

--merge / -m [lib,cpu,tid,tgid,unitmask,all]

Merge any profiles separated in a --separate session.

--no-header

Don't output a header detailing profiling parameters.

--output-file / -o [file]

Output to the given file instead of stdout.

--reverse-sort / -r

Reverse the sort from the default.

--session-dir=dir_path

Use sample database from the specified directory dir_path instead of the default location. If this option is not specified, then opreport will search for samples in <cur_dir>/oprofile_data first. If that directory does not exist, the standard session-dir of /var/lib/oprofile is used as the session directory.

--show-address / -w

Show the VMA address of each symbol (off by default).

--sort / -s [vma,sample,symbol,debug,image]

Sort the list of symbols by, respectively, symbol address, number of samples, symbol name, debug filename and line number, binary image filename.

--symbols / -l

List per-symbol information instead of a binary image summary.

--threshold / -t [percentage]

Only output data for symbols that have more than the given percentage of total samples. For profiles using multiple events, if the threshold is reached for any event, then all sample data for the symbol is shown.

--verbose / -V [options]

Give verbose debugging output.

--version / -v

Show version.

--xml / -X

Generate XML output.

3. Outputting annotated source (opannotate)

The opannotate utility generates annotated source files or assembly listings, optionally mixed with source. If you want to see the source file, the profiled application needs to have debug information, and the source must be available through this debug information. For GCC, you must use the -g option when you are compiling. If the binary doesn't contain sufficient debug information, you can still use opannotate --assembly to get annotated assembly as long as the binary has (at least) symbol information.

Note that for the reason explained in Section 3.1, “Hardware performance counters” the results can be inaccurate. The debug information itself can add other problems; for example, the line number for a symbol can be incorrect. Assembly instructions can be re-ordered and moved by the compiler, and this can lead to crediting source lines with samples not really "owned" by this line. Also see Chapter 5, Interpreting profiling results.

You can output the annotation to one single file, containing all the source found using the --source. You can use this in conjunction with --assembly to get combined source/assembly output.

You can also output a directory of annotated source files that maintains the structure of the original sources. Each line in the annotated source is prepended with the samples for that line. Additionally, each symbol is annotated giving details for the symbol as a whole. An example:

$ opannotate --source --output-dir=annotated /usr/local/oprofile-pp/bin/oprofiled
$ ls annotated/home/moz/src/oprofile-pp/daemon/
opd_cookie.h  opd_image.c  opd_kernel.c  opd_sample_files.c  oprofiled.c

Line numbers are maintained in the source files, but each file has a footer appended describing the profiling details. The actual annotation looks something like this :

...
               :static uint64_t pop_buffer_value(struct transient * trans)
 11510  1.9661 :{ /* pop_buffer_value total:  89901 15.3566 */
               :        uint64_t val;
               :
 10227  1.7469 :        if (!trans->remaining) {
               :                fprintf(stderr, "BUG: popping empty buffer !\n");
               :                exit(EXIT_FAILURE);
               :        }
               :
               :        val = get_buffer_value(trans->buffer, 0);
  2281  0.3896 :        trans->remaining--;
  2296  0.3922 :        trans->buffer += kernel_pointer_size;
               :        return val;
 10454  1.7857 :}
...

The first number on each line is the number of samples, whilst the second is the relative percentage of total samples.

3.1. Locating source files

Of course, opannotate needs to be able to locate the source files for the binary image(s) in order to produce output. Some binary images have debug information where the given source file paths are relative, not absolute. You can specify search paths to look for these files (similar to gdb's dir command) with the --search-dirs option.

Sometimes you may have a binary image which gives absolute paths for the source files, but you have the actual sources elsewhere (commonly, you've installed an SRPM for a binary on your system and you want annotation from an existing profile). You can use the --base-dirs option to redirect OProfile to look somewhere else for source files. For example, imagine we have a binary generated from a source file that is given in the debug information as /tmp/build/libfoo/foo.c, and you have the source tree matching that binary installed in /home/user/libfoo/. You can redirect OProfile to find foo.c correctly like this :

$ opannotate --source --base-dirs=/tmp/build/libfoo/ --search-dirs=/home/user/libfoo/ --output-dir=annotated/ /lib/libfoo.so

You can specify multiple (comma-separated) paths to both options.

3.2. Usage of opannotate

--assembly / -a

Output annotated assembly. If this is combined with --source, then mixed source / assembly annotations are output.

--base-dirs / -b [paths]/

Comma-separated list of path prefixes. This can be used to point OProfile to a different location for source files when the debug information specifies an absolute path on your system for the source that does not exist. The prefix is stripped from the debug source file paths, then searched in the search dirs specified by --search-dirs.

--demangle / -D none|normal|smart

none: no demangling. normal: use default demangler (default) smart: use pattern-matching to make C++ symbol demangling more readable.

--exclude-dependent / -x

Do not include application-specific images for libraries, kernel modules and the kernel.

--exclude-file [files]

Exclude all files in the given comma-separated list of glob patterns. This option is supported solely with the --source option. It can be used to filter out source files in the output using the following types of specifications:

  • filenames (basename -- i.e., no path)
  • filename glob specifications (all files whose base filename matches the given pattern)
  • directory segments (all source files located in the specified directory; e.g. "libio")
  • directory segment glob specifications (e.g., "libi*")

--exclude-symbols / -e [symbols]

Exclude all the symbols in the given comma-separated list.

--help / -? / --usage

Show help message.

--image-path / -p [paths]

Comma-separated list of additional paths to search for binaries. This is needed to find kernel modules.

--root / -R [path]

A path to a filesystem to search for additional binaries.

--include-file [files]

Only include files in the given comma-separated list of glob patterns. The same rules apply for this option as for the --exclude-file option.

--include-symbols / -i [symbols]

Only include symbols in the given comma-separated list.

--objdump-params [params]

Pass the given parameters as extra values when calling objdump. If more than one option is to be passed to objdump, the parameters must be enclosed in a quoted string.

An example of where this option is useful is when your toolchain does not automatically recognize instructions that are specific to your processor. For example, on IBM POWER7/RHEL 6, objdump must be told that a binary file may have POWER7-specific instructions. The opannotate option to show the POWER7-specific instructions is:

   --objdump-params=-Mpower7

The opannotate option to show the POWER7-specific instructions, the source code (--source) and the line numbers (-l) would be:

   --objdump-params="-Mpower7 -l --source"

--output-dir / -o [dir]

Output directory. This makes opannotate output one annotated file for each source file. This option can't be used in conjunction with --assembly.

--search-dirs / -d [paths]

Comma-separated list of paths to search for source files. This is useful to find source files when the debug information only contains relative paths.

--source / -s

Output annotated source. This requires debugging information to be available for the binaries.

--session-dir=dir_path

Use sample database from the specified directory dir_path instead of the default location. If this option is not specified, then opannotate will search for samples in <cur_dir>/oprofile_data first. If that directory does not exist, the standard session-dir of /var/lib/oprofile is used as the session directory.

--threshold / -t [percentage]

For annotated assembly, only output data for symbols that have more than the given percentage of total samples. For profiles using multiple events, if the threshold is reached for any event, then all sample data for the symbol is shown.

For annotated source, only output data for source files that have more than the given percentage of total samples. For profiles using multiple events, if the threshold is reached for any event, then all sample data for the source file is shown.

--verbose / -V [options]

Give verbose debugging output.

--version / -v

Show version.

4. OProfile results with JIT samples

After profiling a Java (or other supported VM) application, the OProfile JIT support creates ELF binaries from the intermediate files that were written by the agent library. The ELF binaries are named <tgid>.jo. With the symbol information stored in these ELF files, it is possible to map samples to the appropriate symbols.

The usual analysis tools (opreport and/or opannotate) can now be used to get symbols and assembly code for the instrumented VM processes.

Below is an example of a profile report of a Java application that has been instrumented with the provided agent library.

$ opreport -l /usr/lib/jvm/jre-1.5.0-ibm/bin/java
CPU: Core Solo / Duo, speed 2167 MHz (estimated)
Counted CPU_CLK_UNHALTED events (Unhalted clock cycles) with a unit mask of 0x00 (Unhalted core cycles) count 100000
samples  %        image name               symbol name
186020   50.0523  no-vmlinux               no-vmlinux               (no symbols)
34333     9.2380  7635.jo                  java                     void test.f1()
19022     5.1182  libc-2.5.so              libc-2.5.so              _IO_file_xsputn@@GLIBC_2.1
18762     5.0483  libc-2.5.so              libc-2.5.so              vfprintf
16408     4.4149  7635.jo                  java                     void test$HelloThread.run()
16250     4.3724  7635.jo                  java                     void test$test_1.f2(int)
15303     4.1176  7635.jo                  java                     void test.f2(int, int)
13252     3.5657  7635.jo                  java                     void test.f2(int)
5165      1.3897  7635.jo                  java                     void test.f4()
955       0.2570  7635.jo                  java                     void test$HelloThread.run()~

Note

Depending on the JVM that is used, certain options of opreport and opannotate do NOT work since they rely on debug information (e.g. source code line number) that is not always available. The Sun JVM does provide the necessary debug information via the JVMTI[PI] interface, but other JVMs do not.

As you can see in the opreport output, the JIT support agent for Java generates symbols to include the class and method signature. A symbol with the suffix ˜<n> (e.g. void test$HelloThread.run()˜1) means that this is the <n>th occurrence of the identical name. This happens if a method is re-JITed. A symbol with the suffix %<n>, means that the address space of this symbol was reused during the sample session (see Section 6, “Overlapping symbols in JITed code”). The value <n> is the percentage of time that this symbol/code was present in relation to the total lifetime of all overlapping other symbols. A symbol of the form <return_val> <class_name>$<method_sig> denotes an inner class.

5. gprof-compatible output (opgprof)

If you're familiar with the output produced by GNU gprof, you may find opgprof useful. It takes a single binary as an argument, and produces a gmon.out file for use with gprof -p. If call-graph profiling is enabled, then this is also included.

$ opgprof `which oprofiled` # generates gmon.out file
$ gprof -p `which oprofiled` | head
Flat profile:

Each sample counts as 1 samples.
  %   cumulative   self              self     total
 time   samples   samples    calls  T1/call  T1/call  name
 33.13 206237.00 206237.00                             odb_insert
 22.67 347386.00 141149.00                             pop_buffer_value
  9.56 406881.00 59495.00                             opd_put_sample
  7.34 452599.00 45718.00                             opd_find_image
  7.19 497327.00 44728.00                             opd_process_samples

5.1. Usage of opgprof

--help / -? / --usage

Show help message.

--image-path / -p [paths]

Comma-separated list of additional paths to search for binaries. This is needed to find kernel modules.

--root / -R [path]

A path to a filesystem to search for additional binaries.

--output-filename / -o [file]

Output to the given file instead of the default, gmon.out

--threshold / -t [percentage]

Only output data for symbols that have more than the given percentage of total samples.

--verbose / -V [options]

Give verbose debugging output.

--session-dir=dir_path

Use sample database from the specified directory dir_path instead of the default location. If this option is not specified, then opgprof will search for samples in <cur_dir>/oprofile_data first. If that directory does not exist, the standard session-dir of /var/lib/oprofile is used as the session directory.

--version / -v

Show version.

6. Analyzing profile data on another system (oparchive)

The oparchive utility generates a directory populated with executable, debug, and oprofile sample files. This directory can be copied to another (host) machine and analyzed offline, with no further need to access the data collection machine (target).

The following command, executed on the target system, will collect the sample files, the executables associated with the sample files, and the debuginfo files associated with the executables and copy them into /tmp/current_data:

# oparchive -o /tmp/current_data

When transferring archived profile data to a host machine for offline analysis, you need to determine if the oprofile ABI format is compatible between the target system and the host system; if it isn't, you must run the opimport command to convert the target's sample data files to the format of your host system. See Section 7, “Converting sample database files (opimport)” for more details.

After your profile data is transferred to the host system and (if necessary) you have run the opimport command to convert the file format, you can now run the opreport and opannotate commands. However, you must provide an "archive specification" to let these post-processing tools know where to find of the profile data (sample files, executables, etc.); for example:

# opreport archive:/home/user1/my_oprofile_archive --symbols

Furthermore, if your profile was collected on your target system into a session-dir other than /var/lib/oprofile, the oparchive command will display a message similar to the following:

# NOTE: The sample data in this archive is located at /home/user1/test-stuff/oprofile_data
instead of the standard location of /var/lib/oprofile.  Hence, when using opreport
and other post-processing tools on this archive, you must pass the following option:
        --session-dir=/home/user1/test-stuff/oprofile_data

Then the above opreport example would have to include that --session-dir option.

Note

In some host/target development environments, all target executables, libraries, and debuginfo files are stored in a root directory on the host to facilitate offline analysis. In such cases, the oparchive command collects more data than is necessary; so, when copying the resulting output of oparchive, you can skip all of the executables, etc, and just archive the $SESSION_DIR tree located within the output directory you specified in your oparchive command. Then, when running the opreport or opannotate commands on your host system, pass the --root option to point to the location of your target's executables, etc.

6.1. Usage of oparchive

--help / -? / --usage

Show help message.

--exclude-dependent / -x

Do not include application-specific images for libraries, kernel modules and the kernel.

--image-path / -p [paths]

Comma-separated list of additional paths to search for binaries. This is needed to find kernel modules.

--root / -R [path]

A path to a filesystem to search for additional binaries.

--output-directory / -o [directory]

Output to the given directory. There is no default. This must be specified.

--list-files / -l

Only list the files that would be archived, don't copy them.

--verbose / -V [options]

Give verbose debugging output.

--session-dir=dir_path

Use sample database from the specified directory dir_path instead of the default location. If this option is not specified, then oparchive will search for samples in <cur_dir>/oprofile_data first. If that directory does not exist, the standard session-dir of /var/lib/oprofile is used as the session directory.

--version / -v

Show version.

7. Converting sample database files (opimport)

This utility converts sample database files from a foreign binary format (abi) to the native format. This is required when moving sample files to a (host) system other than the one used for collection (target system), and the host and target systems are different architectures. The abi format of the sample files to be imported is described in a text file located in $SESSION_DIR/abi. If you are unsure if your target and host systems have compatible architectures (in regard to the OProfile ABI), simply diff a $SESSION_DIR/abi file from the target system with one from the host system. If any differences show up at all, you must run the opimport command.

The oparchive command should be used on the machine where the profile was taken (target) in order to collect sample files and all other necessary information. The archive directory that is the output from oparchive should be copied to the system where you wish to perform your performance analysis (host).

The following command converts an input sample file to the specified output sample file using the given abi file as a binary description of the input file and the curent platform abi as a binary description of the output file. (NOTE: The ellipses are used to make the example more compact and cannot be used in an actual command line.)

# opimport -a /tmp/foreign-abi -o /tmp/imported/.../GLOBAL_POWER_EVENTS.200000.1.all.all.all /tmp/archived/var/lib/.../mprime/GLOBAL_POWER_EVENTS.200000.1.all.all.all

Since opimport converts just one file at a time, an example shell script is provided below that will perform an import/conversion of all sample files in a samples directory collected from the target system.

#!/bin/bash
#Usage: my-import.sh <foreign-abi-fullpathname>

# NOTE: Start from the "samples" directory containing the "current" directory
# to be imported

mkdir current-imported
cd current-imported; (cd ../current; find . -type d ! -name .) |xargs mkdir
cd ../current; mv stats ../StatsSave; find . -type f | while read line; do opimport  -a $1 -o ../current-imported/$line $line; done; mv ../StatsSave stats;

Example usage: Assume that on the target system, a profile was collected using a session-dir of /var/lib/oprofile, and then oparchive -o profile1 was run. Then the profile1 directory is copied to the host system for analysis. To import the sample data in profile1, you would perform the following steps:

$cd profile1/var/lib/oprofile/samples
$my-import.sh `pwd`/../abi

If the OProfile ABI is truly different on host and target machines, then the end result of running the above script will place the converted (i.e., imported) files into the current-imported directory. By default, opreport and other post-profiling tools will look for samples in samples/current of the specified session directory. So you should either rename current-imported to current or specify the session specification of session:current-imported when running post-profiling tools.

If the OProfile ABI is the same on the host and target machines, the my-import.sh script will print the following message for each sample file:

input abi is identical to native. no conversion necessary.

7.1. Usage of opimport

--help / -? / --usage

Show help message.

--abi / -a [filename]

Input abi file description location.

--force / -f

Force conversion even if the input and output abi are identical.

--output / -o [filename]

Specify the output filename. If the output file already exists, the file is not overwritten but data are accumulated in. Sample filename are informative for post profile tools and must be kept identical, in other word the pathname from the first path component containing a '{' must be kept as it in the output filename.

--verbose / -V

Give verbose debugging output.

--version / -v

Show version.

Chapter 5. Interpreting profiling results

The standard caveats of profiling apply in interpreting the results from OProfile: profile realistic situations, profile different scenarios, profile for as long as a time as possible, avoid system-specific artifacts, don't trust the profile data too much. Also bear in mind the comments on the performance counters above - you cannot rely on totally accurate instruction-level profiling. However, for almost all circumstances the data can be useful. Ideally a utility such as Intel's VTUNE would be available to allow careful instruction-level analysis; go hassle Intel for this, not me ;)

1. Profiling interrupt latency

This is an example of how the latency of delivery of profiling interrupts can impact the reliability of the profiling data. This is pretty much a worst-case-scenario example: these problems are fairly rare.

double fun(double a, double b, double c)
{
 double result = 0;
 for (int i = 0 ; i < 10000; ++i) {
  result += a;
  result *= b;
  result /= c;
 }
 return result;
}

Here the last instruction of the loop is very costly, and you would expect the result reflecting that - but (cutting the instructions inside the loop):

$ opannotate -a -t 10 ./a.out

     88 15.38% : 8048337:       fadd   %st(3),%st
     48 8.391% : 8048339:       fmul   %st(2),%st
     68 11.88% : 804833b:       fdiv   %st(1),%st
    368 64.33% : 804833d:       inc    %eax
               : 804833e:       cmp    $0x270f,%eax
               : 8048343:       jle    8048337

The problem comes from the x86 hardware; when the counter overflows the IRQ is asserted but the hardware has features that can delay the NMI interrupt: x86 hardware is synchronous (i.e. cannot interrupt during an instruction); there is also a latency when the IRQ is asserted, and the multiple execution units and the out-of-order model of modern x86 CPUs also causes problems. This is the same function, with annotation :

$ opannotate -s -t 10 ./a.out

               :double fun(double a, double b, double c)
               :{ /* _Z3funddd total:     572 100.0% */
               : double result = 0;
    368 64.33% : for (int i = 0 ; i < 10000; ++i) {
     88 15.38% :  result += a;
     48 8.391% :  result *= b;
     68 11.88% :  result /= c;
               : }
               : return result;
               :}

The conclusion: don't trust samples coming at the end of a loop, particularly if the last instruction generated by the compiler is costly. This case can also occur for branches. Always bear in mind that samples can be delayed by a few cycles from its real position. That's a hardware problem and OProfile can do nothing about it.

2. Kernel profiling

2.1. Interrupt masking

OProfile uses non-maskable interrupts (NMI) on the P6 generation, Pentium 4, Athlon, Opteron, Phenom, and Turion processors. These interrupts can occur even in sections of the kernel where interrupts are disabled, allowing collection of samples in virtually all executable code.

2.2. Idle time

Your kernel is likely to support halting the processor when a CPU is idle. As the typical hardware events like CPU_CLK_UNHALTED do not count when the CPU is halted, the kernel profile will not reflect the actual amount of time spent idle. You can change this behaviour by booting with the idle=poll option, which uses a different idle routine. This will appear as poll_idle() in your kernel profile.

2.3. Profiling kernel modules

OProfile profiles kernel modules by default. However, there are a couple of problems you may have when trying to get results. First, you may have booted via an initrd; this means that the actual path for the module binaries cannot be determined automatically. To get around this, you can use the -p option to the analysis tools to specify where to look for the kernel modules.

In kernel version 2.6, the information on where kernel module binaries are located was removed. This means OProfile needs guiding with the -p option to find your modules. Normally, you can just use your standard module top-level directory for this. Note that due to this problem, OProfile cannot check that the modification times match; it is your responsibility to make sure you do not modify a binary after a profile has been created.

If you have run insmod or modprobe to insert a module in a particular directory, it is important that you specify this directory with the -p option first, so that it over-rides an older module binary that might exist in other directories you've specified with -p. It is up to you to make sure that these values are correct: the kernel simply does not provide enough information for OProfile to get this information.

3. Interpreting call-graph profiles

Sometimes the results from call-graph profiles may be different from what you expect to see. The first thing to check is whether the target binaries where compiled with frame pointers enabled (if the binary was compiled using gcc's -fomit-frame-pointer option, you will not get meaningful results). Note that as of this writing, the GCC developers plan to disable frame pointers by default. The Linux kernel is built without frame pointers by default; there is a configuration option you can use to turn it on under the "Kernel Hacking" menu.

Often you may see a caller of a function that does not actually directly call the function you're looking at (e.g. if a() calls b(), which in turn calls c(), you may see an entry for a()->c()). What's actually occurring is that we are taking samples at the very start (or the very end) of c(); at these few instructions, we haven't yet created the new function's frame, so it appears as if a() is calling directly into c(). Be careful not to be misled by these entries.

Like the rest of OProfile, call-graph profiling uses a statistical approach; this means that sometimes a backtrace sample is truncated, or even partially wrong. Bear this in mind when examining results.

4. Inaccuracies in annotated source

4.1. Side effects of optimizations

The compiler can introduce some pitfalls in the annotated source output. The optimizer can move pieces of code in such manner that two line of codes are interlaced (instruction scheduling). Also debug info generated by the compiler can show strange behavior. This is especially true for complex expressions e.g. inside an if statement:

	if (a && ..
	    b && ..
	    c &&)

here the problem come from the position of line number. The available debug info does not give enough details for the if condition, so all samples are accumulated at the position of the right brace of the expression. Using opannotate -a can help to show the real samples at an assembly level.

4.2. Prologues and epilogues

The compiler generally needs to generate "glue" code across function calls, dependent on the particular function call conventions used. Additionally other things need to happen, like stack pointer adjustment for the local variables; this code is known as the function prologue. Similar code is needed at function return, and is known as the function epilogue. This will show up in annotations as samples at the very start and end of a function, where there is no apparent executable code in the source.

4.3. Inlined functions

You may see that a function is credited with a certain number of samples, but the listing does not add up to the correct total. To pick a real example :

               :internal_sk_buff_alloc_security(struct sk_buff *skb)
 353 2.342%    :{ /* internal_sk_buff_alloc_security total: 1882 12.48% */
               :
               :        sk_buff_security_t *sksec;
  15 0.0995%   :        int rc = 0;
               :
  10 0.06633%  :        sksec = skb->lsm_security;
 468 3.104%    :        if (sksec && sksec->magic == DSI_MAGIC) {
               :                goto out;
               :        }
               :
               :        sksec = (sk_buff_security_t *) get_sk_buff_memory(skb);
   3 0.0199%   :        if (!sksec) {
  38 0.2521%   :                rc = -ENOMEM;
               :                goto out;
  10 0.06633%  :        }
               :        memset(sksec, 0, sizeof (sk_buff_security_t));
  44 0.2919%   :        sksec->magic = DSI_MAGIC;
  32 0.2123%   :        sksec->skb = skb;
  45 0.2985%   :        sksec->sid = DSI_SID_NORMAL;
  31 0.2056%   :        skb->lsm_security = sksec;
               :
               :      out:
               :
 146 0.9685%   :        return rc;
               :
  98 0.6501%   :}

Here, the function is credited with 1,882 samples, but the annotations below do not account for this. This is usually because of inline functions - the compiler marks such code with debug entries for the inline function definition, and this is where opannotate annotates such samples. In the case above, memset is the most likely candidate for this problem. Examining the mixed source/assembly output can help identify such results.

This problem is more visible when there is no source file available, in the following example it's trivially visible the sums of symbols samples is less than the number of the samples for this file. The difference must be accounted to inline functions.

/*
 * Total samples for file : "arch/i386/kernel/process.c"
 *
 *    109  2.4616
 */

 /* default_idle total:     84  1.8970 */
 /* cpu_idle total:         21  0.4743 */
 /* flush_thread total:      1  0.0226 */
 /* prepare_to_copy total:   1  0.0226 */
 /* __switch_to total:      18  0.4065 */

The missing samples are not lost, they will be credited to another source location where the inlined function is defined. The inlined function will be credited from multiple call site and merged in one place in the annotated source file so there is no way to see from what call site are coming the samples for an inlined function.

When running opannotate, you may get a warning "some functions compiled without debug information may have incorrect source line attributions". In some rare cases, OProfile is not able to verify that the derived source line is correct (when some parts of the binary image are compiled without debugging information). Be wary of results if this warning appears.

Furthermore, for some languages the compiler can implicitly generate functions, such as default copy constructors. Such functions are labelled by the compiler as having a line number of 0, which means the source annotation can be confusing.

4.4. Inaccuracy in line number information

Depending on your compiler you can fall into the following problem:

struct big_object { int a[500]; };

int main()
{
	big_object a, b;
	for (int i = 0 ; i != 1000 * 1000; ++i)
		b = a;
	return 0;
}

Compiled with gcc 3.0.4 the annotated source is clearly inaccurate:

               :int main()
               :{  /* main total: 7871 100% */
               :        big_object a, b;
               :        for (int i = 0 ; i != 1000 * 1000; ++i)
               :                b = a;
 7871 100%     :        return 0;
               :}

The problem here is distinct from the IRQ latency problem; the debug line number information is not precise enough; again, looking at output of opannoatate -as can help.

               :int main()
               :{
               :        big_object a, b;
               :        for (int i = 0 ; i != 1000 * 1000; ++i)
               : 80484c0:       push   %ebp
               : 80484c1:       mov    %esp,%ebp
               : 80484c3:       sub    $0xfac,%esp
               : 80484c9:       push   %edi
               : 80484ca:       push   %esi
               : 80484cb:       push   %ebx
               :                b = a;
               : 80484cc:       lea    0xfffff060(%ebp),%edx
               : 80484d2:       lea    0xfffff830(%ebp),%eax
               : 80484d8:       mov    $0xf423f,%ebx
               : 80484dd:       lea    0x0(%esi),%esi
               :        return 0;
    3 0.03811% : 80484e0:       mov    %edx,%edi
               : 80484e2:       mov    %eax,%esi
    1 0.0127%  : 80484e4:       cld
    8 0.1016%  : 80484e5:       mov    $0x1f4,%ecx
 7850 99.73%   : 80484ea:       repz movsl %ds:(%esi),%es:(%edi)
    9 0.1143%  : 80484ec:       dec    %ebx
               : 80484ed:       jns    80484e0
               : 80484ef:       xor    %eax,%eax
               : 80484f1:       pop    %ebx
               : 80484f2:       pop    %esi
               : 80484f3:       pop    %edi
               : 80484f4:       leave
               : 80484f5:       ret

So here it's clear that copying is correctly credited with of all the samples, but the line number information is misplaced. objdump -dS exposes the same problem. Note that maintaining accurate debug information for compilers when optimizing is difficult, so this problem is not suprising. The problem of debug information accuracy is also dependent on the binutils version used; some BFD library versions contain a work-around for known problems of gcc, some others do not. This is unfortunate but we must live with that, since profiling is pointless when you disable optimisation (which would give better debugging entries).

5. Assembly functions

Often the assembler cannot generate debug information automatically. This means that you cannot get a source report unless you manually define the neccessary debug information; read your assembler documentation for how you might do that. The only debugging info needed currently by OProfile is the line-number/filename-VMA association. When profiling assembly without debugging info you can always get report for symbols, and optionally for VMA, through opreport -l or opreport -d, but this works only for symbols with the right attributes. For gas you can get this by

.globl foo
	.type	foo,@function

whilst for nasm you must use

	  GLOBAL foo:function		; [1]

Note that OProfile does not need the global attribute, only the function attribute.

6. Overlapping symbols in JITed code

Some virtual machines (e.g., Java) may re-JIT a method, resulting in previously allocated space for a piece of compiled code to be reused. This means that, at one distinct code address, multiple symbols/methods may be present during the run time of the application.

Since OProfile samples are buffered and don′t have timing information, there is no way to correlate samples with the (possibly) varying address ranges in which the code for a symbol may reside. An alternative would be flushing the OProfile sampling buffer when we get an unload event, but this could result in high overhead.

To moderate the problem of overlapping symbols, OProfile tries to select the symbol that was present at this address range most of the time. Additionally, other overlapping symbols are truncated in the overlapping area. This gives reasonable results, because in reality, address reuse typically takes place during phase changes of the application -- in particular, during application startup. Thus, for optimum profiling results, start the sampling session after application startup and burn in.

7. Using operf to profile fork/execs

When profiling an application that forks one or more new processes, operf will record samples for both the parent process and forked processes. This is also true even if the forked process performs an exec of some sort (e.g., execvp). If the process does not perform an exec, you will see that opreport will attribute samples for the forked process to the main application executable. On the other hand, if the forked process does perform an exec, then opreport will attribute samples to the executable being exec'ed.

To demonstrate this, consider the following examples. When using operf to profile a single application (either with the --pid option or command option), the normal opreport summary output (i.e., invoking opreport with no options) looks something like the following:

CPU_CLK_UNHALT...|
  samples|      %|
------------------
   112342 100.000 sprintft
	CPU_CLK_UNHALT...|
	  samples|      %|
	------------------
	   104209 92.7605 libc-2.12.so
	     7273  6.4740 sprintft
	      858  0.7637 no-vmlinux
	        2  0.0018 ld-2.12.so

But if you profile an application that does a fork/exec, the opreport summary output will show samples for both the main application you profiled, as well as the exec'ed program. An example is shown below where s-m-fork is the main application being profiled, which in turn forks a process that does an execvp of the memcpyt program.

CPU_CLK_UNHALT...|
  samples|      %|
------------------
   133382 70.5031 memcpyt
	CPU_CLK_UNHALT...|
	  samples|      %|
	------------------
	   123852 92.8551 libc-2.12.so
	     8522  6.3892 memcpyt
	     1007  0.7550 no-vmlinux
	        1 7.5e-04 ld-2.12.so
    55804 29.4969 s-m-fork
	CPU_CLK_UNHALT...|
	  samples|      %|
	------------------
	    51801 92.8267 libc-2.12.so
	     3589  6.4314 s-m-fork
	      414  0.7419 no-vmlinux

8. Other discrepancies

Another cause of apparent problems is the hidden cost of instructions. A very common example is two memory reads: one from L1 cache and the other from memory: the second memory read is likely to have more samples. There are many other causes of hidden cost of instructions. A non-exhaustive list: mis-predicted branch, TLB cache miss, partial register stall, partial register dependencies, memory mismatch stall, re-executed µops. If you want to write programs at the assembly level, be sure to take a look at the Intel and AMD documentation at http://developer.intel.com/ and http://developer.amd.com/devguides.jsp.

Chapter 6. Controlling the event counter

Table of Contents

1. Using ocount

1. Using ocount

This section describes in detail how ocount is used. Unless the --events option is specified, ocount will use the default event for your system. For most systems, the default event is some cycles-based event, assuming your processor type supports hardware performance counters. The event specification used for ocount is slightly different from that required for profiling -- a count value is not needed. You can see the event information for your CPU using ophelp. More information on event specification can be found at Section 3, “Specifying performance counter events”.

The ocount command syntax is:

ocount [ options ] [ --system-wide | --process-list <pids> | --thread-list <tids> | --cpu-list <cpus> [ command [ args ] ] ]

ocount has 5 run modes:

  • system-wide
  • process-list
  • thread-list
  • cpu-list
  • command

One and only one of these 5 run modes must be specified when you run ocount. If you run ocount using a run mode other than command [args], press Ctrl-c to stop it when finished counting (e.g., when the monitored process ends). If you background ocount (i.e., with ’&’) while using one these run modes, you must stop it in a controlled manner so that the data collection process can be shut down cleanly and final results can be displayed. Use kill -SIGINT <ocount-PID> for this purpose.

Following is a description of the ocount options.

command [args]

The command or application to be profiled. The [args] are the input arguments that the command or application requires. The command and its arguments must be positioned at the end of the command line, after all other ocount options.

--process-list / -p [PIDs]

Use this option to count events for one or more already-running applications, specified via a comma-separated list (PIDs). Event counts will be collected for all children of the passed process(es) as well.

--thread-list / -r [TIDs]

Use this option to count events for one or more already-running threads, specified via a comma-separated list (TIDs). Event counts will not be collected for any children of the passed thread(s).

--system-wide / -s

This option is for counting events for all processes running on your system. You must have root authority to run ocount in this mode.

--cpu-list / -C [CPUs]

This option is for counting events on a subset of processors on your system. You must have root authority to run ocount in this mode. This is a comma-separated list, where each element in the list may be either a single processor number or a range of processor numbers; for example: ’-C 2,3,4-11,15’.

--events / -e [event1[,event2[,...]]]

This option is for passing a comma-separated list of event specifications for counting. Each event spec is of the form:

name[:unitmask[:kernel[:user]]]

When no event specification is given, the default event for the running processor type will be used for counting. Use ophelp to list the available events for your processor type.

--separate-thread / -t

This option can be used in conjunction with either the --process-list or --thread-list option to display event counts on a per-thread (per-process) basis. Without this option, all counts are aggregated.

--separate-cpu / -c

This option can be used in conjunction with either the --system-wide or --cpu-list option to display event counts on a per-cpu basis. Without this option, all counts are aggregated.

--time-interval / -i interval_length[:num_intervals]

Note: The interval_length is given in milliseconds. However, the current implementation only supports 100 ms granularity, so the given interval_length will be rounded to the nearest 100 ms. Results collected for each time interval are printed immediately instead of the default of one dump of cumulative event counts at the end of the run. Counters are reset to zero at the start of each interval.

If num_intervals is specified, ocount exits after the specified number of intervals occur.

--brief-format / -b

Use this option to print results in the following brief format:

                  [optional cpu or thread,]<event_name>,<count>,<percent_time_enabled>
                  [        <int>         ,]<  string  >,< u64 >,<     double         >
        

If --timer-interval is specified, a separate line formatted as

                  timestamp,<num_seconds_since_epoch>[.n]
        

is printed ahead of each dump of event counts. If the time interval specified is less than one second, the timestamp will have 1/10 second precision.

--output-file / -f outfile_name

Results are written to outfile_name instead of interactively to the terminal.

--verbose / -V

Use this option to increase the verbosity of the output.

--version -v

Show ocount version.

--help / -h

Show a help message.

Chapter 7. Acknowledgments

Thanks to (in no particular order) : Arjan van de Ven, Rik van Riel, Juan Quintela, Philippe Elie, Phillipp Rumpf, Tigran Aivazian, Alex Brown, Alisdair Rawsthorne, Bob Montgomery, Ray Bryant, H.J. Lu, Jeff Esper, Will Cohen, Graydon Hoare, Cliff Woolley, Alex Tsariounov, Al Stone, Jason Yeh, Randolph Chung, Anton Blanchard, Richard Henderson, Andries Brouwer, Bryan Rittmeyer, Maynard P. Johnson, Richard Reich (rreich@rdrtech.com), Zwane Mwaikambo, Dave Jones, Charles Filtness; and finally Pulp, for "Intro".