DTrace and Mac OS X PDF Print E-mail
Saturday, 13 February 2010 05:43

Introduction.

The main article's goal is the complete overview of DTrace from its architecture to DTrace scripts development and libdtrace library use. This article provides examples of DTrace scripts that can be used for software optimization and collecting information about applications in real time. All examples were tested in Mac OS C 10.6.

What is DTrace?

I believe most of MacOS X developers use Instruments for applications optimization and testing very often. But not all of them know about DTrace, although it is being used as the kernel of Instruments.

DTrace is dynamic tracing tool that is built into Mac OS X. It was developed by Sun Microsystems for Solaris and later was released under free CCDL license. At this moment DTrace has been ported to FreeBSD 7.1 (as substitute for ktrace), Mac OS X 10.5 and it is being developed to support QNX.

As opposed to debugger (gdb for example) DTrace is safe to use in production systems and it does not require restarting either the system or application. Also DTrace may be helpful in case of races or another difficult-to-locate errors in production environment and as reverse engineering tool.

DTrace architecture.

DTrace consists of 4 parts: the front-end, like dtrace application or Instruments application, the libdtrace library, in-kernel DTrace framework and DTrace providers.

Instruments and dtrace applications are typical Mac OS X DTrace front-end. These applications use libdtrace for obtaining information from DTrace providers. Also it could be any users’ application, which was written with use of libdtrace library.

Libdtrace provides the interface to kernel level DTrace framework and compiles D code into DIF (DTrace Intermediate Format - a RISC-like instruction set). ioctl call and dtrace device are used for interactions between user and kernel levels.

Kernel level DTrace framework executes DIF code and manages DTrace Providers objects. DTrace providers responsible for creating probes by instrumenting of various part of the system.

A probe is a location or activity to which DTrace can bind a request to perform a set of actions, like recording a stack trace, a timestamp, or the argument to a function.

The interfaces between the libdtrace library, the in-kernel DTrace framework, and the DTrace providers are described in the block comments in header file. The internal architecture of DTrace is described in the block comments in header file.

DTrace D script language.

DTrace D language has C-like syntax and it is used for tracing actions description. The typical D script consists of compiler directives, initialization, type declarations, predicates, variables and probes clauses. Each probe includes probe description, predicates and actions.

BEGIN
{
    initialization
}

probe descriptions
/ predicates / 
{
     actions
}

END
{
    termination
}

When a D script executed, all described probes are enabled and the BEGIN probe fires before any other probes. When the probe is fired and the predicate is true, the actions are executed.

Probes.

Each probe description is made up of four fields separated by colon character:

provider:module:function:name
  • Provider - instrumented layer. For example, function calls, memory statistics, process creation and termination, etc.
  • Module - instrumented module. For example, libSystem, libc, etc.
  • Function - instrumented function.
  • Name - location in the function. Typically it is enter, return, tick-nsec.

The provider field is required; fields module, function, name are optional. Pattern matching characters such as *, ?, [...], \ can be used. The empty field is interpreted as the field which consists only of '*' wild card.

For example, the following D script will trace write* functions calls for all processes.

syscall::write*:entry
{
    printf("(%d):%s write call called", pid, execname);
}

Predicates are used for determination whether the associated actions are executed. If the predicate is omitted the actions are always executed when the probe fires. Examples of predicates:

/pid == 12345/
/execname = "bash"/

Variables

There are two types of variable in D, scalar variables and associated arrays.

BEGIN
{
    x = 100;
    arr1["hello"] = 100;
    arr2["hello", 10] = 100;
}

X is an integer scalar variable.

arr1 is an associated array variable with key "hello" and value 100.

arr2 is an associated array variable with tuple ["hello", 10] as key and value 100.

As follows from sample above, D, unlike C or Java, does not require explicit variable declaration.

Also, variables can be separated into global, thread-local and clause-local variables by storage type.

Thread-local variables

Thread-local variables are used for storing data separated by threads. Thread-local variables are accessed via the self identifier and the -> operator.

syscall::write*:entry
{
    self->write = 1;
}
syscall::write*:return
/self->write == 1/
{
    self->write = 0;
}

Thread-local variable should be assigned to zero at the end of usage, it allows DTrace to reclaim the storage.

Clause-local variables

Clause-local variables are similar to C or C++ automatic variables. Clause-local variables are accessed via the this identifier and the -> operator.

syscall::write*:entry
{
    self->i = 10;
    ...
}

Thread-local variable reclaimed and reused automatically for the next clause.

Structs

For simplifying D scripts variables can be putted together into structs. Structs syntax is similar to C struct.

struct struct_name {
    int a;
    int b[string];
};

struct struct_name arr[string];
The arr variable is associated array of struct_name objects.

Macro variables.

D compiler provides set of macro variable, which can be used in D scripts. Macro variables are prefixed with a dollar sign ($) and are expanded once by the D compiler when it’s processing an input file.

The list of frequently used macro variables:

  • $1 - first additional operand.
  • $pid - dtrace process ID.
  • $target - the process id of the process that is selected by -p option or is created using -c option.
pid$target:::entry 
{
     @[probefunc] = count();
}
...
  matchString                                                    9217
  malloc_zone_malloc                                             9712
  szone_malloc                                                   9712
  szone_malloc_should_clear                                      9743
  memmove                                                       10324
  tiny_free_list_add_ptr                                        10522
  object_getClass                                               11244
  __CFStringHash                                                13019
  CFHash                                                        13095
  ___CFBasicHashFindBucket1                                     13748
  CFAllocatorDeallocate                                         13923
  _CFRetain                                                     15354
  CFRetain                                                      15372
  __CFStringEqual                                               15861
...

The example above demonstrates using of $target macro variable. That scrip can be used to determine the frequently called functions list.

Aggregations.

Aggregations are used for collecting the output in memory tables and outputting summary information. The syntax of an aggregation is:

@name[ keys ] = aggfunc ( args );

Aggregations variables has special prefix @ and there are no thread- or clause-local aggregations. DTrace command prints aggregations results when the process terminated or when user pressed Control-C.

count.

count call returns the number of times called. The following script prints the list of calls with the number of times functions were called.

pid$1:::entry 
{
    @[strjoin(strjoin(probemod,"`"), probefunc)]=count();
}   
....
  CoreFoundation`CFBasicHashApply                                  14
  CoreFoundation`CFDictionaryApplyFunction                         14
  libSystem.B.dylib`strlen                                         26
  libSystem.B.dylib`xdr_void                                       28
  libSystem.B.dylib`mach_port_names                               577
  libSystem.B.dylib`pid_for_task                                  577
  libSystem.B.dylib`task_info                                     577
  libSystem.B.dylib`task_threads                                  577
  tasksexplorerd`build_tasks_array                                577
  tasksexplorerd`task_explorer_dyninfo_1_svc                      577
  tasksexplorerd`xdr_task_info_dynamic                            577
  tasksexplorerd`task_info_manager_find_task                      578
  tasksexplorerd`build_killed_array                               579
  libSystem.B.dylib`_authenticate                                 592
  libSystem.B.dylib`_svcauth_null                                 592
  libSystem.B.dylib`bcopy                                         592
  libSystem.B.dylib`fill_input_buf                                592
  libSystem.B.dylib`read$NOCANCEL                                 592
...

avg.

avg call returns the arithmetic average of the expression. The following script prints the average time spending for each call in microsecond.

pid$1:::entry 
{
     self->timestamp = timestamp;
}

pid$1:::return
/self->timestamp/
{
     @[strjoin(strjoin(probemod,"`"), probefunc)] = avg((timestamp - self->timestamp) / 1000);
     self->timestamp = 0;
}
...
  tasksexplorerd`build_killed_array                                14
  CoreFoundation`CFDictionaryGetCount                              15
  libSystem.B.dylib`get_input_bytes                                15
  libSystem.B.dylib`pid_for_task                                   15
  libSystem.B.dylib`xdrrec_endofrecord                             15
  libSystem.B.dylib`xdrrec_skiprecord                              15
  libSystem.B.dylib`OSSpinLockUnlock                               16
  libSystem.B.dylib`mig_get_reply_port                             16
  libSystem.B.dylib`xdr_void                                       16
  libSystem.B.dylib`read$NOCANCEL                                  17
  libSystem.B.dylib`szone_free_definite_size                       18
  libSystem.B.dylib`tiny_free_list_add_ptr                         18
...

normalize and trunc.

normalize and trunc functions are used for modifying aggregation output. trunc call returns top x records for aggregation results. In the example above this function is used for retrieving top 10 aggregation records. normalize call sets normalization factor for the specified aggregation, it is used for represented time in microsecond in example above.

pid$1:::entry 
{
     self->timestamp = timestamp;
}
pid$1:::return
/self->timestamp/
{
     @result[strjoin(strjoin(probemod,"`"), probefunc)] = avg(timestamp - self->timestamp);
     self->timestamp = 0;
}
END
{
     normalize(@result, 1000);
     trunc(@result, 10);
}
libSystem.B.dylib`tiny_free_list_add_ptr                         18
tasksexplorerd`task_info_manager_get_tasks_count                 18
libSystem.B.dylib`tiny_malloc_from_free_list                     18
libSystem.B.dylib`szone_free_definite_size                       19
libSystem.B.dylib`tiny_free_list_remove_ptr                      19
libSystem.B.dylib`mach_msg_trap                                  22
libSystem.B.dylib`write$NOCANCEL                                 22
libSystem.B.dylib`__sysctl                                       23
libSystem.B.dylib`gettimeofday                                   24
libSystem.B.dylib`select$DARWIN_EXTSN$NOCANCEL                 3802

Providers.

syscall.

The syscall provider gives information about every system call. This provider provides a pair of probes for each system calls: entry and return. The following script displays stack trace for all program system calls.

syscall:::entry 
/pid == $target/
{
    @result[ustack()] = count();
}
...
 libSystem.B.dylib`__getdirentries64+0xa
 libSystem.B.dylib`readdir$INODE64+0x43
 CoreFoundation`_CFBundleCopyDirectoryContentsAtPath+0x65b
 CoreFoundation`_CFSearchBundleDirectory+0x67
 CoreFoundation`_CFFindBundleResourcesInRawDir+0x1a4
 CoreFoundation`_CFFindBundleResourcesInResourcesDir+0x246
 CoreFoundation`_CFFindBundleResources+0x3db
 CoreFoundation`CFBundleCopyResourceURL+0x91
 CoreFoundation`CFBundleGetLocalInfoDictionary+0x55
 CoreFoundation`CFBundleGetValueForInfoDictionaryKey+0x25
 tasksexplorerd`extract_bundle_info+0x8e
 tasksexplorerd`task_explorer_base_info_1_svc+0xf8
 tasksexplorerd`task_explorer_prog_1+0x188
 libSystem.B.dylib`svc_getreqset+0x1cb
 libSystem.B.dylib`svc_run+0x81
 tasksexplorerd`main+0x1c0
 tasksexplorerd`start+0x34
 tasksexplorerd`0x1
...

objc.

The objc provider provides information about Objective-C messages. For example, this provider allows to create the script for collecting top 10 Objective-C messages with the stack traces.

objc$target:ProcessInfo::entry 
{
     @result[ustack()] = count();
}
END
{
     trunc(@result, 10);
}
...
  Tasks Explorer`-[ProcessInfo pid]
  Tasks Explorer`-[TasksInfoManager updateInfo]+0x3e5
  Foundation`__NSFireTimer+0x72
  CoreFoundation`__CFRunLoopRun+0x1958
  CoreFoundation`CFRunLoopRunSpecific+0x23f
  HIToolbox`RunCurrentEventLoopInMode+0x14d
  HIToolbox`ReceiveNextEventCommon+0x136
  HIToolbox`BlockUntilNextEventMatchingListInMode+0x3b
  AppKit`_DPSNextEvent+0x2c4
  AppKit`-[NSApplication nextEventMatchingMask:untilDate:inMode:dequeue:]+0x9b
  AppKit`-[NSApplication run]+0x18b
  AppKit`NSApplicationMain+0x16c
  Tasks Explorer`start+0x34
  Tasks Explorer`0x2
13596

io.

The io provider creates probes related to disks input output operations. There are four types of probes available.

  • start - fires when an I/O request is made.
  • done - fires when an I/O request has been fulfilled.
  • wait-start - fires immediately before a thread begins to wait pending completion of a given I/O request.
  • wait-done - fires when a thread is done waiting for pending completion of a given I/O request.

These probes take 3 arguments: struct buf* as arg[0], devinfo_t* as arg[1] and fileinfo_t* as arg[2].

typedef struct bufinfo
{ 
     int b_flags;               /* flags */
     size_t b_bcount;          /* number of bytes */
     caddr_t b_addr;               /* buffer address */
     uint64_t b_blkno;          /* expanded block # on device */
     uint64_t b_lblkno;          /* block # on device */
     size_t b_resid;               /* # of bytes not transferred */
     size_t b_bufsize;          /* size of allocated buffer */
     caddr_t b_iodone;          /* I/O completion routine */
     dev_t b_edev;               /* extended device */
} bufinfo_t;
typedef struct devinfo 
{ 
     int dev_major;               /* major number */
     int dev_minor;                /* minor number */
     int dev_instance;           /* instance number */
     string dev_name;           /* name of device */
     string dev_statname;      /* name of device + instance/minor */
     string dev_pathname;     /* pathname of device */
} devinfo_t;
typedef struct fileinfo 
{ 
     string fi_name;               /* name (basename of fi_pathname) */
     string fi_dirname;           /* directory (dirname of fi_pathname) */
     string fi_pathname;      /* full pathname */
     offset_t fi_offset;      /* offset within file */
     string fi_fs;                /* filesystem */
     string fi_mount;          /* mount point of file system */
} fileinfo_t;

The following script provides information about applications I/O activity and prints operated file name, application name and PID and operation type (read o write).

#pragma D option quiet
BEGIN
{
     printf("%30s %20s %2s\n", "FILE NAME", "APP NAME(PID)", "RW");
}

io:::start
{
     printf("%30s%15s(%5d)%2s\n", args[2]->fi_name, execname, pid, args[0]->b_flags & B_READ ? "R" : "W");
}
                   FILE NAME        APP NAME(PID) RW
                      test.d       TextMate(  562) W
                .dat088c.005     quicklookd( 2188) W
        index.sqlite-journal     quicklookd( 2188) W
        index.sqlite-journal     quicklookd( 2188) W
                index.sqlite     quicklookd( 2188) W
                index.sqlite     quicklookd( 2188) W
             thumbnails.data     quicklookd( 2188) W
             thumbnails.data     quicklookd( 2188) W

pid.

The pid provider looks like objc provider and it provides similar functionality. This provider allows to collect user's calls information. For example, it is known that application has task_explorer_update_1_svc call and it is necessary to build call list. The script below implements that functional.

pid$target::task_explorer_update_1_svc:entry 
/guard == 0/
{
     self->spec = speculation(); 
     speculate(self->spec);
     guard = 1;
}
pid$target:a.out::
/self->spec/ 
{
     speculate(self->spec);
}
pid$target::task_explorer_update_1_svc:return
{
     commit(self->spec); 
     self->spec = 0;
}
	  1  -> task_explorer_update_1_svc            
	  1   | task_explorer_update_1_svc:1          
	  1   | task_explorer_update_1_svc:4          
	  1   | task_explorer_update_1_svc:6          
	  1    -> task_info_manager_update            
	  1     | task_info_manager_update:0          
	  1      -> KeysHashFunc                      
	  1       | KeysHashFunc:0                    
	  1      <- KeysHashFunc                      
	...

proc.

The proc provider creates probes which fire when processes or threads are created or terminated. The following scrip provides list of processes executed by application with given PID. As result, the list of processes executed by XCode during project building is shown.

proc:::exec-success
/ppid == $target/
{
     @[execname] = count();
}
touch                                                             1
bash                                                              2
sh                                                                2
gcc-4.2                                                          31
xcexec                                                           33

From the next script output it is obvious that the gcc execution time is between 1/4 to 1/2 second for 14 runs and between 1/2 to 1 second for 13 runs.

proc:::start
/ppid == $target/
{
     self->start = timestamp;
}

proc:::exit 
/self->start/ 
{
     @[execname] = quantize((timestamp - self->start) / 1000000); 
     self->start = 0;
}
gcc-4.2                                           
         value  ------------- Distribution ------------- count    
            32 |                                         0        
            64 |@@@                                      2        
           128 |@                                        1        
           256 |@@@@@@@@@@@@@@@@@@                       14       
           512 |@@@@@@@@@@@@@@@@@                        13       
          1024 |                                         0        
          2048 |@                                        1        
          4096 |                                         0        

Using libdtrace.

Warning: this part of article is based on source of dtrace tool for Mac OS X 10.6.2. All functions and interfaces provided by libdtrace are unstable and can be changed at any time without notice.

libdtrace provides next programming interface:

  • dtrace_open - dtrace framework initialization.
  • dtrace_program_strcompile, dtrace_program_fcompile - D script compilation functions.
  • dtrace_handle_* - registers user's function to be called whenever dtrace framework events occurred.
  • dtrace_program_exec - enables probes and instrumenting code.
  • dtrace_setopt - sets dtrace framework option.
  • dtrace_go - runs dtrace script.
  • dtrace_sleep - waits until new data are available.
  • dtrace_work - processes new data and returns results via call-back functions. If the second parameter FILE * is not NULL, call-back function registered with dtrace_handle_buffered will not be called and output will be written by the given descriptor.
  • dtrace_aggregate_print - calls call-back functions with existing aggregation results. Usually it is the last call in the DTrace consumer life cycle.

DTrace consumer life cycle:

Where chew, chewrec and dcmdbuffered are not libdtrace calls but user call-backs.

#include "dtrace.h"
#include "stdio.h"
#include "stdlib.h"
#include "mach/mach.h"
#include "mach-o/loader.h"
#include "mach-o/dyld.h"
#include "mach-o/fat.h"
#include "sys/sysctl.h"
#include "signal.h"

static dtrace_hdl_t *g_dtp;
static int g_intr;

static const char *g_prog =
//"tick-1s"
//"{"
//"    printf(\"Found %s on CPU %d\\n\",execname,cpu);"
//"}";
//"proc:::exec-success"
//"{"
//"	printf(\"%d\\n\", pid);"
//"}";
"syscall::write:entry"
"{"
"	@counts[\"write calls count\", execname] = count();"
"}";

static int dcmdbuffered(const dtrace_bufdata_t *bufdata, void *arg)
{
	printf("Print buffer:: %s", bufdata->dtbda_buffered);
	return(DTRACE_HANDLE_OK);
}

static void dprog_compile(void)
{
    int err;
    dtrace_prog_t *prog;
    dtrace_proginfo_t info;

    if ((g_dtp = dtrace_open(DTRACE_VERSION, DTRACE_O_ILP32, &err)) == NULL){
        printf("failed to initialize dtrace: %s\n",
			   dtrace_errmsg(NULL, err));
        exit(1);
    }

    if ((prog = dtrace_program_strcompile(g_dtp, g_prog, DTRACE_PROBESPEC_NAME, 0, 0, NULL)) == NULL){
        printf("failed to compile program");
        exit(1);
    }

	if (dtrace_handle_buffered(g_dtp, dcmdbuffered, NULL) == -1) {
		printf("Couldn't add buffered handler");
		exit(1);
	}

    if (dtrace_program_exec(g_dtp, prog, &info) == -1){
        printf("failed to enable probes");
        exit(1);
    }
}

static void dprog_setopts(void)
{
	(void) dtrace_setopt(g_dtp, "strsize", "32");
	(void) dtrace_setopt(g_dtp, "bufsize", "4m");
	(void) dtrace_setopt(g_dtp, "aggsize", "256k");
	(void) dtrace_setopt(g_dtp, "stacksymbols", "enabled");
	(void) dtrace_setopt(g_dtp, "arch", "x86_64");
	(void) dtrace_setopt(g_dtp, "aggsortrev", NULL);
	(void) dtrace_setopt(g_dtp, "aggrate", "1sec");
}

static int chew(const dtrace_probedata_t *data, void *arg)
{
	dtrace_probedesc_t *pd = data->dtpda_pdesc;

	processorid_t cpu = data->dtpda_cpu;
	char name[DTRACE_FUNCNAMELEN + DTRACE_NAMELEN + 2];

	(void) snprintf(name, sizeof (name), "%s:%s",
					pd->dtpd_func, pd->dtpd_name);

	printf("%3d %6d %32s ", cpu, pd->dtpd_id, name);

    return (DTRACE_CONSUME_THIS);
}

static int chewrec(const dtrace_probedata_t *data, const dtrace_recdesc_t *rec, void *arg)
{
	if (rec == NULL) {
		printf("\n");

		return (DTRACE_CONSUME_NEXT);
	}
    return (DTRACE_CONSUME_THIS);
}

static void intr(int signo)
{
	g_intr = 1;
}

int main(int argc, char **argv)
{
    int done=0;
	struct sigaction act;

    dprog_compile();
    dprog_setopts();

    if (dtrace_go(g_dtp) != 0){
        printf("Error in dtrace_go()\n");
        exit(1);
    }

	(void) sigemptyset(&act.sa_mask);
	act.sa_flags = 0;
	act.sa_handler = intr;
	(void) sigaction(SIGINT, &act, NULL);
	(void) sigaction(SIGTERM, &act, NULL);

    do {
		if (!g_intr && !done)
			dtrace_sleep(g_dtp);

        if (done || g_intr) {
			done = 1;
            if (dtrace_stop(g_dtp) == -1)
                printf("couldn't stop tracing");
        }

        switch (dtrace_work(g_dtp, stdout, chew, chewrec, NULL)) {
			case DTRACE_WORKSTATUS_DONE:
				done = 1;
				break;
			case DTRACE_WORKSTATUS_OKAY:
				break;
			default:
				printf("processing aborted");
        }
    } while (!done);

	dtrace_aggregate_print(g_dtp, stdout, NULL);

    dtrace_close(g_dtp);

    return(0);
}