drcachesim is a DynamoRIO client that collects memory access traces and feeds them to either an online or offline tool for analysis. The default analysis tool is a CPU cache simulator, while other provided tools compute metrics such as reuse distance. The trace collector and simulator support multiple processes each with multiple threads. The analysis tool framework is extensible, supporting the creation of new tools which can operate both online and offline.

Overview

drcachesim consists of two components: a tracer and an analyzer. The tracer collects a memory access trace from each thread within each application process. The analyzer consumes the traces (online or offline) and performs customized analysis. It is designed to be extensible, allowing users to easily implement a simulator for different devices, such as CPU caches, TLBs, page caches, etc. (see Extending the Simulator), or to build arbitrary trace analysis tools (see Creating New Analysis Tools). The default analyzer simulates the architectural behavior of caching devices for a target application (or multiple applications).

Running the Simulator

To launch drcachesim, use the -t flag to drrun:

$ bin64/drrun -t drcachesim -- /path/to/target/app <args> <for> <app>

The target application will be launched under a DynamoRIO tracer client that gathers all of its memory references and passes them to the simulator via a pipe. (See Offline Traces and Analysis for how to dump a trace for offline analysis.) Any child processes will be followed into and profiled, with their memory references passed to the simulator as well.

Here is an example:

$ bin64/drrun -t drcachesim -- ~/test/pi_estimator
Estimation of pi is 3.142425985001098
---- <application exited with code 0> ----
Cache simulation results:
Core #0 (1 thread(s))
  L1I stats:
    Hits:                          258,433
    Misses:                          1,148
    Miss rate:                        0.44%
  L1D stats:
    Hits:                           93,654
    Misses:                          2,624
    Prefetch hits:                     458
    Prefetch misses:                 2,166
    Miss rate:                        2.73%
Core #1 (1 thread(s))
  L1I stats:
    Hits:                            8,895
    Misses:                             99
    Miss rate:                        1.10%
  L1D stats:
    Hits:                            3,448
    Misses:                            156
    Prefetch hits:                      26
    Prefetch misses:                   130
    Miss rate:                        4.33%
Core #2 (1 thread(s))
  L1I stats:
    Hits:                            4,150
    Misses:                            101
    Miss rate:                        2.38%
  L1D stats:
    Hits:                            1,578
    Misses:                            130
    Prefetch hits:                      25
    Prefetch misses:                   105
    Miss rate:                        7.61%
Core #3 (0 thread(s))
LL stats:
    Hits:                            1,414
    Misses:                          2,844
    Prefetch hits:                     824
    Prefetch misses:                 1,577
    Local miss rate:                 66.79%
    Child hits:                    370,667
    Total miss rate:                  0.76%

Analysis Tool Suite

In addition to a CPU cache simulator, other analysis tools are available that operate on memory address traces. Which tool is used can be selected with the -simulator_type parameter.

To simulate TLB devices instead of caches, pass TLB to -simulator_type:

$ bin64/drrun -t drcachesim -simulator_type TLB -- ~/test/pi_estimator
Estimation of pi is 3.142425985001098
---- <application exited with code 0> ----
TLB simulation results:
Core #0 (1 thread(s))
  L1I stats:
    Hits:                          252,412
    Misses:                            401
    Miss rate:                        0.16%
  L1D stats:
    Hits:                           87,132
    Misses:                          9,127
    Miss rate:                        9.48%
  LL stats:
    Hits:                            9,315
    Misses:                            213
    Local miss rate:                  2.24%
    Child hits:                    339,544
    Total miss rate:                  0.06%
Core #1 (1 thread(s))
  L1I stats:
    Hits:                            8,709
    Misses:                             20
    Miss rate:                        0.23%
  L1D stats:
    Hits:                            3,544
    Misses:                             55
    Miss rate:                        1.53%
  LL stats:
    Hits:                               15
    Misses:                             60
    Local miss rate:                 80.00%
    Child hits:                     12,253
    Total miss rate:                  0.49%
Core #2 (1 thread(s))
  L1I stats:
    Hits:                            1,622
    Misses:                             21
    Miss rate:                        1.28%
  L1D stats:
    Hits:                              689
    Misses:                             35
    Miss rate:                        4.83%
  LL stats:
    Hits:                                3
    Misses:                             53
    Local miss rate:                 94.64%
    Child hits:                      2,311
    Total miss rate:                  2.24%
Core #3 (0 thread(s))

To compute reuse distance metrics:

$ bin64/drrun -t drcachesim -simulator_type reuse_distance -reuse_distance_histogram -- ~/test/pi_estimator
Estimation of pi is 3.142425985001098
---- <application exited with code 0> ----
Reuse distance tool aggregated results:
Total accesses: 349632
Unique accesses: 196603
Unique cache lines accessed: 4235
Reuse distance mean: 14.64
Reuse distance median: 1
Reuse distance standard deviation: 104.10
Reuse distance histogram:
Distance       Count  Percent  Cumulative
       0      153029   44.36%   44.36%
       1      101294   29.37%   73.73%
       2       14116    4.09%   77.82%
       3       14248    4.13%   81.95%
       4        8894    2.58%   84.53%
       5        2733    0.79%   85.32%
...
==================================================
Reuse distance tool results for shard 29327 (thread 29327):
Total accesses: 335084
Unique accesses: 187927
Unique cache lines accessed: 4148
Reuse distance mean: 14.77
Reuse distance median: 1
Reuse distance standard deviation: 106.02
Reuse distance histogram:
Distance       Count  Percent  Cumulative
       0      147157   44.47%   44.47%
       1       96820   29.26%   73.72%
       2       13613    4.11%   77.84%
       3       13834    4.18%   82.02%
       4        8666    2.62%   84.64%
       5        2552    0.77%   85.41%
...
    3658          29    0.01%  100.00%
    3851           1    0.00%  100.00%
Reuse distance threshold = 100 cache lines
Top 10 frequently referenced cache lines
        cache line:     #references   #distant refs
    0x7f2a86b3fd80:        27980,            0
    0x7f2a86b3fdc0:        18823,            0
    0x7f2a88388fc0:        16409,          111
    0x7f2a8838abc0:        15176,            6
    0x7f2a883884c0:         9930,           20
    0x7f2a88388480:         7944,           20
    0x7f2a88388500:         7574,           20
    0x7f2a88398d00:         7390,          100
    0x7f2a86b3fd40:         6668,            0
    0x7f2a88388440:         5717,           20
Top 10 distant repeatedly referenced cache lines
        cache line:     #references   #distant refs
    0x7f2a885a4180:          246,          132
    0x7f2a87504ec0:          202,          128
    0x7f2a875044c0:          323,          126
    0x7f2a885a4480:          220,          126
    0x7f2a87504f00:          293,          124
    0x7f2a86fd7e00:          289,          124
    0x7f2a875049c0:          221,          124
    0x7f2a875053c0:          270,          122
    0x7f2a86db9c00:          269,          122
    0x7f2a875047c0:          201,          122
==================================================
Reuse distance tool results for shard 29328 (thread 29328):
Total accesses: 12216
Unique accesses: 7251
Unique cache lines accessed: 319
Reuse distance mean: 12.98
Reuse distance median: 1
Reuse distance standard deviation: 38.19
Reuse distance histogram:
Distance       Count  Percent  Cumulative
       0        4965   41.73%   41.73%
       1        3758   31.59%   73.32%
       2         411    3.45%   76.78%
       3         348    2.93%   79.70%
       4         179    1.50%   81.21%
       5         152    1.28%   82.48%
...

A reuse time tool is also provided, which counts the total number of memory accesses (without considering uniqueness) between accesses to the same address:

$ bin64/drrun -t drcachesim -simulator_type reuse_time -- ~/test/pi_estimator
Estimation of pi is 3.142425985001098
---- <application exited with code 0> ----
Reuse time tool aggregated results:
Total accesses: 88281
Total instructions: 261315
Mean reuse time: 433.47
Reuse time histogram:
Distance       Count  Percent  Cumulative
       1       27893   32.84%      32.84%
       2       10948   12.89%      45.73%
       3        5789    6.82%      52.54%
...
==================================================
Reuse time tool results for shard 29482 (thread 29482):
Total accesses: 84194
Total instructions: 250854
Mean reuse time: 450.01
Reuse time histogram:
Distance       Count  Percent  Cumulative
       1       26677   32.86%      32.86%
       2       10508   12.95%      45.81%
       3        5427    6.69%      52.50%
...
==================================================
Reuse time tool results for shard 29483 (thread 29483):
Total accesses: 3411
Total instructions: 8805
Mean reuse time: 86.36
Reuse time histogram:
Distance       Count  Percent  Cumulative
       1        1014   31.56%      31.56%
       2         363   11.30%      42.86%
       3         308    9.59%      52.44%

To simply see the counts of instructions and memory references broken down by thread use the basic counts tool:

$ bin64/drrun -t drcachesim -simulator_type basic_counts -- ~/test/pi_estimator
Estimation of pi is 3.142425985001098
---- <application exited with code 0> ----
Basic counts tool results:
Total counts:
      267193 total (fetched) instructions
         345 total non-fetched instructions
           0 total prefetches
       67686 total data loads
       22503 total data stores
           3 total threads
         280 total scheduling markers
           0 total transfer markers
           0 total other markers
Thread 247451 counts:
      255009 (fetched) instructions
         345 non-fetched instructions
           0 prefetches
       64453 data loads
       21243 data stores
         258 scheduling markers
           0 transfer markers
           0 other markers
Thread 247453 counts:
        9195 (fetched) instructions
           0 non-fetched instructions
           0 prefetches
        2444 data loads
         937 data stores
          12 scheduling markers
           0 transfer markers
           0 other markers
Thread 247454 counts:
        2989 (fetched) instructions
           0 non-fetched instructions
           0 prefetches
         789 data loads
         323 data stores
          10 scheduling markers
           0 transfer markers
           0 other markers

The non-fetched instructions are x86 string loop instructions, where subsequent iterations do not incur a fetch. They are included in the trace as a different type of trace entry to support core simulators in addition to cache simulators.

The opcode_mix tool uses the non-fetched instruction information along with the preserved libraries and binaries from the traced execution to gather more information on each executed instruction than was stored in the trace. It only supports offline traces, and the modules.log file created during post-processing of the trace must be preserved. The results are broken down by the opcodes used in DR's IR, where mov is split into a separate opcode for load and store but both have the same public string "mov":

$ bin64/drrun -t drcachesim -offline -- ~/test/pi_estimator
Estimation of pi is 3.142425985001098
$ bin64/drrun -t drcachesim -simulator_type opcode_mix -indir drmemtrace.*.dir
Opcode mix tool results:
         267271 : total executed instructions
          36432 :       mov
          31075 :       mov
          24715 :       add
          22579 :      test
          22539 :       cmp
          12137 :       lea
          11136 :       jnz
          10568 :     movzx
          10243 :        jz
           9056 :       and
           8064 :       jnz
           7279 :        jz
           5659 :      push
           4528 :       sub
           4357 :       pop
           4001 :       shr
           3427 :      jnbe
           2634 :       mov
           2469 :       shl
           2344 :        jb
           2291 :       ret
           2178 :       xor
           2164 :      call
           2111 :   pcmpeqb
           1472 :    movdqa
...

The view tool prints out disassembled instructions in att, intel, arm or DR format for offline traces. The -skip_refs and -sim_refs flags can be used to set a start point and end point for the disassembled view. Note that these flags compute the number of instructions which are skipped or displayed which is distinct from the number of trace entries.

$ bin64/drrun -t drcachesim -simulator_type view -sim_refs 20 -indir drmemtrace.*.dir
  0x00007f3a5127d870  48 83 ec 48          sub    $0x48, %rsp
  0x00007f3a5127d874  0f 31                rdtsc
  0x00007f3a5127d876  48 c1 e2 20          shl    $0x20, %rdx
  0x00007f3a5127d87a  89 c0                mov    %eax, %eax
  0x00007f3a5127d87c  48 09 c2             or     %rax, %rdx
  0x00007f3a5127d87f  48 8b 05 ea 25 22 00 mov    <rel> 0x00007f3a5149fe70, %rax
  0x00007f3a5127d886  48 89 15 d3 23 22 00 mov    %rdx, <rel> 0x00007f3a5149fc60
  0x00007f3a5127d88d  48 8d 15 dc 25 22 00 lea    <rel> 0x00007f3a5149fe70, %rdx
  0x00007f3a5127d894  49 89 d6             mov    %rdx, %r14
  0x00007f3a5127d897  4c 2b 35 62 27 22 00 sub    <rel> 0x00007f3a514a0000, %r14
  0x00007f3a5127d89e  48 85 c0             test   %rax, %rax
  0x00007f3a5127d8a1  48 89 15 40 31 22 00 mov    %rdx, <rel> 0x00007f3a514a09e8
  0x00007f3a5127d8a8  4c 89 35 29 31 22 00 mov    %r14, <rel> 0x00007f3a514a09d8
  0x00007f3a5127d8af  0f 84 9b 00 00 00    jz     $0x00007f3a5127d950
  0x00007f3a5127d8b5  4c 8d 05 84 27 22 00 lea    <rel> 0x00007f3a514a0040, %r8
  0x00007f3a5127d8bc  49 b9 d8 03 00 80 03 mov    $0x00000003800003d8, %r9
                      00 00 00
  0x00007f3a5127d8c6  48 b9 78 fb ff 7f 03 mov    $0x000000037ffffb78, %rcx
                      00 00 00
  0x00007f3a5127d8d0  48 8d 35 41 31 22 00 lea    <rel> 0x00007f3a514a0a18, %rsi
  0x00007f3a5127d8d7  bf ff ff ff 6f       mov    $0x6fffffff, %edi
  0x00007f3a5127d8dc  41 bb ff fd ff 6f    mov    $0x6ffffdff, %r11d
View tool results:
             20 : total disassembled instructions

The top referenced cache lines are displayed by the histogram tool:

$ bin64/drrun -t drcachesim -simulator_type histogram -- ~/test/pi_estimator
Estimation of pi is 3.142425985001098
---- <application exited with code 0> ----
Cache line histogram tool results:
icache: 1134 unique cache lines
dcache: 3062 unique cache lines
icache top 10
    0x7facdd013780: 30929
    0x7facdb789fc0: 27664
    0x7facdb78a000: 18629
    0x7facdd003e80: 18176
    0x7facdd003500: 11121
    0x7facdd0034c0: 9763
    0x7facdd005940: 8865
    0x7facdd003480: 8277
    0x7facdb789f80: 6660
    0x7facdd003540: 5888
dcache top 10
    0x7ffcc35e7d80: 4088
    0x7ffcc35e7d40: 3497
    0x7ffcc35e7e00: 3478
    0x7ffcc35e7f40: 2919
    0x7ffcc35e7dc0: 2837
    0x7facdbe2e980: 2452
    0x7facdbe2ec80: 2273
    0x7ffcc35e7e80: 2194
    0x7facdb6625c0: 2016
    0x7ffcc35e7e40: 1997

Configuration File

drcachesim supports reconfigurable cache hierarchies defined in a configuration file. The configuration file is a text file with the following formatting rules.

A comment starts with two slashes followed by one or more spaces. Anything after the '// ' until the end of the line is considered a comment and ignored.
A parameter's name and its value are listed consecutively with white space (spaces, tabs, or a new line) between them.
Parameters must be separated by white space. Including one parameter per line helps keep the configuration file more human-readable.
A cache's parameters must be enclosed inside braces and preceded by the cache's user-chosen unique name.
Parameters can be listed in any order.
Parameters not included in the configuration file take their default values.
String values must not be enclosed in quotations.

Supported common parameters and their value types (each of these parameters sets the corresponding option with the same name described in Simulator Parameters):

num_cores <unsigned int>
line_size <unsigned int>
skip_refs <unsigned int>
warmup_refs <unsigned int>
warmup_fraction <float in [0,1]>
sim_refs <unsigned int>
cpu_scheduling <bool>
verbose <unsigned int>

Supported cache parameters and their value types:

type <string, one of "instruction", "data", or "unified">
core <unsigned int in [0, num_cores)>
size <unsigned int, power of 2>
assoc <unsigned int, power of 2>
inclusive <bool>
parent <string>
replace_policy <string, one of "LRU", "LFU", or "FIFO">
prefetcher <string, one of "nextline" or "none">
miss_file <string>

Example:

// Configuration for a single-core CPU.
// Common params.
num_cores       1
line_size       64
cpu_scheduling  true
sim_refs        8888888
warmup_fraction 0.8
// Cache params.
P0L1I {                        // P0 L1 instruction cache
  type            instruction
  core            0
  size            65536        // 64K
  assoc           8
  parent          P0L2
  replace_policy  LRU
}
P0L1D {                        // P0 L1 data cache
  type            data
  core            0
  size            65536        // 64K
  assoc           8
  parent          P0L2
  replace_policy  LRU
}
P0L2 {                         // P0 L2 unified cache
  size            512K
  assoc           16
  inclusive       true
  parent          LLC
  replace_policy  LRU
}
LLC {                          // LLC
  size            1M
  assoc           16
  inclusive       true
  parent          mem
  replace_policy  LRU
  miss_file       misses.txt
}

Offline Traces and Analysis

To dump a trace for future offline analysis, use the offline parameter:

$ bin64/drrun -t drcachesim -offline -- /path/to/target/app <args> <for> <app>

The collected traces will be dumped into a newly created directory, which can be passed to drcachesim for offline cache simulation with the -indir option:

$ bin64/drrun -t drcachesim -indir drmemtrace.app.pid.xxxx.dir/

The direct results of the -offline run are raw, compacted files, stored in a raw/ subdirectory of the drmemtrace.app.pid.xxxx.dir directory. The -indir option both converts the data to a canonical trace form and passes the resulting data to the cache simulator. The canonical trace data is stored by -indir in a trace/ subdirectory inside the drmemtrace.app.pid.xxxx.dir/ directory. For both the raw and canonical data, a separate file per application thread is used. If the canonical data already exists, future runs will use that data rather than re-converting it. Either the top-level directory or the trace/ subdirectory may be pointed at with -indir:

$ bin64/drrun -t drcachesim -indir drmemtrace.app.pid.xxxx.dir/trace

The canonical trace files may be manually compressed with gzip, as the trace reader supports reading gzipped files.

Older versions of the simulator produced a single trace file containing all threads interleaved. The -infile option supports reading these legacy files:

$ gzip drmemtrace.app.pid.xxxx.dir/drmemtrace.trace

$ bin64/drrun -t drcachesim -infile drmemtrace.app.pid.xxxx.dir/drmemtrace.trace.gz

The same analysis tools used online are available for offline: the trace format is identical.

Tracing a Subset of Execution

While the cache simulator supports skipping references, for large applications the overhead of the tracing itself is too high to conveniently trace the entire execution. There are several methods of tracing only during a desired window of execution.

The -trace_after_instrs option delays tracing by the specified number of dynamic instruction executions. This can be used to skip initialization and arrive at the desired starting point. The trace's length can also be limited by the -exit_after_tracing option.

If the application can be modified, it can be linked with the drcachesim tracer and use DynamoRIO's start/stop API routines dr_app_setup_and_start() and dr_app_stop_and_cleanup() to delimit the desired trace region. As an example, see our burst_static test application.

Simulator Details

Generally, the simulator is able to be extended to model a variety of caching devices. Currently, CPU caches and TLBs are implemented. The type of device to simulate can be specified by the parameter "-simulator_type" (see Simulator Parameters).

The CPU cache simulator models a configurable number of cores, each with an L1 data cache and an L1 instruction cache. Currently there is a single shared L2 unified cache, but we would like to extend support to arbitrary cache hierarchies (see Current Limitations). The cache line size and each cache's total size and associativity are user-specified (see Simulator Parameters).

The TLB simulator models a configurable number of cores, each with an L1 instruction TLB, an L1 data TLB, and an L2 unified TLB. Each TLB's entry number and associativity, and the virtual/physical page size, are user-specified (see Simulator Parameters).

Neither simulator has a simple way to know which core any particular thread executed on for each of its instructions. The tracer records which core a thread is on each time it writes out a full trace buffer, giving an approximation of the actual scheduling (at the granularity of the trace buffer size). By default, these cache and TLB simulators ignore that information and schedule threads to simulated cores in a static round-robin fashion with load balancing to fill in gaps with new threads after threads exit. The option "-cpu_scheduling" (see Simulator Parameters) can be used to instead map each physical cpu to a simulated core and use the recorded cpu that each segment of thread execution occurred on to schedule execution in a manner that more closely resembles the traced execution on the physical machine. Below is an example of the output using this option running an application with many threads on a pysical machine with 8 cpus. The 8 cpus are mapped to the 4 simulated cores:

$ bin64/drrun -t drcachesim -cpu_scheduling -- ~/test/pi_estimator 20
Estimation of pi is 3.141592653798125
<Stopping application /home/bruening/dr/test/threadsig (213517)>
---- <application exited with code 0> ----
Cache simulation results:
Core #0 (2 traced CPU(s): #2, #5)
  L1I stats:
    Hits:                        2,756,429
    Misses:                          1,190
    Miss rate:                        0.04%
  L1D stats:
    Hits:                        1,747,822
    Misses:                         13,511
    Prefetch hits:                   2,354
    Prefetch misses:                11,157
    Miss rate:                        0.77%
Core #1 (2 traced CPU(s): #4, #0)
  L1I stats:
    Hits:                          472,948
    Misses:                            299
    Miss rate:                        0.06%
  L1D stats:
    Hits:                          895,099
    Misses:                          1,224
    Prefetch hits:                     253
    Prefetch misses:                   971
    Miss rate:                        0.14%
Core #2 (2 traced CPU(s): #1, #7)
  L1I stats:
    Hits:                          448,581
    Misses:                            649
    Miss rate:                        0.14%
  L1D stats:
    Hits:                          811,483
    Misses:                          1,723
    Prefetch hits:                     378
    Prefetch misses:                 1,345
    Miss rate:                        0.21%
Core #3 (2 traced CPU(s): #6, #3)
  L1I stats:
    Hits:                          275,192
    Misses:                            154
    Miss rate:                        0.06%
  L1D stats:
    Hits:                          522,655
    Misses:                            850
    Prefetch hits:                     173
    Prefetch misses:                   677
    Miss rate:                        0.16%
LL stats:
    Hits:                           12,491
    Misses:                          7,109
    Prefetch hits:                   8,922
    Prefetch misses:                 5,228
    Local miss rate:                 36.27%
    Child hits:                  7,933,367
    Total miss rate:                  0.09%

The memory access traces contain some optimizations that combine references for one basic block together. This may result in not considering some thread interleavings that could occur natively. There are no other disruptions to thread ordering, however, and the application runs with all of its threads concurrently just like it would natively (although slower).

Once every process has exited, the simulator prints cache miss statistics for each cache to stderr. The simulator is designed to be extensible, allowing for different cache studies to be carried out: see Extending the Simulator.

For L2 caching devices, the L1 caching devices are considered its children. Two separate miss rates are computed, one (the "Local miss rate") considering just requests that reach L2 while the other (the "Total miss rate") includes the child hits.

For memory requests that cross blocks, each block touched is considered separately, resulting in separate hit and miss statistics. This can be changed by implementing a custom statistics gatherer (see Extending the Simulator).

Software and hardware prefetches are combined in the prefetch hit and miss statistics, which are reported separately from regular loads and stores. To isolate software prefetch statistics, disable the hardware prefetcher by running with "-data_prefetcher none" (see Simulator Parameters). While misses from software prefetches are included in cache miss files, misses from hardware prefetches are not.

Cache Miss Analyzer

The cache simulator can be used to analyze the stream of last-level cache (LLC) miss addresses. This can be useful when looking for patterns that can be utilized in software prefetching. The current analyzer can only identify simple stride patterns, but it can be extended to search for more complex patterns. To invoke the miss analyzer, pass miss_analyzer to the -simulator_type parameter. To write the prefetching hints to a file use the -LL_miss_file parameter to specify the file's path and name.

For example, to run the analyzer on a benchmark called "my_benchmark" and store the prefetching recommendations in a file called "rec.csv", run the following:

$ bin64/drrun -t drcachesim -simulator_type miss_analyzer -LL_miss_file rec.csv -- my_benchmark

Physical Addresses

The memory access tracing client gathers virtual addresses. On Linux, if the kernel allows user-mode applications access to the /proc/self/pagemap file, physical addresses may be used instead. This can be requested via the -use_physical runtime option (see Simulator Parameters). This works on current kernels but is expected to stop working from user mode on future kernels due to recent security changes (see http://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=ab676b7d6fbf4b294bf198fb27ade5b0e865c7ce).

Core Simulation Support

The drcachesim trace format includes information intended for use by core simulators as well as pure cache simulators. For traces that are not filtered by an online first-level cache, each data reference is preceded by the instruction fetch entry for the instruction that issued the data request. Additionally, on x86, string loop instructions involve a single insruction fetch followed by a loop of loads and/or stores. A drcachesim trace includes a special "no-fetch" instruction entry per iteration so that core simulators have the instruction information to go along with each load and store, while cache simulators can ignore these "no-fetch" entries and avoid incorrectly inflating instruction fetch statistics.

Offline traces guarantee that a branch target instruction entry in a trace must immediately follow the branch instruction with no intervening thread switch. This allows a core simulator to identify the target of a branch by looking at the subsequent trace entry.

Traces include scheduling markers providing the timestamp and hardware thread identifier on each thread transition, allowing a simulator to more closely match the actual hardware if so desired.

Traces also include markers indicating disruptions in user mode control flow such as signal handler entry and exit.

A final feature that aids core simulators is the pair of interfaces module_mapper_t::get_loaded_modules() and module_mapper_t::find_mapped_trace_address(), which facilitate reading the raw bytes for each instruction in order to obtain the opcode and full operand information.

Extending the Simulator

The drcachesim tool was designed to be extensible, allowing users to easily model different caching devices, implement different models, and gather custom statistics.

To model different caching devices, subclass the simulator_t, caching_device_t, caching_device_block_t, caching_device_stats_t classes.

To implement a different cache model, subclass the cache_t class and override the request(), access_update(), and/or replace_which_way() method(s).

Statistics gathering is separated out into the caching_device_stats_t class. To implement custom statistics, subclass caching_device_stats_t and override the access(), child_access(), flush(), and/or print_stats() methods.

Customizing the Tracer

The tracer supports customization for special-purpose i/o via drmemtrace_replace_file_ops(), allowing traces to be written to locations not supported by simple UNIX file operations. One option for using this function is to create a new client which links with the provided drmemtrace_static library, includes the drmemtrace/drmemtrace.h header via:

use_DynamoRIO_drmemtrace_tracer(mytool)

And includes its own dr_client_main() which calls drmemtrace_client_main().

The tracer also supports storing custom data with each module (i.e., library or executable) such as a build identifier via drmemtrace_custom_module_data(). The custom data may be retrieved by creating a custom offline trace post-processor and using the module_mapper_t class.

Creating New Analysis Tools

drcachesim provides a drmemtrace analysis tool framework to make it easy to create new trace analysis tools. A new tool should subclass analysis_tool_t.

Concurrent processing of traces is supported by logically splitting a trace into "shards" which are each processed sequentially. The default shard is a traced application thread, but the tool interface can support other divisions. For tools that support concurrent processing of shards and do not need to see a single time-sorted interleaved merged trace, the interface functions with the parallel_ prefix should be overridden, and parallel_shard_supported() should return true. parallel_shard_init() will be invoked for each shard prior to invoking parallel_shard_memref() for each entry in that shard; the data structure returned from parallel_shard_init() will be passed to parallel_shard_memref() for each trace entry for that shard. The concurrency model used guarantees that all entries from any one shard are processed by the same single worker thread, so no synchronization is needed inside the parallel_ functions. A single worker thread invokes print_results() as well.

For serial operation, process_memref(), operates on a trace entry in a single, sorted, interleaved stream of trace entries. In the default mode of operation, the analyzer_t class iterates over the trace and calls the process_memref() function of each tool. An alternative mode is supported which exposes the iterator and allows a separate control infrastructure to be built. This alternative mode does not support parallel operation at this time.

Both parallel and serial operation can be supported by a tool, typically by having process_memref() create data on a newly seen traced thread and invoking parallel_shard_memref() to do its work.

For both parallel and serial operation, the function print_results() should be overridden. It is called just once after processing all trace data and it should present the results of the analysis. For parallel operation, any desired aggregation across the whole trace should occur here as well, while shard-specific results can be presented in parallel_shard_exit().

Today, parallel analysis is only supported for offline traces. Support for online traces may be added in the future.

In the default mode of operation, the analyzer_t class iterates over the trace and calls the appropriate analysis_tool_t functions for each tool. An alternative mode is supported which exposes the iterator and allows a separate control infrastructure to be built.

Each trace entry is of type memref_t and represents one instruction or data reference or a metadata operation such as a thread exit or marker. There are built-in scheduling markers providing the timestamp and cpu identifier on each thread transition. Other built-in markers indicate disruptions in user mode control flow such as signal handler entry and exit.

CMake support is provided for including the headers and linking the libraries of the drmemtrace framework. A new CMake function is defined in the DynamoRIO package which sets the include directory for using the drmemtrace/ headers:

use_DynamoRIO_drmemtrace(mytool)

The drmemtrace_analyzer library exported by the DynamoRIO package is the main library to link when building a new tool. The tools described above are also exported as the libraries drmemtrace_basic_counts, drmemtrace_view, drmemtrace_opcode_mix, drmemtrace_histogram, drmemtrace_reuse_distance, drmemtrace_reuse_time, and drmemtrace_simulator and can be created using the basic_counts_tool_create(), opcode_mix_tool_create(), histogram_tool_create(), reuse_distance_tool_create(), reuse_time_tool_create(), view_tool_create(), cache_simulator_create(), and tlb_simulator_create() functions.

Simulator Parameters

drcachesim's behavior can be controlled through options passed after the -c drcachesim but prior to the "--" delimiter on the command line:

$ bin64/drrun -t drcachesim <options> <to> <drcachesim> -- /path/to/target/app <args> <for> <app>

Boolean options can be disabled using a "-no_" prefix.

The parameters available are described below:

-offline
default value: false
By default, traces are processed online, sent over a pipe to a simulator. If this option is enabled, trace data is instead written to files in -outdir for later offline analysis. No simulator is executed.
-ipc_name
default value: drcachesimpipe
For online tracing and simulation (the default, unless -offline is requested), specifies the name of the named pipe used to communicate between the target application processes and the caching device simulator. On Linux this can include an absolute path (if it doesn't, a default temp directory will be used). A unique name must be chosen for each instance of the simulator being run at any one time. On Windows, the name is limited to 247 characters.
-outdir
default value: .
For the offline analysis mode (when -offline is requested), specifies the path to a directory where per-thread trace files will be written.
-indir
default value: ""
After a trace file is produced via -offline into -outdir, it can be passed to the simulator via this flag pointing at the subdirectory created in -outdir. The -offline tracing produces raw data files which are converted into final trace files on the first execution with -indir. The raw files can also be manually converted using the drraw2trace tool. Legacy single trace files with all threads interleaved into one are not supported with this option: use -infile instead.
-infile
default value: ""
Directs the simulator to use a single all-threads-interleaved-into-one trace file. This is a legacy file format that is no longer produced.
-jobs
default value: -1
By default, both post-processing of offline raw trace files and analysis of trace files is parallelized. This option controls the number of concurrent jobs. 0 disables concurrency and uses a single thread to perform all operations. A negative value sets the job count to the number of hardware threads, with a cap of 16.
-module_file
default value: ""
The opcode_mix tool needs the modules.log file (generated by the offline post-processing step in the raw/ subdirectory) in addition to the trace file. If the file is named modules.log and is in the same directory as the trace file, or a raw/ subdirectory below the trace file, this parameter can be omitted.
-cores
default value: 4
Specifies the number of cores to simulate.
-line_size
default value: 64
Specifies the cache line size, which is assumed to be identical for L1 and L2 caches. Must be a power of 2.
-L1I_size
default value: 32K
Specifies the total size of each L1 instruction cache. Must be a power of 2 and a multiple of -line_size.
-L1D_size
default value: 32K
Specifies the total size of each L1 data cache. Must be a power of 2 and a multiple of -line_size.
-L1I_assoc
default value: 8
Specifies the associativity of each L1 instruction cache. Must be a power of 2.
-L1D_assoc
default value: 8
Specifies the associativity of each L1 data cache. Must be a power of 2.
-LL_size
default value: 8M
Specifies the total size of the unified last-level (L2) cache. Must be a power of 2 and a multiple of -line_size.
-LL_assoc
default value: 16
Specifies the associativity of the unified last-level (L2) cache. Must be a power of 2.
-LL_miss_file
default value: ""
If non-empty, when running the cache simulator, requests that every last-level cache miss be written to a file at the specified path. Each miss is written in text format as a <program counter, address> pair. If this tool is linked with zlib, the file is written in gzip-compressed format. If non-empty, when running the cache miss analyzer, requests that prefetching hints based on the miss analysis be written to the specified file. Each hint is written in text format as a <program counter, stride, locality level> tuple.
-L0_filter
default value: false
Filters out instruction and data hits in a 'zero-level' cache during tracing itself, shrinking the final trace to only contain instruction and data accesses that miss in this initial cache. This cache is direct-mapped with sizes equal to -L0I_size and -L0D_size. It uses virtual addresses regardless of -use_physical.
-L0I_size
default value: 32K
Specifies the size of the 'zero-level' instruction cache for -L0_filter. Must be a power of 2 and a multiple of -line_size, unless it is set to 0, which disables instruction fetch entries from appearing in the trace.
-L0D_size
default value: 32K
Specifies the size of the 'zero-level' data cache for -L0_filter. Must be a power of 2 and a multiple of -line_size, unless it is set to 0, which disables data entries from appearing in the trace.
-use_physical
default value: false
If available, the default virtual addresses will be translated to physical. This is not possible from user mode on all platforms.
-virt2phys_freq
default value: 0
This option only applies if -use_physical is enabled. The virtual to physical mapping is cached for performance reasons, yet the underlying mapping can change without notice. This option controls the frequency with which the cached value is ignored in order to re-access the actual mapping and ensure accurate results. The units are the number of memory accesses per forced access. A value of 0 uses the cached values for the entire application execution.
-cpu_scheduling
default value: false
By default, the simulator schedules threads to simulated cores in a static round-robin fashion. This option causes the scheduler to instead use the recorded cpu that each thread executed on (at a granularity of the trace buffer size) for scheduling, mapping traced cpu's to cores and running each segment of each thread on the core that owns the recorded cpu for that segment.
-max_trace_size
default value: 0
If non-zero, this sets a maximum size on the amount of raw trace data gathered for each thread. This is not an exact limit: it may be exceeded by the size of one internal buffer. Once reached, instrumentation continues for that thread, but no further data is recorded.
-trace_after_instrs
default value: 0
If non-zero, this causes tracing to be suppressed until this many dynamic instruction executions are observed. At that point, regular tracing is put into place. Use -max_trace_size to set a limit on the subsequent trace length.
-exit_after_tracing
default value: 0
If non-zero, after tracing the specified number of references, the process is exited with an exit code of 0. The reference count is approximate.
-online_instr_types
default value: false
By default, offline traces include some information on the types of instructions, branches in particular. For online traces, this comes at a performance cost, so it is turned off by default.
-replace_policy
default value: LRU
Specifies the replacement policy for caches. Supported policies: LRU (Least Recently Used), LFU (Least Frequently Used), FIFO (First-In-First-Out).
-data_prefetcher
default value: nextline
Specifies the hardware data prefetcher policy. The currently supported policies are 'nextline' (fetch the subsequent cache line) and 'none' (disables hardware prefetching). The prefetcher is located between the L1D and LL caches.
-page_size
default value: 4K
Specifies the virtual/physical page size.
-TLB_L1I_entries
default value: 32
Specifies the number of entries in each L1 instruction TLB. Must be a power of 2.
-TLB_L1D_entries
default value: 32
Specifies the number of entries in each L1 data TLB. Must be a power of 2.
-TLB_L1I_assoc
default value: 32
Specifies the associativity of each L1 instruction TLB. Must be a power of 2.
-TLB_L1D_assoc
default value: 32
Specifies the associativity of each L1 data TLB. Must be a power of 2.
-TLB_L2_entries
default value: 1024
Specifies the number of entries in each unified L2 TLB. Must be a power of 2.
-TLB_L2_assoc
default value: 4
Specifies the associativity of each unified L2 TLB. Must be a power of 2.
-TLB_replace_policy
default value: LFU
Specifies the replacement policy for TLBs. Supported policies: LFU (Least Frequently Used).
-simulator_type
default value: cache
Specifies the type of the simulator. Supported types: cache, miss_analyzer, TLB, reuse_distance, reuse_time, histogramor basic_counts.
-verbose
default value: 0
Verbosity level for notifications.
-dr
default value: ""
Specifies the path of the DynamoRIO root directory.
-dr_debug
default value: false
Requests use of the debug build of DynamoRIO rather than the release build.
-dr_ops
default value: ""
Specifies the options to pass to DynamoRIO.
-tracer
default value: ""
The full path to the tracer library.
-skip_refs
default value: 0
Specifies the number of references to skip in the beginning of the application execution. These memory references are dropped instead of being simulated.
-warmup_refs
default value: 0
Specifies the number of memory references to warm up caches before simulation. The warmup references come after the skipped references and before the simulated references. This flag is incompatible with warmup_fraction.
-warmup_fraction
default value: 0
Specifies the fraction of last level cache blocks to be loaded such that the cache is considered to be warmed up before simulation. The warmup fraction is computed after the skipped references and before simulated references. This flag is incompatible with warmup_refs.
-sim_refs
default value: 8589934592G
Specifies the number of memory references to simulate. The simulated references come after the skipped and warmup references, and the references following the simulated ones are dropped.
-view_syntax
default value: att
Specifies the syntax to use when viewing disassembled offline traces.The option can be set to one of att (default), intel, dr and arm.An invalid specification falls back to the default.
-config_file
default value: ""
The full path to the cache hierarchy configuration file.
-report_top
default value: 10
Specifies the number of top results to be reported.
-reuse_distance_threshold
default value: 100
Specifies the reuse distance threshold for reporting the distant repeated references. A reference is a distant repeated reference if the distance to the previous reference on the same cache line exceeds the threshold.
-reuse_distance_histogram
default value: false
By default only the mean, median, and standard deviation of the reuse distances are reported. This option prints out the full histogram of reuse distances.
-reuse_skip_dist
default value: 500
Specifies the distance between nodes in the skip list. For optimal performance, set this to a value close to the estimated average reuse distance of the dataset.
-reuse_verify_skip
default value: false
Verifies every skip list-calculated reuse distance with a full list walk. This incurs significant additional overhead. This option is only available in debug builds.
-record_function
default value: ""
Record invocations trace for the specified function(s) in the option value. Default value is empty. The value should fit this format: function_name|function_id|func_args_num (e.g., -record_function "memset|10|3"). The trace would contain information for function return address, function argument value(s), and function return value. We only record pointer-sized arguments and return value. The trace is labeled with the function_id via an ID entry prior to each set of value entries. If the target function is in the dynamic symbol table, then the function_name should be a mangled name (e.g. "_Znwm" for "operator new", "_ZdlPv" for "operator delete"). Otherwise, the function_name should be a demangled name. Recording multiple functions can be achieved by using the separator "&" (e.g., -record_function "memset|10|3&memcpy|11|3"), or specifying multiple -record_function options (e.g., -record_function "memset|10|3" -record_function "memcpy|11|3"). Note that the provided function id should be unique, and not collide with existing heap functions (see -record_heap_value) if -record_heap option is enabled.
-record_heap
default value: false
It is a convenience option to enable recording a trace for the defined heap functions in -record_heap_value. Specifying this option is equivalent to -record_function [heap_functions], where [heap_functions] is the value in -record_heap_value.
-record_heap_value
default value: malloc|0|1&free|1|1&tc_malloc|2|1&tc_free|3|1&__libc_malloc|4|1&__libc_free|5|1&calloc|6|2
Functions recorded by -record_heap. The option value should fit the same format required by -record_function. These functions will not be traced unless -record_heap is specified.
-miss_count_threshold
default value: 50000
Specifies the minimum number of LLC misses of a load for it to be eligible for analysis in search of patterns in the miss address stream.
-miss_frac_threshold
default value: 0.005
Specifies the minimum fraction of LLC misses of a load (from all misses) for it to be eligible for analysis in search of patterns in the miss address stream.
-confidence_threshold
default value: 0.75
Specifies the minimum confidence to include a discovered pattern in the output results. Confidence in a discovered pattern for a load instruction is calculated as the fraction of the load's misses with the discovered pattern over all the load's misses.

Current Limitations

The drcachesim tool is a work in progress. We welcome contributions in these areas of missing functionality:

Cache coherence (https://github.com/DynamoRIO/dynamorio/issues/1726)
Multi-process online application simulation on Windows (https://github.com/DynamoRIO/dynamorio/issues/1727)
Arbitrary cache hierarchy support via an input config file (https://github.com/DynamoRIO/dynamorio/issues/1715)
Offline traces do not currently accurately record instruction fetches in dynamically generated code (https://github.com/DynamoRIO/dynamorio/issues/2062). All data references are included, but instruction fetches may be skipped. This problem is limited to offline traces.
Application phase marking is not yet implemented (https://github.com/DynamoRIO/dynamorio/issues/2478).

Comparison to Other Simulators

drcachesim is one of the few simulators to support multiple processes. This feature requires an out-of-process simulator and inter-process communication. A single-process design would incur less overhead. Thus, we expect drcachesim to pay for its multi-process support with potentially unfavorable performance versus single-process simulators.

When comparing cache hits, misses, and miss rates across simulators, the details can vary substantially. For example, some other simulators (such as cachegrind) do not split memory references that cross cache lines into multiple hits or misses, while drcachesim does split them. Instructions that reference multiple memory words on the same cache line (such as ldm on ARM) are considered to be single accesses by drcachesim, while other simulators (such as cachegrind) may split the accesses into separate pieces. A final example involves string loop instructions on x86. drcachesim considers only the first iteration to involve an instruction fetch (presenting subsequent iterations as a "non-fetched instruction" which the simulator ignores: the basic_counts tool does show these as a separate statistics), while other simulators (incorrectly) issue a fetch to the instruction cache on every iteration of the string loop.