Simulating (Verilated-Model Runtime)

This section describes items related to simulating, that using a Verilated model’s executable. For the runtime arguments to a simulated model, see Simulation Runtime Arguments.

Benchmarking & Optimization

For best performance, run Verilator with the -O3 --x-assign fast --x-initial fast --noassert options. The -O3 option will require longer time to run Verilator, and --x-assign fast --x-initial fast may increase the risk of reset bugs in trade for performance; see the above documentation for these options.

If using Verilated multithreaded, use numactl to ensure you are using non-conflicting hardware resources. See Multithreading.

Minor Verilog code changes can also give big wins. You should not have any UNOPTFLAT warnings from Verilator. Fixing these warnings can result in huge improvements; one user fixed their one UNOPTFLAT warning by making a simple change to a clock latch used to gate clocks and gained a 60% performance improvement.

Beyond that, the performance of a Verilated model depends mostly on your C++ compiler and size of your CPU’s caches. Experience shows that large models are often limited by the size of the instruction cache, and as such reducing code size if possible can be beneficial.

The supplied $VERILATOR_ROOT/include/verilated.mk file uses the OPT, OPT_FAST, OPT_SLOW and OPT_GLOBAL variables to control optimization. You can set these when compiling the output of Verilator with Make, for example:

make OPT_FAST="-Os -march=native" -f Vour.mk Vour__ALL.a

OPT_FAST specifies optimization options for those parts of the model that are on the fast path. This is mostly code that is executed every cycle. OPT_SLOW applies to slow-path code, which executes rarely, often only once at the beginning or end of simulation. Note that OPT_SLOW is ignored if VM_PARALLEL_BUILDS is not 1, in which case all generated code will be compiled in a single compilation unit using OPT_FAST. See also the Verilator --output-split option. The OPT_GLOBAL variable applies to common code in the runtime library used by Verilated models (shipped in $VERILATOR_ROOT/include). Additional C++ files passed on the verilator command line use OPT_FAST. The OPT variable applies to all compilation units in addition to the specific “OPT” variables described above.

You can also use the -CFLAGS and/or -LDFLAGS options on the verilator command line to pass arguments directly to the compiler or linker.

The default values of the “OPT” variables are chosen to yield good simulation speed with reasonable C++ compilation times. To this end, OPT_FAST is set to “-Os” by default. Higher optimization such as “-O2” or “-O3” may help (though often they provide only a very small performance benefit), but compile times may be excessively large even with medium sized designs. Compilation times can be improved at the expense of simulation speed by reducing optimization, for example with OPT_FAST=”-O0”. Often good simulation speed can be achieved with OPT_FAST=”-O1 -fstrict-aliasing” but with improved compilation times. Files controlled by OPT_SLOW have little effect on performance and therefore OPT_SLOW is empty by default (equivalent to “-O0”) for improved compilation speed. In common use-cases there should be little benefit in changing OPT_SLOW. OPT_GLOBAL is set to “-Os” by default and there should rarely be a need to change it. As the runtime library is small in comparison to a lot of Verilated models, disabling optimization on the runtime library should not have a serious effect on overall compilation time, but may have detrimental effect on simulation speed, especially with tracing. In addition to the above, for best results use OPT=”-march=native”, the latest Clang compiler (about 10% faster than GCC), and link statically.

Generally the answer to which optimization level gives the best user experience depends on the use case and some experimentation can pay dividends. For a speedy debug cycle during development, especially on large designs where C++ compilation speed can dominate, consider using lower optimization to get to an executable faster. For throughput oriented use cases, for example regressions, it is usually worth spending extra compilation time to reduce total CPU time.

If you will be running many simulations on a single model, you can investigate profile guided optimization. With GCC, using GCC’s “-fprofile-arcs”, then GCC’s “-fbranch-probabilities” will yield another 15% or so.

Modern compilers also support link-time optimization (LTO), which can help especially if you link in DPI code. To enable LTO on GCC, pass “-flto” in both compilation and link. Note LTO may cause excessive compile times on large designs.

Unfortunately, using the optimizer with SystemC files can result in compilation taking several minutes. (The SystemC libraries have many little inlined functions that drive the compiler nuts.)

If you are using your own makefiles, you may want to compile the Verilated code with --MAKEFLAGS -DVL_INLINE_OPT=inline. This will inline functions, however this requires that all cpp files be compiled in a single compiler run.

You may uncover further tuning possibilities by profiling the Verilog code. See Code Profiling.

When done optimizing, please let the author know the results. We like to keep tabs on how Verilator compares, and may be able to suggest additional improvements.

Coverage Analysis

Verilator supports adding code to the Verilated model to support SystemVerilog code coverage. With --coverage, Verilator enables all forms of coverage:

When a model with coverage is executed, it will create a coverage file for collection and later analysis, see Coverage Collection.

Functional Coverage

With --coverage or --coverage-user, Verilator will translate functional coverage points which the user has inserted manually into the SystemVerilog design, into the Verilated model.

Currently, all functional coverage points are specified using SystemVerilog assertion syntax which must be separately enabled with --assert.

For example, the following SystemVerilog statement will add a coverage point, under the coverage name “DefaultClock”:

DefaultClock: cover property (@(posedge clk) cyc==3);

Line Coverage

With --coverage or --coverage-line, Verilator will automatically add coverage analysis at each code flow change point (e.g. at branches). At each such branch a unique counter is incremented. At the end of a test, the counters along with the filename and line number corresponding to each counter are written into the coverage file.

Verilator automatically disables coverage of branches that have a $stop in them, as it is assumed $stop branches contain an error check that should not occur. A /*verilator coverage_block_off*/ metacomment will perform a similar function on any code in that block or below, or /*verilator coverage_off*/ and /*verilator coverage_on*/ will disable and enable coverage respectively around a block of code.

Verilator may over-count combinatorial (non-clocked) blocks when those blocks receive signals which have had the UNOPTFLAT warning disabled; for most accurate results do not disable this warning when using coverage.

Toggle Coverage

With --coverage or --coverage-toggle, Verilator will automatically add toggle coverage analysis into the Verilated model.

Every bit of every signal in a module has a counter inserted. The counter will increment on every edge change of the corresponding bit.

Signals that are part of tasks or begin/end blocks are considered local variables and are not covered. Signals that begin with underscores (see --coverage-underscore), are integers, or are very wide (>256 bits total storage across all dimensions, see --coverage-max-width) are also not covered.

Hierarchy is compressed, such that if a module is instantiated multiple times, coverage will be summed for that bit across all instantiations of that module with the same parameter set. A module instantiated with different parameter values is considered a different module, and will get counted separately.

Verilator makes a minimally-intelligent decision about what clock domain the signal goes to, and only looks for edges in that clock domain. This means that edges may be ignored if it is known that the edge could never be seen by the receiving logic. This algorithm may improve in the future. The net result is coverage may be lower than what would be seen by looking at traces, but the coverage is a more accurate representation of the quality of stimulus into the design.

There may be edges counted near time zero while the model stabilizes. It’s a good practice to zero all coverage just before releasing reset to prevent counting such behavior.

A /*verilator coverage_off*/ /*verilator coverage_on*/ metacomment pair can be used around signals that do not need toggle analysis, such as RAMs and register files.

Coverage Collection

When any coverage flag was used to Verilate, Verilator will add appropriate coverage point insertions into the model and collect the coverage data.

To get the coverage data from the model, in the user wrapper code, typically at the end once a test passes, call Verilated::coveragep()->write with an argument of the filename for the coverage data file to write coverage data to (typically “logs/coverage.dat”).

Run each of your tests in different directories, potentially in parallel. Each test will create a logs/coverage.dat file.

After running all of the tests, execute the verilator_coverage command, passing arguments pointing to the filenames of all of the individual coverage files. verilator_coverage will reads the logs/coverage.dat file(s), and create an annotated source code listing showing code coverage details.

verilator_coverage may also be used for test grading, that is computing which tests are important to fully cover the design.

For an example, see the examples/make_tracing_c/logs directory. Grep for lines starting with ‘%’ to see what lines Verilator believes need more coverage.

Additional options of verilator_coverage allow for merging of coverage data files or other transformations.

Info files can be written by verilator_coverage for import to lcov. This enables use of genhtml for HTML reports and importing reports to sites such as https://codecov.io.

Code Profiling

The Verilated model may be code-profiled using GCC or Clang’s C++ profiling mechanism. Verilator provides additional flags to help map the resulting C++ profiling results back to the original Verilog code responsible for the profiled C++ code functions.

To use profiling:

  1. Use Verilator’s --prof-cfuncs.

  2. Build and run the simulation model.

  3. The model will create gmon.out.

  4. Run gprof to see where in the C++ code the time is spent.

  5. Run the gprof output through the verilator_profcfunc program and it will tell you what Verilog line numbers on which most of the time is being spent.

Thread Profiling

When using multithreaded mode (--threads), it is useful to see statistics and visualize how well the multiple CPUs are being utilized.

With the --prof-threads option, Verilator will:

  • Add code to the Verilated model to record the start and end time of each macro-task across a number of calls to eval. (What is a macro-task? See the Verilator internals document (docs/internals.rst in the distribution.)

  • Add code to save profiling data in non-human-friendly form to the file specified with +verilator+prof+threads+file+<filename>.

The verilator_gantt program may then be run to transform the saved profiling file into a nicer visual format and produce some related statistics.

_images/fig_gantt_min.png

Fig. 1 Example verilator_gantt output, as viewed with GTKWave.

The parallelism shows the number of CPUs being used at a given moment.

The cpu_thread section shows which thread is executing on each of the physical CPUs.

The thread_mtask section shows which macro-task is running on a given thread.

For more information see verilator_gantt.

Profiling ccache efficiency

The Verilator generated Makefile provides support for basic profiling of ccache behavior during the build. This can be used to track down files that might be unnecessarily rebuilt, though as of today even small code changes will usually require rebuilding a large number of files. Improving ccache efficiency during the edit/compile/test loop is an active area of development.

To get a basic report of how well ccache is doing, add the ccache-report target when invoking the generated Makefile:

make -C obj_dir -f Vout.mk Vout ccache-report

This will print a report based on all executions of ccache during this invocation of Make. The report is also written to a file, in this example obj_dir/Vout__cache_report.txt.

To use the ccache-report target, at least one other explicit build target must be specified, and OBJCACHE must be set to ‘ccache’.

This feature is currently experimental and might change in subsequent releases.

Save/Restore

The intermediate state of a Verilated model may be saved, so that it may later be restored.

To enable this feature, use --savable. There are limitations in what language features are supported along with --savable; if you attempt to use an unsupported feature Verilator will throw an error.

To use save/restore, the user wrapper code must create a VerilatedSerialize or VerilatedDeserialze object then calling the << or >> operators on the generated model and any other data the process needs saved/restored. These functions are not thread safe, and are typically called only by a main thread.

For example:

void save_model(const char* filenamep) {
    VerilatedSave os;
    os.open(filenamep);
    os << main_time;  // user code must save the timestamp, etc
    os << *topp;
}
void restore_model(const char* filenamep) {
    VerilatedRestore os;
    os.open(filenamep);
    os >> main_time;
    os >> *topp;
}