Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimize performance of icache #1543

Open
veripoolbot opened this issue Oct 6, 2019 · 2 comments
Open

Optimize performance of icache #1543

veripoolbot opened this issue Oct 6, 2019 · 2 comments
Labels
area: performance Issue involves performance issues effort: weeks Expect this issue to require weeks or more of invested effort to resolve type: feature-non-IEEE Request to add new feature, outside IEEE 1800

Comments

@veripoolbot
Copy link
Contributor


Author Name: Wilson Snyder (@wsnyder)
Original Redmine Issue: 1543 from https://www.veripool.org


Feature tracking bug.

The biggest performance cost presently is icache pressure. We should figure out techniques to improve icache performance as this has the potential for 2x performance gains.

First we need to hand-craft some experiments and possible solutions so we find a good path.

One idea is this:

One idea is to build new functions with generic input/outputs, then call those functions (basically reverse-inline). This will result in more instructions, but less code size which will typically be beneficial especially if no branches are used. One possible method:

Build a graph of common code and sights referencing that code. This will look though the existing statements and convert I/O to that statement to generic variables. The code at that point forms a potential function which references those variables generically. If the number of vars is small, hash the potential function, so we now have a function signature, and create a graph from the potential call sight to the potential function. Once the graph is built, find frequent common signatures, build a function corresponding signature, and replace the original logic with calls to that function.

Note Verilator presently has a V3Combine step that finds multiple functions with identical code and combines them into a single function, but this is very limited.

@veripoolbot veripoolbot added area: performance Issue involves performance issues effort: weeks Expect this issue to require weeks or more of invested effort to resolve type: feature-non-IEEE Request to add new feature, outside IEEE 1800 labels Dec 22, 2019
@qrq992
Copy link
Contributor

qrq992 commented Feb 18, 2021

I have a question about performance costs. Why do you think the performance bottleneck of the model is icache instead of dcache? In large RTL designs, verilator can generate as many as a million member variables. Isn't dcache a performance bottleneck?
I use the perf tool to analyze program performance. I can’t analyze the performance bottleneck of the model simply from icache miss rate and dcache miss rate.
Am I misunderstood something here?

@wsnyder
Copy link
Member

wsnyder commented Feb 18, 2021

Designs may vary but:

  1. generally there's a lot of frontend stalls.
  2. for the million variables there's generally multiple-million instructions
  3. doing dcache optimizations like perfect packing seems to only give 10% or so
  4. doing just a small section of icache optimizations (by hand) gives 30% or more
  5. the CPUs are generally good at prefetching/hiding data misses, but aren't optimized to hide instruction misses

@wsnyder wsnyder changed the title Improve performance of icache Optimize performance of icache Mar 16, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area: performance Issue involves performance issues effort: weeks Expect this issue to require weeks or more of invested effort to resolve type: feature-non-IEEE Request to add new feature, outside IEEE 1800
Projects
None yet
Development

No branches or pull requests

3 participants