Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Painter: Memory optimizations #50

Merged
merged 1 commit into from
Oct 23, 2024
Merged

Painter: Memory optimizations #50

merged 1 commit into from
Oct 23, 2024

Conversation

vancluever
Copy link
Owner

@vancluever vancluever commented Oct 21, 2024

This adds a number of memory optimizations to the painter:

  • Downsampling is now done in-place when using AA. Memory is freed after the downsample. This avoids an extra allocation for the downsampled surface.

  • Additionally in AA, if our pixel source is alpha8, we fast-path to just using the mask as the foreground, doing a composition with the pixel source if it's not opaque. This avoids another allocation under this scenario!

  • Finally, we have moved the edge calculations to a FixedBufferAllocator. The default allocation size (which can be tuned if you're okay calling the painter functions directly from the type function) is 1024 items, or 4096 bytes under our new Polygon.Edge packed struct. This has allowed us to drop the use of external allocators completely from non-AA paint. As part of this, we've dropped the X-coordinate in an edge to an i30 (this might be dropped lower as well, depending on the result of our internal numerics work)

  • We've also dropped the ArenaAllocator from AA paint as it's not bringing value anymore, and probably just getting in the way of our resize.

All of these are in advance of fully addressing #44 - lowering the allocation footprint of the painter will make the path to infallibility much easier.

@vancluever
Copy link
Owner Author

Currently draft: I need to address the fact that the tunability of the FBA isn't what it says on the tin. Trying to figure out a good way to supply the comptime value, looks like struct fields that are comptime are functionally equivalent to constants, and can't be changed that way (i.e., we can't shuttle one from the context that way).

@vancluever vancluever force-pushed the downsample-in-place branch 4 times, most recently from ba3ab00 to 905463d Compare October 21, 2024 18:45
@vancluever
Copy link
Owner Author

Also TODO: compositing on to the mask surface for non-fully-opaque alpha (I’ve planned this too, just wanted this up to start).

@vancluever
Copy link
Owner Author

Edge cache size paramterization is fixed, but unfortunately the way you can tweak it is a bit obscure (calling painter methods directly off of a new permutation of the type function). However, this is in line with what we are eventually working on in #44; although I have not shared full details, the intention is to have the context be a managed API over a number of unmanaged components. As such, if you needed a custom value for the edge cache, you would just opt into unmanaged behavior.

Some things needed to be shuffled around to preserve some of the error checking, and will be refactored soon (really just a result of moving the path checking fully down into the painter).

Also, the edge x-coordinate limit has been updated to be i30, this allows us to pack that and the direction flag into a u32, ensuring that our default of 4096 bytes equates to 1k items versus 512. This limit is fine, and I might end up decreasing it anyway if we go to fixed point as described in #26 (in that case, with 24.8, we can just make it i24).

This adds a number of memory optimizations to the painter:

* Downsampling is now done in-place when using AA. Memory is freed after
  the downsample. This avoids an extra allocation for the downsampled
  surface.

* Additionally in AA, if our pixel source is alpha8, we fast-path to
  just using the mask as the foreground, doing a composition with the
  pixel source if it's not opaque. This avoids another allocation under
  this scenario!

* Finally, we have moved the edge calculations to a
  FixedBufferAllocator. The default allocation size (which can be tuned
  if you're okay calling the painter functions directly from the type
  function) is 1024 items, or 4096 bytes under our new Polygon.Edge
  packed struct. This has allowed us to drop the use of external
  allocators completely from non-AA paint. As part of this, we've
  dropped the X-coordinate in an edge to an i30 (this might be dropped
  lower as well, depending on the result of our internal numerics work).

* We've also dropped the ArenaAllocator from AA paint as it's not
  bringing value anymore, and probably just getting in the way of our
  resize.

All of these are in advance of fully addressing #44 - lowering the
allocation footprint of the painter will make the path to infallibility
much easier.
@vancluever vancluever marked this pull request as ready for review October 22, 2024 19:36
@vancluever vancluever merged commit 9f5bf8b into main Oct 23, 2024
2 checks passed
@vancluever vancluever deleted the downsample-in-place branch October 23, 2024 01:55
@vancluever
Copy link
Owner Author

Just appending some quick benchmarks using poop below to this issue.

This tested main at 5541db7, compared to 3eb8352, just running the binary generated by zig build spec. Note the 2MB drop in peak RSS in spite of the addition of extra tests.

The slight increase in wall time can likely be accounted for with the addition of the extra tests (2 more were added for this PR).

Test suite (main, no update)

Benchmark 1 (3 runs): .zig-cache/o/543d8339ae636c49585851915f8c3790/spec
  measurement          mean ± σ            min … max           outliers
  wall_time          5.31s  ± 19.3ms    5.30s  … 5.33s           0 ( 0%)
  peak_rss           17.4MB ± 63.8KB    17.4MB … 17.5MB          0 ( 0%)
  cpu_cycles         23.6G  ± 65.7M     23.5G  … 23.7G           0 ( 0%)
  instructions       59.2G  ± 13.2K     59.2G  … 59.2G           0 ( 0%)
  cache_references   79.8M  ±  522K     79.4M  … 80.4M           0 ( 0%)
  cache_misses       5.93M  ±  224K     5.68M  … 6.11M           0 ( 0%)
  branch_misses      7.65M  ± 28.4K     7.63M  … 7.68M           0 ( 0%)

Test suite (pre-memory fixes, no update)

Benchmark 1 (3 runs): .zig-cache/o/3b12e417959e3cc5eef718db74b72319/spec
  measurement          mean ± σ            min … max           outliers
  wall_time          5.25s  ± 12.8ms    5.24s  … 5.26s           0 ( 0%)
  peak_rss           19.6MB ±  194KB    19.4MB … 19.8MB          0 ( 0%)
  cpu_cycles         23.3G  ± 55.4M     23.2G  … 23.3G           0 ( 0%)
  instructions       57.7G  ± 37.2K     57.7G  … 57.7G           0 ( 0%)
  cache_references   89.1M  ±  420K     88.7M  … 89.5M           0 ( 0%)
  cache_misses       5.83M  ±  211K     5.68M  … 6.08M           0 ( 0%)
  branch_misses      8.08M  ± 19.7K     8.06M  … 8.10M           0 ( 0%)

@vancluever
Copy link
Owner Author

Also did some optimized tests of the individual alpha channel tests. The improvement in peak RSS is still only around 18%, far off from what I would have expected based on the savings in allocations (1 less 1x alpha8 surface and then a full-size RGBA surface, which would be basically 4x too; this is where the 55% estimate comes from since ultimately we end up only needing need the initial 4x RAM for the initial AA raster of the mask, which is then downsampled in place). This is, of course, a very opaque test; I feel like we'd need instrumentation to test further. Might update the CHANGELOG entry to adjust expectations.

Note that this is still also a test binary,

zig build spec -Doptimize=ReleaseFast -Dfilter=046_fill_triangle_alpha (main)
Benchmark 1 (339 runs): .zig-cache/o/4105498d3bd52a481e7d4123db612a40/spec
  measurement          mean ± σ            min … max           outliers
  wall_time          14.7ms ±  810us    13.0ms … 17.7ms          6 ( 2%)
  peak_rss           2.25MB ± 24.4KB    1.99MB … 2.25MB          9 ( 3%)
  cpu_cycles         52.7M  ± 1.05M     51.4M  … 60.5M          16 ( 5%)
  instructions        125M  ±  232       125M  …  125M           6 ( 2%)
  cache_references    386K  ± 15.9K      354K  …  472K           6 ( 2%)
  cache_misses       56.5K  ± 4.70K     43.5K  … 73.8K           8 ( 2%)
  branch_misses      79.5K  ± 1.61K     77.0K  … 86.7K          15 ( 4%)
zig build spec -Doptimize=ReleaseFast -Dfilter=047_fill_triangle_alpha_gray (main)
Benchmark 1 (341 runs): .zig-cache/o/fb7bcd85a55d18869d734f5acf74bdf9/spec
  measurement          mean ± σ            min … max           outliers
  wall_time          14.6ms ±  756us    12.9ms … 17.3ms          9 ( 3%)
  peak_rss           2.25MB ± 27.1KB    1.99MB … 2.25MB         12 ( 4%)
  cpu_cycles         51.9M  ± 1.02M     50.7M  … 60.4M          16 ( 5%)
  instructions        123M  ±  236       123M  …  123M           2 ( 1%)
  cache_references    385K  ± 16.9K      340K  …  476K          12 ( 4%)
  cache_misses       57.0K  ± 4.99K     31.5K  … 76.2K          25 ( 7%)
  branch_misses      73.0K  ± 1.39K     70.4K  … 78.4K          17 ( 5%)
zig build spec -Doptimize=ReleaseFast -Dfilter=046_fill_triangle_alpha (backported to 3eb8352)
Benchmark 1 (318 runs): .zig-cache/o/06662136d7b3f96ba2e6d05b0cfb9123/spec
  measurement          mean ± σ            min … max           outliers
  wall_time          15.6ms ±  870us    14.0ms … 18.9ms          2 ( 1%)
  peak_rss           2.68MB ± 66.9KB    2.51MB … 2.78MB          0 ( 0%)
  cpu_cycles         56.5M  ± 1.14M     54.9M  … 64.8M          23 ( 7%)
  instructions        133M  ±  377       133M  …  133M          20 ( 6%)
  cache_references    412K  ± 26.4K      372K  …  772K          16 ( 5%)
  cache_misses       56.7K  ± 4.44K     45.4K  … 75.3K          24 ( 8%)
  branch_misses      80.6K  ± 1.76K     78.1K  … 88.8K          41 (13%)
zig build spec -Doptimize=ReleaseFast -Dfilter=047_fill_triangle_alpha_gray (backported to 3eb8352)
Benchmark 1 (321 runs): .zig-cache/o/69ce2ae9615f94ef86f0636b47ab4186/spec
  measurement          mean ± σ            min … max           outliers
  wall_time          15.5ms ±  769us    14.0ms … 18.5ms          3 ( 1%)
  peak_rss           2.67MB ± 56.8KB    2.51MB … 2.78MB         67 (21%)
  cpu_cycles         55.7M  ± 1.16M     54.5M  … 67.1M          16 ( 5%)
  instructions        130M  ±  349       130M  …  130M          16 ( 5%)
  cache_references    407K  ± 17.8K      346K  …  513K          13 ( 4%)
  cache_misses       55.9K  ± 5.07K     32.3K  … 72.0K          26 ( 8%)
  branch_misses      74.0K  ± 1.42K     71.4K  … 82.0K          56 (17%)

@vancluever
Copy link
Owner Author

vancluever commented Oct 23, 2024

Just to summarize what our estimates would be based on our allocation savings (for sake of the CHANGELOG):

Before this change, we were allocating the following surfaces:

  • A mask at 4x the rendered size at alpha8,
  • A final downsampled version of the mask at 1x alpha8,
  • A foreground surface as RGBA at 1x.

So estimated before, given a 100x100 polygon would be:

100*100*4 + 100*100 + 100*100*4  # 4x mask, 1x downsampled (both alpha8, 8bpp), and 1x RGBA (32bpp)
90000 # = 90Kbytes

Now, when our source is non-alpha8 (so RGB or RGBA), we downsample in place, removing the 1x mask allocation:

100*100*4 + 100*100*4  # 4x mask (alpha8, 8bpp), and 1x RGBA (32bpp)
80000 # = 80Kbytes
80000/90000
.88888888888888888888 # ~11% savings

Additionally, we just composite to our mask itself if our source is alpha8, removing the relatively costly RGBA allocation altogether.

100*100*4 # 4x mask (alpha8, 8bpp) only
40000 # = 40Kbytes
40000/90000
.44444444444444444444 # ~55% savings

So this is where the estimates are ultimately derived from. As seen from the very opaque benchmarks above though, YMMV. Note that in addition to the test code, there's other stuff going on here that could skew things like the fact that these tests still do both AA and non-AA tests, and file comparisons (so files are loaded to diff against existing buffers).

Eventually it would be nice to get some instrumentation into the library to test this stuff further, but the change in allocation profile is pretty straightforward.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant