Painter: Memory optimizations #50

vancluever · 2024-10-21T18:33:52Z

This adds a number of memory optimizations to the painter:

Downsampling is now done in-place when using AA. Memory is freed after the downsample. This avoids an extra allocation for the downsampled surface.
Additionally in AA, if our pixel source is alpha8, we fast-path to just using the mask as the foreground, doing a composition with the pixel source if it's not opaque. This avoids another allocation under this scenario!
Finally, we have moved the edge calculations to a FixedBufferAllocator. The default allocation size (which can be tuned if you're okay calling the painter functions directly from the type function) is 1024 items, or 4096 bytes under our new Polygon.Edge packed struct. This has allowed us to drop the use of external allocators completely from non-AA paint. As part of this, we've dropped the X-coordinate in an edge to an i30 (this might be dropped lower as well, depending on the result of our internal numerics work)
We've also dropped the ArenaAllocator from AA paint as it's not bringing value anymore, and probably just getting in the way of our resize.

All of these are in advance of fully addressing #44 - lowering the allocation footprint of the painter will make the path to infallibility much easier.

vancluever · 2024-10-21T18:36:05Z

Currently draft: I need to address the fact that the tunability of the FBA isn't what it says on the tin. Trying to figure out a good way to supply the comptime value, looks like struct fields that are comptime are functionally equivalent to constants, and can't be changed that way (i.e., we can't shuttle one from the context that way).

vancluever · 2024-10-21T19:59:29Z

Also TODO: compositing on to the mask surface for non-fully-opaque alpha (I’ve planned this too, just wanted this up to start).

vancluever · 2024-10-22T18:47:55Z

Edge cache size paramterization is fixed, but unfortunately the way you can tweak it is a bit obscure (calling painter methods directly off of a new permutation of the type function). However, this is in line with what we are eventually working on in #44; although I have not shared full details, the intention is to have the context be a managed API over a number of unmanaged components. As such, if you needed a custom value for the edge cache, you would just opt into unmanaged behavior.

Some things needed to be shuffled around to preserve some of the error checking, and will be refactored soon (really just a result of moving the path checking fully down into the painter).

Also, the edge x-coordinate limit has been updated to be i30, this allows us to pack that and the direction flag into a u32, ensuring that our default of 4096 bytes equates to 1k items versus 512. This limit is fine, and I might end up decreasing it anyway if we go to fixed point as described in #26 (in that case, with 24.8, we can just make it i24).

This adds a number of memory optimizations to the painter: * Downsampling is now done in-place when using AA. Memory is freed after the downsample. This avoids an extra allocation for the downsampled surface. * Additionally in AA, if our pixel source is alpha8, we fast-path to just using the mask as the foreground, doing a composition with the pixel source if it's not opaque. This avoids another allocation under this scenario! * Finally, we have moved the edge calculations to a FixedBufferAllocator. The default allocation size (which can be tuned if you're okay calling the painter functions directly from the type function) is 1024 items, or 4096 bytes under our new Polygon.Edge packed struct. This has allowed us to drop the use of external allocators completely from non-AA paint. As part of this, we've dropped the X-coordinate in an edge to an i30 (this might be dropped lower as well, depending on the result of our internal numerics work). * We've also dropped the ArenaAllocator from AA paint as it's not bringing value anymore, and probably just getting in the way of our resize. All of these are in advance of fully addressing #44 - lowering the allocation footprint of the painter will make the path to infallibility much easier.

vancluever · 2024-10-23T17:54:37Z

Just appending some quick benchmarks using poop below to this issue.

This tested main at 5541db7, compared to 3eb8352, just running the binary generated by zig build spec. Note the 2MB drop in peak RSS in spite of the addition of extra tests.

The slight increase in wall time can likely be accounted for with the addition of the extra tests (2 more were added for this PR).

Test suite (main, no update)

Benchmark 1 (3 runs): .zig-cache/o/543d8339ae636c49585851915f8c3790/spec
  measurement          mean ± σ            min … max           outliers
  wall_time          5.31s  ± 19.3ms    5.30s  … 5.33s           0 ( 0%)
  peak_rss           17.4MB ± 63.8KB    17.4MB … 17.5MB          0 ( 0%)
  cpu_cycles         23.6G  ± 65.7M     23.5G  … 23.7G           0 ( 0%)
  instructions       59.2G  ± 13.2K     59.2G  … 59.2G           0 ( 0%)
  cache_references   79.8M  ±  522K     79.4M  … 80.4M           0 ( 0%)
  cache_misses       5.93M  ±  224K     5.68M  … 6.11M           0 ( 0%)
  branch_misses      7.65M  ± 28.4K     7.63M  … 7.68M           0 ( 0%)

Test suite (pre-memory fixes, no update)

Benchmark 1 (3 runs): .zig-cache/o/3b12e417959e3cc5eef718db74b72319/spec
  measurement          mean ± σ            min … max           outliers
  wall_time          5.25s  ± 12.8ms    5.24s  … 5.26s           0 ( 0%)
  peak_rss           19.6MB ±  194KB    19.4MB … 19.8MB          0 ( 0%)
  cpu_cycles         23.3G  ± 55.4M     23.2G  … 23.3G           0 ( 0%)
  instructions       57.7G  ± 37.2K     57.7G  … 57.7G           0 ( 0%)
  cache_references   89.1M  ±  420K     88.7M  … 89.5M           0 ( 0%)
  cache_misses       5.83M  ±  211K     5.68M  … 6.08M           0 ( 0%)
  branch_misses      8.08M  ± 19.7K     8.06M  … 8.10M           0 ( 0%)

vancluever · 2024-10-23T20:04:37Z

Also did some optimized tests of the individual alpha channel tests. The improvement in peak RSS is still only around 18%, far off from what I would have expected based on the savings in allocations (1 less 1x alpha8 surface and then a full-size RGBA surface, which would be basically 4x too; this is where the 55% estimate comes from since ultimately we end up only needing need the initial 4x RAM for the initial AA raster of the mask, which is then downsampled in place). This is, of course, a very opaque test; I feel like we'd need instrumentation to test further. Might update the CHANGELOG entry to adjust expectations.

Note that this is still also a test binary,

`zig build spec -Doptimize=ReleaseFast -Dfilter=046_fill_triangle_alpha` (main)

Benchmark 1 (339 runs): .zig-cache/o/4105498d3bd52a481e7d4123db612a40/spec
  measurement          mean ± σ            min … max           outliers
  wall_time          14.7ms ±  810us    13.0ms … 17.7ms          6 ( 2%)
  peak_rss           2.25MB ± 24.4KB    1.99MB … 2.25MB          9 ( 3%)
  cpu_cycles         52.7M  ± 1.05M     51.4M  … 60.5M          16 ( 5%)
  instructions        125M  ±  232       125M  …  125M           6 ( 2%)
  cache_references    386K  ± 15.9K      354K  …  472K           6 ( 2%)
  cache_misses       56.5K  ± 4.70K     43.5K  … 73.8K           8 ( 2%)
  branch_misses      79.5K  ± 1.61K     77.0K  … 86.7K          15 ( 4%)

`zig build spec -Doptimize=ReleaseFast -Dfilter=047_fill_triangle_alpha_gray` (main)

Benchmark 1 (341 runs): .zig-cache/o/fb7bcd85a55d18869d734f5acf74bdf9/spec
  measurement          mean ± σ            min … max           outliers
  wall_time          14.6ms ±  756us    12.9ms … 17.3ms          9 ( 3%)
  peak_rss           2.25MB ± 27.1KB    1.99MB … 2.25MB         12 ( 4%)
  cpu_cycles         51.9M  ± 1.02M     50.7M  … 60.4M          16 ( 5%)
  instructions        123M  ±  236       123M  …  123M           2 ( 1%)
  cache_references    385K  ± 16.9K      340K  …  476K          12 ( 4%)
  cache_misses       57.0K  ± 4.99K     31.5K  … 76.2K          25 ( 7%)
  branch_misses      73.0K  ± 1.39K     70.4K  … 78.4K          17 ( 5%)

`zig build spec -Doptimize=ReleaseFast -Dfilter=046_fill_triangle_alpha` (backported to `3eb8352`)

Benchmark 1 (318 runs): .zig-cache/o/06662136d7b3f96ba2e6d05b0cfb9123/spec
  measurement          mean ± σ            min … max           outliers
  wall_time          15.6ms ±  870us    14.0ms … 18.9ms          2 ( 1%)
  peak_rss           2.68MB ± 66.9KB    2.51MB … 2.78MB          0 ( 0%)
  cpu_cycles         56.5M  ± 1.14M     54.9M  … 64.8M          23 ( 7%)
  instructions        133M  ±  377       133M  …  133M          20 ( 6%)
  cache_references    412K  ± 26.4K      372K  …  772K          16 ( 5%)
  cache_misses       56.7K  ± 4.44K     45.4K  … 75.3K          24 ( 8%)
  branch_misses      80.6K  ± 1.76K     78.1K  … 88.8K          41 (13%)

`zig build spec -Doptimize=ReleaseFast -Dfilter=047_fill_triangle_alpha_gray` (backported to `3eb8352`)

Benchmark 1 (321 runs): .zig-cache/o/69ce2ae9615f94ef86f0636b47ab4186/spec
  measurement          mean ± σ            min … max           outliers
  wall_time          15.5ms ±  769us    14.0ms … 18.5ms          3 ( 1%)
  peak_rss           2.67MB ± 56.8KB    2.51MB … 2.78MB         67 (21%)
  cpu_cycles         55.7M  ± 1.16M     54.5M  … 67.1M          16 ( 5%)
  instructions        130M  ±  349       130M  …  130M          16 ( 5%)
  cache_references    407K  ± 17.8K      346K  …  513K          13 ( 4%)
  cache_misses       55.9K  ± 5.07K     32.3K  … 72.0K          26 ( 8%)
  branch_misses      74.0K  ± 1.42K     71.4K  … 82.0K          56 (17%)

vancluever · 2024-10-23T20:28:41Z

Just to summarize what our estimates would be based on our allocation savings (for sake of the CHANGELOG):

Before this change, we were allocating the following surfaces:

A mask at 4x the rendered size at alpha8,
A final downsampled version of the mask at 1x alpha8,
A foreground surface as RGBA at 1x.

So estimated before, given a 100x100 polygon would be:

100*100*4 + 100*100 + 100*100*4  # 4x mask, 1x downsampled (both alpha8, 8bpp), and 1x RGBA (32bpp)
90000 # = 90Kbytes

Now, when our source is non-alpha8 (so RGB or RGBA), we downsample in place, removing the 1x mask allocation:

100*100*4 + 100*100*4  # 4x mask (alpha8, 8bpp), and 1x RGBA (32bpp)
80000 # = 80Kbytes
80000/90000
.88888888888888888888 # ~11% savings

Additionally, we just composite to our mask itself if our source is alpha8, removing the relatively costly RGBA allocation altogether.

100*100*4 # 4x mask (alpha8, 8bpp) only
40000 # = 40Kbytes
40000/90000
.44444444444444444444 # ~55% savings

So this is where the estimates are ultimately derived from. As seen from the very opaque benchmarks above though, YMMV. Note that in addition to the test code, there's other stuff going on here that could skew things like the fact that these tests still do both AA and non-AA tests, and file comparisons (so files are loaded to diff against existing buffers).

Eventually it would be nice to get some instrumentation into the library to test this stuff further, but the change in allocation profile is pretty straightforward.

vancluever force-pushed the downsample-in-place branch 4 times, most recently from ba3ab00 to 905463d Compare October 21, 2024 18:45

vancluever force-pushed the downsample-in-place branch from 905463d to caaae94 Compare October 22, 2024 18:35

vancluever force-pushed the downsample-in-place branch from caaae94 to b2e255c Compare October 22, 2024 18:52

vancluever force-pushed the downsample-in-place branch from b2e255c to 4f869d4 Compare October 22, 2024 19:33

vancluever marked this pull request as ready for review October 22, 2024 19:36

vancluever merged commit 9f5bf8b into main Oct 23, 2024
2 checks passed

vancluever deleted the downsample-in-place branch October 23, 2024 01:55

vancluever added a commit that referenced this pull request Oct 23, 2024

Add some clarification to CHANGELOG.md for #50 RAM savings

5721112

vancluever mentioned this pull request Oct 29, 2024

internal: internal numerics (double, single, fixed?) #26

Open

vancluever mentioned this pull request Nov 8, 2024

New unmanaged API, with infallible options for certain parts #57

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Painter: Memory optimizations #50

Painter: Memory optimizations #50

vancluever commented Oct 21, 2024 •

edited

Loading

vancluever commented Oct 21, 2024

vancluever commented Oct 21, 2024

vancluever commented Oct 22, 2024

vancluever commented Oct 23, 2024

vancluever commented Oct 23, 2024

vancluever commented Oct 23, 2024 •

edited

Loading

Painter: Memory optimizations #50

Painter: Memory optimizations #50

Conversation

vancluever commented Oct 21, 2024 • edited Loading

vancluever commented Oct 21, 2024

vancluever commented Oct 21, 2024

vancluever commented Oct 22, 2024

vancluever commented Oct 23, 2024

Test suite (main, no update)

Test suite (pre-memory fixes, no update)

vancluever commented Oct 23, 2024

zig build spec -Doptimize=ReleaseFast -Dfilter=046_fill_triangle_alpha (main)

zig build spec -Doptimize=ReleaseFast -Dfilter=047_fill_triangle_alpha_gray (main)

zig build spec -Doptimize=ReleaseFast -Dfilter=046_fill_triangle_alpha (backported to 3eb8352)

zig build spec -Doptimize=ReleaseFast -Dfilter=047_fill_triangle_alpha_gray (backported to 3eb8352)

vancluever commented Oct 23, 2024 • edited Loading

vancluever commented Oct 21, 2024 •

edited

Loading

`zig build spec -Doptimize=ReleaseFast -Dfilter=046_fill_triangle_alpha` (main)

`zig build spec -Doptimize=ReleaseFast -Dfilter=047_fill_triangle_alpha_gray` (main)

`zig build spec -Doptimize=ReleaseFast -Dfilter=046_fill_triangle_alpha` (backported to `3eb8352`)

`zig build spec -Doptimize=ReleaseFast -Dfilter=047_fill_triangle_alpha_gray` (backported to `3eb8352`)

vancluever commented Oct 23, 2024 •

edited

Loading