You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Today AO official binaries only support NVIDIA GPUs and CPUs but resounding feedback we've gotten since our release has been to support more hardware backends
How to add new backends
We would love to include more backends. In the ideal case the backend is supported via torch.compile already and testing new hardware is mostly a matter of
Donating a runner we could use for our tests
Running our test suite on that runner and adding the necessary skip tests
Opening a tracker with the skipped tests so we can start making the hardware support better
The reason why we like torch.compile is we want to avoid a giant list of if conditions in our codebase. Granted we still have customers for both eager and executorch where working with the compiler is not realistic so in these cases we will insist we implement code via Device agnostic APIs like the ones listed here https://dev-discuss.pytorch.org/t/python-c-api-rules-for-device-generic-apis/2511
One challenge we still need to figure out is the device agnostic APIs are only available on more recent versions of PyTorch whereas in our CI we test many versions of PyTorch
Binary uploads
Note that people can always install AO from source but this makes it inconvenient to use and a lot of the support for more binaries has come from @atalman. The reason why building AO is now hard is because it's no longer a pure python package and will unlikely revert back to that state given how the Executorch and pytorch edge teams are now depending on us to ship their kernels
For the most part our performance story is leveraging torch.compile but we should seriously consider having a simple benchmark suitee like the one in pytorch/benchmark to be able to compare different hardware vendors. This is something @HDCharles had already been looking at
For AMD GPUs the story is simple, we leverage torch.compile() and we are unlikely to port our custom ops to HIP so we can get precise estimate of what chunk of our test suite fails
Intel GPUs same as AMD GPUs
ARM CPUs is the most complicated story since there's many competing solutions we'd need to benchmark
torch.compile CPP codegen
Custom low bit matmuls like the ones in torchao/experimental
Triton ARM backend
Metal GPU will only work via eager instead of torch.compile
Test suite coverage
So finally to really say we support hardware backend X, we should be confident in the performance. So the baselines are is our code faster than eager fp16 and somewhat close to the NVIDIA performance for GPUs. We basically need to run our entire test suite and see how many tests fail or are skipped per new backend and manually chase each down.
Test granularity might be too small to report so we can instead look at feature level support like quantize_(), float8, low_bit_optim etc..
Today AO official binaries only support NVIDIA GPUs and CPUs but resounding feedback we've gotten since our release has been to support more hardware backends
How to add new backends
We would love to include more backends. In the ideal case the backend is supported via torch.compile already and testing new hardware is mostly a matter of
The reason why we like torch.compile is we want to avoid a giant list of if conditions in our codebase. Granted we still have customers for both eager and executorch where working with the compiler is not realistic so in these cases we will insist we implement code via Device agnostic APIs like the ones listed here https://dev-discuss.pytorch.org/t/python-c-api-rules-for-device-generic-apis/2511
One challenge we still need to figure out is the device agnostic APIs are only available on more recent versions of PyTorch whereas in our CI we test many versions of PyTorch
Binary uploads
Note that people can always install AO from source but this makes it inconvenient to use and a lot of the support for more binaries has come from @atalman. The reason why building AO is now hard is because it's no longer a pure python package and will unlikely revert back to that state given how the Executorch and pytorch edge teams are now depending on us to ship their kernels
Leveraging torch.compile
For the most part our performance story is leveraging torch.compile but we should seriously consider having a simple benchmark suitee like the one in
pytorch/benchmark
to be able to compare different hardware vendors. This is something @HDCharles had already been looking attorch.compile()
and we are unlikely to port our custom ops to HIP so we can get precise estimate of what chunk of our test suite failstorchao/experimental
Test suite coverage
So finally to really say we support hardware backend X, we should be confident in the performance. So the baselines are is our code faster than eager fp16 and somewhat close to the NVIDIA performance for GPUs. We basically need to run our entire test suite and see how many tests fail or are skipped per new backend and manually chase each down.
Test granularity might be too small to report so we can instead look at feature level support like
quantize_()
,float8
, low_bit_optim etc..cc @albanD @atalman @EikanWang @jithunnair-amd @supriyar @digantdesai @kimishpatel @metascroy
The text was updated successfully, but these errors were encountered: