Graph optimization is a collection of optimization passes that convert a general network description into a network-description-for-GPU-execution. It happens in the constructor of cldnn::program
. In other words, the input of graph optimization is topology
(link) and the output is program
(link).
The transformation from the original graph into the final graph is quite complicated. The steps are divided into smaller pieces (pass
). The purpose of this documentation is not to explain every step in detail, but to explain key steps.
For debugging purposes, you can dump the optimized graph after each step. See this article for details.
Note: The optimization passes run in sequence and the prefixed number indicates the sequence. However, the sequence number might change in the future.
- 00_init: First step of the optimization. If you want to see the first clDNN graph, you can check this. It collects network output node information and sets node processing order.
- 08_prepare_primitive_fusing: Fuse post-operations into other primitives. For example, ReLU is fused into convolution. Element-wise add operation can usually be fused into predecessor, too. The layout for the primitive is not chosen at this point yet, and you do not know which kernel will be chosen for the primitive. However, support for post-operation is dependent on the chosen kernel. That is why this pass contains some logic to guess the layout.
- 09_reorder_inputs: Select the layout format for each primitives. This is done by calling
layout_optimizer::get_preferred_format
function, which returns preferred format for a node (or “any” which means that the format must be propagated from adjacent nodes if possible). Then it propagates formats for nodes with “any” preferred format to minimize local reorders. After propagating formats, it inserts actual reorder nodes into the graph. The result of this pass is a quite complicated graph with many redundant reorders. It will be removed fromremove_redundant_reorders
. - 17_remove_redundant_reorders: This pass is about removing reorder, but it has two conceptual purposes. First one is removing redundant reorders. For example, when the network contains a pattern like
reorder - reorder - reorder
, it can be shrunk into a singlereorder
. Second one is about supporting cross-layout operation of a primitive. For example, when aconvolution
needs to receivebfyx
input and to generateb_fs_yx_fsv16
output, the initial graph fromreorder_inputs
looks as follows:data(bfyx) --> reorder(b_fs_yx_fsv16) --> convolution(b_fs_yx_fsv16)
. This pass looks for such a pattern and removes the reorder to generate a cross-layout graph for the target convolution:data(bfyx) --> convolution(b_fs_yx_fsv16)
- 19_prepare_buffer_fusing: This pass is for implicit concat or implicit crop. Implicit concat is about removing
concatenation
primitive when two predecessors can put result into the target buffer of concat directly. For example, if two convolution results are concatenated along f-axis and the layout isbfyx
format andb=1
, you can just remove concat primitive and manipulate the output address of the convolutions to point to proper locations. - 20_add_required_reorders: This pass tries to keep graph consistency and add reorder if current format is not supported by a node. It checks if the current input format is present in
implementation_map<op_t>
, defined in<op_type>_gpu.cpp
file. If it is not defined, this pass tries to change layout to one of the most common format[bfyx, yxfb, byxf]
and picks the first supported format. - 21_add_onednn_optimization_attributes: This pass generates oneDNN attributes for post operation (link). OpenVINO GPLU plugin (clDNN) has a set of defined post operations and it requires some transformation to map those into oneDNN post-operations.
- 22_compile_graph: This pass creates
primitive_impl
through the kernel selector. In this pass, the kernel for each node is chosen. For oneDNN primitives, OpenCL code is compiled in this stage. For clDNN primitives, OpenCL code will be compiled after all passes. - 26_propagate_constants: This pass reorders weights for convolution, deconvolution and FC to a required format. As the kernel is chosen in
compile_graph
stage, it is now known that some reordering is required for the weights. It is because the weights are stored in a simple planar format in IR, but other format is usually required for optimized convolution(or deconv, FC). To reorder weights, this pass creates a simple graph that receives weights and generates reordered weights. You get the reordered weights by executing the network and the reordered weights are inserted back into the original graph. - 31_oooq_memory_dependencies: In GPU, device memory is a limited resource and it is not necessary to keep all the intermediate results when inferencing a network. Therefore, the buffer is reused when the content is not needed anymore. However, it is necessary to take it into consideration that
Intel_GPU
plugin is using out-of-order queue. As you are not sure about the exact sequence of execution, there is an additional limitation of reusing the buffer. For example, in case of a multi-branch structure like inception, there is no direct dependencies between the branches except for the common ancestor. However, in OOOQ execution mode, as you are not sure about the sequence of execution in inception module, it is necessary not to reuse the buffer from one branch by another branch. Such implicit dependency information is processed in this pass.