Kothari, Akash via llvm-dev
2021-Nov-12 19:28 UTC
[llvm-dev] [RFC] Proposal for TLX: Tensor LLVM eXtensions
**** Proposal for TLX: Tensor LLVM eXtensions ================================================================================== Authors: Akash Kothari (UIUC), Abdul Rafae Noor (UIUC), Dounia Khaldi (Intel), Vikram Adve (UIUC), Yuanke Luo(Intel), Sudipta Sengupta (Amazon AWS), Milind Girkar (Intel), Charith Mendis (UIUC) ------------------------------------------------------------------------------------ *** RATIONALE =======================================================================================Diverse hardware vendors are developing new hardware support for (mostly dense) tensor computations, which have become increasingly important for machine learning applications. These include both ISA extensions on CPUs and GPUs (such as Intel AMX, Power MMA, NVIDIA’s tensor cores, AMD’s matrix cores, and Qualcomm’s HVX vector ISA) and dedicated accelerators for compute offload (such as NVIDIA’s NVDLA, Amazon’s Inferentia and Trainium, and numerous ML accelerators from smaller companies). While ML workloads are the primary motivation and likely to be the dominant use cases, other tensor-intensive application domains, such as image processing, scientific computing, quantum simulations, financial modeling, and others can benefit from this hardware support as well, via languages like C++, DPC++, Julia, Fortran, Halide, CUDA, OpenCL, and others. LLVM can play a crucial role in making it easier for these vendors to create optimizing compiler back-ends for their emerging hardware (if the existing vector and matrix support in LLVM were generalized to support tensor operations). LLVM is already widely-used today by many of the vendors that develop these tensor architectures, e.g., to target CPUs and GPUs. LLVM is highly retargetable, by design. For the CPU targets, LLVM allows an integrated code generation framework for tensor operations with optimized intermixing of scalar, 1-D vector and 2-D matrix operations in the same code section (e.g., loop body). And LLVM has front-ends for a wide range of high-level languages, including essentially all the languages used widely for relevant application domains today. No existing infrastructure we know of meets these needs. MLIR is likely the best option, and we believe it is entirely complementary to LLVM. MLIR provides strong support for high-level tensor operations in TOSA, relevant optimizations in Affine and Linalg, and lowering paths to accelerators, GPUs and (via the LLVM dialect) CPUs. Crucially, however, MLIR does not have a separate low-level code generation framework that is retargetable to diverse hardware: it relies on LLVM for this purpose. If LLVM could be extended with tensor operations and a corresponding retargetable tensor code generation framework, MLIR could leverage this as well. Moreover, there are enough vendors and also languages that rely heavily on LLVM (but do not use MLIR) that it seems worthwhile to have a high-quality tensor code generation in both LLVM as well as in MLIR. Ideally, both systems would largely share the same code. The broad goal of our project is to add a retargetable tensor code generation framework to LLVM. We are currently working on a prototype implementation with our collaborators at Amazon AWS, Intel, IBM and Qualcomm. This RFC focuses on the first stage: extending the LLVM IR with tensor operations which we refer to as TLX (Tensor LLVM eXtensions). *** OVERALL PROJECT OBJECTIVES ==============================================================================* A unified retargetable code generation and optimization framework for LLVM to target diverse tensor architectures with a common set of IR extensions, instead of using target-specific solutions. * (Subject of this RFC.)A single set of target-agnostic tensor extensions in LLVM that higher-level tensor code generation frameworks such as XLA, Halide, TVM, MLIR, etc. can target instead of lowering to target-specific intrinsics in LLVM, while retaining the optimizations in these high-level frameworks. * A pathway for LLVM-based languages such as C/C++, DPC++, Fortran, Rust, Julia, etc. that do not have frontends for compiler systems like MLIR, TVM, XLA, etc. to target modern tensor architectures by lowering to our tensor extensions in LLVM. * Target-independent optimizations (e.g. peephole and generic SSA-based optimizations) and also flexible code generation capabilities in LLVM that could involve mixing instructions operating on vector and rectangular registers, and involve developing cost models which could help reduce register spills and maximize usage of available hardware resources. * Contribute our tensor extensions (this RFC) and retargetable code generation framework (as a followup) to the LLVM project for the community to experiment with and provide feedback. *** RFC: INTRODUCTION OF TENSOR CONCEPT IN LLVM =====================================================================To achieve our objectives, we need to introduce the concept of tensors. To do this, we need to add a tensor type - N-dimensional data type, generalizing 1-D vectors and 2-D matrices. We also need crucial tensor operations which front-ends for high-level languages can target, and which represent or can be implemented via ISAs of different tensor architectures. *** IMPLEMENTATION OF TENSOR TYPE IN LLVM ====================================================================== ** OVERVIEW: ---------------------- The concept of dense tensors can be implemented as a new, first-class n-dimensional vector type in LLVM. However, doing this would be extremely intrusive since it will require changes to hundreds of files in LLVM. While this may be the correct option in the long term, once the design has been properly evaluated and refined, the effort required to do so for an initial prototype and evaluation is not justified. So we propose to implement the tensor concept as an LLVM intrinsic called llvm.tensor.typeinfo while representing tensor data in “flattened” form as ordinary LLVM vector types. The intrinsic takes as operands a “flattened” LLVM vector, together with shape, layout and padding vectors, and returns a value of LLVM token type. By returning a value of token type, this intrinsic avoids the risk of being eliminated by optimizations (especially, dead code elimination) when it has uses. This intrinsic is marked with the 'readnone' and 'speculatable' attributes so that it does not inhibit optimizations like redundancy elimination, dead code elimination, code motion, etc. token llvm.tensor.typeinfo(<llvm-vector-type> %tensor, <n x i32> %shape, <n x i32> %layout, <n x i32> %padding) ** OPERANDS: ----------------------- ==============================================================================| Operand | Description | ============|=================================================================| %tensor | n-dimensional tensor value represented as a “flattened” vector | ------------|-----------------------------------------------------------------| %shape | Vector of dimension values of a tensor | ------------|-----------------------------------------------------------------| %layout | Vector of permutation of dimension indices ranging from 0 to n-1| ------------|-----------------------------------------------------------------| %padding | Vector of padding values along every dimension of a tensor | ==============================================================================| ** RESULT: ----------------------- ==============================================================================| Result | Description | =================|============================================================| token value | LLVM value of token type associated with a tensor value | ==============================================================================| ** SEMANTICS: ----------------------- The ‘llvm.tensor.typeinfo’ intrinsic is used to produce a unique token value associated with a tensor value represented as a “flattened” vector. The layout operand of this intrinsic is expressed as a permutation of dimension indices (from 0 to n-1 for an n-dimensional tensor). This represents tensor layouts in LLVM in a generic way. The number of elements in shape, layout and padding vectors must be the same and equal to the number of dimensions of the given tensor. Note that this intrinsic is only meant to hold information such as shape, layout and padding of a tensor value in LLVM IR. It does not read nor write memory nor perform any computations, and it does not exhibit any kind of undefined behavior. ** EXAMPLE: ----------------------- ; The first argument (%tensor) is the tensor that is being modelled as a flattened vector. The second argument is the shape (16 x 5 x 3), the third argument is layout (<0, 1, 2>) and the fourth argument is padding (<3, 2, 1> along the corresponding dimensions) for the given tensor. %input = call token @llvm.tensor.typeinfo(<240 x float> %tensor, <3 x i32> <i32 16, i32 5, i32 3>, <3 x i32> <i32 0, i32 1, i32 2>, <3 x i32> <i32 3, i32 2, i32 1>) ; The first argument is the input virtual tensor register and the second argument is the new permutation of the layout of the input tensor. This operation produces a tensor of layout <2, 0, 1>. %output = call <240 x float> @llvm.tensor.transpose(token %input, <3 x i32> <i32 2, i32 0, i32 1>) ; The first argument (%output) the output tensor that is being modelled as a flattened vector. The second argument is the new shape (3 x 16 x 5), the third argument is the new layout (<2, 0, 1>) and the fourth argument is the new padding (<1, 3, 2> along the corresponding dimensions) for the output tensor. %typed_output = call token @llvm.tensor.typeinfo(<240 x float> %output, <3 x i32> <i32 3, i32 16, i32 5>, <3 x i32> <i32 2, i32 0, i32 1>, <3 x i32> <i32 1, i32 3, i32 2>) *** TENSOR OPERATIONS IN LLVM ================================================================================ ** INTRINSIC: llvm.tensor.load =============================== ** OVERVIEW: --------------------- This operation loads a tensor or sub-tensor with the given shape, layout and padding from memory into a register. This operation is strided, unlike the existing load instruction in LLVM, to be able to load sub-tensors from memory. This intrinsic is marked with 'speculatable' attribute to prevent it from inhibiting optimizations like redundancy elimination, dead code elimination, code motion, etc. token llvm.tensor.load(<element_type>* %mem_ptr, <n x i32> %shape, <n x i32> %layout, <n x i32> %pad, <n xi32> %strides) ** OPERANDS: --------------------- ==============================================================================| Operand | Description | =============|================================================================| %mem_ptr | Starting address of a tensor/subtensor in memory | -------------|----------------------------------------------------------------| %shape | Vector of dimension values of the loaded tensor/sub-tensor | -------------|----------------------------------------------------------------| %layout | Vector of permutation of dimension indices ranging from 0 to | | n-1 | -------------|----------------------------------------------------------------| %padding | Vector of padding values along every dimension of | | the loaded tensor/sub-tensor | -------------|----------------------------------------------------------------| %strides | Vector of strides in memory along every dimension of the loaded| | tensor/sub-tensor | ==============================================================================| ** RESULT: ------------------ ==============================================================================| Result | Description | =============|================================================================| token value | LLVM value of token type associated with a tensor value | ==============================================================================| ** SEMANTICS: ----------------------- The ‘llvm.tensor.load' intrinsic loads a tensor or subtensor with the given shape, layout and padding from memory into a register. This operation is strided based on %strides, unlike the existing load instruction in LLVM, to be able to load subtensors from memory since sub-tensors are not laid out contiguously in memory. This intrinsic reads from memory, but does not write to memory. ** EXAMPLE: --------------------- ; This loads a sub-tensor from the memory location pointed to by %mem_ptr. The ; sub-tensor has the shape <16 x 6 x 4> (second argument), layout <0, 1, 2> ; (third argument) and zero padding (fourth argument). The strides in memory ; along every dimension are <0, 0, 8>, which means that the rows of the loaded ; sub-tensor have a distance of 8 bytes in memory. This produces a unique token ; %tensor. %tensor = call token @llvm.tensor.load(i8* %mem_ptr, <3 x i32> <i32 16, i32 6, i32 4>, <2 x i32> <i32 0, i32 1, i32 2>, <3 x i32> <i32 0, i32 0, i32 0>, <3 x i32> <i32 0, i32 0, i32 8>) ** INTRINSIC: llvm.tensor.store ================================ ** OVERVIEW: --------------------- This operation stores a tensor or subtensor from a register into memory. This operation is strided, unlike the existing store instruction in LLVM, to be able to store sub-tensors into memory. This intrinsic is marked with 'readnone' attribute to prevent it from inhibiting optimizations like redundancy elimination, dead code elimination, code motion, etc. void llvm.tensor.store(<element_type>* %mem_ptr,token %tensor, <n xi32> %strides) ** OPERANDS: ---------------------- ==============================================================================| Operand | Description | =============|================================================================| %mem_ptr | Starting address of a tensor/subtensor in memory | -------------|----------------------------------------------------------------| %tensor | Stored subtensor/tensor | -------------|----------------------------------------------------------------| %strides | Vector of strides in memory along every dimension of the stored| | tensor/sub-tensor | ==============================================================================| ** RESULT: ------------------ Intrinsic does not return anything. ** SEMANTICS: ----------------------- The ‘llvm.tensor.store' intrinsic stores a tensor or subtensor from a register into memory. This operation is strided based on %strides, unlike the existing store instruction in LLVM, to be able to store sub-tensors to memory since sub-tensors are not laid out contiguously in memory. This intrinsic writes to memory, but does not read from memory. ** EXAMPLE: --------------------- %tensor = call token @llvm.tensor.typeinfo(<240 x float> %tensor, <3 x i32> <i32 16, i32 6, i32 4>, <3 x i32> <i32 0, i32 1, i32 2>, <3 x i32> <i32 0, i32 0, i32 0>) ; This stores a tensor from the memory location pointed to by %mem_ptr and ; the second argument is the stored tensor itself. The strides in memory along ; every dimension are <0, 12, 10> (third argument), which means that the rows ; of %tensor are stored 10*sizeof(float) bytes apart and columns of %tensor ; are 12*sizeof(float) bytes apart in memory. call void @llvm.tensor.store(float* %mem_ptr, token %tensor, <3 x i32> <i32 0, i32 12, i32 10>) ** INTRINSIC: llvm.tensor.matmul ================================ ** OVERVIEW: --------------------- This intrinsic performs batched matrix multiplication between the inner dimensions of two multidimensional tensors. This intrinsic is marked with the 'readnone' and 'speculatable' attributes to prevent it from inhibiting optimizations like redundancy elimination, dead code elimination, code motion, etc. <vector_ty> llvm.tensor.matmul(token %input1, token %input2) ** OPERANDS: ---------------------- ==============================================================================| Operand | Description | =============|================================================================| %input1 | Token value representing the first input tensor | -------------|----------------------------------------------------------------| %input2 | Token value representing the second input tensor | ==============================================================================| ** RESULT: ------------------ ==============================================================================| Result | Description | =============|================================================================| token value | LLVM value of token type associated with a tensor value | ==============================================================================| ** SEMANTICS: ----------------------- The ‘llvm.tensor.matmul' intrinsic performs batched matrix multiplication between two input tensors. The inner two dimensions of the input tensors must have valid matrix multiplication dimensions, and any further outer dimensions must be of matching batch size. This intrinsic does not read nor write memory, nor does it exhibit any kind of undefined behavior. ** EXAMPLE: --------------------- %input1 = call token @llvm.tensor.typeinfo(<12 x float> %tensor1, <2 x i32> <i32 3, i32 4>, <2 x i32> <i32 0, i32 1>, <2 x i32> <i32 0, i32 0>) %input2 = call token @llvm.tensor.typeinfo(<12 x float> %tensor2, <2 x i32> <i32 4, i32 3>, <2 x i32> <i32 0, i32 1>, <2 x i32> <i32 0, i32 0>) %output = call <9 x float> @llvm.tensor.matmul(token %input1, token %input2) %typed_output = call token @llvm.tensor.typeinfo(<9 x float> %output, <2 x i32> <i32 3, i32 3>, <2 x i32> <i32 0, i32 1>, <2 x i32> <i32 0, i32 0>) ** INTRINSIC: llvm.tensor.contract ================================== ** OVERVIEW: --------------------- This intrinsic performs tensor contraction on two multidimensional tensors. This intrinsic is marked with the 'readnone' and 'speculatable' attributes to prevent it from inhibiting optimizations like redundancy elimination, dead code elimination, code motion, etc. <vector_ty> llvm.tensor.contract(token %input1, token %input2, <m x i32> %in1_contraction_axes, <m x i32> %in2_contraction_axes) ** OPERANDS: ---------------------- ==============================================================================| Operand | Description | ======================|=======================================================| %input1 | Token value representing the first input tensor | ----------------------|-------------------------------------------------------| %input2 | Token value representing the second input tensor | ----------------------|-------------------------------------------------------| %in1_contraction_axes | Vector for m axes for the first input tensor along | | which contraction/reduction is performed | ----------------------|-------------------------------------------------------| %in2_contraction_axes | Vector for m axes for the second input tensor along | | which contraction/reduction is performed | ==============================================================================| ** RESULT: ------------------ ==============================================================================| Result | Description | =====================|========================================================| token value | LLVM value of token type associated with a tensor value| ==============================================================================| ** SEMANTICS: ----------------------- The ‘llvm.tensor.contract' intrinsic multiplies and contracts two input tensors along given axes, %in1_contraction_axes and %in2_contraction_axes. The axes vectors contain a list of dimension indices for the input tensors along which the reduction takes place. This intrinsic does not read nor write memory, nor does it exhibit any kind of undefined behavior. ** EXAMPLE: --------------------- %input1 = call token @llvm.tensor.typeinfo(<8 x float> %tensor1, <3 x i32> <i32 2, i32 2, i32 2>, <3 x i32> <i32 0, i32 1, i32 2>, <3 x i32> <i32 0, i32 0, i32 0>) %input2 = call token @llvm.tensor.typeinfo(<8 x float> %tensor2, <3 x i32> <i32 2, i32 2, i32 2>, <3 x i32> <i32 0, i32 1, i32 2>, <3 x i32> <i32 0, i32 0, i32 0>) %output = call <8 x float> @llvm.tensor.contract(token %input1, token %input2, <2 x i32> <i32 0, i32 2>, <2 x i32> <i32 0, i32 1>) %typed_output = call token @llvm.tensor.typeinfo(<8 x float> %output, <3 x i32> <i32 2, i32 2, i32 2>, <3 x i32> <i32 0, i32 1, i32 2>, <3 x i32> <i32 0, i32 0, i32 0>) ** INTRINSIC: llvm.tensor.umma =============================== ** OVERVIEW: --------------------- This intrinsic performs matrix multiplication between two given input tensors and then accumulates the result using the third input tensor. This intrinsic is marked with the 'readnone' and 'speculatable' attributes to prevent it from inhibiting optimizations like redundancy elimination, dead code elimination, code motion, etc. <vector_ty> llvm.tensor.umma(token %acc, token %op1, token %op2) ** OPERANDS: ---------------------- ==============================================================================| Operand | Description | =============|================================================================| %acc | Token value representing a tensor which accumulates results | -------------|----------------------------------------------------------------| %op1 | Token value representing the first input tensor | -------------|----------------------------------------------------------------| %op2 | Token value representing the second input tensor | ==============================================================================| ** RESULT: ------------------ ==============================================================================| Result | Description | =============|================================================================| token value | LLVM value of token type associated with a tensor value | ==============================================================================| ** SEMANTICS: ----------------------- The ‘llvm.tensor.umma' intrinsic performs matrix-multiply-accumulation between two unsigned operands and accumulates the result with the given register: %output = %acc + matmul(%op1, %op2) When full product (e.g. i8 * i8 -> i16) is needed, the input tensors must be sign-extended. Note that this intrinsic does not read nor write memory, nor does it exhibit any kind of undefined behavior. ** EXAMPLE: --------------------- %acc = call token @llvm.tensor.typeinfo(<9 x i32> %acc_tensor, <2 x i32> <i32 3, i32 3>, <2 x i32> <i32 0, i32 1>, <2 x i32> <i32 0, i32 0>) %op1 = call token @llvm.tensor.typeinfo(<12 x i32> %tensor1, <2 x i32> <i32 3, i32 4>, <2 x i32> <i32 0, i32 1>, <2 x i32> <i32 0, i32 0>) %op2 = call token @llvm.tensor.typeinfo(<12 x i32> %tensor2, <2 x i32> <i32 4, i32 3>, <2 x i32> <i32 0, i32 1>, <2 x i32> <i32 0, i32 0>) ; The first argument is the accumulator virtual tensor register, and the ; second and third arguments are the input virtual tensor registers. %output = call <9 x i32> @llvm.tensor.umma(token %acc, token %op1, token %op2) ** INTRINSIC: llvm.tensor.smma ============================== ** OVERVIEW: --------------------- This intrinsic performs matrix multiplication between two given input tensors and then accumulates the result using the third input tensor. This intrinsic is marked with the 'readnone' and 'speculatable' attributes to prevent it from inhibiting optimizations like redundancy elimination, dead code elimination, code motion, etc. <vector_ty> llvm.tensor.smma(token %acc, token %op1, token %op2) ** OPERANDS: ---------------------- ==============================================================================| Operand | Description | =============|================================================================| %acc | Token value representing a tensor which accumulates results | -------------|----------------------------------------------------------------| %op1 | Token value representing the first input tensor | -------------|----------------------------------------------------------------| %op2 | Token value representing the second input tensor | ==============================================================================| ** RESULT: ------------------ ==============================================================================| Result | Description | =============|================================================================| token value | LLVM value of token type associated with a tensor value | ==============================================================================| ** SEMANTICS: --------------------- The ‘llvm.tensor.smma' intrinsic performs matrix-multiply-accumulation between two signed operands and accumulates the result with the given register: %output = %acc + matmul(%op1, %op2) When full product (e.g. i8 * i8 -> i16) is needed, the input tensors must be sign-extended. Note that this intrinsic does not read nor write memory, nor does it exhibit any kind of undefined behavior. ** EXAMPLE: --------------------- %acc = call token @llvm.tensor.typeinfo(<9 x i32> %acc_tensor, <2 x i32> <i32 3, i32 3>, <2 x i32> <i32 0, i32 1>, <2 x i32> <i32 0, i32 0>) %op1 = call token @llvm.tensor.typeinfo(<12 x i32> %tensor1, <2 x i32> <i32 3, i32 4>, <2 x i32> <i32 0, i32 1>, <2 x i32> <i32 0, i32 0>) %op2 = call token @llvm.tensor.typeinfo(<12 x i32> %tensor2, <2 x i32> <i32 4, i32 3>, <2 x i32> <i32 0, i32 1>, <2 x i32> <i32 0, i32 0>) ; The first argument is the accumulator virtual tensor register, ; and the second and third arguments are the input virtual tensor registers. %output = call <9 x i32> @llvm.tensor.smma(token %acc, token %op1, token %op2) ** INTRINSIC: llvm.tensor.usmma ================================ ** OVERVIEW: --------------------- This intrinsic performs matrix multiplication between two given input tensors and then accumulates the result using the third input tensor. This intrinsic is marked with the 'readnone' and 'speculatable' attributes to prevent it from inhibiting optimizations like redundancy elimination, dead code elimination, code motion, etc. <vector_ty> llvm.tensor.usmma(token %acc, token %op1, token %op2) ** OPERANDS: ---------------------- ==============================================================================| Operand | Description | =============|================================================================| %acc | Token value representing a tensor which accumulates results | -------------|----------------------------------------------------------------| %op1 | Token value representing the first input tensor | -------------|----------------------------------------------------------------| %op2 | Token value representing the second input tensor | ==============================================================================| ** RESULT: ------------------ ==============================================================================| Result | Description | =============|================================================================| token value | LLVM value of token type associated with a tensor value | ==============================================================================| ** SEMANTICS: ----------------------- The ‘llvm.tensor.usmma' intrinsic performs matrix-multiply-accumulation between unsigned and signed operands and accumulates the result with the given register: %output = %acc + matmul(%op1, %op2) When full product (e.g. i8 * i8 -> i16) is needed, the input tensors must be sign-extended. Note that this intrinsic does not read nor write memory, nor does it exhibit any kind of undefined behavior. ** EXAMPLE: --------------------- %acc = call token @llvm.tensor.typeinfo(<9 x i32> %acc_tensor, <2 x i32> <i32 3, i32 3>, <2 x i32> <i32 0, i32 1>, <2 x i32> <i32 0, i32 0>) %op1 = call token @llvm.tensor.typeinfo(<12 x i32> %tensor1, <2 x i32> <i32 3, i32 4>, <2 x i32> <i32 0, i32 1>, <2 x i32> <i32 0, i32 0>) %op2 = call token @llvm.tensor.typeinfo(<12 x i32> %tensor2, <2 x i32> <i32 4, i32 3>, <2 x i32> <i32 0, i32 1>, <2 x i32> <i32 0, i32 0>) ; The first argument is the accumulator virtual tensor register, ; and the second and third arguments are the input virtual tensor ; registers. %output = call <9 x i32> @llvm.tensor.usmma(token %acc, token %op1, token %op2) ** INTRINSIC: llvm.tensor.summa ================================ ** OVERVIEW: --------------------- This intrinsic performs matrix multiplication between two given input tensors and then accumulates the result using the third input tensor. This intrinsic is marked with the 'readnone' and 'speculatable' attributes to prevent it from inhibiting optimizations like redundancy elimination, dead code elimination, code motion, etc. <vector_ty> llvm.tensor.summa(token %acc, token %op1, token %op2) ** OPERANDS: ---------------------- ==============================================================================| Operand | Description | =============|================================================================| %acc | Token value representing a tensor which accumulates results | -------------|----------------------------------------------------------------| %op1 | Token value representing the first input tensor | -------------|----------------------------------------------------------------| %op2 | Token value representing the second input tensor | ==============================================================================| ** RESULT: ------------------ ==============================================================================| Result | Description | =============|================================================================| token value | LLVM value of token type associated with a tensor value | ==============================================================================| ** SEMANTICS: ----------------------- The ‘llvm.tensor.summa' intrinsic performs matrix-multiply-accumulation between signed and unsigned operands and accumulates the result with the given register: %output = %acc + matmul(%op1, %op2) When full product (e.g. i8 * i8 -> i16) is needed, the input tensors must be sign-extended. Note that this intrinsic does not read nor write memory, nor does it exhibit any kind of undefined behavior. ** EXAMPLE: --------------------- %acc = call token @llvm.tensor.typeinfo(<9 x i32> %acc_tensor, <2 x i32> <i32 3, i32 3>, <2 x i32> <i32 0, i32 1>, <2 x i32> <i32 0, i32 0>) %op1 = call token @llvm.tensor.typeinfo(<12 x i32> %tensor1, <2 x i32> <i32 3, i32 4>, <2 x i32> <i32 0, i32 1>, <2 x i32> <i32 0, i32 0>) %op2 = call token @llvm.tensor.typeinfo(<12 x i32> %tensor2, <2 x i32> <i32 4, i32 3>, <2 x i32> <i32 0, i32 1>, <2 x i32> <i32 0, i32 0>) ; The first argument is the accumulator virtual tensor register, and the ; second and third arguments are the input virtual tensor registers. %output = call <9 x i32> @llvm.tensor.summa(token %acc, token %op1, token %op2) ** INTRINSIC: llvm.tensor.convolution ===================================== ** OVERVIEW: --------------------- This intrinsic performs convolution between input and kernel tensors. This intrinsic is marked with the 'readnone' and 'speculatable' attributes to prevent it from inhibiting optimizations like redundancy elimination, dead code elimination, code motion, etc. <vector_ty> llvm.tensor.convolution(token %input, token %kernel, <vector_ty> %strides, <vector_ty> %input_dilations, <vector_ty> %kernel_dilations) ** OPERANDS: ---------------------- ==============================================================================| Operand | Description | ==================|===========================================================| %input | Token value representing the input tensor | ------------------|-----------------------------------------------------------| %kernel | Token value representing the kernel tensor | ------------------|-----------------------------------------------------------| %strides | Vector containing strides values for the sliding kernel | ------------------|-----------------------------------------------------------| %input_dilations | Vector containing dilation values for the input | ------------------|-----------------------------------------------------------| %kernel_dilations | Vector containing dilation values for the kernel | ==============================================================================| ** RESULT: ------------------ ==============================================================================| Result | Description | ==================|===========================================================| token value | LLVM value of token type associated with a tensor value | ==============================================================================| ** SEMANTICS: ----------------------- The ‘llvm.tensor.convolution' intrinsic performs the convolution between input and kernel tensors. This intrinsic does not read nor write memory, nor does it exhibit any kind of undefined behavior. ** EXAMPLE: --------------------- %input = call token @llvm.tensor.typeinfo(<12 x float> %tensor1, <2 x i32> <i32 3, i32 4>, <2 x i32> <i32 0, i32 1>, <2 x i32> <i32 0, i32 0>) %kernel = call token @llvm.tensor.typeinfo(<12 x float> %tensor2, <2 x i32> <i32 4, i32 3>, <2 x i32> <i32 0, i32 1>, <2 x i32> <i32 0, i32 0>) ; The third argument is a vector of kernel stride values along every dimension ; and the fourth argument is a vector of dilation values along every dimension. %output = call <9 x float> @llvm.tensor.convolution(token %input, token %kernel, <2 x i32> <i32 1, i32 2>, <2 x i32> <i32 0, i32 0>, <2 x i32> <i32 0, i32 0>) ** INTRINSIC: llvm.tensor.transpose =================================== ** OVERVIEW: --------------------- This intrinsic changes the layout of a given tensor by permuting the indices of its dimensions. This intrinsic is marked with the 'readnone' and 'speculatable' attributes to prevent it from inhibiting optimizations like redundancy elimination, dead code elimination, code motion, etc. <vector_ty> llvm.tensor.transpose(token %input, <n x i32> %new_layout) ** OPERANDS: ---------------------- ==============================================================================| Operand | Description | =============|================================================================| %input | Token value representing the input tensor | -------------|----------------------------------------------------------------| %new_layout | This is the new permutation of tensor layout | ==============================================================================| ** RESULT: ------------------ ==============================================================================| Result | Description | =============|================================================================| token value | LLVM value of token type associated with a tensor value | ==============================================================================| ** SEMANTICS: ----------------------- The ‘llvm.tensor.transpose' intrinsic operates on the given tensor and produces an output tensor with the given layout. This operation changes the physical layout of the input tensor and leads to changes to the shape and padding. Note that the operation does not lead to any change in the number of dimensions. Note that this intrinsic does not read nor write memory, nor does it exhibit any kind of undefined behavior. ** EXAMPLE: --------------------- %input = call token @llvm.tensor.typeinfo(<240 x float> %tensor, <3 x i32> <i32 16, i32 5, i32 3>, <3 x i32> <i32 0, i32 1, i32 2>, <3 x i32> <i32 3, i32 2, i32 1>) ; The first argument is the input virtual tensor register and the second argument is the new permutation of the layout of the input tensor. This operation produces a tensor of layout <2, 0, 1>. %output = call <240 x float> @llvm.tensor.transpose(token %input, <3 x i32> <i32 2, i32 0, i32 1>) %typed_output = call token @llvm.tensor.typeinfo(<240 x float> %output, <3 x i32> <i32 3, i32 16, i32 5>, <3 x i32> <i32 2, i32 0, i32 1>, <3 x i32> <i32 1, i32 3, i32 2>) *** DESIGN OF TENSOR EXTENSIONS IN LLVM ========================================================================================Tensor extensions we have added to LLVM are described in the document here (https://docs.google.com/document/d/1A3xbrtouckRsPz94v2XttjoaTSqQlz1pSzVe80-Jmro/edit?usp=sharing<https://urldefense.com/v3/__https:/docs.google.com/document/d/1A3xbrtouckRsPz94v2XttjoaTSqQlz1pSzVe80-Jmro/edit?usp=sharing__;!!DZ3fjg!vQEzOO44fdipIUpplKG3F5Q6ZJXDI24nlWm35qyIyzPFGcJa_7d6JONzCoX39As5qHkThMGsxBIZiahRQkU$>). ===============================================================================| LLVM Tensor Intrinsics | Frontend Equivalent | Target Equivalent | ===========================|=======================|===========================| llvm.tensor.matmul | XLA dot op | | ---------------------------|-----------------------|---------------------------| llvm.tensor.contract | XLA dot general op | | ---------------------------|-----------------------|---------------------------| llvm.tensor.umma | | Intel AMX mma instruction,| | | Power MMA instruction | ---------------------------|-----------------------|---------------------------| llvm.tensor.smma | | Intel AMX mma instruction,| | | Power MMA instruction | ---------------------------|-----------------------|---------------------------| llvm.tensor.usmma | | Intel AMX mma instruction,| | | Power MMA instruction | ---------------------------|-----------------------|---------------------------| llvm.tensor.summa | | Intel AMX mma instruction,| | | Power MMA instruction | ---------------------------|-----------------------|---------------------------| llvm.tensor.convolution | XLA convolution op | NVDLA convolution | | | instruction | ---------------------------|-----------------------|---------------------------| llvm.tensor.tanh | XLA element-wise op | NVDLA element-wise | | | instruction | ---------------------------|-----------------------|---------------------------| llvm.tensor.sigmoid | | NVDLA element-wise | | | instruction| ---------------------------|-----------------------|---------------------------| llvm.tensor.relu | | NVDLA element-wise | | | instruction | ---------------------------|-----------------------|---------------------------| llvm.tensor.broadcast | XLA broadcast op | Intel AMX fill instruction| ---------------------------|---------------------------------------------------| llvm.tensor.load | | Intel AMX load instruction| ---------------------------|-----------------------|---------------------------| llvm.tensor.store | | Intel AMX store | | | instruction| | ---------------------------|-----------------------|---------------------------| llvm.tensor.reduce.max | XLA reduce window op | NVDLA pooling instruction | ---------------------------|-----------------------|---------------------------| llvm.tensor.reduce.min | XLA reduce window op | NVDLA pooling instruction | ---------------------------|-----------------------|---------------------------| llvm.tensor.reduce.add | XLA reduce window op | | ---------------------------|-----------------------|---------------------------| llvm.tensor.reduce.mul | XLA reduce window op | | ---------------------------|-----------------------|---------------------------| llvm.tensor.reduce.and | XLA reduce window op | | ---------------------------|-----------------------|---------------------------| llvm.tensor.reduce.or | XLA reduce window op | | ---------------------------|-----------------------|---------------------------| llvm.tensor.reduce.xor | XLA reduce window op | | ---------------------------|-----------------------|---------------------------| lvm.tensor.reshape.block | OneDNN Layouts | | ---------------------------|-----------------------|---------------------------| llvm.tensor.reshape.permute| Tensorflow reshape op | | ---------------------------|-----------------------|---------------------------| lllvm.tensor.transpose | Tensorflow transpose | NVDLA reshape instruction | | op | | ---------------------------|-----------------------|---------------------------| llvm.tensor.pad | XLA pad op | | ---------------------------|-----------------------|---------------------------| llvm.tensor.concat | XLA concat op | NVDLA reshape instruction | ---------------------------|-----------------------|---------------------------| llvm.tensor.tovector | | Power unprime instruction | ---------------------------|-----------------------|---------------------------| llvm.vector.totensor | | Power prime instruction | ===============================================================================| *** LOWERING STRATEGY =============================================================================================== The lowering strategy is divided into three stages: * Lower the N-dimensional tensor operations to 2-dimensional tensor operations. * Lower 2-dimensional tensor operations to target-agnostic 1-D and 2-D tensor intrinsics. * Legalize the target-agnostic 1-D and 2-D tensor intrinsics to target-specific intrnsics. *** LOWERING EXAMPLE ------------------------------------------------------------ As an example, we want to lower the following code with matmul intrinsic in LLVM IR: %tensor1_token = call token @llvm.tensor.typeinfo(<640000 x i32>* byval(<640000 x i32>) %tensor1, <4 x i32> <i32 1, i32 10, i32 800, i32 800>, <4 x i32> <i32 0, i32 1, i32 2, i32 3>, <4 x i32> <i32 0, i32 0, i32 0, i32 0>)? %tensor2_token = call token @llvm.tensor.typeinfo(<640000 x i32>* byval(<640000 x i32>) %tensor2, <4 x i32> <i32 1, i32 10, i32 800, i32 800>, <4 x i32> <i32 0, i32 1, i32 2, i32 3>, <4 x i32> <i32 0, i32 0, i32 0, i32 0>)? ? %tensor3 = call <64000 x i32> @llvm.tensor.matmul(token %tensor1_token, token %tensor2_token)? ? %tensor3_token = call token @llvm.tensor.typeinfo(<640000 x i32>* byval(<640000 x i32>) %tensor3, <4 x i32> <i32 1, i32 10, i32 800, i32 800>, <4 x i32> <i32 0, i32 1, i32 2, i32 3>, <4 x i32> <i32 0, i32 0, i32 0, i32 0>) The code gets lowered to the following code in the first lowering stage: %malloc_ptr1 = call i8* @malloc(…)? %malloc_ptr2 = call i8* @malloc(…)? %malloc_ptr3 = call i8* @malloc(…)? %tensor_ptr1 = bitcast i8* ?%malloc_ptr1 to i32* %tensor_ptr2 = bitcast i8* ?%malloc_ptr2 to i32* %tensor_ptr3 = bitcast i8* ?%malloc_ptr3 to i32* ? for (unsigned I = 0; I < 1; I++)? for (unsigned J = 0; J < 10; J++) ? // Compute the indices into the %tensor1 and %tensor2 and load the feature // maps? ….? %matrix_ptr1 = getelementptr i32* %tensor_ptr1, ….? %matrix_ptr2 = getelementptr i32* %tensor_ptr2, ….? %matrix_ptr3 = getelementptr i32* %tensor_ptr3, ….? %matrix1_token = call token @llvm.tensor.load(i32* %matrix_ptr1, <2 x i32> <i32 800, i32 800>, ….)? %matrix2_token = call token @llvm.tensor.load(i32* %matrix_ptr2, <2 x i32> <i32 800, i32 800>, ….)? ? %matrix3 = call <640000 x i32> @llvm.tensor.matmul(token %matrix1_token, token %matrix2_token)? %matrix3_token = call token @llvm.tensor.typeinfo(<640000 x i32> %matrix3, <2 x i32> <32 800, i32 800>, <2 x i32> <i32 0, i32 1>, <2 x i32> <i32 0, i32 0>)? ? call void @llvm.tensor.store(i32* %tensor_ptr3, token %matrix3_token, …)? ….? After the second lowering stage, the 2-dimensional matmul intrinsic gets lowered to 2-dimensional target-agnostic intrinsics: %malloc_ptr1 = call i8* @malloc(…)? %malloc_ptr2 = call i8* @malloc(…)? %malloc_ptr3 = call i8* @malloc(…)? %tensor_ptr1 = bitcast i8* ?%malloc_ptr1 to i32* %tensor_ptr2 = bitcast i8* ?%malloc_ptr2 to i32* %tensor_ptr3 = bitcast i8* ?%malloc_ptr3 to i32* for (unsigned I = 0; I < 1; I++)? for (unsigned J = 0; J < 10; J++) ? for (unsigned M = 0; M < 800; M+=16)? for (unsigned N = 0; N < 800; N+=16) ? for (unsigned K = 0; K < 800; K+=16) ? ? // Compute the indices into the %tensor1 and %tensor2 // and load the tiles? ….? %tile1_token = call token @llvm.tensor.load(i32* %tile_ptr1, <2 x i32> <i32 16, i32 16>, ….)? %tile2_token = call token @llvm.tensor.load(i32* %tile_ptr2, <2 x i32> <i32 16, i32 16>, ….)? %tile2_token = call token @llvm.tensor.load(i32* %acc_ptr, <2 x i32> <i32 16, i32 16>, ….)? ? %acc = call <256 x i32> @llvm.tensor.smma(token %acc_token, token %tile1_token, token %tile2_token)? %new_acc_token = call token @llvm.tensor.typeinfo(<256 x i32> %acc, <2 x i32> <32 16, i32 16>, ...)? ? .… } call void @llvm.tensor.store(i32* %acc_ptr, token %new_acc_token, …)? …. Last stage of lowering is the legalization stage. In this example, we lower down to Intel AMX intrinsics: %malloc_ptr1 = call i8* @malloc(…)? %malloc_ptr2 = call i8* @malloc(…)? %malloc_ptr3 = call i8* @malloc(…)? %tensor_ptr1 = bitcast i8* ?%malloc_ptr1 to i32* %tensor_ptr2 = bitcast i8* ?%malloc_ptr2 to i32* %tensor_ptr3 = bitcast i8* ?%malloc_ptr3 to i32* for (unsigned I = 0; I < 1; I++)? for (unsigned J = 0; J < 10; J++) ? for (unsigned M = 0; M < 800; M+=16)? for (unsigned N = 0; N < 800; N+=16) ? for (unsigned K = 0; K < 800; K+=16) ? ? // Compute the indices into the %tensor1 and %tensor2 and // load the tiles? ….? %cast_tile_ptr1 = bitcast i32* %tile_ptr1 to i8* %cast_tile_ptr2 = bitcast i32* %tile_ptr2 to i8* %cast_acc_ptr = bitcast i32* %acc_ptr to i8* %tile1_amx = call x86_amx @llvm.x86.tileloadd64.internal(i16 16, i16 64, i8* %cast_tile_ptr1, i64 3200)? %tile2_amx = call x86_amx @llvm.x86.tileloadd64.internal(i16 16, i16 64, i8* %cast_tile_ptr2, i64 3200)? %acc_amx = call x86_amx @llvm.x86.tileloadd64.internal(i16 16, i16 64, i8* %cast_acc_ptr, i64 3200) %mma_amx = call x86_amx @llvm.x86.tdpbssd.internal(i16 16, i16 64, i16 64, x86_amx %acc_amx, x86_amx %tile2_amx , x86_amx %tile1_amx)?? ..... } call void @llvm.x86.tilestored64.internal(i16 16, i16 64, i8* %cast_acc_ptr, i64 3200, x86_amx %mma_amx)? …. *** COMPATIBILITY WITH AND BENEFITS OVER MATRIX EXTENSIONS =========================================================================================The existing matrix extensions model vectors as matrices in LLVM can co-exist and can be used with the tensor extensions that we propose. We argue that our tensor extensions provide an extensible and flexible long-term solution that LLVM developers can experiment with and adopt overtime. We believe that our tensor extensions provide the following benefits over the existing matrix extensions: * Our tensor extensions support an arbitrary number of dimensions for tensors. This affords LLVM developers the flexibility to use higher-dimensional tensors as opposed to confining to rigidly supporting two dimensional tensors only. This support for generality also makes the tensor extensions more easy to maintain in the future. * Currently, information about matrix shapes and layouts is encoded within the matrix intrinsics in LLVM. They do not provide a separation between the matrix properties and matrix operations. This makes the existing matrix extensions rigid and difficult to extend in the future because if developers decide to encode more matrix properties in the IR, they would have to modify all matrix intrinsics and modify several lines of code using these matrix extensions. Our tensor extensions provide a separation between the tensor concept and tensor operations, thereby providing the flexibility of extending the tensor properties represented in the IR without having to modify all the tensor operations that operate on tensors. Note that this flexibility also allows supporting new kinds of tensors (such as sparse and ragged tensors) more easily in the future as well. * Matrix padding is modelled using vector shuffle instructions. This requires optimizations, analyses and transformations to infer padding information by carefully inspecting all the shuffle instructions and their masks. We encode tensor padding information as a set of tensor properties directly represented and readily available in the IR and we use an intrinsic to represent a padding operation. *** CURRENT STATUS OF THE IMPLEMENTATION =====================================================================================* Lowering of most high-level tensor operations to LLVM scalar and vector instructions is supported. * Tensor code generation framework is capable of targeting Intel AMX. Support for targeting NVDLA and NVIDIA tensor cores is in progress. * Lowering support to target Intel VNNI and Hexagon Vector Extension (HVX) is underway. * Example of lowering from Julia to the proposed tensor extensions is in the design document (https://docs.google.com/document/d/1A3xbrtouckRsPz94v2XttjoaTSqQlz1pSzVe80-Jmro/edit#heading=h.17j13gwxto8i<https://urldefense.com/v3/__https:/docs.google.com/document/d/1A3xbrtouckRsPz94v2XttjoaTSqQlz1pSzVe80-Jmro/edit*heading=h.17j13gwxto8i__;Iw!!DZ3fjg!vQEzOO44fdipIUpplKG3F5Q6ZJXDI24nlWm35qyIyzPFGcJa_7d6JONzCoX39As5qHkThMGsxBIZctWEYsA$>) . *** CURRENT TESTING SUPPORT ======================================================================================= Currently, the tests are written in C/C++. The tensor operations are written using “dummy” functions such as tensor_typeinfo, tensor_matmul and so on as shown in the following example: typedef int _tensor_t __attribute__((__vector_size__(25600000))); typedef int _shape_t __attribute__((__vector_size__(16), __aligned__(4)));? typedef int _layout_t __attribute__((__vector_size__(16), __aligned__(4)));? typedef int _padding_t __attribute__((__vector_size__(16), __aligned__(4)));? typedef int _token_t;? ? void example(_tensor_t tensor1, _tensor_t tensor2) {? _shape_t shape = {1, 10, 800, 800}; _layout_t layout = {0, 1, 2, 3};? _padding_t padding = {0, 0, 0, 1};? ? /* Define type information for the input tensors */ _token_t tensor1_token = tensor_typeinfo(tensor1, shape, layout, padding);? _token_t tensor2_token = tensor_typeinfo(tensor2, shape, layout, padding);? /* Perform Matmul */ _tensor_t tensor3 = tensor_matmul(tensor1_token, tensor2_token); /* Define type information for the output tensor */ _token_t tensor3_token = tensor_typeinfo(tensor3, shape, layout, padding);? }? The above code gets translated into tensor intrinsics in LLVM IR: define void @example(<6400000 x i32>* byval(<6400000 x i32>) %tensor1, <6400000 x i32>* byval(<6400000 x i32>) %tensor2) {? %tensor1_token = call token @llvm.tensor.typeinfo( <6400000 x i32>* byval(<6400000 x i32>) %tensor1, <4 x i32> <i32 1, i32 10, i32 800, i32 800>, <4 x i32> <i32 0, i32 1, i32 2, i32 3>, <4 x i32> <i32 0, i32 0, i32 0, i32 0>)? %tensor2_token = call token @llvm.tensor.typeinfo( <6400000 x i32>* byval(<6400000 x i32>) %tensor2, <4 x i32> <i32 1, i32 10, i32 800, i32 800>, <4 x i32> <i32 0, i32 1, i32 2, i32 3>, <4 x i32> <i32 0, i32 0, i32 0, i32 0>)? ? %tensor3 = call <640000 x i32> @llvm.tensor.matmul( token %tensor1_token, token %tensor2_token)? ? %tensor3_token = call token @llvm.tensor.typeinfo( <6400000 x i32>* byval(<6400000 x i32>) %tensor3, <4 x i32> <i32 1, i32 10, i32 800, i32 800>, <4 x i32> <i32 0, i32 1, i32 2, i32 3>, <4 x i32> <i32 0, i32 0, i32 0, i32 0>) ret } Link to the Google doc for this proposal: https://docs.google.com/document/d/1r3bHerTFqloldHH-OcMNF2kuCaYOB5JEeGtb7ftLmNg/edit?usp=sharing<https://urldefense.com/v3/__https:/docs.google.com/document/d/1r3bHerTFqloldHH-OcMNF2kuCaYOB5JEeGtb7ftLmNg/edit?usp=sharing__;!!DZ3fjg!vQEzOO44fdipIUpplKG3F5Q6ZJXDI24nlWm35qyIyzPFGcJa_7d6JONzCoX39As5qHkThMGsxBIZ-YbQ2IE$> -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20211112/15f17e62/attachment-0001.html>
Kothari, Akash via llvm-dev
2021-Nov-15 18:18 UTC
[llvm-dev] [RFC] Proposal for TLX: Tensor LLVM eXtensions
For those who may have been having trouble viewing the RFC in plain text format, we have our proposal in a Google doc: https://docs.google.com/document/d/1IW6VIJ4lMYbGRTOle7S5QXP7Sb5UlucZ3gf-L-4Ccfs/edit?usp=sharing. It would be great if y’all could comment in the google doc or respond via email. Thanks, Akash Kothari On Nov 12, 2021, at 1:28 PM, Kothari, Akash <akashk4 at illinois.edu<mailto:akashk4 at illinois.edu>> wrote: **** Proposal for TLX: Tensor LLVM eXtensions ================================================================================== Authors: Akash Kothari (UIUC), Abdul Rafae Noor (UIUC), Dounia Khaldi (Intel), Vikram Adve (UIUC), Yuanke Luo(Intel), Sudipta Sengupta (Amazon AWS), Milind Girkar (Intel), Charith Mendis (UIUC) ------------------------------------------------------------------------------------ -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20211115/c7c21d89/attachment.html>