it's clearly that float16 can save bandwidth, but is float16 can save compute cycles while computing transcendental functions, like exp()?
Can float16 data type save compute cycles while computing transcendental functions?
397 Views Asked by Leonardo Physh At
1
There are 1 best solutions below
Related Questions in CPU-ARCHITECTURE
- Real-world analog to TIS-100
- What is faster: equal check or sign check
- Multicore clock counter consistency
- How do MemReq and MemResp exactly work in RoccIO - RISCV
- What is the simplest Turing complete CPU instruction set which can execute code from ROM?
- Had 16-bit DOS a memory access limitation of 1 MB? If yes, how?
- Are correct branch predictions free?
- Assembly: why some x86 opcodes are invalid in x64?
- Memory barriers force cache coherency?
- FreeRTOS : How to measure context switching time?
- HACK Machines and its assembler
- Peak FLOPs per cycle for ARM11 and Cortex-A7 cores in Raspberry Pi 1 and 2
- Computer Architecture/Assembly, Amdahl's Law
- How the heap and stack size is decided in process image
- How can I get the virtual address of a shared library by the use of computer architecture state?
Related Questions in HPC
- Documentation for PopupBasherConfiguration?
- Changing priority of job in SGE using python drmaa wrapper
- Messaging between torque jobs in a cluster
- ARMv6 floating point operations per cycle
- Pin app terminated abnormally due to signal 6
- Hybrid loop parallelization with MPI_THREAD_MULTIPLE
- Cannot get faster results via yarn when running spark in a hadoop cluster
- Metis - Block output
- Some cores never return value using MPI
- Obtain information of the Energy Consumption in multicore systems
- Put only one of my jobs per host, but OK to share with others
- python on HPC cluster computer
- How to measure latency of low latency c++ application
- How to make all distributed nodes RAM available to a single node?
- Torque PBS_Server
Related Questions in HALF-PRECISION-FLOAT
- Clarification on IEEE 754 rounding to nearest, ties to even
- Different methods to unpack CUDA half2 datatypes
- How to Initialise 16-bit Half Floats (GAS for ARM32)?
- How to Convert OpenCL code from FP32 to FP16?
- Casting __fp16 to float fails to link on Clang 9
- converting Golang float32 to half-precision float (GLSL float16) as uint16
- Can float16 data type save compute cycles while computing transcendental functions?
- Is there a reason why a nan value appears when there is no nan value in the model parameter?
- atomicAdd half-precision floating-point (FP16) on CUDA Compute Capability 5.2
- Is there an implementation of Keras Adam Optimizer to support Float16
- Why is it dangerous to convert integers to float16?
- Detecting support for __fp16
- TensorFlow mixed precision training: Conv2DBackpropFilter not using TensorCore
- How to Enable Mixed precision training
- Why does bfloat16 have so many exponent bits?
Trending Questions
- UIImageView Frame Doesn't Reflect Constraints
- Is it possible to use adb commands to click on a view by finding its ID?
- How to create a new web character symbol recognizable by html/javascript?
- Why isn't my CSS3 animation smooth in Google Chrome (but very smooth on other browsers)?
- Heap Gives Page Fault
- Connect ffmpeg to Visual Studio 2008
- Both Object- and ValueAnimator jumps when Duration is set above API LvL 24
- How to avoid default initialization of objects in std::vector?
- second argument of the command line arguments in a format other than char** argv or char* argv[]
- How to improve efficiency of algorithm which generates next lexicographic permutation?
- Navigating to the another actvity app getting crash in android
- How to read the particular message format in android and store in sqlite database?
- Resetting inventory status after order is cancelled
- Efficiently compute powers of X in SSE/AVX
- Insert into an external database using ajax and php : POST 500 (Internal Server Error)
Popular Questions
- How do I undo the most recent local commits in Git?
- How can I remove a specific item from an array in JavaScript?
- How do I delete a Git branch locally and remotely?
- Find all files containing a specific text (string) on Linux?
- How do I revert a Git repository to a previous commit?
- How do I create an HTML button that acts like a link?
- How do I check out a remote Git branch?
- How do I force "git pull" to overwrite local files?
- How do I list all files of a directory?
- How to check whether a string contains a substring in JavaScript?
- How do I redirect to another webpage?
- How can I iterate over rows in a Pandas DataFrame?
- How do I convert a String to an int in Java?
- Does Python have a string 'contains' substring method?
- How do I check if a string contains a specific word?
If your hardware has full support for it, not just conversion to float32, then yes, definitely. e.g. on a GPU, or on Intel Alder Lake with AVX-512 enabled, or Sapphire Rapids. Half-precision floating-point arithmetic on Intel chips. Or apparently on Apple M2 CPUs.
If you can do two 64-byte SIMD vectors of FMAs per clock on a core, you go twice as fast if that's 32 half-precision FMAs per vector instead of 16 single-precision FMAs.
Speed vs. precision tradeoff: only enough for FP16 is needed
Without hardware ALU support for FP16, only by not requiring as much precision because you know you're eventually going to round to fp16. So you'd use polynomial approximations of lower degree, thus fewer FMA operations, even though you're computing with float32.
BTW,
expandlogare interesting for floating point because the format itself is build around an exponential representation. So you can do an exponential by converting fp->int and stuffing that integer into the exponent field of an FP bit pattern. Then with the the fractional part of your FP number, you use a polynomial approximation to get the mantissa of the exponent. Alogimplementation is the reverse: extract the exponent field and use a polynomial approximation of log of the mantissa, over a range like 1.0 to 2.0.See
Efficient implementation of log2(__m256d) in AVX2
Fastest Implementation of Exponential Function Using AVX
Very fast approximate Logarithm (natural log) function in C++?
vgetmantps vs andpd instructions for getting the mantissa of float
Normally you do want some FP operations, so I don't think it would be worth trying to use only 16-bit integer operations to avoid unpacking to float32 even for exp or log, which are somewhat special and intimately connected with floating point's
significand * 2^exponentformat, unlike sin/cos/tan or other transcendental functions.So I think your best bet would normally still be to start by converting fp16 to fp32, if you don't have instructions like AVX-512 FP16 can do actual FP math on it. But you can gain performance from not needing as much precision, since implementing these functions normally involves a speed vs. precision tradeoff.