Counting the number of operations in the inner loop of the dot product in the above image shows that binarization can reduce the number of operations in a layer by around 43X and the size of parameters by 32X. When great numbers like these were first reported by Courbarioux et al. in 2016, they kicked off a flood of binary network research. Of course, approximating 32-bit floats with 1-bit values is a lossy approximation. Binary networks typically take a noticeable accuracy hit compared to their corresponding full-precision networks, often around a 20% relative loss of top-1 accuracy. Thus, a major focus of binary network research has been on reducing accuracy loss.
Although works such as Rastegari et al., Cai et al., and Choi et al. make strides towards more accurate binary networks, none of them actually implement the networks in a way that can measure end to end speedups. This lack of implementation not only makes it difficult to know how fast binary networks actually are, it also prevents them from being used for in many practical settings. The reason for the lack of implementation is that if you were to write a simple loop based binary matrix multiply, it would be dramatically slower than most full precision multiplies despite having a lower operation count. A function’s run time isn’t determined purely by its number of operations; the way memory is accessed also plays a huge role. Any binary network implementation must compete against libraries like OpenBLAS and MKLDNN which are hand-optimized by large engineering teams over many years. It’s simply not feasible for most research institutions to devote the amount of time and effort necessary to build a competitive operator library. Instead, they simulate binarization during training and assume speedups are predictable based on operation count.
To address these challenges, we present Riptide, an approach to identify and address bottlenecks in end-to-end binary networks. Riptide builds on TVM, a compiler for deep learning systems, which enables us to automatically generate tuned, high-performance binarized operators.
To date, binary network optimizations have narrowly focused on different strategies to efficiently implement core low-bit convolution layers. This focus is motivated by the assumption that binary network performance mirrors the behavior of higher-precision networks: if core convolutions can be made accurate enough using as few bits as possible, then the entire network will be fast. However, no binary networks are composed solely of convolutions; there are many critical intermediate operations between convolutions that are needed to massage data for the next layer. Such layers contribute negligible latency in higher-precision networks, but in binary networks, where convolutions are up to 43X faster, these intermediate “glue layers” become significant.