CONMW Transformer: A General Vision Transformer Backbone With Merged-Window Attention
Ang Li, Jichao Jiao, Ning Li, Wangjing Qi, Wei Xu, Min Pang
-
SPS
IEEE Members: $11.00
Non-members: $15.00Length: 00:13:48
Image recognition techniques such as object detection are useful for assisting humans in remote video surveillance tasks. However, compression algorithms used for efficient video transmission are usually tuned for low reconstruction error and not for machine vision, leading to suboptimal recognition accuracies. in this work, we propose convolutional encoder-decoder neural networks for compressing video data intended for object detection. These networks are trained for optimal detection accuracy and bitrate, and make use of a novel stochastic quantization technique. in our experiments, we evaluate our method using publicly available datasets and show that we can substantially reduce bitrate over traditional codecs such as H.265 and also over other deep learning based compression methods at identical object detection accuracy. Moreover, we argue that perceived image quality of our compression method is close to H.265 at similar bitrates.