A Foreground Inference Network for Video Surveillance Using Multi-View Receptive Field.
Foreground (FG) pixel labelling plays a vital role in video surveillance. Recent engineering solutions have attempted to exploit the efficacy of deep learning (DL) models initially targeted for image classification to deal with FG pixel labelling. One major drawback of such strategy is the lacking delineation of visual objects when training samples are limited. To grapple with this issue, we introduce a multi-view receptive field fully convolutional neural network (MV-FCN) that harness recent seminal ideas, such as, fully convolutional structure, inception modules, and residual networking. Therefrom, we implement a system in an encoder-decoder fashion that subsumes a core and two complementary feature flow paths. The model exploits inception modules at early and late stages with three different sizes of receptive fields to capture invariance at various scales. The features learned in the encoding phase are fused with appropriate feature maps in the decoding phase through residual connections for achieving enhanced spatial representation. These multi-view receptive fields and residual feature connections are expected to yield highly generalized features for an accurate pixel-wise FG region identification. It is, then, trained with database specific exemplary segmentations to predict desired FG objects.
The comparative experimental results on eleven benchmark datasets validate that the proposed model achieves very competitive performance with the prior- and state-of-the-art algorithms. We also report that how well a transfer learning approach can be useful to enhance the performance of our proposed MV-FCN.
Publisher URL: http://arxiv.org/abs/1801.06593