Learning to Fuse Monocular and Multi-view Cues for
Multi-frame Depth Estimation in Dynamic Scenes
CVPR 2023

Abstract

teaser Multi-frame depth estimation generally achieves high accuracy relying on the multi-view geometric consistency. When applied in dynamic scenes, this consistency is usually violated in the dynamic areas, leading to corrupted estimations. Many multi-frame methods handle dynamic areas by identifying them with explicit masks and compensating the multi-view cues with monocular cues represented as local monocular depth or features. The improvements are limited due to the uncontrolled quality of the masks and the underutilized benefits of the fusion of the two types of cues. In this paper, we propose a novel method to learn to fuse the multi-view and monocular cues encoded as volumes without needing any masks. As unveiled in our analyses, the multi-view cues capture more accurate geometric information in static areas, and the monocular cues capture more useful contexts in dynamic areas. To let the geometric perception learned from multi-view cues in static areas propagate to the monocular representation in dynamic areas and let monocular cues enhance the representation of multi-view cost volume, we propose a cross-cue fusion (CCF) module, which includes the cross-cue attention (CCA) to encode the spatially non-local relative intra-relations from each source to enhance the representation of the other. Experiments on real-world datasets prove the significant effectiveness and generalization ability of the proposed method.

Results in Static & Dynamic Scenes

Our method achieves high accuracy in both static and dynamic scenes:

Results in scenes with dynamic objects.

Results in scenes without dynamic objects

Motivation

teaser Motivation of the method. We aim to propagate the multi-frame static (yellow box) depth to the monocular cues and let monocular cues in dynamic areas (red box) enhance the multi-frame representations, yielding the final dynamic depth excelling each depth cue, intead of being limited by each.

Model Architecture

DyMultiDepth overview. We extract multi-frame depth cues with cost volume and monocular depth cues using one-hot depth volume. Then, we fuse the two volumes with the proposed cross-cue fusion module (CCF) to let the geometric perception learned from multi-view cues in static areas propagate to the monocular representation in dynamic areas and let monocular cues enhance the representation of multi-view cost volume. The fused depth feature is sent to the depth module for final depth estimation.

KITTI Results

Quantitive Results on KITTI. Our method outperforms other methods in both overall and dynamic metrics.

DDAD Generalization Results

Generalization on DDAD. Using the model trained on KITTI, our method outperforms others on DDAD test set.

Citation

Acknowledgements: We thank the awesome website template from Mip-NeRF.