heal.abstract |
3D scene understanding, i.e., the task of perceiving three-dimensional scenes, is
essential to the vast majority of computer vision applications. Indeed, due to the
recent advancements in software and hardware that have made 3D representations
widely available, efficient and reliable algorithms for 3D perception are imperative.
Interesting applications of 3D scene understanding concern but are not limited
to autonomous driving, indoor agents, and multiple AR and VR scenarios. This
thesis particularly acknowledges the importance of 3D object extraction in the
scene. Identifying physical 3D objects is a crucial task for robots in order to be
able to interact with them, but also for AR/VR applications, towards a unified
user experience where the physical and digital worlds merge seamlessly.
The past few years, tremendous efforts of the research community concerned 3D
object extraction, which can be achieved through different tasks. Among them, we
find 3D object-level detection and 3D instance segmentation. Indeed, through the
comparison of multiple observations of the same scene at different time intervals,
object-level scene comparison can provide information about what objects exist in
the scene and how people interact with them. To this end, when denoting as an
object anything that has changed between different captures or when denoting as
an object every instance, the aforementioned tasks are also related to 3D object
identification. On top of these tasks, 3D semantic segmentation provides crucial
information concerning the type of objects existing in the scene. From the above,
it is thus clear that these sub-tasks provide information that is not only directly
correlated to their task description but also correlated to object identification.
Motivated by the importance of object identification and by the fact that object
discovery can be achieved through different scene understanding sub-tasks, this
thesis focuses on object-level 3D change detection, instance, and semantic segmen-
tation. All the above problems have been long-standing issues in the computer
vision community. Over the years, multiple solutions have been deployed, incor-
porating trained and non-trained approaches. Non-trained solutions mainly focus
on the scene structure and geometry. On the other hand, supervised solutions,
integrating hand-crafted features and traditional machine learning techniques to
the latest trends in deep learning and foundation models, have achieved impressive
results. However, a lot of questions remain unanswered.
Given the unconstrained nature of these problems and inspired by the remarkable
performance of integrating scene priors into other 3D vision tasks, we decide to
study the impact of leveraging scene priors towards successful 3D understanding.
More specifically, the data used in this thesis are sourced from multiple overlapping images, either RGB-D sequences or SfM/MVS data. Thus, high-quality texture
information is available in the form of 2D images. These images can impose a set
of priors on these unconstrained tasks. Such priors typically include consistency
between 2D and 3D segmentations. Moreover, implicit 3D scene information, such
as geometric transformations induced by moving objects between changing scenes,
shall also be exploited as a scene prior.
In the context of this thesis, the main contributions refer to proving the efficiency
of integrating scene priors into solving the aforementioned 3D scene understanding
sub-tasks. To this end, novel methods for 3D instance and 3D semantic information,
and 3D change detection are formulated. All the proposed methods integrate
constraints induced by the scene priors.
More specifically, two novel approaches for object-level 3D change detection are
proposed. The first method leverages the geometric constraints induced by the
moving objects. As such, it optimizes the initially detected regions (through
render-and-compare) to the whole object undergoing the same rigid transform
in the context that everything that moves together should belong together. The
second approach exploits generic 2D segmentation masks to propagate change from
the initial regions to the whole changing object. Subsequently, for 3D instance
segmentation, the well-established RANSAC algorithm used for 3D plane fitting
is extended to H-RANSAC. H-RANSAC ensures that the extracted 3D planes
fulfill an extra 2D consistency check (i.e., the 3D planes should also belong to the
same 2D mask in the image). Finally, towards 3D semantic segmentation, 2D
image information is also used. In the SfM/MVS data scenario, 2D images that
typically ensure higher-quality visual characteristics are often discarded. Towards
a more successful 3D semantic segmentation, we propose a method that not
only leverages image texture but also proposes a novel heuristic for optimal view
extraction. Thus, the method identifies the most appropriate view to provide the
visual features. All the methods are extensively studied and rigorously evaluated
using tailored evaluation metrics on appropriate benchmarks. |
el |