Abstract

In this study, we propose a reconstruction and optimization neural network (RONN), a novel neural network for nonrigid structure from motion, which is completed by an unsupervised convolution neural network. Compared with the traditional method for directly solving 3D structures, our model focuses on depth information that is lost owing to projection. This mathematical model is developed using a convolutional neural network with three modules for integration, reconstruction, and optimization, as well as two prior-free loss functions. The proposed RONN achieves competitive accuracy on several tested sequences and high visual quality of various real video sequences.

1. Introduction

Nonrigid structure from motion (NRSfM) targets the recovery of a nonrigid structure and camera matrix from given 2D point tracks in monocular views. Unlike its rigid counterpart [1], NRSfM is a highly ill-posed problem with several inherent ambiguities. Moreover, solving this problem requires additional constraints or priors. Many methods assume that the movement of the camera is slow and smooth [26]; however, this limits its applicability to real sequences. Another assumption is that the deformation of nonrigid instances can be represented using the weighted sum of basic deformations in the trajectory space [4] and shape space [7]. With these assumptions, the NRSfM problem is transformed into solving the basic deformation and its coefficients.

Inspired by these assumptions, many researchers have used neural networks to solve the sparse NRSfM problem [8, 9], which learns shape representations through unsupervised networks, while maintaining good generalization ability in the face of unseen data. However, their models are incapable of handling dense situations.

Dense NRSfM has achieved remarkable progress over the last several years [2, 1013]. In 2020, Sidhu et al. [13] proposed the first dense neural NRSfM (N-NRSfM) approach with mean shape and demonstrated state-of-the-art performance on widely used datasets. However, when confronted with a long sequence or drastic changes, the mean shape is unreasonable. Additionally, it requires a considerable amount of time to obtain high-performance results.

In this study, we introduce a reconstruction and optimization neural network (RONN) and two improved loss functions for dense NRSfM. RONN mainly includes a depth reconstruction module and a camera optimization module, which reconstruct the depth information lost due to projection and optimize the camera matrix, respectively. Inspired by recent advances in NRSfM [8], the proposed improved loss function is combined with the minimum singular value ratio, and experiments show that it improves the original loss function to varying degrees.

The main contributions of our study are as follows. (1)We propose the first dense NRSfM network for reconstruction using depth information, namely, RONN. It is a convolutional neural network including reconstruction and optimization, which realizes the reconstruction of the 3D structure and the optimization of the camera matrix, respectively. Its specific structure will be introduced in Section 4.1. Compared to directly solving the overall 3D structure in the method [13], RONN avoids the use of average shapes and reduces the amount of theoretical computation(2)For the first time, we changed the input of the network from every frame to every point, enabling the network to cope with datasets of different sizes. Section 5.3 shows that RONN reconstructs dense and sparse 3D structures without 3D supervision and achieves competitive accuracy on multiple test sequences(3)Compared with the original loss function, the weighted loss function using msr can handle complex deformation and further improve the reconstruction accuracy. The weighting method will be described in detail in Section 4.2. The comparative experiments in Section 5.2 show varying degrees of improvement

2.1. NRSfM

NRSfM is inherently ill-conditioned and requires additional constraints or priors to guarantee solution uniqueness. We are concerned with the following additional limitations: (1)Bregler et al. [14] proposed a low rank, where the rank of the rigid 3D structure fixed is three. Dai et al. [7] rearranged the rows of as to obtain stronger low rank priors, demonstrating state-of-the-art performance on sparse datasets at the time. Ansari et al. [10] proposed scalable monocular surface reconstruction (SMSR) with an improved low rank. Its scalability enables the achievement of competitive accuracy on both sparse and dense data(2)Park et al. [15] proposed Procrustean regression, which is a regression problem based on Procrustes-aligned shapes. In [15], they proposed a novel regression framework for NRSfM, comprising Procrustes-aligned shape loss and low rank loss. The framework is versatile and can reconstruct a 3D structure under dense datasets. Additionally, Park et al. [16] proposed a novel framework for training neural networks with a Procrustean regression. Although the network structure is simple, it shows superior reconstruction performance compared to the state-of-the-art method. In [16], it was proven that Procrustes alignment could determine unique motions and eliminate the rigid motion components from reconstructed shapes

2.2. Neural NRSfM

There have been studies on combing NRSfM with neural networks. Supervised neural networks require large amounts of training data; however, only a few datasets are currently available for quantitative evaluation of NRSfM methods. In contrast, unsupervised networks are easier to implement. C3DPO [17] and deep NRSfM [9] learn basis shapes from 2D observations without 3D supervision in sparse data sets. C3DPO [17], which was proposed by Facebook’s AI lab, uses a factorization network to replace the classical factorization step. Additionally, to ensure the effect of factorization, it collaborates with another canonicalization network to achieve a robust 3D reconstruction effect. This framework achieved high-performance reconstruction results for rigid and nonrigid datasets. Kong and Lucey [9] proposed a new a priori hypothesis, using multilayer sparse coding to represent 3D nonrigid shapes, and designed an innovative encoder-decoder neural network to realize an unsupervised network for NRSfM. They extended the classic sparse coding algorithm, ISTA, to block sparse scenarios and provide state-of-the-art performance through the proposed network. However, sparse coding also limits its application to dense datasets. Sidhu et al. [13] introduced the first dense neural NRSfM approach, namely, N-NRSfM, and achieved competitive performance on widely used dense datasets. They used the mean shape to achieve reconstruction, but this became a limitation. Once the mean shape is determined, the reconstruction result is obtained. Therefore, when confronted with large-scale deformation, the reconstruction results are not as good as expected.

3. Mathematical Model

Consider a monocular camera for observing a nonrigid object with a set of feature points. Let be the 3D shape matrix of the nonrigid object at the frame and be its 2D matrix according to an orthographic projection. Specifically,

and are related by the full rotation matrix as where is already centralized; therefore, the camera matrix is reduced to pure rotation [14]. According to Formula (3), contains both the camera and 3D structure information. In this study, a reasonable network architecture was designed to separate the required information.

Formula (3) can be changed to the following form.

According to Formula (4), the full rotation matrix will have a certain effect on the reconstruction results; therefore, optimization of the full rotation matrix is necessary.

4. RONN Model

In this section, we introduce the structure and loss functions of the RONN.

4.1. Network Structure

As illustrated in Figure 1, inspired by the trajectory space, our RONN contains three modules: dimensional integration module , camera optimization module , and depth reconstruction module . The dimensional integration module integrates the information contained in and in the matrix, represents the optimization module for the camera matrix, and represents the reconstruction module for the depth information .

The 2D matrix first passes through a dimensional integration module, which consists of a convolutional layer with a kernel size of and a ReLU layer. The next two modules are , which has (set to 1 by default) residual blocks and a linear layer after rearranging the shape, from which we obtain, and , which has . Specifically, the residual block contains two convolutional layers of kernel size 1 × 1 and a ReLU layer.

Through this network, our reconstruction result can be expressed as where is the SVD of the camera matrix.

In this study, the method for initializing is the same as that in [1] on dense datasets and [7] on sparse datasets.

4.2. Loss Function

To solve the NRSfM problem, we propose minimizing the loss functions with the initial rotation matrix as where encode the additional constraints.

The temporal smoothness term is used to constrain the similarity of the reconstruction results of adjacent frames as where denotes the Huber loss of the matrix. Weight is discussed later.

The Procrustean alignment term can determine unique motions and eliminate the rigid motion components from reconstructed shapes [15]; the term can be expressed as follows. where is the translation matrix, centering the shape at the origin. Weight is discussed later in this paper. This function is aimed at minimizing the error between the 3D shape of each frame and the reference shape to optimize the rotation matrix.

A rearranged shape matrix is expressed as with an additional constraint . In [10], they assumed that the mean 3D component was dominant in and could be removed in the temporal dimension. By combining both ideas, is defined as where is the orthogonal projection, and is a vector of ones.

When using the optimized module , must be used as the data term and as the regularization term to form a Procrustean regression.

Weights and are set using the minimal singular-value ratio [8]. Given two 2D matrices, and , let be the stacked matrix of and as follows:

Then, the ratio of the minimal singular value of is used to define the rigidity measure msr as follows: where is the -th singular value of in descending order.

Then, weights and are defined as follows:

5. Experiment

In this section, the experimental results are described for several widely used benchmarks and real datasets. First, we introduce the datasets and experimental setups, then analyse and compare the proposed model with state-of-the-art dense and sparse datasets and, finally, use real data for experiments.

5.1. Datasets and Setups
5.1.1. Datasets

Three dense benchmark datasets are used in the comparison of methods: synthetic faces (two sequences with 99 frames and two different camera trajectories denoted by Traj.A and Traj.B, with 28,887 points per frame) [2], expressions (384 frames with 997 points per frame) [18], and actor mocap (100 frames with 36,349 points per frame) [19].

5.1.2. Evaluation Metrics

For algorithm performance indicators, the 3D error defined as follows. where denotes the Frobenius norm and denotes the ground truth 3D structure at the frame.

5.1.3. Training Details

The RONN was implemented in PyTorch [20]. We used the Adam optimizer with a learning rate of 0.0005 and trained for 2000 epochs. In the experiment, the weight was fixed at .

5.2. Model Analysis
5.2.1. Structure of RONN

The baseline was formed by removing the module in the RONN. These experiments show the advantages of and the necessity of ; it contains different combinations of loss functions on synthetic face sequences (Traj.A and Traj.B).

The advantages of are listed in Table 1. Because the network solves the depth information , rather than the entire 3D structure, this allows the reconstruction to be achieved only with , although the error is relatively large.

The necessity of is shown in Table 2. When using the combination of and , is reduced by 31.72% for Traj.A and 16.88% for Traj.B. When using the combination of , , and , is reduced by 33.83% for Traj.A and 22.00% for Traj.B.

5.2.2. Effectiveness of Improved Loss Functions

To understand the effectiveness of the improved loss functions, we conducted experiments with the following original loss functions.

In Table 3, in the case of RONN without , compared with , of the is reduced by 5.4% for Traj.A and 8.8% for Traj.B.

Because the combination of and must be used when using the network, an experiment without is added to show the performance of different improved functions. However, the improvement in did not affect the error before and after. However, with the addition of , the combination of and reduces the error by 12.11% on Traj.A and 5.01% on Traj.B, which also shows the necessity of optimizing the camera matrix.

5.3. Comparison of Methods
5.3.1. Synthetic Faces

for synthetic faces are listed in Table 4. Compared with jumping manifolds (JM) [21], Grassmannian manifold (GM) [11], SMSR [10], probabilistic point trajectory approach (PPTA) [22], consolidating monocular dynamic reconstruction (CMDR) [23, 24], variational approach (VA) [2], dense spatial-temporal approach (DSTA) [25], expectation-maximization finite element method (EM-FEM) [26], and N-NRSfM [13], the RONN achieves close to the best method on Traj.A and exhibits an average on on Traj.B. Compared with Traj.A, the reconstruction accuracy of Traj.B is poor, as reflected by many methods.

5.3.2. Expressions

for the expressions are presented in Table 5. Compared with the expectation-maximization linear dynamical system (EM-LDS) [3]; column space fitting, version 2 (CSF2) [27]; kernel shape trajectory approach (KSTA) [28]; global model with local interpretation (GMLI) [18]; and N-NRSfM [13], the RONN achieves on par with those of GMLI and N-NRSfM, which is currently the best method for this sequence. However, the number of iterations is significantly reduced (compared to 60,000 times in N-NRSfM).

5.3.3. Actor Mocap

for the expressions are listed in Table 6. Compared with CMDR [23, 24] and SMSR [10], the RONN achieves , which is better than SMSR and CMDR.

5.3.4. Sparse Reconstruction

Except for the linear layer in , the RONN is composed of a convolutional network, and each feature point shares parameters so that the network can handle datasets with different numbers of feature points. When facing classic sparse data, comprising six standard sequences, namely, drink, pickup, yoga, stretch, dance, and shark, the RONN could also realize reconstruction. The number of frames () and number of points (), i.e., the set, for these datasets are (1102, 41), (357, 41), (307, 41), (370, 41), (264, 75), and (240, 91). As shown in Table 7, compared with the 3D reconstruction of dense data, in the sparse 3D reconstruction scene, the correlation between each point is relatively small owing to the few feature points. Compared with the classic sparse 3D reconstruction methods, the reconstruction results of the RONN are not as good as are expected. However, even if the reconstruction error of RONN is not the best, it is also not the worst.

5.4. Experiments with Real Data

We also reconstructed several real image sequences ,i.e., heart surgery [31], back [32], and real face [2] (see Figure 2). As for the real face, owing to the large amount of noise in matrix , the final reconstruction result is not as smooth as expected. As for the back and heart surgery, the RONN achieved good visual reconstruction results.

6. Conclusion

This study proposes RONN and two improved loss functions. Our method can achieve reconstruction from 2D to 3D without supervision. One of the advantages of the RONN method is its scalability and consistent performance on datasets with different numbers of feature points.

As the first network to directly solve depth information to achieve reconstruction, RONN uses the depth reconstruction module , which can achieve 3D structure reconstruction with only temporal smoothness loss. Procrustean regression is used to optimize the camera matrix and improve performance and use msr to weight the above loss function to further improve the network’s ability to deal with complex deformation and experimentally demonstrate the improved loss function and the high performance of RONN.

Compared to the N-NRSfM approach, which is the first dense neural NRSfM, we do not need a mean shape and employ fewer training epochs. Compared to the classic sparse reconstruction method, the RONN shows better scalability. Transforming the input of the network from every frame to every point enables the network to better cope with the dense conditions.

Because of the direct use of , the current limitation of the proposed method is its sensitivity to the 2D matrix . As shown in Figure 2(c), the noise in is directly shown in the 3D structure, causing the result to be unsmooth. Since the original rotation matrix is required, the reconstructed structure accuracy will be affected by the original rotation matrix. Moreover, the RONN cannot handle data loss.

The N-NRSfM provides a new perspective on dense NRSfM, which we further improved, achieving results. In future research, we will consider complex situations, such as denoising and data loss.

Data Availability

The Python code data used to support the findings of this study are available from the corresponding author upon request.

Conflicts of Interest

The author declares that there is no conflict of interest regarding the publication of this paper.

Acknowledgments

This research was supported by the Natural Science Foundation of Zhejiang Province (LZ20F020003, LY17F020034, LY17F020003, and LSZ19F010001) and the National Natural Science Foundation of China (61272311 and 61672466).