Abstract

Deep learning with specific network topologies has been successfully applied in many fields. However, what is primarily called into question by people is its lack of theoretical foundation investigations, especially for structured neural networks. This paper theoretically studies the multichannel deep convolutional neural networks equipped with the downsampling operator, which is frequently used in applications. The results show that the proposed networks have outstanding approximation and generalization ability of functions from ridge class and Sobolev space. Not only does it answer an open and crucial question of why multichannel deep convolutional neural networks are universal in learning theory, but it also reveals the convergence rates.

1. Introduction

Deep learning [1] has made remarkable achievements in many fields. Essentially, it is based on structured neural networks similar to the biological nervous system to extract data features for realizing specific learning goals. In these structured neural networks, a particularly important one called deep convolutional neural networks (DCNNs) has achieved state-of-the-art performance in many domains [24]. Normally, multichannel convolution is used, and the resulting multichannel deep convolutional neural networks (MDCNNs) have also achieved excellent performances in classification [5, 6], natural language processing [7], biological [810], and many other domains [1113].

However, compared with the successful applications of MDCNNs, the theoretical basis is incomplete, which is the main reason why it is widely criticized. In this paper, we present some approximation theories of functions for downsampled MDCNNs where the downsampling operator plays the role of pooling, which reduces the width of deep neural networks. Before giving the main results of downsampled MDCNNs, we first briefly look back at the basic concepts of fully connected neural networks (FNNs) and DCNNs.

An FNN with input vector and hidden layers of neurons with widths is defined iteratively by where is an univariate activation function acting componentwise on vectors, is a weight matrix, is a bias vector in layer , and with the width . Now, the form of (1) used to approximate functions is

Note that if , the FNNs defined by (1) degenerate into the well-known classical shallow neural networks. The most important part of (2) to learning functions is the free parameters of weights and bias. It is easy to find that the form (2) involves free parameters of weights and bias to be trained, leading to huge computational complexity when is large.

For DCNNs, we use the definition from [14]. Let , , and be the positive integers. The convolution of and is mathematically defined as , where ( denotes the set ) which can be equivalently rewritten as where is a Toeplitz-type matrix given by

With the above notations, a DCNN with input vector and hidden layers of neurons is defined iteratively by where is an univariate activation function as before, denotes a filter supported on , and is a bias vector in layer , . The form of (5) to learning functions is

Compared with FNNs, DCNNs defined by (5) involve a sparse matrix in the -th layer, each row of which has no more than nonzero elements. The number of weights and biases is and , respectively, which is a large reduction of parameters.

However, this kind of DCNNs results in width increasing; that is, for input signal , we have which is rarely used in practice. To improve this unusual structure, downsampling also known as pooling operators is applied in the DCNNs to reduce width formally [15, 16]. The key role of downsampling is reducing the dimension of features and retaining effective information. To describe it mathematically, we adopt a general version given below.

Definition 1 (downsampling [15]). Let , a downsampling set , be an index set. is called a downsampling operator indexed by if , where denotes the vector indexed by .

Factually, except for downsampling, in real applications, multiple filters are usually utilized in each layer of DCNNs to obtain multichannel outputs. Each output is made up of channel combinations that provide the flexibility needed to avoid variance issues and loss of information [17], and different channels will play the role of extracting multiple features of the input data [16]. Specifically, as pointed out in [18], convolution from the current layer to the next in the multichannel case is often organized as follows: inputs of each input channel first convolute with all related filters to compose the convoluted inputs, and then the convoluted outputs are composed of linear combinations of the convoluted inputs, and finally, an activation function (usually ReLU) is acted on each convoluted outputs componentwise. Along with this fact in mind, the key to the MDCNNs considered in this paper is multichannel convolution which is mathematically defined as follows.

Definition 2 (multichannel convolution). Let , , and be input channel size, output channel size, and filter size, respectively. Filters are defined as a three order tensor. Let be the input data with channels and the output of channel named without bias, and activation function is defined as the sum of convoluted input data, i.e., where the Toeplitz-type matrix is defined as (3). Further, let be a bias matrix and be the activation function; the multichannel convolution is given by The whole multichannel convolution structure is shown in Figure 1.

Remark 3. Here, we remark that Definition 2 implies that there are filters in total, and each filter has the same size . For convenience, we assume that the input data have the same size for all channels such that the corresponding outputs also have the same size . If , equation (7) degenerates into equation (3).

The multichannel convolution from the current layer to the next provides the main ingredient of MDCNNs. Combined with the downsampling operator given by Definition 1, MDCNNs with downsampling are given below.

Definition 4 (MDCNNs with downsampling). Let be the channel size and filter size in layer , the set satisfying is used to introduce the downsamplings, and is the downsampling sets. A MDCNN with downsampling operators s and input data having widths is defined iteratively by and for is a sequence of function vectors defined iteratively by Here, all channels in the same layer have equal size, the downsampling operators s act on each channel of the layer , denotes the cardinal number of , and the tensor denotes filters between layers and . Finally, the form of MDCNNs used to approximate functions is where are coefficients. The structure of MDCNNs is shown in Figure 2.

Remark 5. The form (11) indicates that the objective form has three important ingredients corresponding to filters, bias, and channel size. From another perspective, it belongs to If all layers have only one channel, MDCNNs will degenerate into DCNNs. We say that the MDCNNs with downsampling have uniform filter lengths if all channels in every layer have the same size. Under this circumstance, we call the MDCNNs with downsampling uniform. All MDCNNs with downsampling considered in our main results are uniform.

However, the existing theoretical studies cannot be applied to MDCNNs. For example, Zhou [14, 15, 19] only considers single-channel DCNNs whose widths are increasing to depth. The multichannel convolution was also used in a recent network Butterfly-Net [20, 21], which is based on butterfly algorithm. However, the multichannel convolution is only part of its network structure, and the structure of our MDCNNs relying on multichannel convolution solely is different from that of Butterfly-Net. Moreover, they study the approximation of Fourier representation of input data, which is also different with ours. To investigate the approximation ability of MDCNNs, we study its behavior on ridge functions and functions from Sobolev space . MDCNNs considered in this paper have finite width , finite filter size , and finite channels in each layer. In addition, the activation function is the popular rectified linear unit (ReLU) defined as a univariate function given by , which is often utilized to guarantee the nonlinear properties of the neural networks. As pointed out by [19, 22], linear combinations of ReLU units can express the objective functions with arbitrary accuracy. Hence, the main proof techniques of our theorems are constructing the structured MDCNNs to obtain the ReLU approximations of the objective functions. In addition, we emphasize the benefit of multiple channels: different channels from some fixed layers can extract transformed data features from the previous layer. Concretely, we utilize channels to store the ReLU units, obtain new ReLU units, and deposit initial data. In this way, our proposed MDCNNs can achieve better results in approximating functions than the structure from DCNNs and FNNs. In summary, we make the following contributions to the approximation theory of MDCNNs: (i)To construct MDCNNs by introducing the multichannel convolution so that different channels are used to extract different data features. To introduce the downsampling operator into the MDCNNs so that the width-increasing nature can be avoided from layer to layer(ii)To present a theorem for approximating ridge functions by MDCNNs of the form with and which demonstrates that for this widely used simple but important function family, MDCNNs have better approximation abilities than FNNs and DCNNs(iii)To prove a theorem for approximating functions in Sobolev space which shows the universality of MDCNNs and the benefit of depth. In addition, it also reveals better approximation performances than FNNs and DCNNs

The structure of this article is organized as follows: in Section 2, we present the main results for approximating functions from ridge class and Sobolev space and further compare them with some related work. Proofs of our main results are given in Section 3. Finally, we summarize the research of this paper in Section 4.

2. Main Results

Complicated functions can often be approximated by simple families [23], such as polynomials, splines, wavelets, radial basis functions, and ridge functions. Specifically, many approximation results are based on the combination of ridge functions [24, 25]. Our first main result of downsampled MDCNNs shows its good performance of approximation ability for ridge class. After that, we further provide the approximation ability of MDCNNs of functions from Sobolev’s space. The two approximations constitute our main results. The main techniques of our proofs are constructing the approximations of objective functions by linear combinations of ReLU units at first and then specifying the networks’ parameters such that the constructed MDCNNs’ outputs match the linear approximations of ReLU units.

2.1. Approximation on Ridge Function

Mathematically, ridge functions are any multivariate real-valued function of the form induced by an unknown eigenvector and an unknown univariate external function . Further, let be the class of univariate Lipschitz- functions defined in with constant ; that is, for any ,

Our first result shows the approximation ability of MDCNNs for ridge functions with the external function and , where represents the unit ball. Denote . Throughout this paper, we will use to represent the number of computation units (widths or hidden units [15]) and free parameters, where computation units can be calculated by counting the hidden units of MDCNNs.

Theorem 6. Let , be the uniform filter size, ( is the ceil function), and . If , then there exists a downsampled MDCNN with at most 3 channels in each layer, the width of each channel is no more than , and layers satisfy where . The number of computation units is , and free parameters are .

Remark 7. The constructed MDCNNs have finite channels, finite width, and finite filter sizes, and the convergence rate denoted by (15) is not only dimension-free but also reveals the benefit of depth. Given arbitrary approximation accuracy , Theorem 6 shows that we need at least . Taking , it needs layers, computation units , and free parameters to get (15).

Remark 8. A concrete example is as follows: let , , , , satisfying belong to Lipschitz-1 class. By Theorem 6, we can construct an MDCNN with at most 3 channels, and the width of each channel is no more than 5, and layers such that

2.2. Approximation on Function from Sobolev Space

How do MDCNNs behave for smooth functions? Our second theorem shows that functions in Sobolev space of order can be well approximated by a downsampled MDCNN with at most 4 channels.

Theorem 9. Let be the uniform filter size, , , and . If , then, for any and an integer , there exists a downsampled MDCNN with finite width and at most 4 channels such that where is an universal constant and denotes the Sobolev norm of given by with being the Fourier transform of . The number of computation units is , free parameters are .

Remark 10. In fact, Theorem 9 demonstrates the universality of MDCNNs; that is, for any compact subset , any function in can be approximated by MDCNNs to an arbitrary accuracy when the depth is large enough. The reason is that the set is dense in when we consider the Sobolev spaces that can be embedded into the space of continuous functions on . Moreover, the proof of this theorem shows that our constructed MDCNNs have at most 4 channels in each layer and the width of each layer equals . Given arbitrary , we requires at least . Taking , we have for small . Thus, the number of computation units is , and free parameters are .

Both of the two main results reveal the benefit of depth in terms of approximations of functions from ridge class and Sobolev space, which indicate that MDCNNs can approximate the two types of functions to arbitrary accuracy if the depth . Moreover, the constructed MDCNNs have finite channels, finite width, and finite filter sizes, wich is more close to real-world scenes compared with [14, 15, 19].

2.3. Comparison and Discussion

Most studies on the approximation theory of neural networks focus on two aspects: the first is obtained in the late 1980s about universality [2628] meaning that any continuous functions can be approximated by (2) to arbitrary accuracy; in other words, the space is dense in the objective function space; the second is obtained about convergence rates of functions [24, 25, 2931] in the view of neurons, parameters, or depth. For fairness, in this part, we aim to compare our main results with other theoretical investigations of networks existing in the literature under approximation error . Specifically, we shall do our comparisons in terms of width , filter size , depth , the number of computation units , and free parameters .

Let denote the set of combination of ridge functions with cardinal number no larger than ; it had been proven in [24] that any function from the Sobolev space in the space with behaves asymptotically of the order by FNNs. The superiority of Theorem 6 over [24] is the dimension-free property of the convergence rate given by (15) which demonstrates the good performance of MDCNNs in approximating ridge functions. Besides, let ( is the floor function), where is a scaling parameter. Paper [15] constructed a DCNN with filter size in the last layer and finite depth . It obtained a convergence rate of for ridge functions with external function , where one needs computation units at most and free parameters at most . However, the filter size is often no larger than the input dimension in practice, meaning that this structure is not frequently used. By comparison, even though our Remark 7 indicates that the computation units and free parameters of MDCNNs constructed from Theorem 6 have the same order of [15], it is easy to find that our constructed network may be closer to real-world applications, and the convergence rate from (15) reveals the benefit of depth.

Let , based on Taylor expansion; paper [32] had shown that one needs ReLU neural networks with length at most and free parameters along with computation units at most to ensure the approximation accuracy . As a comparison, Remark 10 states that our MDCNNs only need layers which are dimension-free, the number of computation units is , and free parameters are to get the approximation accuracy . Our Theorem 9 has huge advantages over [32] since the discussions from [14, 15] indicate that has a factor which may be very large when the input dimension is large. In addition, paper [14] had considered the DCNNs without downsampling operators, and it leads to the networks’ widths being linearly increasing. In such a situation, the number of computation units and free parameters of the DCNNs with is and , respectively. Compared with that work, the MDCNNs from Theorem 9 have finite width, and the computation units are at most , and free parameters are at most which demonstrates better performances to [14].

3. Proofs of Main Results

There are two kinds of downsampling operators and acting on each layer in our constructed MDCNNs to ensure the finite width property, where and ( denotes all integers belong to ). Thereby, for any and , we can write the downsampled convolution as and , where and are square matrices consisting of rows from given by

That is, convolution of and with downsampling operators and is equivalent to special matrix-vector multiplication. Furthermore, the two kinds of convolution have the property that for input signal , we have and ; i.e., the input data and output data are equal widths.

We first introduce some lemmas in Subsection 3.1 that will be used later to prove our main results. Detailed proofs of our main results will be shown in Subsection 3.2.

3.1. Auxiliary Lemmas

Lemma 11. Let , , , be the uniform filter size, , the constant is an arbitrary upper bound of , and . Then, there exists an MDCNN with 3 channels having downsampling set and such that with and . Moreover, , computation units , and free parameters .

Remark 12. The proof procedure of this lemma suggests that by changing the bias in layer to be ( denotes the Kronecker product), we have . Besides, it is not difficult to get that, if we abandon the last channel of each layer; i.e., the channel used to store the input data is deleted in the process of its proof; after layers of convolution, we have (here, in layer , we choose the downsampling set ), and , the computation units are , and the free parameters .

Remark 13. The proof of this lemma tells us that our constructed MDCNN has three characteristics: first, only 3 channels are used in all layers; second, all channels have the same width ; at last, it has finite layers . It can be seen from these characteristics that MDCNNs have great superiority in the view of computation units and free parameters.

Proof. Our MDCNN contains 3 channels in layer , the first channel is used to get the target output, the second channel to shift the input data by units, and the third channel to store the input data; the last layer contains 2 channels. By using convolution computation through layers, we get the desired result. We first construct filters and bias in the first layer, choosing , , , and , where denotes the vector in whose entries equal to 1; we have By taking the downsampling set , we get It is noteworthy that, for , with having the form of (18). For , has the following form: and , for , . By choosing the downsampling set , we have Similarly, for , with having the form of (18). For , by choosing , and the downsampling set , we have In the same way, for , with having the form of (18). In the representation of (18), we can easily find that outputs of the whole constructed convolutional network in each channel have equal width . Thus, the computation units of the network are , and free parameters .

Our next goal is aimed at giving the convergence rates of functions coming from the Lipschitz- class. Before that, we introduce one more lemma inspired from [33, 34].

Lemma 14. Let and ; there exists a piecewise linear function with breakpoints such that where

Remark 15. There are two differences between Lemma 14 and [33, 34]; on the one hand, Lemma 7.3 of [34] gives a similar result of this lemma, but it does not give the concrete expression of used to approximate functions; on the other hand, [33] gives a concrete ReLU expression of functions used to approximate functions, but it is only for functions coming from the Lipschitz- class. Besides, both [33, 34] are acquired under the assumption that input data .

Proof. Our proof will be divided into two parts. Firstly, we shall give an upper bound for based on modulus of continuity of . By Lemma 6 of [19], let , , and , ; the piecewise linear function used to approximate functions in has the form , where is a univariate function with given by . By reordering, let and for , where and , we have Then, by Lemma 6 of [19], for any , we have which gives (26).

Equation (27) indicates that any function can be excellently approximated by the linear combinations of ReLU units. Thereby, the forms of linear combinations by ReLU units inspire us to construct MDCNNs with downsampling to approximate functions. Our next lemma provides specific skills on how to embed (27) into downsampled MDCNNs.

Lemma 16. Let , , and ; there exists a downsampled MDCNN having layers all of which have only 3 channels, such that with , . The computation units are , and the free parameters are weights , bias , and .

Proof. The main techniques are embedding the ReLU expression from Lemma 14 into some specific MDCNNs. Different channels will play the role of storing input data, shifting input data, and storing units.
Choosing , we have . For the first layer, with , , we have .
For , , where , , and . In this matrix, columns represent filters for different output channels and filters in different rows are corresponding to the corresponding input channels with the index of rows and columns corresponding to the index of channels. Then, . By induction, we have Here, when , we change the bias of the second output channel to be 1 such that it has zero output. The third element contains the linear ReLU units from .
For , with all elements equal to zero except for top left and bottom right which is 1, and , we get . Choosing , , and , we have Thus, by (26), we have with computation units and free parameters .

3.2. Proofs of Main Theorems

Proof of Theorem 6. Since , , we have . By Remark 12, if we take an upper bound , then there exists an MDCNN with at most 2 channels such that . By changing to be zero in the proof of Lemma 16, we have . In the sequel, with replaced by in Lemma 16, we get the desired results.

Proof of Theorem 9. Let ; the approximation of is based on from [22] having the following form: with , , , , , , and . By Theorem 2 of [22], we have for some universal constant . We will embed into a downsampled MDCNN with at most 4 channels to get the target approximation. The core of our main method is using different channels to store a variety of data features. Next, we will prove the theorem by induction. For the first layers, by Lemma 11, there exists a downsampled MDCNN with 3 channels having layers such that with and , where the coefficient vector in Lemma 11 is replaced by . Moreover, the first output channel stores the linear rectifier units, and the second channel stores the input data at layer . For , by choosing and , where , and the downsampling set , we have with and .
Next, the MDCNN we constructed will contain at most 4 channels: the first channel is used to store linear combinations of linear rectification units, the second channel is used to store the next linear rectification unit, the third channel is used to shift the input data by steps, and the fourth channel is used to store raw input information. Suppose for , with and ; then, for , we choose given by Here, the column label of each matrix represents the input channel index, and the column vectors represent the corresponding filters. Further, by choosing with and the downsampling set , we have with , , , , and .
For , and are given in turn by By choosing and the downsampling set , we have with , , , , and .
For , with , , , , and ; otherwise, the elements in are equal 0; by choosing and the downsampling set , we have with , ,, and .
For , , and in the first channel, and , and the elements are 0 otherwise; in the second channel, , and the elements are 0 otherwise. By choosing and the downsampling set , we have with and .
By induction, for , we have , and , .
Next, for , we will use the first channel to store the linear combination of ReLU units, and there is no need to store the input data. By Lemma 11, we have for , with and where .
For , with ; the elements are 0 otherwise, . By choosing and the downsampling set , we have with and .
Thus, by choosing , and otherwise, we have At last, by choosing leading to , it is inevitable to appear . However, we need not worry about it since by using the identity map similar to that in Lemma 16, we can always have In addition, through its concrete form of , we have , and since , we further have where we use in . Thus, we obtain . Putting these into (38), we have In a similar way of [14], by using the Cauchy-Buniakowsky-Schwarz inequality, we have where is an absolute constant. By taking , we get the inequality (17). The computation units are and free parameters are . This completes the Proof of Theorem 9.

4. Conclusion and Future Work

This paper studies the approximations of structured MDCNNs with downsampling. The results show that for functions from ridge class and Sobolev’s space , our proposed MDCNNs have better function approximation performances over other relevant studies, explaining the reason why MDCNNs are successful in applications to some extent. But, note that our MDCNNs only consider the signal input and convolution of vectors; therefore, how does the matrix or higher order of convolution work? Is there some relationship between MDCNNs and FNNs? How does MDCNNs behave in terms of their generalization and expressivity? These are interesting questions we left them as future work.

Data Availability

Data sharing is not applicable to this article as no datasets were generated or analysed during the current study.

Conflicts of Interest

The authors declare no conflicts of interest.

Acknowledgments

This research was funded by the Fundamental Research Funds of China West Normal University (CN) (Grant No. 20B001), in part by the Sichuan Science and Technology Program (Grant No. 2023NSFSCO060), and in part by the Initiative Projects for Ph.D. in China West Normal University (Grant No. 22kE030).