Multivariate Regression with Neural Networks: Unique, Exact and Generic Models

Michael Nielsen provides a visual demonstration in his web book Neural Networks and Deep Learning that a 1-layer deep neural network can match any function $y = f(x)$ . It is just a matter of the number of neurons to get a prediction that is arbitrarily close – the more the neurons the better the approximation. There is the Universal Approximation Theorem as well that supplies a rigorous proof of the same. But the known issues with overfitting remain and the obtained network model is only good for the range of the training data. That is, if the training data consisted only of inputs with $x_1 < x < x_2$ there would be no reason to expect the obtained network model to work outside of that range.

This series of posts are about obtaining network models that are unique, generic and exact. That is,

they predict the correct output (the exact part)
they generalize to all inputs irrespective of the data range used to train the model (the generic part)
they can be obtained from any initial guess of the weights and biases (the unique part)

Once the training is done, the exact functional relationship between the inputs and outputs is completely captured by this neural network. That is a very desirable outcome indeed. One could consider using this neural network model as a replacement for that function in a computational framework. The function represented by the neural network model could be as simple as rotating the input vector, or perform a prescribed nonlinear transformation of the same for example. But we could envision pluggable modules of such simpler neural networks to build more complex functions. I am not sure if there are any computational benefits to doing so at this time in the context of examples in this blog, but the possibility exists perhaps for the right applications. We will explore this in upcoming blogs in this series.

But, back to the point – are there situations where a network trained with limited data to generalize exactly for any and all input data, and converge for any initial guess?. The answer is yes.

When there is a linear relationship between the inputs and outputs, multiple neural net models (depending on the initial guess) can make exact predictions for any input.
Further, when we do not employ hidden layers there will be a unique model that the neural net will converge to, no matter the initial guess.

A practical (perhaps – if training the neural net is less expensive than finding the inverse of a large dense matrix) application of this is multivariate linear regression for which we have a closed form solution to compare to. If we choose the sum of squared errors as the cost function for our neural net, the model obtained should be identical to this closed form solution. We can use this known unique solution to evaluate how efficient our neural network algorithm is in converging to it from training data generated with the same model. We can study the convergence rates as a function of the learning rates, cost functions, initial guesses, the size of the training data etc… as it is neat that we have a unique solution that the network should always converge to. That is, if it is going to converge at all without running into numerical issues owing to too large learning rates and associated numerical instabilities.

We can further look at the stability of the obtained model to noise introduced in the generated data. Covering all of this makes for a very long post, so we focus on the basics and problem formulation in this one and leave the results and implementation to a subsequent post.

1. The Many Degrees of Freedom

Neural networks are characterized by having a large number of parameters – the many degrees of freedom. So it is natural to expect that many combinations of these parameters can explain the outputs, given the inputs. But this does not always have to be the case. In some situations, even while we have many more parameters than constraints, there is only one possible solution to those parameters. Let us look at how this can happen in the context of neural networks.

1.1 Multiple exact generic models

First the case of multiple exact models all of which are generally valid. That is, irrespective of the training data range used to obtain these models, they predict the exact output for any input. Consider a simple neural net in Figure 1 that uses one hidden layer with one neuron. We want to see if we can train this neural net to add any two inputs $x_1$ and $x_2$ . What model(s) will it come up with in its attempt to minimize the error/cost is the question.

Figure 1. Multiple Exact Models are possible even in the linear case when we have a hidden layer

The input to a neuron in any non-input layer is taken to be a linearly weighted function of the outputs (i.e. activations) of the neurons in the previous layer. We use identity as the activation function here, so the output of a neuron is the same as the input it gets. With the notation $b$ as the bias, $w$ the weights, $z$ the input, and $a$ the activation – the equations shown in Figure 1 follow directly. Requiring that the output activation $a_3$ be equal to $y$ i.e. $x_1 + x_2$ we get:

For the above equation to be true for all inputs $x_1$ and $x_2$ , we would need:

With 5 unknowns and 3 equations we have 2 degrees of freedom. Clearly, we are going to get multiple solutions. Choosing $b_1$ and $b_2$ (both $\neq$ 0) as the independent variables, we get:

Table 1 shows results from the neural network (Figure 1) that has been trained with identical data, but with different initial values for $w$ and $b$ . Each run drives the cost (sum of the squared errors) to near zero, but yields a different final model. We see that the converged model in each case closely obeys the above solution so that the model has generic validity for any and all inputs – not just the training data range.

	b₂	b₃	w₁	w₂	w₃	–b₂/b₃	-b₃/b₂
Run 1
Guess	0.04451	-0.88378	-0.86994	0.54678	0.54207
Converged	2.04097	-1.90086	1.07373	1.07373	0.93135	1.07371	0.93133
Run 2
Guess	-0.52279	-1.13530	2.10237	1.42069	0.47615
Converged	4.60196	-1.90207	2.41951	2.41950	0.41332	2.41945	0.41331
Run 3
Guess	1.43401	0.62823	0.05015	-0.61086	0.62546
Converged	1.71184	1.94672	-0.87936	-0.87936	-1.13721	-0.87935	-1.13719
Table 1. Multiple Exact Generic Models. Different starting guesses for the biases and weights, converge to different models, all of which exactly predict $x_1 + x_2$ for any $x_1$ and $x_2$ . The converged solutions are seen to obey the solution in all cases

1.2 Unique exact generic model

Let us now remove the hidden layer so the neural network is as shown in Figure 2.

Figure 2. Unique Exact Model. The only possible solution is obtained for any initial guess

Requiring again that the output activation $a_3$ be equal to $x_1 + x_2$ we get:

The only possible solution that works for all $x_1$ and $x_2$ is:

This is unlike the situation when we used the hidden layer. Given that there is only one solution, the neural net has to obtain it if it is going to converge. Table 2 below bears out this result from simulating the above neural network with different initial guesses for $b$ and $w$ . We do in fact obtain the only possible solution in all cases trying to minimize the cost function.

	b₃	w₁	w₂
Run 1
Guess	-1.04436	0.00116	1.26640
Converged	-1.53476e-6	0.99989	1.00003
Run 2
Guess	1.38625	-2.32841	1.24359
Converged	-5.04049e-6	0.99979	1.00003
Run 3
Guess	1.04213	2.20225	-0.15782
Converged	4.10138e-7	1.00010	0.99992
Table 2. Unique Exact Generic Model. The only possible solution $w_1 = w_2 = 1.0$ and $b_3 = 0$ is obtained in all cases.

2. Nonlinear Models

The requirement that the outputs be a linear function of the inputs for obtaining exact models is limiting. But we can accommodate the cases when the outputs can reasonably be approximated as polynomials in terms of the inputs.

2.1 Single input and response

A simple example is a single output $y$ being a polynomial of order $r$ in a single input $x$ .

Given $n$ measurements of $x$ and $\hat{y}$ we have,

A least squares estimate $\underline{\hat{w}}$ , that minimizes $\left(\underline{y} - \underline{\hat{y}}\right)^T \cdot \left(\underline{y} - \underline{\hat{y}}\right)$ , based on these measurements is known.¹

2.2 Multiple inputs and responses

Extending the above to multiple inputs and outputs (the multivariate case) is straightforward. Say we have $m$ outputs/responses, and $q$ actual inputs/predictors. Each measurement for a response has a form like above but extended to include all the $q$ predictors. It is a polynomial of degree $r$ in each predictor so we will have $qr + 1$ coefficients in the equation. In compact matrix notation:

Appealing to the single response/input case in section 2.1, it is easy to understand the following about the above.

$\underline{\underline{Y}}$ above is simply the $m$ response vectors (of length $n$ , the number of measurements) stacked side-by-sde.
The $k^{th}$ row of $\underline{\underline{Y}}$ represents the $k^{th}$ measurement of all $m$ responses and the $j^{th}$ column of $\underline{\underline{Y}}$ has the $n$ measurements for the $j^{th}$ response.
Each column of $\underline{\underline{W}}$ has the length $qr +1$ , the number of coefficients in the polynomial expression for the corresponding response.
$\underline{X_0} \equiv \underline{1}$ is the unit column vector. The remaining $qr$ columns are formed from the actual $q$ predictors contributing $r$ columns each. That is, each predictor $z$ contributes $r$ columns with values $\left\{z, z^2, \cdots , z^r \right\}$ .

Given the actual measurements $\underline{\underline{\widehat{Y}}}$ , the least squares estimate $\underline{\underline{\widehat{W}}}$ is similar to single response case.

2.3 The neural net

Now we are ready to build a neural net that will obtain the unique exact model representing a polynomial relationship between inputs and outputs. We have to use $r-1$ extra inputs $\left\{ x_1^2, \cdots , x_1^r\right\}$ for each actual input measurement $x_1$ , as we are targeting an $r^{th}$ degree polynomial for the outputs in each predictor variable. This is the price we have to pay in order to make the outputs a linear function of the inputs so we can use our hidden layer free neural network to obtain the unique exact model.

Having gotten all this down we will henceforth simply use the symbol $p$ for the number of predictors, in stead of $qr$ . This is for ease of notation. The net will naturally have $p+1$ input neurons (with input $x_0 \equiv 1$ ), $m$ output neurons, no hidden layers, and employs linear input summation, and identity as the activation function, as shown in Figure 3.

Figure 3. Unique and exact polynomial representation with a neural net model

The notation employed is as per Michael Nielsen’s web book Neural Networks and Deep Learning.

Using the sum of squares of differences at the output layer as the cost function we have:

It follows from the second derivative above that the cost function is convex in $\left\{W_{ij}\right\}$ for all input data $\underline{x}$ . So we are going to march towards a model achieving the global minimum no matter what training data we use.

3. Conclusions

We have gone over some of the basics of the problem set up with neural networks to obtain unique, exact, and generalized target models. Building and training the network, code snippets, simulations, convergence, stability etc… will make this post too long so will be covered in an upcoming blog.

Applied Linear Statistical Models by Neter, Kutner, Nachtsheim and Wasserman