The Conjugate Residual Method for Constrained Minimization Problems

The method of conjugate gradients, a particular version of the general class of conjugate direction methods, was originally developed for solving a linear system of equations 
$$ Ax = b,$$ 
where $A$ is an $n \times n$ symmetric, positive definite matrix, $x$ is an unknown $n$-vector and $b$ is a fixed $n$-vector. The conjugate gradient method can be viewed as a descent procedure for the problem : Minimize 
$$(x , Ax) - 2(x, b);$$ 
and through this viewpoint the method can be generalized so as to apply to the solution of general unconstrained minimization problems. For these general problems conjugate gradient methods are among the most effective first order descent procedures.

1. Introduction.The method ofconjugate gradients [1]<5], a particular version of the general class of conjugate direction methods, was originally developed for solving a linear system of equations where A is an n x n symmetric, positive definite matrix, x is an unknown n-vector and b is a fixed n-vector.The conjugate gradient method can be viewed as a descent procedure for the problem :  Minimize and through this viewpoint the method can be generalized so as to apply to the solution of general unconstrained minimization problems.For these general problems conjugate gradient methods are among the most effective first order descent procedures.Minimization problems subject to constraints cannot be treated directly by the conjugate gradient methods although they have often been employed indirectly through the artifice of introducing penalty functions or of suitably projecting the descent directions onto the constraint surface.The major motivation for the research presented in this paper is to develop a conjugate direction procedure that is capable of handling constraints directly.This quickly leads one to consider equations of the form (1.1)with the matrix A symmetric and nonsingular but not necessarily positive definite.For, consider the simple problem: Minimize where Q is an n x n symmetric, positive definite matrix, B is an m x n matrix of rank m 5 n.The n-vector u is unknown and c and d are fixed n-and m-vectors respectively.By introducing an m-dimensional Lagrange multiplier this problem is found to be equivalent to that of solving the linear vector equation Denoting the (n + m) x (?I + m) matrix in (1.4) by A we see that A is symmetric, nonsingular, but not positive definite.Of course, (1.1) can be modified to the equivalent form where now A2 is positive definite, but this stratagem requires roughly twice the computational effort of methods which could be applied directly if A itself were positive definite [3].
In this paper we discuss methods requiring essentially the same computational effort as those employed if A is positive definite.An alternative approach to this problem, which salvages the standard conjugate gradient algorithm, is considered in [6].

The method of conjugate residuals.
DEFINITION.Given an n x n matrix A, a pair of n-vectors x, y is said to be A-orthogonal if (x, Ay) = 0.
We now develop an iterative technique for solving which has the same important characteristics as the method of conjugate gradients but which in addition is applicable in the case where A is symmetric but not necessarily positive definite.The procedure is a descent procedure in that a sequence of approximate solutions x,.x,, . . ., x, is generated with the property that the objective functional decreases monotonically.Furthermore, as is common with descent procedures, the approximation xk+ ,is generated from x, by moving from xk ina certain direction pk to a minimum point of E in this direction.In other words, xh+ ,= xk + akpk, where a, is chosen to minimize E. The novel feature of the method given below is of course the method for selecting the direction vectors p,.
In the procedure the direction vectors are chosen to be an A2-orthogonalized version of the residuals rk = b -Axk.This is similar to the method of conjugate gradients but now since the gradient of the underlying functional E a t xk is -2Ark, and not -2rk, we refer to the scheme as conjugate residuals.
Here is the fundamental iteration scheme.Conjugate residual scheme.Select x, arbitrarily.Set pl = r, = h -Ax,, and repeat the following steps omitting (2.3a, b) on the first step.
If ~k -1 # 0, We prove below that the process defined by (2.3) converges to the unique solution of (2.1) in n steps or less.Furthermore, the residuals r , , r,, . . .form an A-orthogonal sequence and the direction vectors p,, p,, . . .form an A2-orthogonal sequence.
If the matrix A is positive definite, then for each k, a, > 0 and hence (2.3b) is never used.In this case the procedure can be regarded as one of the family of general conjugate gradient methods discussed by Hestenes 121.However, even in the positive definite case this method has not been explicitly pointed out and in terms of total computation it is as efficient as all standard procedures.DEFINITION.Corresponding to the symmetric matrix A , a vector x is said to be singular if (.x,A x ) = 0.
If A is not positive definite, a, may be positive, negative or zero.The condition a, = 0 corresponds, as shown below, to singularity of the residual r,.
Proof.The proof is by induction.The vectors r , ,p, = r , and r, satisfy these relations since : (a) If r , is singular, r, = r , = p, and (r,, Ar,) = 0.
Now assume that these relations are true for the first k -1 steps, i.e., through r,, p,_, , a,-, , DL-,.There are now two distinct cases of interest: a,-# 0 and a,-, = 0. We must show that r,, ,and p, can be added to these relations in either For i < k the right side vanishes by (b) and (a).For i = k the two terms on the right cancel.
To prove (c) we have obviously, for j 2 1, and hence setting j = k and using (2.3a), we have Thus, in view of the induction hypothesis (c) and a,-, # 0 we obtain Apk-, E [p, , P23 " ' pkl.

3
To prove (d) we write For i < k the first term on the right vanishes by the induction hypothesis.The second term also vanishes for i < k by the second part of (c) and (a).For i = k the right side becomes, by (e), But this vanishes because (pk, A2pk) = (rk-bkpk-,,A2pk)= (rk, A2pk).
To prove (g) suppose pk = 8.Then But by (e) and (b) this implies bk = 0 and hence rk = 8.
Case 2. a,-, = 0. To prove (f) we note that by (2.3b), The second two terms on the right are zero by the hypothesis (b), and hence if r, # 8 we have a, # 0. This proves that there cannot be generated two consecutive singular points in the procedure.
To prove (d) we write Proof.This follows immediately from the projection theorem and the fact that r,, ,is A-orthogonal to the subspace [p, ,p,, . . ., p,].
THEOREM 3. The method of conjugate residuals concerges in n steps or less to the unique solution of Ax = b.
Proof.From part (g) of Theorem 1 we see that the sequence p,, p,, . . .does not terminate until a solution is obtained.Furthermore, by the A2-orthogonality of the p,'s this sequence is linearly independent.The result then follows from Theorem 2.
To summarize the results derived in this section, the method of conjugate residuals progresses in a fashion entirely analogous to that of conjugate gradients by generating A2-orthogonal direction vectors and minimizing lIAx -blj2 over increasing subspaces.If at some stage the residual r, should be singular, then the next direction vector p, is equal to r, and the functional E cannot be reduced by moving in that direction since x, is already at the minimum along that line.The next direction p,, ,is then generated by a special formula and this new direction will not be singular.Thus singular directions occur only as isolated members of the sequence p,, p,, . . ., p,.The process converges within n steps.
3. Computational requirements.An important feature of the method of conjugate residuals, which sets it apart from other schemes for handling the nonpositive definite case, is that only a single ~nultiplication of a vector by a matrix is required at each step.To see this, suppose that p,_,.Ap,-, .p,-,.Ap,.,.r, have been stored.Then we have the following.
Step la.Calculate Ark.This requires one matrix multiplication.
Case 2. a k l = 0.In this case, r, ,is singular and it follows from part (e) of Theorem 1 that P , , = 0 and hence p , = r , .Also since a , , = 0 we have rk = r,-, .Thus Ark = Apk-, is on hand.
Step 2a.Calculate A2rk= A(Apk-,).This requires one multiplication of a vector by a matrix.
Step 2d.Calculate r,,, = r, -akApk.Note that the storage requirements are really not as large as indicated since p, -,.Ap, -,are only needed in the singular case and then p, -, = r, -,= r,, so extra storage is available.

A triangulation method.
In this section we show that the nonsingular part of the scheme of conjugate residuals can be implemented by performing only a sequence of line searches (i.e., one-dimensional minimization problems) and residual evaluations, without explicitly computing inner products.Although this procedure is of little value for solving a linear equation, its obvious generalization to nonlinear equations is of substantial importance.
The method is more or less a straightforward extension of the method of parallel tangents (PARTAN) which implements the method of conjugate gradients when A is positive definite [ 7 ] ,[XI.The short derivation of the method given here is of some independent interest in view of the lengthy standard derivations.Our derivation exploits the results of Theorem 1.
Triangulation procedure.Given an arbitrary x, let p , = r , and determine u, After that the process is defined by induction.Having determined u,, x 2 , . . . .x,.
We shall show that the triangulation procedure described above is equivalent to the method of conjugate residuals in the nonsi~lgular case.If during the process a residual r, should become singular, the process breaks down and recovery must be made in some other fashion.THEOREM 4. The method of triangulation, in the nonsirzgular case, is equiuulent to the method o f conjugate residuals.
Proof.The proof is by induction.The statement is obviously true for the first step.Suppose it is true for x , , x,, . . ., x,.For simplicity and without loss of generality we assume that x , = 8.According to the method of conjugate residuals the point x,, , is chosen so as to minimize the functional E over the subspace [ p i , p2, . . ., pk-,,rk].Furthermore, by the orthogonality relations of the method this is equivalent to minimizing E over the subspace determined by p,-, and r,.This in turn is equivalent to finding a point x,., in that plane such that the residual there, r,, ,= b -Ax,, ,.is A-orthogonal to both r, and p,-, .
Consider the points x,-,and y,.The residuals at these points, r,-, ,b -Ay,, are both A-orthogonal to r,, the first by the induction hypothesis and part (d) of Theorem 1, the second because y, is the minimizing point of E in the direction r , .Therefore, by linearity, the residual at any point along the line through x,-, and y, is A-orthogonal to r,.
It is therefore only necessary to find a point on this line where the residual is A-orthogonal to p,-, .This clearly can be accomplished by minimizing E along the line.

Application to constrained minimization problems and nonlinear equations.
Consider the nonlinear equation where F is a mapping from En into Enand 0 is the zero vector in En.We restrict attention to the case where the matrix of first partial derivatives F1(x) is symmetric.As a primary motivating example consider the constrained minimization problem : Minimize subject to wheref is a nonlinear functional on Emand H is a nonlinear mapping from Em to EP,where p 5 m.If u, is the solution of (5.2)and the rank of H1(u0) is equal to p, then, as is well known, there is a 2, E EP such that (u,, i , ) is a solution of the simultaneous equations The system (5.3)is of the type (5.1)with 17 = m + p.
The procedures developed here also have application to minimization problems involving inequality constraints, but this topic is deferred to another paper.Solution of (5.1)is, of course, equivalent to minimizing the form $(x) = / F(x)j/ and indeed this is the approach taken here.Standard methods of minimization, however, require the direction vectors to be selected on the basis of the gradient.In the case of problem (5.2) this in turn would require knowledge of second derivatives off and H.By employing the philosophy of the conjugate residual method, the descent directions are based on the residuals F(x).
We shall briefly discuss three ways in which the conjugate residual scheme can be generalized to nonlinear problems.These are identical in spirit to the standard methods for adopting the conjugate gradient procedure to nonlinear problems and hence we shall not pursue the details.
A. If one is willing to assume the availability of F1(x), then the local approximation A 'V Ff(x) can be used in the formulas (2.3).The conjugate gradient analogue of this procedure has been developed by Daniel [9], [lo].
B. Multiplication of a vector by the matrix A can be approximated by the difference of two residuals by the formula which is exact in the linear case (the step size can of course be adjusted for best results).Since, in the linear case, the conjugate residual scheme can be implemented with a single A multiplication, so in the nonlinear case it can be approximated by taking a single extra step.We give here the procedure corresponding to Case 1 of 5 3, the other case being a direct analogue.Assume x,, p,-, , q,-, , r, are at hand, where p,+ ,is the actual direction vector used to get x,, q,-,is an artificial vector used to approximate Ap,-, , and r, = F(x,).Then : Calculate an approximation s, to Ark by (5.4a) s, = F(x, + r,) -F(x,).

Calculate
A more refined procedure would be to calculate pk from (5.4a) and (5.4b) and then search along the direction p, for a point minimizing IjF(x)l12 initiating the search at the point predicted by (5.4~).For details on such line searching techniques and the analogue of this procedure for conjugate gradients, see [ l l ] .
C. The method of triangulation can be employed directly since it simply entails evaluation of the residual and line searching.There is a slight difficulty if a singular point is encountered.In this case a good procedure is probably to take a gradient step and then start again.As in B above the gradient can be approximated by F(x, + r,) -F(x,).
There are a few general remarks that apply to the application of any of the above schemes when applied to nonlinear equations.The first consideration is the necessity for occasionally restarting the process.The methods of conjugate gradients and conjugate residuals that we have discussed are not self-correcting.Specifically, if the initial direction vector should for some reason be in error, the usual conjugate properties of the subsequently generated direction vectors will not hold and the process may not terminate in a finite number of steps.This implies, in particular, that even if a nonlinear equation is exactly linear in a region containing the solution, any of the nonlinear versions of conjugate residuals (or conjugate gradients) may not converge in a finite number of steps.For this reason, it is good practice to restart these procedures every n steps or so.In view of this comment, the ploy of taking a gradient step and restarting when encountering a singular point does not seem unattractive.
The second comment relates to the detection of singular points.In a nonlinear problem, a singular direction (or a nearly singular direction) is characterized by a halt or near stoppage of the descent process.In other words, if a singular point is approached, the values of the objective functional 11 F(x)/I will converge to some positive value.However, since the desired value of this objective functional is known to be zero, such a singularity is easily identified and the singularity procedure or a gradient step can be employed to move away and continue the descent.
case.Case 1. a k -, # 0. TO prove (e) we have immediately by the induction hypothesis (b).Also, whence, by (d), which gives the formula for Pk.To prove (a) we write For i < k --1 the first term on the right vanishes because of the induction hypotheses (c) and (b) and the second term vanishes by (a).For i = k -1 the two terms cancel.To prove (b) we write (rk+1, ' pi) = (rk, ' pi) -ak(Apk, Api).
the terms on the right cancel.For i < k -2 we have by the induction hypothesis on (c) that A2pi E [p,, p2, . . ., pk-,] and hence by (b) the first term vanishes.The other two terms vanish by (a).The proof of (b) is the same as for Case 1.

/i 2 .
For i < k, this vanishes by the same argument as Case 1.For i = k we note that r, = r,_, = pk-,and hence the first term vanishes by the induction hypothesis on (d) and the second term vanishes by (a).Part (g) follows immediately from (2.5).This completes the proof of Theorem 1.We now state the fundamental minimization property of the method of conjugate residuals.THEOREM = bFor each k the point x,+ ,miizimizes thefunctional E(x) 11 Axover the linear uariety x, + [p, ,p,, . . . .p,] .