MATHEMATICAL ENGINEERING TECHNICAL REPORTS Roundo …

23
MATHEMATICAL ENGINEERING TECHNICAL REPORTS Roundoff Error Analysis of the CholeskyQR2 Algorithm Yusaku YAMAMOTO, Yuji NAKATSUKASA, Yuka YANAGISAWA and Takeshi FUKAYA METR 2014–36 December 2014 DEPARTMENT OF MATHEMATICAL INFORMATICS GRADUATE SCHOOL OF INFORMATION SCIENCE AND TECHNOLOGY THE UNIVERSITY OF TOKYO BUNKYO-KU, TOKYO 113-8656, JAPAN WWW page: http://www.keisu.t.u-tokyo.ac.jp/research/techrep/index.html

Transcript of MATHEMATICAL ENGINEERING TECHNICAL REPORTS Roundo …

Page 1: MATHEMATICAL ENGINEERING TECHNICAL REPORTS Roundo …

MATHEMATICAL ENGINEERINGTECHNICAL REPORTS

Roundoff Error Analysis of the CholeskyQR2Algorithm

Yusaku YAMAMOTO, Yuji NAKATSUKASA, YukaYANAGISAWA and Takeshi FUKAYA

METR 2014–36 December 2014

DEPARTMENT OF MATHEMATICAL INFORMATICSGRADUATE SCHOOL OF INFORMATION SCIENCE AND TECHNOLOGY

THE UNIVERSITY OF TOKYOBUNKYO-KU, TOKYO 113-8656, JAPAN

WWW page: http://www.keisu.t.u-tokyo.ac.jp/research/techrep/index.html

Page 2: MATHEMATICAL ENGINEERING TECHNICAL REPORTS Roundo …

The METR technical reports are published as a means to ensure timely dissemination of

scholarly and technical work on a non-commercial basis. Copyright and all rights therein

are maintained by the authors or by other copyright holders, notwithstanding that they

have offered their works here electronically. It is understood that all persons copying this

information will adhere to the terms and constraints invoked by each author’s copyright.

These works may not be reposted without the explicit permission of the copyright holder.

Page 3: MATHEMATICAL ENGINEERING TECHNICAL REPORTS Roundo …

ROUNDOFF ERROR ANALYSIS OF THE CHOLESKYQR2ALGORITHM

YUSAKU YAMAMOTO∗, YUJI NAKATSUKASA† , YUKA YANAGISAWA‡ , AND TAKESHI

FUKAYA§

Abstract. We consider the QR decomposition of an m × n matrix X with full column rank,where m ≥ n. Among the many algorithms available, the Cholesky QR algorithm is ideal from theviewpoint of high performance computing, since it consists entirely of standard level 3 BLAS opera-tions with large matrix sizes and requires only one allreduce and broadcast in parallel environments.Unfortunately, it is well known that the algorithm is not numerically stable and the deviation fromorthogonality of the computed Q factor is of O((κ2(X))2u), where κ2(X) is the 2-norm conditionnumber of X and u is the unit roundoff. In this paper, we show that if the condition number of Xis not too large, we can improve the stability greatly by iterating the Choelsky QR algorithm twice.

More specifically, if κ2(X) is at most O(u−12 ), both the residual and deviation from orthogonality

are shown to be of O(u). Numerical results support our theoretical analysis.

Key words. QR decomposition, Cholesky QR, commnication-avoiding algorithms, roundofferror analysis

AMS subject classifications. 15A23, 65F25, 65G50

1. Introduction. Let X ∈ Rm×n be an m by n matrix with m ≥ n of full col-umn rank. We consider computing its QR decomposition, X = QR, where Q ∈ Rm×n

has orthonormal columns and R ∈ Rn×n is upper triangular. This is one of the mostfundamental matrix decompositions and is used in various scientific computations.Examples include linear least squares, preprocessing for the singular value decompo-sition of a rectangular matrix [10], and orthogonalization of vectors arising in blockKrylov methods [2, 17] or electronic structure calculations [3, 22]. Often, the matrixsize is very large, so an algorithm suited for modern high performance computers isdesired.

One important feature of modern high performance architectures is that com-munication is much slower than arithmetic. Here, communication refers to both datatransfer between processors or nodes and data movement between memory hierarchies.Thus it is essential for higher performance to minimize the frequency and amount ofthese communications [1]. To minimize interprocessor communications, the algorithmhas to have a large grain parallelism. To minimize data movement between memoryhierarchies, it is effective to reorganize the algorithm to use level 3 BLAS operationsas much as possible [10]. Of course, the benefit of using level 3 BLAS operationsincreases as the size of matrices used there becomes larger.

Conventionally, three major algorithms have been used to compute the QR de-composition: the Householder QR algorithm, the classical Gram-Schmidt (CGS) al-gorithm and the modified Gram-Schmidt (MGS) algorithm. The Householder QRalgorithm is widely used due to its excellent numerical stability [11]. MGS, which isless stable, is often preferred when the Q factor is needed explicitly, because it requires

∗The University of Electro-Communications, Tokyo, Japan / JST CREST, Tokyo, Japan([email protected]).† Department of Mathematical Informatics, Graduate School of Information Science and Tech-

nology, University of Tokyo, Tokyo 113-8656, Japan ([email protected]).‡Waseda University, Tokyo, Japan.§RIKEN Advanced Institute for Computational Science, Kobe, Japan / JST CREST, Tokyo,

Japan.

1

Page 4: MATHEMATICAL ENGINEERING TECHNICAL REPORTS Roundo …

2 Y. Yamamoto, Y. Nakatsukasa, Y. Yanagisawa and T. Fukaya

only half as much work as the Householder QR in that case. When the matrix A iswell conditioned, CGS is also sometimes used since it provides more parallelism. Notethat for matrices with the 2-norm condition number κ2(X) at most O(u−1), whereu is the unit roundoff, repeating CGS or MGS twice leads to algorithms that are asstable as Householder QR [9]. They are known as CGS2 and MGS2, respectively.

For each of these algorithms, variants that can better exploit modern high per-formance architectures have been developed. There are block versions and recursiveversions of Householder QR [6, 18], MGS [13] and CGS [12] that can perform most ofthe computations in the form of level 3 BLAS. There is also a variant of HouseholderQR called the tall-and-skinny QR (TSQR) [5], which has large grain parallelism andrequires only one allreduce and broadcast in a distributed environment.

While these variants have been quite successful, they are not completely satisfac-tory from the viewpoint of high performance computing. In the block and recursiveversions mentioned above, the sizes of matrices appearing in the level 3 BLAS aregenerally smaller than that of X, and become even smaller as the level goes down inthe case of recursive algorithms. For the TSQR algorithm, though only one allreduceis required throughout the algorithm, the reduction operation is a non-standard one,which corresponds to computing the QR decomposition of a 2n×n matrix formed byconcatenating two upper triangular matrices [5]. Thus each reduction step requiresO(n3) work and this tends to become a bottleneck in parallel environments [7]. Inaddition, the TSQR algorithm requires non-standard level 3 BLAS operations suchas multiplication of two triangular matrices [5], for which no optimized routines areavailable on most machines.

There is another algorithm for the QR decomposition, namely, the Cholesky QRalgorithm. In this algorithm, one first forms the Gram matrix A = X>X, computes itsCholesky factororization A = R>R, and then finds the Q factor by Q = XR−1. Thisalgorithm is ideal from the viewpoint of high performance computing because (1) itscomputational cost is 2mn2 (in the case where m� n), which is equivalent to the costof CGS and MGS and half that of Householder QR, (2) it consists entirely of standardlevel 3 BLAS operations, (3) the first and third steps are highly parallel large size level3 BLAS operations in which two of the three matrices are of size m×n, (4) the secondstep, which is the only sequential part, requires only O(n3) work, as opposed to theO(mn2) work in the first and the third steps, and (5) it requires only one allreduceand one broadcast if X is partitioned horizontally. Unfortunately, it is well knownthat Cholesky QR is not stable. In fact, deviation from orthogonality of the Q factorcomputed by Cholesky QR is proportional to (κ2(X))2 [19]. Accordingly, standardtextbooks like [21] describe the method as ”quite unstable and is to be avoided unlesswe know a priori that R is well conditioned”.

In this paper, we show that the Cholesky QR algorithm can be applied to matri-ces with a larger condition number to give a stable QR factorization if it is repeatedtwice. More specifically, we show that if κ2(X) is at most O(u−

12 ), the Q and R

factors obtained by applying Cholesky QR twice satisfy ‖Q>Q − I‖F = O(u) and‖X − QR‖F = O(u). Furthermore, we give the coefficients of u in these boundsexplicitly as simple low degree polynomials in m and n. In the following, we call thismethod CholeskyQR2. Of course, the arithmetic cost of CholeskyQR2 is twice thatof Cholesky QR, CGS and MGS, but it is equivalent to the cost of Householder QR,CGS2 and MGS2. Given the advantages stated above, the increase in the compu-tational work might be more than compensated in some cases. Hence, for matriceswith κ2(X) ∼ O(u−

12 ), CholeskyQR2 can be the method of choice in terms of both

Page 5: MATHEMATICAL ENGINEERING TECHNICAL REPORTS Roundo …

Error Analysis of the CholeskyQR2 Algorithm 3

numerical stability and efficiency on high performance architectures.

The idea of performing the QR decomposition twice to get better stability isnot new. In his textbook [15], Parlett analyses Gram-Schmidt orthogonalization oftwo vectors and introduces the principle of ”twice is enough”, which he attributesto Kahan. There is also a classical paper by Daniel, Gragg, Kaufman and Stewart[4], which deals with the effect of reorthogonalization on the update of the Gram-Schmidt QR decomposition. More recently, Giraud et al. performed a detailed erroranalysis of CGS2 and MGS2 and showed that they give numerically orthogonal Qfactor and small residual for matrices with κ2(X) ∼ O(u−1) [9]. Stathopoulas etal. experimentally show that the Cholesky QR algorithm can be applied to matriceswith a large condition number, if it is applied twice (or more) [19]. Rozloznık etal. analyze the CholeskyQR2 algorithm in a more general setting of orthogonalizationunder indefinite inner product and derive bounds on the residual and deviation fromorthogonality [16]. However, their bounds are expressed in terms of the computed Qand R factors, along with the matrix B that defines the inner product, and do notconstitute a priori error bounds, in contrast to the bounds derived in this paper. Also,the coefficients of u are not given explicitly.

Even though the underlying idea of repeating an unstable algorithm twice to im-prove stability is the same, it is worth noting the inherent disadvantage of CholeskyQR2when compared with CGS2 and MGS2: numerical breakdown. Specifically, if κ2(X)�u−

12 then the Cholesky factorization ofXTX can break down, and so does CholeskyQR2.

By contrast, Gram-Schmidt type algorithms are free from such breakdowns (exceptfor very obvious breakdowns due to division by zeros in the normalization), andas shown in [9], gives stable QR factorizations for a much wider class of matricesκ2(X) ∼ O(u−1) when repeated twice.

The rest of this paper is organized as follows. In section 2, after giving somedefinitions and assumptions, we introduce the CholeskyQR2 algorithm. Detailed erroranalysis of CholeskyQR2 is presented in section 3. Numerical results that supportour analysis is provided in Section 4. Section 5 gives some discussion on our results.Finally, some concluding remarks are given in section 6.

2. The ChoelskyQR2 algorithm.

2.1. Notation and assumptions. In the following, we consider computing theQR decomposition of an m by n real matrix X, where m ≥ n. Throughout this paper,we assume that computations are performed using IEEE 754 floating point standardand denote the unit roundoff by u. Let σi(X) be the ith largest singular value of Xand κ2(X) = σ1(X)/σn(X) be its condition number. We further assume that

(2.1) δ ≡ 8κ2(X)√mnu + n(n+ 1)u ≤ 1.

This means that the condition number of X is at most O(u−12 ). From this assumption

and κ2(X) ≥ 1, we also have

(2.2) mnu ≤ 1

64, n(n+ 1)u ≤ 1

64.

Following [11], let us define a quantity γk for a positive integer k by

γk =ku

1− ku.

Page 6: MATHEMATICAL ENGINEERING TECHNICAL REPORTS Roundo …

4 Y. Yamamoto, Y. Nakatsukasa, Y. Yanagisawa and T. Fukaya

Then it is easy to show that under the assumption (2.1)

(2.3) γm =mu

1−mu≤ 1.1mu, γn+1 =

(n+ 1)u

1− (n+ 1)u≤ 1.1(n+ 1)u.

2.2. The algorithm. In the Cholesky QR algorithm, we compute the QR de-composition of X by the following procedure.

A = X>X,

R = chol(A),

Y = XR−1,

where chol(A) is a function that computes the (upper triangular) Cholesky factor ofA. Then, X = Y R can be regarded as the QR decomposition of X.

In the CholeskyQR2 algorithm, after obtaining Y and R by the above procedure,we further compute the following.

B = Y >Y,

S = chol(B),

Z = Y S−1 (= X(SR)−1),

U = SR

If the columns of Y are exactly orthonormal, B becomes the identity and Z = Y .However, in finite precision arithmetic, this does not hold in general and Z 6= Y . Inthe CholeskyQR2 algorithm, the QR decomposition of X is given by X = ZU .

3. Error analysis of the CholeskyQR2 algorithm. Our objective is to showthat under assumption (2.1), the CholeskyQR2 algorithm delivers an orthogonal factorZ and an upper triangular factor U for which both the orthogonality ‖Z>Z−I‖F andresidual ‖X − ZU‖F /‖X‖2 are of O(u). Here, the constants in O(u) contain lowerorder terms in m and n, but not in κ2(X).

This section is structured as follows. In subsection 3.1, we formulate the CholeskyQR2algorithm in floating point arithmetic and prepare several bounds that are necessaryto evaluate the orthogonality of the computed orthogonal factor. Using these bounds,the bound on the orthogonality is derived in subsection 3.2. In subsection 3.3, severalbounds that are needed to evaluate the residual are provided, and they are used insubsection 3.4 to give a bound on the residual.

3.1. Preparation for evaluating the orthogonality. Let us denote the ma-trices A, R and Y computed using floating point arithmetic by A = fl(X>X),R = fl(chol(A)) and Y = fl(XR−1), respectively. Taking rounding errors intoaccount, the computed quantities satisfy

A = X>X + E1,(3.1)

R>R = A+ E2 = X>X + E1 + E2,(3.2)

y>i = x>i (R+ ∆Ri)−1 (i = 1, 2, . . . ,m).(3.3)

Here, x>i and y>i are the ith row vectors of X and Y , respectively. E1 is the forwarderror of the matrix-matrix multiplication X>X, while E2 is the backward error of theCholesky decomposition of A. ∆Ri denotes the backward error arising from solving

Page 7: MATHEMATICAL ENGINEERING TECHNICAL REPORTS Roundo …

Error Analysis of the CholeskyQR2 Algorithm 5

the linear simultaneous equation y>i R = x>i by forward substitution. It would beeasier if we could express the backward error of the forward substitution as

Y = X(R+ ∆R)−1,

but we have to use the row-wise expression (3.3) instead because the backward error∆R depends on the right-hand side vector x>i .

In the following, we evaluate each of E1, E2 and ∆Ri. We also give bounds onthe 2-norms of R−1 and XR−1 for later use. Furthermore, we derive an alternativeform of Eq. (3.3):

(3.4) y>i = (x>i + ∆x>i )R−1,

in which the backward error enters in the right-hand side vector instead of the co-efficient matrix. Equivalently, ∆x>i is the residual of the linear system y>i R = x>i .Then, by letting ∆X = (∆x1,∆x2, . . . ,∆xm)>, we can rewrite (3.3) as

(3.5) Y = (X + ∆X)R−1,

which is more convenient to use. We also evaluate the norm of ∆X.Forward error in the matrix-matrix multiplication X>X. Let A ∈ Rm×n and

B ∈ Rn×p. Then the componentwise forward error of the matrix-matrix multiplicationC = AB can be evaluated as

(3.6) |C − C| ≤ γn|A||B|,

where C = fl(AB), |A| denotes the matrix whose (i, j)th element is |aij |, and theinequality means componentwise inequality [11]. The 2-norm of the ith column of X,which we denote by xi, is clearly less than or equal to ‖X‖2. Hence,

(3.7) |E1|ij = |A− A|ij ≤ γm(|X|>|X|)ij = γm|xi|>|xj | ≤ γm‖xi‖‖xj‖ ≤ γm‖X‖22.

(Throughout, a vector norm is always the 2-norm.) Thus we have

(3.8) ‖E1‖2 ≤ ‖E1‖F ≤ γmn‖X‖22.

Simplifying this result using (2.3) leads to

(3.9) ‖E1‖2 ≤ 1.1mnu‖X‖22.

Backward error of the Cholesky decomposition of A. Let A ∈ Rn×n be symmetricpositive definite and assume that the Cholesky decomposition of A in floating pointarithmetic runs to completion and the upper triangular Cholesky factor R is obtained.Then, there exists ∆A ∈ Rn×n satisfying

R>R = A+ ∆A, |∆A| ≤ γn+1|R|>|R|.

See Theorem 10.3 of [11] for details. In our case, we take A← A in (3.2) to obtain

(3.10) |E2| ≤ γn+1|R|>|R|.

Hence,(3.11)‖E2‖2 ≤ ‖E2‖F = ‖ |E2| ‖F ≤ γn+1‖ |R|>|R| ‖F ≤ γn+1‖ |R| ‖2F = γn+1‖R‖2F ≤ γn+1n‖R‖22.

Page 8: MATHEMATICAL ENGINEERING TECHNICAL REPORTS Roundo …

6 Y. Yamamoto, Y. Nakatsukasa, Y. Yanagisawa and T. Fukaya

On the other hand, we have from Eq. (3.2),

(3.12) ‖R‖22 = ‖R>R‖2 ≤ ‖A‖2 + ‖E2‖2.

Substituting Eq. (3.12) into the rightmost hand side of Eq. (3.11) leads to

‖E2‖2 ≤ γn+1n(‖A‖2 + ‖E2‖2),

or,

(3.13) ‖E2‖2 ≤γn+1n

1− γn+1n‖A‖2.

Noting that

(3.14) ‖A‖2 ≤ ‖X>X‖2 + ‖E1‖2 ≤ ‖X‖22 + γmn‖X‖22 = (1 + γmn)‖X‖22,

from Eqs. (3.1) and (3.8), we have

(3.15) ‖E2‖2 ≤γn+1n(1 + γmn)

1− γn+1n‖X‖22.

This result can be simplified using (2.2) and (2.3) as

‖E2‖2 ≤1.1(n+ 1)u · n · (1 + 1.1mnu)

1− 1.1n(n+ 1)u‖X‖22

≤1.1(n+ 1)u · n · (1 + 1.1

64 )

1− 1.164

‖X‖22 =7161

6290n(n+ 1)u‖X‖22 ≤ 1.2n(n+ 1)u‖X‖22.(3.16)

Backward error of the forward substitution. Let U ∈ Rn×n be a nonsingulartriangular matrix. Then, the solution x obtained by solving the linear simultaneousequation Ux = b by substitution in floating point arithmetic satisfies

(3.17) (U + ∆U)x = b, |∆U | ≤ γn|U |.

See Theorem 8.5 of [11]. Note that ∆U depends both on U and b, although the boundin (3.17) does not. In our case, U = R, so we have for 1 ≤ i ≤ m,

(3.18) ‖∆Ri‖2 ≤ ‖∆Ri‖F = ‖ |∆Ri| ‖F ≤ γn‖ |R| ‖F ≤ γn√n‖R‖2.

By inserting Eq. (3.13) into Eq. (3.12) and using (3.14), we have

(3.19) ‖R‖22 ≤1

1− γn+1n‖A‖2 ≤

1 + γmn

1− γn+1n‖X‖22.

Inserting this into Eq. (3.18) leads to

‖∆Ri‖2 ≤ γn

√n(1 + γmn)

1− γn+1n‖X‖2.

Simplifying the right-hand side in the same way as in Eq. (3.16), we obtain

‖∆Ri‖2 ≤ 1.1nu

√n(1 + 1.1mnu)

1− 1.1n(n+ 1)u‖X‖2

≤ 1.1nu

√n · (1 + 1.1

64 )

1− 1.164

‖X‖2 ≤ 1.2n√nu‖X‖2.(3.20)

Page 9: MATHEMATICAL ENGINEERING TECHNICAL REPORTS Roundo …

Error Analysis of the CholeskyQR2 Algorithm 7

Bounding the 2-norm of R−1. Next we evaluate the 2-norm of R−1. Noting thatall the matrices appearing in Eq. (3.2) are symmetric, we can apply the Bauer-Fiketheorem (or Weyl’s theorem) to obtain

(σn(X))2 − (‖E1‖2 + ‖E2‖2) ≤ (σn(R))2.

Using assumption (2.1), Eqs. (3.9) and (3.16), we have ‖E1‖2+‖E2‖2 ≤ 1.264 (σn(X))2 ≤

(1− 11.12 )(σn(X))2. Hence,

1

1.12(σn(X))2 ≤ (σn(R))2,

leading to the bound on R−1 as

(3.21) ‖R−1‖2 = (σn(R))−1 ≤ 1.1(σn(X))−1.

Bounding the 2-norm of XR−1. From Eq. (3.2), we have

(3.22) R−>X>XR−1 = I − R−>(E1 + E2)R−1.

Thus,

‖XR−1‖22 ≤ 1 + ‖R−1‖22 (‖E1‖2 + ‖E2‖2).

By using ‖E1‖2 + ‖E2‖2 ≤ 1.264 (σn(X))2 again and inserting Eq. (3.21), we obtain

(3.23) ‖XR−1‖2 ≤ 1.1.

Evaluation of the backward error ∆X. From Eq. (3.3), we have

y>i = x>i (R+ ∆Ri)−1

= x>i (I + R−1∆Ri)−1R−1.

Now, let

(I + R−1∆Ri)−1 = I + Ri.

Then, since Ri =∑∞

k=1(−R−1∆Ri)k, we obtain the bound on ‖Ri‖2 as

‖Ri‖2 ≤∞∑k=1

(‖R−1‖2‖∆Ri‖2)k

=‖R−1‖2‖∆Ri‖2

1− ‖R−1‖2‖∆Ri‖2

≤ 1.1(σn(X)−1) · 1.2n√nu‖X‖2

1− 1.1(σn(X)−1) · 1.2n√nu‖X‖2

,(3.24)

where we used Eq. (3.20) and (3.21) in the last inequality. The denominator ofEq. (3.24) can be evaluated as

1− 1.1(σn(X))−1 · 1.2n√nu‖X‖2 ≥ 1− 1.1 · 1.2n

√nu

8√mnu + n(n+ 1)u

≥ 1− 1.32

8

√nu

≥ 1− 1.32

8

√1

11≥ 0.95.

Page 10: MATHEMATICAL ENGINEERING TECHNICAL REPORTS Roundo …

8 Y. Yamamoto, Y. Nakatsukasa, Y. Yanagisawa and T. Fukaya

Inserting this into Eq. (3.24) and evaluating the numerator using Eq. (3.20) again, wehave

‖Ri‖2 ≤1

0.95· 1.1κ2(X) · 1.2n

√nu ≤ 1.4κ2(X)n

√nu.

Now, let

∆x>i = x>i Ri.

Then,

(3.25) y>i = (x>i + ∆x>i )R−1.

By defining the matrix ∆X ∈ Rm×n as ∆X = (∆x1,∆x2, . . . ,∆xm)>, we can rewriteEq. (3.25) as

(3.26) Y = (X + ∆X)R−1.

The bound on ‖∆X‖F can be given as

‖∆X‖F =

√√√√ m∑i=1

‖∆x>i ‖2 ≤

√√√√ m∑i=1

‖x>i ‖2‖Ri‖22

≤ 1.4κ2(X)n√nu

√√√√ m∑i=1

‖x>i ‖2 ≤ 1.4κ2(X)‖X‖2n2u,(3.27)

where the relationship√∑m

i=1 ‖x>i ‖2 = ‖X‖F ≤√n ‖X‖2 is used to derive the last

inequality.

3.2. Orthogonality of Y and Z. Based on the preparations given in the previ-ous subsection, we evaluate the orthogonality of Y and Z computed by the CholeskyQR and CholeskyQR2 algorithms. The following lemma holds.

Lemma 3.1. Suppose that X ∈ Rm×n with m ≥ n satisfies Eq. (2.1) . Then, thematrix Y obtained by applying the Cholesky QR algorithm in floating point arithmeticto X satisfies the following inequality. With δ as defined in (2.1),

‖Y >Y − I‖2 ≤5

64δ2.

Proof. By expanding Y >Y using Eq. (3.26), we have

Y >Y= R−>(X + ∆X)>(X + ∆X)R−1

= R−>X>XR−1 + R−>X>∆XR−1 + R−>∆X>XR−1 + R−>∆X>∆XR−1

= I − R−>(E1 + E2)R−1 + (XR−1)>∆XR−1 + R−>∆X>(XR−1) + R−>∆X>∆XR−1.

Page 11: MATHEMATICAL ENGINEERING TECHNICAL REPORTS Roundo …

Error Analysis of the CholeskyQR2 Algorithm 9

Here, we used Eq. (3.22) to derive the last equality. Thus,

‖Y >Y − I‖2 ≤ ‖R−>(E1 + E2)R−1‖2 + 2‖R−>∆X>(XR−1)‖2 + ‖R−>∆X>∆XR−1‖2≤ ‖R−1‖22(‖E1‖2 + ‖E2‖2) + 2‖R−1‖2‖XR−1‖2‖∆X‖2 + ‖R−1‖22‖∆X‖22≤ ‖R−1‖22(‖E1‖2 + ‖E2‖2) + 2‖R−1‖2‖XR−1‖2‖∆X‖F + ‖R−1‖22‖∆X‖2F≤ (1.1(σn(X))−1)2(1.1mnu + 1.2n(n+ 1)u)‖X‖22

+2 · 1.1(σn(X))−1 · 1.1 · 1.4κ2(X)‖X‖2n2u+(1.1(σn(X))−1 · 1.4κ2(X)‖X‖2n2u)2

≤ 1.12 · 1.264

δ2 +2 · 1.12 · 1.4

64δ2 +

(1.1 · 1.4

64δ2)2

≤ 5

64δ2.(3.28)

In the fourth inequality, we used Eqs. (3.9), (3.16), (3.21), (3.23) and (3.27). In thelast inequality, we simplified the expression using the assumption δ ≤ 1.

The next corollary follows immediately from Lemma 3.1.Corollary 3.2. The condition number of Y satisfies κ2(Y ) ≤ 1.1.Proof. By Lemma 3.1, every eigenvalue λi of Y >Y satisfies

1− 5

64≤ λi ≤ 1 +

5

64.

Hence, every singular value σi(Y ) of Y satisfies

(3.29)

√59

8≤ σi(Y ) ≤

√69

8.

Thus it follows that

κ2(Y ) =σ1(Y )

σn(Y )≤√

69

59≤ 1.1.

In other words, the matrix Y obtained by applying the Cholesky QR algorithmonce is extremely well-conditioned, though its deviation from orthogonality, ‖Y >Y −I‖2, is still of order 0.1.

Combining Lemma 3.1 and Corollary 3.2, we obtain one of the main results ofthis paper.

Theorem 3.3. The matrix Z obtained by applying CholeskyQR2 in floating pointarithmetic to X satisfies the following inequality.

(3.30) ‖Z>Z − I‖2 ≤ 6(mnu + n(n+ 1)u).

Proof. Noting that κ2(Y ) ≤√

6959 from Corollary 3.2 and applying Lemma 3.1

again to Y , we have

‖Z>Z − I‖2 ≤5

64δ2 ≤ 5

64· 69

59· 64(mnu + n(n+ 1)u)

≤ 6(mnu + n(n+ 1)u).(3.31)

Page 12: MATHEMATICAL ENGINEERING TECHNICAL REPORTS Roundo …

10 Y. Yamamoto, Y. Nakatsukasa, Y. Yanagisawa and T. Fukaya

Orthogonality error in the Frobenius norm. Above, we derived the bound on theorthogonality error in terms of the 2-norm, because we wanted to give a bound onthe 2-norm based condition number of Y . However, by tracing the derivation ofEq. (3.28), we can also derive the following bound in the Frobenius norm,

‖Y >Y − I‖F ≤ ‖R−1‖22(‖E1‖F + ‖E2‖F ) + 2‖R−1‖2‖XR−1‖2‖∆X‖F + ‖R−1‖22‖∆X‖2F .

As is clear from Eqs. (3.8) and (3.11), the upper bounds on ‖E1‖2 and ‖E2‖2 that wereused in Eq. (3.28) are also bounds on ‖E1‖F and ‖E2‖F . Thus, the same bound givenin Eq. (3.31) holds for the Frobenius norm as well. We summarize this observation asa corollary as follows.

Corollary 3.4. The matrix Z obtained by applying CholeskyQR2 in floatingpoint arithmetic to X satisfies the following inequality.

(3.32) ‖Z>Z − I‖F ≤ 6(mnu + n(n+ 1)u).

3.3. Preparation for evaluating the residual. Let the matrices B, S, Zand U computed by floating point arithmetic be denoted by B = fl(Y >Y ), S =fl(chol(B)), Z = fl(Y S−1) and U = fl(SR), resepctively. Then we have

B = Y >Y + E3,(3.33)

S>S = B + E4 = Y >Y + E3 + E4,(3.34)

z>i = y>i (S + ∆Si)−1 (i = 1, 2, . . . ,m),(3.35)

U = SR+ E5.(3.36)

Here, z>i is the ith row vector of Z. E3 and E5 are the forward errors of the matrix

multiplications Y >Y and SR, respectively, while E4 is the backward error of theCholesky decomposition of B. ∆Si is the backward error introduced in solving thelinear simultaneous equation z>i S = y>i by forward substitution.

As a preparation of evaluating the residual, we first evaluate the norms of R, S,∆Si, E5 and Z.

Evaluation of R. From Eq. (3.19), we have

(3.37)‖R‖2‖X‖2

√1 + γmn

1− γn+1n≤

√√√√ 1 + mnu1−mu

1− n(n+1)u1−(n+1)u

√√√√√1 +164

1− 111

1−164

1− 111

=

√651

629≤ 1.1.

Evaluation of S. Noticing that ‖Y ‖2 ≤√698 from Eq. (3.29), we can obtain an

upper bound on the norm of S by multiplying the bound of Eq. (3.37) by√698 . Thus,

‖S‖2 ≤√

651

629·√

69

8≤ 1.1.

Evaluation of ∆Si. Similarly, multiplying the bound of Eq. (3.20) by√698 leads

to the following bound on ∆Si.

‖∆Si‖2 ≤√

69

8· 1.2n

√nu ≤ 1.3n

√nu.

Page 13: MATHEMATICAL ENGINEERING TECHNICAL REPORTS Roundo …

Error Analysis of the CholeskyQR2 Algorithm 11

Evaluation of E5. By using the error bound on matrix multiplication given inEq. (3.6), we have

|E5| ≤ γn|S||R|.

Hence,

‖E5‖2 ≤ ‖ |E5| ‖F ≤ γn‖ |S||R| ‖F ≤ γn‖ |S| ‖F ‖ |R| ‖F = γn‖S‖F ‖R‖F ≤ nγn‖S‖2‖R‖2

≤ n · 1.1nu ·√

651

629·√

69

8·√

651

629‖X‖2 ≤ 1.2n2u‖X‖2.

Evaluation of Z. From Eq. (3.29), we have ‖Y ‖F ≤√698

√n. Multiplying this by

√698 yields the following upper bound on Z.

‖Z‖F ≤69

64

√n ≤ 1.1

√n.

3.4. Bounding the residual. Based on the above results, we evaluate the resid-ual of the pair (Z, U). The following theorem holds, which is also one of our mainresults.

Theorem 3.5. Assume that an m×n real matrix X (m ≥ n) satisfies Eq. (2.1).Then the matrices Z and U obtained by applying the CholeskyQR2 algorithm in float-ing point arithmetic to X satisfy the following inequality.

(3.38)‖ZU −X‖F‖X‖2

≤ 5n2√nu.

Proof. Expanding z>i U − x>i using Eqs. (3.36), (3.35) and (3.3) leads to

‖z>i U − x>i ‖= ‖z>i (SR+ E5)− z>i (S + ∆Si)(R+ ∆Ri)‖= ‖z>i SR+ z>i E5 − z>i SR− z>i S∆Ri − z>i ∆SiR− z>i ∆Si∆Ri‖≤ ‖z>i ‖(‖E5‖2 + ‖S‖2‖∆Ri‖2 + ‖∆Si‖2‖R‖2 + ‖∆Si‖2‖∆Ri‖2)

≤ ‖z>i ‖(1.2n2u + 1.1 · 1.2n√nu + 1.3n

√nu · 1.1 + 1.3n

√nu · 1.2n

√nu)‖X‖2

≤ ‖z>i ‖ ‖X‖2 · 4n2u.(3.39)

Hence,

‖ZU −X‖F‖X‖2

=

√∑ni=1 ‖z>i U − x>i ‖2

‖X‖2≤ 4n2u

√√√√ n∑i=1

‖z>i ‖2 = 4n2u‖Z‖F ≤ 5n2√nu.

4. Numerical results. Here we evaluate the numerical stability of CholeskyQR2and compare it with the stability of other popular QR decomposition algorithms,namely, Householder QR, classical and modified Gram-Schmidt (CGS and MGS; wealso run them twice, shown as CGS2 and MGS2) and Cholesky QR. To this end, wegenerated test matrices with a specified condition number by X := UΣV ∈ Rm×n,where U is an m × n random orthogonal matrix, V is an n × n random orthogonalmatrix and

Σ = diag(1, σ1

n−1 , · · · , σn−2n−1 , σ).

Page 14: MATHEMATICAL ENGINEERING TECHNICAL REPORTS Roundo …

12 Y. Yamamoto, Y. Nakatsukasa, Y. Yanagisawa and T. Fukaya

Here, 0 < σ < 1 is some constant. Thus ‖X‖2 = 1 and the 2-norm condition numberof X is κ2(X) = 1/σ. We varied κ2(X), m and n and investigated the dependenceof the orthogonality and residual on them. All computations were done on Matlab2012b using IEEE standard 754 binary64 (double precision) on Mac OS X version10.8 with 2 GHz Intel Core i7 Duo processor, so that u = 2−53 ≈ 1.11× 10−16.

We show the orthogonality and residual measured by the Frobenius norm undervarious conditions in Figures 4.1 through 4.6. Figures 4.1 and 4.2 show the orthogo-nality ‖ZT Z − I‖F and residual ‖ZU −X‖F , respectively, for the case m = 10, 000,n = 100 and varying κ2(X). In Figures 4.3 and 4.4, κ2(X) = 105, n = 100 and m wasvaried from 1,000 to 10,000. In Figures 4.5 and 4.6, κ2(X) = 105, m = 1, 000 and nwas varied from 100 to 1,000.

It is clear from Figures 4.1 and 4.2 that both the orthogonality and residual areindependent of κ2(X) and are of O(u), as long as κ2(X) is at most O(u−

12 ). This is in

good agreement with the theoretical prediction and is in marked contrast to the resultsof CGS, MGS and Cholesky QR, for which the deviation from orthogonality increasesin proportional to κ2(X) and (κ2(X))2, respectively. As can be seen from Figures4.3 through 4.6, the orthogonality and residual increase only mildly with m and n,which is also in agreement with the theoretical results, although they are inevitablyoverestimates. Compared with Householder QR, it was observed that CholeskyQR2generally produces smaller orthogonality and residual. From these results, we canconclude that CholeskyQR2 is stable for matrices with condition number at mostO(u

12 ). As is well known, Gram-Schmidt type algorithms perform well when repeated

twice.

5. Discussion. In this section, we discuss four topics related to the stabil-ity of CholeskyQR2. First, we compare the orthogonality and residual bounds ofCholeskyQR2 given in Theorems 3.4 and 3.5, respectively, with known bounds forHouseholder QR [11] and CGS2 [9]. Second, we consider how to examine the appli-cability of CholeskyQR2 for a given matrix. Third, we show that CholeskyQR2 isnot only norm-wise stable, but also column-wise stable. Finally, we discuss row-wisestability of CholeskyQR2, which cannot be proved but is nearly always observed inpractice.

5.1. Comparison with the error bounds of Householder QR and CGS2.

Orthogonality. For Householder QR, the Q factor is computed by applying nHouseholder transformations to I1:m,1:n, an m × n matrix consisting of the first ncolumns of the identity matrix of order m. Hence, from Lemma 19.3 of [11], thecomputed Q factor satisfies

Q = P>(I1:m,1:n + ∆I),

where P is some m×m exactly orthogonal matrix and ∆I is an m× n matrix whoseeach column vector has a norm bounded by nγcm, where c is a small positive constant.From this, it is easy to derive the bound

‖Q>Q− I‖F ≤ n√nγc′m ' c′mn

√nu.

For CGS2, Giraud et al. show the following bound for deviation from orthogonalityunder the assumption that κ2(X)m2n3u = O(1) [9].

‖Q>Q− I‖2 ≤ c′′mn√nu.

Page 15: MATHEMATICAL ENGINEERING TECHNICAL REPORTS Roundo …

Error Analysis of the CholeskyQR2 Algorithm 13

102 104 106 108

10−15

10−10

10−5

100

κ (X )

‖ZTZ−

I‖ F

MGSMGS2CGSCGS2Householder QRCholesky QRCholeskyQR2

Fig. 4.1. Orthogonality ‖Z>Z − I‖F for test matrices with m = 10, 000, n = 100, varying κ2(X).

102 104 106 108

10−15

10−10

10−5

100

κ (X )

‖ZU

−X‖ F

MGSMGS2CGSCGS2Householder QRCholesky QRCholeskyQR2

Fig. 4.2. Residual ‖ZU −X‖F for test matrices with m = 10, 000, n = 100, varying κ2(X).

Page 16: MATHEMATICAL ENGINEERING TECHNICAL REPORTS Roundo …

14 Y. Yamamoto, Y. Nakatsukasa, Y. Yanagisawa and T. Fukaya

103 104

10−15

10−10

10−5

100

m

‖ZTZ−

I‖ F

MGSMGS2CGSCGS2Householder QRCholesky QRCholeskyQR2

Fig. 4.3. Orthogonality ‖Z>Z − I‖F for test matrices with κ2(X) = 105, n = 100, varying m.

103 104

10−15

10−10

10−5

100

m

‖ZU

−X‖ F

MGSMGS2CGSCGS2Householder QRCholesky QRCholeskyQR2

Fig. 4.4. Residual ‖ZU −X‖F for test matrices with κ2(X) = 105, n = 100, varying m.

Page 17: MATHEMATICAL ENGINEERING TECHNICAL REPORTS Roundo …

Error Analysis of the CholeskyQR2 Algorithm 15

102 103

10−15

10−10

10−5

100

n

‖ZTZ−

I‖ F

MGSMGS2CGSCGS2Householder QRCholesky QRCholeskyQR2

Fig. 4.5. Orthogonality ‖Z>Z − I‖F for test matrices with κ2(A) = 105, m = 1, 000, varying n.

102 103

10−15

10−10

10−5

100

n

‖ZU

−X‖ F

MGSMGS2CGSCGS2Householder QRCholesky QRCholeskyQR2

Fig. 4.6. Residual ‖ZU −X‖F for test matrices with κ2(A) = 105, m = 1, 000, varying n.

Page 18: MATHEMATICAL ENGINEERING TECHNICAL REPORTS Roundo …

16 Y. Yamamoto, Y. Nakatsukasa, Y. Yanagisawa and T. Fukaya

Although this assumption is hard to satisfy for large matrices (notice that κ2(X)m2n3

is 1019 for the largest matrix appearing in Fig. 4.3), it has been observed that CGS2produces near-orthogonal matrices in many cases where this condition is violated [14].

Comparing these bounds with Eq. (3.32), we observe that the error bound ofCholeskyQR2 is smaller by a factor of

√n. This is in qualitative agreement with the

results of numerical experiments given in the previous section.

Note however that this difference should not be overemphasized, because theseare merely upper bounds. In fact, the Givens QR algorithm admits error bound thatis smaller than that of Householder QR by a factor of n, but it is observed that nodifference in accuracy is seen in practice [11, p.368].

Residual. According to [11, Sec. 19.3], the upper bound on the residual of theHouseholder QR algorithm can be evaluated as O(mn

√nu). As for CGS2, it is not

difficult to derive a bound of the same order using the results given in [9]. Thus,we can say that the CholsekyQR2 algorithm has a smaller bound also in terms ofthe residual. This is related to the fact that in the CholeskyQR2 algorithm, thecomputation of Y from X and Z from Y is done by row-wise forward substitution.Thus the backward errors introduced there or their sum of squares, which is one ofthe main sources of the residual, do not depend on m when ‖X‖2 is fixed. In addition,the forward error in the computation of SR, which is another source of residual, alsoinvolves only n. Thus the residual depends only on n, which is in marked contrast toHouseholder QR.

A few more comments are in order regarding the bound (3.38). A close examina-tion of Eq. (3.39) shows that the highest order term in the residual comes from theforward error of the matrix multiplication SR, which we denoted by E5. This impliesthat if we compute this matrix multiplication using extended precision arithmetic,we can reduce the upper bound on the residual to O(n2u) with virtually no increasein the computational cost (when m � n). Moreover, in a situation where only theorthogonal factor Z is needed, as in orthogonalization of vectors, we can leave theproduct SR uncomputed and say that the triplet (Z, S, R) has residual of O(n2u).

5.2. Applicability of CholeskyQR2 for a given matrix. There are somecases in which the condition number of X is known in advance to be moderate. Anexample is orthogonalization of vectors in first-principles molecular dynamics [3]. Inthis application, we are interested in the time evolution of an orthogonal matrixX(t) ∈ Rm×n, whose column vectors are orthogonal basis of the space of occupied-state wave functions. To obtain X(t + ∆t), we first compute X = X(t) − F (X)∆t,where F (X) ∈ Rm×n is some nonlinear matrix function of X, and then computeX(t + ∆t) by orthogonalizing the columns of X. Since X(t) is orthogonal, we caneasily evaluate the deviation from orthogonality of X by computing the norm ofF (X)∆t. Usually, the time step ∆t is small enough to ensure that κ2(X)� u−

12 .

In some cases, however, the condition number ofX cannot be estimated in advanceand one may want to examine the applicability of CholeskyQR2 from intermediatequantities that are computed in the algorithm. This is possible if R has been computedwithout breakdown in the Cholesky decomposition. Given R, one can estimate itslargest and smallest singular values using the power method and inverse power methodon R>R, respectively. Indeed the MATLAB condition number estimator condest firstcomputes the LU factorization of the input matrix, then applies a few iterations ofpower method to obtain a reliable estimate of the 1-norm condition number. Thisshould not cost too much because R is triangular and each step of both methodsrequires only O(n2) work. After that, one can evaluate the condition number of X by

Page 19: MATHEMATICAL ENGINEERING TECHNICAL REPORTS Roundo …

Error Analysis of the CholeskyQR2 Algorithm 17

using the relations (3.1) and (3.2), the bounds (3.9) and (3.16) on ‖E1‖2 and ‖E2‖2,respectively, and the Bauer-Fike theorem.

5.3. Column-wise stability of CholeskyQR2. Thus far we have investigatedthe normwise residual of CholeskyQR2. Sometimes the columns of X have widelyvarying norms, and one may wish to obtain the more stringent column-wise backwardstability, which requires

‖xj − Qrj‖/‖xj‖ = O(u), j = 1, . . . , n.

Here, xj and rj denote the jth columns of X and R, respectively. In this subsection,we prove that CholeskyQR2 is indeed column-wise backward stable.

To see this, we first consider a single Cholesky QR and show that the computed‖rj‖ is of the same order as ‖xj‖. Let us recall Eqs. (3.1) through (3.3). FromEq. (3.7), we have

(5.1) |E1|jj ≤ γm|xj |>|xj | = γm‖xj‖2.

By considering the (j, j)th element of Eq. (3.2) and substituting Eqs. (5.1) and (3.10),we obtain

‖rj‖2 ≤ |Ajj |+ γn+1|rj |>|rj |≤ ‖xj‖2 + γm‖xj‖2 + γn+1‖rj‖2.

Hence,

(5.2) ‖rj‖ ≤

√1 + γm

1− γn+1‖xj‖ = ‖xj‖ ·O(1).

Now we demonstrate the column-wise backward stability of a single Cholesky QR.Let the jth column of Y be denoted by yj . From Eq. (3.3), we have

(5.3) Xij = y>i (rj + ∆r(i)j ).

Here, ∆r(i)j is the jth column of ∆Ri. Thus,

|Xij − y>i rj | ≤ |y>i ∆r(i)j |

≤ ‖yi‖ ‖∆r(i)j ‖

≤ γn‖yi‖ ‖rj‖.

Squaring both sides and summing over i leads to

‖xj − Y rj‖2 ≤ γ2n‖Y ‖2F ‖rj‖2.

By using ‖Y ‖F = O(1) (see Lemma 3.1) and Eq. (5.2), we can establish the column-wise backward stability of Cholesky QR as follows.

‖xj − Y rj‖‖xj‖

≤ γn‖Y ‖F ·‖rj‖‖xj‖

= γn ·O(1) ·

√1 + γm

1− γn+1.(5.4)

Page 20: MATHEMATICAL ENGINEERING TECHNICAL REPORTS Roundo …

18 Y. Yamamoto, Y. Nakatsukasa, Y. Yanagisawa and T. Fukaya

To apply the above result to CholeskyQR2, we consider the backward errors in thesecond QR decomposition Y = ZS and the product of the two upper triangular factorsU = SR. These backward errors, which we denote by ∆Y and ∆S, respectively, satisfy

Y + ∆Y = ZS,

uj = (S + ∆S)rj .

Here, uj is the j th column of U . To evaluate ∆Y , we note the following inequality,which can be obtained in the same way as Eq. (5.4).

‖yj − Z sj‖2 ≤ γ2n‖Z‖2F ‖sj‖2.

Summing both sides over i and taking the square root gives

(5.5) ‖∆Y ‖F = ‖Y − ZS‖F ≤ γn‖Z‖F ‖S‖F = γn ·O(1).

As for ∆S, the standard result on the error analysis of matrix-vector product, com-bined with ‖S‖F ' ‖Y ‖F = O(1), leads to

(5.6) ‖∆S‖F ≤ γn‖S‖F = γn ·O(1).

On the other hand,

xj − Zuj = xj − Z(S + ∆S)rj

= xj − (Y + ∆Y + Z∆S)rj

= (xj − Y rj)− (∆Y + Z∆S)rj .(5.7)

By substituting Eqs. (5.4), (5.5) and (5.6) into Eq. (5.7), we finally obtain the column-wise backward stability of CholeskyQR2 as follows.

‖xj − Zuj‖‖xj‖

≤ ‖xj − Y rj‖‖xj‖

+ (‖∆Y ‖F + ‖Z‖F ‖∆S‖F ) · ‖rj‖‖xj‖

= γn ·O(1).

5.4. Row-wise stability of CholeskyQR2. In this subsection, we investigatethe row-wise stability of CholeskyQR2, which is defined as

(5.8) ‖x>i − q>i R‖/‖x>i ‖ = O(u), i = 1, . . . ,m.

Here x>i and q>i denote the ith rows of the matrices.The requirement (5.8) is strictly more stringent than the normwise stability, and

indeed the standard Householder QR factorization does not always achieve (5.8). It isknown [11, Ch.19] that when row sorting and column pivoting are used, HouseholderQR factorization gives row-wise stability. However, pivoting involves an increasedcommunication cost and is best avoided in high-performance computing.

Having established the normwise and column-wise stability of CholeskyQR2, wenow examine its row-wise stability. To gain some insight we first run experiments witha semi-randomly generated matrix X, whose row norms vary widely. Specifically,we generate a random m × n matrix via the Matlab command X = randn(m,n),

then left-multiply a diagonal matrix X := DX with Djj = 2j2 for j = 1, . . . ,m.

Here we took m = 100 and n = 50; the matrix thus has rows of exponentiallygrowing norms and κ2(X) ≈ u−

12 . Figure 5.1 shows the row-wise residuals of three

Page 21: MATHEMATICAL ENGINEERING TECHNICAL REPORTS Roundo …

Error Analysis of the CholeskyQR2 Algorithm 19

algorithms: standard Householder QR, Householder QR employing row sorting andcolumn pivoting, and CholeskyQR2.

We make several observations from Figure 5.1. First, we confirm the knownfact that the standard Householder QR factorization is not row-wise backward sta-ble, but this can be cured by employing row sorting and column pivoting. Second,CholeskyQR2 gave row-wise stability comparable to Householder QR with row sortingand column piroving; this is perhaps surprising considering the fact that CholeskyQR2employs no pivoting or sorting.

0 20 40 60 80 100

10−15

10−10

10−5

100

row

Householder QR

Householder QR w/ sort+pivot

CholeskyQR2

Fig. 5.1. Row-wise residual ‖x>i − q>i R‖2/‖x>i ‖2.

To illustrate the situation, we examine the first step of CholeskyQR2. Recall thatq>i = fl(x>i R

−1). Hence ‖x>i − q>i R‖2 = ‖x>i − fl(x>i R−1)R‖2, and by standardtriangular solve there exist ∆Ri for i = 1, . . . ,m such that

fl(x>i R−1)(R+ ∆Ri) = x>i , ‖∆Ri‖ = O(u)‖R‖.

Hence for row-wise stability we need ‖fl(x>i R−1)∆Ri‖ = O(u)‖x>i ‖. Since ‖fl(x>i R−1)∆Ri‖ ≤O(u)‖R‖‖fl(x>i R−1)‖, a sufficient condition is

(5.9) ‖fl(x>i R−1)‖ = O(‖x>i ‖/‖R‖).

Since the general normwise bound for ‖y>R−1‖ is ‖y>R−1‖ ≤ ‖y‖/‖R‖κ2(R), thecondition (5.9) is significantly more stringent when R is ill-conditioned.

Even so, as illustrated in the example above, in all our experiments with ran-dom matrices the condition (5.9) was satisfied with ‖x>i − q>i R‖/‖x>i ‖ < nu for alli. We suspect that this is due to the observation known to experts that triangu-lar linear systems are usually solved to much higher accuracy than the theoreticalbound suggests [11][20]. However, as with this classical observation, counterexam-ples do exist in our case: For example, taking R to be the Kahan matrix [11], whichare ill-conditioned triangular matrices known to have special properties, the bound‖fl(y>R−1)‖ = O(‖y>‖/‖R‖) is typically tight for a randomly generated y>, whichmeans (5.9) is significantly violated. In view of this we form X so that the Choleskyfactor R of X>X is the Kahan matrix. This can be done by taking X = QR for anm × n orthogonal matrix Q. To introduce large variation in the row norms of X weconstruct Q as the orthogonal factor of a matrix as generated in the example above.

Page 22: MATHEMATICAL ENGINEERING TECHNICAL REPORTS Roundo …

20 Y. Yamamoto, Y. Nakatsukasa, Y. Yanagisawa and T. Fukaya

For every such X with varying size n, (5.9) was still satisfied. Finally, we then ap-pended a row at the bottom of X of random elements with much smaller norm thanthe rest, and repeated the experiment. Now the row-wise residual for the last rowwas significantly larger than O(u‖x>i ‖), indicating row-wise stability does not alwayshold. Employing pivoting in the Cholesky factorization did not improve the residual.

A referee has suggested more examples for which CholeskyQR2 fails to have row-wise backward stability. One example is as follows: take X to be the off-diagonalparts of the 6× 6 Hilbert matrix and setting the (3, 3) element to 5e6.

Experiments suggest nonetheless that cases in which CholeskyQR is not row-wisestable is extremely rare.

6. Conclusion. In this paper, we performed roundoff error analysis of the CholeskyQR2algorithm for computing the QR decomposition of an m × n real matrix X, wherem ≥ n. We showed that if X satisfies Eq. (2.1), the computed Q and R factors, whichwe denote by Z and U , respectively, satisfy the following error bounds.

‖Z>Z − I‖F ≤ 6(mnu + n(n+ 1)u),

‖ZU −X‖F /‖X‖2 ≤ 5n2√nu.(6.1)

The bounds shown here is of a smaller order than the corresponding bounds for theHouseholder QR algorithm. Furthermore, it was shown that when only the Q factoris required, the right hand side of Eq. (6.1) can be reduced to O(n2u). Numericalexperiments support our theoretical analysis. CholeskyQR2 is also column-wise back-ward stable, as Householder QR. We also observed that the row-wise stability, whichis a more stringent condition than the norm-wise stability shown by Eq. (6.1), nearlyalways holds in practice, though it cannot be proved theoretically.

In this paper, we focused on the stability of CholeskyQR2. Performance results ofCholeskyQR2 on large scale parallel machines, along with comparison with other QRdecomposition algorithms and detailed performance analysis, is given in our recentpaper [8].

When the matrix is near square, it might be more efficient to partition the ma-trix into panels and apply the CholeskyQR2 algorithm to each panel successively.Development of such an algorithm remains as future work.

Acknowledgments. We thank Professor Nicholas J. Higham for suggesting theKahan matrix for a possible counterexample of the row-wise backward stability ofCholeskyQR2. We also thank Professor Masaaki Sugihara for valuable comments onthe first version of our manuscript. We are grateful to the referees for their sugges-tions, especially for another counterexample for the row-wise backward stability anda comment on columnwise stability.

REFERENCES

[1] G. Ballard, J. Demmel, O. Holtz and O. Schwartz, Minimizing communication in numer-ical linear algebra, SIAM J. Matrix Anal. Appl., 32 (2011), pp. 866–901.

[2] Z. Bai, J. Demmel, J. Dongarra, A. Ruhe and H. van der Vorst, Templates for theSolution of Algebraic Eigenvalue Problems: A Practical Guide, SIAM, 2000.

[3] R. Car and M. Parrinello, Unified approach for molecular dynamics and density-functionaltheory, Phys. Rev. Lett., 55 (1985), pp. 2471–2474.

[4] J. W. Daniel, W. B. Gragg, L. Kaufman and G. W. Stewart, Reorthogonalization and sta-ble algorithms for updating the Gram-Schmidt QR factorization, Math. Comp., 30 (1976),pp. 772–795.

Page 23: MATHEMATICAL ENGINEERING TECHNICAL REPORTS Roundo …

Error Analysis of the CholeskyQR2 Algorithm 21

[5] J. Demmel, L. Grigori, M. Hoemmen and J. Langou, Communication-optimal parallel andsequential QR and LU factorizations, SIAM J. Sci. Comput, 34 (2012), pp. 206–239.

[6] E. Elmroth and F. G. Gustavson, Applying recursion to serial and parallel QR factorizationleads to better performance, IBM J. Res. Dev., 44 (2000), pp. 605–624.

[7] T. Fukaya, T. Imamura and Y. Yamamoto, Performance analysis of the Householder-typeparallel tall-skinny QR factorizations toward automatic algorithm selection, The NinthInternational Workshop on Automatic Performance Tuning (iWAPT2014), accepted.

[8] T. Fukaya, Y. Nakatsukasa, Y. Yanagisawa and Y. Yamamoto, CholeskyQR2: A practicaland communication-avoiding algorithm for tall-skinny QR factorizations, to appear inProceedings of ScalA2014.

[9] L. Giraud, J. Langou, M. Rozloznık and J. van den Eshof, Rounding error analysis of theclassical Gram-Schmidt orthogonalization process, Numer. Math., 101 (2005), pp. 87–100.

[10] G. H. Golub and C. F. Van Loan, Matrix Computations, 4th Ed., Johns Hopkins UniversityPress, 2012.

[11] N. J. Higham, Accuracy and Stability of Numerical Algorithms, 2nd Edition, SIAM, 2002.[12] J. Iwata, D. Takahashi, A. Oshiyama, T. Boku, K. Shiraishi, S. Okada and K. Yabana, A

massively-parallel electronic-structure calculations based on real-space density functionaltheory, J. Comput. Phys., 229 (2010), pp. 2339–2363.

[13] W. Jalby and B. Philippe, Stability analysis and improvement of the block Gram-Schmidtalgorithm, SIAM J. Sci. Stat. Comput., 12 (1991), pp. 1058–1073.

[14] S. J. Leon, A. Bjorck and W. Gander Gram-Schmidt orthogonalization: 100 years andmore, Numer. Lin. Alg. Appl., 20 (2013), pp. 492–532.

[15] B. N. Parlett, The Symmetric Eigenvalue Problem, SIAM, 1998.[16] M. Rozloznık, F. Okulicka-D luzewska and A. Smoktunowicz, Indefinite orthogonalization

with rounding errors, submitted.[17] Y. Saad, Numerical Methods for Large Eigenvalue Problems, Manchester University Press,

1992.[18] R. Schreiber and C. F. van Loan, A storage-efficient WY representation for products of

Householder transformations, SIAM J. Sci. Stat. Comp., 10 (1989), pp. 53–57.[19] A. Stathopoulos and K. Wu, A block orthogonalization procedure with constant synchroniza-

tion requirements, SIAM J. Sci. Comput., 23 (2002), pp. 2165–2182.[20] G. W. Stewart: Introduction to Matrix Computations, Academic Press, 1973.[21] G. W. Stewart, Matrix Algorithms, Vol. 1: Basic Decompositions, SIAM, 1998.[22] S. Toledo and E. Rabani, Very large electronic structure calculations using an out-of-core

filter-diagonalization method, J. Comput. Phys., 180 (2002), pp. 256–269.