MATLABician

Accurate evaluation of the function ${f(x) = (e^x - 1)/x}$ is a classic problem in numerical analysis. This function can be seen as an ${O(h)}$ approximation to the first derivative of ${e^x}$ at ${x = 0}$ , ${(e^x)'|_{x=0} = \lim_{h \rightarrow 0} (e^h - 1)/h}$ . More generally, ${f(x)}$ is the divided difference of ${e^x}$ at ${x}$ and ${0}$ and this fact combined with our discussion of catastrophic cancellation due to roundoff in divided differences in a previous post (Complex Magic, Part 1) should trigger an alarm.

The “usual” accurate method to evaluate ${f(x)}$ is to instead evaluate ${g(x) = (e^x - 1)/\ln(e^x)}$ , unless ${e^x = 1}$ , in which case the result is set to 1, given that, say for continuity, ${f(0)}$ is defined to be 1. This is the method employed in MATLAB’s expm1 whose implementation is based on an algorithm due to William Kahan.

Why do ${f(x)}$ and ${g(x)}$ , which are equal in exact arithmetic for all ${x > 0}$ , evaluate differently in floating point arithmetic? The reason may be rigorously explained with a floating point error analysis; see Nick Higham’s ASNA, for example. The difference of ${g(x)}$ and ${f(x)}$ , in floating point arithmetic, is that the former, by design, enjoys a cancellation of errors in division that the latter does not. Note that neither the numerator ${e^x - 1}$ nor the denominator ${\ln(e^x)}$ of ${g(x)}$ can be evaluated accurately in floating point arithmetic. For x = 9.000000000000000e-015, using double precision in MATLAB R2009a on a machine running Windows XP with an Intel Core 2 Duo E8400 CPU, the former is 9.103828801926284e-015 and the latter is 9.103828801926243e-015, while in exact arithmetic they are, up to 16 significant digits, 9.000000000000040e-015 and 8.999999999999999e-15, respectively. The ratios, however, are both equal to 1.000000000000004 up to 16 significant digits!

From the title of this post, and the fact that the method just illustrated was referred to as “usual,” it shouldn’t be too hard to expect that another method is to be discussed here. The underlying idea of this other method, due to [KassamTrefethen2005], is the same as the one discussed in Complex Magic, Part 2: Cauchy’s integral formula, only this time Cauchy helps with calculating the function itself, not its derivative,

$\displaystyle a_0 = f(z_0) = \frac{1}{2\pi i} \oint_\gamma {f(z) (z-z_0)^{-1}}\, dz. \ \ \ \ \ (1)$

An elegant heuristic proof of (1), based on a fluid flow interpretation from Mark Levi’s Mathematical Mechanic [Levi2009], can be given by recasting the definition of holomorphicity of the function ${f(z)}$ at ${z_0}$ , namely that there exists ${f'(z_0) = (a + ib)}$ such that

$\displaystyle f(z) = f(z_0) + f'(z_0)(z - z_0) + o(z - z_0),$

(in other words, namely that there exists a linear approximation to ${f(z)}$ at ${z_0}$ ) as the linear combination of a source generating function, ${a(z - z_0)^{-1}}$ , a vortex generating function, ${ib(z - z_0)^{-1}}$ , and an ideal (incompressible and irrotational) flow generating function, ${f'(z_0)}$ ,

$\displaystyle f(z)(z - z_0)^{-1} = a(z - z_0)^{-1} + ib(z - z_0)^{-1} + f'(z_0),$

noting that the contributions of these three terms to the contour integral ${\oint_\gamma {f(z) (z-z_0)^{-1}}\, dz}$ are ${a 2\pi i}$ , ${ib 2\pi i}$ , and ${0}$ , respectively, and finally summing up these contributions to get

$\displaystyle \oint_\gamma {f(z) (z-z_0)^{-1}}\, dz = a 2\pi i + bi 2\pi i + 0 = 2\pi i f(z_0).$

– The word “generating” is important in the above “proof.” The source, vortex, and ideal flow functions arise from treating the complex conjugates of the aforementioned generating functions as vector fields.
– Donald Knuth’s “ ${O}$ ” Calculus letter to Notices of the American Mathematical Society [PDF, TeX] offers a lively discussion on ${O}$ , and ${o}$ , notation.

With only a slight modification (the highlighted lines) to the diffc of Complex Magic, Part 2 we can get it to work for zeroth derivative (i.e., function evaluation), too.


function [S_M,delta_M] = diffc(f,varargin)

[x0,n,r,M] = deal(0,1,.1,3);

optArgs    = {x0,n,r,M};
emptyArgs  = cellfun(@isempty,varargin);
[optArgs{~emptyArgs}] = varargin{~emptyArgs};
[x0,n,r,M] = optArgs{:};

f_x0      = f(x0);
[G_M,S_M] = deal(0);
for m = 1:M

    if n == 0
        mn = M;
        mobius_m = 1;
    else
        mn = m*n;
        mobius_m = mobius(m);
    end

    t   = linspace(1/(mn),1,mn);
    g   = real( f( x0 + r * exp(1i * 2*pi * t) ) );

    G_M = max( abs( [G_M f_x0 g] ) );
    S_M = S_M + mobius_m * ( mean(g) - f_x0 );

    if n == 0, break, end
end

delta_M = eps(G_M) * G_M/S_M;
S_M     = factorial(n)/(r^n) * S_M;

Apropos of nothing, note that because linspace(d1,d2,n)‘s implementation outputs [d1+(0:n-2)*(d2-d1)/(floor(n)-1) d2], there is always at least one point, namely d2, returned by linspace, even if n = 0. Also, note that apparently the method MATLAB uses for printing the values of symbolic variables in the Command Window is different from what it uses for double variables, as there is no indentation when symbolic variables are output to the Command Window. Compare x = 1, y = sym(1), z = vpa(1).


f = @(x) (exp(x) - 1)./x;

x_0 = 1e-18; x_vpa = vpa(x_0,32); n = 0; r = .5; M = 16;
tic, f_vpa         = f(x_vpa),              toc
tic, [f_M,delta_M] = diffc(f,x_0,n,r,M),    toc

We see from the output of the above snippet that, using variable precision arithmetic (vpa) with 32 digits, the value of ${(e^x - 1)/x}$ for x = 1e-18 is 1.0000000000000000004999535947385 and that on a contour of radius ${r = 0.5}$ using only ${M = 16}$ points for the trapezoidal quadrature, diffc achieves 1.000000000000000 which agrees with the vpa evaluation in all its 16 digits.

The integrand in (1) was written as ${f(z)(z-z_0)^{-1}}$ , and not the conventional ${\frac{f(z)}{(z-z_0)}}$ . Why would we want to do that?

-----BEGIN DIGRESSION-----

Let us digress, a bit and not to the Neverland! According to Gilbert Strang, there are only two functions whose power series one should know by heart: ${e^t}$ and ${1/(1-t)}$ . Let’s not touch the former, lest we digress too much and get lost! As for the latter, recasting it as ${(1-t)^{-1}}$ allows its power series to be written with the familiar form

$\displaystyle (1-t)^{-1} = 1 + t + t^2 + t^3 + \cdots, \ \ \ \ \ (2)$

for ${t}$ ‘s that can come from different fields ${F}$ ; in particular ${t}$ can be a scalar or a square matrix and (2) stays intact. As an aside, it is noted that (2) may be used, formally, to discover the formula for ${(1-ab )^{-1}}$ for the more general case of one-sided inverses in rings. More on this may be read in How to Construct ${(1-ab )^{-1}}$ .

Another place this kind of recasting can yield a unified treatment is in projections. For ${a}$ and ${b}$ elements of a vector space ${V}$ over the complex field ${\mathbb{C}}$ , the projection of ${b}$ over ${a}$ is usually written as ${p = \frac{a^H b}{a^H a} a}$ , with ${a^H}$ denoting the Hermitian (complex conjugate) transpose of ${a}$ . Recasting this as ${p = a (a^H a)^{-1} a^H b}$ makes this expression identical to the one that holds in the more general case when ${a}$ is a matrix and the projection is into a subspace of ${V}$ spanned by the columns of ${a}$ (that is, the column space of ${a}$ ).

-----END DIGRESSION-----

A similar rationale is behind recasting the integrand in (1) from the usual division to multiplication by the inverse: the latter holds in the more general case when ${z_0}$ is a square matrix, with the slight modification of ${z}$ to ${z1}$ , where ${1}$ is the multiplicative identity of the field ${F}$ . The contour ${\gamma}$ has to enclose the eigenvalues of ${z_0}$ —if ${z_0}$ is a scalar (that is, if ${F}$ is the complex field ${\mathbb{C}}$ ) its eigenvalue is ${z_0}$ itself, as ${z_0 = [z_0]_{1\times 1}}$ .

Why is the operator ${(.)^{-1}(e^{(.)} - 1)}$ so important? Because it appears in many important numerical methods. One of the places that it appears is in Exponential Time Differencing (ETD) schemes; see, for example, [CoxMatthews2002]. ETD schemes are particularly useful when solving stiff PDEs. Discretizing the spatial dimension of a PDE, the result may be written as a system of ODEs of the form

$\displaystyle u_t = Au + F(u,t), \ \ \ \ \ (3)$

where ${A}$ and ${F}$ are the discretized linear and nonlinear parts of the original PDE, respectively. (For instance, supposing that ${u}$ denotes state variables of a mechanical system, ${Au}$ and ${F}$ could be the normalized, linear and nonlinear forces acting on the system, respectively. If the constitutive laws are imposed, via a matrix ${C}$ and nonlinear function ${C(u,t)}$ for the linear and nonlinear parts of (3), normalization can be dropped from the the previous statement.) When the eigenvalues of ${A}$ have very different magnitudes—either because there is a very negative eigenvalue, indicating strong damping (exponential decay) as in dissipative systems, or because there is a large imaginary eigenvalue, indicating rapid oscillations as in dispersive systems—the system (3) is categorized as stiff. (It is assumed that ${A}$ is “normal” so that analyzing its eigenvalues gives us a true prediction of what it represents. Let’s not enter the world of pseudospectra at this time.) It is noted that a differential equation has attached to it a differential operator whose definition is not complete without the necessary initial conditions. As a result, the notion of stiffness depends not only on the bare system of differential equations but also on the initial conditions. Furthermore, many problems are not initially stiff; they become stiff only when the solution approaches a certain state.

In a stiff system of ODEs two restrictions apply to the time steps a numerical solver can take; one is imposed by stability, dictated by the largest eigenvalue (in magnitude), just as in the case of a single ODE, and the other is imposed by the desired accuracy. The former pushes the time step to be much smaller than is required by the latter and this is a dilemma. A scheme not specially designed to handle stiffness is not aware of these two different time scales, would take very small steps and as a result would take an awfully long time to solve the a stiff problem. It is important to note that this (time) inefficiency is the only problem such a scheme would have; other things being equal, stiffness does not render a nonstiff scheme unstable or divergent. (See Cleve Moler’s “corner” article Stiff Differential Equations and Gilbert Strang’s treasure trove “Introduction to Applied Mathematics” for more details.)

The discussion of ETD here is based on [CoxMatthews2002], which is a very nice read. The underlying idea of ETD schemes is simple: apply the matrix exponential ${e^{-At}}$ to (3) and approximate the resulting integral. ( ${e^{-At}}$ is an integrating factor for the linear part of (3) so the integral to approximate is coming from the nonlinear part.) Over a single time step of length ${h}$ from ${t_n}$ to ${t_{n+1}}$ this gives

$\displaystyle u(t_{n+1}) = e^{Ah}u(t_n) + e^{Ah} \int_0^h e^{-A\tau}F( u(t_n + \tau),t_n + \tau ) \, d\tau. \ \ \ \ \ (4)$

A first order approximation to ${F}$ already brings out the notorious ${(.)^{-1}(e^{(.)} - 1)}$ in (4)

$\displaystyle u_{n+1} = e^{Ah}u_n + F_n A^{-1}(e^{Ah} - 1), \ \ \ \ \ (5)$

where ${F_n = F(u_n,t_n)}$ and ${u_n}$ denotes an approximation to ${u(t_n)}$ . With higher order approximations, we have higher order versions of ${(.)^{-1}(e^{(.)} - 1)}$ and the resulting cancellations due to roundoff are more sever. For example, here is Cox and Matthews ETD4RK, a fourth-order ETD scheme with Runge-Kutta time stepping:

$\displaystyle \begin{array}{rcl} a_n & = & e^{Ah}/2u_n + A^{-1} (e^{Ah}/2 - I) N(u_n,t_n),\\ b_n & = & e^{Ah}/2u_n + A^{-1} (e^{Ah}/2 - I) N(a_n,t_n + h/2),\\ c_n & = & e^{Ah}/2a_n + A^{-1} (e^{Ah}/2 - I)(2N(b_n,t_n + h/2) - N(u_n,t_n)),\\ u_{n+1} & = & e^{Ah}u_n + h^{-2} A^{-3}\{ [ -4 - Ah + e^{Ah} ( 4 - 3Ah + (Ah)^2 ) ] N(u_n,t_n)\\ & & {} + 2[ 2 + Ah + e^{Ah} (-2 + Ah) ]( N(a_n,t_n + h/2) + N(b_n,t_n + h/2) )\\ & & {} + [ -4 - 3Ah - (Ah)^2 + e^{Ah}(4 - Ah) ] N(c_n,t_n + h)\}. \end{array}$

(The terms multiplying powers of ${A^{-1}}$ in the above equations are prone to severe cancellation due to roundoff.)

We are now ready to write a simple function fevalc that computes (1) for the general case that ${z_0}$ is a matrix:


function f_z0 = fevalc(f,varargin)

[z0,r,M]  = deal(0,.1,16);
optArgs   = {z0,r,M};
emptyArgs = cellfun(@isempty,varargin);
[optArgs{~emptyArgs}] = varargin{~emptyArgs};
[z0,r,M] = optArgs{:};

I    =   eye(size(z0));
f_z0 = zeros(size(z0));

zs = r * exp( 1i*pi*( ((1:M) - .5 )/M ) );
for j = 1:M
    z = zs(j);
    zI_z0_inv = inv(z*I - z0);
    f_z0 = f_z0 + f(z) * zI_z0_inv * z; %#ok<MINV>
end
f_z0 = real(f_z0)/M;

Note that for the special case when ${z_0}$ is a scaler this function is equivalent to the one-liner zs = r * exp( 1i*pi*( ((1:M) - .5 )/M ) ); f_z0 = real(mean(f(zs))).

For the sake of demonstration, let us calculate one of the “troubled” terms (or “troublesome,” depending on your view point!), say ${h^{-2} A^{-3} [-4 - Ah + e^{Ah} ( 4 - 3Ah + (Ah)^2 ) ]}$ , in (5) when the time step is ${h = 1/10}$ and ${A}$ is the second-order, Chebyshev differentiation matrix for a problem with constant Dirichlet boundary conditions:


function ETD4RK_term_eval
N  = 5;
D  = cheb(N);
h  = 1/10;
D2 = D^2; D2 = .01*D2(2:N,2:N);
Ah = h*D2;
f  = @(z) ( h*z^(-3) * (-4 - z + exp(z)*(4 - 3*z + z^2) ) );

f_A   = f(Ah);
f_A_c = fevalc(f,Ah,20,10);

function [D,x] = cheb(N)

if N == 0, D = 0; x = 1; return, end
x = cos(pi*(0:N)/N)';
c = [2; ones(N-1,1); 2].*(-1).^(0:N)';
X = repmat(x,1,N+1);
dX = X - X';
D  = (c*(1./c)')./(dX + (eye(N+1)));
D  = D - diag(sum(D,2));

(The border rows and columns are being chopped off from D2 to impose the constant boundary conditions. For more details about the implementation of boundary conditions and how the function cheb generates the Chebyshev differentiation matrix see [Trefethen2000].)

This is ${f(A)}$ calculated by direct substitution, which is completely off,

${-3.0302\times 10^{7}}$	${-3.0086\times 10^{7}}$	${-3.0085\times 10^{7}}$	${-3.0302\times 10^{7}}$
${-9.0451\times 10^{7}}$	${-8.9806\times 10^{7}}$	${-8.9806\times 10^{7}}$	${-9.0451\times 10^{7}}$
${-9.0451\times 10^{7}}$	${-8.9806\times 10^{7}}$	${-8.9806\times 10^{7}}$	${-9.0451\times 10^{7}}$
${-3.0302\times 10^{7}}$	${-3.0085\times 10^{7}}$	${-3.0086\times 10^{7}}$	${-3.0302\times 10^{7}}$

This is ${f(A)}$ calculated using fevalc

${-1.7858\times 10^{5}}$	${-1.0354\times 10^{2}}$	${3.0126\times 10^{1}}$	${-1.8017\times 10^{1}}$
${-5.9729\times 10^{1}}$	${-1.7876\times 10^{5}}$	${-4.7287\times 10^{1}}$	${1.5536\times 10^{1}}$
${1.5536\times 10^{1}}$	${-4.7287\times 10^{1}}$	${-1.7876\times 10^{5}}$	${-5.9729\times 10^{1}}$
${-1.8017\times 10^{1}}$	${3.0126\times 10^{1}}$	${-1.0354\times 10^{2}}$	${-1.7858\times 10^{5}}$

Of course, as mentioned before, the calculation in ETD4RK_term_eval is for demonstration only. The “troubled” terms in (5) need not, and in fact should not, be calculated one by one using fevalc, as the inverse matrix ${(z1 - z_0)^{-1}}$ is shared among them and need be calculated only once. Furthermore, because these terms have powers of ${z^{-1}}$ , the multiplication by ${z}$ in ETD4RK_term_eval can be accounted for by reducing the power of ${z^{-1}}$ by ${1}$ , instead of actually doing the multiplication. Refer to [KassamTrefethen2005] for more details. As an aside, it is noted that this is one of the very rare cases that the inverse matrix itself is needed; no system of linear equations is being solved and as a result what inv is doing cannot be achieved using the backslash operator. Another case in which the inverse matrix itself is needed and cannot be replaced by the backslash operator is calculating the standard errors of the estimates, in an identification problem, based on the inverse of the Fisher information matrix.

MATLABician

Complex Magic, Part 3: Function Evaluation via Integration

Recent Posts

Pages

Categories

RSS Subscription

Email Subscription