Low-rank Matrix Completion Demo¶

2017-11-07 Greg Ongie, University of Michigan

Updated 2018-08-18: added comparison with POGM

2018-08-18 Julia 0.7.0 by Jeff Fessler

TOP SECRET

On 2017-11-06 10:34:12 GMT Agent 556 obtained a photo of the illegal arms dealer, code name **Professor X-Ray**. However, Agent 556 spilled soup on their lapel-camera shorting out several CCD sensors. The image matrix has several missing entries--we were only able to recover 25% of data! Agent 551: Your misson, should you choose to accept it, is to recover the missing entries and uncover the true identity of **Professor X-Ray**.

using LinearAlgebra
using DelimitedFiles: readdlm
using LaTeXStrings
using Plots; pyplot()
function jim(Y; title="") # jiffy image display 
    heatmap(Y', color=:grays, aspectratio=1,
        xtick = [1, size(Y,1)],
        ytick = [1, size(Y,2)],
        title = title,
        yflip = true,
    )
end

jim (generic function with 1 method)

# Read in data with missing pixels set to zero
Y = readdlm("professorxray.txt")'
@show size(Y)
jim(Y, title="Y: Corrupted image matrix of Professor X-Ray\n (missing pixels set to 0)")

size(Y) = (205, 300)

# Create binary mask Omega (true=observed, false=unobserved)
Omega = Y .> 0

# Show mask, count proportion of missing entries
percent_nonzero = 100 * sum(Omega) / prod(size(Omega))
@show percent_nonzero
jim(Omega, title = L"$\Omega$: Locations of observed entries")

percent_nonzero = 25.0

Can we use simple low-rank approximation?

r = 20
U,s,V = svd(Y)
Xr = U[:,1:r] * Diagonal(s[1:r]) * V[:,1:r]'
jim(Xr, title="low-rank approximation")

No. Instead, we will try to uncover the identity **Professor X-Ray** using low-rank matrix completion.

The optimization problem we will solve is: $\large\min_{\mathbf X}~\frac{1}{2}\bigg|\bigg|P_{\Omega}(\mathbf X)-P_{\Omega}(\mathbf Y)\bigg|\bigg|^2 + \beta\,\bigg|\bigg|\bigg|\mathbf X\bigg|\bigg|\bigg|_*\quad\quad(\text{NN-min})$ where $\mathbf Y$ is the zero-filled input data matrix, and $P_{\Omega}$ is the operator that extract a vector of entries belonging to the index set $\Omega$ .

# Define cost function
nucnorm(X) = sum(svdvals(X))
costfun(Y, Omega, X, beta) = 0.5*norm(X[Omega]-Y[Omega])^2 + beta*nucnorm(X)
beta = 0.01
costfun(X, beta) = costfun(Y, Omega, X, beta)
#costfun(X) = costfun(Y, Omega, X, beta)

costfun (generic function with 2 methods)

# Define singular value soft thresholding (SVST)
function SVST(X,beta)
    U,s,V = svd(X)
    sthresh = max.(s .- beta, 0)
    return U * Diagonal(sthresh) * V'
end;

Iterative Soft-Thresholding Algorithm (ISTA)¶

Extension of gradient descent to convex cost functions that look like $\min_x f(x) + g(x)$ where $f(x)$ is smooth and $g(x)$ is non-smooth. Also known as proximal gradient descent.

ISTA algorithm for solving (NN-min):
initialize $\mathbf X_0 = \mathbf Y$ (zero-fill missing entries)
for k=0,1,2,...
$[\hat{\mathbf X}_k]_{i,j} = \begin{cases}[\mathbf X_k]_{i,j}& \text{if}~(i,j)\not\in\Omega\\ [\mathbf Y]_{i,j}& \text{if}~(i,j)\in\Omega\end{cases}$ (Put back in known entries)

$\mathbf X_{k+1} = \text{SVST}(\hat{\mathbf X}_k,\beta)$ (Singular value soft-thresholding)
end

# ISTA (Iterative Soft-Thresholding Algorithm)
function ista(Y, Omega; beta=beta, niter=250)
X = copy(Y)
Xold = X
t = 1
cost_ista = zeros(niter+1)
cost_ista[1] = costfun(X,beta)
for k=1:niter
    X[Omega] = Y[Omega]
    X = SVST(X,beta)
    cost_ista[k+1] = costfun(X,beta)
end
return (X, cost_ista)
end

niter = 250
(Xista, cost_ista) = ista(Y, Omega, niter=niter)

jim(Xista, title="ISTA at $niter")

What went wrong? Let's investigate. First, let's see if the above solution is actually low-rank.

sX = svdvals(Xista)
sY = svdvals(Y)
plot(sY, title="singular values", label="svd(Y) : initialization",
    line=:red, marker=(:o, :red))
plot!(sX, label="svd(X) : ISTA output",
    line=:blue, marker=(:o, :blue))

Now let's check the cost function:

plot(cost_ista,
    title="cost vs. iteration",
    xlabel="iteration",
    ylabel="cost function value",
    marker=:o,
    label="ISTA")

Fast Iterative Soft-Thresholding Algorithm (FISTA)¶

Modification of ISTA to include Nesterov acceleration for faster convergence.

Reference:

Beck, A. and Teboulle, M., 2009. A fast iterative shrinkage-thresholding algorithm for linear inverse problems. SIAM journal on imaging sciences, 2(1), pp.183-202.

FISTA algorithm for solving (NN-min)
initialize matrices $\mathbf Z_0 = \mathbf X_0 = \mathbf Y$
for k=0,1,2,...
$[\hat{\mathbf Z}_k]_{i,j} = \begin{cases}[\mathbf Z_k]_{i,j}& \text{if}~(i,j)\not\in\Omega\\ [\mathbf Y]_{i,j}& \text{if}~(i,j)\in\Omega\end{cases}$ (Put back in known entries)

$\mathbf X_{k+1} = \text{SVST}(\hat{\mathbf Z}_k,\beta)$

$t_{k+1} = \frac{1+\sqrt{1+4t_k^2}}{2}$ (Nesterov step-size)

$\mathbf Z_{k+1} = \mathbf X_{k+1} + \frac{t_k-1}{t_{k+1}}(\mathbf X_{k+1}-\mathbf X_{k})$ (Momentum update)
end

# FISTA algorithm
function fista(Y, Omega; beta=0.01, niter=250)
X = copy(Y)
Z = copy(X)
Xold = copy(X)
told = 1
cost_fista = zeros(niter+1)
cost_fista[1] = costfun(X,beta)
for k=1:niter
    Z[Omega] = Y[Omega]
    X = SVST(Z,beta)
    t = (1 + sqrt(1+4*told^2))/2
    Z = X + ((told-1)/t)*(X-Xold)
    Xold = X
    told = t
    cost_fista[k+1] = costfun(X,beta) # comment out to speed-up
end
return (X, cost_fista)
end

(Xfista, cost_fista) = fista(Y, Omega)
jim(Xfista, title="FISTA")

sX = svdvals(Xfista)
sY = svdvals(Y)

effective_rank = sum(sX .> (0.01*sX[1]))
@show effective_rank

plot(sY, title="singular values", label="svd(Y) : initialization",
    line=:red, marker=(:o, :red))
plot!(sX, label="svd(X) : FISTA output",
    line=:green, marker=(:o, :green))

effective_rank = 19

plot(cost_ista, label="ISTA",
    line=:red, marker=(:o, :red))
plot!(cost_fista, label="FISTA",
    line=:blue, marker=(:o, :blue),
    title="cost vs. iteration",
    xlabel="iteration",
    ylabel="cost function value")

Proximal Optimized Gradient Method (POGM)¶

Optimized acceleration of ISTA with a proximal version of the optimized gradient method (OGM). Similar to FISTA but has an additional term in the momentum update (in red below).

Reference:

Kim, D., & Fessler, J. A. (2016). Optimized first-order methods for smooth convex minimization. Mathematical programming, 159(1-2), 81-107.

POGM algorithm for solving (NN-min)
initialize matrices $\mathbf Z_0 = \mathbf X_0 = \mathbf Y$
for k=0,1,2,...
$[\hat{\mathbf Z}_k]_{i,j} = \begin{cases}[\mathbf Z_k]_{i,j}& \text{if}~(i,j)\not\in\Omega\\ [\mathbf Y]_{i,j}& \text{if}~(i,j)\in\Omega\end{cases}$ (Put back in known entries)

$\mathbf X_{k+1} = \text{SVST}(\hat{\mathbf Z}_k,\beta)$

$t_{k+1} = \frac{1+\sqrt{1+4t_k^2}}{2}$ (Nesterov step-size)

$\mathbf Z_{k+1} = \mathbf X_{k+1} + \frac{t_k-1}{t_{k+1}}(\mathbf X_{k+1}-\mathbf X_{k}) +$ $\frac{t_k}{t_{k+1}}(\mathbf X_{k+1}-\mathbf Z_k)$ (Momentum update)
end

# POGM algorithm
function pogm(Y, Omega; beta=0.01, niter=250)
X = copy(Y)
Z = copy(Y)
Zold = copy(Z)
Xold = copy(X)
told = 1
cost_pogm = zeros(niter+1,1)
cost_pogm[1] = costfun(X,beta)
for k=1:niter
    Z[Omega] = Y[Omega]
    X = SVST(Z,beta)
    t = (1 + sqrt(1+4*told^2))/2
    Z = X + ((told-1)/t)*(X-Xold) + (told/t)*(X-Z)
    Zold = Z
    Xold = X
    told = t
    cost_pogm[k+1] = costfun(X,beta) # comment out to speed-up
end
return (X, cost_pogm)
end

(Xpogm, cost_pogm) = pogm(Y, Omega)
jim(Xpogm, title="POGM")

plot(cost_ista, label="ISTA",
    line=:red, marker=(:o, :red))
plot!(cost_fista, label="FISTA",
    line=:blue, marker=(:o, :blue))
plot!(cost_pogm, label="POGM",
    line=:green, marker=(:o, :green),
    title="cost vs. iteration",
    xlabel="iteration",
    ylabel="cost function value")

Alternating directions method of multipliers (ADMM) algorithm¶

Another approach that uses SVST as a sub-routine, closely related to proximal gradient decent.
Faster than FISTA but the algorithm requires a tuning parameter $\mu$ (here we use $\mu = \beta$ ).

References:

Cai, J.F., Candès, E.J. and Shen, Z., 2010. A singular value thresholding algorithm for matrix completion. SIAM Journal on Optimization, 20(4), pp.1956-1982.
Boyd, S., Parikh, N., Chu, E., Peleato, B. and Eckstein, J., 2011. Distributed optimization and statistical learning via the alternating direction method of multipliers. Foundations and Trends in Machine Learning, 3(1), pp.1-122.

# alternating directions method of multipliers (ADMM) algorithm
function admm(Y, Omega; beta=0.01, niter=250, mu=beta)
X = copy(Y)
Z = zero(X)
L = zero(X)
cost_admm = zeros(niter+1,1)
cost_admm[1] = costfun(X,beta)
for k=1:niter
    Z = SVST(X+L, beta/mu)
    X = (Y + mu*(Z-L)) ./ (mu .+ Omega)
    L = L + X - Z
    cost_admm[k+1] = costfun(X,beta) # comment out to speed-up
end
return (X, cost_admm)
end
(Xadmm, cost_admm) = admm(Y, Omega)
jim(Xadmm, title="ADMM")

plot(cost_ista, label="ISTA",
    line=:red, marker=(:o, :red))
plot!(cost_fista, label="FISTA",
    line=:blue, marker=(:o, :blue))
plot!(cost_pogm, label="POGM",
    line=:green, marker=(:o, :green),
    title="cost vs. iteration",
    xlabel="iteration",
    ylabel="cost function value")
plot!(cost_admm, label="ADMM",
    line=:magenta, marker=(:o, :magenta))