function machine = CRSVM(varargin) %Latest modification date: 7/19/2007 % %Please report bugs to Jin H. Jung %jjung [at] cs (dot) umd {dot} edu %Copyright Jin. H. Jung, 2007 % %This work was supported by the US Department of Energy under %grant DEFG0204ER25655 and described in a companion paper %J. H. Jung, D. P. O'Leary, and A. L. Tits, %Adaptive constraint reduction for linear support vector machine. % %CRSVM Trains the linear support vector machine (with L1 hinge loss) % by solving a convex quadratic program with an constraint % reduced interior-point method. The algorithm uses a variant of % Mehrotra's primal-dual method. % Instead of forming the normal matrix with all data points, the % algorithm tries to form the matrix with prospective % (on-boundary) support vectors. % The number of data points involved in the matrix forming is % adaptively shrunken basedon how a current intermmediate % classifier is close to the one with maximal margin. Usually the % number can be decreased down to the number of (on-boundary) % support vectors. This results in great time saving. % %machine = CRSVM(A, d, OPTS) %INPUT %A : An m by n matrix, each row of which corresponds to a data % point. A column corresponds to an attribute. %d : A column vector containing classification labels (+1/-1) %opts : A parameter containing various options % opts.tau: Misclassification penalty [vector | scalar | {1.0}] % opts.tol: Convergence tolerance applied to residuals [scalar | {10e-8}] % opts.tol_mu: Convergence tolerance applied to the complementarity % measurement [scalar | {10e-8}] % opts.iteration_limit: Maximal number of iterations % opts.q_lower: Minimum number of constraints to keep % [integer | {'automatic'}] % opts.q_lower_rel: q_lower = opts.q_lower_rel*size(A,1) % [scalar | {opts.q_lower_rel/size(A,1)}] % opts.q_upper: maximum number of constraints to keep [integer | {size(A,1)}] % opts.q_upper_rel: q_upper = opts.q_upper_rel*size(A,1) % [scalar | {opts.q_upper_rel/size(A,1)}] % opts.reduction: reduction technique % [BALANCED | NONBALANCED | {ADAPTIVE_BALANCED} | ADAPTIVE_NONBALANCED | NONE] % BALANCED: include q/2 positive data points and q/2 negative data points % NONBALANCED: include q data points without considering class % membership % ADAPTIVE: shrink Q adaptively as the iteration approach an optimal % solution % opts.beta: A parameter controlling the rate of decrease of the % centrality % measurement in determining the size of the index set. Any positive number % greater than 1. (mu^(1/opts.beta)) will determine the ratio of % index set size to the number of data points. % [scalar | {4}] % opts.index_basis: The basis for the selection of indexes % [omega | {distance} | onesided_distance | v(tau-v)] % opts.save_history: If true, the function returns history of weight, w, gamma, % Q, and etc. % opts.initial_point: Initial point. If not given, a point is % automatically generated. This option should contain w0, gamma0, y0, % s0, v0, and u0. [structure | {Automatically assigned}] % opts.factorization_method: This option determines which factorization method % will be used. [{Chol} | {Chol_SMW} | QR_SMW]. If the data is dense, % then Chol is the default, otherwise, Chol_SMW is the default. % If the data is dense, Chol_SMW is overriden to Chol. Both QR_SMW % and Chol_SMW are only available when the data matrix is sparse. % opts.use_ordering: Use pivoting in Cholesky factorization. This option % is available only when Chol_SMW is used. [{false} | true] % opts.kernel: % opts.kernel : A structure defining kernel. If this option is active, an % incomplete Cholesky factorization is performed to the Gram matrix. % Thus, this is an approximation method. % .kernel.kernel_function : % [{'None'}|'Gaussian'|'Linear'|'Polynomial'|'Sigmoid'|function handle] % 'None': Don't use a kernel. Thus the incomplete Cholesky % factorization is never performed. % 'Gaussian': exp(-norm(x-y)^2/(2*sigma^2)) % Optional parameters % .kernel.sigma: Defines width of the kernel. Default value is % set to make 2*sigma1^2 become n, where n = size(A,2), i.e., % the number of attributes. % 'Linear': dot(x,y) % This takes no option. % 'Polynomial': (coefficient*dot(x,y)+bias)^degree % Optional parameters % .kernel.coefficient: Default value is 1/n. % .kernel.bias: Default value is 1/n. % .kernel.degree: Default value is 3. % 'Sigmoid': tanh(kappa*dot(x,y)-delta) % Optional parameters % .kernel.kappa: Default value is 1/n. % .kernel.delta: Default value is 0. % function handle: A handle to function taking two data matrices % X1 and X2, each row of which is a data point. % This can be a function to an anonymous function such as % .kernel.kernel_function = @(X1,X2) (X1*X2'+1).^2; % The provided function should support pairwise evaluation % between data points in X1 and X2. % .kernel.max_rank : maximal rank of the incomplete Cholesky factor % [{max(256,ceil(m*0.01))} | integer] % .kernel.trade_off : trade-off between approximation of K and prediction % of Y (suggested: .99) % [{.99} | scalar in (0,1)] % .kernel.centering : 1 if centering, 0 otherwise (suggested: 1) % [ {1} | 0 ] % .kernel.lookahead_columns : number of columns of cholesky performed in advance (suggested: 40) % [ {40} | integer ] % .kernel.tol : minimum gain at iteration (suggested: 1e-4) % [ {1e-4} | scalar > 0 ] % %OUTPUT %machine: This contains parameters required to build the binary classifier % If kernel is not used: % sign( sum_{i of SVs} ( d(i)*weight(i)*dot(A(i,:), x) ) - bias) % or you can use w which is stored in machine.internal_variables.w: % sign( - bias) % If kernel is used, w is useless in this case. Use the following: % sign( sum_{i of SVs} ( d(i)*weight(i)*kernel(A(i,:), x) ) - bias) % % machine.bias: The bias (gamma) of the classifier. % machine.weight: Weight of each data point in making up the classifier. % The classifier can be expressed as sign(Sum(v_i*) - gamma), % where a_i is the i'th data point and v_i is the i'th weight. Data % points with nonzero weight are the support vectors. % machine.support_vector_index: Contains the indices of support vectors % machine.on_boundary_support_vector_index: Contains the indices of on % boundary support vectors % machine.off_boundary_support_vector_index: Contains the indices of off % boundary support vectors % machine.number_of_data_point: Equivalent to size(A,1) % machine.number_of_input_attribute: Equivalent to size(A,2) % machine.dimension_of_reduced_feature_space: The number of columns of % the incomplete Cholesky factor. % machine.iterations: Total number of iterations performed. % machine.time: Total time taken to train the machine. % machine.number_of_data_points: The number of data points. % machine.number_of_features: The number of features. Dimension of the % data points. % machine.exit_flag: Exit flag for exit reason. % machine.exit_reason: Exit reason. % % machine.internal_variables: This contains internal result of optimization % problem. Stored variables are w, y, s and u. % % machine.optimization_result: This contains information of optimization % result. Many are self explanatory. % .primal_objective_function_value % .dual_objective_function_value % .mu: The complementarity measure % .mu_trace: The trace of complementarity measure % .relative_duality_gap % .relative_residual % .relative_residual_trace: The trace of relative residual % % machine.reduction_result: This contains information associated with % the constraint reduction % .number_of_data_point_used: Trace of how many data points are % used in each iteration. % .average_number_of_data_point_used: Average number of data % points used per iteration. % .total_number_of_data_point_used: Sum of the number of data % points used for all iterations. % machine.timing_result: Timing result of critical operations. % .index_set: Time consumed to make the index set % .normal_matrix: Time consumed to form the matrix for the normal % equations. % .factorization: Time consumed to factorize the normal matrix in % case Cholesky factorization is used or the modified data matrix % in case QR factorization is used. % .substitutions: Time consumed to solve the normal equations with % the Cholesky factor. % machine.history: This is returned if save_history option is turned on. % .Q: history of the index set used for the constraint reduction % .w: history of the coefficient of the classifier % .bias: history of the bias of the classifier % .weight: History of weight (v) of each pattern % .normal_matrix_density: history of the density of the normal matrix % .cholesky_factor_density: history of the density of the Cholesky % factor. % Start to measure cputime starting = cputime; if nargin < 2 error('Not enough input arguments.'); end A = varargin{1}; d = varargin{2}; if nargin == 3 opts = varargin{3}; else opts = []; end % m : number of data points % n : number of features [m,n] = size(A); machine.number_of_data_points = m; machine.number_of_input_attributes = n; use_Gram_approximation = false; if isfield(opts, 'kernel') kernel = opts.kernel; if ~isfield(kernel, 'kernel_function') error('You must provide opt.kernel.kernel_function'); end if ~strcmpi(kernel.kernel_function, 'none') if isfield(kernel, 'approximation_method') approximation_method = kernel.approximation_method; else approximation_method = 'chol'; end switch(lower(approximation_method)) case 'chol' if ~isfield(kernel, 'chol_opt') kernel.chol_opt = {}; end kchol_opt = kernel.chol_opt; if ~isfield(kchol_opt, 'rank') kchol_opt.rank = min(300,max(ceil(m/2),1)); end time_Gram_approximation = cputime; A = LowRankCholesky(full(A), kernel, kchol_opt.rank); time_Gram_approximation = cputime - time_Gram_approximation; use_Gram_approximation = true; case 'eigs' if isa(kernel.kernel_function, 'function_handle') error('User defined function is not supported for truncated eigen value decomposition'); end if ~strcmpi(kernel.kernel_function, 'gaussian') error('Kernels other than Gaussian is not supported for truncated eigen value decomposition'); end if ~isfield(kernel, 'eigs_opt') kernel.eigs_opt = {}; end eigs_opt = kernel.eigs_opt; if ~isfield(eigs_opt, 'rank') eigs_opt.rank = min(64,max(ceil(m/2),1)); end eigs_opt.method = 'ifgt'; time_Gram_approximation = cputime; if issparse(A) A = full(A); end [A, DEig] = KernelEig(A, kernel, eigs_opt); A = A*sqrt(DEig); time_Gram_approximation = cputime - time_Gram_approximation; use_Gram_approximation = true; end [m, n] = size(A); machine.dimension_of_reduced_feature_space = n; end end % Determine whether the data is sparse sw_sparse = false; if issparse(A) if nnz(A)/(m*n) < 0.33 sw_sparse = true; else A = full(A); end end % Extract penalty parameter tau if isfield(opts, 'tau') tau = opts.tau; else tau = ones(m,1); end % Extract tolerance if isfield(opts, 'tol') tol_conv = opts.tol; else tol_conv = 1e-8; end % Extract tolerance for complementarity if isfield(opts, 'tol_mu') tol_mu = opts.tol_mu; else tol_mu = 1e-8; end % Upper bound on number of iterations if isfield(opts, 'iteration_limit') iter_ubound = opts.iteration_limit; else iter_ubound = 200; end % Beta if isfield(opts, 'beta') beta = opts.beta; else beta = 4; end % Save history if isfield(opts, 'save_history') save_history = opts.save_history; else save_history = false; end % Choice of q_lower q_lower_automatic = false; if (isfield(opts,'q_lower') && strcmpi(opts.q_lower, 'automatic')) q_lower_automatic = true; q_lower = n; elseif isfield(opts,'q_lower') q_lower = opts.q_lower; elseif isfield(opts,'m_lower') q_lower = opts.m_lower; elseif isfield(opts,'q_lower_rel') q_lower = round(m*opts.q_lower_rel); elseif isfield(opts,'m_lower_rel') q_lower = round(m*opts.m_lower_rel); else q_lower_automatic = true; q_lower = n; % Default choice: automatic end % Choice of q_upper if isfield(opts, 'q_upper') if opts.q_upper > 1 q_upper = opts.q_upper; else q_upper = ceil(opts.q_upper * m); end elseif isfield(opts,'q_upper_rel') q_upper = round(m*opts.q_upper_rel); else q_upper = m; end % Reduction method : default is adaptive_balanced useReduction = true; if isfield(opts,'reduction') switch lower(opts.reduction) case {'combined', 'nondistinctive', 'nonbalanced'} sw_Q = 1; case {'separate', 'distinctive', 'balanced'} sw_Q = 2; case {'adaptive_combined', 'adaptive_nondistinctive', 'adaptive_nonbalanced'} sw_Q = 3; case {'adaptive_separate', 'adaptive_distinctive', 'adaptive_balanced'} sw_Q = 4; case {'none', 'no'} useReduction = false; sw_Q = 0; otherwise error('Unknown reduction technique. Check the option you gave.'); end else sw_Q = 4; end % Index basis if sw_Q ~= 0 && isfield(opts, 'index_basis') % v,singed_distance,vu,s+y,su+yv are experimental switch lower(opts.index_basis) case {'onesided_distance'} sw_index_basis = 1; case {'v'} sw_index_basis = 2; case {'signed_distance'} sw_index_basis = 3; case {'omega'} sw_index_basis = 4; case {'vu'} sw_index_basis = 5; case {'s+y'} sw_index_basis = 6; case {'su+yv'} sw_index_basis = 7; case {'distance'} sw_index_basis = 8; case {'v(tau-v)'} sw_index_basis = 9; otherwise error('Check the option. opts.index_basis must be one of distance, omega, v(tau-v), onesided_distance.'); end else sw_index_basis = 8; end % Unused: Depending on the initial point setting, the first selection % sw_index_basis is determined. if isfield(opts, 'initial_point') sw_first_random_selection = false; else sw_first_random_selection = true; end % Determine factorization and matrix formation methods sw_factoring = 1; sw_use_SMW = 0; if isfield(opts, 'factorization_method') switch(lower(opts.factorization_method)) case 'chol' case 'chol_smw' if sw_sparse sw_use_SMW = 1; end case 'qr_smw' if sw_sparse == 1 % if sparse option is turned off, the QR option is ignored. sw_factoring = 2; sw_use_SMW = 1; end otherwise error('No such option exists for ''factorization_method''. Must be one of chol/chol_SMW/qr_SMW'); end else if sw_sparse sw_use_SMW = 1; end end % Determine whether to use matrix ordering in factorization sw_use_ordering = false; if isfield(opts, 'use_ordering') if opts.use_ordering && sw_use_SMW == 1 sw_use_ordering = true; end end % Time consumed for definining the index set time_Q = 0; % Time consumed for assembling the matrix for the normal equations time_Mat = 0; % Time consumed for factoring the matrix for the normal equations time_factor = 0; % Time consumed for solving the normal equations with the Cholesky % factor time_solve = 0; % Determine an initial point (w, gamma, y, s, v, u) if isfield(opts, 'initial_point') w = opts.initial_point.w0; gamma = opts.initial_point.gamma0; y = opts.initial_point.y0; s = opts.initial_point.s0; v = opts.initial_point.v0; u = opts.initial_point.u0; else [w, gamma, y, s, v, u] = GenerateStartingPointJosh(m, n); end % Variables used for balanced selection switch sw_Q case { 2, 4 } IdxP = find(d > 0); IdxM = find(d < 0); m_plus = length(IdxP); m_minus = length(IdxM); [q_upper_plus, q_upper_minus] = SplitQ_mu(q_upper, m_plus, m_minus, q_upper, 0, 0); end % Save the first w and gamma value if the save_history option is turned % on if save_history w_hist{1} = w; gamma_hist(1) = gamma; v_hist{1} = v; end % Define the vectors of all ones of size m e_m = ones(m,1); itr_cnt = 0; % Let BT = A'*D BT = A'*spdiags(d, 0, m, m); clear A; % Constant setting cut_coeff = 1e2; % Used for automatic dectection of q_L % Define the maximum magnitude of the problem data data_magnitude = max([norm(BT, inf), norm(tau,inf), 1]); % Begin the iterations of the interior-point method while(true) % Set complementary measure mu = (v'*s + y'*u)/(2*m); mu_trace(itr_cnt+1) = mu; %------------------------------------------------------------------ % Residuals % r_w : w - A'*D*v % r_v : d'*v % r_u : tau - v - u % r_s : D*A*w - gamma*d + y - e - s % r_sv : S*v for predictor % -sigma*mu*e + DSaff*Dvaff for corrector % r_yu : Y*u for predictor % -sigma*mu*e + DYaff*Duaff for corrector %------------------------------------------------------------------ % Set up residuals BTv = BT*v; r_w = w - BTv; r_v = d'*v; r_u = tau - v - u; signed_dist = (BT'*w - gamma*d - e_m); onesided_dist = signed_dist + y; r_s = onesided_dist - s; % Determine termination residual = max([norm(r_w, inf), r_v, norm([r_u, r_s], inf)])/data_magnitude; residual_trace(itr_cnt+1) = residual; % Check whether termination criteria are satisfied if( residual < tol_conv && mu < tol_mu) ending = cputime; flag = 1; break; end % Check whether iteration limit count is reached if itr_cnt >= iter_ubound flag = -1; ending = cputime; warning('Iteration count reached the limit.'); break; end % Increase iteration count itr_cnt = itr_cnt + 1; % Set up frequently used inverse variables v_inv = 1./v; u_inv = 1./u; y_inv = 1./y; % Set up vectors for the diagonal matrix Omega and its inverse omega = s.*v_inv + y.*u_inv; omega_inv = 1./omega; % Start to measure the timing for making the index set starting_Q = cputime; % Start to build the index set for the constraint reduction if sw_Q ~= 0 if itr_cnt == 1 mu0 = mu; q_mu = q_upper; % At the first selection, if the q_upper is m then Q is just % equal to {1:m} if q_mu < m % If Q is randomly selected if sw_first_random_selection == true % Select Q differently depending on whether the % selection strategy is the blanced or nonblanced selection switch sw_Q case {2,4} % Balanced 2:nonadaptive 4:adaptive % Select data points from each side [q_mu_plus, q_mu_minus] = SplitQ_mu(q_mu, m_plus, m_minus, q_upper, 0, 0); Qp = randperm(m_plus); Qm = randperm(m_minus); Qp = IdxP(Qp(1:q_mu_plus)); Qm = IdxM(Qm(1:q_mu_minus)); % Q = union( Qp, Qm ); % This is slower Q_bitmask = zeros(m,1); Q_bitmask(Qp(1:q_mu_plus)) = 1; Q_bitmask(Qm(1:q_mu_minus)) = 1; Q = find(Q_bitmask); case {1,3} % Nonbalanced Q = randperm(m); Q = sort(Q(1:q_mu)); end end else Q = 1:m; % Half-and-half if sw_Q == 2 % Nonadaptive balanced [q_mu_plus, q_mu_minus] = SplitQ_mu(q_mu, m_plus, m_minus, q_upper, 0, 0); end end end end if sw_Q ~= 0 if itr_cnt > 1 || sw_first_random_selection == false if ~(q_upper >= m && (sw_Q == 1 || sw_Q == 2)) %if adaptive or q_upper is less than m % Determine the number of indexes to choose switch sw_Q case {3} % Adaptive non-balanced % q_mu is iteration dependent if q_lower_automatic % Determine q_lower automatically candidate_threshold = min(cut_coeff*sqrt(mu),2); switch sw_index_basis case {1, 3} candidate_cnt = length(find(s <= sqrt(mu)/cut_coeff)); case 2 v_sinv = v./s; candidate_cnt = length(find(v_sinv >= candidate_threshold)); case {4,5,9} candidate_cnt = length(find(omega_inv >= candidate_threshold)); otherwise candidate_cnt = length(find(s <= sqrt(mu)/cut_coeff & y <= sqrt(mu)/cut_coeff)); end q_lower = ceil(1*candidate_cnt); if mu >= 1e-5 q_lower = max(q_lower, n); end q_rate = (mu)^(1/beta); q_mu = max(ceil(q_rate*m), q_lower); q_mu = min(q_mu, q_upper); else q_rate = (mu)^(1/beta); q_mu = max([min([ceil(q_rate*m), q_upper]), q_lower]); end case {4} % Adaptive balanced % q_mu is iteration dependent if q_lower_automatic % Determine q_lower_plus and q_lower_minus automatically candidate_threshold = min(cut_coeff*sqrt(mu),2); switch sw_index_basis case {1, 3} v_sinv = v./s; candidate_cnt_plus = length(find(s(IdxP) <= sqrt(mu) | v_sinv(IdxP) >= candidate_threshold)); candidate_cnt_minus = length(find(s(IdxM) <= sqrt(mu) | v_sinv(IdxM) >= candidate_threshold)); case 2 v_sinv = v./s; candidate_cnt_plus = length(find(v_sinv(IdxP) >= candidate_threshold)); candidate_cnt_minus = length(find(v_sinv(IdxM) >= candidate_threshold)); case {4,5,9} candidate_cnt_plus = length(find(omega_inv(IdxP) >= candidate_threshold)); candidate_cnt_minus = length(find(omega_inv(IdxM) >= candidate_threshold)); otherwise candidate_cnt_plus = length(find(omega_inv(IdxP) >= candidate_threshold | s(IdxP) <= sqrt(mu)/cut_coeff & y(IdxP) <= sqrt(mu)/cut_coeff)); candidate_cnt_minus = length(find(omega_inv(IdxM) >= candidate_threshold | s(IdxM) <= sqrt(mu)/cut_coeff & y(IdxM) <= sqrt(mu)/cut_coeff)); end q_lower_plus = ceil(1*candidate_cnt_plus); q_lower_minus = ceil(1*candidate_cnt_minus); q_rate = (mu)^(1/beta); q_mu = ceil(q_rate*m); if mu >= 1e-5 q_mu = max(q_mu, n); end [q_mu_plus, q_mu_minus] = SplitQ_mu(q_mu, m_plus, m_minus, q_upper, q_lower_plus, q_lower_minus); q_mu = q_mu_plus + q_mu_minus; else q_rate = (mu)^(1/beta); q_mu = max([min([ceil(q_rate*m), q_upper]), q_lower]); [q_mu_plus, q_mu_minus] = SplitQ_mu(q_mu, m_plus, m_minus, q_upper, ceil(q_lower/2), ceil(q_lower/2)); end end % Now determine indexes switch sw_Q case 0 case {1, 3} % Nonbalanced % Select indices based on switch sw_index_basis case 1 % onsided distance [sorted, Q] = sort(onesided_dist); case 2 % v [sorted, Q] = sort(v,1,'descend'); case 3 % signed distance [sorted, Q] = sort(signed_dist); case 4 % omega [sorted, Q] = sort(omega); case 5 % Vu [sorted, Q] = sort(v.*u,'descend'); case 6 % s + y [sorted, Q] = sort(s+y); case 7 % Su + Yv [sorted, Q] = sort(s.*u+y.*v); case 8 % distance [sorted, Q] = sort(abs(signed_dist)); case 9 % V(tau*e - v) [sorted, Q] = sort(v.*(tau-v), 'descend'); end % Choose the first q indices in the sorted data Q = Q(1:q_mu); case {2, 4} % Balanced % Select indices based on switch sw_index_basis case 1 % onesided distance [sorted, Qp] = sort(onesided_dist(IdxP)); [sorted, Qm] = sort(onesided_dist(IdxM)); case 2 % v [sorted, Qp] = sort(v(IdxP),1,'descend'); [sorted, Qm] = sort(v(IdxM),1,'descend'); case 3 % signed distance [sorted, Qp] = sort(signed_dist(IdxP)); [sorted, Qm] = sort(signed_dist(IdxM)); case 4 % omega [sorted, Qp] = sort(omega(IdxP)); [sorted, Qm] = sort(omega(IdxM)); case 5 % Vu vu = v.*u; [sorted, Qp] = sort(vu(IdxP),'descend'); [sorted, Qm] = sort(vu(IdxM),'descend'); case 6 % s + y spy = s+y; [sorted, Qp] = sort(spy(IdxP)); [sorted, Qm] = sort(spy(IdxM)); case 7 % Su + Yv supyv = s.*u+y.*v; [sorted, Qp] = sort(supyv(IdxP)); [sorted, Qm] = sort(supyv(IdxM)); case 8 % signed distance dist = abs(signed_dist); [sorted, Qp] = sort(dist(IdxP)); [sorted, Qm] = sort(dist(IdxM)); case 9 % V(tau*e-v) vu = v.*(tau-v); [sorted, Qp] = sort(vu(IdxP),'descend'); [sorted, Qm] = sort(vu(IdxM),'descend'); end % Indices from the positive class Qp = IdxP(Qp(1:q_mu_plus)); % Indices from the negative class Qm = IdxM(Qm(1:q_mu_minus)); % Combine the two index sets % Q = union(Qp(1:q_mu_plus), Qm(1:q_mu_minus)); Q_bitmask = zeros(m,1); Q_bitmask(Qp(1:q_mu_plus)) = 1; Q_bitmask(Qm(1:q_mu_minus)) = 1; Q = find(Q_bitmask); end else Q = 1:m; end end %if-else end % if sw_Q ~= 0 time_Q = time_Q + (cputime-starting_Q); % Keep track of the number of data points involved in the matrix % assembly if sw_Q ~= 0 lenQ = length(Q); pts_used(itr_cnt) = lenQ; end % For predictor r_sv = s.*v; r_yu = u.*y; % Set up modified residuals br_u = r_u + r_yu.*y_inv; r_omega = r_s + v_inv.*r_sv - u_inv.*y.*br_u; br_w = r_w + BT*(omega_inv.*r_omega); br_v = r_v - d'*(omega_inv.*r_omega); omega_inv_d = omega_inv.*d; % Set up d_bar d_bar = BT*(omega_inv_d); % Let eta to be 1/(d^T * Omega^-1 * d) eta = 1/sum(omega_inv); % Form the matrix for the normal equation % Use reduction? if sw_Q ~= 0 % Reduced matrix assembly starting_Mat = cputime; dQ = d(Q); omegaQ_inv = omega_inv(Q); d_barQ = BT(:,Q)*(omega_inv_d(Q)); if lenQ ~= 0 etaQ = 1/sum(omegaQ_inv); else etaQ = 1; end % Density threshold for adaptive use of SMW formula % Experimental. % if this is greater than 1, always use the SMW formula, if 0, % always form the full matrix. density_threshold = 2; % Preallocation for normal matrix assembly. Experimental. sw_preallocation = false; if sw_factoring == 1 if sw_use_SMW if sw_preallocation && itr_cnt >= 2 %&& normal_matrix_density < density_threshold if itr_cnt == 2 [Mat_i, Mat_j] = find(Mat); end Mat = sparse(Mat_i, Mat_j, 0, n, n); % Mat(:,:) = 0; Mat = Mat + BT(:,Q)*spdiags(omegaQ_inv, 0, lenQ, lenQ)*BT(:,Q)' + speye(n,n); else Mat = speye(n,n) + BT(:,Q)*spdiags(omegaQ_inv, 0, lenQ, lenQ)*BT(:,Q)'; end normal_matrix_density = nnz(Mat)/size(Mat,1)/size(Mat,2); if normal_matrix_density >= density_threshold Mat = Mat - d_barQ*(d_barQ'*etaQ); end else Mat = speye(n,n) + (BT(:,Q)*spdiags(omegaQ_inv, 0, lenQ, lenQ))*BT(:,Q)' - d_barQ*(d_barQ'*etaQ); end end time_Mat = time_Mat + (cputime-starting_Mat); else % Nonreduced matrix assembly starting_Mat = cputime; if sw_factoring == 1 if sw_use_SMW Mat = speye(n,n) + BT*spdiags(omega_inv, 0, m, m)*BT'; if normal_matrix_density >= density_threshold Mat = Mat - d_bar*(d_bar'*eta); end else Mat = speye(n,n) + BT*spdiags(omega_inv, 0, m, m)*BT' - d_bar*(d_bar'*eta); end end time_Mat = time_Mat + (cputime-starting_Mat); end % Form the rhs for the equation rhs_w = -br_w - d_bar * br_v*eta; % Get the Cholesky factor starting_factor = cputime; if sw_factoring == 1 if sw_use_SMW && normal_matrix_density < density_threshold if sw_use_ordering [Upper, pp, ordering] = chol(Mat, 'vector'); % Sparse cholesky else Upper = chol(Mat); % Sparse cholesky end else Upper = chol(Mat); % Dense cholesky end chol_density = nnz(Upper)/size(Upper,1)/size(Upper,2); elseif sw_factoring ==2 if sw_Q ~= 0 sw_QR_placing = 2; % The identity is placed in 1: top, 2: bottom, 3: top, 4: bottom switch sw_QR_placing case 1 Upper = qr([speye(n,n), BT(:,Q)*spdiags(sqrt(omegaQ_inv), 0, lenQ, lenQ)]', 0); case 2 Upper = qr([BT(:,Q)*spdiags(sqrt(omegaQ_inv), 0, lenQ, lenQ), speye(n,n)]', 0); case 3 Upper = qr([speye(n,n); spdiags(sqrt(omegaQ_inv), 0, lenQ, lenQ)*BT(:,Q)'], 0); case 4 Upper = qr([spdiags(sqrt(omegaQ_inv), 0, lenQ, lenQ)*BT(:,Q)'; speye(n,n)], 0); end else Upper = qr([BT*spdiags(sqrt(omega_inv), 0, m, m), speye(n,n)]', 0); end end time_factor = time_factor + (cputime - starting_factor); % Get the affine scaling direction (Dw, Dgamma, Dv, Dy, Du, Ds) % Dw = Mat\rhs_w; starting_solve = cputime; if sw_use_SMW && normal_matrix_density < density_threshold if sw_use_ordering Mat_inv_rhs_w(ordering,1) = Upper\(Upper'\rhs_w(ordering)); else Mat_inv_rhs_w = Upper\(Upper'\rhs_w); end if sw_Q ~= 0 if sw_use_ordering Mat_inv_d_bar(ordering,1) = Upper\(Upper'\d_barQ(ordering)); else Mat_inv_d_bar = Upper\(Upper'\d_barQ); end eta_d_bar_Mat_inv_d_bar = etaQ*(d_barQ'*Mat_inv_d_bar); Dw = Mat_inv_rhs_w + ... (etaQ*(d_barQ'*Mat_inv_rhs_w)/(1-eta_d_bar_Mat_inv_d_bar))... * Mat_inv_d_bar; else if sw_use_ordering Mat_inv_d_bar(ordering,1) = Upper\(Upper'\d_bar(ordering)); else Mat_inv_d_bar = Upper\(Upper'\d_bar); end eta_d_bar_Mat_inv_d_bar = eta*(d_bar'*Mat_inv_d_bar); Dw = Mat_inv_rhs_w + ... (eta*(d_bar'*Mat_inv_rhs_w)/(1-eta_d_bar_Mat_inv_d_bar))... * Mat_inv_d_bar; end else Dw = Upper\(Upper'\rhs_w); end time_solve = time_solve + (cputime - starting_solve); % Dgamma = 1/(d'*Omega^-1*d) * (-br_v + d'*Omega^-1*D*A*Dw) % = eta * (-br_v + d_bar*Dw) Dgamma = eta * (-br_v + d_bar'*Dw); % Dv = -Omega^-1 * (r_omega + D*A*Dw - d*Dgamma) Dv = -omega_inv .* (r_omega + BT'*Dw - d*Dgamma); % Dy = -U^-1 * Y * (br_u - Dv) Dy = -u_inv .* y .* (br_u - Dv); % Du = -Y^-1 * (r_yu + U*Dy) Du = -(r_yu + u.*Dy).*y_inv; % Ds = -V^-1 * (r_sv + S*Dv) Ds = -v_inv .* (r_sv + s.*Dv); % Find appropriate step size so that the step doesn't go out of % the positive orthant alpha_primal_aff = DeterminePredictorStepSize([y; s], [Dy; Ds]); alpha_dual_aff = DeterminePredictorStepSize([v; u], [Dv; Du]); alpha_dual_aff = min([alpha_primal_aff alpha_dual_aff]); alpha_primal_aff = alpha_dual_aff; % Evaluate the prospective complementarity measurement mu_aff = ((v + alpha_dual_aff*Dv)'*(s + alpha_primal_aff*Ds) ... + (y + alpha_primal_aff*Dy)'*(u + alpha_dual_aff*Du))/(2*m); % Set centering parameter according to [Wright 1999] sigma = (mu_aff/mu)^3; sigma = min([sigma 1]); sigma_times_mu = sigma*mu; % Now it's time to set up right hand side for the corrector step r_sv = s.*v - sigma_times_mu + Ds.*Dv; r_yu = y.*u - sigma_times_mu + Du.*Dy; % Set up modified residuals br_u = r_u + r_yu.*y_inv; r_omega = r_s + v_inv.*r_sv - u_inv.*y.*br_u; br_w = r_w + BT*(omega_inv.*r_omega); br_v = r_v - d'*(omega_inv.*r_omega); % Form the rhs for the equation (no perturbation in RHS, yet) rhs_w = -br_w - d_bar * br_v*eta; % Get the combined predictor and corrector direction (Dw_cc, Dgamma_cc, Dv_cc, Dy_cc, Du_cc, Ds_cc) % Dw = M\rhs_w; starting_solve = cputime; if sw_use_SMW && normal_matrix_density < density_threshold if sw_use_ordering Mat_inv_rhs_w(ordering,1) = Upper\(Upper'\rhs_w(ordering)); else Mat_inv_rhs_w = Upper\(Upper'\rhs_w); end if sw_Q ~= 0 Dw = Mat_inv_rhs_w + ... (etaQ*(d_barQ'*Mat_inv_rhs_w)/(1-eta_d_bar_Mat_inv_d_bar))... * Mat_inv_d_bar; else Dw = Mat_inv_rhs_w + ... (eta*(d_bar'*Mat_inv_rhs_w)/(1-eta_d_bar_Mat_inv_d_bar))... * Mat_inv_d_bar; end else Dw = Upper\(Upper'\rhs_w); end time_solve = time_solve + (cputime - starting_solve); % Dgamma = 1/(d'*Omega^-1*d) * (-br_v + d'*Omega^-1*D*A*Dw) % = eta * (-br_v + d_bar*Dw) Dgamma = eta * (-br_v + d_bar'*Dw); % Dv = -Omega^-1 * (r_omega + D*A*Dw - d*Dgamma) Dv = -omega_inv .* (r_omega + BT'*Dw - d*Dgamma); % Dy = -U^-1 * Y * (br_u - Dv) Dy = -u_inv .* y .* (br_u - Dv); % Du = -Y^-1 * (r_yu + U*Dy) Du = -(r_yu + u.*Dy).*y_inv; % Ds = -V^-1 * (r_sv + S*Dv) Ds = -v_inv .* (r_sv + s.*Dv); % Determine next step length alpha_primal = DetermineNextStepSize([y; s], [Dy; Ds]); alpha_dual = DetermineNextStepSize([v; u], [Dv; Du]); alpha_dual = min([alpha_primal alpha_dual]); alpha_primal = alpha_dual; % Take the step w = w + alpha_primal*Dw; gamma = gamma + alpha_primal*Dgamma; y = y + alpha_primal*Dy; s = s + alpha_primal*Ds; v = v + alpha_dual*Dv; u = u + alpha_dual*Du; % Save history if save_history Q_hist{itr_cnt} = Q; w_hist{itr_cnt+1} = w; gamma_hist(itr_cnt+1) = gamma; v_hist{itr_cnt+1} = v; if sw_factoring == 1 if sw_use_SMW % normal_matrix_density_trace(itr_cnt) = nnz(Mat)/size(Mat,1)/size(Mat,2); % chol_density_trace(itr_cnt) = nnz(Upper)/size(Upper,1)/(size(Upper,2)+1)*2; normal_matrix_density_trace(itr_cnt) = normal_matrix_density; chol_density_trace(itr_cnt) = chol_density; else normal_matrix_density_trace(itr_cnt) = nnz(Mat)/size(Mat,1)/size(Mat,2); chol_density_trace(itr_cnt) = nnz(Upper)/size(Upper,1)/size(Upper,2); end elseif sw_factoring == 2 chol_density_trace(itr_cnt) = nnz(Upper)/size(Upper,1)/size(Upper,2); end end end % end of main while % These conditions are always satisfied. Not necessary. % if any(s < 0) || any(y < 0) || any(v < 0) || any(u < 0) % error('Positiveness conditions are not satisfied'); % end % Set the output variable machine.bias = gamma; machine.weight = v; % machine.support_vector_index = find(s < v); % machine.on_boundary_support_vector_index = find( (s < v) & (y < u) ); % machine.off_boundary_support_vector_index = find( (s < v) & (y > u) ); machine.support_vector_index = find( v./s > cut_coeff*sqrt(mu)/2 ); machine.on_boundary_support_vector_index = find( omega_inv > cut_coeff*sqrt(mu) ); machine.off_boundary_support_vector_index = find( (v./s > cut_coeff*sqrt(mu)/2) & ~(omega_inv > cut_coeff*sqrt(mu)) ); % machine.classifier = @(x,y) machine.internal_variables.w = w; machine.internal_variables.y = y; machine.internal_variables.s = s; machine.internal_variables.u = u; obj_primal = dot(w,w)/2 + (sum(tau.*y)); obj_dual = -1/2*dot(BTv,BTv) + sum(v); rel_duality = abs(obj_primal - obj_dual)/(1+abs(obj_dual)); machine.optimization_result.primal_objective_function_value = obj_primal; machine.optimization_result.dual_objective_function_value = obj_dual; machine.optimization_result.mu = mu; machine.optimization_result.mu_trace = mu_trace; machine.optimization_result.relative_duality_gap = rel_duality; machine.optimization_result.relative_residual = residual; machine.optimization_result.relative_residual_trace = residual_trace; % machine.optimization_result.data_magnitude = data_magnitude; machine.iterations = itr_cnt; machine.time = ending - starting; machine.exit_flag = flag; switch flag case 1 machine.exit_reason = 'Found the optimal solution'; case -1 machine.exit_reason = 'Iteration count reached the limit'; end if useReduction machine.reduction_result.number_of_data_point_used = pts_used; machine.reduction_result.average_number_of_data_point_used = sum(pts_used)/length(pts_used); machine.reduction_result.total_number_of_data_point_used = sum(pts_used); end if use_Gram_approximation machine.timing_result.Gram_matrix_approximation = time_Gram_approximation; end machine.timing_result.index_set = time_Q; machine.timing_result.normal_matrix = time_Mat; machine.timing_result.factorization = time_factor; machine.timing_result.substitutions = time_solve; if save_history machine.history.Q = Q_hist; machine.history.w = w_hist; machine.history.bias = gamma_hist; machine.history.weight = v_hist; if sw_factoring == 1 machine.history.normal_matrix_density = normal_matrix_density_trace; end machine.history.cholesky_factor_density = chol_density_trace; end end % end of function function alpha = DeterminePredictorStepSize(x, Dx) idx = find(Dx < 0); alpha = min([-x(idx)./Dx(idx); 1]); % alpha = min([alpha; 1]); % alpha = max([alpha; 0]); % redundant end function alpha = DetermineNextStepSize(x, Dx) idx = find(Dx < 0); alpha = min(-x(idx)./Dx(idx)); alpha = min([0.99 * alpha; 1]); % alpha = max([alpha; 0]); % redundant end function [q_mu_plus, q_mu_minus] = SplitQ_mu(q_mu, m_plus, m_minus, q_upper, q_lower_plus, q_lower_minus) % Half-and-half q_mu_plus = ceil(min(q_upper,ceil(q_mu))/2); q_mu_minus = ceil(min(q_upper,ceil(q_mu))/2); q_mu_plus = max(q_lower_plus, q_mu_plus); q_mu_minus = max(q_lower_minus, q_mu_minus); q_mu_plus = min(q_mu_plus, m_plus); q_mu_minus = min(q_mu_minus, m_minus); % Adjust q_mu_plus and q_mu_minus so that their % sum makes q_mu left_over = min(q_mu, q_upper) - q_mu_plus - q_mu_minus; if left_over > 0 if q_mu_plus == m_plus q_mu_minus = q_mu_minus + left_over; elseif q_mu_minus == m_minus q_mu_plus = q_mu_plus + left_over; end end end function [w, gamma, y, s, v, u] = GenerateStartingPointJosh(m, n) w = zeros(n,1); gamma = 0; y = 2*ones(m,1); s = y; v = y; u = y; end % function [G,P,Q,R,error1,error2,error,predicted_gain,true_gain] = csi(X,Y,kernel,max_rank,centering,kappa,delta,tol) % % INPUT % % X : data matrix m x l, where l is the number of attributes % % Y : target vector m x d. Side information such as data label or answer. % % kernel : A structure defining kernel % % kernel.kernel_function : % % [{'Gaussian'}|'Linear'|'Polynomial'|'Sigmoid'|A function handle] % % 'Gaussian': exp(-norm(x-y)^2/(2*sigma^2)) % % Options % % kernel.sigma: Defines width of the kernel. Default value is % % set to make 2*sigma1^2 become l. % % 'Linear': dot(x,y) % % This takes no option. % % 'Polynomial': (coefficient*dot(x,y)+bias)^degree % % Options % % kernel.coefficient: Default value is 1/l. % % kernel.bias: Default value is 1/l. % % kernel.degree: Default value is 3. % % 'Sigmoid': tanh(kappa*dot(x,y)-delta) % % Options % % kernel.kappa: Default value is 1/l. % % kernel.delta: Default value is 0. % % A function handle: A handle to function taking two data matrices % % X1 and X2, each row of which is a data point. % % This can be a function to an anonymous function such as % % kernel.kernel_function = @(X1,X2) (X1*X2'+1).^2; % % The provided function should support pairwise evaluation % % between data points in X1 and X2. % % max_rank : maximal rank % % kappa : trade-off between approximation of K and prediction of Y (suggested: .99) % % Don't be confused by the kappa option provided for the sigmoid kernel. % % centering : 1 if centering, 0 otherwise (suggested: 1) % % delta : number of columns of cholesky performed in advance (suggested: 40) % % Don't be confused by the delta option provided for the sigmoid kernel. % % tol : minimum gain at iteration (suggested: 1e-4) % % % % OUTPUT % % G : Cholesky decomposition -> K(P,P) is approximated by G*G', where K is % % the gram matrix defined by the kernel and the data points. % % P : permutation matrix % % Q,R : QR decomposition of G (or center(G) if centering) % % error1 : tr(K-G*G')/tr(K) at each step of the decomposition % % error2 : ||Y-Q*Q'*Y||_F^2 / ||Y||_F^2 at each step of the decomposition % % predicted_gain : predicted gain before adding each column % % true_gain : actual gain after adding each column % % % % Copyright (c) Francis R. Bach, 2005. % % % % Modified by Jin Hyuk Jung, July, 2007. % % This modification does not require the Gram matrix to be formed before % % this function is called. So this code does not involve storing the huge % % Gram matrix. % % n = size(X,1); % The number of data points % num_attributes = size(X,2); % d = size(Y,2); % assert(size(Y,1)==n,'The targets have the wrong number of data points'); % % if ~isfield(kernel, 'kernel_function') % kernel_function = @linear_kernel; % else % if isa(kernel.kernel_function, 'function_handle') % kernel_function = kernel.kernel_function; % else % switch(lower(kernel.kernel_function)) % case 'linear' % kernel_function = @linear_kernel; % case 'gaussian' % kernel_function = @gaussian_kernel; % if isfield(kernel, 'sigma') % gaussian_sigma_sq_doubled = 2*(kernel.sigma^2); % else % gaussian_sigma_sq_doubled = num_attributes; % end % case 'polynomial' % kernel_function = @polynomial_kernel; % if isfield(kernel, 'coefficient') % poly_coeff = kernel.coefficient; % else % poly_coeff = 1/num_attributes; % end % % if isfield(kernel, 'bias') % poly_bias = kernel.bias; % else % poly_bias = 1/num_attributes; % end % % if isfield(kernel, 'degree') % poly_degree = kernel.degree; % else % poly_degree = 3; % end % case 'sigmoid' % kernel_function = @sigmoid_kernel; % if isfield(kernel, 'kappa') % sigmoid_kappa = kernel.kappa; % else % sigmoid_kappa = 1/num_attributes; % end % % if isfield(kernel, 'delta') % sigmoid_delta = kernel.bias; % else % sigmoid_delta = 0; % end % otherwise % error(sprintf(['No such kernel, %s, is supported. User provided'... % 'kernel should be a handle to a function'], kernel.kernel_function)); % end % end % end % % G = zeros(n,min(max_rank+delta,n)); % Cholesky factor % diagK = zeros(n,1); % for i = 1:n, % diagK(i,1) = kernel_function(X(i,:),X(i,:)); % end % % P = 1:n; % % Q = zeros(n,min(max_rank+delta,n)); % Q part of the QR decomposition % R = zeros(min(max_rank+delta,n),min(max_rank+delta,n)); % R part of the QR decomposition % traceK = sum(diagK); % lambda = (1-kappa) / traceK ; % if centering, Y = Y - 1/n * repmat(sum(Y,1),n,1); end % sumY2 = sum(Y(:).^2); % mu = (kappa) / sumY2; % error1(1) = traceK; % error2(1) = sumY2; % % k=0; % current index of the Cholesky decomposition % kadv=0; % current index of the look ahead steps % Dadv = diagK; % D = diagK; % % % approximation cost cached quantities % A1 = zeros(n,1); % A2 = zeros(n,1); % A3 = zeros(n,1); % GTG = zeros(max_rank); % QTY = zeros(max_rank,d); % QTYYTQ = zeros(max_rank,max_rank); % % % makes sure that delta is smaller than n % delta = min(delta,n); % % % first performs delta steps of Cholesky and QR decomposition % for i=1:delta % kadv = kadv + 1; % % select best index % diagmax = Dadv(kadv); % jast = 1; % for j=1:n-kadv+1 % if (Dadv(j+kadv-1)>diagmax / .99) % diagmax = Dadv(j+kadv-1); % jast = j; % end % end % if diagmax<1e-12 % % all pivots are too close to zero, stops % % this can only happen if the matrix has rank less than delta % kadv = kadv - 1; % break; % else % jast=jast+kadv-1; % % % permute indices % P( [kadv jast] ) = P( [jast kadv] ); % Dadv( [kadv jast] ) = Dadv( [jast kadv] ); % D( [kadv jast] ) = D( [jast kadv] ); % A1( [kadv jast] ) = A1( [jast kadv] ); % G([kadv jast],1:kadv-1) = G([ jast kadv],1:kadv-1); % Q([kadv jast],1:kadv-1) = Q([ jast kadv],1:kadv-1); % % % compute new Cholesky column % G(kadv,kadv)=Dadv(kadv); % G(kadv,kadv)=sqrt(G(kadv,kadv)); % % newKcol = K(P(kadv+1:n),P(kadv)); % newKcol = kernel_function(X(P(kadv+1:n),:),X(P(kadv),:)); % G(kadv+1:n,kadv)=1/G(kadv,kadv)*( newKcol - G(kadv+1:n,1:kadv-1)*(G(kadv,1:kadv-1))'); % % % update diagonal % Dadv(kadv+1:n) = Dadv(kadv+1:n) - G(kadv+1:n,kadv).^2; % Dadv(kadv) = 0; % % % performs QR % if centering % Gcol = G(:,kadv) - 1/n * repmat( sum(G(:,kadv) ,1),n,1 ); % else % Gcol = G(:,kadv); % end % R(1:kadv-1,kadv) = Q(:,1:kadv-1)' * Gcol; % Q(:,kadv) = Gcol - Q(:,1:kadv-1) * R(1:kadv-1,kadv); % R(kadv,kadv) = norm(Q(:,kadv)); % Q(:,kadv) = Q(:,kadv) / R(kadv,kadv); % % % update cached quantities % if centering % GTG(1:kadv,kadv) = G(:,1:kadv)' * G(:,kadv); % else % GTG(1:kadv,kadv) = R(1:kadv,1:kadv)' * R(1:kadv,kadv); % end % GTG(kadv,1:kadv) = GTG(1:kadv,kadv)'; % QTY(kadv,:) = Q(:,kadv)' * Y(P,:); % QTYYTQ(kadv,1:kadv) = QTY(kadv,:) * QTY(1:kadv,:)'; % QTYYTQ(1:kadv,kadv) = QTYYTQ(kadv,1:kadv)'; % % % update costs % A1(kadv:n) = A1(kadv:n) + GTG(kadv,kadv) * ( G(kadv:n,kadv).^2 ); % A1(kadv:n) = A1(kadv:n) + 2 * G(kadv:n,kadv) .* ( G(kadv:n,1:kadv-1) * GTG(1:kadv-1,kadv) ); % % end % end % % % compute remaining costs for all indices % A2 = sum( ( G(:,1:kadv) * ( R(1:kadv,1:kadv)' * QTY(1:kadv,:) ) ).^2 , 2); % A3 = sum( ( G(:,1:kadv) * R(1:kadv,1:kadv)' ).^2 , 2); % % % % start main loop % while true, % if k >= max_rank % break; % end % k = k +1; % % % compute the gains in approximation for all remaining indices % dJK = zeros(n-k+1,1); % for i=1:n-k+1 % kast = k+i-1; % if D(kast)<1e-12, % % this column is already generated by already % % selected columns -> cannot be selected % dJK(i)=-1e100; % else % dJK(i) = A1(kast); % if kast > kadv, % % add eta % dJK(i) = dJK(i) + D(kast)^2 - (D(kast) - Dadv(kast))^2; % end % dJK(i) = dJK(i) / D(kast); % end % end % % dJY = zeros(n-k+1,1); % if kadv>k % for i=1:n-k+1 % kast = k+i-1; % if A3(kast) < 1e-12 % dJY(i) = 0; % else % dJY(i) = A2(kast) / A3(kast); % end % end % end % % % select the best column % dJ = lambda * dJK + mu * dJY; % diagmax = -1; % jast = 0; % for j=1:n-k+1 % if D(j+k-1)>1e-12 % if dJ(j) > diagmax / 0.9 % diagmax = dJ(j); % jast = j; % end % end % end % if jast==0, % % no more good indices, exit % k=k-1; % break; % end % jast = jast + k - 1; % predicted_gain(k) = diagmax; % % % performs one cholesky + QR step: % % if new pivot not already selected, use pivot % % otherwise, select new look ahead index that maximize Dadv % % if jast > kadv, % newpivot = jast; % jast = kadv + 1; % else % a = 1e-12; % b = 0; % for j=1:n-kadv % if Dadv(j+kadv)>a/.99 % a=Dadv(j+kadv); % b = j+kadv; % end % end % if b==0, newpivot = 0; % else newpivot = b; % end % end % % if newpivot > 0 % % performs steps % % kadv = kadv + 1; % % % permute % P( [kadv newpivot] ) = P( [newpivot kadv] ); % Dadv( [kadv newpivot] ) = Dadv( [newpivot kadv] ); % D( [kadv newpivot] ) = D( [newpivot kadv] ); % A1( [kadv newpivot] ) = A1( [newpivot kadv] ); % A2( [kadv newpivot] ) = A2( [newpivot kadv] ); % A3( [kadv newpivot] ) = A3( [newpivot kadv] ); % G([kadv newpivot],1:kadv-1)=G([ newpivot kadv],1:kadv-1); % Q([kadv newpivot],1:kadv-1)=Q([ newpivot kadv],1:kadv-1); % % % compute new Cholesky column % G(kadv,kadv)=Dadv(kadv); % G(kadv,kadv)=sqrt(G(kadv,kadv)); % %newKcol = K(P(kadv+1:n),P(kadv)); % newKcol = kernel_function(X(P(kadv+1:n),:), X(P(kadv),:)); % G(kadv+1:n,kadv)=1/G(kadv,kadv)*( newKcol - G(kadv+1:n,1:kadv-1)*(G(kadv,1:kadv-1))'); % % % update diagonal % Dadv(kadv+1:n) = Dadv(kadv+1:n) - G(kadv+1:n,kadv).^2; % Dadv(kadv) = 0; % % % performs QR % if centering % Gcol = G(:,kadv) - 1/n * repmat( sum(G(:,kadv) ,1),n,1 ); % else % Gcol = G(:,kadv); % end % R(1:kadv-1,kadv) = Q(:,1:kadv-1)' * Gcol; % Q(:,kadv) = Gcol - Q(:,1:kadv-1) * R(1:kadv-1,kadv); % R(kadv,kadv) = norm(Q(:,kadv)); % Q(:,kadv) = Q(:,kadv) / R(kadv,kadv); % % % update the cached quantities % if centering % GTG(k:kadv,kadv) = G(:,k:kadv)' * G(:,kadv); % else % GTG(k:kadv,kadv) = R(1:kadv,k:kadv)' * R(1:kadv,kadv); % end % GTG(kadv,k:kadv) = GTG(k:kadv,kadv)'; % QTY(kadv,:) = Q(:,kadv)' * Y(P,:); % QTYYTQ(kadv,k:kadv) = QTY(kadv,:) * QTY(k:kadv,:)'; % QTYYTQ(k:kadv,kadv) = QTYYTQ(kadv,k:kadv)'; % % % update costs % A1(kadv:n) = A1(kadv:n) + GTG(kadv,kadv) * ( G(kadv:n,kadv).^2 ); % A1(kadv:n) = A1(kadv:n) + 2 * G(kadv:n,kadv) .* ( G(kadv:n,k:kadv-1) * GTG(k:kadv-1,kadv) ); % A3(kadv:n) = A3(kadv:n) + G(kadv:n,kadv).^2 * sum( R(k:kadv,kadv).^2, 1 ); % temp = R(k:kadv,kadv)' * R(k:kadv,k:kadv-1); % A3(kadv:n) = A3(kadv:n) + 2 * G(kadv:n,kadv) .* ( G(kadv:n,k:kadv-1) * temp' ); % temp = R(k:kadv,kadv)' * QTYYTQ(k:kadv,k:kadv); % temp1 = temp * R(k:kadv,kadv) ; % A2(kadv:n) = A2(kadv:n) + G(kadv:n,kadv).^2 * temp1; % temp2 = temp * R(k:kadv,k:kadv-1); % A2(kadv:n) = A2(kadv:n) + 2 * G(kadv:n,kadv) .* ( G(kadv:n,k:kadv-1) * temp2' ); % % % end % % % permute pivots in the Cholesky and QR decomposition between p,q % p = k; % q = jast; % if p different % A1(k:n) = A1(k:n) - 2 * G(k:n,k) .* ( G(k:n,k:p-1) * GTG(k:p-1,k) ); % below p % A1(k:n) = A1(k:n) - 2 * G(k:n,k) .* ( G(k:n,q+1:kadv) * GTG(q+1:kadv,k) ); % above q % tempQ = tempQ(:,1); % temp = G(k:n,q+1:kadv) * R(k,q+1:kadv)'; % temp = temp + G(k:n,k:p-1) * R(k,k:p-1)'; % temp = temp + Gbef(k:n,:) * Rbef' * tempQ; % A3(k:n) = A3(k:n) - temp.^2 ; % A2(k:n) = A2(k:n) + temp.^2 * QTYYTQ(k,k); % temp2 = ( tempQ' * QTYYTQbef ) * Rbeftotal; % A2(k:n) = A2(k:n) - 2 * temp .* ( Gbeftotal(k:n,:) * temp2' ); % % else % % % update costs % A1(k:n) = A1(k:n) - 2 * G(k:n,k) .* ( G(k:n,k:kadv) * GTG(k:kadv,k) ); % A3(k:n) = A3(k:n) - ( G(k:n,k:kadv) * R(k,k:kadv)' ).^2; % temp = G(k:n,k:kadv) * R(k,k:kadv)'; % A2(k:n) = A2(k:n) + ( temp ).^2 * QTYYTQ(k,k); % temp2 = QTYYTQ(k,k:kadv) * R(k:kadv,k:kadv); % A2(k:n) = A2(k:n) - 2 * temp .* ( G(k:n,k:kadv) * temp2' ); % end % % % update diagonal and other quantities (A1,B1) % D(k+1:n) = D(k+1:n) - G(k+1:n,k).^2; % D(k) = 0 ; % A1(k:n) = A1(k:n) + GTG(k,k) * ( G(k:n,k).^2 ); % % % compute errors and true gains % temp2 = Q(:,k)' * Y(P,:); % temp2 = sum(temp2.^2); % temp1 = sum( G(:,k).^2 ); % true_gain(k) = temp1 * lambda + temp2 * mu; % error1(k+1) = error1(k) - temp1; % error2(k+1) = error2(k) - temp2; % % if true_gain(k) < tol, break; end % end % % % reduce dimensions of decomposition % G=G(:,1:k); % Q=Q(:,1:k); % R=R(1:k,1:k); % % % compute and normalize errors % error = lambda * error1 + mu * error2; % error1 = error1 / traceK; % error2 = error2 / sumY2; % % function K = linear_kernel(X, x_basis) % K = X*x_basis'; % end % % function K = polynomial_kernel(X, x_basis) % K = (poly_coeff*X*x_basis'+poly_bias).^poly_degree; % end % % function K = gaussian_kernel(X, x_basis) % K = exp( - sqdist(X', x_basis')/(gaussian_sigma_sq_doubled) ); % end % % function K = sigmoid_kernel(X, x_basis) % K = (tanh(sigmoid_kappa * (X*x_basis') - sigmoid_delta)); % end % % function d=sqdist(a,b) % % SQDIST - computes squared Euclidean distance matrix % % computes a rectangular matrix of pairwise distances % % between points in A (given in columns) and points in B % % % NB: very fast implementation taken from Roland Bunschoten % aa = sum(a.*a,1); bb = sum(b.*b,1); ab = a'*b; % d = abs(repmat(aa',[1 size(bb,2)]) + repmat(bb,[size(aa,2) 1]) - 2*ab); % end % end % % %%%%%%%%%%%%%%%%%%%%%%%%%%%%% % function [Q,R] = qr2(M); % % QR decomposition for 2x2 matrices (this is to make sure that the C and % % Matlab implementations output exactly the same matrices % Q = zeros(2,2); % R = zeros(2,2); % x = sqrt( M(1,1)^2 + M(2,1)^2 ); % R = x; % Q(:,1) = M(:,1) / x; % R(1,2) = Q(:,1)' * M(:,2); % Q(:,2) = M(:,2) - R(1,2) * Q(:,1); % R(2,2) = norm(Q(:,2)); % Q(:,2) = Q(:,2) / R(2,2); % end