API reference for casimir.optim#

Incremental First Order Oracles#

class casimir.optim.IncrementalFirstOrderOracle#

Base class for incremental first order oracles (IFO).

For a function \(f(w)\) defined as \(f(w) = \frac{1}{n} \sum_{i=1}^n f_i(w)\), this class is an interface to implement methods to compute \(f_i(w)\) and its gradient \(\nabla f_i(w)\), as well as the batch function value \(f(w)\) and the batch gradient \(\nabla f(w)\) if each \(f_i\) is smooth. If \(f_i\) is not smooth, use sub-class SmoothedIncrementalFirstOrderOracle instead.

A concrete implementation must at least override the methods function_value, gradient and __len__. Optionally, it may also override evaluation_function for domain specific evaluation metrics such as classification accuracy. The methods batch_function_value and batch_gradient are implemented as a loop that averages over the outputs of function_value and gradient respectively by default. If a more efficient implementation exists, these methods can be overridden by derived classes as well.

Input idx to function value and gradient must not exceed \(n\). This is not explicitly checked here.

batch_function_value(model)#

Return function value \(f(w)\) where \(w\) is model.

batch_gradient(model)#

Return gradient \(\nabla f(w)\) where \(w\) is model.

evaluation_function(model)#

Return domain-specific task metrics (default None).

function_value(model, idx)#

Return function value \(f_i(w)\) where \(w\) is model and \(i\) is idx.

gradient(model, idx)#

Return gradient \(\nabla f_i(w)\) where \(w\) is model and \(i\) is idx.

class casimir.optim.SmoothedIncrementalFirstOrderOracle(smoothing_coefficient=None)#

Base class of smoothed incremental first order oracles.

For a function \(f(x)\) defined as \(f(w) = \frac{1}{n} \sum_{i=1}^n f_i(w)\), where each \(f_i\) is non-smooth but smoothable, this class is an interface to implement methods to compute \(f_{i, \mu}(w)\) and its gradient \(\nabla f_{i, \mu}(w)\), as well as the batch function value \((f(w), f_{\mu}(w))\) and the batch gradient \(\nabla f_{\mu}(w)\). Here, \(g_\mu\) is a smooth surrogate to the non-smooth function \(g\) that is parameterized by a smoothing coefficient \(\mu\).

When \(\mu\) is ``None``(i.e., no smoothing), the implementation must serve as an IFO for the original non-smooth function \(f\), in which case \(\nabla f_i\) refers to a subgradient of \(f_i\).

Note

This class contains a field smoothing_coefficient to represent \(\mu\), which can be mutated by optimization algorithms or other functions that use adaptive smoothing schemes.

function_value(model, idx)#

Return the pair \(\big( f_i(w), f_{i, \mu}(w) \big)\). If smoothing_coefficient is None, i.e., no smoothing, simply return \(f_i(w)\), where \(i\) represents the index idx.

gradient(model, idx)#

Return that gradient \(\nabla f_{i, \mu}(w)\) if smoothing_coefficient is not None or \(\nabla f_i(w)\) if smoothing_coefficient is None.

Optimization Algorithms#

casimir.optim.optimize_ifo(initial_model, train_ifo, dev_ifo=None, test_ifo=None, algorithm='SGD', reg_penalties=None, num_passes=100, termination_criterion=None, seed=25, logging=True, verbose=True, optim_options=None)#

Minimize a convex function with access to a (smoothed) incremental first order oracle using a primal algorithm.

Functions that can be handled are of the form \(f(w) = \frac{1}{n} \sum_{i=1}^n f_i(w) + r(w)\), where a (smoothed) incremental first order oracle gives access to each \(f_i\) and \(r(w)\) is a (sum of) regularization penalties. The argument algorithm controls which optimization algorithm is used.

Parameters:
  • initial_model (numpy.ndarray) – Starting iterate of the optimization algorithm.

  • train_ifo (incremental_first_order_oracle.IncrementalFirstOrderOracle or incremental_first_order_oracle.SmoothIncrementalFirstOrderOracle) – (Smoothed) Incremental First Order Oracle for the function to minimize.

  • dev_ifo (incremental_first_order_oracle.IncrementalFirstOrderOracle or incremental_first_order_oracle.SmoothIncrementalFirstOrderOracle or None) – Incremental First Order Oracle for development set. A value of None indicates no development set.

  • test_ifo (incremental_first_order_oracle.IncrementalFirstOrderOracle or incremental_first_order_oracle.SmoothIncrementalFirstOrderOracle or None) – Incremental First Order Oracle for test set. A value of None indicates no test set.

  • algorithm – Optimization algorithm to use. Should be from the list ['sgd', 'svrg', 'casimir_svrg'].

  • seed – Integer to be used as random seed. Default 25.

  • reg_penalties (List of regularization.SmoothRegularizationPenalty or None) – List of regularization penalties where \(r(w)\) corresponds to sum of penalties in the list. Default None, in which case no regularization penalty is applied.

  • num_passes – Maximum allowed number of passes through the data. Default 100.

  • termination_criterionNone or Callable, takes two arguments model, train_ifo and returns True if converged. If None, the given iteration budget num_passes is used to terminate optimization. Default None. See termination_gradient_norm() for an example.

  • logging – (boolean) Log performance of optimization algorithm. Default True.

  • verbose – (boolean) Print performance of optimization algorithm, only if logging is True. Default True.

  • optim_options (dict) – Algorithm-specific options. Default None. Consult named parameters of specific optimizers for allowed options.

Returns:

The tuple (final iterate, logs) where logs are a list of lists, one for each epoch. The list contains the epoch number, train function value, dev function value, dev evaluation metric, test function value, test evaluation metric and the time taken to perform this epoch. The dev (test) statistics are printed only if dev_ifo (resp. test_ifo) is not None.

class casimir.optim.Optimizer(initial_model, reg_penalties=None)#

Base class for optimizers using (smoothed) incremental first order oracles.

Optimizer classes store the state of the variables to be optimized.

Any subclass must override the methods start_epoch, step, end_epoch.

end_epoch()#

Perform optional computation to end epoch and return current iterate to be used for logging.

Returns:

model, denoting optimizer state at current time.

start_epoch(train_ifo, epoch)#

Prepare for start of epoch.

Example uses are warm start in Casimir-SVRG or batch gradient computation in SVRG. Adaptive smoothing algorithms may update smoothing parameter here.

Param:

train_ifo: (Smoothed) Incremental First Order Oracle for the training set. This object may be mutated by adaptive smoothing algorithms.

step(train_ifo, idx, iteration)#

Take a step based on sample indexed by idx.

Parameters:
  • train_ifo – (Smoothed) Incremental First Order Oracle for the training set.

  • idx – Index of sample to use to take a step.

  • iteration – Global iteration number. Required, e.g., by SGDOptimizer to compute learning rate.

class casimir.optim.CasimirSVRGOptimizer(initial_model, reg_penalties, learning_rate=None, grad_lipschitz_parameter=None, initial_smoothing_coefficient=None, smoothing_coefficient_update_rule='const', initial_moreau_coefficient=None, moreau_coefficient_update_rule='const', warm_start='prox-center')#

Implement Casimir (Catalyst with smoothing) or Catalyst outer loop with SVRG as the inner loop.

This optimizer can be invoked by calling optimize_ifo() with argument algorithm='casimir_svrg' or algorithm='catalyst_svrg'.

Use of smoothing:

This class can mimic the original Catalyst algorithm, if initial_smoothing_coefficient and initial_moreau_coefficient are both set to None, and the moreau_coefficient_update_rule is set to 'const'.

Inexactness criterion:

Each inner SVRG loop is run for one epoch over the data.

Parameters:
  • initial_model – Initial iterate of optimizer.

  • reg_penalties – List of smooth regularization penalties.

The following named parameters can be passed though optim_options of optimize_ifo():

Parameters:
  • learning_rate – Learning rate to use for inner SVRG iterations. Either learning rate or smoothness parameter must be specified.

  • grad_lipschitz_parameter – Estimate of Lipschitz constant of gradient. Inner SVRG iterations then use a learning rate of \(1/(L+\lambda + \kappa)\), where \(L\) is the grad_lipschitz_parameter, \(\lambda\) is the strong convexity of the regularization and \(\kappa\) is the Moreau coefficient added by Casimir. Either learning rate or grad_lipschitz_parameter must be specified.

  • initial_smoothing_coefficient – The initial amount of smoothing \(\mu_0\) to add to non-smooth objectives. If None, no smoothing is added.

  • smoothing_coefficient_update_rule – The update rule that determines the amount of smoothing \(\mu_t\) to add in epoch \(t\). Allowed inputs are 'const' or 'linear' or 'expo'. See below for an explanation.

  • initial_moreau_coefficient – The initial weight \(\kappa_0\) of the proximal term \(\frac{\kappa_t}{2} \|w - z_{t-1}\|^2\) added by Casimir, where \(z_t\) is the prox-center. Default None, in which case the value suggested by theory is used, provided \(L\) has been specified, and ‘const’ is used as moreau_coefficient_update_rule.

  • moreau_coefficient_update_rule – The update rule that determines the weight \(\kappa_t\) added by Casimir in epoch \(t\). The allowed values and the corresponding update rules are 'const', which uses \(\kappa_t = \kappa_0\) (default) and 'linear', which uses \(\kappa_t = t \kappa_0\).

  • warm_start – Warm start strategy to use to find the starting iterate of the next epoch. The allowed values are 'prox-center', 'prev-iterate', 'extrapolation'. See below for a description.

Allowed values for smoothing_coefficient_update_rule
'const'

\(\mu_t = \mu_0\) (default)

'linear'

\(\mu_t = \mu_0 / t\)

'expo'

\(\mu_t = \mu_0 c_t^t\), where \(c_t = \sqrt{1 - \frac{1}{2}\sqrt{\frac{\lambda}{\lambda + \kappa_t}}}\). Here, \(\lambda\) is the strong convexity of the regularization and \(\kappa_t\) is the Moreau coefficient, the weight of the proximal term added by Casimir in epoch \(t\).

Allowed values for warm_start
'prox-center'

Use \(z_{t-1}\), prox center of the proximal term \(\frac{\kappa_t}{2} \|w - z_{t-1}\|^2\) added by Casimir (default)

'prev-iterate'

Use \(w_{t-1}\), the previous iterate

'extrapolation`

Use \(w_{t-1} + \frac{\kappa_t}{\kappa_t + \lambda}(z_{t-1} - z_{t-2})\)

end_epoch()#

Remove prox penalty, compute \(lpha, eta\) and update averaged iterate.

start_epoch(train_ifo, epoch)#

Update Moreau coefficient, smoothing coefficient, learning rate, warm start and batch gradient.

step(train_ifo, idx, iteration)#

Take a single inner SVRG step and update averaged iterate.

class casimir.optim.SGDOptimizer(initial_model, reg_penalties=None, learning_rate_scheme='const', averaging='wavg', initial_learning_rate=1.0, learning_rate_time_factor=100.0)#

Implement the stochastic (sub-)gradient method with various learning rate and averaging schemes.

This optimizer can be invoked by calling optimize_ifo() with argument algorithm='sgd'.

Parameters:
  • initial_model – Initial iterate of optimizer.

  • reg_penalties – List of smooth regularization penalties.

The following named parameters can be passed though optim_options of optimize_ifo():

Parameters:
  • learning_rate_scheme – Learning rate schemes. Allowed values are 'const', 'pegasos' or 'linear'. See below for a description.

  • averaging – Use parameter averaging. Allowed values are 'none', 'uavg', 'wavg'. See below for a description.

  • initial_learning_rate – The parameter \(\eta_0\) used to determine the learning rate.

  • learning_rate_time_factor – The parameter \(t_0\) used to determine the learning rate.

Allowed values of the parameter learning_rate_scheme and corresponding learning rates \(\eta_t\) at iteration t are:

  • 'const' (default): \(\eta_t = \eta_0\),

  • 'pegasos': \(\eta_t = 1/(\lambda t)\),

  • 'linear': \(\eta_t = \eta_0 / (1 + t/t_0)\)

Here, \(\eta_0, t_0\) are parameters described above and \(\lambda\) the strong convexity.

Allowed values of the parameter averaging and the corresponding averaged iterates used are:

  • 'none': no averaging, use final iterate

  • 'uavg': uniform average \(\bar w_T = \frac{1}{T+1}\sum_{t=0}^{T} w_t\)

  • 'wavg' (default): weighted average \(\bar w_T = \frac{2}{T(T+1)} \sum_{t=0}^{T} t\, w_t\)

end_epoch()#

Return averaged iterate.

start_epoch(train_ifo, epoch)#

Do nothing.

step(train_ifo, idx, iteration)#

Perform a single SGD step and update averaged iterates.

class casimir.optim.SVRGOptimizer(initial_model, reg_penalties, learning_rate=1.0, smoothing_coefficient=None)#

Implement Stochastic Variance Reduced Gradient (SVRG) with optional smoothing.

This optimizer can be invoked by calling optimize_ifo() with argument algorithm='svrg'. The following named parameters can be passed though optim_options of optimize_ifo(): learning_rate and smoothing_coefficient.

end_epoch()#

Copy averaged iterate to main iterate and return averaged iterate.

start_epoch(train_ifo, epoch)#

Update smoothing coefficient, reset averaged iterate and update batch gradient.

step(train_ifo, idx, iteration)#

Take a single SVRG step and update averaged iterate.

casimir.optim.block_coordinate_frank_wolfe_optimize(initial_model, train_ifo, dev_ifo=None, test_ifo=None, reg_penalties=None, num_passes=100, termination_criterion=None, seed=25, logging=True, verbose=True)#

Implement the Block Coordinate Frank-Wolfe (BCFW) algorithm for structured prediction.

This algorithm is not in the incremental first order oracle model. It requires the task loss in addition to first order information of the objective function. From an implementation point of view, it requires an IFO with the linear_minimization_oracle method implemented.

Implemented here is Algorithm 4 of Lacoste-Julien et. al. Block-coordinate Frank-Wolfe optimization for structural SVMs (2012).

Parameters:
  • initial_model – Starting point of the optimization algorithm.

  • train_ifo – (Smoothed) Incremental First Order Oracle for the function to minimize.

  • dev_ifo – Incremental First Order Oracle for development set. A value of None indicates no development set.

  • test_ifo – Incremental First Order Oracle for test set. A value of None indicates no test set.

  • seed – Integer to be used as random seed. Default 25.

  • reg_penalties – List of regularization.SmoothRegularizationPenalty objects. Here, \(r(w)\) corresponds to sum of penalties in the list. BCFW requires non-zero regularization, so reg_penalties must not be None.

  • num_passes – Maximum allowed number of passes through the data. Default 100.

  • termination_criterionNone or Callable, takes two arguments model, train_ifo and returns True if converged. If None, the given iteration budget num_passes is used to terminate optimization. Default None.

  • logging – (boolean) Log performance of optimization algorithm. Default True.

  • verbose – (boolean) Print performance of optimization algorithm, only if logging is True. Default True.

Returns:

The tuple (final iterate, logs).

casimir.optim.termination_gradient_norm(model, train_ifo, gradient_norm_tolerance=1e-05)#

Terminate optimization after gradient norm falls below a certain tolerance.

Parameters:
  • model – Current iterate of optimizer

  • train_ifo – (Smoothed) Incremental first order oracle.

  • gradient_norm_tolerance – Tolerance.

Returns:

True if gradient norm is smaller than tolerance, false otherwise.

Regularization#

class casimir.optim.L2Penalty(regularization_parameter, prox_center=None)#

Class representing the L2 regularization penalty.

For a prox-center \(z\) and regularization parameter \(\lambda\), the penalty takes the form \(r(w) = \frac{\lambda}{2}\|w-z\|^2_2\).

Parameters:
  • regularization_parameter – the parameter \(\lambda\) above.

  • prox_center – Ndarray representing \(z\) above. A value of None is interpreted as the zero vector.

function_value(model)#

Return function value of regularization penalty.

gradient(model)#

Return gradient of penalty at model.

strong_convexity()#

Strong convexity coefficient w.r.t. L2 norm.