API reference for casimir.optim#
Incremental First Order Oracles#
- class casimir.optim.IncrementalFirstOrderOracle#
Base class for incremental first order oracles (IFO).
For a function \(f(w)\) defined as \(f(w) = \frac{1}{n} \sum_{i=1}^n f_i(w)\), this class is an interface to implement methods to compute \(f_i(w)\) and its gradient \(\nabla f_i(w)\), as well as the batch function value \(f(w)\) and the batch gradient \(\nabla f(w)\) if each \(f_i\) is smooth. If \(f_i\) is not smooth, use sub-class
SmoothedIncrementalFirstOrderOracle
instead.A concrete implementation must at least override the methods
function_value
,gradient
and__len__
. Optionally, it may also overrideevaluation_function
for domain specific evaluation metrics such as classification accuracy. The methodsbatch_function_value
andbatch_gradient
are implemented as a loop that averages over the outputs offunction_value
andgradient
respectively by default. If a more efficient implementation exists, these methods can be overridden by derived classes as well.Input
idx
tofunction value
andgradient
must not exceed \(n\). This is not explicitly checked here.- batch_function_value(model)#
Return function value \(f(w)\) where \(w\) is
model
.
- batch_gradient(model)#
Return gradient \(\nabla f(w)\) where \(w\) is
model
.
- evaluation_function(model)#
Return domain-specific task metrics (default
None
).
- function_value(model, idx)#
Return function value \(f_i(w)\) where \(w\) is
model
and \(i\) isidx
.
- gradient(model, idx)#
Return gradient \(\nabla f_i(w)\) where \(w\) is
model
and \(i\) isidx
.
- class casimir.optim.SmoothedIncrementalFirstOrderOracle(smoothing_coefficient=None)#
Base class of smoothed incremental first order oracles.
For a function \(f(x)\) defined as \(f(w) = \frac{1}{n} \sum_{i=1}^n f_i(w)\), where each \(f_i\) is non-smooth but smoothable, this class is an interface to implement methods to compute \(f_{i, \mu}(w)\) and its gradient \(\nabla f_{i, \mu}(w)\), as well as the batch function value \((f(w), f_{\mu}(w))\) and the batch gradient \(\nabla f_{\mu}(w)\). Here, \(g_\mu\) is a smooth surrogate to the non-smooth function \(g\) that is parameterized by a smoothing coefficient \(\mu\).
When \(\mu\) is ``None``(i.e., no smoothing), the implementation must serve as an IFO for the original non-smooth function \(f\), in which case \(\nabla f_i\) refers to a subgradient of \(f_i\).
Note
This class contains a field
smoothing_coefficient
to represent \(\mu\), which can be mutated by optimization algorithms or other functions that use adaptive smoothing schemes.- function_value(model, idx)#
Return the pair \(\big( f_i(w), f_{i, \mu}(w) \big)\). If
smoothing_coefficient
isNone
, i.e., no smoothing, simply return \(f_i(w)\), where \(i\) represents the indexidx
.
- gradient(model, idx)#
Return that gradient \(\nabla f_{i, \mu}(w)\) if
smoothing_coefficient
is notNone
or \(\nabla f_i(w)\) ifsmoothing_coefficient
isNone
.
Optimization Algorithms#
- casimir.optim.optimize_ifo(initial_model, train_ifo, dev_ifo=None, test_ifo=None, algorithm='SGD', reg_penalties=None, num_passes=100, termination_criterion=None, seed=25, logging=True, verbose=True, optim_options=None)#
Minimize a convex function with access to a (smoothed) incremental first order oracle using a primal algorithm.
Functions that can be handled are of the form \(f(w) = \frac{1}{n} \sum_{i=1}^n f_i(w) + r(w)\), where a (smoothed) incremental first order oracle gives access to each \(f_i\) and \(r(w)\) is a (sum of) regularization penalties. The argument
algorithm
controls which optimization algorithm is used.- Parameters:
initial_model (numpy.ndarray) – Starting iterate of the optimization algorithm.
train_ifo (
incremental_first_order_oracle.IncrementalFirstOrderOracle
orincremental_first_order_oracle.SmoothIncrementalFirstOrderOracle
) – (Smoothed) Incremental First Order Oracle for the function to minimize.dev_ifo (
incremental_first_order_oracle.IncrementalFirstOrderOracle
orincremental_first_order_oracle.SmoothIncrementalFirstOrderOracle
orNone
) – Incremental First Order Oracle for development set. A value ofNone
indicates no development set.test_ifo (
incremental_first_order_oracle.IncrementalFirstOrderOracle
orincremental_first_order_oracle.SmoothIncrementalFirstOrderOracle
orNone
) – Incremental First Order Oracle for test set. A value ofNone
indicates no test set.algorithm – Optimization algorithm to use. Should be from the list
['sgd', 'svrg', 'casimir_svrg']
.seed – Integer to be used as random seed. Default 25.
reg_penalties (List of
regularization.SmoothRegularizationPenalty
orNone
) – List of regularization penalties where \(r(w)\) corresponds to sum of penalties in the list. DefaultNone
, in which case no regularization penalty is applied.num_passes – Maximum allowed number of passes through the data. Default 100.
termination_criterion –
None
or Callable, takes two arguments model, train_ifo and returnsTrue
if converged. IfNone
, the given iteration budgetnum_passes
is used to terminate optimization. DefaultNone
. Seetermination_gradient_norm()
for an example.logging – (boolean) Log performance of optimization algorithm. Default
True
.verbose – (boolean) Print performance of optimization algorithm, only if
logging
isTrue
. DefaultTrue
.optim_options (dict) – Algorithm-specific options. Default
None
. Consult named parameters of specific optimizers for allowed options.
- Returns:
The tuple (final iterate, logs) where logs are a list of lists, one for each epoch. The list contains the epoch number, train function value, dev function value, dev evaluation metric, test function value, test evaluation metric and the time taken to perform this epoch. The dev (test) statistics are printed only if
dev_ifo
(resp.test_ifo
) is notNone
.
- class casimir.optim.Optimizer(initial_model, reg_penalties=None)#
Base class for optimizers using (smoothed) incremental first order oracles.
Optimizer classes store the state of the variables to be optimized.
Any subclass must override the methods
start_epoch
,step
,end_epoch
.- end_epoch()#
Perform optional computation to end epoch and return current iterate to be used for logging.
- Returns:
model, denoting optimizer state at current time.
- start_epoch(train_ifo, epoch)#
Prepare for start of epoch.
Example uses are warm start in Casimir-SVRG or batch gradient computation in SVRG. Adaptive smoothing algorithms may update smoothing parameter here.
- Param:
train_ifo: (Smoothed) Incremental First Order Oracle for the training set. This object may be mutated by adaptive smoothing algorithms.
- step(train_ifo, idx, iteration)#
Take a step based on sample indexed by
idx
.- Parameters:
train_ifo – (Smoothed) Incremental First Order Oracle for the training set.
idx – Index of sample to use to take a step.
iteration – Global iteration number. Required, e.g., by
SGDOptimizer
to compute learning rate.
- class casimir.optim.CasimirSVRGOptimizer(initial_model, reg_penalties, learning_rate=None, grad_lipschitz_parameter=None, initial_smoothing_coefficient=None, smoothing_coefficient_update_rule='const', initial_moreau_coefficient=None, moreau_coefficient_update_rule='const', warm_start='prox-center')#
Implement Casimir (Catalyst with smoothing) or Catalyst outer loop with SVRG as the inner loop.
This optimizer can be invoked by calling
optimize_ifo()
with argumentalgorithm='casimir_svrg'
oralgorithm='catalyst_svrg'
.- Use of smoothing:
This class can mimic the original Catalyst algorithm, if
initial_smoothing_coefficient
andinitial_moreau_coefficient
are both set toNone
, and themoreau_coefficient_update_rule
is set to'const'
.- Inexactness criterion:
Each inner SVRG loop is run for one epoch over the data.
- Parameters:
initial_model – Initial iterate of optimizer.
reg_penalties – List of smooth regularization penalties.
The following named parameters can be passed though
optim_options
ofoptimize_ifo()
:- Parameters:
learning_rate – Learning rate to use for inner SVRG iterations. Either learning rate or smoothness parameter must be specified.
grad_lipschitz_parameter – Estimate of Lipschitz constant of gradient. Inner SVRG iterations then use a learning rate of \(1/(L+\lambda + \kappa)\), where \(L\) is the grad_lipschitz_parameter, \(\lambda\) is the strong convexity of the regularization and \(\kappa\) is the Moreau coefficient added by Casimir. Either learning rate or grad_lipschitz_parameter must be specified.
initial_smoothing_coefficient – The initial amount of smoothing \(\mu_0\) to add to non-smooth objectives. If
None
, no smoothing is added.smoothing_coefficient_update_rule – The update rule that determines the amount of smoothing \(\mu_t\) to add in epoch \(t\). Allowed inputs are
'const'
or'linear'
or'expo'
. See below for an explanation.initial_moreau_coefficient – The initial weight \(\kappa_0\) of the proximal term \(\frac{\kappa_t}{2} \|w - z_{t-1}\|^2\) added by Casimir, where \(z_t\) is the prox-center. Default
None
, in which case the value suggested by theory is used, provided \(L\) has been specified, and ‘const’ is used asmoreau_coefficient_update_rule
.moreau_coefficient_update_rule – The update rule that determines the weight \(\kappa_t\) added by Casimir in epoch \(t\). The allowed values and the corresponding update rules are
'const'
, which uses \(\kappa_t = \kappa_0\) (default) and'linear'
, which uses \(\kappa_t = t \kappa_0\).warm_start – Warm start strategy to use to find the starting iterate of the next epoch. The allowed values are
'prox-center'
,'prev-iterate'
,'extrapolation'
. See below for a description.
- Allowed values for
smoothing_coefficient_update_rule
'const'
\(\mu_t = \mu_0\) (default)
'linear'
\(\mu_t = \mu_0 / t\)
'expo'
\(\mu_t = \mu_0 c_t^t\), where \(c_t = \sqrt{1 - \frac{1}{2}\sqrt{\frac{\lambda}{\lambda + \kappa_t}}}\). Here, \(\lambda\) is the strong convexity of the regularization and \(\kappa_t\) is the Moreau coefficient, the weight of the proximal term added by Casimir in epoch \(t\).
- Allowed values for
warm_start
'prox-center'
Use \(z_{t-1}\), prox center of the proximal term \(\frac{\kappa_t}{2} \|w - z_{t-1}\|^2\) added by Casimir (default)
'prev-iterate'
Use \(w_{t-1}\), the previous iterate
'extrapolation`
Use \(w_{t-1} + \frac{\kappa_t}{\kappa_t + \lambda}(z_{t-1} - z_{t-2})\)
- end_epoch()#
Remove prox penalty, compute \(lpha, eta\) and update averaged iterate.
- start_epoch(train_ifo, epoch)#
Update Moreau coefficient, smoothing coefficient, learning rate, warm start and batch gradient.
- step(train_ifo, idx, iteration)#
Take a single inner SVRG step and update averaged iterate.
- class casimir.optim.SGDOptimizer(initial_model, reg_penalties=None, learning_rate_scheme='const', averaging='wavg', initial_learning_rate=1.0, learning_rate_time_factor=100.0)#
Implement the stochastic (sub-)gradient method with various learning rate and averaging schemes.
This optimizer can be invoked by calling
optimize_ifo()
with argumentalgorithm='sgd'
.- Parameters:
initial_model – Initial iterate of optimizer.
reg_penalties – List of smooth regularization penalties.
The following named parameters can be passed though
optim_options
ofoptimize_ifo()
:- Parameters:
learning_rate_scheme – Learning rate schemes. Allowed values are
'const'
,'pegasos'
or'linear'
. See below for a description.averaging – Use parameter averaging. Allowed values are
'none'
,'uavg'
,'wavg'
. See below for a description.initial_learning_rate – The parameter \(\eta_0\) used to determine the learning rate.
learning_rate_time_factor – The parameter \(t_0\) used to determine the learning rate.
Allowed values of the parameter
learning_rate_scheme
and corresponding learning rates \(\eta_t\) at iteration t are:'const'
(default): \(\eta_t = \eta_0\),'pegasos'
: \(\eta_t = 1/(\lambda t)\),'linear'
: \(\eta_t = \eta_0 / (1 + t/t_0)\)
Here, \(\eta_0, t_0\) are parameters described above and \(\lambda\) the strong convexity.
Allowed values of the parameter
averaging
and the corresponding averaged iterates used are:'none'
: no averaging, use final iterate'uavg'
: uniform average \(\bar w_T = \frac{1}{T+1}\sum_{t=0}^{T} w_t\)'wavg'
(default): weighted average \(\bar w_T = \frac{2}{T(T+1)} \sum_{t=0}^{T} t\, w_t\)
- end_epoch()#
Return averaged iterate.
- start_epoch(train_ifo, epoch)#
Do nothing.
- step(train_ifo, idx, iteration)#
Perform a single SGD step and update averaged iterates.
- class casimir.optim.SVRGOptimizer(initial_model, reg_penalties, learning_rate=1.0, smoothing_coefficient=None)#
Implement Stochastic Variance Reduced Gradient (SVRG) with optional smoothing.
This optimizer can be invoked by calling
optimize_ifo()
with argumentalgorithm='svrg'
. The following named parameters can be passed thoughoptim_options
ofoptimize_ifo()
:learning_rate
andsmoothing_coefficient
.- end_epoch()#
Copy averaged iterate to main iterate and return averaged iterate.
- start_epoch(train_ifo, epoch)#
Update smoothing coefficient, reset averaged iterate and update batch gradient.
- step(train_ifo, idx, iteration)#
Take a single SVRG step and update averaged iterate.
- casimir.optim.block_coordinate_frank_wolfe_optimize(initial_model, train_ifo, dev_ifo=None, test_ifo=None, reg_penalties=None, num_passes=100, termination_criterion=None, seed=25, logging=True, verbose=True)#
Implement the Block Coordinate Frank-Wolfe (BCFW) algorithm for structured prediction.
This algorithm is not in the incremental first order oracle model. It requires the task loss in addition to first order information of the objective function. From an implementation point of view, it requires an IFO with the
linear_minimization_oracle
method implemented.Implemented here is Algorithm 4 of Lacoste-Julien et. al. Block-coordinate Frank-Wolfe optimization for structural SVMs (2012).
- Parameters:
initial_model – Starting point of the optimization algorithm.
train_ifo – (Smoothed) Incremental First Order Oracle for the function to minimize.
dev_ifo – Incremental First Order Oracle for development set. A value of
None
indicates no development set.test_ifo – Incremental First Order Oracle for test set. A value of
None
indicates no test set.seed – Integer to be used as random seed. Default 25.
reg_penalties – List of regularization.SmoothRegularizationPenalty objects. Here, \(r(w)\) corresponds to sum of penalties in the list. BCFW requires non-zero regularization, so
reg_penalties
must not beNone
.num_passes – Maximum allowed number of passes through the data. Default 100.
termination_criterion –
None
orCallable
, takes two argumentsmodel
,train_ifo
and returnsTrue
if converged. IfNone
, the given iteration budgetnum_passes
is used to terminate optimization. DefaultNone
.logging – (boolean) Log performance of optimization algorithm. Default
True
.verbose – (boolean) Print performance of optimization algorithm, only if
logging
isTrue
. DefaultTrue
.
- Returns:
The tuple (final iterate, logs).
- casimir.optim.termination_gradient_norm(model, train_ifo, gradient_norm_tolerance=1e-05)#
Terminate optimization after gradient norm falls below a certain tolerance.
- Parameters:
model – Current iterate of optimizer
train_ifo – (Smoothed) Incremental first order oracle.
gradient_norm_tolerance – Tolerance.
- Returns:
True if gradient norm is smaller than tolerance, false otherwise.
Regularization#
- class casimir.optim.L2Penalty(regularization_parameter, prox_center=None)#
Class representing the L2 regularization penalty.
For a prox-center \(z\) and regularization parameter \(\lambda\), the penalty takes the form \(r(w) = \frac{\lambda}{2}\|w-z\|^2_2\).
- Parameters:
regularization_parameter – the parameter \(\lambda\) above.
prox_center – Ndarray representing \(z\) above. A value of
None
is interpreted as the zero vector.
- function_value(model)#
Return function value of regularization penalty.
- gradient(model)#
Return gradient of penalty at model.
- strong_convexity()#
Strong convexity coefficient w.r.t. L2 norm.