In
supervised learning, we seek to learn to predict a target variable from
data. The way this
happens is by having access to a number of data vectors with their
associated targets (i.e. labels or physical measurements), defining
some functional form that maps data to targets, and a merit criterion
which is small when estimates are close to targets. One salient
question for this domain is how to choose the functional form such that
the resulting optimization problem is solvable, but sufficiently
general so as to be able to capture important relationships between
data and targets. In this thread of research, we investigate how to
manage this tradeoff for the case that estimators belong to a
Reproducing Kernel Hilbert Space (RKHS), which yields a very rich class
of estimators. However, owing to the Representer Theorem, this choice
also makes the complexity of the optimization problem comparable to the
training sample size, which is untenable for streaming applications or
large-scale problems. However, by carefully constructing subspaces onto
which we project RKHS-valued stochastic gradient algorithms, we are
able to find optimally compressed kernelized regression functions. This
is in contrast to a long history of literature trying to ameliorate the
complexity growth of kernel methods with the training sample size that
typically violate the optimality properties of the iterative sequences
to which they’re applied.

We call this method Parsimonious Online
Learning with Kernels (POLK), and illustrate a couple examples of the
decision surface defined by POLK when used with a Gaussian kernel for
two-dimensional multi-class classification. On the left we visualize
the surface defined by kernel logistic regression; on the right a
kernel support vector machine classifier is visualized. [Video]

These results are explained in detail in the following journal paper:

A. Koppel, G. Warnell, E. Stump, and A. Ribeiro, “Parsimonious Online Learning with Kernels via Sparse Projections in Function Space,” in Journal of Machine Learning Research
(under review), Nov. 2016

**Risk-Aware Learning** We have recently investigated how this framework can be extended to overcome outliers and ill-conditioned data in streaming scenarios. We propose to do so through replacing the standard supervised learning objective by one that incorporates a dispersion statistic, which can be mathematically encapsulated by coherent risk. Coherent risk is a concept from operations research/mathematical finance which quantifies the uncertainty in a decision.

To date, risk has not been used in nonparametric estimation. There are two ways to modify online supervised learning procedures to overcome outliers and become risk-aware. One would be to add the risk function as a penalty term to the objective. The motivation fo this formulation can be connected directly from the bias-variance (estimation-approximation) tradeoff in machine learning. Minimizing risk functions, however, requires addressing the fact that they have a structural form where there are two nested expectations in the objective, and hence the stochastic gradient method inside POLK must be replaced with stochastic quasi-gradient (SQG) method. The use of SQG together with the subspace projections of POLK, their convergence, and experimental gains, is the subject of an Apr. 2018 submission Mathematical Programming submission.

A. Bedi Singh, A. Koppel, and K. Rajawat. ”Nonparametric Compositional Stochastic Optimization: Algorithms for Robust Online Learning with Kernels,” in Mathematical Programming (submitted), Apr. 2018.

Alternatively, one may write the risk function as a nonlinear constraint. However, to come up with kernelized learning algorithms, one must first extend the Representer Theorem to cases with constraints. We propose this extension, and then upon this foundation we develop a primal-dual method for solving the constrained optimization problem in function space. This formulation is equally applicable to risk-aware learning and model predictive control with obstacle-avoidance constraints. These contributions are summarized in the paper:

A. Koppel, K. Zhang, H. Zhu, and T. M. Baser. ”Projected Stochastic Primal-Dual Method
for Constrained Online Learning with Kernels” in IEEE Trans. Signal Process. (submitted),
Apr. 2018.

In the following plot, we display how a canonical risk function, called conditional value at risk, changes the decision surface and model complexity of a nonparametric estimator, for the cartoon data set we display above.

Here we display how these methods work for an online kernel SVM classifier with optimal memory-reduction. On the left, we place test set accuracy for this synthetic Gaussian mixture point cloud for primal-dual method, the new innovation here, as compared with its unconstrained variant (POLK) and an approximate penalty method. In the middle and right plot, when risk-constraints are enforced, then confidence regions are only drawn far from regions of class overlap. This happens because the loss function contains more uncertainty in these regions. Therefore, by ensuring risk is constrained, classifications are still made correctly but are averse to uncertainty regions.

In
reinforcement learning, an autonomous agent observes a sequence
of rewards as it traverses its environment based upon the actions it
takes. Depending on which actions it takes, it may end up in more or
less rewarding states. The question is how to choose the policy, i.e.,
the map from states to actions, so as to ensure the long-run
(discounted) accumulation of rewards is maximized. This long-run
accumulation is called the value. Classical analysis on this topic has
shown that it’s possible to reformulate the problem of both policy
evaluation and policy learning as different function-valued fixed point
problems called Bellman equations if the state and action spaces are
continuous. Thus, many classical techniques for RL use stochastic fixed
point iterations, e.g., Temporal Difference Learning and Q Learning.
These methods break down when combined with function approximation,
however, which is intrinsically required for continuous Markov Decision
Problems. My approach to mitigating these issues is to reformulate the
fixed point problems defined by Bellman’s evaluation into compositional
optimization problems, for which it is possible to develop iterative
algorithms that can be shown to have globally stable Lyapunov
functions. This is done by allowing the reproducing Kernel Hilbert
space to define the value function parameterization. See an example of
how this works on the Mountain Car domain in the below plot, where we
should that the Bellman error converges and we obtain a
memory-efficient value function that is transparently interpretable in
the contour plot. We handle the growth of the kernel parameterization
through sparse subspace projections in a manner very similar to POLK.

These results are explained in detail in the following journal paper:

A. Koppel, G. Warnell, E. Stump, P. Stone, and A. Ribeiro. “Policy Evaluation in Continuous MDPs with Efficient Kernelized Gradient Temporal Difference,” in IEEE Trans. Automatic Control (submitted), Dec. 2017.

This approach may be adapted to solve the Bellman optimality
equation for action-value functions. In the below left contour plot we show how
this may be used to solve the Continuous Mountain car and obtain
interpretable results. On the left below, the
value function associated with the learned policy that is obtained by
maximizing the action-value function. On the right we plot the
resulting policy.

These results are explained in detail in the following journal paper:

A. Koppel, E. Tolstaya, E. Stump, and A. Ribeiro. ”Nonparametric Stochastic Compositional
Gradient Descent for Q-Learning in Continuous Markov Decision Problems” in IEEE
Trans. Automatic Control (submitted), Mar. 2018.

The main questions these works motivate is how to overcome the need for smoothness in the function approximation through the necessity of Lipschitz gradients, due to the fact that Bellman's optimality equation is non-smooth: the presence of the maximum and the fact that the reward are non-smooth are intrinsic to the problem setting. Moreover, reformulating the problem as a compositional problem yields a significantly slower learning rate than stochastic fixed point methods. However, it is a long-standing open question how to make stochastic fixed point methods stable under function approximation.

Many
machine learning and signal processing problems may be formulated as
convex optimization problems with an objective that may be written as
the sum of
predictive risks evaluated at each sample point of a training data set.
This is variously referred to as empirical risk minimization or
stochastic average approximation. When the number of sample points is
static and not too large, deterministic iterative algorithms such as
gradient descent or Newton's method provide effective tools for
solviong this class of problems. When the number of sample
points is huge-scale, possible infinite, stochastic approximation
becomes necessary. Stochastic gradient methods operate on stochastic
gradients in lieu of true gradients, which allow for processing data
points one at a time, and under suitable conditions converge to the
minimum of an expected
risk
over all data realizations. This methodology has been around since a
landmark paper of Robbins and Monroe in 1951, but is increasingly
popular in the "Big Data" era. Within this problem class, I have
worked on how to handle learning
problems where the number of predictive parameters, or features, is
comparable to the (huge-scale) number of sample points. I have
developed a
parallelized doubly stochastic approximation algorithms, which allows
for operating on random subsets of both sample points and
features. It does so by combing the merits of stochastic gradient
method with random coordinate descent.

[Video]

These results are explained in detail in the following journal paper:

A. Mokhtari, A. Koppel, and A. Ribeiro, “A Class of Doubly Random Parallel Stochastic Methods for Large Scale Learning,” in Journal of Machine Learning Research (under review), June 2016

**Linearly Constrained Problems** In many applications, one would line to solve variants of this problem class when linear equality constraints are present. These constraints arise for instance, in wireless communications, from rate capacity constraints for communications along the network links, or in parallel computing architectures when one would like to exploit the computational capability of distinct parallel nodes but then ensure that each node's local parameter estimates coincide with all others.

In this research thread, we focus on designing dual methods for linearly constrained convex problems, focused on attaining state of the art accelerated rates with weaker than existing assumptions. Specifically, given that linear convergence remains elusive for first-order dual methods for constrained problems, even with strong convexity, we ask the following question: is Nesterov’s optimal rate O(1/k^2) realizable by first-order accelerated dual methods for constrained problems without strong convexity?
The contribution of this work is an affirmative answer, based on synthesizing accelerated methods, both in continuous- and discrete-time, using Lyapunov functions and matrix inequalities. Specifically, for a desired convergence rate (which we assume it is attainable) as a design parameter, we synthesize an algorithm that can achieve that rate of convergence. For doing so, we derive a matrix inequality that couples the convergence rate to the parameters of the algorithm. By solving the matrix inequality symbolically, we obtain the parameters of the algorithm as a function of the desired convergence rate. We attain exponential convergence in continuous time for strongly convex smooth objectives, and an O(1/k^2) rate for discrete-time algorithms for objectives that are neither strongly convex nor differentiable.

For a simple model predictive control problem with a quadratic objective and linear dynamics, we observe that our accelerated variant of the method of multipliers, which is synthesized from a continuous time synthetic mechanical system, we observe state of the art convergence rates.

For more details, please see:

M. Fazlyab, A. Koppel, V. Preciado, and A. Ribeiro, ``A Variational Approach to Dual Methods for Constrained Convex Optimization," in IEEE Trans. Automatic Control (submitted), Nov. 2017.

Multi-agent systems refer to settings in which distinct computing nodes that may communicate with one another according to some network structure aim to solve a task which is global to the network. Sensor networks, computers networks, and robotic teams may all be described with this perspective. Classically, this problem class has led to the development of decentralized optimization algorithms, where distinct nodes operate on local information only. Through appropriate information mixing strategies, agents may learn a decision variable which is optimal in terms of observations aggregated globally over the network.

My
contributions to this problem domain are in developing new information
mixing strategies. Prior approaches are based upon schemes where agents
combine a weighted average of neighbors decision variables with a local
gradient step ("the consensus protocol"), whereas the method I propose
for this setting, called the saddle point algorithm, expoits duality in
convex optimization theory to have agents approximately enforce
local agreement constraints while learning based upon local
information. This saddle point method can solve more general classes of
decentralized optimization problems than consensus methods, i.e.
problems in which each agent's cost function is parameterized by a
random variable with a possibly distinct distribution function, or more
general types of network constraints. Moreover,
these primal-dual algorithms are well-suited to online or dynamic
settings where training observations become sequentially available, and
agents must repeatedly adapt to new information. [Video]

Robotics
is an exciting field for signal processing researchers because there
are so many frontiers to which we may explore and contribute.
My particular focus has been on learning-based control in single-robot
and multi-robot systems. In particular, I study information processing
strategies for streaming sensory data, such that useful intelligence
may be extracted from the robot's operating domain for adapting its
control strategies.This problem breaks down into three components (1)
how to train a statistical model online relating sensory data to some
label or output variable space, (2) how to map the predictions of this
statistical model into the control space, and (3) tailoring the robotic
control strategy to the statistics of its operating environment.

For problem (1), I have developed extensions of task-driven dictionary learning to decentralized settings. Task-driven dictionary learning is a method for learning a signal representation and statistical model relating a feature space and predictive variable jointly in a manner that is tuned for minimizing the expected prediction risk. By creating a framework for this task in multi-agent systems, I've provided the capability for robotic networks to handle streaming data and collaboratively exchange information to maximize the informativeness of the paths they've traversed for a particular prediction task.

The
task of how to map statistical predictions into the control
space of the autonomous system (2) classically has been treated with
robust control techniques, in which one adds chance constraints or
stochastic variables to the objective of an optimal control
formulation. While doing yields robust performance for linear systems,
most robotic kinematics are highly nonlinear, in which case tried and
true methods for linear systems are not appropriate. To circumvent this
issue, I am considering developing new formulations of receding horizon
control of nonlinear systems where the cost functional automatically
switches between optimal and robust control based on the level of
predicted uncertainty in the environment. By approaching task (3) in
this manner, I am developing a method such that an autonomous system
may learn online to compensate for the simplifications
made in kinematic models of robot platforms that are typically
used to generate control signals, and thus have a robot adjust in
real-time to unexpected changes in the environment in which it is
operating. By learning the uncertainty in a control action, we may use
this estimate to correct for deviations between ground truth and some
kinematic model. In the below figure, the red path
is the one that uses uncertainty estimated by simple averaging; the
green path is the ground truth path taken by the robot; and the blue
path is the result of our uncertainty estimate. On the left we picture
these paths before learning whereas on the right we picture them after
learning. After learning, they match very closely.

Prior to coming to UPenn, the unifying theme of my research projects was Monte Carlo and stochastic diffusion approximation algorithms. My undergraduate research project in the Math Dept. at Washington University in St. Louis was to study probabilistic models of predator-prey interactions in migrational settings. To do so, I made use of tools in stochastic simulation and diffusion approximations to discern under which cases population stability, explosion, and extinction occurred across different sites in a nonlinear networked dynamical system.

I spent 2011-2012 at the U.S. Army Research Laboratory ALC as part of the Science Outreach for Army Research (SOAR) program investigating different frameworks for modeling robotic networks in complex domains. The first project I worked on was to evaluate a "bio--inspired" control strategy on an asynchronous parallel computing architecture. Numerically I was able to conclude that if the amount of information processed passed a certain threshold, the system was controllable. The other project I worked on was generalizing a leader-follower framework for a robotic network to maintain communications connectivity while exploring the environment. I implemented a simulated annealing strategy to overcome the tendency of follower-robots becoming trapped at obstacles in the environment, which is a consequence of non-convexity of the feasible space.

I also worked as a biostatistical research intern at Washington University School of Medicine's Summer Institute for Training in Biostatistics (SIBS), where I performed statistical analyses and explored latent variable patterns in clinical trials via independent component analysis, and presented my findings to a team of physicians.