**2023 Fall PSU-Purdue-UMD Joint Seminar on Mathematical Data Science**

**Time: Monday 14:30 - 15:30 PM ET**

**Location: Zoom Meeting**

**Organizers:**

**John Harlim and Wenrui Hao, Department of Mathematics, Penn State University**

**Yuan Gao, Department of Mathematics, Purdue University**

**Haizhao Yang, Department of Mathematics, University of Maryland College Park **

**Zoom Meeting ID: 927 8056 1489**

**Password: 0900**

**Link: ****https://purdue-edu.zoom.us/j/92780561489?pwd=aXl3cy9Nd1Z5SnJhOW5Id2JDNzRBQT09**

**Week 1 August 28: Organization Meeting, No Seminar**

**Week 2 September 4: Labor Day, No Seminar**

**Week 3 September 11:**

**Speaker: Yizhe Zhu, UC Irvine**

**Title: **Effective Algorithms for Differentially Private Synthetic Data Generation

**Abstract: **Differentially private synthetic data provide a powerful mechanism to enable data analysis while protecting sensitive information about individuals. We first present a highly effective algorithmic approach for generating differentially private synthetic data in a bounded metric space with near-optimal utility guarantees under the Wasserstein distance. When the data lie in a high-dimensional space, the accuracy of the synthetic data suffers from the curse of dimensionality. We then propose an algorithm to generate low-dimensional private synthetic data from a high-dimensional dataset efficiently. A key step in our algorithm is a private principal component analysis (PCA) procedure with a near-optimal accuracy bound. Based on joint work with Yiyun He (UC Irvine), Thomas Strohmer (UC Davis), and Roman Vershynin (UC Irvine).

**Week 4 September 18:**

**Speaker: Tuo Zhao, Gatech**

**Title: **On Fine-Tuning Large Language Models with Less Labeling Cost

**Abstract: **Labeled data is critical to the success of deep learning across various applications, including natural language processing, computer vision, and computational biology. While recent advances like pre-training have reduced the need for labeled data in these domains, increasing the availability of labeled data remains the most effective way to improve model performance. However, human labeling of data continues to be expensive, even when leveraging cost-effective crowd-sourced labeling services. Further, in many domains, labeling requires specialized expertise, which adds to the difficulty of acquiring labeled data.

In this talk, we demonstrate how to utilize weak supervision together with efficient computational algorithms to reduce data labeling costs. Specifically, we investigate various forms of weak supervision, including external knowledge bases, auxiliary computational tools, and heuristic rule-based labeling. We showcase the application of weak supervision to both supervised learning and reinforcement learning across various tasks, including natural language understanding, molecular dynamics simulation, and code generation.

**Week 5 September 25:**

**Speaker: Patrice Koehl, UC Davis**

**Title: **Light speed computation of exact solutions to generic and to degenerate assignment problems

**Abstract: **The linear assignment problem is a fundamental problem in combinatorial optimization with a wide range of applications, from operational research to data sciences.
It consists of assigning "agents" to "tasks" on a one-to-one basis, while minimizing the total cost associated with the assignment. While many exact algorithms have been developed to identify such an optimal assignment, most of these methods are computationally prohibitive for large size problems.In this talk, I will describe a novel approach to solving the assignment problem using techniques adapted from statistical physics.In particular I will derive a strongly concave effective free energy function that captures the constraints of the assignment problem at a finite temperature. This free energy decreases monotonically as a function of beta, the inverse of temperature, to the optimal assignment cost, providing a robust framework for temperature annealing. For large enough beta values the exact solution to the generic assignment problem can be derived using a simple round-off to the nearest integer of the elements of the computed assignment matrix. I will also describe a provably convergent method to handle degenerate assignment problems. Finally, I will describe computer implementations of this framework that are optimized for parallel architectures, one based on CPU, the other based on GPU.These implementations enable solving large assignment problems (of the orders of a few 10000s) in computing clock times of the orders of minutes.

**Week 6 October 2:**

**Speaker: **Ryan Murray, NC State

**Title: **Robust learning: a tour through variational methods, PDE, and geometry

**Abstract: **A major consideration in many learning algorithms is how to appropriately ensure good generalization and robustness. There are many methods, both classical and contemporary, and ranging from statistical to computational, which address this issue. This talk will give a tour of different mathematical approaches to the problem of ensuring robustness in statistical learning problems, with a special focus on non-parametric settings. Special attention will be given to connections that these methods have with classical mathematical tools, such as partial differential equations, geometry, and variational methods. Specifically, I will discuss 1) Hamilton-Jacobi equations satisfied by classical non-parametric generalizations of medians, which have deep connections with convex geometry and control theory, and 2) Geometric and analytical results for adversarially robust classification methods. Time permitting, new computational methods based upon these analytical approaches will also be discussed.

**Week 7 October 9: Fall Break at Purdue, No Seminar**

**Week 8 October 16:**

**Speaker: Caroline Moosmueller, UNC Chapel Hill**

**Title: **TBA

**Abstract: **TBA

**Week 9 October 23:**

**Speaker:Themis Sapsis, MIT**

**Title: **TBA

**Abstract: **TBA

**Week 10 October 30:**

**Speaker: Marius Zeinhofer, Simula Research Laboratory**

**Title: **TBA

**Abstract: **TBA

**Week 11 November 6:**

**Speaker: Ju Sun, UMN**

**Title: **TBA

**Abstract: **TBA

**Week 12 November 13:**

**Speaker: Joe Kileel, UT Austin**

**Title: **TBA

**Abstract: **TBA

**Week 13 November 20: Thanksgiving Holiday at Penn State, No Seminar**

**Week 14 November 27:**

**Speaker: Yuanzhe Xi, Emory University**

**Title: **TBA

**Abstract: **TBA

**Week 15 December 4:**

**Speaker: ** Benjamin Peherstorfer, Courant Institute of Mathematical Sciences, New York University

**Title: **TBA

**Abstract: **TBA

**Past Events:**

**2023 Spring PSU-Purdue-UMD Joint Seminar on Mathematical Data Science**

**Time: Monday 10:30 - 11:30 AM ET**

**Location: Zoom Meeting**

**Organizers:**

**John Harlim and Wenrui Hao, Department of Mathematics, Penn State University**

**Yuan Gao, Department of Mathematics, Purdue University**

**Haizhao Yang, Department of Mathematics, University of Maryland College Park **

**Zoom Meeting ID: 927 8056 1489**

**Password: 0900**

**Link: ****https://purdue-edu.zoom.us/j/92780561489?pwd=aXl3cy9Nd1Z5SnJhOW5Id2JDNzRBQT09**

**Week 1 January 23:**

**Speaker: Andrew Stuart, Caltech**

**Title: The Mean-Field Ensemble Kalman Filter**

**Abstract: **Ensemble Kalman filters constitute a methodology for incorporating noisy data into complex dynamical models to enhance predictive capability. They are widely adopted in the geophysical sciences, underpinning weather forecasting for example, and are starting to be used throughout the sciences and engineering; furthermore, they have been adapted to function as a general-purpose tool for parametric inference. The strength of these methods stems from their ability to operate using complex models as a black box, together with their natural adaptation to high performance computers. In this work we provide, for the first time, theory to elucidate conditions under which this widely adopted methodology provides accurate model predictions and uncertainties for discrete time filtering. The theory rests on a mean-field formulation of the methodology and an error analysis controlling differences between probability measure propagation under the mean-field model and under the true filtering distribution.

The mean-field formulation is based on joint work with Edoardo Calvello (Caltech) and Sebastian Reich (Potsdam).

The error analysis is based on joint work with Jose Carrillo (Oxford), Franca Hoffmann (Caltech) and Urbain Vaes (Paris).

**Week 2 January 30:**

**Speaker: Tan Bui-Thanh, The University of Texas at Austin**

**Title: TNet: A Tikhonov Neural Network Approach to Inverse Problems and UQ **

**Abstract: **Deep Learning (DL) by design is purely data-driven and in general does not require physics. This is the strength of DL but also one of its key limitations when applied to science and engineering problems in which underlying physical properties (such as stability, conservation, and positivity) and desired accuracy need to be achieved. DL methods in their original forms are not capable of respecting the underlying mathematical models or achieving desired accuracy even in big-data regimes. On the other hand, many data-driven science and engineering problems, such as inverse problems, typically have limited experimental or observational data, and DL would overfit the data in this case. Leveraging information encoded in the underlying mathematical models, we argue, not only compensates missing information in low data regimes but also provides opportunities to equip DL methods with the underlying physics and hence obtaining higher accuracy. This talk introduces a Tikhonov Network (TNet) that is capable of learning Tikhonov regularized inverse problems. We present and provide intuitions for our formulations for general nonlinear problems. We rigorously show that our TNet approach can learn information encoded in the underlying mathematical models, and thus can produce consistent or equivalent inverse solutions, while naive purely data-based counterparts cannot. Furthermore, we theoretically study the error estimate between TNet and Tikhhonov inverse solutions and under which conditions they are the same. Extension to statistical inverse problems will also be presented.

**Week 3 February 6:**

**Speaker: Bao Wang, The University of Utah**

**Title: The Effects of Activation Functions on the Over-smoothing Issue of Graph Convolutional Networks**

**Abstract: **Smoothness has been shown to be crucial for the success of graph convolutional networks (GCNs); however, over-smoothing has become inevitable. In this talk, I will present a geometric characterization of how activation functions of a graph convolution layer affect the smoothness of their input leveraging the distance of graph node features to the eigenspace of the largest eigenvalue of the (augmented) normalized adjacency matrix, denoted as $\gM$. In particular, we show that 1) the input and output of ReLU or leaky ReLU activation function are related by a high-dimensional ball, 2) activation functions can increase, decrease, or preserve the smoothness of node features, and 3) adjusting the component of the input in the eigenspace $\gM$ can control the smoothness of the output of activation functions. Informed by our theory, we propose a universal smooth control term to modulate the smoothness of learned node features and improve the performance of existing graph neural networks.

**Week 4 February 13:**

**Speaker: Marina Meila, University of Washington**

**Title: Manifold Learning between Mathematics and the Sciences**

**Abstract: **Non-linear dimension reduction algorithms can recover the underlying low-dimensional parametrization of high-dimensional point clouds. This area at the frontier of machine learning, statistics, computer science and mathematics is known as Manifold Learning. This talk will extend Manifold Learning in two directions.
First, we ask if it is possible, in the case of scientific data where quantitative prior knowledge is abundant, to explain a data manifold by new coordinates, chosen from a set of scientifically meaningful functions?
Second, we ask how popular Manifold Learning tools and their applications can be recreated in the space of vector fields and flows on a manifold. For this, we need to transport advanced differential geometric and topological concepts into a data-driven framework.
Central to this approach is the order 1-Laplacian of a manifold, $\Delta_1$, whose eigen-decomposition into gradient, harmonic, and curl, known as the Helmholtz-Hodge Decomposition, provides a basis for all vector fields on a manifold. We present an estimator for $\Delta_1$, and based on it we develop a variety of applications. Among them, visualization of the principal harmonic, gradient or curl flows on a manifold, smoothing and semi-supervised learning of vector fields, 1-Laplacian regularization. In topological data analysis, we describe the 1st-order analogue of spectral clustering, which amounts to prime manifold decomposition. Furthermore, from this decomposition a new algorithm for finding shortest independent loops follows. The algorithms are illustrated on a variety of real data sets.
Joint work with Yu-Chia Chen, Samson Koelle, Weicheng Wu, Hanyu Zhang and Ioannis Kevrekidis

**Week 5 February 20:**

**Speaker: Paris Perdikaris, University of Pennsylvania**

**Title: Self-supervised learning of PDE solution manifolds in function spaces **

**Abstract: **While the great success of modern deep learning lies in its ability to approximate maps between finite-dimensional vector spaces, many tasks in science and engineering involve continuous measurements that are functional in nature. For example, in climate modeling one might wish to predict the pressure field over the earth from measurements of the surface air temperature field. The goal is then to learn an operator, between the space of temperature functions to the space of pressure functions. In recent years operator learning techniques using deep neural networks have emerged as a powerful tool for regression problems in infinite-dimensional function spaces. In this talk we present a general approximation framework for neural operators and demonstrate that their performance fundamentally depends on their ability to learn low-dimensional parameterizations of solution manifolds. This motivates new architectures which are able to capture intrinsic low-dimensional structure in the space of target output functions. Additionally, we provide a way to train these models in a self-supervised manner, even in the absence of paired labeled examples. These contributions result in neural PDE solvers which yield fast and discretization invariant predictions of spatio-temporal fields up to three orders of magnitude faster compared to classical numerical solvers. We will also discuss key open questions related to generalization, accuracy, data-efficiency and inductive bias, the resolution of which will be critical for the success of AI in science and engineering.

**Week 6 February 27:**

**Speaker: Yannis Kevrekidis, John Hopkins University**

**Title: No equations, no variables, no space, no time: Old and new results on data and the modeling of complex systems**

**Abstract: **I will start by showing how several successful NN architectures (ResNets, recurrent nets, convolutional nets, autoencoders, neural ODEs, operator learning....) had been used for nonlinear dynamical system identification (learning ODEs and PDEs) since the early 1990s.

Obtaining predictive dynamical equations from data lies at the heart of science and engineering modeling and is the linchpin of our technology. In mathematical modeling one typically progresses from observations of the world (and some serious thinking!) first to equations for a model, and then to the analysis of the model to make predictions. Good mathematical models give good predictions (and inaccurate ones do not) - but the computational tools for analyzing them are the same: algorithms that are typically based on closed form equations.

While the skeleton of the process remains the same, today we witness the development of mathematical techniques that operate directly on observations -data-, and appear to circumvent the serious thinking that goes into selecting variables and parameters and deriving accurate equations. The process then may appear to the user a little like making predictions by "looking in a crystal ball". Yet the "serious thinking" is still there and uses the same -and some new- mathematics: it goes into building algorithms that jump directly from data to the analysis of the model (which is now not available in closed form) to make predictions. Our work here presents a couple of efforts that illustrate this "new" path from data to predictions. It really is the same old path, but it is traveled by new means.

**Bio: **
Yannis Kevrekidis studied Chemical Engineering at the National Technical University in Athens. He then followed the steps of many alumni of that department to the University of Minnesota, where he studied with Rutherford Aris and Lanny Schmidt (as well as Don Aronson and Dick McGehee in Math). He was a Director's Fellow at the Center for Nonlinear Studies in Los Alamos in 1985-86 (when Soviets still existed and research funds were plentiful). He then had the good fortune of joining the faculty at Princeton, where he taught Chemical Engineering and also Applied and Computational Mathematics for 31 years; five years ago he became Emeritus and started fresh at Johns Hopkins (where he somehow is also Professor of Urology). His work always had to do with nonlinear dynamics (from instabilities and bifurcation algorithms to spatiotemporal patterns to data science in the 90s, nonlinear identification, multiscale modeling, and back to data science/ML); and he had the additional good fortune to work with several truly talented experimentalists, like G. Ertl's group in Berlin. When young and promising, he was a Packard Fellow, a Presidential Young Investigator and the Ulam Scholar at Los Alamos National Laboratory. He holds the Colburn, CAST and Wilhelm Awards of the AIChE, the Crawford and the Reid Prizes of SIAM, he is a member of the NAE, the American Academy of Arts and Sciences, and the Academy of Athens.

**Week 7 March 6:**

**Speaker: Siddhartha Mishra, ETH Zurich**

**Title: Learning Operators**

**Abstract: **Operators are mapping between infinite-dimensional spaces and arise in a variety of contexts, particularly in the solution of PDEs. The main aim of this lecture would be to introduce the audience to the rapidly emerging area of operator learning, i.e., machine learning operators from data. To this end, we will summarize existing architectures such as DeepONets and Fourier neural operators (FNOs) as well as describe the newly proposed Convolutional Neural Operators (CNOs). Theoretical error estimates for different operator learning architectures will be mentioned and numerical experiments comparing them described. Several open issues regarding operator learning will also be covered. If time permits, we will describe Neural Inverse operators (NIOs): a machine-learning architecture for the efficient learning of a class of inverse problems associated with PDEs.

**Week 8 March 13:**

**Speaker: Mikhail Belkin, University of California San Diego**

**Title: The mathematics of neural networks: recent advances, thoughts, and a path forward**

**Abstract: **The recent remarkable practical achievements of neural networks have far outpaced our theoretical understanding of their properties. Yet, it is hard to imagine that progress can continue indefinitely, without deeper understanding of their fundamental principles and limitations. In this talk I will discuss some recent advances in the mathematics of neural networks and outline what are in my opinion are promising directions for the future research.

**Week 9 March 20:**

**Speaker: Hongkai Zhao, Duke University**

**Title: How much can one learn a PDE from a single solution data?**

**Abstract: **In this presentation, we discuss a few basic questions for PDE learning from observed solution data. Using various types of PDEs as examples, we show 1) how large the data space spanned by all snapshots along a solution trajectory is, 2) if one can construct an arbitrary solution by superposition of snapshots of a single solution, and 3) identifiability of a differential operator from a single solution data on local patches.

**Week 10 March 27:**

**Speaker: Lu Zhang, Columbia University**

**Title: Coupling physics-deep learning inversion**

**Abstract: **In recent years, there is great interest in using deep learning to geophysical/medical data inversion. However, direct application of end-to-end data-driven approaches to inversion have quickly shown limitations in the practical implementation. Due to the lack of prior knowledge on the objects of interest, the trained deep learning neural networks very often have limited generalization. In this talk, we introduce a new methodology of coupling model-based inverse algorithms with deep learning for two typical types of inversion problems. In the first part, we present an offline-online computational strategy for coupling classical least-squares based computational inversion with modern deep learning based approaches for full waveform inversion (FWI) to achieve advantages that can not be achieved with only one of the components. An offline learning strategy is used to construct a robust approximation to the inverse operator and utilize it to design a new objective function for the online inversion with new datasets. In the second part, we present an integrated machine learning and model-based iterative reconstruction framework for joint inversion problems where additional data on the unknown coefficients are supplemented for better reconstructions. The proposed method couples the supplementary data with the partial differential equation (PDE) model to make the data-driven modeling process consistent with the model-based reconstruction procedure. The impact of learning uncertainty on the joint inversion results are also investigated.

**Week 11 April 3:**

**Speaker: Sui Tang**

**Title: Data-driven discovery of interacting particle systems via Gaussian process**

**Abstract: **Interacting particle systems are ubiquitous in science and engineering, exhibiting a wide range of collective behaviors such as flocking of birds, milling of fish, and self-propelled particles. Differential equations are commonly used to model these systems, providing insights into how individual behavior generates collective behaviors. However, quantitatively matching these models with observational data remains a challenge despite recent theoretical and numerical advancements.
In this talk, we present a data-driven approach for discovering interacting particle models with latent interactions. Our approach uses Gaussian processes to model latent interactions, providing an uncertainty-aware approach to modeling interacting particle systems. We demonstrate the effectiveness of our approach through numerical experiments on prototype systems and real data. Moreover, we develop an operator-theoretic framework to provide theoretical guarantees for the proposed approach. We analyze recoverability conditions and establish the statistical optimality of our approach.
This talk is based on joint works with Jinchao Feng, Charles Kulick, Fei Lu, Mauro Maggioni, and Yunxiang Ren.

**Week 12 April 10:**

**Speaker: Rongrong Wang, Michigan State**

**Title: What affects the generalization of Neural networks?**

**Abstract: **Deep neural networks (NN) have led to huge empirical successes in recent years across a wide variety of tasks. Most deep learning problems are in essence solving an over-parameterized, large-scale non-convex optimization problem. A mysterious phenomenon about NN that attracted much attention in the past few years is why NN generalizes so well.

In this talk, we will begin by reviewing existing theories that attempt to explain this generalization phenomenon when neural networks are trained using the Stochastic Gradient Descent (SGD) algorithm. Building on these results, we will present a new analysis that focuses on the widely-used heavy-ball momentum accelerated SGD (SGD+M) algorithm.

Specifically, we will derive the formula for the implicit gradient regularization (IGR) of the SGD+M algorithm and explain its relationship to generalization. We will then use this framework to shed light on previously observed but empirically unexplained behavior of the momentum-accelerated SGD algorithm.

This is joint work with Avrajit Ghosh, He Lyu, and Xitong Zhang.

**Week 13 April 17:**

**Speaker: Lexing Ying, Stanford University**

**Title: Quantum numerical analysis**

**Abstract: **Recent developments in quantum computers have inspired rapid progress in developing quantum algorithms for scientific computing, including examples in numerical linear algebra, partial differential equations, and machine learning. However, the noise of quantum devices and quantum measurements pose new questions in the area of numerical analysis of quantum algorithms. In this talk, I will discuss two of my recent works in this direction: (1) new low-depth algorithms for quantum phase estimation for early fault-tolerant quantum devices and (2) a new robust algorithm for computing phase factors in forming general functions of quantum operators.

**Week 14 April 24:**

**Speaker: Jay Kuo, University of Southern California**

**Title: On the 2nd AI Wave: Toward Interpretable, Reliable, and Sustainable AI**

**Abstract: ** Rapid advances in artificial intelligence (AI) in the last decade have been primarily attributed to the wide applications of deep learning (DL) technologies. I view these advances as the first AI wave. There are concerns with the first AI wave. DL solutions are a black box (i.e., not interpretable) and vulnerable to adversarial attacks (i.e., unreliable). Besides, the high carbon footprint yielded by large DL networks is a threat to our environment (i.e., not sustainable). Many researchers are looking for an alternative solution that is interpretable, reliable, and sustainable. This is expected to be the second AI wave. To this end, I have conducted research on green learning (GL) since 2015. GL was inspired by DL. Low carbon footprints, small model sizes, low computational complexity, and mathematical transparency characterize GL. It offers energy-effective solutions in cloud centers and mobile/edge devices. It has three main modules: 1) unsupervised representation learning, 2) supervised feature learning, and 3) decision learning. GL has been successfully applied to a few applications. My talk will present the fundamental ideas of the GL solution and highlight a couple of demonstrated examples.

**Bio: **Dr. C.-C. Jay Kuo received his Ph.D. from the Massachusetts Institute of Technology in 1987. He is now with the University of Southern California (USC) as William M. Hogue Professor, Distinguished Professor of Electrical and Computer Engineering and Computer Science, and Director of the Media Communications Laboratory. His research interests are in visual computing and communication. He is a Fellow of AAAS, ACM, IEEE, NAI, and SPIE and an Academician of Academia Sinica. Dr. Kuo has received a few awards for his research contributions, including the 2010 Electronic Imaging Scientist of the Year Award, the 2010-11 Fulbright-Nokia Distinguished Chair in Information and Communications Technologies, the 2019 IEEE Computer Society Edward J. McCluskey Technical Achievement Award, the 2019 IEEE Signal Processing Society Claude Shannon-Harry Nyquist Technical Achievement Award, the 72nd annual Technology and Engineering Emmy Award (2020), and the 2021 IEEE Circuits and Systems Society Charles A. Desoer Technical Achievement Award. Dr. Kuo was Editor-in-Chief for the IEEE Transactions on Information Forensics and Security (2012-2014) and the Journal of Visual Communication and Image Representation (1997-2011). He is currently the Editor-in-Chief for the APSIPA Trans. on Signal and Information Processing (2022-2023). He has guided 166 students to their Ph.D. degrees and supervised 31 postdoctoral research fellows.

**Week 15 May 1:**

**Speaker: Yunyue Li, Purdue University**

**Title: Machine Learning Developments and Applications in Solid-Earth Geosciences - A Review**

**Abstract: **After decades of low but continuing activity, applications of machine learning (ML) in solid Earth geoscience have exploded in popularity. Based on a special collection on "Machine Learning for Solid Earth Observation, Modeling, and Understanding", primarily organized by Journal of Geophyscial Research - Solid Earth, I will provide a snapshot of applications ranging from data processing to inversion and interpretation, for which ML appears particularly well suited. Inevitably, there are variations in the degree to which these methods have been developed. I will highlight the successes and discuss the remaining challenges, with examples from my own research group. I hope that the progress seen in some areas will inspire efforts in others. The formidable task of how geoscience can keep pace with developments in ML while ensuring the scientific rigor that our field depends on requires improvements in sensor technology, accelerating rates of data accumulation, and close collaboration among geoscientists, mathematicians, and machine learning engineers. The methods of ML seem poised to play an important role in geosciences for the foreseeable future.

**2022 Fall PSU-Purdue-UMD Joint Seminar on Mathematical Data Science**

**Time: Monday 10:30 - 11:30 AM ET**

**Location: Zoom Meeting**

**Organizers:**

**John Harlim and Wenrui Hao, Department of Mathematics, Penn State University**

**Yuan Gao, Department of Mathematics, Purdue University**

**Haizhao Yang, Department of Mathematics, University of Maryland College Park **

**Zoom Meeting ID: 927 8056 1489**

**Password: 0900**

**Link: ****https://purdue-edu.zoom.us/j/92780561489?pwd=aXl3cy9Nd1Z5SnJhOW5Id2JDNzRBQT09**

**Week 1 August 29:**

**Speaker: Yunan Yang, Cornell University**

**Title: Benefits of Weighted Training in Machine Learning and PDE-based Inverse Problems**

**Abstract: **Many models in machine learning and PDE-based inverse problems exhibit intrinsic spectral properties, which have been used to explain the generalization capacity and the ill-posedness of such problems. In this talk, we discuss weighted training for computational learning and inversion with noisy data. The highlight of the proposed framework is that we allow weighting in both the parameter space and the data space. The weighting scheme encodes both a priori knowledge of the object to be learned and a strategy to weight the contribution of training data in the loss function. We demonstrate that appropriate weighting from prior knowledge can improve the generalization capability of the learned model in both machine learning and PDE-based inverse problems.

**Week 2 September 5: **

**No seminar on Labor day**

**Week 3 September 12: (in-person at Purdue University joint with zoom)**

**Speaker: Braxton Osting, University of Utah**

**Location: ** Stanley Coulter Hall G060

**Title: Archetypal Analysis **

**Abstract: **Archetypal Analysis is an unsupervised learning method that uses a convex polytope to summarize multivariate data. For fixed k, the method finds a convex polytope with k vertices, called archetype points, such that the polytope is contained in the convex hull of the data and the mean squared distance between the data and the polytope is minimal. In this talk, I'll give an overview of Archetypal Analysis and discuss our recent results on consistency, a probabilistic method for approximate archetypal analysis, and a version of the problem using the Wasserstein metric. Parts of this work are joint with Katy Craig, Ruijian Han, Dong Wang, Yiming Xu, and Dominique Zosso.

**Week 4 September 19: **

**Speaker: Zhiqin Xu, Shanghai Jiao Tong University**

**Title: Frequency principle and cheap labels in solving PDEs by neural networks**

**Abstract:** In this talk, I would discuss two approaches for solving PDEs by neural networks. The first one is to parameterize the solution by a network. In this approach, neural network suffers from a high-frequency curse, pointed by the frequency principle, i.e., neural network learns data from low to high frequency. To overcome the high-frequency curse, a multi-scale neural network is proposed and verified. The second approach is to express the solution by the form of the Green function and parameterize the Green function by a network. We propose a model-operator-data framework. In this approach, the MOD-Net solves a family of PDEs rather than a specific one and is much more efficient than original neural operator because few expensive labels are required, which are computed on coarse grid points with cheap computation cost and significantly improves the model accuracy.

**Week 5 September 26: **

**Speaker: Jianfeng Lu, Duke University**

**Title: Convergence for score-based generative modeling with polynomial complexity**

**Abstract:** Score-based generative modeling (SGM) is a highly successful approach for learning a probability distribution from data and generating further samples. We prove the first polynomial convergence guarantees for the core mechanic behind SGM: drawing samples from a probability density p given a score estimate (an estimate of ∇lnp) that is accurate in L2(p). Compared to previous works, we do not incur error that grows exponentially in time or that suffers from a curse of dimensionality. Our guarantee works for any smooth distribution and depends polynomially on its log-Sobolev constant. Using our guarantee, we give a theoretical analysis of score-based generative modeling, which transforms white-noise input into samples from a learned data distribution given score estimates at different noise scales. Our analysis gives theoretical grounding to the observation that an annealed procedure is required in practice to generate good samples, as our proof depends essentially on using annealing to obtain a warm start at each step. Moreover, we show that a predictor-corrector algorithm gives better convergence than using either portion alone. Joint work with Holden Lee and Yixin Tan.

**Week 6 October 3:**

**Speaker: Kui Ren, Columbia University**

**Title: Some Perspectives from Inverse Problems on Learning Second-Order Elliptic PDEs from Solution Data**

**Abstract:**In recent years, there have been great interests in discovering structures of partial differential equations from given solution data. Very promising theory and computational algorithms have been proposed for such identification problems in different settings. We will try to review some recent understandings of such PDE learning problems from the perspective of inverse problems. In particularly, we will highlight a few computational and analytical understandings on learning a second-order elliptic PDE from single and multiple solutions.

**Week 7 October 10:**

**No seminar on Fall break **

**Week 8 October 17:**

**Speaker: Paul Atzberger, University California Santa-Barbara**

**Title: Geometric Variational Autoencoders for Learning Representations of Nonlinear Dynamics and Dimension Reductions**

**Abstract:** We discuss variational autoencoders for learning representations for non-linear dynamics based on manifold latent spaces incorporating prior knowledge of geometric and topological structure. These are referred to as Geometric Dynamic (GD)-VAEs. We show GD-VAEs using manifold latent spaces allow for reducing the number of needed latent dimensions and facilitate learning more interpretable and robust representations of dynamics for making long-time predictions. We discuss challenges and present results using GD-VAEs for learning parsimonious representations for non-linear dynamics of burgers equation, constrained mechanical systems, and dynamics of reaction-diffusion systems. We also make comparisons with other methods, including analytic reduction techniques, Dynamical Model Decomposition (DMD), Proper Orthogonal Decomposition (POD), and standard autoencoders (non-variational). The results indicate some of the ways that non-linear approximation combined with manifold latent spaces can be used to significantly enhance learning of dynamics and for related machine learning tasks. The GD-VAEs package is available at http://atzberger.org

**Week 9 October 24: (in-person at Purdue University joint with zoom)**

**Speaker: Nicolas Trillos, University of Wisconsin–Madison**

**Location: ** Stanley Coulter Hall G060

**Title: Analysis of adversarial robustness and of other problems in modern machine learning**

**Abstract:** Modern machine learning methods, in particular deep learning approaches, have enjoyed unparalleled success in a variety of challenging application fields like image recognition, medical image reconstruction, and natural language processing. While a vast majority of previous research in machine learning mainly focused on constructing and understanding models with high predictive power, consensus has emerged that other properties like stability and robustness of models are of equal importance and in many applications essential. This has motivated researchers to investigate the problem of adversarial training (or how to make models robust to adversarial attacks), but despite the development of several computational strategies for adversarial training and some theoretical development in the broader distributionally robust optimization literature, there are still several theoretical questions about it that remain relatively unexplored. In this talk, I will take an analytical perspective on the adversarial robustness problem and explore three questions: 1)What is the connection between adversarial robustness and inverse problems?, 2) Can we use analytical tools to find lower bounds for adversarial robustness problems?, 3) How do we use modern tools from analysis and geometry to solve adversarial robustness problems? At its heart, this talk is an invitation to view adversarial machine learning through the lens of mathematical analysis, showcasing a variety of rich connections with perimeter minimization problems, optimal transport, mean field PDEs of interacting particle systems, and min-max games in spaces of measures.
The talk is based on joint works with Leon Bungert (Bonn), Camilo A. García Trillos (UAL), Matt Jacobs (Purdue), Jakwang Kim (Wisc), and Ryan Murray (NCState).

**Week 10 October 31:**

**Speaker: Andrea Agazzi, University of Pisa**

**Title: Convergence and optimality of single-layer neural networks for reinforcement learning**

**Abstract:** Over the past few years, groundbreaking results have established a convergence theory for wide neural networks in the supervised learning setting. Depending on the scaling of parameters at initialization, the (stochastic) gradient descent dynamics of these models converge towards different deterministic limits known as the mean-field and the lazy training regimes.In this talk, we extend some of these results to examples of prototypical algorithms in reinforcement learning: Temporal-Difference (TD) learning and Policy Gradients. In the first case, we prove convergence and optimality of wide neural network training dynamics in the lazy and mean-field regime, respectively. To establish these results, we bypass the lack of gradient structure of the TD learning dynamics by leveraging Lyapunov function techniques in the lazy training regime and sufficient expressivity of the activation function in the mean-field framework. We further show that similar optimality results hold for wide, single layer neural networkstrained by entropy-regularized softmax Policy Gradients despite the nonlinear and nonconvex nature of the risk function in this setting. This is joint work with Jianfeng Lu.

**Week 11 November 7:**

**Speaker: Yuehaw Khoo, University of Chicago **

**Title: Tensor-network approach for statistical mechanics problem**

**Abstract:** We develop tensor-network approaches for solving high-dimensional partial differential equations with the goal of characterizing the transition between two states in a statistical mechanics system with high-accuracy. For this purpose we also develop novel generative modeling techniques based on tensor-networks. The proposed method is completely linear algebra based and does not require any optimization.

**Week 12 November 14:**

**Speaker: Nathan Kutz, University of Washington**

**Title: The future of governing equations**

**Abstract: ** A major challenge in the study of dynamical systems is that of model discovery: turning data into reduced order models that are not just predictive, but provide insight into the nature of the underlying dynamical system that generated the data. We introduce a number of data-driven strategies for discovering nonlinear multiscale dynamical systems and their embeddings from data. We consider two canonical cases: (i) systems for which we have full measurements of the governing variables, and (ii) systems for which we have incomplete measurements. For systems with full state measurements, we show that the recent sparse identification of nonlinear dynamical systems (SINDy) method can discover governing equations with relatively little data and introduce a sampling method that allows SINDy to scale efficiently to problems with multiple time scales, noise and parametric dependencies. For systems with incomplete observations, we show that the Hankel alternative view of Koopman (HAVOK) method, based on time-delay embedding coordinates and the dynamic mode decomposition, can be used to obtain a linear models and Koopman invariant measurement systems that nearly perfectly captures the dynamics of nonlinear quasiperiodic systems. Neural networks are used in targeted ways to aid in the model reduction process. Together, these approaches provide a suite of mathematical strategies for reducing the data required to discover and model nonlinear multiscale systems.

**Bio: ** Nathan Kutz is the Yasuko Endo and Robert Bolles Professor of Applied Mathematics and Electrical and Computer Engineering at the University of Washington, having served as chair of applied mathematics from 2007-2015. He received the BS degree in physics and mathematics from the University of Washington in 1990 and the Phd in applied mathematics from Northwestern University in 1994. He was a postdoc in the applied and computational mathematics program at Princeton University before taking his faculty position. He has a wide range of interests, including neuroscience to fluid dynamics where he integrates machine learning with dynamical systems and control.

**Week 13 November 21:**

**Speaker: Peter Koltai, Freie Universitat Berlin**

**Title: Collective variables in complex systems: from molecular dynamics to agent-based models and fluid dynamics**

**Abstract: ** The identification of persistent forecastable structures in complicated or high-dimensional dynamics is vital for a robust prediction (or manipulation) of such systems in a potentially sparse-data setting. Such structures can be intimately related to so-called collective variables known for instance from statistical physics. We have recently developed a first data-driven technique to find provably good collective variables in molecular systems. Here we will discuss how the concept generalizes to other applications as well, such as fluid dynamics and social or epidemic dynamics.

**Week 14 November 28:**

**Canceled**

**Week 15 December 5:**

**Speaker: Somdatta Goswami, Brown University**

**Title: Transfer learning in deep neural operators **

**Abstract:** A new paradigm in scientific research has been established with the integration of data-driven and physics-informed methodologies in the domain of deep learning, and it is certain to have an impact on all areas of science and engineering. This field, popularly termed "scientific machine learning," relies on a known model, some (or no) high-fidelity data, and partially known constitutive relationships or closures, to be able to close the gap between the physical models and the observational data. Despite the success of deep learning models, the predictive performance of such models is often limited by the availability of labeled data used for training. Furthermore, learning in isolation, i.e., training a single predictive model for different but related tasks, can be extremely expensive. The application of deep learning techniques within the context of operator regression and transfer learning between different but related tasks/domains will be the focus of this talk.

**2022 Spring Mathematical Data Science Seminar**

**Department of Mathematics**

**Purdue University**

**Time: Monday 10:30 - 11:30 AM ET**

**Location: Zoom Meeting**

**Zoom Meeting ID:940 8060 9694**

**Password:919442**

**Link: ****https://purdue-edu.zoom.us/j/94080609694?pwd=eldxbTR2UWRDTGthakd4NUhISGRYQT09**

**Week 1 January 10:**

**No seminar**

**Week 2 January 17:**

**No seminar on holiday**

**Week 3 January 24:**

Speaker: **Samuel Kou, Harvard University [****website****] **

Title: Statistical Inference of dynamic systems via manifold-constrained Gaussian processes

Abstract: Parameter estimation for nonlinear dynamic system models, represented by ordinary differential equations (ODEs), using noisy and sparse data is a vital task in many fields. We will introduce a fast and accurate method, MAGI (MAnifold-constrained Gaussian process Inference), in this task. MAGI uses a Gaussian process model over time-series data, explicitly conditioned on the manifold constraint that derivatives of the Gaussian process must satisfy the ODE system. By doing so, we completely bypass the need for numerical integration and achieve substantial savings in computational time. MAGI is also suitable for inference with unobserved system components, which often occur in real experiments. MAGI is distinct from existing approaches as we provide a principled statistical construction under a Bayesian framework, which incorporates the ODE system through the manifold constraint. We demonstrate the accuracy and speed of MAGI using realistic examples based on physical experiments.

**Week 4 January 31:**

Speaker: **Roman Vershynin**, **UC Irvine [****website****] **

Title: Mathematics of synthetic data and its privacy

Abstract: An emerging way to protect privacy is to replace true data by synthetic data. Medical records of artificial patients, for example, could retain meaningful statistical information while preserving privacy of the true patients. But what is synthetic data, and what is privacy? How do we define these concepts mathematically? Is it possible to make synthetic data that is both useful and private? I will tie these questions to a simple-looking problem in probability theory: how much information about a random vector X is lost when we take conditional expectation of X with respect to some sigma-algebra? This talk is based on a series of papers joint with March Boedihardjo and Thomas Strohmer, mainly this one: https://arxiv.org/abs/2107.05824

**Week 5 February 7:**

Speaker: **Deanna Needell, UCLA [****website****] **

Title: Online nonnegative matrix factorization and applications

Abstract: Online Matrix Factorization (OMF) is a fundamental tool for dictionary learning problems, giving an approximate representation of complex data sets in terms of a reduced number of extracted features. Convergence guarantees for most of the OMF algorithms in the literature assume independence between data matrices, and the case of dependent data streams remains largely unexplored. In this talk, we present results showing that a non-convex generalization of the well-known OMF algorithm for i.i.d. data converges almost surely to the set of critical points of the expected loss function, even when the data matrices are functions of some underlying Markov chain satisfying a mild mixing condition. As the main application, by combining online non-negative matrix factorization and a recent MCMC algorithm for sampling motifs from networks, we propose a novel framework of Network Dictionary Learning that extracts `network dictionary patches' from a given network in an online manner that encodes main features of the network. We demonstrate this technique on real-world data and discuss recent extensions and variations.

**Week 6 February 14:**

Speaker: **Yongxin Chen, GaTech [****website****] **

Title: Diffusion models for generative modeling and sampling

Abstract: Diffusion models are a class of generative models built on the key idea of bridging a simple distribution and a target distribution with a proper diffusion process. When samples of the target distributions are given, it can be used for density estimation. When the target distribution is given as a (unnormalized) probability density, it can be used for sampling. In this talk, I will discuss two diffusion models we developed in our recent work, one for generative modeling (diffusion normalizing flow) and one for sampling (path integral sampler). Our methods are based on diffusion processes and stochastic control. It is also closely related to the Schrodinger bridge problem.

**Week 7 February 21:**

Speaker: **Hao Liu, Hong Kong Baptist University [****website****] **

Title: Off-policy learning and classification on low-dimensional manifold by deep neural networks

Abstract: Deep neural networks have demonstrated a great success on many applications, especially on problems with high-dimensional data sets. In spite of that, most existing statistical theories are cursed by data dimension and cannot explain such a success. To bridge the gap between theories and practice, we exploit the low-dimensional structures of data set and establish theoretical guarantees with a fast rate that is only cursed by the intrinsic dimension of the data set. This presentation addresses our recent work on causal inference and binary classification by deep neural networks. For causal inference, we develop non-asymptotic regret bounds for doubly robust off-policy learning by deep neural networks. For binary classification, we focus on the convolutional residual network classifier learned by minimizing the empirical logistic loss, and develop a non-asymptotic bound on the excess risk. In both problems, we assume the data set locates on a low-dimensional manifold. Our results provide fast rates depending on the intrinsic dimension of data sets and show that deep neural networks are adaptive to low-dimensional structures of data sets.

**Week 8 February 28:**

Speaker: **Yue Yu, Lehigh University [****website****] **

Title: Learning Nonlocal Operators for Heterogeneous Material Modeling

Abstract: Constitutive modeling based on the continuum mechanics theory has been a classical approach for modeling the mechanical responses of materials. However, when constitutive laws are unknown or when defects and/or high degrees of heterogeneity present, these classical models may become inaccurate. In this talk we propose to use data-driven modeling which directly utilizes high-fidelity simulation and experimental measurements on displacement fields, to predict a material's response. Two nonlocal operator regression approaches will be presented, to obtain homogenized and heterogeneous surrogates for material modeling, respectively.

In the first approach, we propose to consider nonlocal models as coarse-grained, homogenized models to accurately capture the system's global behavior, and learn optimal kernel functions directly from data. By combining machine learning identifiability theory, known physics, and nonlocal theory, this combination guarantees that the resulting model is mathematically well-posed, physically consistent, and converging as the data resolution increases. As an application, we employ this ``linear operator regression’’ technique to model wave propagation in heterogeneous bars.

In the second approach, we further propose to leverage the linear operator regression to nonlinear models by combining it with neural networks, and model the heterogeneous material response as mapping between loading conditions and its resultant displacement fields. To this end, we develop deep integral neural operator architectures, which is resolution-independent and naturally embeds the material micromechanical properties and defects in the integrand. As an application, we learn material models directly from digital image correlation (DIC) displacement tracking measurements, and show that the learnt model substantially outperforms conventional constitutive models.

**Week 9 March 7:**

Speaker: **Hong Qian, University of Washington [****website****] **

Title: Gibbs' Theory and Statistical Physics: A third approach to understanding the world probabilistically?

How to apply the mathematical theory of probability to real world problems? Interpretations of "what is probability" have led to the standoff between Bayesian and frequentist schools. In this talk, I try to show how Gibbs' theory stitches together both thoughts, as well as the large deviations theory, the asymptotic analysis of the law of large numbers. This yields the statistical ensemble as a parametric family of probabilistic models that are specifically informed by the nature of "observables" with infinitely many observations. Two well-known entropy functions, Gibbs' and Shannon's, as well as Pitman-Koopman-Darmois theorem, figure prominently in our theory.

**Week 10 March 14:**

**No seminar in the spring break**

**Week 11 March 21:**

Speaker: **Rongjie Lai, Rensselaer Polytechnic Institute [****website****] **

Title: Geometry inspired DNNs on Manifold-structured Data.

Abstract: Deep neural networks have made great success in many problems in science and engineering. In this talk, I will discuss our recent efforts on leveraging non-trivial geometry information hidden in data to design adaptive DNNs. I will first discuss our work on advocating the use of a multi-chart latent space for better data representation. Inspired by differential geometry, we propose a Chart Auto-Encoder (CAE) and prove a universal approximation theorem on its representation capability. CAE admits desirable manifold properties that auto-encoders with a flat latent space fail to obey, predominantly proximity of data. If time permits, I will discuss our work on defining convolution on manifolds via parallel transport and its associated parallel transport convolution neural networks (PTC-nets). This provides a natural combination of geometric modeling and learning on manifolds. This talk is based on a series of joint work with a group of my students and collaborators.

**Week 12 March 28:**

Speaker:** Akil Narayan, University of Utah [****website****] **

Title: Bandit learning and budget allocation for multi-fidelity scientific computing

Abstract: Multi-fidelity algorithms operate by recognizing that not all data is created equal: Some data is more trustworthy than other data, and this inequality in data trustworthiness can be exploited if inexpensive data that reveals bulk behavior is appropriately coupled with expensive data that more faithfully adheres to some underlying behavior. We examine this tradeoff in the context of scientific computing, where multiple computational simulations give rise to data of multiple levels of trustworthiness, or "fidelity". As a crude yet clarifying example, simulations with fewer degrees of freedom in discretizing a differential equation are less trustworthy and less costly than simulations employing more degrees of freedom. The question then is how to allocate a fixed computational budget across simulations so that the generated data ensemble reveals as much predictive information as possible.

We show that ideas and techniques from budget-limited multi-armed bandit learning can successfully tackle the stochastic decision-making problem of how to allocate computational resources amongst such simulations. We propose and discuss an adaptive multi-armed bandit learning strategy that, with minimal knowledge about given simulations, successfully generates data from simulations in ways that are provably asymptotically optimal and empirically pre-asymptotically near-optimal. We show that both statistics and full probability distributions can be learned, and demonstrate that our bandit learning methods with zero a priori statistical information about simulations perform as well as or superior to existing multi-fidelity algorithms that require oracle statistics as input.

**Week 13 April 4:**

Speaker:** Matthew Thorpe, University of Manchester [****website****] **

Title: A Random Walk from Semi Supervised Learning to Neural Networks

Abstract: Semi-supervised learning is the problem of finding missing labels; more precisely one has a data set of feature vectors of which a (often small) subset are labelled. The semi-supervised learning assumption is that similar feature vectors should have similar labels, implying that one needs a geometry on the set of feature vectors. A typical way to represent this geometry is via a graph where the nodes are the feature vectors and the edges are weighted by some measure of similarity. Laplace learning is a popular graph-based method for solving the semi-supervised learning problem which essentially requires one to minimise a Dirichlet energy defined on the graph (hence the Euler-Lagrange equation is Laplace's equation). However, at low labelling rates Laplace learning typically performs poorly. This is due to the lack of regularity, or the ill-posedness, of solutions to Laplace's equation in any dimension higher (or equal to) two. The random walk interpretation allows one to characterise how close one is to entering the ill-posed regime. In particular, it allows one to give a lower bound on the number of labels required and even provides a route for correcting the bias. Correcting the bias leads to a new method, called Poisson learning. Finally, the ideas behind correcting the bias in Laplace learning have motivated a new graph neural network architecture which do not suffer from the over-smoothing phenomena. In particular, this type of neural network, which we call GRAND++ (GRAph Neural Diffusion with a source term) enables one to use deep architectures. This is joint work with Jeff Calder, Dejan Slepčev, Brendan Cook, Tan Nguyen, Hedi Xia, Thomas Strohmer, Andrea Bertozzi, Stanley Osher and Bao Wang.

**Week 14 April 11:**

Speaker:** Bin Dong, Peking University [****website****] **

Title: Learning and Learning to Solve PDEs

Abstract: Deep learning continues to dominate machine learning and has been successful in computer vision, natural language processing, etc. Its impact has now expanded to many research areas in science and engineering. In this talk, I will mainly focus on some recent impacts of deep learning on computational mathematics. I will present our recent work on bridging deep neural networks with numerical differential equations, and how it may guide us in designing new models and algorithms for some scientific computing tasks. On the one hand, I will present some of our works on the design of interpretable data-driven models for system identification and model reduction. On the other hand, I will present our recent attempts at combining wisdom from numerical PDEs and machine learning to design data-driven solvers for PDEs and their applications in electromagnetic simulation.

**Week 15 April 18:**

Speaker: **Jeff Calder, University of Minnesota [****website****] **

Title: Boundary estimation and Hamilton-Jacobi equations on point clouds.

Abstract: This talk will discuss recent work on estimating the boundary of a domain from iid samples, and on solving Hamilton-Jacobi equations on graphs. We'll present a method for boundary estimation that is scalable to large datasets in high dimensional settings, and provide statistical guarantees that the method identifies all samples within a desired distance of the boundary. We'll show that the method can be used to solve PDEs on point clouds with Dirichlet boundary conditions, which has applications to problems like data depth and machine learning. We will also present some work on robust approximations of graph distances via the solution of particular Hamilton-Jacobi equations on graphs, and discuss applications to data depth and semi-supervised learning. This is joint work with Dejan Slepcev, Sangmin Park, and Mahmood Ettehad.

**Week 16 April 25: **

Speaker:** Xiaohui Chen, University of Illinois at Urbana-Champaign, [****website****] **

Title: A diffusion perspective of manifold clustering

Abstract: We introduce the diffusion K-means clustering method on Riemannian submanifolds, which maximizes the within-cluster connectedness based on the diffusion distance. The diffusion K-means constructs a random walk on the similarity graph with vertices as data points randomly sampled on the manifolds and edges as similarities given by a kernel that captures the local geometry of manifolds. The diffusion K-means is a multi-scale clustering tool that is suitable for data with non-linear and non-Euclidean geometric features in mixed dimensions. Given the number of clusters, we propose a polynomial-time convex relaxation algorithm via the semidefinite programming (SDP) to solve the diffusion K-means. In addition, we also propose a nuclear norm regularized SDP that is adaptive to the number of clusters. In both cases, we show that exact recovery of the SDPs for diffusion K-means can be achieved under suitable between-cluster separability and within-cluster connectedness of the submanifolds, which together quantify the hardness of the manifold clustering problem. We further propose the localized diffusion K-means by using the local adaptive bandwidth estimated from the nearest neighbors. We show that exact recovery of the localized diffusion K-means is fully adaptive to the local probability density and geometric structures of the underlying submanifolds. Joint work with Yun Yang (UIUC).

**2021 Fall Mathematical Data Science Seminar**

**Department of Mathematics**

**Purdue University**

**Time: Monday 10:30 - 11:30 AM ET**

**Location: Zoom Meeting**

**Zoom Meeting ID:940 8060 9694**

**Password:919442**

**Link: ****https://purdue-edu.zoom.us/j/94080609694?pwd=eldxbTR2UWRDTGthakd4NUhISGRYQT09**

**Week 1 August 23:**

Speaker: **Stephan J. Wojtowytsch, Texas A&M University [****website****] **

Title: Stochastic gradient descent for noise with ML-type scaling

Abstract: In the literature on stochastic gradient descent, there are two types of convergence results: (1) SGD finds minimizers of convex objective functions and (2) SGD finds critical points of smooth objective functions. Classical results are obtained under the assumption that the stochastic noise is L^2-bounded and that the learning rate decays to zero at a suitable speed. We show that, if the objective landscape and noise possess certain properties which are reminiscent of deep learning problems, then we can obtain global convergence guarantees of first type under second type assumptions for a fixed (small, but positive) learning rate. The convergence is exponential, but with a large random coefficient. If the learning rate exceeds a certain threshold, we discuss minimum selection by studying the invariant distribution of a continuous time SGD model. We show that at a critical threshold, SGD prefers minimizers where the objective function is 'flat' in a precise sense.

**Week 2 August 30:**

Speaker: **Shijun Zhang, National University of Singapore [****website****] **

Title: Deep Network Approximation: Achieving Arbitrary Accuracy with Fixed Number of Neurons

Abstract: This talk discusses a new type of simple feed-forward neural network that achieves the universal approximation property for all continuous functions with a fixed finite number of neurons. This new type of neural network is simple because it is designed with a simple and computable continuous activation function $\sigma$ leveraging a triangular-wave function and a softsign function. First, we prove that $\sigma$-activated networks with width $36d(2d+1)$ and depth $11$ can approximate any continuous function on a $d$-dimensional hypercube with an arbitrarily small error. Next, we show that classification functions arising from image and signal classification can be exactly represented by $\sigma$-activated networks with width $36d(2d+1)$ and depth $12$, when there exist pairwise disjoint closed bounded subsets of $\mathbb{R}^d$ such that the samples of the same class are located in the same subset.

**Week 3 September 6:**

No talk because of Labor Day

**Week 4 September 13:**

Speaker: **Shuhao Cao, Washington University in St. Louis [****website****] **

Title: Galerkin Transformer

Abstract: Transformer in "Attention Is All You Need" is now THE ubiquitous architecture in every state-of-the-art model in Natural Language Processing (NLP), such as BERT. At its heart and soul is the "attention mechanism". We apply the attention mechanism the first time to a data-driven operator learning problem related to parametric partial differential equations. Inspired by Fourier Neural Operator which showed a state-of-the-art performance in parametric PDE evaluation, we put together an effort to explain the heuristics of, and improve the efficacy of the self-attention. We have demonstrated that the widely-accepted "indispensable" softmax normalization in the scaled dot-product attention is sufficient but not necessary. Without the softmax normalization, the approximation capacity of a linearized Transformer variant can be proved to be on par with a Petrov-Galerkin projection layer-wise. Some simple changes mimicking projections in Hilbert spaces are applied to the attention mechanism, and it helps the final model achieve remarkable accuracy in operator learning tasks with unnormalized data, surpassing the evaluation accuracy of the classical Transformer applied directly by 100 times. Meanwhile in all experiments including the viscid Burgers' equation, an interface Darcy flow, and an inverse interface coefficient identification problem, the newly proposed simple attention-based operator learner, Galerkin Transformer, shows significant improvements in both speed and evaluation accuracy over its softmax-normalized counterparts, as well as other linearizing variants such as Random Feature Attention or FAVOR+ in Performer by Google and Deepmind. The code is publicly available at https://github.com/scaomath/galerkin-transformer

**Week 5 September 20:**

Speaker: **Anna Little, University of Utah [****website****] **

Title: Clustering High-dimensional Data with Path Metrics: A Balance of Density and Geometry

Abstract: This talk discusses multiple methods for clustering high-dimensional data, and explores the delicate balance between utilizing data density and data geometry. I will first present path-based spectral clustering, a novel approach which combines a density-based metric with graph-based clustering. This density-based path metric allows for fast algorithms and strong theoretical guarantees when clusters concentrate around low-dimensional sets. However, the method suffers from a loss of geometric information, information which is preserved by simple linear dimension reduction methods such as classic multidimensional scaling (CMDS). The second part of the talk will explore when CMDS followed by a simple clustering algorithm can exactly recover all cluster labels with high probability. However, scaling conditions become increasingly restrictive as the ambient dimension increases, and the method will fail for irregularly shaped clusters. Finally, I will discuss how a more general family of path metrics, when combined with CMDS, give low-dimensional embeddings which respect both data density and data geometry. This new method exhibits promising performance on single cell RNA sequence data and can be computed efficiently by restriction to a sparse graph.

**Week 5 September 27:**

Speaker: **Christoph Ortner, University of British Columbia [****website****] **

Title: Interatomic Potentials from First Principles and Machine Learning

Abstract: Accurate molecular simulation requires computationally expensive quantum chemistry models that makes simulating complex material phenomena or large molecules intractable. However, if no chemistry is required, but only interatomic forces then it should in principle be possible to construct much cheaper surrogate models, interatomic potentials, that capture full quantum mechanical accuracy. This talk will aim to explain why this is possible, focusing on how modelling, analysis and machine learning work together. Specifically, I will explore whether we can rigorously justify the low-dimensional functional forms proposed for interatomic potentials, and whether we can construct practical approximation schemes that can, in principle at least, close the complexity gap between electronic structure density functional theory and interatomic potentials.

**Week 5 October 4:**

Speaker: **Guo-wei Wei, Michigan State University [****website****] **

Title: How Math and AI are revolutionizing biosciences

Abstract: Mathematics underpins fundamental theories in physics such as quantum mechanics, general relativity, and quantum field theory. However, its success in modern biology, namely cellular biology, molecular biology, biochemistry, and genetics, has been quite limited. Artificial intelligence (AI) has fundamentally changed the landscape of science, technology, industry, and social media in the past few years and holds a great future for discovering the rules of life. However, AI-based biological discovery encounters obstacles arising from the structural complexity of macromolecules, the high dimensionality of biological variability, the multiscale entanglement of molecules, cells, tissues, organs, and organisms, the nonlinearity of genotype, phenotype, and environment coupling, and the excessiveness of genomic, transcriptomic, proteomic, and metabolomic data. We tackle these challenges mathematically. Our work focuses on reducing the complexity, dimensionality, entanglement, and nonlinearity of biological data in AI. We have introduced evolutionary de Rham-Hodge, persistent cohomology, and persistent spectral graph theories to obtain high-level abstractions of biological systems and thus significantly enhance AI's ability to handle excessively large biological datasets. Using our mathematical AI approaches, my team has been the top winner in D3R Grand Challenges, a worldwide annual competition series in computer-aided drug design and discovery for years. Using one million genome isolates from patients, we prove for the first time that the emergency of SARS-CoV-2 variants is determined by natural selection. We also correctly predicted vaccine-escape, fast-growing, and antibody-disruptive viral mutations.

**Week 5 October 11:**

Speaker: **Sho Yaida, Facebook [****website****] **

Title: Effective Theory of Deep Neural Networks

Abstract: Large neural networks with many model parameters work extremely well in practice and form the foundation of modern machine learning. From theorists' perspective, however, it naively seems hard to understand these large models from first principles. Nonetheless, in this talk, we'll see that the statistics of neural networks -- both at initialization and after training -- drastically simplify at large width and become analytically tractable. Along the way, we'll further emphasize that the extreme infinite-width limit is too simple and misses several important aspects of deep learning; to address them, we need to include finite-width corrections through the use of perturbation theory.

**Week 5 October 18:**

Speaker: **Fei Lu, Johns Hopkins Universiy [****website****] **

Title: Nonparametric learning of interaction kernels in mean-field equations of particle systems

Abstract: Systems of interacting particles/agents arise in multiple disciplines, such as particle systems in physics, flocking birds and swarming cells in biology, and opinion dynamics in social science. We consider the learning of the distance-based interaction kernels between the particles/agents from data. A challenging case is when the system is large with millions of particles, and we can only observe the population density. We present an efficient regression algorithm to estimate the interaction kernel, along with a systematic learning theory addressing identifiability and convergence of the estimators. We demonstrate our algorithm on three typical examples: the opinion dynamics with a piecewise linear kernel, the granular media model with a quadratic kernel, and the aggregation-diffusion with a repulsive-attractive kernel.

**Week 5 October 25:**

Speaker: **Kim Chuan Toh, National University of Singapore [****website****] **

Title: Convex clustering: theoretical guarantee and efficient computations

Abstract: Convex clustering models based on sum-of-norms have been gaining more attention in recent years. The perfect recovery properties of the convex clustering model with uniformly weighted all pairwise-differences regularization have been proved by Panahi et al. in 2017. In this talk, we present sufficient conditions for the perfect recovery of general weighted convex clustering models, which include and improve existing theoretical results as special cases. We also develop a scalable sparse Newton based augmented Lagrangian method for solving large-scale convex clustering problems. Extensive numerical experiments on both simulated and real data demonstrate that our algorithm is highly efficient and robust for solving large-scale problems. Moreover, the experiments also show the superior performance and scalability of our algorithm compared to the existing first-order methods. In particular, our algorithm is able to solve a convex clustering problem with 200,000 points in 3-dimension in about 6 minutes.

**Week 5 November 1:**

Speaker: **Wuchen Li, University of South Carolina [****website****] **

Title: Transport information flows for Bayesian sampling problems

Abstract: In AI and inverse problems, the Markov chain Monte Carlo (MCMC) method is a classical model-free method for sampling target distributions. A fact is that the optimal transport first-order method (gradient flow) forms the MCMC scheme, known as Langevin dynamics. A natural question arises: Can we propose accelerated or high order optimization techniques for MCMC methods? We positively answer this question by applying optimization methods from optimal transport and information geometry, known as transport information geometry. E.g., we introduce a theoretical framework for Newton's flows in probability space w.r.t. the optimal transport metric. Several numerical examples are given to demonstrate the effectiveness of the proposed optimization-based sampling methods.

**Week 5 November 8:**

Speaker: **Rayan Saab, University of California, San Diego [****website****] **

Title: A greedy algorithm for quantizing neural networks.

Abstract: While neural networks have been remarkably successful in a wide array of application areas, miniaturizing them and implementing them in hardware remains an area of intense research. By quantizing, or replacing the weights of a neural network with quantized (e.g., 8-bit, or binary) counterparts, massive savings in cost, computation speed, memory, and power consumption can be attained. Of course, one wishes to attain these savings without compromising the performance of the network. We propose a new data-driven and computationally efficient method for quantizing the weights of already trained neural networks. Specifically, we quantize each neuron, or hidden unit, using a greedy path-following algorithm. We prove that this simple algorithm has favorable error guarantees for a single-layer neural network (or, alternatively, for quantizing the first layer of a multi-layer network) when the training data are random from an appropriate distribution. Under these assumptions, the quantization error decays with the width of the layer, i.e., its level of over-parametrization. We also provide numerical experiments, on multi-layer networks, to illustrate the performance of our methods.

**Week 5 November 15:**

Speaker: **Daniel Sanz-Alonso, University of Chicago [****website****] **

Title: Finite Element and Graphical Representations of Gaussian Processes

Abstract: Gaussian processes (GPs) are popular models for random functions in computational and applied mathematics, statistics, machine learning and data science. However, GP methodology scales poorly to large data-sets due to the need to factorize a dense covariance matrix. In spatial statistics, a standard approach to surmount this challenge is to represent Matérn GPs using finite elements, obtaining an approximation with a sparse precision matrix. The first part of the talk will give new understanding of this approach for regression and classification with large data-sets, showing that under mild smoothness assumptions the dimension of the matrices that need to be factorized can be reduced without hindering the estimation accuracy. The analysis balances finite element and statistical errors to show that there is a threshold beyond which further refining of the discretization increases the computational cost without improving the estimation accuracy. In the second part of the talk, I will introduce graphical representations of GPs to model random functions on high-dimensional point clouds, greatly expanding the important but limited scope of the finite element approach. I will show error bounds on the graphical representations, and study the associated posterior contraction in a semi-supervised learning problem. Time permitting, I will demonstrate the versatility of the graphical approach in applications to regression, classification and PDE-constrained Bayesian inverse problems where the covariates are sampled from a variety of hidden manifolds. This is joint work with Ruiyi Yang.

**Week 5 November 22:**

Speaker: **Xiaowu Dai, UC Berkeley [****website****] **

Title: Statistical Learning and Market Design

Abstract: We study the problem of decision-making in the setting of a scarcity of shared resources when the preferences of agents are unknown a priori and must be learned from data. Taking the two-sided matching market as a running example, we focus on the decentralized setting, where agents do not share their learned preferences with a central authority. Our approach is based on the representation of preferences in a reproducing kernel Hilbert space, and a learning algorithm for preferences that accounts for uncertainty due to the competition among the agents in the market. Under regularity conditions, we show that our estimator of preferences converges at a minimax optimal rate. Given this result, we derive optimal strategies that maximize agents' expected payoffs and we calibrate the uncertain state by taking opportunity costs into account. We also derive an incentive-compatibility property and show that the outcome from the learned strategies has a stability property. Finally, we prove a fairness property that asserts that there exists no justified envy according to the learned strategies. This is a joint work with Michael I. Jordan.

**Week 5 November 29:**

Speaker: **Andrea Bertozzi, University of California, Los Angeles [****website****] **

**Special time: 13:30 -14:30 PM ET**

Title: Geometric Graph-Based Methods for High Dimensional Data

Abstract: High dimensional data can be organized on a similarity graph - an undirected graph with edge weights that measure the similarity between data assigned to nodes. We consider problems in semi-supervised and unsupervised machine learning that are formulated as penalized graph cut problems. There are a wide range of problems including Cheeger cuts, modularity optimization on networks, and semi-supervised learning. We show a parallel between these modern problems and classical minimal surface problems in Euclidean space.

This analogy allows us to develop a suite of new algorithms for machine learning that are both very fast and highly accurate. These are analogues of well-known pseudo-spectral methods for partial differential equations.

**Week 5 December 6:**

Speaker: **Senwei Liang, Purdue University [****website****] **

Title: Solving PDEs on unknown manifolds with machine learning

Abstract: Solving high-dimensional PDEs on unknown manifolds is a challenging computational problem that commands a wide variety of applications. In this talk, I will introduce a mesh-free computational framework and machine learning theory for solving elliptic PDEs on unknown manifolds, identified with point clouds, based on diffusion maps and deep learning. The PDE solver is formulated as a supervised learning task to solve a least-squares regression problem that imposes an algebraic equation approximating a PDE (and boundary conditions if applicable). The resulting numerical method is to solve a highly non-convex empirical risk minimization problem subjected to a solution from a hypothesis space of neural networks. The parameterization error and global convergence analysis are established for the proposed method. Supporting numerical examples demonstrate the convergence of the solutions and the effectiveness of the proposed solver in avoiding numerical issues that hampers the traditional approach when a large data set becomes available, e.g., large matrix inversion.

**2021 Spring Mathematical Data Science Seminar**

**Department of Mathematics**

**Purdue University**

**Time: Monday 10:30 - 11:30 AM EDT**

**Location: Zoom Meeting**

**Zoom Meeting ID:940 8060 9694**

**Password:919442**

**Link: ****https://purdue-edu.zoom.us/j/94080609694?pwd=eldxbTR2UWRDTGthakd4NUhISGRYQT09**

**Week 1 January 18: **

Speaker: **Yiqi Gu, National University of Singapore **

Title: The discovery of dynamics via deep learning

Abstract: Identifying hidden dynamics from observed data is a significant and challenging task in a wide range of applications. Recently, the combination of linear multistep methods (LMMs) and deep learning has been successfully employed to discover dynamics. We put forward an error estimate for the deep network-based LMMs using the approximation property of ReLU networks. It indicates, for certain families of LMMs, that the L2 grid error is of O(h^p) where h is the time step size and p is the local truncation error order, as long as the network size is sufficiently large. Moreover, the numerical results of some physically relevant examples are consistent with our theory.

**Week 2 January 25:**

Speaker: **Alex Townsend, Cornell University [****website****] **

Title: An approximation theory perspective on deep learning

Abstract: As empirical improvements in deep learning become more challenging, we will need to understand its success and shortcomings.

Fortunately, approximation theory has a lot to say about deep learning. In this talk, we will explore how classical results in radial basis function theory, rational functions, and Green’s functions can be used to give us insights into ReLU neural networks and PDE learning. This talk will involve work with Austin Benson, Nicolas Boulle, Anil Damle, and Yuji Nakatsukasa.

**Week 3 February 1:**

Speaker: **Haomin Zhou, Georgia Institute of Technology [****website****] **

Title: Neural Parametric Fokker-Planck equation and its Wasserstein error estimates

Abstract: In this presentation, I will introduce a system of ODEs that are constructed by using generative neural networks to spatially approximate the Fokker-Planck equation. We call this system Parametric Fokker-Planck equation. We design a semi-implicit time discretized scheme to compute its solution by only using random samples in space. The resulting scheme allows us to approximate the solution of Fokker-Planck equation in high dimensions. Furthermore, we provide error bounds, in terms of Wasserstein metric, for the semi-discrete and fully discrete approximations. This presentation is based on a recent joint work with Wuchen Li (South Carolina), Shu Liu (Math GT) and Hongyuan Zha (CUHK Shenzhen).

**Week 4 February 8:**

Speaker: **Jianfeng Cai, Hong Kong University of Science and Technology [****website****] **

Title: Landscape analysis of non-convex optimizations in phase retrieval

Abstract: Non-convex optimization is a ubiquitous tool in scientific and engineering research. For many important problems, simple non-convex optimization algorithms often provide good solutions efficiently and effectively, despite possible local minima. One way to explain the success of these algorithms is through the global landscape analysis. In this talk, we present some results along with this direction for phase retrieval. The main results are, for several of non-convex optimizations in phase retrieval, a local minimum is also global and all other critical points have a negative directional curvature. The results not only will explain why simple non-convex algorithms usually find a global minimizer for phase retrieval, but also will be useful for developing new efficient algorithms with a theoretical guarantee by applying algorithms that are guaranteed to find a local minimum.

**Week 5 February 15:**

Speaker: **Yulong Lu, University of Massachusetts Amherst [****website****] **

Title: On the a priori generalization analysis of neural network methods for solving high dimensional PDEs

Abstract: Neural network-based machine learning methods, including the most notably deep learning have achieved extraordinary successes in numerous fields. In spite of the rapid development of learning algorithms based on neural networks, their mathematical analysis are far from understood. In particular, it has been a big mystery that neural network-based machine learning methods work extremely well for solving high dimensional problems.

In this talk, we will demonstrate the power of neural network methods for solving high dimensional problems PDEs. Specifically we will discuss an a priori generalization error analysis of the Deep Ritz Method for solving high dimensional elliptic problems. Assuming the exact solution lies in a low-complexity function space called spectral Barron space, we show that the convergence rate of the generalization error is independent of dimension. We also develop a new solution theory for the PDEs of consideration on the spectral Barron space, which can be viewed as an analog of the classical Sobolev regularity theory for PDEs. This is a joint work with Jianfeng Lu and Min Wang.

**Week 6 February 22:**

Speaker: **Marie-Therese Wolfram, University of Warwick [****website****]**

Title: Inverse Optimal Transport (joint with Andrew Stuart)

Abstract: Discrete optimal transportation problems arise in various contexts in engineering, the sciences and the social sciences. Examples include the marriage market in economics or international migration flows in demographics. Often the underlying cost criterion is unknown, or only partly known, and the observed optimal solutions are corrupted by noise. In this talk we discuss a systematic approach to infer unknown costs from noisy observations of optimal transportation plans. The proposed methodologies are developed within the Bayesian framework for inverse problems and require only the ability to solve the forward optimal transport problem, which is a linear program, and to generate random numbers. We illustrate our approach using the example of international migration flows. Here reported migration flow data captures (noisily) the number of individuals moving from one country to another in a given period of time. It can be interpreted as a noisy observation of an optimal transportation map, with costs related to the geographical position of countries. We use a graph-based formulation of the problem, with countries at the nodes of graphs and non-zero weighted adjacencies only on edges between countries which share a border. We use the proposed algorithm to estimate the weights, which represent cost of transition, and to quantify uncertainty in these weights.

References: Andrew M. Stuart and Marie-Therese Wolfram. Inverse optimal transport. SIAM Journal on Applied Mathematics, 80(1):599–619, 2020.

**Week 7 March 1:**

Speaker: **Song Mei, University of California, Berkeley [****website****] **

Title: The efficiency of kernel methods on structured datasets

Abstract: Inspired by the proposal of tangent kernels of neural networks (NNs), a recent research line aims to design kernels with a better generalization performance on standard datasets. Indeed, a few recent works showed that certain kernel machines perform as well as NNs on certain datasets, despite their separations in specific cases implied by theoretical results. Furthermore, it was shown that the induced kernels of convolutional neural networks perform much better than any former handcrafted kernels. These empirical results pose a theoretical challenge to understanding the performance gaps in kernel machines and NNs in different scenarios.

In this talk, we show that data structures play an essential role in inducing these performance gaps. We consider a few natural data structures, and study their effects on the performance of these learning methods. Based on a fine-grained high dimensional asymptotics framework of analyzing random features models and kernel machines, we show the following: 1) If the feature vectors are nearly isotropic, kernel methods suffer from the curse of dimensionality, while NNs can overcome it by learning the best low-dimensional representation; 2) If the feature vectors display the same low-dimensional structure as the target function (the spiked covariates model), this curse of dimensionality becomes milder, and the performance gap between kernel methods and NNs become smaller; 3) On datasets that display some invariance structure (e.g., image dataset), there is a quantitative performance gain of using invariant kernels (e.g., convolutional kernels) over inner product kernels. Beyond explaining the performance gaps, these theoretical results can further provide some intuitions towards designing kernel methods with better performance.

This talk is based on a few joint works with Theodor Misiakiewicz, Behrooz Ghorbani, and Andrea Montanari: https://arxiv.org/abs/1904.12191, https://arxiv.org/abs/2006.13409, https://arxiv.org/abs/2101.10588, and a forthcoming arXiv manuscript.

**Week 8 March 8:**

Speaker: **Dan Mikulincer, Weizmann Institute [****website****] **

Title: Memorization with two-layers neural networks

Abstract: The aim of this talk is to present the memorization capacity of neural networks with two layers. While it is known that over-parameterized networks can memorize arbitrary labels we show that it suffices that the number of network parameters is of the same order as the number of data points to fit.

We will present three constructions of networks to memorize arbitrary labels:

i. The first construction is purely combinatorial and uses a minimal number of neurons

ii. The second construction shows that, for well-separated data sets, memorization can be achieved using gradient descent like procedures.

iii. The third construction introduces complex weights in order to construct a memorizing network which enjoys some regularizing properties enjoyed by over-parameterized networks.

**Week 9 March 15:**

Speaker: **Michael Herty, RWTH Aachen University [****website****] **

Title: Recent results on filtering using meanfield theory

Abstract: Many learning and optimization tasks can be solved by modern ensemble filtering methods. A mathematical formulation as inverse problem allows to apply ensemble Kalman filter methods to obtain a gradient free, converging algorithm for general inverse problems. The analysis of the method is based on the large ensemble limit using suitable PDE theory. Stabilization of the method as well as numerical results will be shown in this talk.

**Week 10 March 22:**

Speaker: **Lin Lin, University of California, Berkeley [****website****] **

Title: Quantum numerical linear algebra

Abstract: The two "quantum supremacy" experiments (by Google in 2019 and by USTC in 2020, respectively) have brought quantum computation to public attention. In this talk, I will discuss how to use a quantum computer to solve linear algebra problems. I will start with a toy linear system Ax=b, where A is merely a 2 x 2 matrix. I will then talk about some recent progress of quantum linear system solvers, eigenvalue problems, and a proposal for the quantum LINPACK benchmark.

**Week 11 March 29:**

Speaker: **Guido F. Montufar, University of California, Los Angeles [****website****] **

Title: Implicit bias of gradient descent for mean squared error regression with wide neural networks

Abstract: We investigate gradient descent training of wide neural networks and the corresponding implicit bias in function space. For 1D regression, we show that the solution of training a width-$n$ shallow ReLU network is within $n^{- 1/2}$ of the function which fits the training data and whose difference from initialization has smallest 2-norm of the second derivative weighted by $1/\zeta$. The curvature penalty function $1/\zeta$ is expressed in terms of the probability distribution that is utilized to initialize the network parameters, and we compute it explicitly for various common initialization procedures. For instance, asymmetric initialization with a uniform distribution yields a constant curvature penalty, and hence the solution function is the natural cubic spline interpolation of the training data. While similar results have been obtained in previous works, our analysis clarifies important details and allows us to obtain significant generalizations. In particular, the result generalizes to multivariate regression and different activation functions. Moreover, we show that the training trajectories are captured by trajectories of spatially adaptive smoothing splines with decreasing regularization strength. This is joint work with Hui Jin.

**Week 12 April 5:**

Speaker: **Joan Bruna Estrach, New York University [****website****] **

Title: Mathematical aspects of neural network approximation and learning

Abstract: High-dimensional learning remains an outstanding phenomena where experimental evidence outpaces our current mathematical understanding, mostly due to the recent empirical successes of Deep Learning. Neural Networks provide a rich yet intricate class of functions with statistical abilities to break the curse of dimensionality, and where physical priors can be tightly integrated into the architecture to improve sample efficiency. Despite these advantages, an outstanding theoretical challenge in these models is computational, by providing an analysis that explains successful optimization and generalization in the face of existing worst-case computational hardness results.

In this talk, we will describe snippets of such challenge, covering respectively optimization and approximation. First, we will focus on the framework that lifts parameter optimization to an appropriate measure space. We will overview existing results that guarantee global convergence of the resulting Wasserstein gradient flows, and present our recent results that study typical fluctuations of the dynamics around their mean field evolution, as well as extensions of this framework beyond vanilla supervised learning to account for symmetries in the function and in competitive optimization. Next, we will discuss the role of depth in terms of approximation, and present novel results establishing so-called ‘depth separation’ for a broad class of functions. We will conclude by discussing consequences in terms of optimization, highlighting current and future mathematical challenges.

Joint work with: Zhengdao Chen, Grant Rotskoff, Eric Vanden-Eijnden, Luca Venturi, Samy Jelassi and Aaron Zweig.

**Week 13 April 12:**

Speaker: **Arnulf Jentzen, University of Munster [****website****] **

Title: Overcoming the curse of dimensionality: from nonlinear Monte Carlo to deep learning

Abstract: Partial differential equations (PDEs) are among the most universal tools used in modelling problems in nature and man-made complex systems. For example, stochastic PDEs are a fundamental ingredient in models for nonlinear filtering problems in chemical engineering and weather forecasting, deterministic Schroedinger PDEs describe the wave function in a quantum physical system, deterministic Hamiltonian-Jacobi-Bellman PDEs are employed in operations research to describe optimal control problems where companies aim to minimise their costs, and deterministic Black-Scholes-type PDEs are highly employed in portfolio optimization models as well as in state-of-the-art pricing and hedging models for financial derivatives. The PDEs appearing in such models are often high-dimensional as the number of dimensions, roughly speaking, corresponds to the number of all involved interacting substances, particles, resources, agents, or assets in the model. For instance, in the case of the above mentioned financial engineering models the dimensionality of the PDE often corresponds to the number of financial assets in the involved hedging portfolio. Such PDEs can typically not be solved explicitly and it is one of the most challenging tasks in applied mathematics to develop approximation algorithms which are able to approximately compute solutions of high-dimensional PDEs. Nearly all approximation algorithms for PDEs in the literature suffer from the so-called "curse of dimensionality" in the sense that the number of required computational operations of the approximation algorithm to achieve a given approximation accuracy grows exponentially in the dimension of the considered PDE. With such algorithms it is impossible to approximately compute solutions of high-dimensional PDEs even when the fastest currently available computers are used. In the case of linear parabolic PDEs and approximations at a fixed space-time point, the curse of dimensionality can be overcome by means of Monte Carlo approximation algorithms and the Feynman-Kac formula. In this talk we prove that suitable deep neural network approximations do indeed overcome the curse of dimensionality in the case of a general class of semilinear parabolic PDEs and we thereby prove, for the first time, that a general semilinear parabolic PDE can be solved approximately without the curse of dimensionality.

**Week 14 April 19:**

Speaker: **Thomas Hou, California Institute of Technology [****website****] **

Talk video record [link]

Title: Asymptotic escape of saddles and unified global analysis for low-rank matrix recovery

Abstract: Manifold optimization has found many important applications in machine learning and data science in recent years. In this talk, we introduce some new results on manifold optimization using the Riemannian gradient descent method with a focus on low-rank matrix recovery. The purpose of our analysis is twofold: to prove the asymptotic escape of strict saddles and convergence to local minima on the manifold, and to obtain the exact convergence rate of a class of low-rank matrix recovery problems using manifold optimization. On the asymptotic side, we analyze the escape of strict saddle sets using the projected gradient descent (PGD) algorithm. We show that PGD is able to escape strict saddle sets that are non-isolated provided that they have certain geometry properties. This is a general result, and as an example, we apply it to the phase retrieval problem and explain the asymptotic behavior of PGD for phase retrieval on the low-rank matrix manifold. On the non-asymptotic side, we propose a unified analysis for a class of low-rank matrix recovery problems. We use PGD with random initialization to minimize the empirical least squares loss function, and study the convergence rate under some mild assumptions. Our results can be further extended to problems such as Gaussian phase retrieval, matrix factorization and matrix sensing. This is a joint work with Dr. Zhenzhen Li and Ziyun Zhang.

**Week 15 April 26:**

Speaker: **Jack Xin, University of California, Irvine [****website****] **

Title: Relaxed Differentiable Search Methods for Deep Neural Networks

Abstract: Neural network architectures are traditionally hand designed based on a designer's knowledge, intuition and experience. In recent years, data-driven design has been studied as an alternative where a search algorithm is deployed to find an optimal architecture in addition to optimizing network weights. This led to a two-level optimization problem to solve. I shall review a few methods especially differential architecture search (DARTS), then introduce a single-level optimization problem as a relaxation approximation and the associated convergent algorithm RARTS. Through architecture/weight variable splitting and Gauss-Seidel iterations, RARTS outperforms DARTS in accuracy and search efficiency, shown in a solvable problem and the CIFAR-10 image classification based search. The gain over DARTS continues upon transfer to ImageNet (1000 image classes). RARTS is further applied for network architecture compression, such as channel pruning of overparameterized convolutional neural networks. In experiments on ResNet-18 and MobileNetV2, RARTS achieves considerable channel sparsity and learns slim deep networks with satisfactory accuracy.

**Week 16 May 3:**

Speaker: **Jonathan Siegel, Penn State University [****website****] **

Title: Approximation Rates and Metric Entropy of Shallow Neural Networks

Abstract: We consider the problem of approximating high dimensional functions using shallow neural networks, and more generally by sparse linear combinations of elements of a dictionary. We begin by introducing natural spaces of functions which can be efficiently approximated in this way. Then, we derive the metric entropy of the unit balls in these spaces, which allows us to calculate optimal approximation rates for approximation by shallow neural networks. This gives a precise measure of how large this class of functions is and how well shallow neural networks overcome the curse of dimensionality. Finally, we describe an algorithm which can be used to solve high-dimensional PDEs using this space of functions.

**2020-Fall Mathematical Data Science Seminar**

**Department of Mathematics**

**Purdue University**

**Time: Monday 10:30 - 11:30 AM EDT**

**Location: Zoom Meeting**

**Zoom Meeting ID:968 1759 0066**

**Password:389806**

**Schedule:**

**Week 1 August 24: **

Speaker:** Xin Tong, National University of Singapore [****website****] **

Title: Can Algorithms Collaborate? The Replica Exchange Method

Abstract: Gradient descent (GD) is known to converge quickly for convex objective functions, but it can be trapped at local minima. On the other hand, Langevin dynamics (LD) can explore the state space and find global minima, but in order to give accurate estimates, LD needs to run with a small discretization step size and weak stochastic force, which in general slows down its convergence. This paper shows that these two algorithms can ``collaborate" through a simple exchange mechanism, in which they swap their current positions if LD yields a lower objective function. This idea can be seen as the singular limit of the replica-exchange technique from the sampling literature. We show that this new algorithm converges to the global minimum linearly with high probability, assuming the objective function is strongly convex in a neighborhood of the unique global minimum. By replacing gradients with stochastic gradients, and adding a proper threshold to the exchange mechanism, our algorithm can also be used in online settings. We further verify our theoretical results through some numerical experiments, and observe superior performance of the proposed algorithm over running GD or LD alone.

**Week 2 August 31:**

Speaker:** Xudong Li, Fudan University [****website****] **

Title: On the Equivalence of Inexact Proximal ALM and ADMM for a Class of Convex Composite Programming

Abstract: In this talk, we show that for a class of linearly constrained convex composite optimization problems, an (inexact) symmetric Gauss-Seidel based majorized multi-block proximal alternating direction method of multipliers (ADMM) is equivalent to an inexact proximal augmented Lagrangian method (ALM). This equivalence not only provides new perspectives for understanding some ADMM-type algorithms but also supplies meaningful guidelines on implementing them to achieve better computational efficiency. A collection of illustrative examples is provided to demonstrate the breadth of applications for which our results can be used. Numerical experiments on solving semidefinite programming problems are conducted to illustrate how the theoretical results established here can lead to improvements on the corresponding practical implementations.

**Week 3 September 7: (National holiday)**

**Week 4 September 14:**

Speaker: **Wenjing Liao, Georgia Institute of Technology [****website****] **

Title: Regression of functions on low-dimensional manifolds by neural networks

Abstract: Many data in real-world applications lie in a high-dimensional space but are concentrated on or near a low-dimensional manifold. Our goal is to estimate functions on the manifold from finite samples of data. This talk focuses on an efficient approximation theory of deep ReLU networks for functions supported on low-dimensional manifolds. We construct a ReLU network for such function approximation where the size of the network grows exponentially with respect to the intrinsic dimension of the manifold. When the function is estimated from finite samples, we proved that the mean squared error for the function approximation converges as the training samples increases with a rate depending on the intrinsic dimension of the manifold instead of the ambient dimension of the space. These results demonstrate that deep neural networks are adaptive to low-dimensional geometric structures of data. This is a joint work with Minshuo Chen, Haoming Jiang, Tuo Zhao (Georgia Institute of Technology).

**Week 5 September 21:**

Speaker: **Qin Li, University of Wisconsin-Madison [****website****] **

Title: Ensemble Kalman Inversion, Ensemble Kalman Sampling, and mean-field limit

Abstract: How to sample from a target distribution function is one of the core problems in Bayesian inference. It is used widely in machine learning, data assimilation and inverse problems. During the past decade, ensemble type algorithms have been very popular, among which, Ensemble Kalman Inversion (EKI) and Ensemble Kalman Sampling (EKS) may have garnered the most attention. While numerical experiments suggest consistency and fast convergence, rigorous mathematical justification has been in lack.

To prove the validity of the two methods, we utilize the mean-field limit argument. This translates the algorithms into a set of coupled stochastic differential equations, whose mean-field limits (as the number of particles $N$ goes to infinity) are Fokker-Planck equations that reconstruct the target distribution exponentially fast. We prove the convergence rate is optimal in the Wasserstein sense, meaning the ensemble distribution converges to the target distribution in $N^{-1/2}$.

It is a joint work with Zhiyan Ding.

**Week 6 September 28:**

Speaker: **John Harlim, Penn State University [****website****] **

Title: Solving elliptic PDEs on manifolds with diffusion maps algorithm

Abstract: In this talk, I will discuss recent efforts in applying the Diffusion Maps (DM) algorithm to solve elliptic PDEs on unknown manifolds using point clouds data. The key idea rests on the fact that away from the boundary, the second-order elliptic differential operators can be approximated by integral operators defined with appropriate Gaussian kernels. The key advantage of such an approximation is that one can avoid parameterizing the manifold, which can be complicated if the manifold is embedded in a high-dimensional ambient space. On manifolds with boundary, however, such an approximation is only valid for functions that satisfy the Neumann boundary condition. Motivated by the classical ghost-point correction in the finite-difference method for solving Neumann problems, we extend the diffusion maps algorithm with ghost points such that it is a consistent estimator in the pointwise sense even near the boundary. Applying the proposed algorithm, which we called the Ghost Points Diffusion Maps, to solve the well-posed elliptic PDEs with Dirichlet, Neumann, or Robin boundary conditions, we establish the convergence of the approximate solution under appropriate smoothness assumptions. Supporting numerical examples of problems on various known and unknown manifolds will be shown. If time permits, I will discuss an application on Bayesian elliptic inverse problems

**Week 7 October 5:**

Speaker: **Jason Klusowski, Princeton University [****website****] **

Title: “Sparse Learning with CART”

Abstract: “Decision trees with binary splits are popularly constructed using Classification and Regression Trees (CART) methodology. For regression models, this approach recursively divides the data into two near-homogenous daughter nodes according to a split point that maximizes the reduction in sum of squares error (the impurity) along a particular variable. In this talk, we explore some of the statistical properties of regression trees constructed with CART. We will see that the training error is governed by the Pearson correlation between the optimal decision stump and response data in each node, which can be bounded by constructing a prior distribution on the split points and solving a nonlinear optimization problem. We leverage this connection between the training error and Pearson correlation to show that CART with cost-complexity pruning achieves an optimal complexity/goodness-of-fit tradeoff when the depth scales with the logarithm of the sample size. Data dependent quantities, which adapt to the dimensionality and latent structure of the regression model, are seen to govern the rates of convergence of the prediction error.”

**Week 8 October 12:**

Speaker: **changed to November**

**Week 9 October 19:**

Speaker: **Matthew Hirn, Michigan State University [****website****]**

Title: Understanding convolutional neural networks through signal processing: From images to manifolds to graphs

Abstract: Convolutional neural networks (CNNs) are the go-to tool for image processing tasks in machine learning. But how and why do they work so well? Using the basic guiding principles of CNNs, namely their convolutional structure, invariance properties, and multi-scale nature, we will discuss how the CNN architecture arises as a natural bi-product of these principles using the language of nonlinear signal processing. In doing so we will extract some core ideas that allow us to generalize these architectures to manifolds and graphs, while still being able to provide mathematical guarantees on the nature of the representation provided by these tools.

**Week 10 October 26:**

Speaker: **Phillip Peterson, the University of Vienna [****website****] **

Title: Stable approximation of solutions of PDEs by neural networks

We discuss the approximation of certain classes of functions via deep neural networks with ReLU activation functions. The function classes that we consider are either classical smoothness spaces, such as $k$-times differentiable functions or more involved function spaces appearing as solutions of partial differential equations on general domains. Due to potential applications in the numerical analysis of PDEs, we focus on approximation in Sobolev norms. In addition, we put an emphasis on approximation with neural network weights that are bounded in a sensible way.

**Week 11 November 2:**

Speaker: **Zhongqiang Zhang, Worcester Polytechnic Institute**

Title: Error estimates of residual minimization using neural networks for linear PDEs

Abstract: We propose an abstract framework for analyzing the convergence of least-squares methods based on residual minimization when feasible solutions are neural networks. With the norm relations and compactness arguments, we derive error estimates for both continuous and discrete formulations of residual minimization in strong and weak forms. The formulations cover recently developed physics-informed neural networks based on strong and variational formulations. This is a joint work with Yeonjong Shin and George Em Karniadakis at Brown University. The full text of our work can be found at https://arxiv.org/abs/2010.08019

**Week 12 November 9:**

Speaker: **Yifei Lou, University of Texas at Dallas**

**[****website****]**

Title: Graph Regularized Models for Blind Hyperspectral Unmixing

Abstract: Remote sensing data from hyperspectral cameras suffer from limited spatial resolution, in which a single pixel of a hyperspectral image may contain information from several materials in the field of view. Blind hyperspectral image unmixing is the process of identifying the spectra of pure materials (i.e., endmembers) and their proportions (i.e., abundances) at each pixel. In this talk, I will present two graph regularized models: graph Laplacian and graph total variation (gTV) for blind hyperspectral unmixing, both of which can be solved efficiently by the alternating direction method of multipliers (ADMM). To further alleviate the computational cost, we apply the Nystrom method to approximate a fully-connected graph by a small subset of sampled points. Furthermore, we adopt the Merriman-Bence-Osher (MBO) scheme to solve the gTV-involved subproblem in ADMM by decomposing a grayscale image into a bit-wise form. A variety of numerical experiments on synthetic and real hyperspectral images are conducted, showcasing the potential of the proposed method in terms of identification accuracy and computational efficiency.

**Week 13 November 16:**

Speaker:** Qianxiao Li, National University of Singapore [****website]**

Title: On the Curse of Memory in Recurrent Neural Networks

Abstract: In this talk, we discuss approximation theory and optimization dynamics of recurrent neural networks for learning input-output temporal relationships. The convenient setting we adopt is a continuous-time, linear one, in which this problem can be regarded as the approximation of sequences of linear functionals. In particular, we present universal approximation theorems of these linear functionals and characterize the approximation rates. Moreover, we analyze the associated fine-grained optimization dynamics. A unifying theme uncovered is the adverse effect of long memory, which makes both approximation and optimization exponentially harder. We coin this the "curse of memory", and this may serve as a basic starting point to uncover new phenomena that arises in supervised learning of temporal data.

**Week 14 November 23:**

Speaker: **Xiuyuan Cheng, Duke University [****website****] **

Title: Local and Global Fourier Approaches in Convolutional Neural Networks

Abstract: Fourier basis which diagonalizes convolution is a fundamental tool in analysis and signal processing. This talk introduces three scenarios where how local and global Fourier methods help to better analyze and design modern deep convolutional neural networks (CNN). First, by adopting the Fourier-Bessel basis in CNN and rotation-equivariant CNN, we show how a filter-decomposed convolution leads to reduced model size and computation, along with provably stable deep representation with respect to geometric deformations of input data. Second, we introduce low-rank local filters on a graph which provide a graph convolution model more expressive to represent graph signals than certain spectral graph convolutions, the latter can be viewed as by global (general) Fourier transforms on graphs. Meanwhile, resilience to (local) high-frequency input perturbation can be obtained via local graph Laplacian regularization. Finally, in the context of representing Fourier kernels, we introduce a structured and sparsified CNN model called Butterfly-net, where the network parameters can be hard-coded to implement the Butterfly algorithm. Theoretically, this provides an approximation result of CNN where the network complexity is reduced to scale with K, instead of the input dimension, when approximating K-bandlimited operators. Joint work with Qiang Qiu, Zichen Miao, Robert Calderbank, Guillermo Sapiro; Yingzhou Li, Jianfeng Lu.

**Week 15 November 30:**

Speaker: **Xiangxiong Zhang, Purdue University [****website****] **

Title: Riemannian Optimization for Low Rank Hermitian Positive Semidefinite Constraints

Abstract: Riemannian Optimization on matrix manifolds along with a few examples will be introduced. Past and some ongoing work on Riemannian Optimization for Low Rank Hermitian Positive Semidefinite Constraints will also be elaborated.

**Week 16 December 7:**

Speaker: **Alex Cloninger, University of California, San Diego [****website****] **

Title: Fast Statistical and Geometric Distances Between Families of Distributions

Abstract: Detecting differences and building classifiers between a family of distributions, given only finite samples, has had renewed interest due to data science applications in high dimensions. Applications include survey response effects, topic modeling, and various measurements of cell or gene populations per person. Recent advances have focused on kernel Maximum Mean Discrepancy and Optimal Transport. However, when the family of distributions are concentrated near a low dimensional structure, or when the family of distributions being considered is generated from a family of simple group actions, these algorithms fail to exploit the reduced complexity. In this talk, we'll discuss the theoretical and computational advancements that can be made under these assumptions, and their connections to harmonic analysis, approximation theory, and group actions. Similarly, we'll use both techniques to develop methods of provably identifying not just how much the distributions deviate, but where these differences are concentrated. We'll also focus on applications in medicine, generative modeling, and supervised learning.