Statistics and Data Science Seminars



Upcoming Statistics and Data Science Seminars
Past Statistics and Data Science Seminars
DMS Statistics and Data Science Seminar
Feb 12, 2025 02:00 PM
250 Parker Hall


bakalli

Speaker: Dr. Gaetan Bakalli (Assistant Professor of Econometrics and Data Science, Emlyon Business School, Lyon)

Title: Nonstandard Errors

 

Abstract: In statistics, samples are drawn from a population in a data-generating process (DGP). Standard errors measure the uncertainty in estimates of population parameters. In science, evidence is generated to test hypotheses in an evidence-generating process (EGP). We claim that EGP variation across researchers adds uncertainty—nonstandard errors (NSEs). We study NSEs by letting 164 teams test the same hypotheses on the same data. NSEs turn out to be sizable, but smaller for more reproducible or higher rated research. Adding peer-review stages reduces NSEs. We further find that this type of uncertainty is underestimated by participants.


DMS Statistics and Data Science Seminar
Dec 04, 2024 02:00 PM
ZOOM


Speaker: Russell J. Bowater (Independent Statistical Consultant, Oaxaca City, Mexico)

Title: The 7 hardest lessons to learn in statistics 

 

Abstract:  What is the current state of the theory of statistical inference? Is it essentially in a good state except for a relatively small number of issues that need to be tidied up? Or is what is usually presented as being the standard and accepted theory of statistical inference so full of conceptual holes that it is nothing short of an embarrassment for anyone who wishes to describe themselves as a statistician? This talk explores these questions by presenting lessons that arguably need to be learnt but have proved difficult to learn for reasons that to a great extent are not related to doing good independent and impartial science. By exposing ourselves to such an uncomfortable level of introspection, a greater understanding can be gained about what we have done, where we are at and where we should be going.

 


DMS Statistics and Data Science Seminar
Nov 20, 2024 02:00 PM
354 Parker Hall


lian

Speaker: Dr. Li An (Solon & Martha Dixon Endowed Professor at College of Forestry, Wildlife and Environment, Auburn University)

Title: Green the ecosystems at multiple spatial & temporal scales - Complex systems & Data Science approaches

 

Abstract: Humanity stands in an unprecedented era of climate change, environmental degradation, rapid biodiversity loss, and other crises. Green initiatives, defined to be programs, funds,  payments, policies, or any endeavors that aim to counter such crises and restore, sustain, or improve nature’s capacity to benefit human beings, are becoming increasingly widespread and popular across the globe. Using data from 15 sites from local to global scales (including China and the USA), the author systematically explores how specific policy, intended behaviors, and gains of a certain green initiative may interact with those of other green initiatives concurrently implemented in the same geographic area or involving the same recipients. Spatial data science methods, including remote sensing, GIS, and eigenvector spatial filtering, are employed to uncover mechanisms behind the data. The findings suggest that spillover effects were widespread and divergent: one initiative could reduce the gain of another by 22% ~ 100%, indicating a sign of alarming losses. In other instances, one initiative can increase the gain of another by 9% ~ 310%, offering substantial co-benefits. This talk also presents current efforts that infuses artificial intelligence, machine learning, and agent-based models to uncover mechanisms and rules hidden in multiple space-time data patterns.

 

Biosketch: Dr. Li An is Solon & Martha Dixon Endowed Professor at College of Forestry, Wildlife and Environment and Director of International Center for Climate and Global Change Research (Center URL) in Auburn University. He received his B.S. degree from Peking University (Economic Geography), China, M.S. degree from Chinese Academy of Sciences (Systems Ecology) and from Michigan State University (Probability and Statistics), and Ph.D. degree from Michigan State University (Systems modeling; fisheries and wildlife). His research focuses on complex human-environment systems, space-time analysis and modeling, spatial data science, landscape ecology, and complex adaptive systems. He is a Fellow of The American Association for the Advancement of Science (AAAS; 2020 class), Fellow of The American Association of Geographers (AAG; 2022 class), awardee of 2023 Distinguished Scholarship Honors from AAG, and recipient of multiple other awards or recognitions. He has been leading or involved in research projects funded by multiple federal agencies, and these projects are broadly distributed in Nepal, Ghana, USA, and China. He has served on the editorial board of Annals of the American Association of GeographersEcological ModellingThe Journal of Artificial Societies and Social Simulation (JASSS), Remote SensingInternational Journal of Geospatial and Environmental Research, and Geography and Sustainability. He is  President of the International Association of Landscape Ecology-North America.


DMS Statistics and Data Science Seminar
Nov 13, 2024 02:00 PM
354 Parker Hall


minyang

Speaker: Dr. Min Yang (University of Illinois at Chicago)

Title : Scalable Methodologies for Big Data Analysis: Integrating Flexible Statistical Models and Optimal Designs

 

Abstract: The formidable challenge presented by the analysis of big data stems not just from its sheer volume, but also from the diversity, complexity, and the rapid pace at which it needs to be processed or delivered. A compelling approach is to analyze a sample of the data, while still preserving the comprehensive information contained in the full dataset. Although there is a considerable amount of research on this subject, the majority of it relies on classical statistical models, such as linear models and generalized linear models, etc. These models serve as powerful tools when the relationships between input and output variables are uniform. However, they may not be suitable when applied to complex datasets, as they tend to yield suboptimal results in the face of inherent complexity or heterogeneity. In this presentation, we will introduce a broadly applicable and scalable methodology designed to overcome these challenges. This is achieved through an in-depth exploration and integration of cutting-edge statistical methods, drawing particularly from neural network models and, more specifically, Mixture-of-Experts (ME) models, along with optimal designs.


DMS Statistics and Data Science Seminar
Nov 06, 2024 02:00 PM
354 Parker Hall


jeonjoopark

Speaker: Dr. Yeonjoo Park (University of Texas at San Antonio)

Title: A data-adaptive dimension reduction for functional data via penalized low-rank approximation

 

Abstract: We introduce a data-adaptive nonparametric dimension reduction tool to obtain a low-dimensional approximation of functional data contaminated by erratic measurement errors following symmetric or asymmetric distributions. We propose to apply robust submatrix completion techniques to matrices consisting of coefficients of basis functions calculated by projecting the observed trajectories onto a given orthogonal basis set. In this process, we use a composite asymmetric Huber loss function to accommodate domain-specific erratic behaviors in a data-adaptive manner. We further incorporate the \(L_1\) penalty to regularize the smoothness of latent factor curves. The proposed method can also be applied to partially observed functional data, where each trajectory contains individual-specific missing segments. Moreover, since our method does not require estimating the covariance operator, the extension to any dimensional functional data observed over a continuum is straightforward. We demonstrate the empirical performance in estimating lower-dimensional space and reconstruction of trajectories of the proposed method through simulation studies. We then apply the proposed method to two real datasets, one-dimensional Advanced Metering Infrastructure (AMI) data and two-dimensional max precipitation spatial data collected in North America and South America.

 


DMS Statistics and Data Science Seminar
Oct 30, 2024 02:00 PM
354 Parker Hall


guerrier

Speaker: Dr. Stéphane Guerrier (University of Geneva)

Title: Accurate Inference for Penalized Logistic Regression

 

Abstract: Inference for high-dimensional logistic regression models using penalized methods has been a challenging research problem. As an illustration, a major difficulty is the significant bias of the Lasso estimator, which limits its direct application in inference. Although various bias corrected Lasso estimators have been proposed, they often still exhibit substantial biases in finite samples, undermining their inference performance. These finite sample biases become particularly problematic in one-sided inference problems, such as one-sided hypothesis testing. This paper proposes a novel two-step procedure for accurate inference in high-dimensional logistic regression models. In the first step, we propose a Lasso-based variable selection method to select a suitable submodel of moderate size for subsequent inference. In the second step, we introduce a bias corrected estimator to fit the selected submodel. We demonstrate that the resulting estimator from this two-step procedure has a small bias order and enables accurate inference. Numerical studies and an analysis of alcohol consumption data are included, where our proposed method is compared to alternative approaches. Our results indicate that the proposed method exhibits significantly smaller biases than alternative methods in finite samples, thereby leading to improved inference performance.

 

This is a joint work with Yuming Zhang (Harvard) and Runze Li (Penn State).

 


DMS Statistics and Data Science Seminar
Oct 23, 2024 02:00 PM
ZOOM


rongma

Speaker: Dr. Rong Ma (Harvard T.H. Chan School of Public Health)

Title: Is your data alignable? A geometric view of single-cell data integration

 

Abstract: Single-cell data integration can provide a comprehensive molecular view of cells, and many algorithms have been developed to remove unwanted technical or biological variations and integrate heterogeneous single-cell datasets. Despite their wide usage, existing methods suffer from several fundamental limitations. In particular, we lack a rigorous statistical test for whether two high-dimensional datasets are in principle alignable (and therefore should even be aligned). Moreover, popular methods can substantially distort the data during alignment, making the aligned data and downstream analysis difficult to interpret. To overcome these limitations, we present a spectral manifold alignment and inference (SMAI) framework, which enables principled and interpretable alignability testing and structure-preserving integration of single-cell data with the same type of features. SMAI provides a statistical test to robustly assess the alignability between datasets to avoid misleading inference and is justified by high-dimensional statistical theory. On a diverse range of real and simulated benchmark datasets, it outperforms commonly used alignment methods. Moreover, we show that SMAI improves various downstream analyses such as identification of differentially expressed genes and imputation of single-cell spatial transcriptomics, providing further biological insights. SMAI’s interpretability also enables quantification and a deeper understanding of the sources of technical confounders in single-cell data.

 

This is a joint work with Eric Sun, David Donoho, and James Zou from Stanford University.


DMS Statistics and Data Science Seminar
Oct 16, 2024 02:00 PM
ZOOM


juehou

Speaker: Dr. Jue Hou (University of Minnesota, School of Public Health) 

Title: Efficient and Robust Semi-supervised Estimation of Average Treatment Effect with Partially Annotated Treatment and Response

 

Abstract: A notable challenge of leveraging Electronic Health Records (EHR) for treatment effect assessment is the lack of precise information on important clinical variables, including the treatment received and the response. Both treatment information and response cannot be accurately captured by readily available EHR features in many studies and require labor-intensive manual chart review to precisely annotate, which limits the number of available gold standard labels on these key variables. We considered average treatment effect (ATE) estimation when 1) exact treatment and outcome variables are only observed together in a small labeled subset and 2) noisy surrogates of treatment and outcome, such as relevant prescription and diagnosis codes, along with potential confounders are observed for all subjects. We derived the efficient influence function for ATE and used it to construct a semi-supervised multiple machine learning (SMMAL) estimator. We justified that our SMMAL ATE estimator is semi-parametric efficient with B-spline regression under low-dimensional smooth models. We developed the adaptive sparsity/model doubly robust estimation under high-dimensional logistic propensity score and outcome regression models. Results from simulation studies demonstrated the validity of our SMMAL method and its superiority over supervised and unsupervised benchmarks. We applied SMMAL to the assessment of targeted therapies for metastatic colorectal cancer in comparison to chemotherapy.


DMS Statistics and Data Science Seminar
Oct 09, 2024 02:00 PM
ZOOM


Rui

Speaker: Dr. Rui Duan ( Assistant Professor of Biostatistics, Harvard T.H. Chan School of Public Health)

Title: Collaborative Multi-Site Statistical Inference

 

Abstract: In response to the increasing demand for generating real-world evidence from multi-site collaborative studies, we introduce novel collaborative learning approaches designed to estimate the average treatment effect in multi-site settings with data-sharing constraints. Our methods operate in a federated manner, utilizing individual-level data from a user-defined target population while leveraging summary statistics from other source populations to construct efficient estimators for the average treatment effect on the target population. Crucially, our federated approach eliminates the need for iterative communications between sites, making it particularly well-suited for research consortia that lack the resources to implement automated data-sharing infrastructures.

Compared to existing data integration methods in causal inference, our approach accommodates distributional shifts in outcomes, treatments, and baseline covariates. Additionally, under the right conditions, it achieves the semiparametric efficiency bound. We conduct simulation studies to demonstrate the efficiency gains obtained by incorporating data from multiple sources, as well as the robustness of our methods in the face of varying levels of distributional shifts. Furthermore, we showcase several real-world case studies that estimate treatment effectiveness using data from multiple institutions.

 


DMS Statistics and Data Science Seminar
Oct 02, 2024 02:00 PM
ZOOM


Wei

Speaker:  Dr. Yuting Wei (the Wharton School, University of Pennsylvania)

Title: Towards faster non-asymptotic convergence for diffusion-based generative models

Abstract: Diffusion models, which convert noise into new data instances by learning to reverse a Markov diffusion process, have become a cornerstone in contemporary generative artificial intelligence. While their practical power has now been widely recognized, the theoretical underpinnings remain far from mature.  In this work, we develop a suite of non-asymptotic theory towards understanding the data generation process of diffusion models in discrete time, assuming access to \(\ell_2\)-accurate estimates of the (Stein) score functions. For a popular deterministic sampler (based on the probability flow ODE), we establish a convergence rate proportional to \(1/T\) (with \(T\) the total number of steps), improving upon past results; for another mainstream stochastic sampler (i.e., a type of the denoising diffusion probabilistic model), we derive a convergence rate proportional to \(1/\sqrt{T}\), matching the state-of-the-art theory. Imposing only minimal assumptions on the target data distribution (e.g., no smoothness assumption is imposed), our results characterize how \(\ell_2\) score estimation errors affect the quality of the data generation processes. Further, we design two accelerated variants, improving the convergence to \(1/T^2\) for the ODE-based sampler and \(1/T\) for the DDPM-type sampler, which might be of independent theoretical and empirical interest.

 

 

Bio: Dr. Yuting Wei is currently an assistant professor in the Statistics and Data Science Department at the Wharton School, University of Pennsylvania. Prior to that, Dr. Wei spent two years at Carnegie Mellon University as an assistant professor and one year at Stanford University as a Stein Fellow. She received her Ph.D. in statistics at the University of California, Berkeley. She was the recipient of the 2023 Google Research Scholar Award, 2022 NSF Career award, and the Erich L. Lehmann Citation from the Berkeley statistics department. Her research interests include high-dimensional and non-parametric statistics, statistical machine learning, and reinforcement learning.

 


More Events...