Statistics and Data Science Seminars
Nov 06, 2024 02:00 PM
354 Parker Hall
Speaker: Dr. Yeonjoo Park (University of Texas at San Antonio)
Title: A data-adaptive dimension reduction for functional data via penalized low-rank approximation
Abstract: We introduce a data-adaptive nonparametric dimension reduction tool to obtain a low-dimensional approximation of functional data contaminated by erratic measurement errors following symmetric or asymmetric distributions. We propose to apply robust submatrix completion techniques to matrices consisting of coefficients of basis functions calculated by projecting the observed trajectories onto a given orthogonal basis set. In this process, we use a composite asymmetric Huber loss function to accommodate domain-specific erratic behaviors in a data-adaptive manner. We further incorporate the \(L_1\) penalty to regularize the smoothness of latent factor curves. The proposed method can also be applied to partially observed functional data, where each trajectory contains individual-specific missing segments. Moreover, since our method does not require estimating the covariance operator, the extension to any dimensional functional data observed over a continuum is straightforward. We demonstrate the empirical performance in estimating lower-dimensional space and reconstruction of trajectories of the proposed method through simulation studies. We then apply the proposed method to two real datasets, one-dimensional Advanced Metering Infrastructure (AMI) data and two-dimensional max precipitation spatial data collected in North America and South America.
Oct 30, 2024 02:00 PM
354 Parker Hall
Speaker: Dr. Stéphane Guerrier (University of Geneva)
Title: Accurate Inference for Penalized Logistic Regression
Abstract: Inference for high-dimensional logistic regression models using penalized methods has been a challenging research problem. As an illustration, a major difficulty is the significant bias of the Lasso estimator, which limits its direct application in inference. Although various bias corrected Lasso estimators have been proposed, they often still exhibit substantial biases in finite samples, undermining their inference performance. These finite sample biases become particularly problematic in one-sided inference problems, such as one-sided hypothesis testing. This paper proposes a novel two-step procedure for accurate inference in high-dimensional logistic regression models. In the first step, we propose a Lasso-based variable selection method to select a suitable submodel of moderate size for subsequent inference. In the second step, we introduce a bias corrected estimator to fit the selected submodel. We demonstrate that the resulting estimator from this two-step procedure has a small bias order and enables accurate inference. Numerical studies and an analysis of alcohol consumption data are included, where our proposed method is compared to alternative approaches. Our results indicate that the proposed method exhibits significantly smaller biases than alternative methods in finite samples, thereby leading to improved inference performance.
This is a joint work with Yuming Zhang (Harvard) and Runze Li (Penn State).
DMS Statistics and Data Science Seminar
Oct 23, 2024 02:00 PM
ZOOM
Speaker: Dr. Rong Ma (Harvard T.H. Chan School of Public Health)
Title: Is your data alignable? A geometric view of single-cell data integration
Abstract: Single-cell data integration can provide a comprehensive molecular view of cells, and many algorithms have been developed to remove unwanted technical or biological variations and integrate heterogeneous single-cell datasets. Despite their wide usage, existing methods suffer from several fundamental limitations. In particular, we lack a rigorous statistical test for whether two high-dimensional datasets are in principle alignable (and therefore should even be aligned). Moreover, popular methods can substantially distort the data during alignment, making the aligned data and downstream analysis difficult to interpret. To overcome these limitations, we present a spectral manifold alignment and inference (SMAI) framework, which enables principled and interpretable alignability testing and structure-preserving integration of single-cell data with the same type of features. SMAI provides a statistical test to robustly assess the alignability between datasets to avoid misleading inference and is justified by high-dimensional statistical theory. On a diverse range of real and simulated benchmark datasets, it outperforms commonly used alignment methods. Moreover, we show that SMAI improves various downstream analyses such as identification of differentially expressed genes and imputation of single-cell spatial transcriptomics, providing further biological insights. SMAI’s interpretability also enables quantification and a deeper understanding of the sources of technical confounders in single-cell data.
This is a joint work with Eric Sun, David Donoho, and James Zou from Stanford University.
DMS Statistics and Data Science Seminar
Oct 16, 2024 02:00 PM
ZOOM
Speaker: Dr. Jue Hou (University of Minnesota, School of Public Health)
Title: Efficient and Robust Semi-supervised Estimation of Average Treatment Effect with Partially Annotated Treatment and Response
Abstract: A notable challenge of leveraging Electronic Health Records (EHR) for treatment effect assessment is the lack of precise information on important clinical variables, including the treatment received and the response. Both treatment information and response cannot be accurately captured by readily available EHR features in many studies and require labor-intensive manual chart review to precisely annotate, which limits the number of available gold standard labels on these key variables. We considered average treatment effect (ATE) estimation when 1) exact treatment and outcome variables are only observed together in a small labeled subset and 2) noisy surrogates of treatment and outcome, such as relevant prescription and diagnosis codes, along with potential confounders are observed for all subjects. We derived the efficient influence function for ATE and used it to construct a semi-supervised multiple machine learning (SMMAL) estimator. We justified that our SMMAL ATE estimator is semi-parametric efficient with B-spline regression under low-dimensional smooth models. We developed the adaptive sparsity/model doubly robust estimation under high-dimensional logistic propensity score and outcome regression models. Results from simulation studies demonstrated the validity of our SMMAL method and its superiority over supervised and unsupervised benchmarks. We applied SMMAL to the assessment of targeted therapies for metastatic colorectal cancer in comparison to chemotherapy.
DMS Statistics and Data Science Seminar
Oct 09, 2024 02:00 PM
ZOOM
Speaker: Dr. Rui Duan ( Assistant Professor of Biostatistics, Harvard T.H. Chan School of Public Health)
Title: Collaborative Multi-Site Statistical Inference
Abstract: In response to the increasing demand for generating real-world evidence from multi-site collaborative studies, we introduce novel collaborative learning approaches designed to estimate the average treatment effect in multi-site settings with data-sharing constraints. Our methods operate in a federated manner, utilizing individual-level data from a user-defined target population while leveraging summary statistics from other source populations to construct efficient estimators for the average treatment effect on the target population. Crucially, our federated approach eliminates the need for iterative communications between sites, making it particularly well-suited for research consortia that lack the resources to implement automated data-sharing infrastructures.
Compared to existing data integration methods in causal inference, our approach accommodates distributional shifts in outcomes, treatments, and baseline covariates. Additionally, under the right conditions, it achieves the semiparametric efficiency bound. We conduct simulation studies to demonstrate the efficiency gains obtained by incorporating data from multiple sources, as well as the robustness of our methods in the face of varying levels of distributional shifts. Furthermore, we showcase several real-world case studies that estimate treatment effectiveness using data from multiple institutions.
DMS Statistics and Data Science Seminar
Oct 02, 2024 02:00 PM
ZOOM
Speaker: Dr. Yuting Wei (the Wharton School, University of Pennsylvania)
Title: Towards faster non-asymptotic convergence for diffusion-based generative models
Abstract: Diffusion models, which convert noise into new data instances by learning to reverse a Markov diffusion process, have become a cornerstone in contemporary generative artificial intelligence. While their practical power has now been widely recognized, the theoretical underpinnings remain far from mature. In this work, we develop a suite of non-asymptotic theory towards understanding the data generation process of diffusion models in discrete time, assuming access to \(\ell_2\)-accurate estimates of the (Stein) score functions. For a popular deterministic sampler (based on the probability flow ODE), we establish a convergence rate proportional to \(1/T\) (with \(T\) the total number of steps), improving upon past results; for another mainstream stochastic sampler (i.e., a type of the denoising diffusion probabilistic model), we derive a convergence rate proportional to \(1/\sqrt{T}\), matching the state-of-the-art theory. Imposing only minimal assumptions on the target data distribution (e.g., no smoothness assumption is imposed), our results characterize how \(\ell_2\) score estimation errors affect the quality of the data generation processes. Further, we design two accelerated variants, improving the convergence to \(1/T^2\) for the ODE-based sampler and \(1/T\) for the DDPM-type sampler, which might be of independent theoretical and empirical interest.
Bio: Dr. Yuting Wei is currently an assistant professor in the Statistics and Data Science Department at the Wharton School, University of Pennsylvania. Prior to that, Dr. Wei spent two years at Carnegie Mellon University as an assistant professor and one year at Stanford University as a Stein Fellow. She received her Ph.D. in statistics at the University of California, Berkeley. She was the recipient of the 2023 Google Research Scholar Award, 2022 NSF Career award, and the Erich L. Lehmann Citation from the Berkeley statistics department. Her research interests include high-dimensional and non-parametric statistics, statistical machine learning, and reinforcement learning.
DMS Statistics and Data Science Seminar
Sep 25, 2024 02:00 PM
ZOOM
Speaker: Dr. Lan Luo (Assistant Professor, Department of Biostatistics and Epidemiology at Rutgers University)
Title: Online statistical inference with streaming data: renewability, dependence, and dynamics
Abstract: New data collection and storage technologies have given rise to a new field of streaming data analytics, including real-time statistical methodology for online data analyses. Streaming data refers to high-throughput recordings with large volumes of observations gathered sequentially and perpetually over time. Such data collection scheme is pervasive not only in biomedical sciences such as mobile health, but also in other fields such as IT, finance, services, and operations. Despite a large amount of work in the field of online learning, most of them are established under strong independent and identical data distribution, and very few target statistical inference. This talk will center around three key components in streaming data analyses: (i) renewable updating, (ii) cross-batch dependency, and (iii) time-varying effects. I will first introduce how to conduct a renewable updating procedure, in the case of independent data batches, with a particular aim of achieving similar statistical properties to the offline oracle methods but enjoying great computational efficiency. Then I will discuss how we handle the dependency structure that spans across a sequence of data batches to maintain statistical efficiency in the process of renewable updating. Lastly, a dynamic weighting scheme will be integrated into the online inference framework to account for time-varying effects. I will provide both conceptual understanding and theoretical guarantees of the proposed method and illustrate its performance via numerical examples.
DMS Statistics and Data Science Seminar
Sep 18, 2024 02:00 PM
ZOOM
Speaker: Dr. James Zou (Associate Professor of Biomedical Data Science, Computer Science, and Electrical Engineering at Stanford University)
Title: Generative AI agents for science and medicine
Abstract: This talk will explore how we can develop and use generative AI to help researchers. I will first discuss how generative AI can act as research co-advisors. We will then discuss how genAI can expand researchers' creativity by designing and experimentally validating new drugs. Finally, I will present how visual-language AI helps clinicians aggregate and interpret noisy data. I will conclude by sharing some thoughts on the future of AI agents for science.
Mini Bio: James Zou is an associate professor of Biomedical Data Science, CS and EE at Stanford University. He is also the faculty director of Stanford AI4Health. He works on advancing the foundations of ML and in-depth scientific and clinical applications. Many of his innovations are widely used in tech and biotech industries. He has received a Sloan Fellowship, an NSF CAREER Award, two Chan-Zuckerberg Investigator Awards, a Top Ten Clinical Achievement Award, several best paper awards, and faculty awards from Google, Amazon, and Adobe. His research has also been profiled in popular press including the NY Times, WSJ, and WIRED.
DMS Statistics and Data Science Seminar
Sep 11, 2024 02:00 PM
354 Parker Hall
Speaker: Dr. Mariangela Guidolin (University of Padua)
Title: Innovation Diffusion Models: Theory and Practice
Abstract: The seminar is a general overview of a class innovation diffusion models that can be used to describe and forecast the evolution in time of sales of new products or technologies. Starting from the basic Bass model (BM), the seminar will be devoted to present some of its generalizations, which account for the presence of exogenous shocks, affecting the timing of the diffusion process, and for the presence of a dynamic market potential, as a function of a communication process, which develops over time. Moreover, some generalizations of the univariate BM are proposed to account for the presence of competition. The statistical techniques involved in model estimation combine time-series analysis with nonlinear regression techniques. The key objectives of the seminar are: to describe the main mathematical features of the models, discussing the meaning of the parameters from the economic point of view with real-data applications; to present and discuss the statistical aspects involved in model estimation and selection; to show and discuss predictive and explanatory ability of the proposed models, highlighting the properties and limitations of each of the models described.
DMS Statistics and Data Science Seminar
Sep 04, 2024 02:00 PM
ZOOM
Speaker: Dr. JungWun Lee (Assistant Professor, Department of Biostatistics, Boston University School of Public Health)
Title: A latent trajectory analysis for multivariate mixed outcomes: a study on the effect of bariatric surgery via electronic health records.
Abstract: Trajectory analysis can be a statistical solution for explaining heterogeneities by partitioning patients into less heterogeneous subgroups based on similarities in outcome variables. This work proposes a novel trajectory analysis for electronic health records, a longitudinal data set containing multiple biomarkers, demographic factors of patients, and many missing values. The proposed model discovers subgroups of patients so that patients with the same trajectory group memberships are similar in their observed outcomes, while patients with different trajectories are heterogeneous. The proposed model may conceive multivariate mixed outcomes consisting of categorical and continuous variables simultaneously. We suggest an estimation strategy using the expectation-maximization algorithm, which provides the maximum-likelihood estimates and is highly stable to many missing values. We also present an application of our methodology to the DURABLE data set, an NIH-funded study examining long-term outcomes of patients who experienced bariatric surgery between 2007 and 2011.
DMS Statistics and Data Science Seminar
Apr 24, 2024 02:00 PM
ZOOM
Speaker: Dr. Shuoyang Wang (Assistant Professor, University of Louisville)
Title: Inference on High-dimensional Mediation Analysis with Convoluted Confounding via Deep Neural Networks
Abstract: Traditional linear mediation analysis has inherent limitations when it comes to handling high-dimensional mediators. Particularly, accurately estimating and rigorously inferring mediation effects is challenging, primarily due to the intertwined nature of the mediator selection issue. Despite recent developments, the existing methods are inadequate for addressing the complex relationships introduced by confounders. To tackle these challenges, we propose a novel approach called DP2LM (Deep neural network based Penalized Partially Linear Mediation). DP2LM incorporates deep neural network techniques to account for nonlinear effects in confounders and utilizes the penalized partially linear model to accommodate high dimensionality. In addition, to address the influence of outliers on mediation effects, we present an enhanced version of DP2LM called QDP2LM (Quantile Deep Neural Network-based Penalized Partially Linear Mediation). QDP2LM builds upon DP2LM and provides a comprehensive assessment of mediation effects across various quantiles. Unlike most existing works that concentrate on mediator selection, our methods prioritize estimation and inference on mediation effects. Specifically, we develop test procedures for testing the direct and indirect mediation effects. Theoretical analysis shows that the proposed procedures control type I error rates for hypothesis testing on mediation effects. Numerical studies show that the proposed methods outperform existing approaches under a variety of settings, demonstrating their versatility and reliability as modeling tools for complex data. Our application of the proposed methods to study DNA methylation's mediation effects of childhood trauma on cortisol stress reactivity reveals previously undiscovered relationships through a comprehensive analysis.