Statistics and Data Science Seminars



Upcoming Statistics and Data Science Seminars
DMS Statistics and Data Science Seminar
Apr 09, 2025 02:00 PM
250 Parker Hall


marzia

Speaker: Marzia A. Cremona (Dept. Operations and Decision Systems; Université Laval; Québec, Canada)

Title: Local clustering and motif discovery of functional data

 

Abstract: Recent evolution in data acquisition technologies enabled the generation of high-dimensional, complex data in several research areas – in the sciences and engineering, among other disciplines. Increasingly sophisticated statistical and computational methods are needed in order to analyze these data. Functional data analysis (FDA) can be broadly employed to analyze functional data, i.e., data that vary over a continuum and can be naturally viewed as smooth curves or surfaces, exploiting information in their shapes.

In this talk, I will present probabilistic 𝐾-mean with local alignment (probKMA, [1]), an unsupervised learning method to locally cluster a set of misaligned curves and to address the problem of discovering functional motifs, i.e. typical “shapes” or “patterns” that may recur several times along and across a set of curves, capturing important local characteristics of these curves. After demonstrating the performance of the method on simulated data and showing how it generalizes other clustering methods for functional data, I will present three applications to the analysis of functional data from different fields. First, I will apply probKMA to discover functional motifs in “Omics” signals related to mutagenesis and genome dynamics. Second, I will employ probKMA as a probabilistic clustering method to group COVID-19 death curves of the different Italian regions during the first wave of the pandemic. Finally, I will present a generalization of probKMA and its application to the discovery and characterization of functional motifs in stock market prices [2].

 

[1] Cremona, Chiaromonte (2023) Probabilistic K-means with local alignment for clustering and motif discovery in functional dataJournal of Computational and Graphical Statistics 32(3): 1119-1130.

[2] Cremona, Doroshenko, Severino (2023) Functional motif discovery in stock market pricesSSRN 4642040.


More Events...

Past Statistics and Data Science Seminars
DMS Statistics and Data Science Seminar
Apr 02, 2025 02:00 PM
ZOOM


 jessietong

Jessie Tong (Assistant Professor, Department of Biostatistics, Johns Hopkins University)

Title: Using Electronic Health Records Data for Clinical Evidence Generation

 

Abstract: In the era of expanding real-world data (RWD) availability from distributed research networks (DRNs), it becomes essential to leverage the data to generate evidence for clinical inquiries relevant to stakeholders in the healthcare system. To provide the answer to inquiries regarding hospital and treatment options, medication queries, and others, we still face practical challenges in analyzing the RWD, such as reporting bias, confounding factors, and rare events. It is particularly challenging when integrating data from multiple clinical sites within the DRNs given the data privacy, patient heterogeneity, also known as case-mix situation, and communication cost. In my presentation, centered on the theme of real-world evidence-based health system performance assessment, I will introduce our distributed learning frameworks designed to produce actionable analytical insights for hospital profiling, with the ultimate goal of enhancing hospitals’ quality of care and patients’ clinical outcomes. The effectiveness and reliability of our proposed framework have been validated through real-world application in collaboration with 12 clinical sites across three countries within the Observational Health Data Sciences and Informatics (OHDSI) network.


DMS Statistics and Data Science Seminar
Mar 26, 2025 02:00 PM
354 Parker Hall


di ioria

Speaker: Dr. Jacopo Di Iorio (Emory University) 

Title: Identifying functional motifs with funBIalign

 

Abstract: Functional data analysis is dealing with a novel challenge: the identification of functional motifs, or “shapes” that may be repeated multiple times within each functional observation or across multiple curves belonging to the same set. To address this issue, we introduce the funBIalign algorithm, a multi-step approach employing agglomerative hierarchical clustering with complete linkage and functional distances based on mean squared residue scores and virtual error, two widely used validation measures in the biclustering literature. These distances enable funBIalign to detect functional motifs that are shifted or scaled along the y-axis. To validate the effectiveness of our methodology, we present simulations and case studies that demonstrate its ability to identify functional motifs.

 


DMS Statistics and Data Science Seminar
Mar 19, 2025 02:00 PM
352 Parker Hall


 li

Dr. Haoran Li (Auburn University)

Title: Tracy-Widom Law of High Dimensional Ridge-Regularized F-matrix. 

 

Abstract: In multivariate analysis, many core problems involve the eigen-analysis of an F matrix,  constructed from two Wishart matrices.  These so-called Double Wishart problems arise in contexts such as MANOVA, covariance matrix equality testing, and hypothesis testing in multivariate linear regression. A prominent classical approach, Roy's largest root test, relies on the largest eigenvalue of the F matrix for inference. However, in high-dimensional settings, this test becomes impractical due to the singularity or near-singularity of the Wishart matrix. To address this challenge, we propose a ridge-regularization framework by introducing a ridge term. Specifically, we develop a family of ridge-regularized largest root tests, leveraging the largest eigenvalue of the ridge-regularized F matrix. Under mild assumptions, we establish the asymptotic Tracy-Widom distribution of the regularized largest root after appropriate scaling. An efficient method for estimating the scaling parameters is proposed using the Marčenko-Pastur equation.  


DMS Statistics and Data Science Seminar
Feb 12, 2025 02:00 PM
250 Parker Hall


bakalli

Speaker: Dr. Gaetan Bakalli (Assistant Professor of Econometrics and Data Science, Emlyon Business School, Lyon)

Title: Nonstandard Errors

 

Abstract: In statistics, samples are drawn from a population in a data-generating process (DGP). Standard errors measure the uncertainty in estimates of population parameters. In science, evidence is generated to test hypotheses in an evidence-generating process (EGP). We claim that EGP variation across researchers adds uncertainty—nonstandard errors (NSEs). We study NSEs by letting 164 teams test the same hypotheses on the same data. NSEs turn out to be sizable, but smaller for more reproducible or higher rated research. Adding peer-review stages reduces NSEs. We further find that this type of uncertainty is underestimated by participants.


DMS Statistics and Data Science Seminar
Dec 04, 2024 02:00 PM
ZOOM


Speaker: Russell J. Bowater (Independent Statistical Consultant, Oaxaca City, Mexico)

Title: The 7 hardest lessons to learn in statistics 

 

Abstract:  What is the current state of the theory of statistical inference? Is it essentially in a good state except for a relatively small number of issues that need to be tidied up? Or is what is usually presented as being the standard and accepted theory of statistical inference so full of conceptual holes that it is nothing short of an embarrassment for anyone who wishes to describe themselves as a statistician? This talk explores these questions by presenting lessons that arguably need to be learnt but have proved difficult to learn for reasons that to a great extent are not related to doing good independent and impartial science. By exposing ourselves to such an uncomfortable level of introspection, a greater understanding can be gained about what we have done, where we are at and where we should be going.

 


DMS Statistics and Data Science Seminar
Nov 20, 2024 02:00 PM
354 Parker Hall


lian

Speaker: Dr. Li An (Solon & Martha Dixon Endowed Professor at College of Forestry, Wildlife and Environment, Auburn University)

Title: Green the ecosystems at multiple spatial & temporal scales - Complex systems & Data Science approaches

 

Abstract: Humanity stands in an unprecedented era of climate change, environmental degradation, rapid biodiversity loss, and other crises. Green initiatives, defined to be programs, funds,  payments, policies, or any endeavors that aim to counter such crises and restore, sustain, or improve nature’s capacity to benefit human beings, are becoming increasingly widespread and popular across the globe. Using data from 15 sites from local to global scales (including China and the USA), the author systematically explores how specific policy, intended behaviors, and gains of a certain green initiative may interact with those of other green initiatives concurrently implemented in the same geographic area or involving the same recipients. Spatial data science methods, including remote sensing, GIS, and eigenvector spatial filtering, are employed to uncover mechanisms behind the data. The findings suggest that spillover effects were widespread and divergent: one initiative could reduce the gain of another by 22% ~ 100%, indicating a sign of alarming losses. In other instances, one initiative can increase the gain of another by 9% ~ 310%, offering substantial co-benefits. This talk also presents current efforts that infuses artificial intelligence, machine learning, and agent-based models to uncover mechanisms and rules hidden in multiple space-time data patterns.

 

Biosketch: Dr. Li An is Solon & Martha Dixon Endowed Professor at College of Forestry, Wildlife and Environment and Director of International Center for Climate and Global Change Research (Center URL) in Auburn University. He received his B.S. degree from Peking University (Economic Geography), China, M.S. degree from Chinese Academy of Sciences (Systems Ecology) and from Michigan State University (Probability and Statistics), and Ph.D. degree from Michigan State University (Systems modeling; fisheries and wildlife). His research focuses on complex human-environment systems, space-time analysis and modeling, spatial data science, landscape ecology, and complex adaptive systems. He is a Fellow of The American Association for the Advancement of Science (AAAS; 2020 class), Fellow of The American Association of Geographers (AAG; 2022 class), awardee of 2023 Distinguished Scholarship Honors from AAG, and recipient of multiple other awards or recognitions. He has been leading or involved in research projects funded by multiple federal agencies, and these projects are broadly distributed in Nepal, Ghana, USA, and China. He has served on the editorial board of Annals of the American Association of GeographersEcological ModellingThe Journal of Artificial Societies and Social Simulation (JASSS), Remote SensingInternational Journal of Geospatial and Environmental Research, and Geography and Sustainability. He is  President of the International Association of Landscape Ecology-North America.


DMS Statistics and Data Science Seminar
Nov 13, 2024 02:00 PM
354 Parker Hall


minyang

Speaker: Dr. Min Yang (University of Illinois at Chicago)

Title : Scalable Methodologies for Big Data Analysis: Integrating Flexible Statistical Models and Optimal Designs

 

Abstract: The formidable challenge presented by the analysis of big data stems not just from its sheer volume, but also from the diversity, complexity, and the rapid pace at which it needs to be processed or delivered. A compelling approach is to analyze a sample of the data, while still preserving the comprehensive information contained in the full dataset. Although there is a considerable amount of research on this subject, the majority of it relies on classical statistical models, such as linear models and generalized linear models, etc. These models serve as powerful tools when the relationships between input and output variables are uniform. However, they may not be suitable when applied to complex datasets, as they tend to yield suboptimal results in the face of inherent complexity or heterogeneity. In this presentation, we will introduce a broadly applicable and scalable methodology designed to overcome these challenges. This is achieved through an in-depth exploration and integration of cutting-edge statistical methods, drawing particularly from neural network models and, more specifically, Mixture-of-Experts (ME) models, along with optimal designs.


DMS Statistics and Data Science Seminar
Nov 06, 2024 02:00 PM
354 Parker Hall


jeonjoopark

Speaker: Dr. Yeonjoo Park (University of Texas at San Antonio)

Title: A data-adaptive dimension reduction for functional data via penalized low-rank approximation

 

Abstract: We introduce a data-adaptive nonparametric dimension reduction tool to obtain a low-dimensional approximation of functional data contaminated by erratic measurement errors following symmetric or asymmetric distributions. We propose to apply robust submatrix completion techniques to matrices consisting of coefficients of basis functions calculated by projecting the observed trajectories onto a given orthogonal basis set. In this process, we use a composite asymmetric Huber loss function to accommodate domain-specific erratic behaviors in a data-adaptive manner. We further incorporate the \(L_1\) penalty to regularize the smoothness of latent factor curves. The proposed method can also be applied to partially observed functional data, where each trajectory contains individual-specific missing segments. Moreover, since our method does not require estimating the covariance operator, the extension to any dimensional functional data observed over a continuum is straightforward. We demonstrate the empirical performance in estimating lower-dimensional space and reconstruction of trajectories of the proposed method through simulation studies. We then apply the proposed method to two real datasets, one-dimensional Advanced Metering Infrastructure (AMI) data and two-dimensional max precipitation spatial data collected in North America and South America.

 


DMS Statistics and Data Science Seminar
Oct 30, 2024 02:00 PM
354 Parker Hall


guerrier

Speaker: Dr. Stéphane Guerrier (University of Geneva)

Title: Accurate Inference for Penalized Logistic Regression

 

Abstract: Inference for high-dimensional logistic regression models using penalized methods has been a challenging research problem. As an illustration, a major difficulty is the significant bias of the Lasso estimator, which limits its direct application in inference. Although various bias corrected Lasso estimators have been proposed, they often still exhibit substantial biases in finite samples, undermining their inference performance. These finite sample biases become particularly problematic in one-sided inference problems, such as one-sided hypothesis testing. This paper proposes a novel two-step procedure for accurate inference in high-dimensional logistic regression models. In the first step, we propose a Lasso-based variable selection method to select a suitable submodel of moderate size for subsequent inference. In the second step, we introduce a bias corrected estimator to fit the selected submodel. We demonstrate that the resulting estimator from this two-step procedure has a small bias order and enables accurate inference. Numerical studies and an analysis of alcohol consumption data are included, where our proposed method is compared to alternative approaches. Our results indicate that the proposed method exhibits significantly smaller biases than alternative methods in finite samples, thereby leading to improved inference performance.

 

This is a joint work with Yuming Zhang (Harvard) and Runze Li (Penn State).

 


DMS Statistics and Data Science Seminar
Oct 23, 2024 02:00 PM
ZOOM


rongma

Speaker: Dr. Rong Ma (Harvard T.H. Chan School of Public Health)

Title: Is your data alignable? A geometric view of single-cell data integration

 

Abstract: Single-cell data integration can provide a comprehensive molecular view of cells, and many algorithms have been developed to remove unwanted technical or biological variations and integrate heterogeneous single-cell datasets. Despite their wide usage, existing methods suffer from several fundamental limitations. In particular, we lack a rigorous statistical test for whether two high-dimensional datasets are in principle alignable (and therefore should even be aligned). Moreover, popular methods can substantially distort the data during alignment, making the aligned data and downstream analysis difficult to interpret. To overcome these limitations, we present a spectral manifold alignment and inference (SMAI) framework, which enables principled and interpretable alignability testing and structure-preserving integration of single-cell data with the same type of features. SMAI provides a statistical test to robustly assess the alignability between datasets to avoid misleading inference and is justified by high-dimensional statistical theory. On a diverse range of real and simulated benchmark datasets, it outperforms commonly used alignment methods. Moreover, we show that SMAI improves various downstream analyses such as identification of differentially expressed genes and imputation of single-cell spatial transcriptomics, providing further biological insights. SMAI’s interpretability also enables quantification and a deeper understanding of the sources of technical confounders in single-cell data.

 

This is a joint work with Eric Sun, David Donoho, and James Zou from Stanford University.


More Events...