"Doubly Sparse Estimator for High-Dimensional Covariance Matrices" with Seregina, Econometrics & Statistics, 2024

T

he classical sample covariance estimator lacks desirable properties such as consistency and suffers from eigenvalue spreading in high-dimensional settings. Improved estimators have been proposed that shrink sample eigenvalues but retain the eigenvectors of the sample covariance estimator. In high dimensions, however, sample eigenvectors are generally strongly inconsistent, rendering eigenvalue shrinkage estimators suboptimal. A Doubly Sparse Covariance Estimator (DSCE) is developed that goes beyond mere eigenvalue shrinkage: a covariance matrix is decomposed into a signal part, where sparse eigenvectors are estimated via truncation, and an idiosyncratic part, estimated via thresholding. It is shown that accurate estimation is possible if the leading eigenvectors are sufficiently sparse affecting proportionately less than $\sqrt{p}$ of the variables. DSCE fills the gap for empirical applications that fall in-between fully sparse settings and conditionally sparse settings: DSCE takes advantage of conditional sparsity implied by factor models while allowing only a subset of variables to load on factors, which relaxes pervasiveness assumption of traditional factor models. An empirical application to the constituents of the S&P 1500 illustrates that DSCE-based portfolios outperform competing methods in terms of Sharpe ratio, maximum drawdown, and cumulative return for monthly and daily data..

Keywords: Sparse recovery Rotation equivariance Random matrix theory Large-dimensional asymptotics Principal components

"The Kernel Trick for Nonlinear Factor Modeling"
International Journal of Forecasting, 2021

F

actor modeling is a powerful statistical technique that permits to capture the common dynamics in a large panel of data with a few latent variables, or factors, thus alleviating the curse of dimensionality. Despite its popularity and widespread use for various applications ranging from genomics to finance, this methodology has predominantly remained linear. This study estimates factors nonlinearly through the kernel method, which allows flexible nonlinearities while still avoiding the curse of dimensionality. We focus on factor-augmented forecasting of a single time series in a high-dimensional setting, known as diffusion index forecasting in macroeconomics literature. Our main contribution is twofold. First, we show that the proposed estimator is consistent and it nests linear PCA estimator as well as some nonlinear estimators introduced in the literature as specific examples. Second, our empirical application to a classical macroeconomic dataset demonstrates that this approach can offer substantial advantages over mainstream methods.

Keywords: Forecasting Latent factor model Nonlinear time series Kernel PCA Neural networks Econometric models

"Fast and Efficient Data Science Techniques for COVID-19 Group Testing" with Seregina, Journal of Data Science, 2021

R

esearchers and public officials tend to agree that until a vaccine is developed, stopping SARS-CoV-2 transmission is the name of the game. Testing is the key to preventing the spread, especially by asymptomatic individuals. With testing capacity restricted, group testing is an appealing alternative for comprehensive screening and has recently received FDA emergency authorization. This technique tests pools of individual samples, thereby often requiring fewer testing resources while potentially providing multiple folds of speedup. We approach group testing from a data science perspective and offer two contributions. First, we provide an extensive empirical comparison of modern group testing techniques based on simulated data. Second, we propose a simple one-round method based on $\ell_1$-norm sparse recovery, which outperforms current state-of-the-art approaches at certain disease prevalence rates.

Keywords: Pooled Testing Compressed Sensing Sparse Recovery Lasso Sensing Matrix SARS-CoV-2