One of the key issues in finance, especially in empirical asset pricing, is the trade-off between the returns and the risk of a portfolio. One important way to quantify such trade-off is via the Sharpe Ratio.
We contribute to this literature by studying the case when the number of assets, namely p, grows with the time span of the portfolio, n. To obtain the Sharpe Ratio, and also its maximum, we make use of the asset return’s precision matrix. In order to get an estimate of the precision matrix for asset returns in a large portfolio, we propose that an approximate factor model governs the dynamics of excess returns. Hence, asset returns (excess returns over a risk-free asset) can be explained by an increasing but known number of factors with unknown idiosyncratic errors entering the linear relation in an additive way. One major difference with the previous literature is that, in our case, the precision matrix has to be sparse. Therefore, this is a hybrid method that combines factor models with high-dimensional econometrics.
The first step in getting the Sharpe Ratio and its maximum involves the estimation of the precision matrix of the idiosyncratic terms (errors). Estimating the such precision matrix is not an easy task, and the simple nodewise regression idea as in Meinshausen and B¨uhlmann (2006) is not feasible. Therefore, we provide a simple, feasible residual-based nodewise regression method to estimate the precision matrix of errors in a factor model setup even if p > n. This feasible residual-based nodewise regression is a new idea, and it is shown to be consistently estimating the precision matrix of the errors which is our first contribution. Next, we obtain consistent estimators to the precision matrix of asset returns, even if p > n, which is our second technical contribution. Although, we focus on factor models in asset pricing, our methodology can be applied to any situation where the interest is the precision matrix of the errors of a linear regression model.
Next, by using the precision matrix estimator for returns we can link our technical analysis to the financial econometrics literature. We make three contributions towards Sharpe Ratio analysis. First, we consider the Sharpe Ratios in the global minimum-variance portfolio and Markowitz mean-variance portfolio. We develop consistent estimators even if p > n, and both dimensions diverge. Second, we consider the rate of convergence and consistency of the maximum Sharpe Ratio when the portfolio weights are normalized to one. Recently, Maller and Turkington (2002), and Maller et al. (2016) analyze the limit with a fixed number of assets and extend that approach to a large number of assets, but a number less than the time span of the portfolio. Their papers make a key discovery: in the case of weight constraints (summing to one), the formula for the maximum Sharpe Ratio depends on a technical term, unlike the unconstrained maximum Sharpe Ratio case. Practitioners could obtain the minimum Sharpe Ratio instead of the maximum if they are using the unconstrained formula. Our paper extends their paper by analyzing two issues. First, the case if p > n, with both quantities growing to infinity, and second, by handling the uncertainty created by this technical term, which we can estimate and use to obtain a new constrained and consistent maximum Sharpe Ratio. The assumption of constant loadings in the factor model is clearly a constraint for portfolio analysis over longer horizons. However, the setup where p > n provides the statistical tools for us to analyze portfolios in short horizons and small samples as high-dimensional asymptotics can be seen as a good approximation for situations when n is small but p is large compared to n. Third, only in the case of p << n, we obtain the consistency of our nodewise-based maximum-out-of-sample Sharpe Ratio estimate, with both p, n growing to infinity and 0. We also provide an analysis of the Sharpe Ratio with only portfolio weights estimated in the formula. In that way, we can see the effect of estimated portfolio on getting the optimal Sharpe Ratio. Our analysis shows this is possible when p < n only.
1.1 The Sparsity of the Precision Matrix
There are several reasons motivating the assumption of sparsity of the precision matrix of the errors from the factor model. In technical terms, this is a convenient and widely used asymptotic tool when we want to consider high dimensional problems when p > n. The sparsity assumption on the precision matrix of errors gives rise to a direct way of estimating the precision matrix for the returns via Sherman-Morrison-Woodbury formula. We solve two technical issues with this assumption. First, consistent estimation of the precision matrix of returns is possible, yielding consistent estimation of the Sharpe Ratio and it’s maximum, even in constrained case. Also, as far as we know, in the case of p > n, we do not know any other consistent estimation results for global minimum variance and Markowitz portfolios, as well as the constrained maximum Sharpe Ratio in the literature.
The sparsity assumption on the precision matrix of the errors from a factor model can be also justified in situations of interest in the empirical finance literature. First, even though we do not assume normality of the errors here, in this particular case the conditional independence of two errors given all the other errors, is represented by a zero entry in the precision matrix of errors. This is explained in p.1436-1439 of Meinshausen and B¨uhlmann (2006). So, in the case of normally distributed data, sparsity can be thought as a conditional independence restriction. When the errors follow an elliptical distribution, conditional uncorrelatedness of two errors amount to a zero cell in the precision matrix as discussed in Section 2.4 of Fan et al. (2018). The authors claim that sparse precision matrix may be more useful when we estimate a network of stocks, by taking out common factors from returns and analyzing the conditional independence among idiosyncratic components (errors). Finally, there are a number of recent papers in the literature showing that after removing common factors, the covariance matrix of the errors is “almost” block diagonal, yielding a sparse precision matrix; see, for example, Fan et al. (2016) and Brito et al. (2018). When the covariance matrix is block-diagonal, the precision matrix can be computed by inverting the estimated covariance matrix, which in turn can be consistently estimated by several different methods. However, even in this case, there are potential benefits of estimating the precision matrix directly as shown in our simulations and empirical exercise; see also Senneret et al. (2016).
1.2 A Brief Review of the Literature and Main Takeaways
In terms of the literature on nodewise regression and related methods, the most relevant papers are as follows. Meinshausen and B¨uhlmann (2006) establish the nodewise regression approach and provide an optimality result when data are normally distributed. Chang et al. (2018) extend the nodewise regression method to time-series data and build confidence intervals for the elements in the precision matrix. However, the goal of Chang et al. (2018) only centers on the elements of the precision matrix, and there is no connection to factor models. Furthermore, their results are based on the precision matrix of observed data and not on the residuals of a first-stage estimator. Finally, the authors do not consider the case of maximum Sharpe Ratio, and it is not clear if their results are directly applicable to financial applications. Caner and Kock (2018) establish uniform confidence intervals in the case of high-dimensional parameters in heteroskedastic setups using nodewise regression, but, as in the previous paper, there is no connection to factor models in empirical finance. Callot et al. (2021) provide the variance, the risk, and the weight estimation of a portfolio via nodewise regression. They take the nodewise regression directly from Meinshausen and B¨uhlmann (2006) and apply it to returns. However, they assume that the precision matrix of returns is sparse. Hence, it is more restrictive and less realistic than the method we propose. We combine factor models with the sparsity of the precision matrix of errors. As a consequence, our method is much more connected to typical empirical asset pricing models. Furthermore, we do not impose any sparsity on the precision matrix of returns. Callot et al. (2021) also has no proofs about the estimation of the Sharpe Ratio.
In terms of recent contributions to the literature on factor models and sparse regression, we highlight Fan et al. (2021). The authors consider the combination of factor models and sparse regression in a very general setting. More specifically, they analyze a panel data model with a factor structure and idiosyncratic terms that are sparsely related. They also provide an inference procedure designed to test hypotheses on the entries of the covariance matrix of the residuals of pre-estimated models, including principal component regressions. Our paper differs from theirs in several directions. First, Fan et al. (2021) considers only the covariance matrix and not the precision matrix. Second, their approach is not based on nodewise regressions. Finally, Sharpe Ratio estimation and portfolio allocation are not considered. A seminal paper is by Gagliardini et al. (2016), where they analyze time-varying risk premia in large portfolios with factor models. They develop a structural model, and can tie that to factor models, and after that, they can estimate time-varying risk-premia. One of their main assumptions is that the maximum eigenvalue of covariance matrix of errors in the factor structure can diverge. Also, they assume sparsity of covariance matrix of errors and observed factors in the factor model. We also use diverging eigenvalue assumption in Assumption 7(i) in our paper, as well as an increasing number of factors here, but with the assumption of sparsity on the precision matrix of errors. Gagliardini et al. (2019) develop a diagnostic test for omitted factors in factor models. They rely on residuals rather than errors for their tests. As clear in their analysis, working with residuals pose major difficulties. We also face the similar difficulty in our paper. Then, Gagliardini et al. (2020) analyze large conditional factor models. They analyze conditional risk premia even when the number of assets dominate the time span of the portfolio.
In a recent paper, Fan et al. (2018) use sparse precision matrix estimation with hidden factors. Their approach uses a Dantzig based constrained estimator for precision matrix. The main differences are that the type of estimator depends on magnitude of coefficients in the precision matrix, with larger coefficients, and that the rate of estimation slows down considerably as seen in their equation (2.12)-result 2. Also, they assume bounded-finite matrix norm, which is restrictive. We allow diverging matrix
norm. Also they do not apply their results to Sharpe Ratio analysis in high dimensions as we do.
Recently, important contributions have been obtained in this area by using shrinkage and factor models.
Ledoit and Wolf (2017) propose a nonlinear shrinkage estimator in which small eigenvalues of the sample covariance matrix are increased and large eigenvalues are decreased by a shrinkage formula. Their main contribution is the optimal shrinkage function, which they find by minimizing a loss function. The maximum out-of-sample Sharpe Ratio is an inverse function of this loss. Their results cover the independent and identically distributed case and when (0, 1)
(1
). For the analysis of mean-variance efficiency, Ao et al. (2019) make a novel contribution in which they take a constrained optimization, maximize returns subject to the risk of the portfolio, and show that it is equivalent to an unconstrained objective function, where they minimize a scaled return of the portfolio error by choosing optimal weights. To obtain these weights, they use lasso regression and assume a sparse number of nonzero weights of the portfolio, and they analyze
(0, 1). They show that their method maximizes the expected return of the portfolio and satisfies the risk constraint. Their paper is an important result on its own. One key paper in the literature is by Fan et al. (2011) which assumes an approximate factor model, but, on the other hand, the authors assume conditional sparsity-diagonality of the covariance matrix of errors. Fan et al. (2011) show for the first time how to build a precision matrix of returns in a large portfolio via factor models. Therefore, it is a key paper in the high-dimensional econometrics literature.
Regarding other papers, Ledoit and Wolf (2003,2004) propose a linear shrinkage estimator of the covariance matrix and apply it to portfolio optimization. Ledoit and Wolf (2017) shows that nonlinear shrinkage performs better in out-of-sample forecasts. Lai et al. (2011), and Garlappi et al. (2007) approach the same problem from a Bayesian perspective by aiming to maximize a utility function tied to portfolio optimization. Another avenue of the literature improves the performance of the portfolios by introducing constraints on the weights. This type of literature is in the case of the global minimum-variance portfolio. Examples of works investigating this problem include Jagannathan and Ma (2003) and Fan et al. (2012). We also see a combination of different portfolios proposed by Kan and Zhou (2007), and Tu and Zhou (2011). Very recently, Ding et al. (2021) extended factor models to assumptions that are more consistent with principal components analysis. They provide consistent estimation of the risk of the portfolio under the sparsity of covariance of errors with a fixed number of factors. Barras et al. (2021), Brodie et al. (2009), Chamberlain and Rothschild (1983), DeMiguel et al. (2009), Fan et al. (2015) analyze the mutual fund industry, sparsely constructed Markowitz portfolio, arbitrage and factor models in large portfolios, sparsely constructed mean-variance portfolios, and risks of large portfolios, respectively.
1.3 Organization of the Paper
This paper is organized as follows. Section 2 considers our assumptions and feasible precision matrix estimation for errors. Section 3 provides the feasible precision matrix estimate for asset returns. Section 4 analyzes consistency of the Sharpe Ratio in a portfolio with large number of assets in three different scenarios. Section 5 provides simulations that compare several methods. Section 6 presents an out-of-sample forecasting exercise. The main proofs are in the Supplement A, common proofs used for Theorems 3-8 are in Supplement B, Supplement C contains proofs related to section 4.4, and the Supplement D has a proof of mean-variance efficiency of a large portfolio in case of out-of-sample context, and some extra simulation results.
1.4 Notation
Let be the
norms of a generic vector
. Let
:=
which is the prediction norm for an
1 vector v. Let Eigmin(A) represents the minimum eigenvalue of a matrix A, and Eigmax(A) represent the maximum eigenvalue of the matrix A. For a generic matrix A, let
, be the
induced matrix norm (i.e. maximum absolute column sum norm),
induced matrix norm (i.e. maximum absolute row sum norm), spectral matrix norm, respectively.
is maximum absolute value of element of a matrix, and also a norm (but not a matrix norm). Matrix norms have the additional desirable feature of submultiplicativity property. For further information on matrix norms, see p.341 of Horn and Johnson (2013).
We start with the following model for the jth asset return (excess asset return) at time , for j = 1
,
and time periods t = 1, such that
where is a
1 vector of factor loadings,
is the
1 vector of common factors to all assets’ returns, and
is the scalar error (idiosyncratic) term for asset return j at time t. All the factors are assumed to be observed. This model is used by Fan et al. (2011). From this point on, when asset return is mentioned, it should be understood as excess asset return.
For the jth asset return we can rewrite (1) in the vector form, for j = 1:
where X = () is a
matrix, and
= (
is a
1 vector of returns of the jth asset. We can also express the same relation in a matrix form as follows:
where Y is a matrix, B is a
matrix, and U is a
matrix.
Define the covariance matrix of the
1 vector of errors
:= (
as
:=
.
We take to be a strictly stationary, ergodic, and strong mixing sequence of random variables. Also, let
be the
- algebras generated by
, for
0, and
, respectively. Denote the strong mixing coefficient as
) := sup
In Assumption 7 below, we assume that maximum eigenvalue of can grow with sample size, this is due to
being a
matrix where p may grow with n. We will assume sparsity for the precision matrix of errors
:=
, but we do not subscript
with n to avoid cumbersome notation. Each row of
will
where Ωrepresents the lth element in the jth row of
. Let
represents the index set of all zero elements in the jth row of
. Define the cardinality of the non-zero cells in the jth row of the precision matrix as
:=
, which can be nondecreasing in n, but we do not subscript that with n. Denote the maximum number of nonzero elements across all rows j = 1
of the precision matrix
as ¯s := max
, which is nondecreasing in n.
This last definition plays a key role in analysis of the rate of convergence of estimation errors. Note that, just to be clear, when , we allow
, and ¯
. As in the literature, we do not subscript them by n. Also, we allow for p > n, when
, and
in our analysis in Theorems 1-7, which can be considered ultra-high dimensional portfolio analysis. For future references, we
denote all of the asset returns except the jth one as
where , of dimension (
1)
, is the Y matrix without the jth row,
is the (
1)
matrix which is B without the jth row, and
is the (
1)
matrix given by U matrix without the jth row.
It has been well established in the literature that in case of known , which is essential input in nodewise regression, can be recovered with the following lasso problem, with a sequence
0, for all
The main issue with (5) is, unlike nodewise regression in Caner and Kock (2018), it is infeasible due to error terms regressed on each other. We now show how to turn this to feasible regression and still consistently estimate .
By equation (2) we can define the OLS residual as
where X is a matrix and
Define the residuals by transposing (4) such that
Note that is a
1) matrix
is a
matrix, and
is a
1) matrix. Next, use (7) and (9):
where
of the residuals affect the consistent estimation of . We define a feasible nodewise estimator
Now, to form the jth row of , set the jth element in the jth row as
We want to show that for each j = 1,
is consistent. We can write
=
with
being an 1
matrix of ones in jth cell and
in the other cells.
2.1 Assumptions and a Key Result
In this part, we provide the assumptions that will be needed for consistency for the jth row of the precision matrix estimator. Let be the j the element of the
1 vector
. Similarly,
is the (
1)
1 vector of errors in tth time period, except the jth term in
. Define
:=
.
Assumption 1. (i). are sequences of (strictly) stationary and ergodic random variables. Furthermore,
are independent.
is a (
1) zero mean random vector with covariance matrix
). Eigmin(
0, with c a positive constant, and max
. (ii). For the strong mixing variables
exp(
), for a positive constant
0.
Assumption 2. There exists positive constants 0 and another set of positive constants
,
3
Assumption 4. (i). Eigmin[cov(0, with cov(
) being the covariance matrix of the factors
, t = 1
. (ii). max
, min
0. (iii). max
Note that Assumptions 1-3 are standard assumptions and are used in Fan et al. (2011) as well. Also, we
that, Stationary GARCH models with finite second moments and continuous error distributions, as well as causal ARMA processes with continuous error distributions, and a certain class of stationary Markov chains satisfy our Assumptions 1-2 and are discussed in p.61 of Chang et al. (2018). Chang et al. (2018) also uses similar assumptions.
Assumption 4(i)-(ii) is also used in Fan et al. (2011), and the nodewise error assumption 4(iii) is used in Caner and Kock (2018). Assumption 5 shows the interaction of sparsity of the precision matrix with factors. They both contribute negatively to biases that our analysis will show below.
Before the next theorem, we define formally. Let C > 0 be a generic positive constant, then
where we specify tuning parameter in Lemma A.5 in Supplement, and the asymptotic negligibility is by Assumption 5. Note that in tuning parameter , the first term involving
is due to nodewise regression
via factor models. In Callot et al. (2021), without factor models, they have the second term only
now provide one of the main Theorems in the paper. Theorem provides consistent estimates for the rows of the precision matrix of errors.
1. Note that ˆare not columns of ˆ
respectively. ˆ
are column representation of row vectors
2. As long as Assumption 5 is maintained, the rate of approximation error in Theorem 1 matches the
Assuming orthogonality between factors and the idiosyncratic errors, the () covariance matrix of the asset returns is defined as:
We start with the precision matrix formula for the asset returns, based on factor model that we used. Using Sherman-Morrison-Woodbury formula, as in p.13 of Horn and Johnson (2013), :=
is defined as:
and the precision matrix estimator for the returns is
where :=
is the symmetrized version of our feasible nodewise regression estimator for the precision matrix for errors.
cov(
) =
is the estimator for the covariance matrix of returns, and it is given in p.3327 of Fan et al. (2011) with
representing a (
1) vector of ones. Also,
= (
is the least-squares estimator for the factor model in (3). In addition,
is a (
) matrix, and
cov(
) is a
matrix. Note that we use a symmetric version of our precision matrix estimator for errors in the term in square brackets in equation (19). There is a technical reason behind that. The proofs depend on the symmetry of the matrix in the square brackets in (19), but the other parts in the proof do not need symmetry of the precision matrix estimator. Hence, we use both symmetrized,
and standard (non-symmetric version) of the precision matrix estimator,
. We want to rewrite the precision
matrix and it’s estimator so that it’s convenient to analyze them technically. In this respect, define
and
As a consequence,
We need to find maxwhere
and
are the 1
dimensional rows of the precision matrix of the returns and its estimator, respectively.
and
are simply transposes of these rows which
are 1. In this respect, using (20) we have that
Our aim is to simplify and get rates of convergence for the right side term in (21). To get consistency and rate of convergence results for the precision matrix for returns, rather than the errors as in Theorem 1 above, we need the following assumption on factor loadings.
(i). maxmax
Also, a strengthened assumption on sparsity compared to Assumption 5 is provided.
Specifically, the rate is the rate of estimation error for
as in Lemma A.13 in Supplement A. Note that Assumption 6 is used in Fan et al. (2011). Assumption 7(i) is used in Gagliardini et al. (2016). Assumption 7(i) allows for the maximal eigenvalue of
to grow with n. In the special case of a diagonal
, due to Assumption 1(i), the maximum eigenvalue of a diagonal
matrix is finite. However, a diagonal matrix of variance of errors case is empirically less relevant and less realistic. We expect the errors to be correlated across assets. For an example of where the maximum eigenvalue of
may diverge, we show that this may be the case for block diagonal matrix structure for
in (24). Note that Shanken (1992) criticizes standard Arbitrage Pricing Theory since eigenvalue of the residual covariances must be bounded even when the number of assets diverge. Our Assumption 7(i) moves away from maximum bounded eigenvalue assumption. Our residual covariances approximate error covariances very well and this can be seen in (A.40) and (A.41) in Supplement A.
Assumption 7(ii) is a sparsity assumption which tradeoffs between maximal eigenvalue and the sparsity of the precision matrix. This assumption is needed to analyze the precision matrix for the asset returns. To give an example, ignoring constants, we can have ¯s = ln(n), K = ln(n), p = 2n, and =
O[max(ln(ln(n)/n]. Then, Assumption 7(ii) is satisfied
main results, which is the consistent estimation of the precision matrix for asset returns. Since the precision
matrix of asset returns is in the formula of the Sharpe Ratio, as will be shown in Section 4, this theorem is crucial for subsequent analysis.
1. This theorem merges two key concepts: factor models and nodewise regression in high dimensional
2. Although we focus on factor models in empirical asset pricing, the vector can be seen as any set of
3.1 Two examples relating precision matrix restrictions to covariance matrix
We now illustrate how specific structures of the covariance matrix are compatible with the sparsity assumption for the precision matrix. We provide two examples for errors, one block-diagonal covariance matrix for errors, and the other one is the Toeplitz form for the covariance matrix of errors. Then, we provide how they affect the precision matrix and Assumption 7(i).
3.1.1 Block Diagonal Covariance Matrix for Errors
Suppose that there are m = 1blocks in a
covariance matrix of the errors.
The sparsity assumption – Assumption 1 – for can be translated into
as max
max
= ¯s, where this is the maximum number of nonzero cells in a given row of a block, across all blocks. For Assumption 7 we need the following inequality from Corollary 6.1.5 of Horn and Johnson (2013), by seeing that spectral radius of a matrix is larger than or equal to absolute value of any eigenvalue for any square matrix
A. Therefore,
For the same inequality also see Theorem 5.6.9a of Horn and Johnson (2013). Relating to Assumption 7(i)
where Σis the
element of covariance matrix of errors. By (23), this last inequality becomes
It is easy to see that using Assumption 1(i), and under sufficient conditions for Assumption 7(i), with
we get Eigmax(0. This allows the size of the blocks to be increasing with p, but the ratio of the maximum block size to total number of parameters should be small.
3.1.2 Toeplitz Analysis
In this case, the correlation among errors are ] =
, with
1. Then.
We have the tri-diagonal inverse, with all other cells being zero except the main and two adjacent diagonals.
Clearly Assumption 7(i) is satisfied since the sum on the right side converges to a constant.
3.2 Algorithm For Asset Return Based Precision Matrix Estimation
Here we provide a practical algorithm to get the precision matrix estimator for asset returns, , and it will depend on the residual-based nodewise regression estimator
, and its symmetric version
.
1. Use equation (7) to set up the residual from a least squares based regression via known factors with
2. Form the transpose matrix of residuals for all asset returns except jth one, , which is a
1
4. Use equation (13) to get .
5. Now form which is a row in the precision matrix estimate for the errors with 1
as jth element of that jth row, and put all other elements of the jth row, as
.
6. Run steps 1-5 for all j = 1. Stack all rows j = 1
to form
matrix:
. Form symmetric version by
:=
.
7. Form
8. Now form the precision matrix estimate for all asset returns by (19) and steps 6-7:
In this section, we apply the results, mainly the estimation of precision matrix of returns, to the analysis of the Sharpe Ratio with large number of assets. Specifically, we allow , when
. There will be four themes in each subsection below. But all of these themes relate to the analysis of consistency of the Sharpe Ratio in portfolios with a large number of assets. All our theoretical analysis is without transaction costs, however in simulations and also in empirical exercise we consider the presence of transaction costs.
The first subsection analyzes the Sharpe Ratio of Global Minimum Variance (GMV) portfolio, and Markowitz Mean-Variance (MMV) portfolio. In the GMV portfolio, we choose the weights to minimize the variance of the portfolio and restricted to sum one. Short-sales are allowed. The Sharpe Ratio is then constructed by dividing the mean portfolio returns by its standard deviation. In MMV portfolio, weights are chosen exactly as GMV but we also impose a target for the portfolio mean return.
The second subsection considers choosing the weights of the portfolio in such a way to maximize the Sharpe Ratio, subject to weights of the portfolio adding up to one. Short sales are allowed. The main difference between GMV in Section 4.1.1, and the Constrained Maximum Sharpe Ratio in Section 4.2, is that weights are chosen to minimize the variance in GMV portfolio and then the Sharpe Ratio is computed and, in case of the Constrained Maximum Sharpe Ratio, weights are chosen to maximize the Sharpe Ratio directly. Both methods use the same constraint that the weights of the portfolio should add up to one. In case of the MMV portfolio in Section 4.1.2 weights are chosen first to minimize the portfolio variance under the conditions described earlier and then, the Sharpe Ratio is computed. The constraint of weights adding up to one is helpful in visualizing assets in percentage terms.
In the third subsection, we analyze the maximum out-of-sample Sharpe Ratio. Here, we do not have a constraint that all weights of the portfolio should add up to one as in Sections 4.1.1, 4.1.2, and 4.2. The analysis is out-sample unlike the GMV, MMV, and Constrained Maximum Sharpe Ratio portfolios. Weights are chosen to maximize the portfolio returns subject to a constraint of a given variance. But the maximum out-of-sample Sharpe Ratio use estimated weights, with population out-sample mean return vector and the out-sample covariance matrix of returns in the formula. Since the maximum eigenvalue of out-sample covariance matrix of returns is growing, this affects the estimation error rate. Specifically, Sections 4.1.1, 4.1.2, and 4.2 allow p > n and we still get consistency, when . With the maximum-out-of-sample Sharpe Ratio we get consistency only when p < n and
.
In the fourth subsection, we consider the effect of estimated portfolio weights on obtaining the optimal Sharpe Ratio in large samples. Specifically, we estimate the weights and substitute this into the Sharpe Ratio formula, with keeping intact, and then try to show that this estimate is consistent. We show that it is possible only in the case of p < n, and this includes diverging number of assets and time span.
Before we state the theorems, we need the following sparsity assumption. Assumption 8(i) below replaces Assumption 7(ii). In Assumption 8(ii), the first term shows square of the maximum Sharpe Ratio is lower bounded, (scaled by p), to be positive. Scaling by p is needed since the numerator is summed over p terms. In a similar way, the second term in Assumption 8(ii) imposes that the variance of the GMV portfolio (scaled)
4.1 Commonly Used Portfolios with a Large Number of Assets
Here, we provide consistent estimates of the Sharpe Ratio of the GMV and MMV portfolios when p > n.
4.1.1 Global Minimum-Variance (GMV) Portfolio
In this part, we analyze the Sharpe Ratio that we can infer from the GMV portfolio. This is the portfolio in which weights are chosen to minimize the variance of the portfolio subject to the weights summing to one.
Specifically,
The solution to the above problem is well known and is given by
Next, substitute these weights into the Sharpe Ratio formula, normalized by the number of assets
We estimate (26) by nodewise regression, noting that :=
,
The following theorem is also valid when p > n and establishes both consistency and rate of convergence in the case of the Sharpe Ratio in the global minimum-variance portfolio.
1. We see that a large p only affects the error by a logarithmic factor as in the definition of in (22).
2. In the case of non-sparse precision matrix, we can only get consistency when p << n. To show this, in
4.1.2 Markowitz Mean-Variance (MMV) Portfolio
Markowitz (1952) portfolio selection is defined as finding the smallest variance given a desired expected
return . The decision problem is
The formula for optimal weight is
where we use A, F, D formulas A := :=
:=
, with
:=
. We define the estimators of these terms as
:=
:=
:=
. The optimal variance of the
portfolio in this scenario is normalized by the number of assets
The estimate of that variance is
By our constraint, we obtain
Using the variance V above
The estimate of the Sharpe Ratio under the MMV portfolio is
We provide the maximum Sharpe Ratio (squared) consistency in this framework when the number of assets is larger than the sample size. This is a novel result in the literature.
1. Condition 0 shows that the variance is bounded away from infinity, and
2. We provide the rate of convergence of the estimators, which increases with p in a logarithmic way as
3. To get consistency when there is non-sparse precision matrix, the same analysis in Remark 2 of Theorem
4. Number of factors slows the rate of convergence of estimation error to zero here. This is due to the fact
4.2 Maximum Sharpe Ratio: Portfolio Weights Normalized to One
In this section, we define the maximum Sharpe Ratio when the portfolio weights are normalized to one. This, in turn will depend on a critical term that will determine the formula below. The maximum Sharpe Ratio is defined as follows, with w as the 1 vector of portfolio weights:
where is a vector of ones. This maximum Sharpe Ratio is constrained to have portfolio weights that sum to one. Maller et al. (2016) shows that depending on a scalar, it has two solutions. When
0, with
:=
, we have the square of the maximum Sharpe Ratio:
On the other hand, when 0, we have
This is equation (6.1) of Maller et al. (2016). Equation (32) is used in the literature, and this is the formula when the weights do not necessarily sum to one given a return constraint as in Ao et al. (2019). In case of 0, in equations (2.7)-(2.10) of Maller and Turkington (2002), there is an approximation to optimal portfolio weights. To be specific, with a positive
0, optimal portfolio weights, which is (
1) vector:
where
is a (1)
1 matrix with
:= (
:
1 matrix, with 1
a (
1) column vector of
ones, and
is of dimension 1.
When , the weights can provide the maximum Sharpe Ratio:
, as discussed in p.504 of Maller and Turkington (2002).
These equations can be estimated by their sample counterparts, but in the case of p > n, is not invertible, so we need to use new tools from high-dimensional statistics. We use the nodewise regression precision matrix estimate of Meinshausen and B¨uhlmann (2006). This estimate is denoted by
is incorporated into the precision matrix of returns ˆ
.
We will also introduce the maximum Sharpe Ratio, which addresses the uncertainty regarding whether
we should analyze MSR or . This is
Note also that with = 0,
. The estimators for
will be intro- duced in the next subsection.
4.2.1 Consistency and Rate of Convergence of Constrained Maximum Sharpe Ratio Estimators
First, when 0, we have the square of the maximum Sharpe Ratio as in (32). Namely, the estimate of the square of the maximum Sharpe Ratio is:
1. We allow p > n and p can grow exponentially in n. We also allow for time-series data and establish
2. When there is no sparsity of the precision matrix, i.e. ¯s = p, we can still get consistency but for
If 0, the Sharpe Ratio is minimized, as shown on p.503 of Maller and Turkington (2002). The new maximum Sharpe Ratio in the case when
0 is in Theorem 2.1 of Maller and Turkington (2002). The square of the maximum Sharpe Ratio when
0 is given in (33).
An estimator in this case is
The optimal portfolio allocation for such a case is given in (2.10) of Maller and Turkington (2002), and shown in here in Section 4.2. The limit for such estimators when the number of assets is fixed (p fixed) is given in Theorems 3.1b-c of Maller et al. (2016).
1. In Theorem 6, we allow p > n, and time-series data are allowed, unlike the iid or normal return cases
2. Case of non-sparse precision matrix proceeds in the same way as Remark 2 of Theorem 5. To have
We provide an estimate that takes into account uncertainties about the term . Note that the term can be consistently estimated, as shown in Lemma B.3 in Supplement B. A practical estimate for a maximum Sharpe Ratio that will be consistent is:
where we excluded the case of = 0 in the estimator. That specific scenario is very restrictive in terms of returns and variance. Note that under a mild assumption, when
0, we have
0, and when
0, we have
0 with probability approaching one in the proof of Theorem 7. Note that
:=
.
Theorem 7. Under Assumptions 1-4,6,7(i), 8, with 0, where
is a positive constant, and assuming
0, with a sufficiently small positive
0, and C being a positive
1. In the case of p > n, we only consider consistency since standard central limit theorems (apart from
2. The case of non-sparse precision matrix with ¯s = p proceeds in the same way as in Remark 2 after
4.3 Maximum Out-of-Sample Sharpe Ratio
This section analyzes the maximum out of Sharpe Ratio that is considered in Ao et al. (2019). To obtain that formula, we need the optimal calculation of the weights of the portfolio. The optimization of the portfolio
weights is formulated as
where we maximize the return subject to a specified positive and finite risk constraint, 0. Equation (A.2) of Ao et al. (2019) defines the estimated maximum out-of-sample ratio when p < n, with the inverse of the sample covariance matrix,
= [
used as an estimator for the precision matrix estimate:
The theoretical version is written as, by definition of :=
,
Then, equation (1.1) of Ao et al. (2019) shows that when (0, 1), the above plug-in maximum out-of-sample ratio cannot consistently estimate the theoretical version. The optimal weights of a portfolio are given in (2.3) of Ao et al. (2019) in an out-of-sample context given a risk level. This comes from maximizing the expected portfolio return subject to its variance being constrained by the square of the risk, where this
The estimates that we will use
Our maximum out-of-sample Sharpe Ratio estimate using the nodewise estimate is:
Below we provide a sparsity assumption for the case of maximum out of sample Sharpe Ratio.
1. Note that p.4353 of Ledoit and Wolf (2017) shows that the maximum out-of-sample Sharpe Ratio is
2. We cannot have p > n in this theorem, due to Assumption 9, this shows the difficulty of maximum out
3. (1) can be also obtained in non-sparse precision matrix, although the conditions will be more
4. The case of large non-negative weights can be handled with our analysis. This is the case of growing
4.4 Portfolio Estimation Based Sharpe Ratio Analysis
In this section for the scenarios we considered in Sections 4.1-4.2, we form the estimate of the portfolio weights and substitute that into the Sharpe Ratio. To understand the effects of only portfolio estimation for consistent estimation of Sharpe Ratio, we keep as constants in Sharpe Ratio estimates. We start with
GMV portfolio. The estimated portfolio weights are
The Sharpe Ratio estimate of this portfolio is:
The optimized-target population Sharpe Ratio is given in (26).
Now we consider the Sharpe Ratio based on Markowitz portfolio. The estimated portfolio weights are
These are estimates by plugging in terms in equation (28). Denote the Sharpe Ratio based on portfolio
weight estimates
The optimal Sharpe Ratio is in (31) in this case.
In case of constrained maximum Sharpe Ratio in section 4.2, when 0, we can establish the
portfolio weight estimates
Constrained maximum Sharpe Ratio estimate when 0 is:
The optimal Sharpe Ratio in this case is in (32).
The constrained maximum Sharpe Ratio weights when 0 are more complicated as seen in
in Section 4.2. The estimate is:
with
Note that maximum Sharpe Ratio in this second constrained case is:
Using ˆposes several challenges. Taking
to reach the optimal Sharpe Ratio is key but the rate may play a role and also the weights depend on ˆ
term which depends on ˆ
that depends on precision matrix estimate ˆ
, mean estimate ˆ
, and estimate
from section 4.2. So, given Theorems 2 and 6, we think that consistency is plausible. However, given the lengthy material in this paper, this is beyond the scope of our theoretical analysis. Hence, similar corollaries for Theorems 6-7 cannot be handled in this paper.
An important fact that applies to all Corollaries here is that we can only have p < n case, as discussed in Remark 3 of Theorem 8.
5.1 Models and Implementation Details
In this section, we compare the nodewise regression with several models in a simulation exercise. The two aims of the exercise are to determine whether our method achieves consistency and how our method performs compared to others in the estimation of the constrained maximum Sharpe Ratio, the out-of-sample maximum Sharpe Ratio, and the Sharpe Ratio in global minimum-variance and Markowitz mean-variance portfolios.
The other methods that are used widely in the literature and benefit from high-dimensional techniques are the principal orthogonal complement thresholding (POET) from Fan et al. (2013), the nonlinear shrinkage (NL-LW) and the single factor nonlinear shrinkage (SF-NL-LW) from Ledoit and Wolf (2017), and the maximum Sharpe Ratio estimated and sparse regression (MAXSER) from Ao et al. (2019). All models except for the MAXSER are plug-in estimators, where the first step is to estimate the precision/covariance matrix, and the second step is to plug-in the estimate in the desired equation.
The POET uses principal components to estimate the covariance matrix and allows some eigenvalues of to be spiked and grow at a rate O(p), which allows common and idiosyncratic components to be identified via principal components analysis and can consistently estimate the space spanned by the eigenvectors of
. However, Fan et al. (2013) point out that the absolute convergence rate of the model is not satisfactory for estimating
, and consistency can only be achieved in terms of the relative error matrix.
Nonlinear shrinkage is a method that individually determines the amount of shrinkage of each eigenvalue in the covariance matrix for a particular loss function. The main aim is to increase the value of the lowest eigenvalues and decrease the largest eigenvalues to stabilize the high-dimensional covariance matrix. This nonlinear method is a very novel and excellent idea. Ledoit and Wolf (2017) propose a function that captures the objective of an investor using portfolio selection. As a result, they have an optimal estimator of the covariance matrix for portfolio selection for many assets. The SF-NL-LW method extracts a single factor structure from the data before estimating the covariance matrix, which is simply an equal-weighted portfolio with all assets.
Finally, the MAXSER starts with estimating the adjusted squared maximum Sharpe Ratio used in a penalized regression to obtain the portfolio weights. Of all the discussed models, the MAXSER is the only one that does not estimate the precision matrix in a plug-in estimator of the maximum Sharpe Ratio.
Regarding implementation, the POET and both models from Ledoit and Wolf (2017) are available in the R packages POET Fan et al. (2016) and nlshrink Ramprasad (2016). The SF-NL-LW needs some minor adjustments following the procedures described in Ledoit and Wolf (2017). For the MAXSER, we follow the steps for the non-factor case in Ao et al. (2019), and we use the package lars (Hastie and Efron (2013)) for the penalized regression estimation. We estimate the nodewise regression following the steps in Section 3.2 using the glmnet package Friedman et al. (2010) for penalized regressions. We used two alternatives to select the regularization parameter , a 10-fold cross validation (CV), and the generalized information criterion (GIC) from Zhang et al. (2010).
The GIC procedure starts by fitting in (12) for a range of
that goes from the intercept-only model to the largest feasible model. This is automatically done by the glmnet package. Then, for the GIC procedure,
we calculate the information criterion for a given among the ranges of all possible tuning parameters
where ) is the sum squared error for a given
) is the number of variables, given
in the model that is nonzero, and p is the number of assets. The last step is to select the model with the smallest GIC. Once this is done for all assets j = 1, . . . , p, we can proceed to obtain
.
For the CV procedure, we split the sample into k subsamples and fit the model for a range of as in the GIC procedure. However, we will fit models in the subsamples. We always estimate the models in
1 subsamples, leaving one subsample as a test sample, where we compute the mean squared error (MSE). After repeating the procedure using all k subsamples as a test, we finally compute the average MSE across all subsamples and select the
for each asset j that yields the smallest average MSE. We can then use the estimated
to obtain
.
5.2 Data Generation Process and Results
The DGP is based on a simplified version of the factor DGP in Ao et al. (2019), for j = 1:
where and
are the monthly asset returns of asset j, factor returns of factor k respectively,
are the individual stock sensitivities to the factors, and
represent the idiosyncratic component of each stock. We start with two specifications that correspond to two tables. Table 1 corresponds to 1 factor: excess return of the market portfolio, hence K = 1, and Table 2 corresponds to 3 factors from the Fama & French three factors, K = 3.
Let
and
be the factors’ sample mean and covariance matrix. The
, and
and covariance matrix of residuals:
are estimated using a simple least-squares regression using returns from the S&P500 stocks that were part of the index in the entire period from 2008 to 2017. In each simulation, we randomly select p stocks from the pool with replacement because our simulations require more than the total number of available stocks. We then used the selected stocks to generate individual returns with covariance matrix of errors:
=
), where
) is the
matrix of
the form, for (i,j)th element
with = 0.25, 0.5, 0.75.
represents element by element multiplication (Hadamard product) of two square matrices A, B of the same dimensions.
Tables 1-2 show the results. The values in each cell show the average absolute estimation error for estimating the square of the Sharpe Ratio. Each eight-column block in the table shows the results for a different sample size. In each of these blocks, the first four columns are for p = n/2, and the last four columns are for p = 3n/2. MSR, MSR-OOS, GMV-SR, and MKW-SR are the constrained maximum Sharpe Ratio, the out-of-sample maximum Sharpe Ratio, the Sharpe Ratio from the global minimum-variance portfolio, and the Sharpe Ratio from the Markowitz portfolio with target returns set to 1%, respectively. Therefore, there are four categories to evaluate the different estimates. The MAXSER risk constraint was set to 0.04 following Ao et al. (2019). We ran 100 iterations in each simulation setup. All bold-face entries in tables show category champions.
Both Tables show that our method achieves consistency, as shown in Theorems. Analyzing K = 3, Table 2, with = 0.50 OOS-MSR (the Out Of Sample-Maximum Sharpe Ratio), and Generalized Information Criterion tuning parameter selection, the estimation error at p = n/2, with n = 100 is 1.244, and this error declines to 0.585 at p = n/2, n = 200, and then declines to 0.321 at p = n/2, n = 400. So with jointly increasing n, p we show that the error declines, as predicted by our theorems. The main reason is that errors grow with
ln(p), but decline with
rate. So the number of assets in a large portfolio only affects the error logarithmically. To give another example from Table 2, with
= 0.50, GMV-SR (Global Minimum Variance-Sharpe Ratio) and Cross Validation tuning parameter selection with our method, the estimation error is 0.352 with p = 3n/2, n = 100, then this error declines to 0.213 with p = 3n/2, n = 200, and further declines to 0.143 with p = 3n/2, n = 400.
Next, we consider which method achieves the smallest estimation error. Table 1 favors SF-NL-LW (Single Factor Non-Linear Shrinkage of Ledoit-Wolf) since it has a single factor built into this subset of their technique. We get better results in Table 2 (K = 3) for our methods. We have 4 categories: MSR, OOS-MSR, GMV-SR, MKW-SR corresponding to our Theorems 3-9. There are nine possibilities in each category (given we are either at p = n/2 or p = 3n/2), representing three choices of sample sizes paired with 3 choices of different Toeplitz structures.
We analyze each category. We start with Table 1. With p = 3n/2 in OOS-MSR our NW-GIC method has the smallest errors 8 out of 9 categories. When p = n/2, MAXSER method dominates all others since it is specifically factor model designed to handle OOS-MSR with p < n. In GMV-SR, with p = n/2, in 3 out of 9 cases, our NW-GIC dominates. In the other categories in Table 1, non-linear shrinkage method of Ledoit-Wolf (2017) does the best, but our methods come a very close second.
In Table 2, with K = 3, our methods perform better than in Table 1. In the category of GMV-SR, with p = 3n/2, out of 9 possible configurations, our methods have the smallest error in 7 cases. Our methods dominate in the same category, with p = 0.5n, 5 out of 9 possibilities. In the case of the category of MKW-SR (Markowitz-Sharpe Ratio), our theorems predict that our methods may suffer from a number of factors. We see that non-linear shrinkage methods are the best, and our methods are the second best in this category. In the constrained maximum Sharpe Ratio, (MSR) non-linear shrinkage methods perform the best.
For the empirical application, we use two subsamples. The first subsample uses data from January 1995 to December 2019 with an out-of-sample period from January 2005 to December 2019. We selected all stocks in the S&P 500 index for at least one month in the out-of-sample period and have data for the entire 1995-2019 period resulting in 382 stocks. The second subsample starts in January 1990 and ends in December 2019 with an out-of-sample period from January 2000 to December 2019. Using the same criterion as the first subsample, the number of stocks was 321, which is around 15% fewer than the first subsample. The objective is to have an out-of-sample competition between models, and we only estimated GMV and Markowitz portfolios for the plug-in estimators. The first out-of-sample period includes only the recession of 2008. The second out-of-sample period includes the recessions of 2000 and 2008, and the out-of-sample periods reflect recent history.
The Markowitz return constraint is 0.8% per month, and the MAXSER risk constraint is 4%. In the low-dimensional experiment, we randomly select 50 stocks from the pool to estimate the models with the same stocks for all windows. We also experimented with 25 stocks but did not report them. That table is available from the authors on demand. In the high-dimensional case, we use all available stocks.
We use a rolling window setup for the out-of-sample estimation of the Sharpe Ratio following Callot et al. (2021). Specifically, samples of size n are divided into in-sample (1 : ) and out-of-sample (
+ 1 : n). We start by estimating the portfolio
in the in-sample period and the out-of-sample portfolio returns
. Then, we roll the window by one element (2 :
+ 1) and form a new in-sample portfolio
and out-of-sample portfolio returns
. This procedure is repeated until the end of the sample.
The out-of-sample average return and variance without transaction costs are
We estimate the Sharpe Ratios with and without transaction costs. The transaction cost, c, is defined as 50 basis points following DeMiguel et al. (2007). Let be the return of the portfolio in
period t + 1; in the presence of transaction costs, the returns will be defined as
where =
(1 +
(1 +
) and
and
are the excess returns of asset j and the portfolio P added to the risk-free rate. The adjustment made in
is because the portfolio at the end of the period has changed compared to the portfolio at the beginning of the period.
The Sharpe Ratio is calculated from the average return and the variance of the portfolio in the out-of-
sample period
The portfolio returns are replaced by the returns with transaction costs when we calculate the Sharpe Ratio with transaction costs.
We use the same test as Ao et al. (2019) to compare the models. Specifically,
where is the Sharpe Ratio of our feasible nodewise model, which is tested against all remaining models. This is the Jobson and Korkie (1981) test with Memmel (2003) correction. We also considered the method of Ledoit and Wolf (2008) for testing the significance of the winner and using the equally weighted portfolio as a benchmark; the results were very similar and hence are not reported.
We also include an equally weighted portfolio (EW). GMV-NW-GIC and GMV-NW-CV denote the nodewise method with GIC and cross validation tuning parameter choices, respectively, in the global minimum-variance portfolio (GMV).
In each of our feasible nodewise models with GIC, CV, we either use a single-factor model (market as the only factor) or three-factor model. They are denoted GMV-NW-GIC-SF, GMV-NW-GIC-3F for the global minimum variance portfolio analyzed with feasible nodewise method and GIC criterion for tuning parameter choice and single and three-factor models, respectively. In the same way, we define GMV-NW-CV-SF, GMV-NW-CV-3F. We take GMV-NW-GIC-SF as the benchmark to test against all other methods since it generally does well in different preliminary forecasts.
GMV-POET, GMV-NL-LW, and GMV-SF-NL-LW denote the POET, nonlinear shrinkage, and single-factor nonlinear shrinkage methods, respectively, which are described in the simulation section and also used in the global minimum-variance portfolio. The MAXSER is also used and explained in the simulation section. MW denotes the Markowitz mean-variance portfolio, and MW-NW-GIC-SF denotes the feasible nodewise method with GIC tuning parameter selection in the Markowitz portfolio with a single factor. All the other methods with MW headers are analogous and thus self-explanatory.
The results are presented in Tables 3 and 4. Table 3 shows the results for the 2005-2019 out-of-sample period. Feasible nodewise methods do well in terms of the Sharpe Ratio in Table 3. For example, with transaction costs in the low-dimensional portfolio category, in terms of Sharpe Ratio (SR) (averaged over the out-of-sample time period), GMV-NW-GIC-SF is the best model. It has an SR of 0.210. In the case of high dimensional case with transaction costs in the same table, GMV-POET and our GMV-NW-GIC-SF virtually tie (difference in favor of POET in fourth decimal) at 0.214 for the Sharpe Ratio.
If we were to analyze only the Markowitz portfolio in Table 3, with transaction costs in high dimensions, MW-NW-GIC-SF has the highest SR of 0.211. Therefore, even in other subcategories of Markowitz portfolio, the feasible nodewise method dominates. Although statistical significance is not established, it is unclear that these significance tests have high power in our high-dimensional cases.
Table 4 shows the results for the out-of-sample January 2000-2019 subsample. We see that feasible nodewise methods dominate all scenarios except for the low-dimensional case with transaction costs. In high dimensionality with transaction costs, GMV-NW-GIC-SF (Markowitz-nodewise-GIC) has an SR of 0.225, and the closest is GMV-POET with 0.204. Also, we experimented with two other out-sample periods of 2005-2017, 2000-2017, and the results are slightly better for our methods, and these can be shared on demand.
Table 3: Empirical Results – Out-of-Sample Period from Jan. 2005 to Dec. 2019
The table shows the Sharpe Ratio (SR), average returns (Avg), standard deviation (SD) and p-value of the Jobson and Korkie (1981) test with Memmel (2003) correction. We also applied the Ledoit and Wolf (2008) test with circular bootstrap, and the results were very similar; therefore we only report those of the first test in this table. The statistics were calculated from 180 rolling windows covering the period from Jan. 2005 to Dec. 2019, and the size of the estimation window was 120 observations.
In Table 5, we analyze turnover, leverage and maximum leverage (equations (40), (41) and (42), respectively) of the portfolios in Tables 3-4.
The definitions are as follows for turnover:
and leverage
and maximum leverage
It is clear that in Table 5 in terms of turnover, leverage, maximum leverage, GMV-POET and GMV-NW-GIC-SF do well, with the best and close to best respectively if we discount EW portfolios.
6.1 Time Series of Sharpe Ratios and Turnover
Figures 1 and 2 shows Global Minimum Variance results of the NW-GIF-SF, the POET and the SF-NL-LW models with transaction costs. The results were obtained through a 24 months rolling window with the
Table 4: Empirical Results – Out-of-Sample Period from Jan. 2000 to Dec. 2019
The table shows the Sharpe Ratio (SR), average returns (Avg), standard deviation (SD) and p-value of the Jobson and Korkie (1981) test with Memmel (2003) correction. We also applied the Ledoit and Wolf (2008) test with circular bootstrap, and the results were very similar; therefore we only report those of the first test in this table. The statistics were calculated from 240 rolling windows covering the period from Jan. 2005 to Dec. 2019, and the size of the estimation window was 120 observations.
out-of-sample returns from the 2000-2019 experiment, which yields time-series that start in 2002 and end in 2019 for the Sharpe Ratio and the turnover. The main conclusion from the figures is that Nodewise works better in terms of the Sharpe Ratio in deep recessions like the 2008 crisis, but Nonlinear Shrinkage and POET are superior when we have long periods of normality in the markets. Nodewise also delivers better Sharpe Ratios during the recovery of the crisis. On the turnover side, Nodewise and POET consistently have lower turnover than Nonlinear Shrinkage with POET being the overall lowest. However, during the 2008 crisis, especially in the high dimension setup, POET had a higher turnover than Nodewise.
Table 5: Turnover and Leverage
Figure 1: 24 months rolling Sharpe Ratio and turnover - Low Dimension with transaction costs
We provide a hybrid factor model combined with nodewise regression method that can control for risk and obtain the maximum expected return of a large portfolio. Our result is novel and holds even when p > n. We allow for an increasing number of factors, with possible unbounded largest eigenvalue of the covariance matrix of errors. Sparsity is assumed on the precision matrix of errors rather than the covariance matrix of errors. We also show that the maximum out-of-sample Sharpe Ratio can be estimated consistently. Furthermore, we also develop a formula for the maximum Sharpe Ratio when the sum of the weights of the portfolio is one. A consistent estimate for the constrained case is also shown. Then, we extended our results to the consistent estimation of the Sharpe Ratios in two widely used portfolios in the literature. It will be essential to extend our results to more restrictions on portfolios.
Abadir, K. and J. Magnus (2005). Matrix Algebra. Cambridge University Press.
Ao, M., Y. Li, and X. Zheng (2019). Approaching mean-variance efficiency for large portfolios. Review of Financial Studies 32, 2499–2540.
Barras, L., P. Gagliardini, and O. Scaillet (2021+). Skill, scale, and value creation in the mutual fund industry. Journal of Finance. forthcoming.
Figure 2: 24 months rolling Sharpe Ratio and turnover - High Dimension with transaction costs
Brito, D., M. Medeiros, and R. Ribeiro (2018). Forecasting large realized covariance matrices: The benefits
Brodie, J., I. Daubechies, C. D. Mol, D. Giannone, and I. Loris (2009). Sparse and stable Markowitz
Callot, L., M. Caner, O. Onder, and E. Ulasan (2021). A nodewise regression approach to estimating large
Caner, M. and A. Kock (2018). Asymptotically honest confidence regions for high dimensional parameters
Chamberlain, G. and M. Rothschild (1983). Arbitrage, factor structure, and mean-variance analysis on large
Chang, J., Y. Qiu, Q. Yao, and T. Zou (2018). Confidence regions for entries of a large precision matrix.
DeMiguel, V., L. Garlappi, F. Nogales, and R. Uppal (2009). A generalized approach to portfolio optimiza-
DeMiguel, V., L. Garlappi, and R. Uppal (2007). Optimal versus naive diversification: How inefficient is the
Ding, Y., Y. Li, and X. Zheng (2021). High-dimensional minimum variance portfolio estimation under
Fan, J., Y. Fan, and J. Lv (2008). High-dimensional covariance matrix estimation using a factor model.
Fan, J., A. Furger, and D. Xiu (2016). Incorporating global industrial classification standard into portfolio
Fan, J., Y. Li, and K. Yu (2012). Vast volatility matrix estimation using high frequency data for portfolio
Fan, J., Y. Liao, and M. Mincheva (2011). High-dimensional covariance matrix estimation in approximate
Fan, J., Y. Liao, and M. Mincheva (2013). Large covariance estimation by thresholding principal orthogonal
Fan, J., Y. Liao, and M. Mincheva (2016). POET: Principal Orthogonal Complement Thresholding (POET)
Fan, J., Y. Liao, and X. Shi (2015). Risks of large portfolios. Journal of Econometrics 186, 367–387.
Fan, J., H. Liu, and W. Wang (2018). Large covariance estimation through elliptical factor models. Annals
Fan, J., R. Masini, and M. Medeiros (2021). Bridging factor and sparse models. arxiv:2102.11341, arXiv.
Friedman, J., T. Hastie, and R. Tibshirani (2010). Regularization paths for generalized linear models via
Gagliardini, P., E. Ossola, and O. Scaillet (2016). Time-varying risk premium in large cross-sectional equity
Gagliardini, P., E. Ossola, and O. Scaillet (2019). A diagnostic criterion for approximate factor structure.
Gagliardini, P., E. Ossola, and O. Scaillet (2020). Estimation of large dimensional conditional factor models
Garlappi, L., R. Uppal, and T. Wang (2007). Portfolio selection with parameter and model uncertainty: A
Hastie, T. and B. Efron (2013). lars: Least Angle Regression, Lasso and Forward Stagewise. R package
Horn, R. and C. Johnson (2013). Matrix Analysis. Cambridge University Press.
Jagannathan, R. and T. Ma (2003). Risk reduction in large portfolios: Why imposing the wrong constraints
Jobson, J. D. and B. M. Korkie (1981). Performance hypothesis testing with the sharpe and treynor measures.
Kan, R. and G. Zhou (2007). Optimal portfolio choice with parameter uncertainty. Journal of Financial
Lai, T., H. Xing, and Z. Chen (2011). Mean-variance portfolio optimization when means and covariances
Ledoit, O, M. and M. Wolf (2003). Improved estimation of the covariance matrix of stock returns with an
Ledoit, O, M. and M. Wolf (2004). A well conditioned estimator for large dimensional covariance matrices.
Ledoit, O, M. and M. Wolf (2017). Nonlinear shrinkage of the covariance matrix for portfolio selection:
Ledoit, O. and M. Wolf (2008). Robust performance hypothesis testing with the Sharpe ratio. Journal of
Maller, R., S. Roberts, and R. Tourky (2016). The large sample distribution of the maximum sharpe ratio
Maller, R. and D. Turkington (2002). New light on portfolio allocation problem. Mathematical Methods of
Markowitz, H. (1952). Portfolio selection. Journal of Finance 7, 77–91.
Meinshausen, N. and P. B¨uhlmann (2006). High-dimensional graphs and variable selection with the lasso.
Memmel, C. (2003). Performance hypothesis testing with the sharpe ratio. Finance Letters 1(1).
Merlevede, F., M. Peligrad, and E. Rio (2011). A Bernstein type inequality and moderate deviations for
Ramprasad, P. (2016). nlshrink: Non-Linear Shrinkage Estimation of Population Eigenvalues and Covari-
Senneret, M., Y. Malevergne, P. Abry, G. Perrin, and L. Jaffr`es (2016). Covariance versus precision matrix
Shanken, J. (1992). The current state of the arbitrage pricing theory. Journal of Finance 47, 1569–1574.
Tu, J. and G. Zhou (2011). Markowitz meets talmud: A combination of sophisticated and naive diversification
van de Geer, S. (2016). Estimation and testing under sparsity. Springer-Verlag.
Zhang, Y., R. Li, and C.-L. Tsai (2010). Regularization parameter selections via generalized information
Sharpe Ratio Analysis in High Dimensions: Residual-Based Nodewise Regression in Factor Models
Supplement A is divided into several parts. The first part has preliminary proofs, norm inequalities, defi-nitions, and a maximal inequality that is extended in a very minor form from the existing literature. The second part has the proofs of lemmata that lead to proof of Theorem 1. The first two parts relate only to the proof of Theorem 1. The third part is only related to the proof of Theorem 2. Part 4 is related to all the remaining proofs of the theorems in this paper.
Part 1
We start with a lemma that provides norm inequalities. Let matrices and
1 vector.
Proof of Lemma A.1. (i). Set , and let
be the 1
row vector of
where we use H¨older’s inequality for the first inequality, and the relation between norms for the second inequality, and to get the last inequality we repeat the first two inequalities.
where is the maximum absolute column sum norm of
matrix (i.e.
induced matrix norm). Let
’ be 1
row vector of
, and
is the jth column of
matrix.
where we use H¨older’s inequality for the first inequality, and norm relation for the other inequalities.
Next we provide a lemma that is directly from Lemma A.2 of Fan et al. (2011).
Lemma A.2. (Fan et al. (2011)). Suppose that two random variables satisfy the following exponential type tail condition. There exist
(0, 1) and
0 constant such that for all s > 0
We provide now the following maximal inequality due to Theorem 1 of Merlevede et al. (2011), and used in the proof of Lemma A.3(i) and proof of Lemma B.1(ii) in Fan et al. (2011). To that effect, we provide a general assumption on data, and then show the theorem and its proof.
Assumption L1. (i). are vectors of dimension
and
, respectively, for t = 1
. They are both stationary and ergodic. Also
are strong mixing with strong mixing coefficients are satisfying
with t, a positive integer, and 0 a positive constant. (ii). We also let
satisfy the exponential tail condition for
= 1
= 1
Proof of Theorem A.1. This is a simple application of Lemma A.2 above with Assumption L1 for Theorem 1 of Merlevede et al. (2011), and Bonferroni union bound.
Part 2
We start with an important maximal inequality applied to factor models in nodewise regression setting. Some of the results are already in Lemma A.3, Lemma B.1 of Fan et al. (2011). We show them so that readers can see all results without referral to other literature. We also provide two new results Lemma A.3(ii), (v) due to nodewise regression interaction with factor models.
Lemma A.3. Under Assumptions 1-3, for 0, with m = 1, 2, 3, 4, 5 with
that is used in Theorem A.1. (i).
(ii). Denote as the (
1)
matrix in (4), and let the l th row and t th column element
and
as
1 vector, and the t th element as
Proof of Lemma A.3. (i). This is Lemma A.3(i) of Fan et al. (2011).
(ii). The proof follows from Theorem A.1 and Assumption 3 provides the tail probability through the same algebra as in p.3346 of Fan et al. (2011).
(v). The proof will involve several steps and this is due to interaction of factor models () and nodewise error (
). Start with the definition of
where :=
where we use (A.7) for the first equality and H¨older’s inequality for the first inequality. Consider
where we use definition. Noting that
is
1 submatrix of
consisting all rows and columns of
except the jth one. See that
Then,
for the second inequality in (A.10), given our Assumption max. Hence,
by (A.10). Clearly, by (A.9)-(A.11)
Then since :=
and by (A.61)
Next, use Lemma A.3(iii) and (A.12) in (A.8) to show
This also implies that, since X := () :
matrix, and
:= (
:
1
Now we start defining two events, and we condition the next lemma, which is bound on nodewise regression estimates, on these two events. Then we relax this restriction, and show that an unconditional result for
norm of the nodewise regression estimates after finding that these two events converge in
probability to one. Define
and define the population adaptive restricted eigenvalue condition, as in Caner and Kock (2018), for j =
1, and let
represent the vector with
indices in
, and all the other elements than
indices in
set to zero
and the empirical version of the adaptive restricted eigenvalue condition is as follows, with matrix
and the event is for each j = 1
We have the following bound result.
Use (10) to have =
) and this last equation can be substituted into first left side term and first right side term in (A.17) to have
Simplify the first term on the left and the first term on the right side of (A.18),
Since we use and then H¨older’s inequality
Use , on the second term on the left side of (A.20) (
represents the indices of nonzero cells in row j of the precision matrix, and
represents the indices of zero cells in row j of the precision matrix).
Use the norm inequality
Now ignoring the first term above and dividing the rest by 0, provides the restricted set condition
(cone condition) in adaptive restricted eigenvalue condition
Set in the empirical adaptive restricted set condition in (A.16), then use the empirical adaptive restricted eigenvalue condition in (A.24)
Then use 32 + 9
2 with
,
.
Use in the first term on the right side and simplify
This implies
Now to get bound, ignore the first term in (A.24) and add both sides
Use the norm inequality for the first term on the right side of (A.27)
and can use the empirical adaptive restricted eigenvalue condition in (A.16)
Next, use (A.26) and to have
Last inequality above is true by noticing by ¯s definition, and then by definition of population adaptive restricted eigenvalue condition
).
Now we evaluate two events, in the next two lemmata.
Proof of Lemma A.5. Start with definition in (A.14). Use (9)-(11) and
:=
is idempotent such that
Note that U is a matrix and
is the
submatrix, which is U without the jth row. As a
consequence,
with probability at least 1 ) by Lemma A.3(ii). Next, for the second right side term in (A.29) we
have that
by Lemma A.1(i). We evaluate each term in (A.31). Note that X = () :
Then, by Lemma A.3(iii),
with probability at least 1 ). Next, since
1, and
is the tth element
Then, by Lemma A.3(v),
Combine (A.30)-(A.35) in (A.29) in order to form
with probability at least 1 ). Now use (A.14) to get
.
Proof of Lemma A.6. For each j = 1, add and subtract
Note that second right side term with absolute value in (A.37) can be bounded by using H¨older’s inequality
By the same analysis applied to the first right side term with absolute value in (A.37) and simplifying
with probability at least 1 ). Next, in (A.38) see that
is a submatrix of
, and
is a submatrix of U as described above and
with probability at least 1 ) by Lemma A.3(i). We need to provide some simplification for
term in (A.38). Next, since the cone condition in adaptive restricted eigenvalue condition is satisfied in
Then, add to the left side and right side and use the norm inequality that puts an upper bound on the
norm in terms of the
norm. Hence,
So, we have that
Next, using the empirical and population adaptive restricted eigenvalue definitions and minimizing over we have that
Note that, if we have with probability approaching one (wpa1 from now on)
Thus, we need to show that following probability goes to zero
Set := 16
ln(
ln(
. Clearly, by (A.40) and (A.41) we have that
Since 0 by Assumption 5, by (A.46)(A.47)
0.
One crucial point is that we need to get a low bound for . In that respect, from (A.45)
Clearly by the definitions of and
and population adaptive restricted eigenvalue condition, we have that
Next, by (A.33) and (A.41), via Lemma A.3(iii), we have that
We provide the main consistency result for residual based nodewise regression result.
Proof of Lemma A.7. Use Lemmata A.5-A.6 and (A.49) to have
Then, combine above with Lemma A.4 to have the desired result via Assumption 5 and Lemma A.5 to have (1).
Next, we provide proof of consistency for the estimates of the reciprocal of the main diagonal elements of the precision matrix.
and :=
, with
:=
, and
:= (
1 vector
=
. Using (10) for
in
definition we have
By the triangle inequality we get
Consider each term in (A.50) carefully. Start with definition; :=
, and
being idempotent.
First, exactly as in Lemma A.3(i) with Assumption 2(ii)(iv), 31 we have by Theorem A.1 that
Then note that 1 vector, and
:
matrix. Therefore,
where we use H¨older’s inequality for the first inequality, and (A.1) and (A.2) for the second inequality, and the norm inequality between and
norms for the third inequality (i.e.
dim(
dim(x) :
dimension of the vector x). Next by (A.33), (A.34), and (A.35), we have by (A.53) that
Combine (A.52) and (A.54) in (A.51) to have the first term on the right side of (A.50) by Assumption 5 to
get the last equality in (A.55)
In (A.50) consider the second term on the right side by (A.56), Lemma A.7
Consider the third term on the right side of (A.50), where we use H¨older’s inequality to get
for the rates we use (A.11), (A.56). Last we consider the fourth term on the right side of (A.50). To get a better rate, we start with the Karush-Kuhn-Tucker (KKT) conditions in (12). The following 1 equations
form the KKT
where is the sub-differrential and explained in more detail in p.160 of Caner and Kock (2018) which replaces the gradient in non-differential penalties. Also for all j = 1
Use (10) for
and rewrite KKT as
Then the fourth term on the right side of (A.50)
where we use H¨older’s inequality, (A.11) and (A.59). Clearly, (A.58) and (A.60) are the slowest among the four terms on the right side of (A.50), and we use Assumption 5 to get the desired result.
Proof of Theorem 1. First, we derive some of the key results. By definition of , for j = 1
, and since
:=
, with Assumption 1
is bounded away from zero wpa1 by Lemma A.8. Then
by Lemma A.8, (A.61), and (A.62). Now we complete the proof by using the formula for .
where we use (A.63), Lemma A.7, (A.11) for the rates, and the last equality is by Assumption 5.
Part 3
After the proof of Theorem 1 we provide lemmata that lead to proof of Theorem 2. We start with a lemma that is related to norm inequalities. First define generic matrices, :
, also define a row vector
: 1
, and also define
matrices
.
(i).
where we use submultiplicativity of matrix norms for the first inequality, and submultiplicativity of matrix norms and the following for the second inequality,
where and
are the kth row of
, and k, j element of
respectively. Then, for the last inequality, we use a matrix norm inequality that provides an upper bound for
matrix norm in terms of spectral norm in p.365 of Horn and Johnson (2013).
(ii).
where we use section 4.3 of van de Geer (2016) for the first inequality, and the second inequality can be seen by defining as the jth row of
, and
as the kth column of
and using H¨older’s inequality
(iii).
where we use p.345 of Horn and Johnson (2013) for the first inequality, and matrix norm submultiplicativity for the second inequality, and the last equality is by seeing that transpose of
matrix norm is
matrix norm.
(iv).
where we use p.44 van de Geer (2016) dual norm inequality for the first inequality, then for the second inequality we use submultiplicativity property of matrix norms,and for the last inequality we use := max
, where
is the j, k th cell in
.
where we use norm definition for the first equality, and for the first inequality we use p.345 of Horn and Johnson (2013), which is
for a generic matrix A, and generic vector x, for the third inequality we use the upper bound of
induced matrix norm in terms of spectral norm, as in p.365 of Horn and Johnson
by Assumption 6 that for a positive constant C and uniformly over j = 1
= 1
.
Next, using the results above with Assumption 7, we have
(iii). This is proved in (A.64).
(iv). The proof of (iv) is the same as in (ii) above except, with as the kth column of matrix B.
by Assumption 6.
Before the next lemma, we extend two following results which is Lemma B.4 in Fan et al. (2011) to the case of increasing maximal eigenvalue of errors.
Lemma A.11. Under Assumptions 4,6, and 7(i), with c > 0, C > 0, and positive finite constants (i). Eigmin(cp
.
Proof of Lemma A.11. We follow the proof of Lemma B.4 in Fan et al. (2011). (i). Since :=
,
(ii). Using Assumption 4
We have the desired result by (A.65), and since for an invertible matrix A, Eigmax= 1/Eigmin(A).
As described above in the main text, we form the symmetrized version of our feasible nodewise regression estimator for this part of the paper: :=
.
:=
[
cov(
+
Proof of Lemma A.12. (i). We start with simple adding and subtracting( = (
) + B),
= (
) +
) and the triangle inequality. Hence,
Analyze each term in (A.66), and by Lemma A.9(ii)(iv)
where we use (B.14) of Fan et al (2011) which is:
since norm of transpose of
involves rows of
(hence columns of
).
For the second term in (A.66)
where we use, , Lemma A.9(ii)(iv) for the first-second inequalities, (B.14) of Fan et al. (2011), Assumption 6, and Theorem 1 and (A.68) for the rates. Now consider the third term in (A.66)
where we use Lemma A.9(ii) for the first inequality, (B.14) of Fan et al. (2011), and := max
= max
as in (A.12). We consider the fourth term in (A.66)
where we use symmetry of , Lemma A.9(ii)(iv) for the first and second inequality, and Assumption 6, and Theorem 1 (A.68) for the rates. Also analyze the fifth term in (A.66)
where we use Lemma A.9(ii) for the inequality, and the rates are by (B.14) of Fan et al. (2011), Assumption 6, and := max
= max
as in (A.12). The slowest rate is the
maximum of the rates (A.71) and (A.72) above. So,
Then, by norm inequality tying spectral norm to norm in p.365 of Horn and Johnson (2013), and since
is
matrix
(ii). Since cov(
(cov(
does not involve the precision matrix estimator, we proceed as in Fan et al. (2011), Lemma B5(ii). Specifically (B.20) of Fan et al. (2011) provide
Using (A.74) and the equation above we develop a larger bound
([
cov(
+
([cov(
Note that
Then using Lemma A.1(i) of Fan et al. (2011), with (A.65) and (A.76)
wpa1 with as in Assumption 7. By (A.77), and seeing that for invertible matrix A, Eigmax(
) =
1/Eigmin(A),
We restate the definitions of major terms that are used.
and
We have the next lemma which will be instrumental in proving Theorem 2.
Proof of Lemma A.13. Start with, by adding and subtracting and triangle inequality
Consider the first term in (A.82)
where we use Lemma A.9(i) for the first inequality, Lemma A.10-A.12, and (B.14) of Fan et al. (2011):
where we use Lemma A.9(i) for the first inequality, and for the rates use Lemma A.10-A.12, and Assumption 6 which shows that factor loadings are uniformly bounded away from infinity. Analyze the third term in (A.82).
where we use Lemma A.9(i) for the first inequality, Lemma A.10-A.12, and (B.14) of Fan et al. (2011):
where we use Lemma A.9(i). We have from (A.78)(A.79) and by submultiplicativity of matrix norm
(spectral norm)
where we use Lemma A.12, and ) by Lemma A.11, (A.75). Substitute (A.87) into (A.86) via Lemma A.10
Since the last rate is the slowest among all on the right side of (A.82) we have the desired result.
Proof of Theorem 2. From (21), and using triangle inequality
We consider second right side term in (A.89). Add and subtract via triangle inequality
We analyze the first term on the right side of (A.90) and try to simplify by adding and subtracting (
, and triangle inequality
Then on the first right side term in (A.91) add and subtract () via triangle inequality
Now for the second right side term in (A.91) add and subtract (via triangle inequality
Substitute the last two inequalities into (A.91)
Now in (A.90) we consider the second term on the right side, add and subtract via triangle inequality
Combine (A.92)(A.94) into (A.90) right side to have
To consider all the terms in (A.95) we need to find some rates about terms. In that respect,
where we use definition of L for the first equality in (A.80), G is defined in (A.79), and we use submultiplicativity of norm for the first inequality, and the relation between spectral norm and
norm from p.365 of Horn and Johnson (2013) for the second inequality, and the rates are from (A.64), Lemma A.10, Lemma
A.11 and G definition. Next we need the following, by using the same analysis in (B.55) of Caner and Kock (2018) via strict stationary of the data, or (A.12) here
We consider each term on the right side of (A.95).
[ max
(A.98) = [
(¯
(A.99)
where we use Lemma A.9(iii), and
for the inequality in (A.98) and use Lemma A.13, and Theorem 1 for the rates.
We consider the second term on the right side of (A.95).
where we use Lemma A.9(iii), and (A.100) for the inequality in (A.101) and use (A.96), and Theorem 1 for
the rates. We analyze the third term on the right side of (A.95)
where we use Lemma A.9(iii) for the first inequality, and the rates are by (A.97), Lemma A.13, Theorem 1.
Now consider the fourth term on the right side of (A.95)
where we use Lemma A.9(iii) for the inequality, and Theorem 1, (A.96)(A.97) for the rate. Now consider the fifth term on the right side of (A.95).
where we use lemma A.9(iii) for the first inequality, and Theorem 1, Lemma A.13, (A.97) for the rates. Consider the sixth term on the right side of (A.95)
where we use Lemma A.9(iii) for the inequality, and use (A.97), and Lemma A.13 for the rates. Now analyze
the seventh term on the right side of (A.95)
where we use Lemma A.9(iii) for the inequality, and for the rates we use (A.96)(A.97) Theorem 1. Note that among all (A.99)-(A.106), the slowest rate is by (A.105) by the definition of in (A.81) and by Assumption
This ends the proof of (i) with using Theorem 1 and (A.107) in (A.89).
as in Fan et al. (2011) with being
1 vector of asset returns, and errors respectively at time t = 1
.
by Assumption 1. Consider
Clearly, by the proof of Lemma A.1(i) here we have for a generic vector x, and a matrix A. Then, by Lemma A.10(iii) and Theorem A.1 we get the rate.
Part 4
First, we start with a maximal eigenvalue bound which will be used in the proof of Theorem 8. Here, we
provide the rate for maximal eigenvalue of covariance matrix of returns . See that
Eigmax(Eigmax[Bcov(
] + Eigmax(
Eigmax(cov(f))Eigmax(
) + Eigmax(
Since by Assumption 7, 0, and Eigmax(
, with the above inequality and specifically by
This is true for cov(f) = in Fan et al. (2013). The result holds for general cov(f) as discussed in section
Proof of Theorem 3. First, we start with definitions of :=
:= 1
:=
, F :=
.
Now consider the numerator in (A.109):
Analyze the first term on the right side of (A.110):
Then, by Lemma B.3 in Supplement B, via Assumption 8
Then,
where we use (A.112) and Lemma B.5 in Supplement B. By (A.112)(A.113) and Lemma B.5 in (A.111), we have =
(A.114)
Then, by Lemma B.2 in Supplement B and (A.114),
Then, the second term on the right side of (A.110) is
by (A.112)(A.113) and Lemma B.2, Lemma B.5 in Supplement B, and the last equality is by Assumption 8. Use (A.115)(A.116) in (A.110) with Assumption 8
Now consider the denominator in (A.109). Note that
So by Assumption 8(ii)
Next
by (A.115) and Assumption 8. Combine (A.117) with (A.118)(A.119) in (A.109) to obtain the desired result.
Proof of Theorem 4. To ease the notation in the proofs, set =
+ D = v. The estimates will be
=
=
. Then,
First, analyze the denominator of (A.120).
Then, by Lemma B.2-B.4 in Supplement B, triangle inequality and being bounded away from zero and
finite, by Assumption 8,
We also know that by the conditions in theorem statement 0, and
0. Then, see that by Lemma B.5 in Supplement B
Thus, by (A.122)(A.123) and 0 with Assumption 8:
0 in (A.121), we have
Consider the numerator in (A.120):
By Lemma B.6 in Supplement B, and Assumption 8
Clearly, by Lemma B.5 in Supplement B and triangle inequality with being finite,
Then, use (A.122)(A.123)(A.126)(A.127) in (A.125) by Assumption 8
Use (A.124)(A.128) in (A.120) to obtain the desired result.
Lemma B.4 in Supplement B shows that
Proof of Theorem 6. Note that by the definition of in (C.2) and A, F, D terms,
by Lemma B.2 in Supplement B. Then by Assumption 8
Thus, clearly we obtain, since ,
which implies for the denominator
where the rate is the slowest among the three right-hand-side terms.
Proof of Theorem 7. Note that we define . We need to start with
Define the event , where
0. We condition the proof on event
, then at the end of the proof we show that
1. Start with the condition
0;
where we use in the second inequality and the condition for the third inequality. This clearly shows that at event
, when the condition
0 holds, we have
0. So
Then, in (A.140), using the condition 0 (note that this also implies
0)
which implies that, with , adding
to all sides above yields
as in the maximum Sharpe Ratios in Theorem 6. Clearly under event with
0, (A.137) is rewritten as
where we use Theorem 5. Under event , with 1
0, (A.137) is rewritten as
(
where we use Theorem 6.
Note that we can rewrite the event :=
, with
). Note that event
occurs with probability approaching one by Lemma B.3 in Supplement B, so we have proven the desired result.
Proof of Theorem 8. (A.2) of Ao et al. (2019) shows that the squared ratio of the estimated maximum out-of-sample Sharpe Ratio to the theoretical ratio can be written as
The proof will consider the numerator and the denominator of the squared maximum out-of-sample
Sharpe Ratio. We start with the numerator using the definition, :=
Consider the fraction on the right-hand side. Start with the numerator in (A.145).
where we use (B.18), (B.19), and (B.20) for the rates and the dominant rate in the last equality is by
Assumption 8 and definition (22). By Assumption 8(ii)
Then, by (A.146)(A.147) in (A.145)
We now attempt to show that the denominator in (A.144)
In that respect, bearing in mind that is symmetric
Using (A.151)
First, we consider (A.152).
where we use H¨older’s inequality for the third inequality and Theorem 2 and (A.108), (B.8) for the rate. Now, consider (A.153), and by definition :=
.
by (B.16)(B.19) for the second equality, and the dominant rate in third equality can be seen from Assumption
8. Next, consider (A.154), and recall that :=
where we use (B.19)(B.20) for the second equality, and the dominant rate in the third equality can be seen
from Assumption 8. Consider now (A.155) by the symmetry of
by (B.17). Next, analyze (A.156) by the symmetricity of
by (B.18). Combine the rates and terms (A.157)-(A.161) in (A.152)-(A.156) to obtain
by the dominant rate in (A.159), as seen in Assumption 9: 0 in (A.157), and
definition in Assumption 7.
Combine (A.162)(A.163), in the second right side term in (A.150) via Assumption 8
Therefore, we show (A.149) via (A.150). Then, combine (A.148)(A.149) in (A.144) to obtain the desired result.
Here, we provide results that are used in proofs of Section 4. We provide a matrix norm inequality. Let x be a generic vector, which is 1. M is a square matrix of dimension p, where
is the jth row of dimension 1
, and
is the transpose of this row vector.
where we use H¨older’s inequality to obtain each inequality.
Recall the definition of A := and
:=
, and ¯
is the rate of convergence in Theorem 2 in main text, and defined in Assumption 7 with the property ¯
0.
where H¨older’s inequality is used in the first inequality, Lemma B.1 is used for the second inequality, and the last equality is obtained by using Theorem 2 and imposing Assumption 7.
Before the next Lemma, we define :=
, and F :=
.
Proof of Lemma B.3. We can decompose by simple addition and subtraction into
Now, we analyze each of the terms above.
where we use H¨older’s inequality in the first inequality and Lemma B.1 in the second inequality above, and
where for the rates we use (A.96)(A.97) and since K is nondecreasing in n. Note that ] =
].
So with representing j, kth element of
matrix, and
] representing kth element of
1 vector
) we have that
where the rate is by Assumption 4, 6. Therefore, we consider (B.4) above.
where we use the same analysis that leads to (B.6), and the rate is from Theorem 2, (B.8). Now consider (B.5).
where we use H¨older’s inequality in the first inequality and Lemma B.1 in the second inequality above, and the rate is from Theorem 2, (B.7). Combine (B.6)(B.9)(B.10) in (B.3)-(B.5), and note that the largest rate is coming from (B.9) by ¯definition in Assumption 7.
Note that D := , and its estimator is
:=
.
Proof of Lemma B.4. By simple addition and subtraction,
Consider the first right side term above
where H¨older’s inequality is used for the first inequality above, and the inequality Lemma B.1 for the second inequality above, and for the rates we use Theorem 2. We continue with (B.12).
where H¨older’s inequality is used for the first inequality above, and the inequality Lemma B.1 for the second
inequality above, and for the rates, we use Theorem 2 and (B.7). Then, we consider (B.13)
where H¨older’s inequality is used for the first inequality above, and the inequality Lemma B.1 for the second inequality above, and for the rates, we use Theorem 2 and (B.8). Then, we consider (B.14).
where H¨older’s inequality is used for the first inequality above, and the inequality Lemma B.1 for the second inequality above, for the third inequality above, we use (B.8), and for the rates, we use Theorem 2. Then, we consider (B.15):
where H¨older’s inequality is used for the first inequality above, and the inequality Lemma B.1 for the second
inequality above, for the third inequality above, we use (B.8), and for the rate, we use Theorem 2. Note that in (B.11)-(B.15) the rate in (B.20) is the slowest due to definition in (22) to obtain
The following lemma establishes orders for the terms in the optimal weight, A, B, D. Note that both A, D are positive by Assumption 2 and uniformly bounded away from zero.
Proof of Lemma B.5. Note that Eigmax(
). Then by p.221 of Abadir and Magnus (2005), (Exercise 8.27.b in Abadir and Magnus (2005)),
[(cov(
+
is positive semidefinite,
by (6.3) of Fan et al. (2008), = O(p) under Assumption 6, and by Assumption 4,
= O(K), since
1 vector of factors. By (B.22)(B.23)
For the term F, the proof can be obtained by using the Cauchy-Schwartz inequality first and the same analysis as for terms A and D.
Next, we need the following technical lemma, which provides the limit and the rate for the denominator in the optimal portfolio.
Proof of Lemma B.6. Note that by simple addition and subtraction,
Then, using this last expression and simplifying, A, D being both positive,
where we use (B.2), Lemma B.3, (B.21), Lemma B.5, and Assumption 8.
This part covers the proofs for Corollaries 1-3 in the main text. Proof of Corollary 1. Rewrite the ratio of the Sharpe Ratio estimate to its target in the following way
Consider the numerator in (C.1).
Then by Holder’s inequality and Lemma B.1
and the rates are by (B.8), Theorem 2. Since 0 by Assumption, using (C.2) we have
Analyze the denominator in (C.1),
Next, see that by adding and subtracting and via triangle inequality in (C.4) numerator
Consider the first term in right side of (C.5)
by Theorem 2, (A.108), and Assumption 9. Then in (C.5), take the second right side term, with :=
,
by Lemma B.2, Assumption 9. Use (C.6)(C.7) in (C.4)(C.5) by Assumption 8(ii)
and
The estimate of the portfolio return
The target portfolio return
The estimate of variance of the portfolio, with constant, is
Target variance is
Start with the estimate of square of the Sharpe Ratio:
Then the target Sharpe Ratio is:
Take the ratio of the estimate to the target Sharpe Ratio, and scaling variances by p
Start with the terms in numerator in (C.15) which will be upper bounded by
First term on the right side of (C.16), and is symmetric
Take the first term on the right side of (C.17)
Then by Lemma B.3-B.6
Next by (C.2), Lemma B.3, F :=
where the last equality is by Assumption 8, 0. Combine (C.19)(C.20) in (C.18)
Then consider the second term on right side of (C.17)
and (C.2). So use (C.21)(C.22) in (C.17)
Consider the second term in (C.16)
Take the first term on the right side of (C.25)
Analyze (C.26) in the same way as in (C.19) use, (C.10), Lemma B.2-B.3,
Then use D := , and Lemma B.5 D = O(K) with (A.145)(A.146), by Assumption 8
Next use the last two rates in (C.26)
Then take the second term on the right side in (C.25)
by Lemma B.5, and definition, and we use (A.146) for the other rate in (C.29). Combine (C.28)(C.29) for (C.25)
Use (C.24)(C.31) in (C.16), by Assumption 8
Next numerator in (C.15) can be written (without squaring)
We analyze the terms in the denominator of (C.15)
In (C.33) consider the first term on the right side by adding and subtracting
In (C.34) the first right side term will be considered by adding and subtracting
In (C.35), by (C.19), and by Lemma B.5 ), and Assumption 8
Using (C.5)-(C.7) with (C.36) in the first term on the right side of (C.35),
Next use Lemma B.5 with (C.36) on the second right side term in (C.35)
In (C.34) the first right side term, use (C.37)(C.38), and since ¯(1) by Assumption 9
Analyze the second term on the right side of (C.34), by (C.5)-(C.7) with Lemma B.5 ), and
Clearly by (C.39)(C.40) in (C.34), by Assumption 8
In (C.33) consider the second term on the right side which will be upper bounded by adding and sub-
tracting and triangle inequality
In (C.42) the first term on the right side will be analyzed by adding and subtracting
In (C.43) consider by (C.27)(C.30) and Assumption 8
Then by (C.44)(A.162) on the first right side term (C.43)
Then using Lemma B.5 for second term on the right side of (C.43) in combination with (C.44)(C.45) in
since (1) by Assumption 8. Next consider the second term on right side of (C.42), with
:=
,
by Assumption 8. Use (C.46)(C.47) in (C.42) to have
by Assumption 8.
Consider the third right side term in (C.33), by adding and subtracting and triangle inequality
Consider the first right side term in (C.49)
In (C.50), by (C.19)(C.23)(C.27)(C.30) and Assumption 8
Consider the first term on the right side of (C.52) via Cauchy Schwartz inequality
by the same analysis in (B.9). Fourth term on the right side of (C.52) can use the same analysis in (B.6), :=
Then fifth term on the right side of (C.52) is
by (B.6).
by (B.10). Among all right side terms in (C.52), the slowest rate are (C.55)(C.58), as can be seen by
Now, combine (C.50)(C.51)(C.60) in the first right side term (C.49), with :=
, Lemma B.5 (i.e.
))
where the last rate is by Assumption 8. Consider (C.60) and ) by (C.23)(C.30),
substituted into second term in right side of (C.49)
So by (C.61)(C.62) in left side term in (C.49), by Assumption 8
Next clearly by (C.41)(C.48)(C.63) in (C.33)
Next the denominator in (C.15) can be written as
The ratio in (C.65) is greater than equal to the following term
1
By definitions, and Assumption
0
See that by :=
and symmetric Γ
By (A.148) in the numerator above
By (A.150)(A.164) with Assumption 8(ii) in the denominator of (C.67)
In this part we consider mean-variance efficiency of large portfolio in an out-of-sample context, and also we add a simulation to show the effects of sparsity on our and other methods.
Mean-Variance Efficiency
This Supplement formally shows that we can obtain mean-variance efficiency in an out-of-sample context. Ao et al. (2019) show that this is possible when , when both p, and n are large. That article is a significant contribution since they also demonstrate that other methods before theirs could not obtain that result, and it is a difficult issue to address. We are interested in maximized out-of-sample expected return
and its estimate
. Additionally, we are interested in the out-of-sample variance of the portfolio returns
and its estimate
. Note also that by the formula for weights
=
, given
:=
.
Below, we show that our estimates based on nodewise regression are consistent, and furthermore, we also provide the rate of convergence results.
Proof of Theorem D.1. (i). Start with definition of weights, and its estimators
Next, we have
where we divided both the numerator and denominator by p, and
By (A.147),(D.3), Lemma B.4 in the Supplement B, and (1) via Assumption 8 in the denominator below in (D.4)
Now, use Assumption 8 in (D.2)(D.5) and (D.1) to obtain the desired result.
(ii). Now, we analyze the risk. See that
where we multiplied and divided by , which is positive by (A.147). By (A.164), since
:=
,
Additionally, by Lemma B.4 in Supplement B and (A.147)
By (D.6), (D.7) and Assumption 8,
Effects of Sparsity
This section of the Supplement show a small simulation with a Block Diagonal covariance matrix for the idiosyncratic part of the dgp. The dgp is the same from section 5 but with =
BLDiag(b), where BLDiag(b) is the
block diagonal matrix with b blocks of ones. Moreover, this simulation was only performed for n = 200 and for the plug-in models with block sizes of 5, 15 and 50. The objective is to look at the behavior of Nodewise Regression on different sparsity levels in the covariance matrix. We analyze two questions whether our methods are doing well compared to others when the model is
less sparse, and then see whether sparsity effects are uniform over analysis of various Sharpe Ratio cases in Section 4.
First, from Table 6, our methods do well in high dimensional cases, our method has the smallest error in 5 out of 12 cases, POET method has high errors in all cases. In case of low dimensions, non-linear shrinkage is the best method, POET does again poorly. Also in less sparse case of blocks with 50, in high dimensions, we get the least error in 2 cases, and the other 2 cases non-linear shrinkage gets the least errors. Regarding the analysis of our method in various Sharpe Ratio cases, in case of the constrained maximum
Sharpe Ratio (MSR), our errors are smaller with increased block size. To give an example, NW-GIC has 0.379 in high dimensional case with 5 as block size, and this decreases to 0.100 with block size of 50. In case
of Markowitz portfolio we see that increasing the block size does not affect our errors much differently. Our method is affected by non-sparsity in maximum out-of-sample Sharpe Ratio as predicted by our Theorem 8.
Table 6: Simulation Results – Block Diagonal DGP with Real Factors
The table shows the simulation results for the block DGP. Each simulation was done with 100 iterations. We used a single sample size of n = 200 and the number of stocks was either n/2 or 1.5n for the low-dimensional and the high-dimensional case, respectively. Each block of rows shows the results for a different block size (5, 15, 50) in the block diagonal DGP. The values in each cell show the average absolute estimation error for estimating the square of the Sharpe Ratio.