Machine learning includes various classification, regression, clustering and dimensionality reduction methods. In classification models, the target is a discrete class, while in regression, the target is typically a continuous variable. Ordinal regression, also known as ordinal clas-sification, is regression with a target that is discrete and ordered. Ordinal classification can be thus considered to be a hybrid of classification and regression.
Typical applications of ordinal classification include age estimation with an integer-valued target, advertising systems, recommender systems, and movie ratings on a scale from 1 to 5. For further insight into recent developments in machine learning, including ordinal regression, the reader is kindly directed to (Gutierrez et al., 2016; Cao et al., 2019; Vargas et al., 2019; Wang and Zhu, 2019; Gambella et al., 2019) and the references therein.
In this work, we perform a case study of ordinal regression on the ranks of a sorted sum of random variables corresponding to the duration of an orienteering relay viewed as a random process with a large number of subsamples of changeover times. We compare three widelyused regression schemes to our original method that is based on manipulations of certain properties of ordered random variables, otherwise known as order statistics.
We study whether expectations of order statistics in conjunction with statistical inference can forecast final team places when certain intricate, educated guesses are made about the underlying random process. Specifically, we assume lognormality of both individual leg times and, more importantly, team changeover times.
Generally, in competitive sports, the final place is the hard, quantitative result that both individuals and teams want to minimize. For our case study, both plots and numerical prediction error values show that ordinal regression based on lognormal order statistics of time duration can provide an accurate fit to real-world relay race data.
2.1 Relay Race System Model
Consider an orienteering relay race. Let n denote the number of finishing teams as we ignore disqualified and retired teams. Each team has m runners and each runner runs one leg. Especially note that m is given, whereas n is estimated, as will be discussed later.
We say that leg time results correspond to random variables with
. We define the changeover time
after leg
as
The final team result list of the relay race is a length-n sample of sorted in ascending order.
Let denote the cdf of
, and
de- note the
order statistic of a length-n sample of
. These order statistics satisfy
. We refer to
as the (final) place of the corresponding team, where
corresponds to the winning team.
Let c denote the number of training observations. We are given a changeover time training vector and a final place training vector y =
. Hence,
consists of realizations of
and y includes the corresponding observed places. We wish to find regression functions that satisfy
as accurately as possible. In (2), we use vector notation to emphasize that we find the parameters of that minimize a loss function with a set of observation pairs rather than with a single observation pair.
2.2 The Four Regression Models
1) Linear regression refers here to Ordinary Least Squares (OLS) regression rounded to the nearest integer. OLS finds the intercept and slope that minimize the residual sum of squares between the observed targets and the targets predicted by the linear approximation.
2) Gaussian Process (GP) regression is a nonparametric model that can manage exact regression up to a million data points on commodity hardware (Wang et al., 2019).
For a pair of training vectors , a GP is defined by its kernel function
, a
kernel matrix
with covariance values for all training point pairs, and a c-dimensional vector
with evaluations of the kernel function between training points
and a given point x. A GP predicts an unknown function
. For kernel matrix
with additive Gaussian noise with zero mean and variance
, the expected value of the zero mean GP predictive posterior distribution with a Gaussian likelihood is (Rasmussen and Williams, 2006)
We use (3) rounded to the nearest integer as the GP place predictor. The GP regression function thus becomes
For practical, numerical implementation of exact GP, we utilize the readily available GPyTorch library with a radial basis function (RBF) kernel as in the “GPyTorch Regression Tutorial” on (GPyTorch, 2019).
3) Ordinal regression refers here to the rounded to the nearest integer regression-based model from the readily available Python mord package for ordered ordinal ridge regression. This model overwrites the ridge regression function from the scikit-learn library and uses the (minus) absolute error as its score function (Mord, 2019; Pedregosa-Izquierdo, 2015).
4) Fenton-Wilkinson Order Statistics (FWOS) regression is our original regression model. For this model we make the following two well-educated assumptions.
Assumption 1: Individual leg time is lognormal.
Assumption 2: Changeover time is lognormal.
The lognormal distribution often appears in sciences (Limpert et al., 2001). Assumption 1 is based on the lognormality of vehicle travel time (Chen et al., 2018).
Assumption 2 paraphrases what in the literature is known as the Fenton-Wilkinson approximation (Wilkin- son, 1934; Fenton, 1960; Cobb, 2012). The FentonWilkinson approximation method is the method of approximating the distribution of the sum of lognormal random variables with another lognormal distribution.
Note that and
are both lognormal and independent but not identically distributed. With this in mind, we can now derive the FWOS regression prediction function
. For this purpose we use the following two well- known preliminary tools in probability theory.
Tool 1: Let W follow the standard uniform distribution U(0, 1) and follow distribution F. Let
denote the
order statistic of a length-n sample of W. The
order statistic of a length-n sample of
has the same distribution as the inverse cdf of F at
Tool 2: The standard uniform order statistic follows Beta(
). Therefore,
.
The inverse cdf of F is known as the quantile function . Tool 1 can be therefore expressed as
where “d=” reads “has the same distribution as”, and applying Tool 2 to (5) hence yields
Let F be the lognormal distribution with cdf . Now (6) directly implies
and, further, most interestingly for our purposes, that
which resembles (2) as r is associated with y.
For large n, as in our case study, it is reasonable to assume that such that
The well-known lognormal cdf is given by
where is the standard normal cdf,
and
are the lognormal parameters and
is the natural logarithm.
We combine (7), (8) and (9), and define the FWOS place prediction function for leg l as
where again denotes rounding to the nearest integer.
Maximum likelihood estimation (MLE) for the normal distribution yields lognormal estimators for and
when
are replaced by
. Hence,
What remains to be done is finding an estimate for the total number of teams n. We assume that there are no ties, which is equivalent to stating that the elements in y are unique. Thus, y is a sample, without replacement, of the discrete uniform distribution U[1, n]. Let D denote a random variable that follows this distribution. Now recall that y is a random c-subset of {1, 2, . . . , n}.
Estimating the parameter n of U[1, n] with a sample drawn without replacement is in the literature known as the German tank problem (Ruggles and Brodie, 1947), a solution to which is a uniformly minimum-variance unbiased estimator (UMVUE) (Goodman, 1952)
is the realization of the order statistic (maximum) of a length-c sample of D.
We replace n and in (10) with the
of (12) and
, respectively, and note that numerical values for
are obtained through (11). We obtain the following proposition.
Proposition 1. For a training dataset of c observations, lognormal parameter estimates that are obtained through the maximum likelihood estimators of changeover time observations
, and
, the maximum of final place observations y, the FWOS ordinal regression function
predicts final place with changeover time at changeover l.
Loosely speaking, Proposition 1 states that the FWOS method approximates final place with expected changeover place. We anticipate that this approximation holds to a satisfactory degree and that it improves with l.
We note that c, the dimension of the training vectors, is constant for all changeovers l as is , the estimate of n, while the pair
is separately estimated for each changeover l with changeover time training vector
.
Further note that FWOS regression only requires an easily acquirable pairof dimension 2 as opposed to the other three regression methods that require the whole place training vector y of dimension c.
Real-world data are acquired from the results of Jukola 2019 (Jukola, 2019) with n = 1653 teams with m = 7 runners per team. Training of the regression models is conducted for two cases: for and also for
training observations. The rest of the data are used for testing the models.
Figure 1: Place predictions (orange) at changeover l = 4 with training set size 80% and test set (blue) size 20%.
Figure 2: Place predictions (orange) at changeover l = 4 with training set size 5% and test set (blue) size 95%.
Figure 3: Place prediction root-mean-square errors (RMSEs) after each leg.
3.1 Regression Curves
Figure 1 plots place against team changeover time after leg l = 4 for the predictions (orange) and the test set (blue) for all the four regression models. The size of the training set is 80% and the size of the test set is 20%.
We see that linear regression and GP regression provide accurate fits for a large portion of the test set points, but behave poorly when time is large. For average teams, with approximately 350 to 550 minutes of elapsed time after four legs, place seems to grow linearly with time, though overall linearity is clearly an oversimplification.
We further notice that ordinal regression captures the effect where place saturates for large values of time, but fails to provide a smooth transition. FWOS regression, unlike the other models, does indeed exhibit the smooth sigmoidal behavior of the data.
In the setting of Fig. 2, there are significantly fewer training data compared to the setting of Fig. 1, namely, 5% compared to 80%. Interestingly, linear regression, ordinal regression and FWOS regression maintain high performance, whereas GP regression greatly suffers from the lack of training data.
Overall, we make the following two key remarks.
Remark 1: The “place against changeover time” curve resembles a scaled lognormal cumulative distribution function. This further suggests that elite teams “pull away” from the rest of the teams, while extremely slow teams fall far behind the rest.
Remark 2: A lognormal cumulative distribution function, alongside a German tank estimator, effectively predicts places with changeover times even when the number of training observations is small.
3.2 Root-Mean-Square Errors
Here we illustrate the error between place prediction and the corresponding true value
with label
, where v is the size of the test set with n = c + v. We use the root-mean-square error (RMSE)
Fig. 3a plots the RMSEs after each of the m = 7 changeovers for a random test set of v = 331 points (as in Fig. 1), while Fig. 3b plots the RMSEs for a random test set of v = 1571 points (
as in Fig. 2).
When comparing Fig. 3a with Fig. 3b, we notice similar RMSEs for both training set sizes, except for GP regression. It is clear that GP regression requires more training data than the other regression models to achieve comparable prediction error performance.
In every case, FWOS exhibits the best RMSE performance. However, the RMSEs are not zero even for the changeover, i.e., after the anchor leg when the team fin-ishes. This is due to the imperfections of the regression models and the random fluctuations of the data.
For our case study of an orienteering relay race, we have shown that rankings of ordered sums, here referred to as places, can be relatively accurately predicted by intermediate sums by first taking an educated guess that leg times are lognormal, and then by scaling the cumulative distribution function of another lognormal distribution corresponding to observed changeover times. The latter distribution can be approximated by the Fenton-Wilkinson method, while the scaling factor can be estimated with a well-known solution to the German tank problem.
The use of order statistics for ordinal regression provides a powerful approximation for our case study. However, intricate prior insight into the underlying random process is assumed, which is unnecessary for nonparametric models such as the Gaussian process. Similarly, linear regression and standard ordinal regression require no knowledge about the underlying distributions.
An event that may complicate leg time distribution fit-ting and erode our prediction performance is the restart, where the runners that have not started before a specific cut-off time start simultaneously. Restarts may skew the leg time distributions in the same way as the first leg runners that often run in packs. Running in packs may decrease our prediction accuracy.
Fenton-Wilkinson order statistics, alongside German tank estimators, could be applied whenever a longer random process can be modeled as a sum of shorter random processes with random durations and when the outcomes of the longer process are uniquely ordered by the corresponding total time duration. Such processes are plentiful in sports – those sources of inspiration for ordinal classi-fication order statistics research.
Cao, W., Mirjalili, V., and Raschka, S. (2019). Rank- consistent ordinal regression for neural networks. [online] Available: arXiv:1901.07884.
Chen, P., Tong, R., Lu, G., and Wang, Y. (2018). Ex- ploring travel time distribution and variability patterns using probe vehicle data: Case study in beijing. J. Adv. Transp.
Cobb, R. R. B. (2012). Approximating the distribution of a sum of log-normal random variables. In Proc. 6th Eur. Workshop Probab. Graph. Models.
Fenton, L. F. (1960). The sum of log-normal probabil- ity distibutions in scattered transmission systems. IRE Trans. Commun. Syst., 8:57–67.
Gambella, C., Ghaddar, B., and Naoum-Sawaya, J. (2019). Optimization models for machine learning: A survey. [online] Available: arXiv:1901.05331.
Goodman, L. A. (1952). Serial number analysis. J. Amer. Statist. Assoc., 47(270):622–634.
GPyTorch (2019). [online] https://gpytorch.ai.
Gutierrez, P. A., Perez-Ortiz, M., Sanchez-Monedero, J., Fernandez-Navarro, F., and Hervas-Martinez, C.
(2016). Ordinal regression methods: survey and experimental study. IEEE Trans. Knowl. and Data Eng., 28(1):127–146.
Jukola (2019). [online] https://www.jukola.com/2019/.
Limpert, E., Stahel, W. A., and Abbt, M. (2001). Log- normal distributions across the sciences: Keys and clues. Bioscience, 51:341–352.
Mord (2019). [online] https://pythonhosted.org/mord/.
P¨a¨akk¨onen, J., Dharmawansa, P., Freij-Hollanti, R., Hol- lanti, C., and Tirkkonen, O. (2019). File size distributions and caching for offloading. In Proc. IEEE Signal Process. Advances in Wireless Commun. (SPAWC).
Pedregosa-Izquierdo, F. (2015). Feature extraction and supervised learning on fMRI: from practice to theory. PhD thesis, Universit´e Pierre et Marie Curie.
Rasmussen, C. E. and Williams, C. K. I. (2006). Gaus- sian processes for machine learning. The MIT Press.
Ruggles, R. and Brodie, H. (1947). An empirical ap- proach to economic intelligence in world war ii. J. Amer. Statist. Assoc., 42(237):72–91.
Vargas, V.-M., Guti´errez, P.-A., and Herv´as-Mart´ınez, C. (2019). Cumulative link models for deep ordinal clas-sification. [online] Available: arXiv:1905.13392.
Wang, K. A., Pleiss, G., Gardner, J. R., Tyree, S., Wein- berger, K. Q., and Wilson, A. G. (2019). Exact gaussian processes on a million data points. [online] Available: arXiv:1903.08114.
Wang, L. and Zhu, D. (2019). Tackling multiple ordinal regression problems: Sparse and deep multi-task learning approaches. [online] Available: arXiv:1907.12508.
Wilkinson, R. I. (1934). Unpublished, cited in 1967. Bell Telephone Labs.