Solar flares are the most energetic events in the solar system. Over a typical duration of - 1000) s, they can release up to 10
erg of energy — stored in stressed active region magnetic fields — into directed mass motions, heating, and acceleration of supra-thermal charged particles, including electrons, protons and heavier ions (Kontar et al. 2011). Solar flares, together with Coronal Mass Ejections (CMEs), are the main drivers of space weather at Earth and can sometimes even significantly affect Earth- and space-based technology systems like power grids, flight navigation, satellite communications. Predicting solar flares requires, first of all, the determination of parameters such as properties of sunspot groups or of the coronal magnetic field configuration, that are thought to be important for the understanding of fundamental processes in solar plasma physics. Second, at a more technological level, these parameters are used as input values for algorithms that realize predictions providing, for example (but not exclusively), a binary flare/no-flare outcome (Bloomfield
Most recent flare prediction algorithms belong to the machine learning framework (Bobra & Cou-
vidat 2015; Colak & Qahwaji 2009; Li et al. 2007; Yu et al. 2009; Yuan et al. 2010). In this setting,
data properties utilized for prediction are named features. In the case of supervised learning, a set of historical data is at disposal where features are tagged by means of labels representing the observation outcome, and the prediction task consists in determining the label associated to the incoming features’ set. On the other hand, unsupervised methods do not use any training set and data are clustered in different groups according to similarity criteria involving data features.
A crucial aspect of flare prediction, characterized by notable physical implications, is to provide hints on which data features mostly correlate with the labels. In statistical learning theory this practice is known as feature selection although applications often refer to it as feature importance, which better points out the fact that at the end of the process features are ranked according to their importance in the prediction task (in the following we will use the two terms equivalently). Feature selection can be realized by using advanced implementations of standard neural network approaches (Olden et al. 2004; Garson 1991) or by means of regularization methods. This second approach aims to optimize a functional made of two terms: the discrepancy term measures the distance between prediction and data in the training set, while the penalty term imposes a constraint on the number of features that significantly contribute to the prediction itself. Two examples of regularization methods for feature selection are LASSO (Tibshirani 1996) and l1-penalized logit (l1-logit in the following) (Wu et al. 2009). Both approaches utilize an l1-norm penalty term to reduce the complexity; on the other hand, LASSO measures the discrepancy assuming that the noise on the data is Gaussian, while l1-logit relies on a maximum likelihood procedure in which the probability function to maximize is the binomial distribution. In the framework of flare prediction, each one of these two methods presents a specific limitation. In fact, LASSO is intrinsically a regression method and therefore it is not naturally appropriate for applications like flare prediction that may require a binary yes/no response. On the other hand, l1-logit is a classification method, but it predicts the binary condition by applying a fixed threshold on the flare occurrence probability, i.e., the applied threshold is the same whatever the dataset used for training is.
The present paper introduces a novel approach to flare prediction with feature importance whose aim is to overtake both previous limitations. The perspective of such approach is hybrid and rather general: first, a regularization method for regression is applied to the training set with an l1 penalty term that promotes sparsity, thus realizing feature importance (more specifically, this regularization step reconstructs the vector of weights, with which each feature contributes to the prediction in the training set). Then, the set of the real values obtained by multiplying the weights times the feature values in the training set is automatically clustered in two classes by means of a clustering technique. Clustering is an unsupervised learning approach that organizes a set of samples into meaningful clusters based on data similarity. Data partition is obtained through the minimization of a cost function involving distances between data and cluster prototypes. Optimal partitions are obtained through iterative optimization: starting from a random initial partition, samples are moved from one cluster to another until no further improvement in the cost function optimization is noticed. Therefore, in the second step of the hybrid approach, clustering performs an automatic thresholding, which depends on the historical set used for the training phase and is, therefore, intrinsically data dependent. The resulting algorithm presents several advantages with respect to standard one-step approaches: it selects the most significant features, since, in the first step, it relies on a regularization technique that promotes sparsity; it is a classification method, since at the end it produces two clusters, each one corresponding to a specific outcome of the prediction; it performs classification in a flexible, data-adaptive way, which makes it significantly efficient in providing good performances with respect to standard skill scores. The hybrid approach in this paper utilized LASSO in the regularization step and Fuzzy C-means (Bezdek 1981) to cluster the LASSO outcome, although other feature selection and clustering algorithms can be applied.
In order to corroborate the effectiveness of this hybrid approach we utilized a set of data from the National Oceanic and Atmospheric Administration (NOAA) Space Weather Prediction Center (SWPC) and compared our results with the ones provided by l1-logit, as far as both the classification and feature importance abilities are concerned, and with some of the most used machine learning approaches in flare forecasting, as far as just the prediction effectiveness is concerned.
The plan of the paper is as follows. Section 2 illustrates the kind of data, prediction algorithms will deal with. Section 3 introduces our hybrid approach for flare prediction with feature importance. Section 4 applies the hybrid approach to the set of SWPC data described in Section 2 and compares its performances to the ones obtained by l1-logit and by other machine learning methods. Our conclusions are offered in Section 5.
Solar Active Regions (ARs) are classified according to magnetic field complexity indicators. For example, ARs tracked by the National Oceanic and Atmospheric Administration (NOAA) Space Weather Prediction Center (SWPC) are typically classified by using the following 5 indicators (features): the area, the McIntosh indices (McIntosh 1990), and the Mount Wilson index (Hale et al. 1919). The area index is computed in fractions (millionths) of a solar hemisphere. The McIntosh scheme uses white light emissions to represent sunspot structure and is composed by three independent variables: the Zurich class Z of leading/trailing spot size and separation, which may assume 7 categorical values; the penumbral class p of primary spot regularity, which may assume 6 categorical values; the compactness class c of internal spot distribution, which may assume 4 categorical values. Finally, the Mount Wilson scheme groups sunspots into classes based on the complexity of magnetic flux distribution in associated active regions, according to rules set by the Mount Wilson Observatory in California; this feature may assume 8 categorical data.
In order to apply machine learning algorithms, either supervised or unsupervised, we need to transform the categorical information contained in the above sunspot classifications (specifically, the McIntosh and Mount Wilson indices) into numerical data. This can be done by either transforming the categorical variables into dummy variables (Hardy 1993) or computing occurrence frequencies in a historical database. In this paper we used this second approach, which preserves the dimension of the space where to perform the data analysis. Specifically, we have considered the SWPC database covering the December 1988 to June 1996 time range and we have computed the frequency with which a sunspot classified by a specific value of a fixed indicator produces a flare greater than a given class. Anyhow, we have verified that the use of the dummy variables does not improve the effectiveness of the prediction for all methods considered in this paper. On the other hand, we are also aware that the use of frequencies requires the availability of a labelled dataset whose content, in principle, may increase while new data are at disposal.
More formally, and focusing on the specific case of the value A for the Zurich class in the McIntosh classification, we denoted by the occurrences of flaring events of class C, M and X, respectively, and computed the frequencies associated to flaring events of class
greater or equal to a specific class as
(with the corresponding no-flare-event frequency defined as
(with the corresponding no-flare-event frequency defined as
(with the corresponding no-flare-event frequency defined as ). Similar formulas can be written for each one of the other categorical predictors.
We finally notice that the same dataset used for computing these frequencies has been used as training set for the supervised machine learning algorithms utilized in the following. On the other hand, the database of SWPC indicators covering the time range between August 1996 and December 2010 has been used as test set for both the supervised and unsupervised machine learning methods.
We denote with X the matrix with dimension N (number of active regions) (number of features) whose columns contain the feature values for each specific active region in the training set;
1 vector containing the F model parameters to determine and
1 data vector used in the training set and made of 0 and 1 values. l1-logit has been designed ’ad hoc’ to perform classification with feature importance (Wu et al. 2009). This is a constrained maximum-likelihood method that allows the estimation of the model parameters while best-fitting the data. The logit
parameter estimation method solves the minimum problem
i.e., it searches for the maximum likelihood of the model parameter vector when one assumes that each component
of the vector y is the realization of a random variable described by the Bernoulli distribution. In equation (4)
-th row of matrix X and c is a positive constant. In order to realize feature selection, l1-logit adds the condition that
is small, which mathematically points out the few parameters that most significantly contribute to the classification. When a new active region x is at disposal (x is a vector of F components), then ˆ
is computed and its sign denotes the outcome of the prediction. This implies that the classification threshold here is fixed and equal to zero independently of the dataset used for training.
We now introduce an approach to flare prediction with feature selection which, differently than l1-logit, is hybrid and data-dependent. The first step of this two-step approach utilizes LASSO (Tibshirani 1996) to perform feature selection. Specifically, we look for the solution of the minimum
problem
where the regularization parameter is optimized by means of a Cross Validation procedure (Stone 1974). Then, in the second step, we apply a clustering method for partitioning the output of ˆ
In a classical clustering approach like Hard C-Means (HCM) (Jain et al. 1999), each sample may belong to a unique cluster, while in a fuzzy clustering formulation a different degree of membership is assigned to each sample with respect to each cluster, which implies a much higher flexibility in accounting for data characteristics. Therefore, in the second step of our hybrid approach, we used Fuzzy C-Means (FCM) (Bezdek 1981), which is the fuzzy extension of HCM. In this framework, the
FCM functional is given by
where ˆis the set of the C centroids of the clusters, the component
represents the membership of the k-th sample to the j-th cluster,
is the distance between the j-th centroid and the k-th sample, and m is the fuzzifier parameter. The FCM optimization problem is the one to (iteratively) determine the components of the matrix U and of the vectors ˆz given the components of the vector ˆy.
In this Section we have compared the performances of l1-logit and our hybrid approach during the analysis of the SWPC test set covering the time range between August 1996 and December 2010 (the cardinality of such set is 22222); for both methods we used the data collected between December 1988 and June 1996 as training set (the cardinality of this second set is 17600). Further, we have also analyzed the same test set by means of other four classical machine learning methods: the (unsupervised) clustering HCM and FCM algorithms, a standard Multi Layer Perceptron (MLP) (Rumelhart et al. 1986) and a Support Vector Machine (SVM) (Cortes & Vapnik 1995). For the latest two methods, which are supervised, we used the same training set as in the case of l1-logit and the hybrid method. All these prediction algorithms have been applied to predict flares with class above C1 and M1, respectively. From now on, for sake of brevity, we will indicate with 1 all flares with class above C1 and M1, respectively. We have not considered flares with class above X1 since they are rare in this dataset (less than 1% in the training set and around 0.5% in the test set).
By means of the frequency matching process described in Section 2, each sample is transformed into a 5-dimensional vector. Note that the first four components range from 0 to 1, while the fifth one, i.e., the sunspot area, goes from 0 up to 10. Since the differences between component variances can affect the flare prediction performances, a standardization step preceded the application of the machine learning algorithms. We also note that frequency matching must be performed for each case of interest, i.e, separately for the
1-flare predictions; therefore, for both the training set and the test set, we have constructed two subsets: the first subset, indicated with #1, is constructed using the frequencies of flares of class at least C1 (i.e., by applying (1) and analogous); the second subset, indicated with #2, to the frequencies of flares of class at least M1 (i.e., by applying (2) and analogous).
As explained in the previous section, the main advantage of the hybrid approach is in the fact that the way it partitions the set of LASSO outcomes is driven by the input data. This is clearly described in Figure 1, showing how FCM automatically identifies the probability threshold. It is interesting to note that this threshold depends on the flare class under consideration and in any case is different than the fixed value provided by l1-logit, which splits the regressions values at 0.5.
The threshold value determines the prediction, whose performance can be measured by means of specific scores. Many skill scores can be found in literature for the assessment of flare prediction performances (Bloomfield et al. 2012). All these scores are linked to the forecast contingency tables made up of four elements:
• The number of flares predicted and observed (true positives, TP).
• The number of flares not predicted but observed (false negatives, FN).
• The number of flares predicted but not observed (false positives, FP).
We have validated the six flare prediction algorithms by means of the following skill scores defined
in terms of the above elements. Specifically, the probability of detection
the accuracy
the false alarm ratio
These scores range from 0 to 1 and best predictions correspond to small FAR values and high values for the other scores. We also utilized two scores with values ranging from 1 to 1: the Heidke skill
score
and the true skill statistics
Also in this case good prediction performances correspond to high values of the scores. Figure 2 and Figure 3 present the values of all five skill scores for the 1 flare prediction and the
prediction, respectively. Moreover, Table 1 and Table 2 provide the results of the feature selection processes preformed by l1-logit and the hybrid technique. Specifically, the tables contain the weights
with which the sunspot area, the McIntosh indices, and the Mount Wilson index contribute to the flare prediction process for the two methods.
Table 1. Feature importance in class flare prediction computed from the
Table 2. Feature importance in class flare prediction computed from
This paper introduces a novel approach to flare prediction, which utilizes indices associated to ARs data and which is also able to automatically indicate the ones, among such features, that mostly contribute to prediction. The approach is intrinsically hybrid, in the sense that it is based on the combination of the ability of regularization to perform feature selection with the ability of clustering to classify in a data-adaptive fashion. In the present implementation we have used LASSO in the feature selection step and FCM in the clustering step. In fact, LASSO guarantees a notable degree of generality in regularization while FCM guarantees a notable degree of flexibility in data adaptation.
class flare prediction. Split of the Lasso regression output by means of the Fuzzy C-means algorithm. The x-axis shows the values of the regression outcomes provided by the cross validated Lasso algorithm. Blue and green colors represent the two clusters identified by the Fuzzy C-means algorithm. Blue (resp. green) cluster is the set of all the events for which the hybrid method returns a no-flare (resp. flare) prediction. (b) The same as in (a) but for
class flares.
Anyhow, we have tested the hybrid approach using different combinations of feature selection and clustering methods involving l1-logit and HCM: the results of both feature importance ranking and prediction were comparable.
We validated the approach against a NOAA SWPC dataset and by comparing the results with the ones provided by l1-logit and other standard machine learning flare prediction algorithms. This comparison showed that the hybrid approach outperforms l1-logit in the case of HSS and TSS, that
Figure 2. Comparison of performance between the six flare prediction algorithms in terms of skill scores. The bar plots represent the skill score values obtained by applying each method to the test set for the prediction of
Figure 3. The same as in Figure 2 but for the prediction of
are often considered (Bloomfield et al. 2012) the most reliable skill scores in the game (for example, ACC tends to reach its maximum when the threshold is 0.5, which is not fully appropriate in the case of unfrequent events such as M and X class flares). This is particularly true for TSS and for the prediction of flares belonging to class M or higher. More in general, the hybrid method predicts with a performance rate which is very similar to the one of the other two unsupervised clustering algorithms, while, coherently, l1-logit works similarly to the other two supervised regularization methods. The higher forecasting effectiveness of the hybrid approach with respect to l1-logit is due to the fact that it performs classification with a thresholding procedure which is data adaptive, while l1-logit utilizes a fixed negative/positive threshold. We note that the threshold in l1-logit could be tuned heuristically, searching ‘a posteriori’ for the values that provide the maximum for TSS and HSS and that the advantage of fuzzy clustering is that it realizes such search ‘a priori’ and in an automatic way.
The hybrid approach and l1-logit can be compared also as far as their feature selection power is concerned. Table 1 clearly shows that, in forecasting C1 flares, the two methods indicate the same features as the ones that mostly contribute to the prediction. Results are different when predicting
M1 flares, since LASSO gives the highest emphasis to the AR area, while l1-logit points out more significantly two of the three McIntosh indices as mostly significant. A clarification of this contradictory outcome shall be obtained by means of a systematic application of these two methods against either several SWPC datasets or features extracted from SDO/HMI images; this activity is part of the tasks currently addressed by the H2020 project FLARECAST, which will provide a technological platform for the testing of flare prediction algorithms and for the validation of the forecasting and feature selection results.
The authors have been supported by the H2020 grant Flare Likelihood And Region Eruption foreCASTing (FLARECAST), project number 640216. The authors kindly thank Prof. Shaun Bloomfield for providing the SWPC data and Dr. Annalisa Perasso for useful discussion.
Bezdek, J. C. 1981, Pattern Recognition with
Fuzzy Objective Function Algorithms (Norwell,
MA, USA: Kluwer Academic Publishers) Bloomfield, D. S., Higgins, P. A., McAteer, R.
T. J., & Gallagher, P. T. 2012, Astrophysical
Journal Letters, 747, L41 Bobra, M. G., & Couvidat, S. 2015, Astrophysical
Journal, 798, 135 Colak, T., & Qahwaji, R. 2009, Space Weather, 7,
S06001 Cortes, C., & Vapnik, V. 1995, Machine learning,
20, 273. http://www.springerlink.com/
index/K238JX04HM87J80G.pdf Gallagher, P. T., Moon, Y. J., & Wang, H. 2002,
Solar Physics, 209, 171 Garson, G. D. 1991, Artificial Intelligence Expert,
6, 47 Hale, G. E., Ellerman, F., Nicholson, S. B., & Joy,
A. H. 1919, The Astrophysical Journal, 49, 153.
http://adsabs.harvard.edu/full/1919ApJ.
...49..153H Hardy, M. A. 1993, Regression with Dummy
Variables (SAGE) Jain, A. K., Murty, N. M., & Flynn, P. J. 1999,
ACM Computing Surveys, 31, 264 Kontar, E. P., Brown, J. C., Emslie, A. G., et al.
2011, Space Science Reviews, 159, 301 Li, R., Wang, H. N., He, H., Cui, Y. M., & Du,
Z. L. 2007, Chinese Journal of Astronomy and
Astrophysics, 7, 441