Modified swarm-based metaheuristics enhance Gradient Descent initialization performance: Application for EEG spatial filtering

2019·Arxiv

ABSTRACT

ABSTRACT

Gradient Descent (GD) approximators often fail in the solution space with multiple scales of convexities, i.e., in subspace learning and neural network scenarios. To handle that, one solution is to run GD multiple times from different randomized initial states and select the best solution over all experiments. However, this idea is proved impractical in plenty of cases. Even Swarm-based optimizers like Particle Swarm Optimization (PSO) or Imperialistic Competitive Algorithm (ICA), as commonly used GD initializers, have failed to find optimal solutions in some applications. In this paper, Swarm-based optimizers like ICA and PSO are modified by a new optimization framework to improve GD optimization performance. This improvement is for applications with high number of convex localities in multiple scales. Performance of the proposed method is analyzed in a nonlinear subspace filtering objective function over EEG data. The proposed metaheuristic outperforms commonly used baseline optimizers as GD initializers in both the EEG classification accuracy and EEG loss function fitness. The optimizers have been also compared to each other in some of CEC 2014 benchmark functions, where again our method outperforms other algorithms.

Keywords

Soft Computing; Meta-Heuristics; Gradient Descent; Artificial Intelligence; Engineering designed optimizations; exploration and exploitation

1 Introduction

In engineering applications, using approximation-based search methods like GD is appropriate as their search space is convex around operating points [24]. However, the combination of a locally convex optimizer with a metaheuristic can lead to better initial seeds and is worth improving. In this paper, the GD performance is improved by initializer that searches in multiple locality scales. The purpose of this work is to optimize differentiable objective functions with fractal behavior or layers of monotonic nonlinearities. Some examples are perceptron neural nets or linear subspace methods. Other important functions are those with recursive objective functions, e.g., the power series eigen-decomposition1. Most of these functions represent nests of hills that are scattered in multiple scales and are eventually multi-scale. They are multi-scale in the sense of having local optimas scattered in varieties of scales of function-values in terms of overshoot or smoothness of their tangent per locality. To find a suitable metaheuristic as GD initializer, we have to consider algorithm complexity more seriously. Highly complex optimizers are not suitable for engineering applications, adding to their computational burdens.

For algorithms that handle complexity-accuracy trade-off one can name Bat Search, which simulates echolocation behavior of bats [3], Ant Colony that mimics pheromone impact in communications [4], Gravitational Search which is inspired by the law of the gravity [5], and artificial immune system (AIS) algorithms [6] to imitate behavior of animal immune system. These methods are capable of intensifying in localities as time goes on. The mentioned algorithms as well as simplistic optimizers [1, 2], despite

1. Develop hybrid optimizer of GD with the proposed framework as its initializer. This offers GD a randomizer that searches a wider span of multiple scales to choose better seeds for GD convex regions. In such scenario, the initializer randomizes new instances after training batches during the GD update process. 2. Compare the new optimizers’ performance among four types of single-modal, multi-modal, perturbed single-modal, and perturbed multi-modal benchmark series versus baselines commonly used in GD-hybrid applications. This part suggests the function group that suits better for the proposed framework and confirms the necessity of choosing differentiable multi-scale objective functions.

3. Compare cooperative version of OHM (PSO based) with its competitive version (ICA based) to gain more insights about framework role on PSO and ICA.

2 Required Concepts

2.1 Particle swarm optimization

PSO is by default inspired by swarming behavior of particles in a search space of multiple dimensions. Swarms like fishes and birds have a sort of social behavior making them act and react in a common way and not deviate that much from normality. On the other hand, they have their own individualistic behavior. Kennedy and Eberhart have inspired this behavior and set up update formula (1) for unconstrained optimization problems [9,11]. The formula has two terms; the first is particle movement through the best particle’s solution and the second one is its movement towards the best global solution so far found. The update formula and newly produced solution is the form below:

where Vi = [ ] is the velocity for particle i; Xi= [] is the position of particle i; φ1are uniformly distributed random number and are generated independently for each dimension; ] is the best position found by particle i; ] is the best position found by the entire population; N is the dimension of the search space and k is the iteration number. After each update, the possibility of the betterment of best current G and also best P of the updated particle has to be checked out so as to change those values if necessary.

Adding inertial weight to the PSO update formula is reported to motivate exploration and hinders algorithm from getting trapped in local optima [11]. By having a big inertial weight w(k) as shown in (2), the exploration rate increases. While in contrast, by setting that small, the algorithm returns to exploitation mode due to ignorance of subsequent update values.

The significant characteristic of PSO lies in simultaneously updating social and cognitive search component. Higher weights of social component lead algorithm to search locally and higher weights of cognitive component make algorithm do global search. The situation in which the particle initials are selected far from each other makes PSO ready for better local searching or global searching when necessary. In PSO, each hierarchy level has its own elite.

After proposing the main framework, it will be shown that PSO update formula can be resulted from setting the hierarchy level parameter to 2 with a special update type. The increasing level of the hierarchy and the same time using the cooperative approach of PSO in solution update has one advantage. When regions of particles do not overlap in search space, the algorithm can test out best hierarchy levels for considering their organizations as particles and even change it adaptively.

2.2 Imperialist competitive algorithm

Generally, ICA is taken from sociopolitical behavior of imperialistic competition. First of all, the algorithm defines multiple sets of solutions as countries. Like the behavior of real countries, algorithm regards some countries as imperialists and major remaining as colonies of them. The countries are assigned to the most powerful imperialists. Power of each imperialist is inversely related to best solution cost of the country. The failed countries in the competition, i.e. those having low power with respect to others, are condemned to removal. Powerful empires attract colonies of weak empires making them weaker [12].

The process of ICA starts from a segmented series of solutions. After initializing countries as groups of variables, either country is tagged as empires or as imperialists. Imperialist location can be set as the best solution among countries the empire entails. The assimilation is achieved by propelling colonies through imperialist. This movement is formulated in Eq. 5.

Where U is a uniform random number generator with mean 0 and length β, is the direction towards imperialist and β is a custom variable. Algorithm of general ICA is as below. While stopping condition is not met;

• Exchange phase: Replacement of better colony (with lower cost) with the existing empire.

• Imperialistic competition: Competition of imperialists to take possession of colonies of each other.

• Elimination of weakest empires.

3 The proposed framework and algorithms

3.1 Organized hierarchical metaheuristic (OHM), the proposed framework

Fig. 1: OHM update process of convergence and reduction of hierarchies.

Table 1: Solution selection types

Table 2: Organization center Metrics. The location of search space in each organization

Table 3: Selectors for selecting organization

Table 4: Parameter specifications and best-tested metrics for each proposed

Table 5: Optimizers’ preferences

3.1.1 Proposed framework’s pseudo-code

To simulate ICA approach in the framework, the hierarchy level selector (i.e. organization) in Table 3 should be MinCostExcludingOrg, because for assimilating the elite, it excludes the swarms in which that solution resides. PSO selector, on the contrary, is set to EntailingOrg. It moves towards each of its swarms in elites containers that reside in the desired scale. For a detailed demonstration, please refer to Fig. 2. These selectors are remained fixed after initialization. The framework is summarized in the following algorithm:

ALGORITHM 1: Proposed framework’s pseudo-code

• Select hierarchy level based on the effectiveness of levels

• Select an organization in that hierarchy level based on metric II Table 3

• If rand < Update_Randomly_likelihood_Threshold

• Move the solution toward the considered location

• Update center of a changed organization in all hierarchies by metric III suggested in Table 2

• Find the cost of the new solution

• Remove low costs solutions

• Modify or prune hierarchies

• Update effectiveness

3.2 Proposed algorithms for the framework

3.2.2 Proposed organized hierarchical version of Imperialistic Competitive Algorithm (OHMICA)

Comparing ICA with PSO leads to two differences. First, instead of a social update phase of PSO, there exists imperialist competition attracting individuals towards other colonies. Second, there exists no cognitive update phase in ICA. Cognitive update term of PSO helps it get a global taste in solution updates. Proposed selectors in Table 3, MinCostExcludingOrg, makes OHMICA work in a competitive way, while selects a hierarchy level randomly for scale specifications. Afterwards, the starting point moves towards the center of organization which has the best solution. This approach simulates competition phase of ICA. Although the organization selector is fixed due to the structure of the algorithm, OHMICA can also get used with other organization center update metrics. Fig. 2 shows schematics of OHMPSO and OHMICA in parts (a) and (b) respectively. The blue star shown in each part is the selected solution for the update process. In part (a), solution moves towards the organization center of its own organization, leading to cooperation and intensification. Part (b) shows the solution moves towards centers of other hierarchy, giving a sense of competition to the algorithm, making it more explorative.

Most suitable parameters for the algorithm are selected and shown. Among all proposed selectors/metrics of Tables 1-3, parameter preferences have been chosen and shown in Table 4. Parameters performance are evaluated on average over absolute errors from 20 benchmark functions shown in Table 6, 7 in 10 runs.

4 Experimental Analysis and discussion

Three types of evaluations are approached to assess the capability of the proposed OHM framework. First and the main one is a hybrid version of Gradient Descent (GD) in which OHM is used as a random initializer. The objective function belongs to EEG subspace learning problem with sum of convex functions. As each convex region has its own width and scale, our approach is needed to search for initial seeds of GD more efficiently. The second evaluation finds global minima error of selected benchmark functions in a single setting with no approximators or hybrid modes. This test confirms the positive role of organized hierarchy in both competition and cooperation in two exploration and exploitation schemes. At last, the third evaluation compares different variants of OHM with each other to analyze the errors. The detailed descriptions of results are shown in the next three subsections.

Optimizer preferences are cleared up in Table 4. Proposed algorithms have been compared to set of algorithms which were either ICA based, or PSO based and also self-tuning-mode of them in which hierarchy level selector uses fitness distribution of currently sought solutions by a Roulette Wheel Selector to select hierarchy level.

In order to make OHM ready for global optimization of learning based models, OHM has to be implemented on simple ICA or PSO models. Equipping more complex variations of ICA or PSO with our proposed method, makes optimization time-consuming and intractable. As the objective of this study is to improve gradient-based initializers in context of swarm-based methods, a thorough fine-tuning has not been conducted on hyper-parameters. Swarm-based methods are easy to use, simple to implement, have high flexibility to modify and tune, and are the only algorithms hybridized with GD most frequently in the literature. Therefore, the authors preferred to use basic swarm-based context rather than baseline optimizers from CEC SOTA.

All the experiments are carried out in the same computer with Intel Core I 5 2.7GHz CPU, 3.00GB memory, and a Win7 32bit operation system. For both single and hybrid phases, the algorithm has been run 20 different times and the evaluation results of every run are recorded.

4.1 Optimization of subspace filtering objective function using hybrid OHM-GD method and comparison with other hybrid methods of GD

The hybrid optimizer in this case-study consists of two algorithms, GD, and its initializer as our proposed OHM method. It is evaluated on a subspace-learning problem named as weighted Common Spatial Patterns over trials. In order to define the main objective function, first, the reader has to get familiar with the Common Spatial Pattern (CSP).

4.1.1 Common spatial pattern (CSP)

CSP is a feature extraction method for classifying brain data in Brain Computer Interface. Due to its success in many neuroscientific case studies, it has turned into one of the most commonly used approaches for EEG features dimensionality reduction. The goal of CSP is to find projections of data to lower dimensions that have maximal data-variance per one label while having minimal variance per other labels [27]. For a single

4.1.2 Devlaminck’s work

Figure 3: Detailed flowchart of OHM framework.

Figure 2: Schematics of OHMPSO and OHMICA algorithms for the solution update process. Star in each color, being elite among

Applying a single optimization framework to learn both types of filters

This method tries to optimize CSP filter by a weighted average of covariance matrices over trials for finding appropriate weights. However, due to binary weights, the weights cannot be approximated using gradient methods. Decomposing the filter to two parts of subject dependent and subject independent part makes this method a powerful baseline in the context of our proposed objective function in 4.1.3.

4.1.3 Lotte and Guan’s work

The method proposed by Lotte and Guan regularizes the estimated covariance matrix by taking the mean covariance matrix of other subjects [32]. This kind of regularization may largely improve the estimation quality of high dimensional covariance matrix for scarce data. The estimation for subject ican be written as

Where is the covariance matrix of class c for the subject of interest, are the covariance matrices of the other i = 1 . . . n, subjects and λ [0 1] is a regularization parameter controlling the amount of information incorporated from other users. This method is based on a restrictive assumption, namely the similarity between covariance matrices of different subjects.

This method is designed only for subjects and relationships between them. In contrast, the complexity of this model puts it in the same class with our proposed subspace model. As a result, it can be regarded as another suitable comparison baseline.

4.1.4 The objective function for evaluation, Weighted Common Spatial Pattern (Wgt-CSP)

To evaluate the proposed hybrid optimizer versus previous hybrids like GPSO, a new nonconvex objective function has been proposed. This new objective function is a modification to the CSP function (section 3.2.2.1), to not only improve the accuracy of speech imagery classification but also to assert the capability of the proposed hybrid optimizer OHM-GD over single GD and GPSO.

The objective is to maximize the following objective function in CSP framework:

Where is the selected epoch’s covariance matrix of the class c which is NxN square matrix with N as the number of channels. w is one row of CSP projector. Each epoch could be ERP epoch. Depending of different ai and bi, the objective function will have different hessians for the localized convex functions. Therefore, the search space comprises convex regions in various widths and scales. As a result OHM is a necessary initializer to generate initializing seeds (solutions) in multiple scales to speed up finding the global optima. Obviously, PSO (or ICA) cannot generate seed as easy as OHMPSO, due to its single-scale behavior. It either misses initializing in low-width convex regions, or wastes individual seeds all scattered over wide convex regions. ERP is the average of a distinct number of epochs. This process is to lower the total number of epochs either for the sake of increasing precision or for passing a smaller number of covariance matrices to CSP algorithm. Negative of (11) is passed to the hybrid optimizers to perform minimization.

Using Rayleigh Quotient, both of the following equations when having the weights ai and bi can be simplified to a generalized eigenvalue problem and the projectors be optimized completely in a certain way. That makes the only uncertain part of optimization the process of finding ai and bi s. Hence, unlike Devlaminck et. al method which has the possibility of getting dumped into local optimal in all its parameters, this form helps the process of optimization be done in a more efficient and robust way while remaining more certain about the result.

4.1.5 The optimization process of the objective function

Due to the non-convexity of objective function over weights, the mere usage of GD without powerful random initializer cannot find the global minima and eventually causes the gradient method to undergo premature convergence.

The optimization method alternates between computing CSP using SVD power method with 7 iterations and OHM-GD for updating weights with 6 iterations. The OHM-GD algorithm is mentioned as below:

Input: : weights in (11) . N : number of GD iterations as 5. F: Objective function in (11) . Parameter settings for OHM. Output: Optimized

According to a number of independent components, randomly draw a positively valued matrix. by Nesterov method [26]

4.1.6 EEG dataset , assessment method, CSP usage

EEG data belongs to speech-imagery BCI experiments which are developed by Rostami et.al. [30]. Data is logged using a 16 channeled EEG recorder extracted from 6 subjects, aged between 23 and 30 who performed imagination of vowel sounds. Each subject has taken 180 trials which were approximately 36 trials for the imagination of five class each as a vowel. The sampling rate was 512 Hz for 4 seconds lasted imagery. For evaluating the proposed objective function, the main 5-folded cross-validation process described as follows: ALGORITHM 3: The 5-folded cross-validation pseudo-code

o Decimate data by the rate of 8; then bandpass filter data using a fifth order Chebyshev filter with band-pass of [3,30] Hz.

o Pass results in train and test data to Linear Discriminant Analysis (LDA) or Support Vector Machine (SVM) classifier with linear kernel and LibSVM library [33].

4.1.7 Values used for searching and tuning hyper-parameter

For the GD method, Adam is used due to its adaptive momentum and the weight-decay effect. Weight-decay stops jumping off the local minima by restricting movements towards current direction. The fine-tuning settings the tuner has searched over were {0.2,0.02,0.002} for learning-rate , {0.8,0.9,099} for momentum1 and momentum2 , and also the proposed OHM initializers’ reoccurrence occasion are fine-tuned for each

5,10 or 15 GD runs The best resulted parameters settings are 0.02 for learning-rate, 0.9 for momentum1, 0.8 for momentum2 and 5 for initializer rerun occasion. OHM initializer is tested out in two proposed scenarios OHMPSO and OHMICA with both bases PSO and ICA.

Error-bar in Fig. 4 shows that proposed hybrid optimizer of weighted CSP outperforms other similar averaging approaches like Devlaminck and Lotte (4.1.2 and 4.1.3). All accuracies in the error-bar are validation data averaged over 20 independent runs in first three subjects of speech imagery dataset [30]. All methods except Weighted CSP are evaluated with only single GD without metaheuristic for initialization due to their convex formulation and the fact that all CSP computations have been performed in the same way using SVD power method.

4.1.8 Evaluation results and analysis

After designing a suitable objective function for feature learning, the comparison baselines are defined for the OHM. Table 6 shows elapsed time, cost and standard-deviation averaged over 20 independent runs and also best accuracy evaluated on validation data. Two best results per each column are shown in bold. The comparison baselines are single GD [26], Gradient-based PSO (GPSO) [25], Comprehensive Learning PSO (CLPSO) [14] with GD, ICA [12] with GD, CICA [13] with GD and each of which has GD as its main algorithm. GPSO is a combination of GD optimizer with standard PSO which lacks multi-scaled search capability. In GPSO, initialization frequency is controlled by the parameter NG [25] and it is tuned to 5. CICA, which is a chaotic adjustment of colony direction angle, is selected due to its relevance and performance in chaotic fractal functions and their multi-scale local optimas. Setting PSO/ICA state of the art subsets as the baselines is inappropriate; because due to implementation, runtime, and complexity issues of combining with GD, they are not useful for gradient initializers. Yet, the standard benchmarks' SOTA methods are not necessarily suitable for comparing hybrid-GD approaches; because there are neither multi-modal differentiable benchmark standards to assess existing methods, nor general assessment of hybrid-GD-methods performance in the literature. The selected baselines are most common in machine learning applications, being used as initializers more frequently in the literature [13, 14, 23].

The analysis results show outperformance of both OHMPSO and OHMICA versus other hybrid algorithms especially G-PSO [25]. Also by using standard ICA, CICA, and CLPSO as GD initializer with the same mechanism as of OHMICA-GD, they have failed to outperform our hybrid method.

Due to the formulation of CSP subspace filtering problem, the GD version used in the proposed study was not stochastic like ones used in deep learning; but averaging over sub-batches and saving average cost per batch, may also help big-data models initialization in such case studies.

The validation accuracies also suggest that the decrease in the cost of the objective function over hybrid-GD mode is meaningful and improves the best GD evaluation accuracy mentioned in the 4th row of Table 6.

Table 6. Comparison of hybrid optimizers in terms of time, cost, and validation accuracy

Furthermore, Fig. 4 describes the Comparison of generalization accuracy in our Wgt-CSP method among previous subspace filtering approaches in EEG two-class classification study. Wgt-CSP outperformed other methods mostly when OHMPSO-GD is used as optimizer. Only in one case the result nearly equates normal CSP approach but averaged accuracies of methods described in 4.1.2 and 4.1.3 are less than our objective function. In Fig. 4, standard-deviations in three out of 4 Wgt-CSP modes are relatively low. However, one

of them are high and suggests some uncertainties yet have to be tackled by more thorough parameter tuning and preprocessing.

Fig. (4): Comparison of EEG classification accuracy in weighted-CSPs-over-trials (Wgt-CSP) among previous subspace approaches. Wgt-CSP outperformed other methods mostly when used OHMPSO-GD as the optimizer. In the legend, terms ‘Comp’, ‘SVM’, ‘LDA’, ‘Raw’ mean ‘number of projected subspaces’, ‘SVM classifier’, ‘Linear Discriminant Analysis’, and ‘without CSP feature extraction as raw features’ respectively.

4.2 Comparison of OHM versus ICA/PSO variants on benchmarks in a single setting

The benchmark functions used in this comparison are explained in Tables 7 and 8. Table 7 evaluates single-modal and multi-modal functions while Table 8 lists the stability functions in shifted and rotated modes. PValues for 20 independent runs over Table 7 and 8 benchmark functions are computed and all p-values are below 0.05 for failing to reject the null-hypothesis of OHM indifference versus other optimizers’ results. A set of 20 benchmark function has been used in which their specifications are brought in Tables 4 and 5. Algorithms have been run with NFE (number of function evaluations) destined dependent to corresponding benchmark function. All functions with NFE=30000 had 3-dimensional input vectors. Dimensions for 180000 and 500000 cases were 10 and 30 respectively.

The proposed OHM framework is not fully evaluated on CEC benchmarks over state of the art optimizers for three reasons. First of all, the optimizers baselines in this section are selected based on their usage frequency in gradient-based applications. Secondly, OHMPSO and OHMICA outperform CLPSO and CICA, respectively as elite in CEC competition 2014 [14] and trustworthy in previous neural network case studies [23]. Finally, due to the complexity and time-consuming run-time of state of the art optimizers, the authors refrained to compare the proposed method with them as they are not applicable while being hybridized with GD. The optimizers’ baselines used in Table 9 are twofold, ICA and CICA as ICA group, and PSO and CLPSO as PSO group.

The OHMICA and OHMPSO have been compared to proposed improvements of ICA and PSO which are CICA and CLPSO respectively due to more relative simplicity and plausibility for machine learning [13, 14, 23]. Although the OHM is not compared with earlier SOTA methods and they may generally outperform, but they are more computationally intensive, have less literature usage in GD-hybridized settings, and have undergone a more comprehensive hyperparameter tuning. Furthermore, this section also suggests the search space distributions in which the OHM acts more plausibly (multi-modal functions group in this case) and such assessment does not necessarily need SOTA baselines. Analysis results from Table 9 show that multimodal functions Ackley, Rastrigin, Schwefel, and Griewank have overcome these algorithms in OHM in a better way than single-modal counterparts. This verifies the results sought in multi-convex setting of

Section 4.1. Its reason is outstanding power of framework to bypass local optimas while its updater chooses a higher level of hierarchy during the search in a sub-hierarchy. This simultaneous searching in levels of hierarchy hinders from getting trapped in local optima.

This analysis is based on algorithms’ rank counted per each row of Table 9. Although results show acceptable performance in shifted and rotated functions over robust algorithms like CLPSO, PSO-W, and CICA; but OHMPSO manifested unsatisfactory results in Easom and Quartic. On the other hand, OHMPSO preformed well in multimodal functions like Schwefel. These findings correlate with the aforementioned ideas about cooperation and competition. Cooperative nature of OHMPSO helps it deal better with local search, while competitive nature of OHMICA makes it perform a more promising global search, resulting in better performance in multimodal functions.

Current parameter specifications of the algorithm is unable to handle Beale and Sphere. The more an algorithm acts cautiously for difficult situations, the less it is capable of acting fast in simple functions.

As Table 9 suggests, functions Ackley, Weierestrass, and Griewank have higher relative error improvement in both OHMICA and OHMPSO compared to Rastrigin, Easom, and Schwefel. Weirestrass is a self-similar function that contains the highest density of multi-scale local optimas compared to other benchmarks. Better OHM results in this fractal multi-scale function confirm the multi-scale search capability of our methods.

Moreover, Griewank and Ackley’s values are scattered in more scales than Rastrigin, Easom, and Schwefel. This happens because exponential and cosine values are mostly produced by each other in Ackley, Weierestrass, and Griewank functions; while in Rastrigin, Easom, and Schwefel, cosines and exponential functions are summed over each other and that makes high growth rate with multtiple varieties of tones nonexistent in the 2nd group . Results suggest there is extra capability in OHM for seeking solutions of fractal multi-scale functions and that once again confirms the necessity for hierarchical searching in nested swarms. This helps the neural nets and highly nonlinear subspace learning search spaces (as in Section 4.1) initialize in a better way.

Table 7: Benchmark functions to evaluate optimizers, Part 1.

Table 8- Benchmark functions used to assess optimizer’s stability. F12 to F16 are rotated functions and F17 to F20 are shifted rotated functions which are both for stability testing

Table 9: Optimization results. Comparisons of error from global minima in proposed OHMPSO and OHMICA algorithms versus known optimizer, those of more similar category. All functions

4.3 Comparison between OHM variants over selected benchmark functions in single mode

As mentioned in the beginning of Section 4, in self-tuning-mode of OHM, hierarchy level selector uses fitness distribution of currently sought solutions by a Roulette Wheel Selector to select hierarchy level. When this mode is active, the algorithm adaptively changes the significance of each hierarchy level. Performance comparison out of Table 10 shows outperformance of self-tuning case of OHM framework with normal OHM. In both cases of ICA based hierarchy and PSO based hierarchy, the function ”Beale” is seen to have its performance improved. Beale function is multimodal with sharp peaks which faster finding of its global optima has been achieved through automatic adaptation of the significance of hierarchy levels.

The self-tuning process worked fairly efficient for OHMPSO and led to negative results for the case of OHMICA. One reason could be that constant changing of hierarchy level effectiveness misguide competition process of OHMICA and combines and finally fades out borders among organizations. Another reason could be because of inappropriate initialization of level effectiveness. Because the initialization parameter for that was not tuned for the best contrarily to other parameters.

5 Conclusion

This paper introduces a new metaheuristic framework for basic PSO and ICA. The main objective is to improve GD initialization power using nested sub-hierarchies and super-hierarchies search used by PSO and ICA generalizations respectively. This framework has many applications in multi-convex and gradient-based models. Proposed algorithms out of framework have been tested out on a hybrid subspace learning study and also a single study with 20 benchmark functions over optimizers in a similar context. The results are satisfactory in improving GD, GPSO, and other hybrid algorithms. Insights from a single mode of benchmark evaluations unravel OHMICA’s performance for multi-modal cases while suggests that OHMPSO acts relatively more acceptable in stability and also exploration with exploitation.

In upcoming works, a more thorough parameter tuning will be approached for OHM to compare it with SOTA methods by CEC competition benchmarks for non-hybrid case studies used for tuning. OHM will be used in batched large-scale data in deep learning scenarios which need random initializer for Stochastic GD.

References

Busetti, F. (2003). Simulated annealing overview. World Wide Web URL www.geocities. com/francorbusetti/saweb. pdf.).

Hu, X., Eberhart, R. C., & Shi, Y. (2003, April). Particle swarm with extended memory for multiobjective optimization. In Proceedings of the 2003 IEEE Swarm Intelligence Symposium. SIS'03 (Cat. No. 03EX706) (pp. 193-197). IEEE.

Yang, X. S. (2010). A new metaheuristic bat-inspired algorithm. In Nature inspired cooperative strategies for optimization (NICSO 2010) (pp. 65-74). Springer, Berlin, Heidelberg.

Dorigo, M., & Blum, C. (2005). Ant colony optimization theory: A survey. Theoretical computer science, 344(2-3), 243-278.

Rashedi, E., Nezamabadi-Pour, H., & Saryazdi, S. (2009). GSA: a gravitational search algorithm. Information sciences, 179(13), 2232-2248.

Coello, C. A. C., & Cortés, N. C. (2005). Solving multiobjective optimization problems using an artificial immune system. Genetic Programming and Evolvable Machines, 6(2), 163-190.

Janson, S., & Middendorf, M. (2005). A hierarchical particle swarm optimizer and its adaptive variant. IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics), 35(6), 1272-1282.

Chaturvedi, K. T., Pandit, M., & Srivastava, L. (2008). Self-organizing hierarchical particle swarm optimization for nonconvex economic dispatch. IEEE transactions on power systems, 23(3), 1079-1087.

R.C. Eberhart. (2001). Swarm Intelligence. Morgan Kaufmann Academic Press.

.CEberhart, R., & Kennedy, J. (1995, October). A new optimizer using particle swarm theory. In MHS'95. Proceedings of the Sixth International Symposium on Micro Machine and Human Science (pp. 39-43). Ieee.

Shi, Y. (2001). Particle swarm optimization: developments, applications and resources. In Proceedings of the 2001 congress on evolutionary computation (IEEE Cat. No. 01TH8546) (Vol. 1, pp. 81-86). IEEE.

Atashpaz-Gargari, E., & Lucas, C. (2007, September). Imperialist competitive algorithm: an algorithm for optimization inspired by imperialistic competition. In 2007 IEEE congress on evolutionary computation (pp. 4661-4667). IEEE.

Talatahari, S., Azar, B. F., Sheikholeslami, R., & Gandomi, A. H. (2012). Imperialist competitive algorithm combined with chaos for global optimization. Communications in Nonlinear Science and Numerical Simulation, 17(3), 1312-1319.

Liang, J. J., Qin, A. K., Suganthan, P. N., & Baskar, S. (2006). Comprehensive learning particle swarm optimizer for global optimization of multimodal functions. IEEE transactions on evolutionary computation, 10(3), 281-295.

Wolpert, D. H., & Macready, W. G. (1997). No free lunch theorems for optimization. IEEE transactions on evolutionary computation, 1(1), 67-82.

Lagarias, J. C., Reeds, J. A., Wright, M. H., & Wright, P. E. (1998). Convergence properties of the Nelder--Mead simplex method in low dimensions. SIAM Journal on optimization, 9(1), 112-147.

Chang, C. C., & Lin, C. J. (2011). LIBSVM: A library for support vector machines. ACM transactions on intelligent systems and technology (TIST), 2(3), 27.

Sapolsky, R. M. (2005). The influence of social hierarchy on primate health. Science, 308(5722), 648-652.

Roshandel, E., & Moattari, M. (2015, May). Novel line search based parameter optimization of multimachnie power system stabilizer enhanced by teaching learning based optimization. In 2015 23rd Iranian Conference on Electrical Engineering (pp. 1428-1433). IEEE.

Yang, X. S. (2010). A new metaheuristic bat-inspired algorithm. In Nature inspired cooperative strategies for optimization (NICSO 2010) (pp. 65-74). Springer, Berlin, Heidelberg.

Moattari, M., Parnianpour, P., & Moradi, M. H. (2017, May). Independent component analysis approach using higher orders of Non-Gaussianity. In 2017 Iranian Conference on Electrical Engineering (ICEE) (pp. 49-54). IEEE.

Bahrami, H., Faez, K., & Abdechiri, M. (2010, March). Imperialist competitive algorithm using chaos

theory for optimization (CICA). In 2010 12th International Conference on Computer Modelling and Simulation (pp. 98-103). IEEE.

Zhang, X. (2012). The application of imperialist competitive algorithm based on chaos theory in

Robbins, H., & Monro, S. (1951). A stochastic approximation method. The annals of mathematical

Noel, M. M. (2012). A new gradient based particle swarm optimization algorithm for accurate

Jakovetić, D., Xavier, J. M. F., & Moura, J. M. (2013). Convergence rates of distributed Nesterov-like

Wang, Y., Gao, S., & Gao, X. (2006, January). Common spatial pattern method for channel selelction

in motor imagery based brain-computer interface. In 2005 IEEE Engineering in Medicine and Biology 27th Annual Conference (pp. 5392-5395). IEEE.

Parlett, B. N. (1974). The Rayleigh quotient iteration and some generalizations for nonnormal

Müller-Gerking, J., Pfurtscheller, G., & Flyvbjerg, H. (1999). Designing optimal spatial filters for

Rostami, M., & Moradi, M. H. (2015, November). Evidential multi-band common spatial pattern in brain

computer interface. In 2015 22nd Iranian Conference on Biomedical Engineering (ICBME) (pp. 16-20). IEEE..

Devlaminck, D., Wyns, B., Grosse-Wentrup, M., Otte, G., & Santens, P. (2011). Multisubject learning

for common spatial patterns in motor-imagery BCI. Computational intelligence and neuroscience, 2011, 8.

Lotte, F., & Guan, C. (2010). Regularizing common spatial patterns to improve BCI designs: unified

designed for accessibility and to further open science