Combining particle swarm optimization and genetic algorithms to improve software effort estimation

ABSTRACT


INTRODUCTION
Software effort estimation predicts the needed work, time, and employees. Estimating software development effort is a fundamental problem of project management, as most software projects fail owing to erroneous cost estimates and poor planning and scheduling. Unlike home construction and material manufacture, the software services make effort estimate harder. Software intangibility and varied, unreliable datasets can complicate cost estimates. It's hard to create a single model for all software and datasets. In the worst situation, poor estimating might cause a project to fail, hence a precise approach for optimally anticipating costs is vital [1].
Cost estimating models that estimate the cost of creating a system in the early phases of a project are valuable and essential. Appropriate cost estimation controls system time and building costs [2]. Boehm believes that the estimation of system construction effort in the early stages varies between 25 and 40% of actual effort [3]. In other words, the preliminary estimation at the beginning of the system construction process is often inaccurate because little knowledge of the project is available. This perspective is shared by Heemstra [4].
Several methods have been presented to estimate software project management effort. They are algorithmic or non-algorithmic. Mathematical models are used by algorithms to estimate project costs. Each algorithm is a cost function. The present techniques in this category differ in their selection of cost components and cost calculation algorithm. First, cost components are investigated, then techniques are  [5]. The flexibility of algorithmic techniques is poor, and they can't offer appropriate estimates for large and complicated projects. Popular algorithmic techniques include constructive cost model (COCOMO) and software life cycle management (SLIM). Analytical techniques are non-algorithmic. Non-algorithmic methodologies require knowledge on similar past projects, and the estimating process is dependent on analyzing existing databases. In nonalgorithmic techniques, there are no relations or equations and inference is utilized to estimate software cost.
Analogy-based estimate is the most prevalent non-algorithmic way of estimating software development effort. This technique forecasts the work necessary for a new project based on the resemblance of its attributes to those of finished ones [6]. The analogy-based technique consists of four parts: historical dataset, similarity function, retrieval criteria, and solution function. The effort estimating procedure contains the following phases [7]: i) Gather information from past initiatives; ii) Choose measurement parameters like function points (FP) and code lines (LOC); iii) Retrieve prior projects and calculate project similarity; and iv) Estimate the target project effort.
One analogy-based method is case-based reasoning. To improve these methods, use meta-heuristic algorithms [8]. This study presents a new case-based method for estimating software work that employs the particle swarm optimization (PSO) algorithm to boost efficiency. Analogy-based models perform better with weighted features. PSO does not guarantee the global optimal point, hence its weights are not ideal. The introduction of additional optimization methods may increase this weighting, as hybrid algorithms seek the global optimal point. We want more precise weights and better estimate by integrating PSO and genetic algorithms. The rest of this paper is structured as follows. In Section 2, we look at some of the most important related works. In Section 3, the proposed approach is explained in detail. In Section 4, the effectiveness of this approach is looked at. In Section 5, the paper is summed up and ideas are given for future work.

RELATED WORKS
The Delphi model was established by Dalkey and Helmer [9] to estimate software effort. This approach is non-algorithmic. Due to the inadequacy of algorithmic approaches to control the dynamic behavior of software projects in the early phases, non-algorithmic solutions were offered. In this strategy, experts exchange their estimations of the amount of work to reach a final consensus to attain the same degree of a joint effort among all experts. Boehm [10] proposed the notion of algorithmic approaches with the LOC. He created a novel model called COCOMO for evaluating software development efforts that employed empirical equations. COCOMO is an experimental model created using data from many software projects. This data is examined to create formulae that best suit the observations, and these formulas connect system and product size, team, and project parameters to the amount of effort necessary to construct the system. Other models, such as SLIM and SEM-SEER, maintained the COCOMO technique. Albrecht and Gaffney [11] presented one of the most important innovations in software measurement, dubbed FP, which allowed measuring in the early phases of the project and mainly prevented the bad impacts of the prior technique, LOC. In prior approaches, it was essential to first estimate the LOC, which could not be correct until the conclusion of software implementation, causing a substantial estimation error. FP solves this problem to a significant extent. Walkerden and Jeffery [12] sought to forecast effort using comparable projects. They excluded characteristics that didn't fit the project from the dataset and then picked just projects in the project region. They compared the results of their model with univariate linear regression on the FPs and found that analogy-based estimates can have better results than FPs and algorithmic models, but they did not include the weight of the project characteristics.
Features are rated as low, complicated, and significant. Numeric values can be derived linguistically. When generated from numbers, they're commonly shown in classical intervals. Also, language values don't replicate human interpretation, thus faults and uncertainty can't be prevented. Idri et al. [13] presented the fuzzy analogy approach to overcome the problem. Fuzzy sets, not classical spaces, are employed in fuzzy analogies. Fuzzy scales convert numbers to language values. Azzeh et al. [14] compared two software projects' numerical and classified properties. The two projects estimate similarity via fuzzy clustering and logic. To test the software similarity technique, two approaches were compared to Euclidean weight and distance. The findings suggest that they are as valid as case-based reasoning methods. When two projects' property values are put in a fuzzy set, they are similar to two similar projects. This technique uses the Gaussian function and requires many project features. Attarzadeh and Ow [15] used machine learning and pattern recognition to estimate software expenditures. Artificial neural networks learn from past data and establish relationships between variables. The authors employed a neural network to estimate cost. The neural network approach can learn complicated functions and is more accurate than previous methods such as Int J Adv Appl Sci ISSN: 2252-8814  Combining particle swarm optimization and genetic algorithms to improve software effort … (Ehsan Nasr)

201
COCOMO. However, the neural network method may not operate until the characteristics are translated to quantitative attributes, and it is not well-suited for datasets with a missing value. Azzeh et al. [16] improved software estimates using analogy and fuzzy numbers. Their suggested technique is based on experimental data and compared to case-based reasoning and step-by-step regression. The findings demonstrate that their method works better. Their focus is on model measurement uncertainty, which is connected to analogy-based software estimation. Amazel et al. [17] demonstrated that analogybased estimate approaches are useful. They employed a fuzzy analogy for non-numerical categorized data. The fuzzy method is used to cluster huge datasets with batch values. Based on clustering results, software project similarity was analyzed and new project effort was computed using the closest scales. This structure's correctness improves the suggested method's estimating accuracy. Kumari and Pushkar [18] improved the analogy-based technique with a multi-objective genetic algorithm. They thought the analogy-based method's project selection had a big influence and the project interactions might affect software cost estimates. Their approach was implemented using COCOMO and NASA datasets and achieved improved results, however, because they utilized a meta-heuristic method, the model may trap into a local optimum point. Idri and Abnane [19] constructed a fuzzy analogy-based model and compared it with six other strategies for evaluating work and testing on multiple datasets. Fuzzy analogy models perform better than other approaches, although, in this article, the method must employ numerical characteristics.
Wu et al. [8] improved the case-based reasoning technique. By merging this method with the particle swarm meta-heuristic algorithm, they presented a novel hybrid methodology. They used their method to two Maxwell and Desharnais datasets and got good results. Ezghari and Zahi [20] devised a fuzzy analogy-based technique to estimate software development efforts. The suggested technique is evaluated on 13 software datasets, and the results are compared with similar studies. The authors claim their proposed approach has superior performance and accuracy than others. Mustafa and Abdelwahed [21] constructed a stochastic forest model that was empirically adjusted by modifying its primary parameters. They introduced this methodology to enhance software project effort estimation. In this study, the performance of the optimized random forest model is compared to the traditional regression tree. The authors feel the optimum stochastic forest model performs better than the tree regression model in all assessment criteria.
Shah et al. [22] used the artificial bee colony methodology to increase software development estimation accuracy. In the training step, the most relevant elements of the artificial bee colony algorithm are computed. In testing, the effort estimation's correctness is assessed. Shahpar et al. [23] suggested an evolving ensemble analogy-based cost estimating approach. This method combines genetic algorithm with analogybased methodologies. It's been tested on Maxwell, Albrecht, Kemerer, and Desharnais MRE variant and PRED (0.25) datasets. Samavatian and Mohebbi [24] employed cuckoo search to estimate effort. Particle swarm optimization analyzes the outcomes. Sequentially applying these algorithms has led to a more precise search of the issue space, increasing the chance of obtaining the global optimum, or best features. The suggested technique is examined using COCOMO 81 and COCOMO NASA datasets.
Shahpar et al. [25] used particle swarm optimization and simulated annealing to calculate project work using analogy-based models. To find the ideal model, they used the optimal analogy-based estimation parameters, such as feature weights and comparable projects. Then, a polynomial equation uses the improved model's effort to estimate the final effort. The suggested method has been tested on Maxwell, Albrecht, COCOMO 81, Desharnais, and Kemerer. Dashti et al. [26] provided a method to evaluate software cost using the learnable evolution model's feature weighting optimization. They tested their system using Desharnais and Maxwell for MMRE, PRED (0.25), and MdMRE. A study of current publications demonstrates that many academics are using meta-heuristic and optimization methods to increase software development efficiency.

RESEARCH METHOD
The analogy-based strategy is more accurate when software features are weighted precisely. In this research, we improve feature weights using PSO and evolutionary algorithms. PSO is a meta-heuristic algorithm that solves problems well [27]. Genetic algorithm is a popular nature-inspired meta-heuristic [28]. Combining the two techniques reduces the likelihood of being trapped at a local optimization point and maximizes global feature weight optimization. Figure 1 depicts the proposed method.
Using this approach, a population of particles with a random distribution is generated. The following characteristics are possessed by these particles: i) Position: Indicates the location of each particle in each algorithm iteration; ii) Best position: The optimum location that each particle held throughout the algorithm's execution; iii) Velocity: The amount of displacement of each particle every algorithm iteration. This value is updated with each iteration; iv) Cost: The cost function determines the amount of cost that each particle has in its present location; and v) Best cost: The particle with the lowest cost during the algorithm's execution. It is required to establish an appropriate cost function in order to assess the performance of the generated weights in order to be able to generate weights that lead to more accurate estimates. Only then will it be possible to develop weights that lead to more accurate estimates. The following assessment criteria were utilized to generate this function in this study; these criteria are commonly used to measure the accuracy of software cost estimates.
where MMRE is the mean magnitude of relative error, MdMRE is the median magnitude of relative error, and PRED is the percentage of relative error deviation [29], MMRE is the mean magnitude of relative error. Section 4 has a definition of these criteria for you to look through. The MMRE and MdMRE have both been summed up for use in the estimation of the effort, and lastly, the PRED criterion for which it is always the goal to move the value closer to zero has been used. This is due to the fact that it is always attempted to bring the MMRE and MdMRE down while doing so. As a direct consequence of this, the cost function prioritizes the weights more highly if they have the potential to provide lower values.  Before running the algorithm, some inputs must be specified, such as the number of features to be weighted (Sizefeature), the initial population of particles (Sizepopulation) (in this study, particles are equivalent to feature weights), the lower (Limitlow) and upper (Limitup) limits for each feature, and the number of algorithm iterations (Itermax). After these variables have been initialized, the algorithm will go on to the next step, which is to construct the starting population. In order to do this, the particles are given initial locations that are chosen at random. Next, their costs are computed using the cost function, and finally, they are sorted in ascending order of cost and position (this is due to easier access to good genes in the genetic algorithm). After that, the best position and cost are saved into a variable as the best global record for the whole population.

Analogy-based Estimation
In the method that has been suggested, there are three loops that are as follows: i) Main loop: At this loop, a condition determines if the algorithm is caught in a local optimum position. If the best cost doesn't improve after 4 iterations of the main loop, a revolution happens and all particles' positions are re-initialized randomly, except for one particle that is handed to the next iteration as the superior generation of the current iteration. After this check, PSO and genetics loops are conducted; ii) PSO loop: Here, each particle's velocity is updated and its movement is verified against the features' range (otherwise, the velocity is reversed). Position and cost are recalculated once particles are moved at a determined speed. If each particle has a better experience than its previous best, this value will be replaced. If the experience is less than the best cost experienced by the total population, the best global record will be updated; and iii) Genetic loop: First, particles are merged at a defined rate (Ratecrossover) to create a new population. Each feature is evaluated between Limitlow and Limitup. Then, particles are mutated at a specific rate (Ratemutation), whose range is regulated like combination. In the suggested technique, particle velocity is employed in the PSO algorithm, hence this attribute must be genetically coupled in crossover and mutation. The original, crossed-over, and mutant populations combine to form a new population. In this stage, we sort particles by cost and store just their initial sizepopulation number for future iterations. Finally, if needed, we update the best global record.
After generating weights for project features depending on effort required, they are employed in the analogy-based method's similarity function to locate the closest project to the present one. Similarity compares characteristics using distance metrics. This study's similarity function uses the following distance measures: i) Euclidean distance: is the regular distance between the coordinates of two points. This index is utilized for comparing distance in optimization tasks. The formula is: (2) where p' and p represent projects, Wi represents the specified weight for each feature, fi' and fi represent the features of each project, and n represents the number of features. ᵟ is also used to obtain non-zero results. ii) Manhattan distance: is the geometric distance between two points. It is calculated via the following formula: The description of the parameters is the same as the Euclidean distance. Grey relational grade: is calculated via the following steps: a. Formation of decision matrix: consists of criteria and options (rows are options and columns are criteria). b. Normalization of the decision matrix: Due to the many types and natures of the indicators in the decision matrix, they must be scaled to facilitate evaluation and comparison. Formulas are used for this stage.
in a normal matrix, all integers are between zero and one. Numbers close to one are preferable. c. Calculating the grey relational coefficient by the following formula: where r is between zero and one and is usually 0.5. The grey relational coefficient is to determine the degree of desirability or proximity of to 0 . The larger the coefficient, the closer is to 0 . d. Ranking of options: In this step, the final score of options is calculated using the following formula and they are ranked based on it.
the gray relational score compares each choice to the ideal option. Score determines precedence. iii) Minkowski distance: If A and B are p-dimensional points, their distance is computed using the Minkowski distance parameter, d.
after finding the difference in characteristics, multiply the result by the weight to calculate the distance. In analogy-based software development estimates, the solution function is employed. As a solution function, this research uses average (mean), median, weighted mean, and inverse distance weighted mean of similar project efforts. The weighted mean is defined as: while the inverse distance weighted mean is calculated as: in both (12) and (13) is the project development effort, k is the number of similar projects and is the estimated effort [30], [31]. Table 1 shows the approach parameters. Maxwell and Desharnais were utilized to validate the suggested method. The Maxwell dataset comprises 62 initiatives from a Finnish bank [32]. A project has 25 characteristics. The effort is defined by the number of working hours from specs to delivery. Desharnais has 81 Canadian Software projects [33]. Every project has 10 features. Person-hours measure work. Normalization was employed for data preparation and preprocessing in this investigation. The Min-Max approach has also been adopted for normalization. In this procedure, each set of data is mapped to arbitrary intervals whose lowest and maximum values are known. Here we may map any interval to a new one with a simple conversion. Suppose feature A between min_A and max_A to be transferred to new_Min to new_Max. To achieve this, any starting value such as v in the original interval will be translated to the new value v′ in the new interval using the following formula.

RESULTS AND DISCUSSION
There are many different factors that may be used to evaluate the work that was put into developing software. Table 2 presents the criteria that were considered relevant for the purpose of this research [34]. It has been explicitly emphasized in previous research, and in particular [35], that methods such as n-fold generally produce results with high variance, and that in order to obtain more accurate results in effort estimation, methods such as leave-one-out cross-validation (LOOCV), or hold-out, must be applied [36]. These methods can be used to estimate the amount of work required to develop software. The LOOCV approach has been utilized to validate the algorithm in this investigation since the data volume is not very vast and all of the samples may be thought of as test data at the same time. First, the MRE illustrates the divergence between the projected and actual estimate. Best case is zero, and this value rises when forecasts deviate from reality. Figure 2 shows the two datasets' MRE. This chart shows that a few projects from both datasets have many mistakes since they have differing effort levels. In some study fields, we may be able to exclude these examples as outliers or normalize them. Since both the Desharnais and Maxwell datasets are generated from genuine projects and the data was collected with excellent accuracy, an error is improbable and we cannot consider these situations outliers. However, such examples are rare, and when using the analogy-based strategy, we can't locate similar projects. It is common that these instances have a high mistake rate.
Then we calculated MMRE, MdMRE, and PRED (0.25). For a comparison evaluation, these criteria's values are also used from [8], a renowned related study. The outcome is in Figure 3 We have determined the TotalCost to be an aggregate of the aforementioned criteria, and its calculation is denoted by the equation (1). As can be observed, this criterion has also advanced in comparison to previous studies of a similar nature. In the case of Desharnais, the total cost is in the negative, which is to be expected given that, in the best case scenario, both MMRE and MdMRE are 0 and PRED is 1, which results in TotalCost being-1.
In order to facilitate a deeper level of comprehension, the percentage of improvement in terms of assessment criteria brought about by the proposed method in comparison to previous work of a similar nature has been determined for each of the datasets and is presented in Table 3. The evaluation results allow us to draw the conclusion that combining PSO and genetic algorithms leads to the achievement of more suitable optimal points and reduces the likelihood of the problem of getting "trapped" in a local optimal solution in comparison to using either of these algorithms separately. As a consequence of this, after we have discovered a global optimization for the weights of the features of the projects and implemented these weights into an analogy-based technique, we are able to carry out more accurate estimations with a lower margin of error.
As part of another evaluation, we have carried out a number of trials to determine which of the many scenarios for solution and similarity functions yields the best match. In order to do this, a grid search will be conducted across all of the different potential scenarios. The following are some of the potential results of this search: i) Number of projects: The number of selected projects that have the nearest similarity to the target project. This is between 1 and 4; ii) Similarity functions: Six distance functions Euclidean   The grid search results for the Maxwell and Desharnais datasets are shown in Tables 4 and 5, respectively. As observed, the cost varies in different similarity and solution functions, but the number of comparable projects examined to estimate a project's cost affects the mistake rate. In Maxwell and Desharnais, the lowest error is attained with 4 comparable projects. Changing the similarity or solution functions does not enhance or modify the mistake rate. If the number of selected projects is smaller than 4, the mistake rate rises, and adjusting the similarity or solution functions may have minimal effect. In the analogy-based technique, the number of comparable selected projects to estimate the cost of a software project is more essential than picking the similarity and solution functions.