With the increasing complexity of and dependency on software, software products may suffer from low quality, high prices, be hard to maintain, etc. Software defects usually produce incorrect or unexpected results and behaviors. Accordingly, software defect prediction (SDP) is one of the most active research fields in software engineering and plays an important role in software quality assurance. Based on the results of SDP analyses, developers can subsequently conduct defect localization and repair on the basis of reasonable resource allocation, which helps to reduce their maintenance costs.
This book offers a comprehensive picture of the current state of SDP research. More specifically, it introduces a range of machine-learning-based SDP approaches proposed for different scenarios (i.e., WPDP, CPDP, and HDP). In addition, the book shares in-depth insights into current SDP approaches’ performance and lessons learned for future SDP research efforts.
We believe these theoretical analyses and emerging challenges will be of considerable interest to all researchers, graduate students, and practitioners who want to gain deeper insights into and/or find new research directions in SDP. It offers a comprehensive introduction to the current state of SDP and detailed descriptions of representative SDP approaches.
With the increasing pressures of expediting software projects that is always increasing in size and complexity to meet rapidly changing business needs, quality assurance activities such as fault prediction models have thus become extremely important. The main purpose of a fault prediction model is the effective allocation or prioritization of quality assurance effort (test effort and code inspection effort). Construction of these prediction models are mostly dependent on historical or previous software project data referred to as a dataset.
However, a prevalent problem in data mining is the skewness of a dataset. Fault prediction datasets are not excluded from this phenomenon. Most datasets have the majority of the instances being either clean or not faulty and conventional learning methods are primarily designed for balanced datasets. Common classifiers such as Neural Networks (NN), Support Vector Machines (SVM), and decision trees work best toward optimizing their objective functions, which lead to the maximum overall accuracy—the ratio of correctly predicted instances to the total number of instances. The use of imbalanced datasets for training a classifier will most likely generate a classifier that tends to over-predict the presence of the majority class but a lower probability of predicting the minority or faulty modules. When the model predicts the minority class, it often has a higher error rate compared to predictions for the majority class. This impacts the performance of classifiers in Machine Learning and is known as learning from imbalanced datasets. This affects the prediction performance of classifiers, and in machine learning, this issue is known as learning from imbalanced datasets. Several methods have been proposed in machine learning for dealing with the class imbalanced issue such as random over and under sampling creating synthetic data application of cleaning techniques for data sampling and cluster-based sampling. With a significant amount of literature in machine learning for imbalanced datasets, very few studies have tackled it in the area of fault prediction.
In the Chapter 2, several common learning algorithms and their applications in software defect prediction are briefly introduced, including Deep Learning, transfer learning, dictionary learning, semi-supervised learning, and multi-view learning. In many real world applications, it is expensive or impossible to recollect the needed training data and rebuild the models. It would be nice to reduce the need and effort to recollect the training data. In such cases, transfer learning (TL) between task domains would be desirable. Transfer learning exploits the knowledge gained from a previous task to improve generalization on another related task. Transfer learning can be useful when there is not enough labeled data for the new problem or when the computational cost of training a model from scratch is too high. Traditional data mining and machine learning algorithms make predictions on the future data using statistical models that are trained on previously collected labeled or unlabeled training data. Most of them assume that the distributions of the labeled and unlabeled data are the same. Transfer learning (TL), in contrast, allows the domains, tasks, and distributions used in training and testing to be different.
Внимание
Уважаемый посетитель, Вы зашли на сайт как незарегистрированный пользователь.
Мы рекомендуем Вам зарегистрироваться либо войти на сайт под своим именем.