publications
publications by categories in reversed chronological order. generated by jekyll-scholar.
2026
- ICSEMisbehaviour Forecasting for Focused Autonomous Driving Systems TestingMolla Mohammad Abid Naziri, Stefano Carlo Lambertenghi, Andrea Stocco, and Marcelo d’AmorimIn Proceedings of the 51st International Conference on Software Engineering, 2026
Simulation-based testing is the standard practice for assessing the reliability of self-driving cars’ software before deployment. Existing bug-finding techniques are either unreliable or expensive. We build on the insight that near misses observed during simulations may point to potential failures. We propose Foresee, a technique that identifies near misses using a misbehavior forecaster that computes possible future states of the ego-vehicle under test. Foresee per- forms local fuzzing in the neighborhood of each candidate near miss to surface previously unknown failures. In our empirical study, we evaluate the effectiveness of different configurations of Foresee using several scenarios provided in the CARLA simulator on both end-to-end and modular self-driving systems and examine its com- plementarity with the state-of-the-art fuzzer DriveFuzz. Our results show that Foresee is both more effective and more efficient than the baselines. Foresee exposes 128.70% and 38.09% more failures than a random approach and a state-of-the-art failure predictor while being 2.49×and 1.42×faster, respectively. Moreover, when used in combination with DriveFuzz, Foresee enhances failure detection by up to 93.94%.
@inproceedings{2026-Naziri-ICSE, title = {Misbehaviour Forecasting for Focused Autonomous Driving Systems Testing}, author = {Naziri, Molla Mohammad Abid and Lambertenghi, Stefano Carlo and Stocco, Andrea and d'Amorim, Marcelo}, booktitle = {Proceedings of the 51st International Conference on Software Engineering}, series = {ICSE '26}, year = {2026}, publisher = {ACM/IEEE}, pages = {12 pages}, } - pre-printFeature-Aware Test Generation for Deep Learning ModelsXingcheng Chen, Oliver Weissl, and Andrea Stocco2026
As deep learning models are widely used in software systems, test generation plays a crucial role in assessing the quality of such models before deployment. To date, the most advanced test generators rely on generative AI to synthesize inputs; however, these approaches remain limited in providing semantic insight into the causes of misbehaviours and in offering fine- grained semantic controllability over the generated inputs. In this paper, we introduce Detect, a feature-aware test generation framework for vision-based deep learning (DL) models that systematically generates inputs by perturbing disentangled semantic attributes within the latent space. Detect perturbs individual latent features in a controlled way and observes how these changes affect the model’s output. Through this process, it identifies which features lead to behavior shifts and uses a vision-language model for semantic attribution. By distinguishing between task-relevant and irrelevant features, Detect applies feature-aware perturbations targeted for both generalization and robustness. Empirical results across image classification and detection tasks show that Detect generates high-quality test cases with fine-grained control, reveals distinct shortcut behaviors across model architectures (convolutional and transformer-based), and bugs that are not captured by accuracy metrics. Specifically, Detect outperforms a state-of-the-art test generator in decision boundary discovery and a leading spurious feature localization method in identifying robustness failures. Our findings show that fully fine-tuned convolutional models are prone to overfitting on localized cues, such as co-occurring visual traits, while weakly supervised transformers tend to rely on global features, such as environmental variances. These findings highlight the value of interpretable and feature-aware testing in improving DL model reliability.
@misc{2026-Chen-arXiv, title = {Feature-Aware Test Generation for Deep Learning Models}, author = {Chen, Xingcheng and Weissl, Oliver and Stocco, Andrea}, year = {2026}, eprint = {}, archiveprefix = {arXiv}, primaryclass = {cs.SE}, url = {}, } - SANERSTELLAR: A Search-Based Testing Framework for Large Language Model ApplicationsLev Sorokin, Ivan Vasilev, Ken Friedl, and Andrea StoccoIn Proceedings of the 33rd IEEE International Conference on Software Analysis, Evolution and Reengineering, 2026
Large Language Model (LLM)-based applications are increasingly deployed across various domains, including customer service, education, and mobility. However, these systems are prone to inaccurate, fictitious, or harmful responses, and their vast, high-dimensional input space makes systematic testing particularly challenging. To address this, we present STELLAR, an automated search-based testing framework for LLM-based applications that systematically uncovers text inputs leading to inappropriate system responses. Our framework models test generation as an optimization problem and discretizes the input space into stylistic, content-related, and perturbation features. Unlike prior work that focuses on prompt optimization or coverage heuristics, our work employs evolutionary optimization to dynamically explore feature combinations that are more likely to expose failures. We evaluate STELLAR on three LLM-based conversational question-answering systems. The first focuses on safety, benchmarking both public and proprietary LLMs against malicious or unsafe prompts. The second and third target navigation, using an open-source and an industrial retrieval- augmented system for in-vehicle venue recommendations. Over- all, STELLAR exposes up to 4.3× (average 2.5×) more failures than the existing baseline approaches.
@inproceedings{2026-Sorokin-SANER, title = {STELLAR: A Search-Based Testing Framework for Large Language Model Applications}, author = {Sorokin, Lev and Vasilev, Ivan and Friedl, Ken and Stocco, Andrea}, year = {2026}, booktitle = {Proceedings of the 33rd IEEE International Conference on Software Analysis, Evolution and Reengineering}, series = {SANER '26}, publisher = {IEEE}, pages = {10 pages}, } - SANERCoverage-Guided Road Selection and Prioritization for Efficient Testing in Autonomous Driving SystemsQurban Ali, Andrea Stocco, Leonardo Mariani, and Oliviero RiganelliIn Proceedings of the 33rd IEEE International Conference on Software Analysis, Evolution and Reengineering, 2026
utonomous Driving Assistance Systems (ADAS) rely on extensive testing to ensure safety and reliability, yet road scenario datasets often contain redundant cases that slow down the testing process without improving fault detection. We present a novel test prioritization framework that reduces redundancy while preserving geometric and behavioral diversity. Road scenarios are segmented into representative sections, which are compared using similarity scores based on dynamic time warping and enriched with dynamic features of the ADAS driving behavior. These features guide clustering to identify groups of similar scenarios, from which representative cases are selected to guarantee coverage. Finally, we introduce a prioritization mechanism that ranks roads based on geometric complexity, driving difficulty, and historical failures, ensuring that the most critical and challenging tests are executed first. We evaluate our framework on the OPENCAT dataset and the Udacity self-driving car simulator using two ADAS models. On average, our approach achieves an 89% reduction in test suite size while retaining an average of 79% of failed road scenarios. The prioritization strategy improves early failure detection by up to 95× compared to random baselines. These results demonstrate that our framework significantly improves test efficiency and fault detection capability, while maintaining scenario diversity and generalizing across different ADAS.
@inproceedings{2026-Ali-SANER, title = {Coverage-Guided Road Selection and Prioritization for Efficient Testing in Autonomous Driving Systems}, author = {Ali, Qurban and Stocco, Andrea and Mariani, Leonardo and Riganelli, Oliviero}, year = {2026}, booktitle = {Proceedings of the 33rd IEEE International Conference on Software Analysis, Evolution and Reengineering}, series = {SANER '26}, publisher = {IEEE}, pages = {10 pages}, } - ICSEWLarge Language Models for Secure Code Assessment: A Multi-Language Empirical StudyKohei Dozono, Tiago Espinha Gasiba, and Andrea StoccoIn Proceedings of the 48th International Conference on Software Engineering Workshops, 2026
Most vulnerability detection studies focus on datasets of vulnerabilities in C/C++ code, offering limited lan- guage diversity. Thus, the effectiveness of deep learning methods, including large language models (LLMs), in detecting software vulnerabilities beyond these languages is still largely unexplored. In this paper, we evaluate the effectiveness of LLMs in de- tecting and classifying Common Weakness Enumerations (CWE) using different prompt and role strategies. Our experimental study targets six state-of-the-art pre-trained LLMs (GPT-3.5- Turbo, GPT-4 Turbo, GPT-4o, CodeLLama-7B, CodeLLama- 13B, and Gemini 1.5 Pro) and five programming languages: Python, C, C++, Java, and JavaScript. We compiled a multi- language vulnerability dataset from different sources, to ensure representativeness. Our results showed that GPT-4o achieves the highest vulnerability detection and CWE classification scores using a few-shot setting. Aside from the quantitative results of our study, we devel- oped a library called CODEGUARDIAN integrated with VSCode which enables developers to perform LLM-assisted real-time vulnerability analysis in real-world security scenarios. We have evaluated CODEGUARDIAN with a user study involving 22 developers from the industry. Our study showed that, by using CODEGUARDIAN, developers are more accurate and faster at detecting vulnerabilities.
@inproceedings{2026-Dozono-ICSEW, title = {Large Language Models for Secure Code Assessment: A Multi-Language Empirical Study}, author = {Dozono, Kohei and Gasiba, Tiago Espinha and Stocco, Andrea}, booktitle = {Proceedings of the 48th International Conference on Software Engineering Workshops}, year = {2026}, url = {https://arxiv.org/abs/2408.06428}, } - ICSEWLatent Regularization in Generative Test Input GenerationGiorgi Merabishvili, Oliver Weissl, and Andrea StoccoIn Proceedings of the 48th International Conference on Software Engineering Workshops, 2026
This study examines how the impact of regularization of latent spaces through truncation affects the quality of generated test in- puts for deep learning classifiers. We evaluate this effect using style-based GANs, a state-of-the-art generative approach, and as- sess quality along three dimensions: validity, diversity, and fault detection. We evaluate our approach on the boundary testing of deep learning image classifiers across three datasets — MNIST, Fashion-MNIST, and CIFAR-10. We compare two truncation strate- gies: latent code mixing with binary search optimization and ran- dom latent truncation for generative exploration. Our experiments show that the latent code–mixing approach achieves a higher fault detection rate than random truncation, while also improving both diversity and validity.
@inproceedings{2026-Merabishvili-ICSEW, title = {Latent Regularization in Generative Test Input Generation}, author = {Merabishvili, Giorgi and Weissl, Oliver and Stocco, Andrea}, booktitle = {Proceedings of the 48th International Conference on Software Engineering Workshops}, year = {2026}, url = {}, }
2025
- EMSEXMutant: XAI-based Fuzzing for Deep Learning SystemsXingcheng Chen, Matteo Biagiola, Vincenzo Riccio, Marcelo d’Amorim, and 1 more authorEmpirical Software Engineering, 2025
Semantic-based test generators are widely used to produce failure-inducing inputs for Deep Learning (DL) systems. They typically generate challenging test inputs by applying random perturbations to input semantic concepts until a failure is found or a timeout is reached. However, such randomness may hinder them from efficiently achieving their goal. This paper proposes XMutant, a technique that leverages explainable artificial intelligence (XAI) techniques to generate challenging test inputs. XMutant uses the local explanation of the input to inform the fuzz testing process and effectively guide it toward failures of the DL system under test. We evaluated different configurations of XMutant in triggering failures for different DL systems both for model-level (sentiment analysis, digit recognition) and system-level testing (advanced driving assistance). Our studies showed that XMutant enables more effective and efficient test generation by focusing on the most impactful parts of the input. XMutant generates up to 125% more failure-inducing inputs compared to an existing baseline, up to 7X faster. We also assessed the validity of these inputs, maintaining a validation rate above 89%, according to automated and human validators.
@article{2025-Chen-EMSE, title = {{XMutant: XAI-based Fuzzing for Deep Learning Systems}}, author = {Chen, Xingcheng and Biagiola, Matteo and Riccio, Vincenzo and d'Amorim, Marcelo and Stocco, Andrea}, year = {2025}, journal = {Empirical Software Engineering}, publisher = {Springer}, url = {https://arxiv.org/abs/2503.07222}, } - ASEA Multi-Modality Evaluation of the Reality Gap in Autonomous Driving SystemsStefano Carlo Lambertenghi, Mirena Flores Valdez, and Andrea StoccoIn Proceedings of the 40th IEEE/ACM International Conference on Automated Software Engineering, 2025
Simulation-based testing is a cornerstone of Autonomous Driving System (ADS) development, offering safe and scalable evaluation across diverse driving scenarios. However, discrepancies between simulated and real-world behavior, known as the reality gap, challenge the transferability of test results to deployed systems. In this paper, we present a comprehensive empirical study comparing four representative testing modalities: Software-in-the-Loop (SiL), Vehicle-in-the-Loop (ViL), MixedReality (MR), and full real-world testing. Using a small-scale physical vehicle equipped with real sensors (camera and LiDAR), and its digital twin, we implement each setup and evaluate two ADS architectures (modular and end-to-end) across diverse indoor driving scenarios involving real obstacles, road topologies, and indoor environments. We systematically assess the impact of each testing modality along three dimensions of the reality gap: actuation, perception, and behavioral fidelity. Our results show that while SiL and ViL setups simplify critical aspects of realworld dynamics and sensing, MR testing improves perceptual realism without compromising safety or control. Importantly, we identify the conditions under which failures do not transfer across testing modalities and isolate the underlying dimensions of the gap responsible for these discrepancies. Our findings offer actionable insights into the respective strengths and limitations of each modality and outline a path toward more robust and transferable validation of autonomous driving systems.
@inproceedings{2025-Lambertenghi-ASE, author = {Lambertenghi, Stefano Carlo and Valdez, Mirena Flores and Stocco, Andrea}, title = {A Multi-Modality Evaluation of the Reality Gap in Autonomous Driving Systems}, booktitle = {Proceedings of the 40th IEEE/ACM International Conference on Automated Software Engineering}, series = {ASE '25}, publisher = {IEEE}, pages = {12 pages}, year = {2025}, } - pre-printGIFTbench: Generative Image Fuzz Testing BenchmarkMaryam, Matteo Biagiola, Andrea Stocco, and Vincenzo Riccio2025
GIFTbench is a modular framework for testing Deep Learning image classifiers that combines Generative AI with genetic algorithms. Its architecture integrates pretrained generative models with a user-friendly Gradio interface, enabling automated, reproducible, and interpretable robustness testing. Supporting VAE, GAN, and Diffusion models, GIFTbench generates test inputs by perturbing latent representations to expose misbehaviors of the classifier under test. By automating test input generation and reducing the need for manual coding, GIFTbench accelerates experimentation and facilitates comparative evaluation of both classifiers and generative models. Designed for researchers and practitioners, it enables reproducible assessment of image classifiers, while supporting studies on classifier vulnerabilities, mutation strategies, and the role of generative models in robustness testing.
@misc{2025-Maryam-SCP, title = {GIFTbench: Generative Image Fuzz Testing Benchmark}, author = {Maryam and Biagiola, Matteo and Stocco, Andrea and Riccio, Vincenzo}, year = {2025}, eprint = {}, archiveprefix = {}, primaryclass = {}, url = {}, } - pre-printPerturbationDrive: A Framework for Perturbation-Based Testing of ADASHannes Leonhard, Stefano Carlo Lambertenghi, and Andrea Stocco2025
Advanced driver assistance systems (ADAS) often rely on deep neural networks to interpret driving images and support vehicle control. Although reliable under nominal conditions, these systems remain vulnerable to input variations and out-of-distribution data, which can lead to unsafe behavior. We present PerturbationDrive, a testing framework to perform robustness and generalization testing of ADAS. The framework features more than 30 image perturbations from the literature that mimic changes in weather, lighting, or sensor quality and extends them with dynamic and attention- based variants. PerturbationDrive supports both offline evaluation on static datasets and online closed-loop testing in different simulators. Additionally, the framework integrates with procedural road generation and search-based testing, enabling systematic exploration of diverse road topologies combined with image perturbations. Together, these features allow PerturbationDrive to evaluate robustness and generalization capabilities of ADAS across varying scenarios, making it a reproducible and extensible framework for systematic system-level testing.
@misc{2025-Leonhard-SCP, title = {PerturbationDrive: A Framework for Perturbation-Based Testing of ADAS}, author = {Leonhard, Hannes and Lambertenghi, Stefano Carlo and Stocco, Andrea}, year = {2025}, eprint = {}, archiveprefix = {}, primaryclass = {}, url = {}, } - pre-printBenchmarking Contextual Understanding for In-Car Conversational SystemsPhilipp Habicht, Lev Sorokin, Abdullah Saydemir, Ken E. Friedl, and 1 more author2025
In-Car Conversational Question Answering (ConvQA) systems significantly enhance user experience by enabling seamless voice interactions. However, assessing their accuracy and reliability remains a challenge. This paper explores the use of Large Language Models (LLMs) along side advanced prompt ing techniques and agent-based methods to evaluate the extent to which ConvQA system responses adhere to user utterances. The focus lies on contextual understanding, the ability to provide accurate venue recommendations considering the user constraints and situational context. To evaluate the utterance/response coherence using an LLM, we synthetically generate user utterances accompanied by correct but also modified failure-containing system responses. We use input-output, chain of thought, self-consistency prompting, as well as multi-agent prompting techniques, with 13 reasoning and non-reasoning LLMs, varying in model size and providers, from OpenAI, DeepSeek, Mistral AI, and Meta. We evaluate our approach on a case study that involves a user asking for restaurant recommendations. The most substantial improvements are observed for non-reasoning models when applying advanced prompting techniques, in particular, when applying multi-agent prompting. However, non-reasoning models are significantly surpassed by reasoning models, where the best result is achieved with single-agent prompting incorporating self-consistency. Notably, the DeepSeek-R1 model achieves the highest F1-score of 0.990 at a cost of 0.002 USD per request. Overall, the best tradeoff between effectiveness and cost/time efficiency is achieved with the non-reasoning model DeepSeek-V3.
@misc{2025-Habicht-arxiv, title = {Benchmarking Contextual Understanding for In-Car Conversational Systems}, author = {Habicht, Philipp and Sorokin, Lev and Saydemir, Abdullah and Friedl, Ken E. and Stocco, Andrea}, year = {2025}, eprint = {}, archiveprefix = {}, primaryclass = {}, url = {}, } - pre-printFoundation Models in Autonomous Driving: A Survey on Scenario Generation and Scenario AnalysisYuan Gao, Mattia Piccinini, Yuchen Zhang, Dingrui Wang, and 11 more authors2025
For autonomous vehicles, safe navigation in complex environments depends on handling a broad range of diverse and rare driving scenarios. Simulation- and scenario-based testing have emerged as key approaches to development and validation of autonomous driving systems. Traditional scenario generation relies on rule-based systems, knowledge-driven models, and data-driven synthesis, often producing limited diversity and unrealistic safety-critical cases. With the emergence of foundation models, which represent a new generation of pre-trained, general-purpose AI models, developers can process heterogeneous inputs (e.g., natural language, sensor data, HD maps, and control actions), enabling the synthesis and interpretation of complex driving scenarios. In this paper, we conduct a survey about the application of foundation models for scenario generation and scenario analysis in autonomous driving (as of May 2025). Our survey presents a unified taxonomy that includes large language models, vision-language models, multimodal large language models, diffusion models, and world models for the generation and analysis of autonomous driving scenarios. In addition, we review the methodologies, open-source datasets, simulation platforms, and benchmark challenges, and we examine the evaluation metrics tailored explicitly to scenario generation and analysis. Finally, the survey concludes by highlighting the open challenges and research questions, and outlining promising future research directions. All reviewed papers are listed in a continuously maintained repository, which contains supplementary materials and is available at GitHub.com/TUM-AVS/FM-for-Scenario-Generation-Analysis.
@misc{2025-Guo-arxiv, title = {Foundation Models in Autonomous Driving: A Survey on Scenario Generation and Scenario Analysis}, author = {Gao, Yuan and Piccinini, Mattia and Zhang, Yuchen and Wang, Dingrui and Moller, Korbinian and Brusnicki, Roberto and Zarrouki, Baha and Gambi, Alessio and Totz, Jan Frederik and Storms, Kai and Peters, Steven and Stocco, Andrea and Alrifaee, Bassam and Pavone, Marco and Betz, Johannes}, year = {2025}, eprint = {2506.11526}, archiveprefix = {arXiv}, primaryclass = {cs.RO}, url = {https://arxiv.org/abs/2506.11526}, } - pre-printWeb Element Relocalization in Evolving Web Applications: A Comparative Analysis and Extension StudyAnton Kluge and Andrea Stocco2025
Fragile web tests, primarily caused by locator breakages, are a persistent challenge in web development. Hence, researchers have proposed techniques for web-element re-identification in which algorithms utilize a range of element properties to relocate elements on updated versions of websites based on similarity scoring. In this paper, we replicate the original studies of the most recent propositions in the literature, namely the Similo algorithm and its successor, VON Similo. We also acknowledge and reconsider assumptions related to threats to validity in the original studies, which prompted additional analysis and the development of mitigation techniques. Our analysis revealed that VON Similo, despite its novel approach, tends to produce more false positives than Similo. We mitigated these issues through algorithmic refinements and optimization algorithms that enhance parameters and comparison methods across all Similo variants, improving the accuracy of Similo on its original benchmark by 5.62%. Moreover, we extend the replicated studies by proposing a larger evaluation benchmark (23x bigger than the original study) as well as a novel approach that combines the strengths of both Similo and VON Similo, called HybridSimilo. The combined approach achieved a gain comparable to the improved Similo alone. Results on the extended benchmark show that HybridSimilo locates 98.8% of elements with broken locators in realistic testing scenarios.
@misc{2025-Kluge-arxiv, title = {Web Element Relocalization in Evolving Web Applications: A Comparative Analysis and Extension Study}, author = {Kluge, Anton and Stocco, Andrea}, year = {2025}, eprint = {2505.16424}, archiveprefix = {arXiv}, primaryclass = {cs.SE}, url = {https://arxiv.org/abs/2505.1642}, } - IVAutomated Factual Benchmarking for In-Car Conversational Systems using Large Language ModelsRafael Giebisch, Ken E. Friedl, Lev Sorokin, and Andrea StoccoIn Proceedings of the 36th IEEE Intelligent Vehicles Symposium, 2025
In-car conversational systems bring the promise to improve the in-vehicle user experience. Modern conversational systems are based on Large Language Models (LLMs), which makes them prone to errors such as hallucinations, i.e., inaccurate, fictitious, and therefore factually incorrect information. In this paper, we present an LLM-based methodology for the automatic factual benchmarking of in-car conversational systems. We instantiate our methodology with five LLMbased methods, leveraging ensembling techniques and diverse personae to enhance agreement and minimize hallucinations. We use our methodology to evalute CarExpert, an in-car retrieval-augmented conversational question answering system, with respect to the factual correctness to a vehicle’s manual. We produced a novel dataset specifically created for the in-car domain, and tested our methodology against an expert evaluation. Our results show that the combination of GPT-4 with the Input Output Prompting achieves over 90% factual correctness agreement rate with expert evaluations, other than being the most efficient approach yielding an average response time of 4.5s. Our findings suggest that LLM-based testing constitutes a viable approach for the validation of conversational systems regarding their factual correctness.
@inproceedings{2025-Giebisch-IV, author = {Giebisch, Rafael and Friedl, Ken E. and Sorokin, Lev and Stocco, Andrea}, title = {Automated Factual Benchmarking for In-Car Conversational Systems using Large Language Models}, booktitle = {Proceedings of the 36th IEEE Intelligent Vehicles Symposium}, series = {IV '25}, publisher = {IEEE}, pages = {8 pages}, year = {2025}, } - pre-printLatent Space Class Dispersion: Effective Test Data Quality Assessment for DNNsVivek Vekariya, Mojdeh Golagha, Andrea Stocco, and Alexander Pretschner2025
High-quality test datasets are crucial for assessing the reliability of Deep Neural Networks (DNNs). Mutation testing evaluates test dataset quality based on their ability to uncover injected faults in DNNs as measured by mutation score (MS). At the same time, its high computational cost motivates researchers to seek alternative test adequacy criteria. We propose Latent Space Class Dispersion (LSCD), a novel metric to quantify the quality of test datasets for DNNs. It measures the degree of dispersion within a test dataset as observed in the latent space of a DNN. Our empirical study shows that LSCD reveals and quantifies deficiencies in the test dataset of three popular benchmarks pertaining to image classification tasks using DNNs. Corner cases generated using automated fuzzing were found to help enhance fault detection and improve the overall quality of the original test sets calculated by MS and LSCD. Our experiments revealed a high positive correlation (0.87) between LSCD and MS, significantly higher than the one achieved by the well-studied Distance-based Surprise Coverage (0.25). These results were obtained from 129 mutants generated through pre-training mutation operators, with statistical significance and a high validity of corner cases. These observations suggest that LSCD can serve as a cost-effective alternative to expensive mutation testing, eliminating the need to generate mutant models while offering comparably valuable insights into test dataset quality for DNNs.
@misc{2025-Vekariya-arxiv, title = {Latent Space Class Dispersion: Effective Test Data Quality Assessment for DNNs}, author = {Vekariya, Vivek and Golagha, Mojdeh and Stocco, Andrea and Pretschner, Alexander}, year = {2025}, eprint = {2503.18799}, archiveprefix = {arXiv}, primaryclass = {cs.SE}, url = {https://arxiv.org/abs/2503.18799}, } - pre-printSimulator Ensembles for Trustworthy Autonomous Driving TestingLev Sorokin, Matteo Biagiola, and Andrea Stocco2025
Scenario-based testing with driving simulators is extensively used to identify failing conditions of automated driving assistance systems (ADAS) and reduce the amount of in-field road testing. However, existing studies have shown that repeated test execution in the same as well as in distinct simulators can yield different outcomes, which can be attributed to sources of flakiness or different implementations of the physics, among other factors. In this paper, we present MultiSim, a novel approach to multi-simulation ADAS testing based on a search-based testing approach that leverages an ensemble of simulators to identify failure-inducing, simulator-agnostic test scenarios. During the search, each scenario is evaluated jointly on multiple simulators. Scenarios that produce consistent results across simulators are prioritized for further exploration, while those that fail on only a subset of simulators are given less priority, as they may reflect simulator-specific issues rather than generalizable failures. Our case study, which involves testing a deep neural network-based ADAS on different pairs of three widely used simulators, demonstrates that MultiSim outperforms single-simulator testing by achieving on average a higher rate of simulator-agnostic failures by 51%. Compared to a state-of-the-art multi-simulator approach that combines the outcome of independent test generation campaigns obtained in different simulators, MultiSim identifies 54% more simulator-agnostic failing tests while showing a comparable validity rate. An enhancement of MultiSim that leverages surrogate models to predict simulator disagreements and bypass executions does not only increase the average number of valid failures but also improves efficiency in finding the first valid failure.
@misc{2025-Sorokin-arxiv, title = {Simulator Ensembles for Trustworthy Autonomous Driving Testing}, author = {Sorokin, Lev and Biagiola, Matteo and Stocco, Andrea}, year = {2025}, eprint = {2503.08936}, archiveprefix = {arXiv}, primaryclass = {cs.SE}, url = {https://arxiv.org/abs/2503.08936}, } - ICSEEfficient Domain Augmentation for Autonomous Driving Testing Using Diffusion ModelsLuciano Baresi, Davide Yi Xian Hu, Andrea Stocco, and Paolo TonellaIn Proceedings of the 47th International Conference on Software Engineering, 2025
Simulation-based testing is widely used to assess the reliability of Autonomous Driving Systems (ADS), but its effectiveness is limited by the operational design domain (ODD) conditions available in such simulators. To address this limitation, in this work, we explore the integration of generative artificial intelligence techniques with physics-based simulators to enhance ADS system-level testing. Our study evaluates the effectiveness and computational overhead of three generative strategies based on diffusion models, namely instruction-editing, inpainting, and inpainting with refinement. Specifically, we assess these tech- niques’ capabilities to produce augmented simulator-generated images of driving scenarios representing new ODDs. We employ a novel automated detector for invalid inputs based on semantic segmentation to ensure semantic preservation and realism of the neural generated images. We then perform system-level testing to evaluate the ADS’s generalization ability to newly synthesized ODDs. Our findings show that diffusion models help increase the ODD coverage for system-level testing of ADS. Our automated semantic validator achieved a percentage of false positives as low as 3%, retaining the correctness and quality of the generated images for testing. Our approach successfully identified new ADS system failures before real-world testing.
@inproceedings{2025-Baresi-ICSE, author = {Baresi, Luciano and Hu, Davide Yi Xian and Stocco, Andrea and Tonella, Paolo}, title = {Efficient Domain Augmentation for Autonomous Driving Testing Using Diffusion Models}, booktitle = {Proceedings of the 47th International Conference on Software Engineering}, series = {ICSE '25}, publisher = {IEEE}, pages = {12 pages}, year = {2025}, } - ICSTBenchmarking Image Perturbations for Testing Automated Driving Assistance SystemsStefano Carlo Lambertenghi, Hannes Leonhard, and Andrea StoccoIn Proceedings of the 18th IEEE International Conference on Software Testing, Verification and Validation, 2025
[Distinguished Paper Award]
@inproceedings{2025-Lambertenghi-ICST, author = {Lambertenghi, Stefano Carlo and Leonhard, Hannes and Stocco, Andrea}, title = {Benchmarking Image Perturbations for Testing Automated Driving Assistance Systems}, booktitle = {Proceedings of the 18th IEEE International Conference on Software Testing, Verification and Validation}, series = {ICST '25}, publisher = {IEEE}, pages = {12 pages}, year = {2025}, } - ICSTBenchmarking Generative AI Models for Deep Learning Test Input GenerationMaryam, Matteo Biagiola, Andrea Stocco, and Vincenzo RiccioIn Proceedings of the 18th IEEE International Conference on Software Testing, Verification and Validation, 2025
[Distinguished Paper Award]
Test Input Generators (TIGs) are crucial to assess the ability of Deep Learning (DL) image classifiers to provide correct predictions for inputs beyond their training and test sets. Recent advancements in Generative AI (GenAI) models have made them a powerful tool for creating and manipulating syn- thetic images, although these advancements also imply increased complexity and resource demands for training. In this work, we benchmark and combine different GenAI models with TIGs, assessing their effectiveness, efficiency, and quality of the generated test images, in terms of domain validity and label preservation. We conduct an empirical study involving three different GenAI architectures (VAEs, GANs, Diffusion Models), five classification tasks of increasing complexity, and 364 human evaluations. Our results show that simpler architectures, such as VAEs, are sufficient for less complex datasets like MNIST. However, when dealing with feature-rich datasets, such as ImageNet, more sophisticated architectures like Diffusion Models achieve superior performance by generating a higher number of valid, misclassification-inducing inputs.
@inproceedings{2025-Maryam-ICST, author = {Maryam and Biagiola, Matteo and Stocco, Andrea and Riccio, Vincenzo}, title = {Benchmarking Generative AI Models for Deep Learning Test Input Generation}, booktitle = {Proceedings of the 18th IEEE International Conference on Software Testing, Verification and Validation}, series = {ICST '25}, publisher = {IEEE}, pages = {12 pages}, year = {2025}, } - ICSEWOpenCat: Improving Interoperability of ADS TestingQurban Ali, Andrea Stocco, Leonardo Mariani, and Oliviero RiganelliIn Proceedings of the 47th International Conference on Software Engineering Workshops, 2025
Testing Advanced Driving Assistance Systems (ADAS), such as lane-keeping functions, requires creating road topologies or using predefined benchmarks. However, the test cases in existing ADAS benchmarks are often designed in specific formats(e.g.,OpenDRIVE)andtailoredtospecificADASmodels. This limits their reusability and interoperability with other simulators and models, making it challenging to assess ADAS functionalities independently of the platform-specific details used to create the test cases. This paper evaluates the interoperability of SensoDat, a benchmark developed for ADAS regression testing. We introduce OpenCat, a converter that transforms OpenDRIVE test cases into the Catmull-Rom spline format, which is widely supported by many current test generators. By applying OpenCat to the SensoDat dataset, we achieved high accuracy in converting test cases into reusable road scenarios. To validate the converted scenarios, we used them to evaluate a lane-keeping ADAS model using the Udacity simulator. Both the simulator and the ADAS model operate independently of the technologies underlying Sen- soDat, ensuring an unbiased evaluation of the original test cases. Our findings reveal that benchmarks built with specific ADAS models hinder their effective usage for regression testing. We conclude by offering insights and recommendations to enhance the reusability and transferability of ADAS benchmarks for more extensive applications.
@inproceedings{2025-Ali-ICSEW, title = {OpenCat: Improving Interoperability of ADS Testing}, author = {Ali, Qurban and Stocco, Andrea and Mariani, Leonardo and Riganelli, Oliviero}, year = {2025}, booktitle = {Proceedings of the 47th International Conference on Software Engineering Workshops}, series = {ICSEW '24}, publisher = {IEEE}, pages = {10 pages}, } - TOSEMTargeted Deep Learning System Boundary TestingOliver Weißl, Amr Abdellatif, Xingcheng Chen, Giorgi Merabishvili, and 3 more authorsACM Transactions on Software Engineering and Methodology, 2025
Evaluating the behavioral boundaries of deep learning (DL) systems is crucial for understanding their reliability across diverse, unseen inputs. Existing solutions fall short as they rely on untargeted random, model- or latent-based perturbations, due to difficulties in generating controlled input variations. In this work, we introduce Mimicry, a novel black-box test generator for fine-grained, targeted exploration of DL system boundaries. Mimicry performs boundary testing by leveraging the probabilistic nature of DL outputs to identify promising directions for exploration. It uses style-based GANs to disentangle input representations into content and style components, enabling controlled feature mixing to approximate the decision boundary. We evaluated Mimicry’s effectiveness in generating boundary inputs for five widely used DL image classification systems of increasing complexity, comparing it to two baseline approaches. Our results show that Mimicry consistently identifies inputs closer to the decision boundary. It generates semantically meaningful boundary test cases that reveal new functional (mis)behaviors, while the baselines produce mainly corrupted or invalid inputs. Thanks to its enhanced control over latent space manipulations, Mimicry remains effective as dataset complexity increases, maintaining competitive diversity and higher validity rates, confirmed by human assessors.
@article{2025-Weissl-TOSEM, title = {Targeted Deep Learning System Boundary Testing}, author = {Weißl, Oliver and Abdellatif, Amr and Chen, Xingcheng and Merabishvili, Giorgi and Riccio, Vincenzo and Kacianka, Severin and Stocco, Andrea}, journal = {ACM Transactions on Software Engineering and Methodology}, publisher = {Association for Computing Machinery}, address = {New York, NY, USA}, year = {2025}, url = {https://arxiv.org/abs/2408.06258}, } - ISTA Multi-Year Grey Literature Review on AI-assisted Test AutomationFilippo Ricca, Alessandro Marchetto, and Andrea StoccoInformation and Software Technology, 2025
Context: Test Automation (TA) techniques are crucial for quality assurance in software engineering but face limitations such as high test suite maintenance costs and the need for extensive programming skills. Artificial Intelligence (AI) offers new opportunities to address these issues through automation and improved practices. Objectives: Given the prevalent usage of AI in industry, sources of truth are held in grey literature as well as the minds of professionals, stakeholders, developers, and end-users. This study surveys grey literature to explore how AI is adopted in TA, focusing on the problems it solves, its solutions, and the available tools. Additionally, the study gathers expert insights to understand AI’s current and future role in TA. Methods: We reviewed over 3,600 grey literature sources over five years, including blogs, white papers, and user manuals, and finally filtered 342 documents to develop taxonomies of TA problems and AI solutions. We also cataloged 100 AI-driven TA tools and interviewed five expert software testers to gain insights into AI’s current and future role in TA. Results: The study found that manual test code development and maintenance are the main challenges in TA. In contrast, automated test generation and self-healing test scripts are the most common AI solutions. We identified 100 AI-based TA tools, with Applitools, Testim, Functionize, AccelQ, and Mabl being the most adopted in practice. Conclusion: This paper offers a detailed overview of AI’s impact on TA through grey literature analysis and expert interviews. It presents new taxonomies of TA problems and AI solutions, provides a catalog of AI-driven tools, and relates solutions to problems and tools to solutions. Interview insights further revealed the state and future potential of AI in TA. Our findings support practitioners in selecting TA tools and guide future research directions.
@article{2025-Ricca-IST, title = {A Multi-Year Grey Literature Review on AI-assisted Test Automation}, author = {Ricca, Filippo and Marchetto, Alessandro and Stocco, Andrea}, journal = {Information and Software Technology}, year = {2025}, url = {https://arxiv.org/abs/2408.06224}, } - TOSEMSystem Safety Monitoring of Learned Components Using Temporal Metric ForecastingSepehr Sharifi, Andrea Stocco, and Lionel C. BriandACM Transactions on Software Engineering and Methodology, Jan 2025
In learning-enabled autonomous systems, safety monitoring of learned components is crucial to ensure their outputs do not lead to system safety violations, given the operational context of the system. However, developing a safety monitor for practical deployment in real-world applications is challenging. This is due to limited access to internal workings and training data of the learned component. Furthermore, safety monitors should predict safety violations with low latency, while consuming a reasonable computation resource amount.To address the challenges, we propose a safety monitoring method based on probabilistic time series forecasting. Given the learned component outputs and an operational context, we empirically investigate different Deep Learning (DL)-based probabilistic forecasting to predict the objective measure capturing the satisfaction or violation of a safety requirement (safety metric). We empirically evaluate safety metric and violation prediction accuracy, and inference latency and resource usage of four state-of-the-art models, with varying horizons, using autonomous aviation and autonomous driving case studies. Our results suggest that probabilistic forecasting of safety metrics, given learned component outputs and scenarios, is effective for safety monitoring. Furthermore, for both case studies, the Temporal Fusion Transformer (TFT) was the most accurate model for predicting imminent safety violations, with acceptable latency and resource consumption.
@article{2025-Sharifi-TOSEM, author = {Sharifi, Sepehr and Stocco, Andrea and Briand, Lionel C.}, title = {System Safety Monitoring of Learned Components Using Temporal Metric Forecasting}, journal = {ACM Transactions on Software Engineering and Methodology}, publisher = {Association for Computing Machinery}, year = {2025}, url = {https://doi.org/10.1145/3712196}, doi = {10.1145/3712196}, month = jan, address = {New York, NY, USA}, issn = {1049-331X}, }
2024
- ICSTAssessing Quality Metrics for Neural Reality Gap Input Mitigation in Autonomous Driving TestingStefano Carlo Lambertenghi and Andrea StoccoIn Proceedings of the 17th IEEE International Conference on Software Testing, Verification and Validation, Jan 2024
@inproceedings{2024-Lambertenghi-ICST, author = {Lambertenghi, Stefano Carlo and Stocco, Andrea}, title = {Assessing Quality Metrics for Neural Reality Gap Input Mitigation in Autonomous Driving Testing}, booktitle = {Proceedings of the 17th IEEE International Conference on Software Testing, Verification and Validation}, series = {ICST '24}, publisher = {IEEE}, pages = {12 pages}, year = {2024}, abstrat = {Simulation-based testing of automated driving sys- tems (ADS) is the industry standard, being a controlled, safe, and cost-effective alternative to real-world testing. Despite these advantages, virtual simulations often fail to accurately replicate real-world conditions like image fidelity, texture representation, and environmental accuracy. This can lead to significant dif- ferences in ADS behavior between simulated and real-world domains, a phenomenon known as the sim2real gap. Researchers have used Image-to-Image (I2I) neural translation to mitigate the sim2real gap, enhancing the realism of simulated environments by transforming synthetic data into more au- thentic representations of real-world conditions. However, while promising, these techniques may potentially introduce artifacts, distortions, or inconsistencies in the generated data that can affect the effectiveness of ADS testing. In our empirical study, we investigated how the quality of image-to-image (I2I) techniques influences the mitigation of the sim2real gap, using a set of established metrics from the literature. We evaluated two popular generative I2I architectures, pix2pix and CycleGAN, across two ADS perception tasks at a model level, namely vehicle detection and end-to-end lane keeping, using paired simulated and real-world datasets. Our findings reveal that the effectiveness of I2I architectures varies across different ADS tasks, and existing evaluation metrics do not consistently align with the ADS behavior. Thus, we conducted task-specific fine-tuning of perception metrics, which yielded a stronger correlation. Our findings indicate that a perception metric that incorporates semantic elements, tailored to each task, can facilitate selecting the most appropriate I2I technique for a reliable assessment of the sim2real gap mitigation.} } - ICSTPredicting Safety Misbehaviours in Autonomous Driving Systems using Uncertainty QuantificationRuben Grewal, Paolo Tonella, and Andrea StoccoIn Proceedings of the 17th IEEE International Conference on Software Testing, Verification and Validation, Jan 2024
The automated real-time recognition of unexpected situations plays a crucial role in the safety of autonomous vehicles, especially in unsupported and unpredictable scenarios. This paper evaluates different Bayesian uncertainty quantifica- tion methods from the deep learning domain for the anticipa- tory testing of safety-critical misbehaviours during system-level simulation-based testing. Specifically, we compute uncertainty scores as the vehicle executes, following the intuition that high uncertainty scores are indicative of unsupported runtime condi- tions that can be used to distinguish safe from failure-inducing driving behaviors. In our study, we conducted an evaluation of the effectiveness and computational overhead associated with two Bayesian uncertainty quantification methods, namely MC- Dropout and Deep Ensembles, for misbehaviour avoidance. Over- all, for three benchmarks from the Udacity simulator comprising both out-of-distribution and unsafe conditions introduced via mutation testing, both methods successfully detected a high number of out-of-bounds episodes providing early warnings several seconds in advance, outperforming two state-of-the-art misbehaviour prediction methods based on autoencoders and attention maps in terms of effectiveness and efficiency. Notably, Deep Ensembles detected most misbehaviours without any false alarms and did so even when employing a relatively small number of models, making them computationally feasible for real-time detection. Our findings suggest that incorporating uncertainty quantification methods is a viable approach for building fail-safe mechanisms in deep neural network-based autonomous vehicles.
@inproceedings{2024-Grewal-ICST, author = {Grewal, Ruben and Tonella, Paolo and Stocco, Andrea}, title = {Predicting Safety Misbehaviours in Autonomous Driving Systems using Uncertainty Quantification}, booktitle = {Proceedings of the 17th IEEE International Conference on Software Testing, Verification and Validation}, series = {ICST '24}, publisher = {IEEE}, pages = {12 pages}, year = {2024}, } - EMSETwo is Better Than One: Digital Siblings to Improve Autonomous Driving TestingMatteo Biagiola, Andrea Stocco, Vincenzo Riccio, and Paolo TonellaEmpirical Software Engineering, Jan 2024[Invited Journal-first track at ICSE 2025]
Simulation-based testing represents an important step to ensure the reliability of autonomous driving software. In practice, when companies rely on third-party general-purpose simulators, either for in-house or outsourced testing, the generalizability of testing results to real autonomous vehicles is at stake. In this paper, we enhance simulation-based testing by introducing the notion of digital siblings—a multi-simulator approach that tests a given autonomous vehicle on multiple general-purpose simulators built with different technologies, that operate collectively as an ensemble in the testing process. We exemplify our approach on a case study focused on testing the lane-keeping component of an autonomous vehicle. We use two open-source simulators as digital siblings, and we empirically compare such a multi-simulator approach against a digital twin of a physical scaled autonomous vehicle on a large set of test cases. Our approach requires generating and running test cases for each individual simulator, in the form of sequences of road points. Then, test cases are migrated between simulators, using feature maps to characterize the exercised driving conditions. Finally, the joint predicted failure probability is computed, and a failure is reported only in cases of agreement among the siblings. Our empirical evaluation shows that the ensemble failure predictor by the digital siblings is superior to each individual simulator at predicting the failures of the digital twin. We discuss the findings of our case study and detail how our approach can help researchers interested in automated testing of autonomous driving software.
@article{2024-Biagiola-EMSE, author = {Biagiola, Matteo and Stocco, Andrea and Riccio, Vincenzo and Tonella, Paolo}, title = {Two is Better Than One: Digital Siblings to Improve Autonomous Driving Testing}, journal = {Empirical Software Engineering}, publisher = {Springer}, year = {2024}, note = {[Invited Journal-first track at ICSE 2025]}, }
2023
- pre-printNeural Embeddings for Web TestingAndrea Stocco, Alexandra Willi, Luigi Libero Lucio Starace, Matteo Biagiola, and 1 more authorJan 2023
Web test automation techniques employ web crawlers to automatically produce a web app model that is used for test generation. Existing crawlers rely on app-specific, threshold-based, algorithms to assess state equivalence. Such algorithms are hard to tune in the general case and cannot accurately identify and remove near-duplicate web pages from crawl models. Failing to retrieve an accurate web app model results in automated test generation solutions that produce redundant test cases and inadequate test suites that do not cover the web app functionalities adequately. In this paper, we propose WEBEMBED, a novel abstraction function based on neural network embeddings and threshold-free classifiers that can be used to produce accurate web app models during model-based test generation. Our evaluation on nine web apps shows that WEBEMBED outperforms state-of-the-art techniques by detecting near-duplicates more accurately, inferring better web app models that exhibit 22% more precision, and 24% more recall on average. Consequently, the test suites generated from these models achieve higher code coverage, with improvements ranging from 2% to 59% on an app-wise basis and averaging at 23%.
@article{2023-Stocco-arXiv, author = {Stocco, Andrea and Willi, Alexandra and Starace, Luigi Libero Lucio and Biagiola, Matteo and Tonella, Paolo}, title = {Neural Embeddings for Web Testing}, year = {2023}, eprint = {2306.07400}, archiveprefix = {arXiv}, primaryclass = {cs.SE}, } - QUATICA Retrospective Analysis of Grey Literature for AI-supported Test AutomationFilippo Ricca, Alessandro Marchetto, and Andrea StoccoIn Proceedings of the 16th International Conference on the Quality of Information and Communications Technology, Jan 2023
[Best Paper Award]
This paper provides the results of a retrospective analysis conducted on a survey of the grey literature about the perception of practitioners on the integration of artificial intelligence (AI) algorithms into Test Automation (TA) practices. Our study involved the examination of 231 sources, including blogs, user manuals, and posts. Our primary goals were to: (a) assess the generalizability of existing taxonomies about the usage of AI for TA, (b) investigate and understand the relationships between TA problems and AI-based solutions, and (c) systematically map out the existing AI-based tools that offer AI-enhanced solutions. Our analysis yielded several interesting results. Firstly, we assessed a high degree of generalization of the existing taxonomies. Secondly, we identified TA problems that can be addressed using AI-enhanced solutions integrated into existing tools. Thirdly, we found that some TA problems require broader solutions that involve multiple software testing phases simultaneously, such as test generation and maintenance. Fourthly, we discovered that certain solutions are being investigated but are not supported by existing AI-based tools. Finally, we observed that there are tools that supports different phases of TA and may have a broader outreach.
@inproceedings{2023-Ricca-QUATIC, author = {Ricca, Filippo and Marchetto, Alessandro and Stocco, Andrea}, title = {A Retrospective Analysis of Grey Literature for AI-supported Test Automation}, booktitle = {Proceedings of the 16th International Conference on the Quality of Information and Communications Technology}, publisher = {Springer}, series = {QUATIC 2023}, year = {2023}, } - EMSEModel vs System Level Testing of Autonomous Driving Systems: A Replication and Extension StudyAndrea Stocco, Brian Pulfer, and Paolo TonellaEmpirical Software Engineering, Jan 2023Invited journal first track at ICSE 2024
Offline model-level testing of autonomous driving software is much cheaper, faster, and diversified than in-field, online system-level testing. Hence, re- searchers have compared empirically model-level vs system-level testing using driv- ing simulators. They reported the general usefulness of simulators at reproducing the same conditions experienced in-field, but also some inadequacy of model-level testing at exposing failures that are observable only in online mode. In this work, we replicate the reference study on model vs system-level test- ing of autonomous vehicles while acknowledging several assumptions that we had reconsidered. These assumptions are related to several threats to validity affect- ing the original study that motivated additional analysis and the development of techniques to mitigate them. Moreover, we also extend the replicated study by evaluating the original findings when considering a physical, radio-controlled autonomous vehicle. Our results show that simulator-based testing of autonomous driving systems yields predictions that are close to the ones of real-world datasets when using neural-based translation to mitigate the reality gap induced by the simulation platform. On the other hand, model-level testing failures are in line with those experienced at the system level, both in simulated and physical environments, when considering the pre-failure site, similar-looking images, and accurate labels.
@article{2023-Stocco-EMSE, author = {Stocco, Andrea and Pulfer, Brian and Tonella, Paolo}, title = {{Model vs System Level Testing of Autonomous Driving Systems: A Replication and Extension Study}}, journal = {Empirical Software Engineering}, publisher = {Springer}, volume = {}, year = {2023}, note = {Invited journal first track at ICSE 2024}, }
2022
- ASEThirdEye: Attention Maps for Safe Autonomous Driving SystemsAndrea Stocco, Paulo J. Nunes, Marcelo d’Amorim, and Paolo TonellaIn Proceedings of the 37th IEEE/ACM International Conference on Automated Software Engineering, Jan 2022
Automated online recognition of unexpected conditions is an in- dispensable component of autonomous vehicles to ensure safety even in unknown and uncertain situations. In this paper we pro- pose a runtime monitoring technique rooted in the attention maps computed by explainable artificial intelligence techniques. Our ap- proach, implemented in a tool called ThirdEye, turns attention maps into confidence scores that are used to discriminate safe from unsafe driving behaviours. The intuition is that uncommon atten- tion maps are associated with unexpected runtime conditions. In our empirical study, we evaluated the effectiveness of dif- ferent configurations of ThirdEye at predicting simulation-based injected failures induced by both unknown conditions (adverse weather and lighting) and unsafe/uncertain conditions created with mutation testing. Results show that, overall, ThirdEye can predict 98% misbehaviours, up to three seconds in advance, outperforming a state-of-the-art failure predictor for autonomous vehicles.
@inproceedings{2022-Stocco-ASE, author = {Stocco, Andrea and Nunes, Paulo J. and d'Amorim, Marcelo and Tonella, Paolo}, title = {{ThirdEye}: Attention Maps for Safe Autonomous Driving Systems}, booktitle = {Proceedings of the 37th IEEE/ACM International Conference on Automated Software Engineering}, publisher = {IEEE/ACM}, series = {ASE '22}, year = {2022}, doi = {10.1145/3551349.3556968}, } - TSEMind the Gap! A Study on the Transferability of Virtual vs Physical-world Testing of Autonomous Driving SystemsAndrea Stocco, Brian Pulfer, and Paolo TonellaIEEE Transactions on Software Engineering, Jan 2022[Invited journal First track at ICSE 2023]
Safe deployment of self-driving cars (SDC) necessitates thorough simulated and in-field testing. Most testing techniques consider virtualized SDCs within a simulation environment, whereas less effort has been directed towards assessing whether such techniques transfer to and are effective with a physical real-world vehicle. In this paper, we shed light on the problem of generalizing testing results obtained in a driving simulator to a physical platform and provide a characterization and quantification of the sim2real gap affecting SDC testing. In our empirical study, we compare SDC testing when deployed on a physical small-scale vehicle vs its digital twin. Due to the unavailability of driving quality indicators from the physical platform, we use neural rendering to estimate them through visual odometry, hence allowing full comparability with the digital twin. Then, we investigate the transferability of behavior and failure exposure between virtual and real-world environments, targeting both unintended abnormal test data and intended adversarial examples. Our study shows that, despite the usage of a faithful digital twin, there are still critical shortcomings that contribute to the reality gap between the virtual and physical world, threatening existing testing solutions that only consider virtual SDCs. On the positive side, our results present the test configurations for which physical testing can be avoided, either because their outcome does transfer between virtual and physical environments, or because the uncertainty profiles in the simulator can help predict their outcome in the real world.
@article{2022-Stocco-TSE, author = {Stocco, Andrea and Pulfer, Brian and Tonella, Paolo}, title = {{Mind the Gap! A Study on the Transferability of Virtual vs Physical-world Testing of Autonomous Driving Systems}}, journal = {IEEE Transactions on Software Engineering}, volume = {}, year = {2022}, url = {https://arxiv.org/abs/2112.11255}, publisher = {IEEE}, note = {[Invited journal First track at ICSE 2023]}, }
2021
- JSEPConfidence-driven Weighted Retraining for Predicting Safety-Critical Failures in Autonomous Driving SystemsAndrea Stocco and Paolo TonellaJournal of Software: Evolution and Process, Jan 2021
Safe handling of hazardous driving situations is a task of high practical relevance for build- ing reliable and trustworthy cyber-physical systems such as autonomous driving systems. This task necessitates an accurate prediction system of the vehicle’s confidence to prevent potentiallyharmfulsystemfailuresontheoccurrenceofunpredictableconditionsthatmake it less safe to drive. In this paper, we discuss the challenges of adapting a misbehavior pre- dictor with knowledge mined during the execution of the main system. Then, we present a framework for the continual learning of misbehavior predictors, which records in-field behavioral data to determine what data are appropriate for adaptation. Our framework guides adaptive retraining using a novel combination of in-field confidence metric selection and reconstruction error-based weighing. We evaluate our framework to improve a mis- behavior predictor from the literature on the Udacity simulator for self-driving cars. Our results show that our framework can reduce the false positive rate by a large margin and can adapt to nominal behavior drifts while maintaining the original capability to predict failures up to several seconds in advance.
@article{2021-Stocco-JSEP, author = {Stocco, Andrea and Tonella, Paolo}, title = {Confidence-driven Weighted Retraining for Predicting Safety-Critical Failures in Autonomous Driving Systems}, journal = {Journal of Software: Evolution and Process}, pages = {e2386}, volume = {34}, number = {10}, publisher = {John Wiley & Sons}, url = {https://doi.org/10.1002/smr.2386}, doi = {10.1002/smr.2386}, year = {2021}, } - ICSTWAI-based Test Automation: A Grey Literature AnalysisFilippo Ricca, Alessandro Marchetto, and Andrea StoccoIn Proceedings of the 14th IEEE International Conference on Software Testing, Verification and Validation Workshops, Jan 2021[Best Presentation Award]
This paper provides the results of a survey of the grey literature concerning the use of artificial intelligence to improve test automation practices. We surveyed more than 1,200 sources of grey literature (e.g., blogs, white-papers, user manuals, StackOverflow posts) looking for highlights by professionals on how AI is adopted to aid the development and evolution of test code. Ultimately, we filtered 136 relevant documents from which we extracted a taxonomy of problems that AI aims to tackle, along with a taxonomy of AI-enabled solutions to such problems. Manual code development and automated test generation are the most cited problem and solution, respectively. The paper concludes by distilling the six most prevalent tools on the market, along with think-aloud reflections about the current and future status of artificial intelligence for test automation.
@inproceedings{2021-Ricca-ICSTW, author = {Ricca, Filippo and Marchetto, Alessandro and Stocco, Andrea}, title = {AI-based Test Automation: A Grey Literature Analysis}, booktitle = {Proceedings of the 14th IEEE International Conference on Software Testing, Verification and Validation Workshops}, publisher = {Springer}, series = {ICSTW 2021}, note = {[Best Presentation Award]}, year = {2021}, } - ICSTQuality Metrics and Oracles for Autonomous Vehicles TestingGunel Jahangirova, Andrea Stocco, and Paolo TonellaIn Proceedings of the 14th IEEE International Conference on Software Testing, Verification and Validation, Jan 2021
The race for deploying AI-enabled autonomous vehicles (AVs) on public roads is based on the promise that such self-driving cars will be as safe as or safer than human drivers. Numerous techniques have been proposed to test AVs, which however lack oracle definitions that account for the quality of driving, due to the lack of a commonly used set of metrics. Towards filling this gap, we first performed a systematic analysis of the literature concerning the assessment of the quality of driving of human drivers and extracted 126 metrics. Then, we measured the correlation between such metrics and the human perception of driving quality when AVs are driving. Lastly, we performed a study based on mutation analysis to assess whether the 26 metrics that best capture the quality of AV driving according to the human study can be used as functional oracles. Our results, targeting the Udacity platform, indicate that our automated oracles can kill a high proportion of mutants at a zero or very low false alarm rate, and therefore can be used as effective functional oracles for the quality of driving of AVs.
@inproceedings{2021-Jahangirova-ICST, author = {Jahangirova, Gunel and Stocco, Andrea and Tonella, Paolo}, title = {Quality Metrics and Oracles for Autonomous Vehicles Testing}, booktitle = {Proceedings of the 14th IEEE International Conference on Software Testing, Verification and Validation}, series = {ICST '21}, publisher = {IEEE}, pages = {12 pages}, year = {2021}, } - SOFSEMWeb Test Automation: Insights from the Grey LiteratureFilippo Ricca and Andrea StoccoIn Proceedings of the 47th International Conference on Current Trends in Theory and Practice of Computer Science, Jan 2021[Best Paper Nominee]
This paper provides the results of a survey of the grey lit- erature concerning best practices for end-to-end web test automation. We analyzed more than 2,400 sources (e.g., blog posts, white-papers, user manuals, GitHub repositories) looking for guidelines by IT profes- sionals on how to develop and maintain web test code. Ultimately, we filtered 142 relevant documents from which we extracted a taxonomy of guidelines divided into technical tips (i.e., concerning the development, maintenance, and execution of web tests), and business-level tips (i.e, concerning the planning and management of testing teams, design, and process). The paper concludes with distilling the ten most cited best practices for developing good quality automated web tests.
@inproceedings{2021-Ricca-SOFSEM, author = {Ricca, Filippo and Stocco, Andrea}, title = {Web Test Automation: Insights from the Grey Literature}, booktitle = {Proceedings of the 47th International Conference on Current Trends in Theory and Practice of Computer Science}, publisher = {Springer}, series = {SOFSEM 2021}, note = {[Best Paper Nominee]}, year = {2021}, }
2020
- TSEA Survey on the Use of Computer Vision to Improve Software Engineering TasksMohammad Bajammal, Andrea Stocco, Davood Mazinanian, and Ali MesbahIEEE Transactions on Software Engineering, Oct 2020
Software engineering (SE) research has traditionally revolved around engineering the source code. However, novel approaches that analyze software through computer vision have been increasingly adopted in SE. These approaches allow analyzing the software from a different complementary perspective other than the source code, and they are used to either complement existing source code-based methods, or to overcome their limitations. The goal of this manuscript is to survey the use of computer vision techniques in SE with the aim of assessing their potential in advancing the field of SE research. We examined an extensive body of literature from top-tier SE venues, as well as venues from closely related fields (machine learning, computer vision, and human-computer interaction). Our inclusion criteria targeted papers applying computer vision techniques that address problems related to any area of SE. We collected an initial pool of 2,716 papers, from which we obtained 66 final relevant papers covering a variety of SE areas. We analyzed what computer vision techniques have been adopted or designed, for what reasons, how they are used, what benefits they provide, and how they are evaluated. Our findings highlight that visual approaches have been adopted in a wide variety of SE tasks, predominantly for effectively tackling software analysis and testing challenges in the web and mobile domains. The results also show a rapid growth trend of the use of computer vision techniques in SE research.
@article{2020-Bajammal-TSE, author = {Bajammal, Mohammad and Stocco, Andrea and Mazinanian, Davood and Mesbah, Ali}, title = {{A Survey on the Use of Computer Vision to Improve Software Engineering Tasks}}, journal = {IEEE Transactions on Software Engineering}, publisher = {IEEE}, year = {2020}, month = oct, volume = {48}, number = {5}, doi = {10.1109/TSE.2020.3032986}, } - ISSREWTowards Anomaly Detectors that Learn ContinuouslyAndrea Stocco and Paolo TonellaIn Proceedings of the 31st International Symposium on Software Reliability Engineering Workshops, Oct 2020
In this paper, we first discuss the challenges of adapting an already trained DNN-based anomaly detector with knowledge mined during the execution of the main system. Then, we present a framework for the continual learning of anomaly detectors, which records in-field behavioural data to determine what data are appropriate for adaptation. We evaluated our framework to improve an anomaly detector taken from the literature, in the context of misbehavior prediction for self-driving cars. Our results show that our solution can reduce the false positive rate by a large margin and adapt to nominal behaviour changes while maintaining the original anomaly detection capability.
@inproceedings{2020-Stocco-GAUSS, author = {Stocco, Andrea and Tonella, Paolo}, title = {Towards Anomaly Detectors that Learn Continuously}, booktitle = {Proceedings of the 31st International Symposium on Software Reliability Engineering Workshops}, publisher = {IEEE}, series = {ISSREW 2020}, year = {2020}, month = oct, doi = {10.1109/ISSREW51248.2020.00073}, } - EMSETesting Machine Learning based Systems: A Systematic MappingVincenzo Riccio, Gunel Jahangirova, Andrea Stocco, Nargiz Humbatova, and 2 more authorsEmpirical Software Engineering, Nov 2020
Context: A Machine Learning based System (MLS) is a software system including one or more components that learn how to perform a task from a given data set. The increasing adoption of MLSs in safety critical domains such as autonomous driving, healthcare, and finance has fostered much attention towards the quality assurance of such systems. Despite the advances in software testing, MLSs bring novel and unprecedented challenges, since their behaviour is defined jointly by the code that implements them and the data used for training them. Objective: To identify the existing solutions for functional testing of MLSs, and classify them from three different perspectives: (1) the context of the problem they address, (2) their features, and (3) their empirical evaluation. To report demographic information about the ongoing research. To identify open challenges for future research. Method: We conducted a systematic mapping study about testing techniques for MLSs driven by 33 research questions. We followed existing guidelines when defining our research protocol so as to increase the repeatability and reliability of our results. Results: We identified 70 relevant primary studies, mostly published in the last years. We identified 11 problems addressed in the literature. We investigated multiple aspects of the testing approaches, such as the used/proposed adequacy criteria, the algorithms for test input generation, and the test oracles. Conclusions: The most active research areas in MLS testing address automated scenario/input generation and test oracle creation. MLS testing is a rapidly growing and developing research area, with many open challenges, such as the generation of realistic inputs and the definition of reliable evaluation metrics and benchmarks.
@article{2020-Riccio-EMSE, author = {Riccio, Vincenzo and Jahangirova, Gunel and Stocco, Andrea and Humbatova, Nargiz and Weiss, Michael and Tonella, Paolo}, title = {{Testing Machine Learning based Systems: A Systematic Mapping}}, journal = {Empirical Software Engineering}, publisher = {Springer}, year = {2020}, volume = {25}, number = {6}, doi = {10.1007/s10664-020-09881-0}, month = nov, pages = {5193–5254}, } - STVRBugJS: A Benchmark and Taxonomy of JavaScript BugsPéter Gyimesi, Béla Vancsics, Andrea Stocco, Davood Mazinanian, and 3 more authorsSoftware Testing, Verification And Reliability, Oct 2020
Summary JavaScript is a popular programming language that is also error-prone due to its asynchronous, dynamic, and loosely typed nature. In recent years, numerous techniques have been proposed for analyzing and testing JavaScript applications. However, our survey of the literature in this area revealed that the proposed techniques are often evaluated on different datasets of programs and bugs. The lack of a commonly used benchmark limits the ability to perform fair and unbiased comparisons for assessing the efficacy of new techniques. To fill this gap, we propose BugsJS, a benchmark of 453 real, manually validated JavaScript bugs from 10 popular JavaScript server-side programs, comprising 444k lines of code (LOC) in total. Each bug is accompanied by its bug report, the test cases that expose it, as well as the patch that fixes it. We extended BugsJS with a rich web interface for visualizing and dissecting the bugs’ information, as well as a programmable API to access the faulty and fixed versions of the programs and to execute the corresponding test cases, which facilitates conducting highly reproducible empirical studies and comparisons of JavaScript analysis and testing tools. Moreover, following a rigorous procedure, we performed a classification of the bugs according to their nature. Our internal validation shows that our taxonomy is adequate for characterizing the bugs in BugsJS. We discuss several ways in which the resulting taxonomy and the benchmark can help direct researchers interested in automated testing of JavaScript applications.
@article{2020-Gyimesi-STVR, author = {Gyimesi, P\'{e}ter and Vancsics, B\'{e}la and Stocco, Andrea and Mazinanian, Davood and \'{A}rp\'{a}d Besz\'{e}des and Ferenc, Rudolf and Mesbah, Ali}, title = {{BugJS}: A Benchmark and Taxonomy of JavaScript Bugs}, journal = {Software Testing, Verification And Reliability}, publisher = {John Wiley & Sons}, year = {2020}, volume = {31}, number = {4}, month = oct, doi = {10.1002/stvr.1751}, } - ICSEMisbehaviour Prediction for Autonomous Driving SystemsAndrea Stocco, Michael Weiss, Marco Calzana, and Paolo TonellaIn Proceedings of the 42nd International Conference on Software Engineering, Jun 2020
Deep Neural Networks (DNNs) are the core component of modern autonomous driving systems. To date, it is still unrealistic that a DNN will generalize correctly to all driving conditions. Current testing techniques consist of offline solutions that identify adversarial or corner cases for improving the training phase.In this paper, we address the problem of estimating the confidence of DNNs in response to unexpected execution contexts with the purpose of predicting potential safety-critical misbehaviours and enabling online healing of DNN-based vehicles. Our approach SelfOracle is based on a novel concept of self-assessment oracle, which monitors the DNN confidence at runtime, to predict unsupported driving scenarios in advance. SelfOracle uses autoencoder-and time series-based anomaly detection to reconstruct the driving scenarios seen by the car, and to determine the confidence boundary between normal and unsupported conditions.In our empirical assessment, we evaluated the effectiveness of different variants of SelfOracle at predicting injected anomalous driving contexts, using DNN models and simulation environment from Udacity. Results show that, overall, SelfOracle can predict 77% misbehaviours, up to six seconds in advance, outperforming the online input validation approach of DeepRoad.
@inproceedings{2020-Stocco-ICSE, author = {Stocco, Andrea and Weiss, Michael and Calzana, Marco and Tonella, Paolo}, title = {Misbehaviour Prediction for Autonomous Driving Systems}, booktitle = {Proceedings of the 42nd International Conference on Software Engineering}, series = {ICSE '20}, publisher = {ACM}, pages = {12 pages}, year = {2020}, month = jun, doi = {10.1145/3377811.3380353} } - ICSETaxonomy of Real Faults in Deep Learning SystemsNargiz Humbatova, Gunel Jahangirova, Gabriele Bavota, Vincenzo Riccio, and 2 more authorsIn Proceedings of the 42nd International Conference on Software Engineering, Jun 2020
[Best Artifact Award]
The growing application of deep neural networks in safety-critical domains makes the analysis of faults that occur in such systems of enormous importance. In this paper we introduce a large taxonomy of faults in deep learning (DL) systems. We have manually analysed 1059 artefacts gathered from GitHub commits and issues of projects that use the most popular DL frameworks (TensorFlow, Keras and PyTorch) and from related Stack Overflow posts. Structured interviews with 20 researchers and practitioners describing the problems they have encountered in their experience have enriched our taxonomy with a variety of additional faults that did not emerge from the other two sources. Our final taxonomy was validated with a survey involving an additional set of 21 developers, confirming that almost all fault categories (13/15) were experienced by at least 50% of the survey participants.
@inproceedings{2020-Humbatova-ICSE, author = {Humbatova, Nargiz and Jahangirova, Gunel and Bavota, Gabriele and Riccio, Vincenzo and Stocco, Andrea and Tonella, Paolo}, title = {Taxonomy of Real Faults in Deep Learning Systems}, booktitle = {Proceedings of the 42nd International Conference on Software Engineering}, series = {ICSE '20}, publisher = {ACM}, pages = {12 pages}, year = {2020}, month = jun, doi = {10.1145/3377811.3380395} } - ICSENear-Duplicate Detection in Web App Model InferenceRahulkrishna Yandrapally, Andrea Stocco, and Ali MesbahIn Proceedings of the 42nd International Conference on Software Engineering, Jun 2020
Automated web testing techniques infer models from a given web app, which are used for test generation. From a testing viewpoint, such an inferred model should contain the minimal set of states that are distinct, yet, adequately cover the app’s main functionalities. In practice, models inferred automatically are affected by near-duplicates, i.e., replicas of the same functional webpage differing only by small insignificant changes. We present the first study of near-duplicate detection algorithms used in within app model inference. We first characterize functional near-duplicates by classifying a random sample of state-pairs, from 493k pairs of webpages obtained from over 6,000 websites, into three categories, namely clone, near-duplicate, and distinct. We systematically compute thresholds that define the boundaries of these categories for each detection technique. We then use these thresholds to evaluate 10 near-duplicate detection techniques from three different domains, namely, information retrieval, web testing, and computer vision on nine open-source web apps. Our study highlights the challenges posed in automatically inferring a model for any given web app. Our findings show that even with the best thresholds, no algorithm is able to accurately detect all functional near-duplicates within apps, without sacrificing coverage.
@inproceedings{2020-Yandrapally-ICSE, author = {Yandrapally, Rahulkrishna and Stocco, Andrea and Mesbah, Ali}, title = {Near-Duplicate Detection in Web App Model Inference}, booktitle = {Proceedings of the 42nd International Conference on Software Engineering}, series = {ICSE '20}, publisher = {ACM}, pages = {12 pages}, year = {2020}, month = jun, doi = {10.1145/3377811.3380416}, } - ICSTDependency-Aware Web Test GenerationMatteo Biagiola, Andrea Stocco, Filippo Ricca, and Paolo TonellaIn Proceedings of the 13th IEEE International Conference on Software Testing, Verification and Validation, Oct 2020
Web crawlers can perform long running in-depth explorations of a web application, achieving high coverage of the navigational structure. However, a crawling trace cannot be easily turned into a minimal test suite that achieves the same coverage. In fact, when the crawling trace is segmented into test cases, two problems arise: (1) test cases are dependent on each other, therefore they may raise errors when executed in isolation, and (2) test cases are redundant, since the same targets are covered multiple times by different test cases. In this paper, we propose DANTE, a novel web test generator that computes the test dependencies associated with the test cases obtained from a crawling session, and uses them to eliminate redundant tests and produce executable test schedules. DANTE can effectively turn a web crawler into a test case generator that produces minimal test suites, composed only of feasible tests that contribute to achieve the final coverage. Experimental results show that DANTE, on average, (1) reduces the error rate of the test cases obtained by crawling traces from 85% to zero, (2) produces minimized test suites that are 84% smaller than the initial ones, and (3) outperforms two competing crawling-based and model-based techniques in terms of coverage and breakage rate.
@inproceedings{2020-Biagiola-ICST, author = {Biagiola, Matteo and Stocco, Andrea and Ricca, Filippo and Tonella, Paolo}, title = {Dependency-Aware Web Test Generation}, booktitle = {Proceedings of the 13th IEEE International Conference on Software Testing, Verification and Validation}, series = {ICST '20}, publisher = {IEEE}, pages = {12 pages}, year = {2020}, month = oct, doi = {10.1109/ICST46399.2020.00027}, }
2019
- ProWebHow Artificial Intelligence Can Improve Web Development and TestingAndrea StoccoIn Companion of the 3rd International Conference on Art, Science, and Engineering of Programming, Genova, Italy, Apr 2019
The Artificial Intelligence (AI) revolution in software development is just around the corner. With the rise of AI, developers are expected to play a different role from the traditional role of programmers, as they will need to adapt their know-how and skillsets to complement and apply AI-based tools and techniques into their traditional web development workflow. In this extended abstract, some of the current trends on how AI is being leveraged to enhance web development and testing are discussed, along with some of the main opportunities and challenges for researchers.
@inproceedings{2019-Stocco-Proweb, author = {Stocco, Andrea}, title = {How Artificial Intelligence Can Improve Web Development and Testing}, booktitle = {Companion of the 3rd International Conference on Art, Science, and Engineering of Programming}, series = {Programming '19}, year = {2019}, month = apr, location = {Genova, Italy}, pages = {1--13}, articleno = {13}, numpages = {4}, url = {http://doi.acm.org/10.1145/3328433.3328447}, doi = {10.1145/3328433.3328447}, publisher = {ACM}, address = {New York, NY, USA}, keywords = {artificial intelligence, web development, web testing}, } - ESEC/FSEDiversity-based Web Test GenerationMatteo Biagiola, Andrea Stocco, Filippo Ricca, and Paolo TonellaIn Proceedings of the 27th ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering, Aug 2019
Existing web test generators derive test paths from a navigational model of the web application, completed with either manually or randomly generated input values. However, manual test data selection is costly, while random generation often results in infeasible input sequences, which are rejected by the application under test. Random and search-based generation can achieve the desired level of model coverage only after a large number of test execution at- tempts, each slowed down by the need to interact with the browser during test execution. In this work, we present a novel web test generation algorithm that pre-selects the most promising candidate test cases based on their diversity from previously generated tests. As such, only the test cases that explore diverse behaviours of the application are considered for in-browser execution. We have implemented our approach in a tool called DIG. Our empirical evaluation on six real-world web applications shows that DIG achieves higher coverage and fault detection rates significantly earlier than crawling-based and search-based web test generators.
@inproceedings{2019-Biagiola-FSE-Diversity, author = {Biagiola, Matteo and Stocco, Andrea and Ricca, Filippo and Tonella, Paolo}, title = {Diversity-based Web Test Generation}, booktitle = {Proceedings of the 27th ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering}, series = {ESEC/FSE 2019}, publisher = {ACM}, pages = {12 pages}, year = {2019}, month = aug, doi = {10.1145/3338906.3338970}, } - ESEC/FSEWeb Test Dependency DetectionMatteo Biagiola, Andrea Stocco, Ali Mesbah, Filippo Ricca, and 1 more authorIn Proceedings of the 27th ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering, Aug 2019
E2E web test suites are prone to test dependencies due to the heterogeneous multi-tiered nature of modern web apps, which makes it difficult for developers to create isolated program states for each test case. In this paper, we present the first approach for detecting and validating test dependencies present in E2E web test suites. Our approach employs string analysis to extract an approximated set of dependencies from the test code. It then filters potential false dependencies through natural language processing of test names. Finally, it validates all dependencies, and uses a novel recovery algorithm to ensure no true dependencies are missed in the final test dependency graph. Our approach is implemented in a tool called TEDD and evaluated on the test suites of six open-source web apps. Our results show that TEDD can correctly detect and validate test dependencies up to 72% faster than the baseline with the original test ordering in which the graph contains all possible dependencies. The test dependency graphs produced by TEDD enable test execution parallelization, with a speed-up factor of up to 7\texttimes.
@inproceedings{2019-Biagiola-FSE-Dependencies, author = {Biagiola, Matteo and Stocco, Andrea and Mesbah, Ali and Ricca, Filippo and Tonella, Paolo}, title = {Web Test Dependency Detection}, booktitle = {Proceedings of the 27th ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering}, series = {ESEC/FSE 2019}, publisher = {ACM}, pages = {12 pages}, year = {2019}, doi = {10.1145/3338906.3338948}, month = aug, } - ICSTBugJS: A Benchmark of JavaScript BugsPéter Gyimesi, Béla Vancsics, Andrea Stocco, Davood Mazinanian, and 3 more authorsIn Proceedings of the 12th IEEE International Conference on Software Testing, Verification and Validation, Apr 2019
JavaScript is a popular programming language that is also error-prone due to its asynchronous, dynamic, and loosely-typed nature. In recent years, numerous techniques have been proposed for analyzing and testing JavaScript applications. However, our survey of the literature in this area revealed that the proposed techniques are often evaluated on different datasets of programs and bugs. The lack of a commonly used benchmark limits the ability to perform fair and unbiased comparisons for assessing the efficacy of new techniques. To fill this gap, we propose BugsJS, a benchmark of 453 real, manually validated JavaScript bugs from 10 popular JavaScript server-side programs, comprising 444k LOC in total. Each bug is accompanied by its bug report, the test cases that detect it, as well as the patch that fixes it. BugsJS features a rich interface for accessing the faulty and fixed versions of the programs and executing the corresponding test cases, which facilitates conducting highly-reproducible empirical studies and comparisons of JavaScript analysis and testing tools.
@inproceedings{2019-Gyimesi-ICST, author = {Gyimesi, P\'{e}ter and Vancsics, B\'{e}la and Stocco, Andrea and Mazinanian, Davood and \'{A}rp\'{a}d Besz\'{e}des and Ferenc, Rudolf and Mesbah, Ali}, title = {{BugJS}: A Benchmark of JavaScript Bugs}, booktitle = {Proceedings of the 12th IEEE International Conference on Software Testing, Verification and Validation}, series = {ICST 2019}, publisher = {IEEE}, pages = {90-101}, year = {2019}, month = apr, doi = {10.1109/ICST.2019.00019} } - Adv.
Comp.Three Open Problems in the Context of E2E Web Testing and a Vision: NEONATEFilippo Ricca, Maurizio Leotta, and Andrea StoccoAdvances in Computers, Jan 2019Web applications are critical assets of our society and thus assuring their quality is of undeniable importance. Despite the advances in software testing, the ever-increasing technological complexity of these applications makes it difficult to prevent errors. In this work, we provide a thorough description of the three open problems hindering web test automation: fragility problem, strong coupling and low cohesion problem, and incompleteness problem. We conjecture that a major breakthrough in test automation is needed, because the problems are closely correlated, and hence need to be attacked together rather than separately. To this aim, we describe Neonate, a novel integrated testing environment specifically designed to empower the web tester. Our utmost purpose is to make the research community aware of the existence of the three problems and their correlation, so that more research effort can be directed in providing solutions and tools to advance the state of the art of web test automation.
@article{2019-Ricca-Advances, author = {Ricca, Filippo and Leotta, Maurizio and Stocco, Andrea}, title = {{Three Open Problems in the Context of E2E Web Testing and a Vision: NEONATE}}, journal = {Advances in Computers}, publisher = {Elsevier}, volume = {113}, pages = {89-133}, year = {2019}, issn = {0065-2458}, doi = {10.1016/bs.adcom.2018.10.005}, month = jan, }
2018
- ESEC/FSE
Demo TrackVISTA: Web Test Repair Using Computer VisionAndrea Stocco, Rahulkrishna Yandrapally, and Ali MesbahIn Proceedings of the 26th ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering, Nov 2018Repairing broken web element locators represents the major main- tenance cost of web test cases. To detect possible repairs, testers typically inspect the tests’ interactions with the application under test through the GUI. Existing automated test repair techniques focus instead on the code and ignore visual aspects of the applica- tion. In this demo paper, we give an overview of Vista, a novel test repair technique that leverages computer vision and local crawling to automatically suggest and apply repairs to broken web tests. URL: https://github.com/saltlab/Vista
@inproceedings{2018-Stocco-FSE-demo, author = {Stocco, Andrea and Yandrapally, Rahulkrishna and Mesbah, Ali}, title = {{VISTA}: Web Test Repair Using Computer Vision}, booktitle = {Proceedings of the 26th ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering}, series = {ESEC/FSE 2018 - Demonstration Track}, publisher = {ACM}, pages = {876--879}, year = {2018}, month = nov, doi = {10.1145/3236024.3264592}, } - ESEC/FSEVisual Web Test RepairAndrea Stocco, Rahulkrishna Yandrapally, and Ali MesbahIn Proceedings of the 26th ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering, Nov 2018
Web tests are prone to break frequently as the application under test evolves, causing much maintenance effort in practice. To detect the root causes of a test breakage, developers typically inspect the test’s interactions with the application through the GUI. Existing automated test repair techniques focus instead on the code and entirely ignore visual aspects of the application. We propose a test repair technique that is informed by a visual analysis of the application. Our approach captures relevant visual information from tests execution and analyzes them through a fast image processing pipeline to visually validate test cases as they re-executed for regression purposes. Then, it reports the occurrences of breakages and potential fixes to the testers. Our approach is also equipped with a local crawling mechanism to handle non-trivial breakage scenarios such as the ones that require to repair the test’s workflow. We implemented our approach in a tool called Vista. Our empirical evaluation on 2,672 test cases spanning 86 releases of four web applications shows that Vista is able to repair, on average, 81% of the breakages, a 41% increment with respect to existing techniques.
@inproceedings{2018-Stocco-FSE, author = {Stocco, Andrea and Yandrapally, Rahulkrishna and Mesbah, Ali}, title = {Visual Web Test Repair}, booktitle = {Proceedings of the 26th ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering}, series = {ESEC/FSE 2018}, publisher = {ACM}, pages = {503--514}, year = {2018}, month = nov, doi = {10.1145/3236024.3236063}, } - ICSEFine-Grained Test MinimizationArash Vahabzadeh, Andrea Stocco, and Ali MesbahIn Proceedings of the 40th ACM/IEEE International Conference on Software Engineering, May 2018
As a software system evolves, its test suite can accumulate redundancies over time. Test minimization aims at removing redundant test cases. However, current techniques remove whole test cases from the test suite using test adequacy criteria, such as code coverage. This has two limitations, namely (1) by removing a whole test case the corresponding test assertions are also lost, which can inhibit test suite effectiveness, (2) the issue of partly redundant test cases, i.e., tests with redundant test statements, is ignored. We propose a novel approach for fine-grained test case minimization. Our analysis is based on the inference of a test suite model that enables automated test reorganization within test cases. It enables removing redundancies at the test statement level, while preserving the coverage and test assertions of the test suite. We evaluated our approach, implemented in a tool called Testler, on the test suites of 15 open source projects. Our analysis shows that over 4,639 (24%) of the tests in these test suites are partly redundant, with over 11,819 redundant test statements in total. Our results show that Testler removes 43% of the redundant test statements, reducing the number of partly redundant tests by 52%. As a result, test suite execution time is reduced by up to 37% (20% on average), while maintaining the original statement coverage, branch coverage, test assertions, and fault detection capability.
@inproceedings{2018-Arash-ICSE, author = {Vahabzadeh, Arash and Stocco, Andrea and Mesbah, Ali}, title = {Fine-Grained Test Minimization}, booktitle = {Proceedings of the 40th ACM/IEEE International Conference on Software Engineering}, series = {ICSE 2018}, publisher = {ACM}, pages = {210--221}, year = {2018}, month = may, doi = {10.1145/3180155.3180203}, } - STVRPESTO: Automated Migration of DOM-based Web Tests towards the Visual ApproachMaurizio Leotta, Andrea Stocco, Filippo Ricca, and Paolo TonellaSoftware Testing, Verification And Reliability, Mar 2018
Automated test scripts are used with success in many web development projects, so as to automatically verify key functionalities of the web application under test, reveal possible regressions and run a large number of tests in short time. However, the adoption of automated web testing brings advantages but also novel problems, among which the test code fragility problem. During the evolution of the web application, existing test code may easily break and testers have to correct it. In the context of automated DOM-based web testing, one of the major costs for evolving the test code is the manual effort necessary to repair broken web page element locators – lines of source code identifying the web elements (e.g., form fields, buttons) to interact with. In this work, we present ROBULA+, a novel algorithm able to generate robust XPath-based locators – locators that are likely to work correctly on new releases of the web application. We compared ROBULA+ with several state of the practice/art XPath locator generator tools/algorithms. Results show that XPath locators produced by ROBULA+ are by far the most robust. Indeed, ROBULA+ reduces the locators fragility on average by 90% w.r.t. absolute locators and by 63% w.r.t. Selenium IDE locators.
@article{2018-Leotta-STVR, author = {Leotta, Maurizio and Stocco, Andrea and Ricca, Filippo and Tonella, Paolo}, journal = {Software Testing, Verification And Reliability}, publisher = {John Wiley & Sons}, title = {{PESTO}: Automated Migration of {DOM}-based Web Tests towards the Visual Approach}, year = {2018}, month = mar, doi = {10.1002/stvr.1665}, }
2017
- Ph.D.
ThesisAutomatic page object generation to support E2E testing of web applicationsAndrea StoccoUniversità degli Studi di Genova, Mar 2017 - SQJAPOGEN: Automatic Page Object Generator for Web TestingAndrea Stocco, Maurizio Leotta, Filippo Ricca, and Paolo TonellaSoftware Quality journal, Sep 2017
Modern web applications are characterized by ultra-rapid development cycles, and web testers tend to pay scant attention to the quality of their automated end-to-end test suites. Indeed, these quickly become hard to maintain, as the application under test evolves. As a result, end-to-end automated test suites are abandoned, despite their great potential for catching regressions. The use of the Page Object pattern has proven to be very effective in end-to-end web testing. Page objects are façade classes abstracting the internals of web pages into high-level business functions that can be invoked by the test cases. By decoupling test code from web page details, web test cases are more readable and maintainable. However, the manual development of such page objects requires substantial coding effort, which is paid off only later, during software evolution. In this paper, we describe a novel approach for the automatic generation of page objects for web applications. Our approach is implemented in the tool Apogen, which automatically derives a testing model by reverse engineering the target web application. It combines clustering and static analysis to identify meaningful page abstractions that are automatically turned into Java page objects for Selenium WebDriver. Our evaluation on an open-source web application shows that our approach is highly promising: Automatically generated page object methods cover most of the application functionalities and result in readable and meaningful code, which can be very useful to support the creation of more maintainable web test suites.
@article{2017-Stocco-SQJ, author = {Stocco, Andrea and Leotta, Maurizio and Ricca, Filippo and Tonella, Paolo}, title = {{APOGEN: Automatic Page Object Generator for Web Testing}}, journal = {Software Quality journal}, volume = {25}, number = {3}, month = sep, year = {2017}, issn = {0963-9314}, pages = {1007--1039}, numpages = {33}, doi = {10.1007/s11219-016-9331-9}, acmid = {3129059}, publisher = {Kluwer Academic publishers}, }
2016
- JSEPROBULA+: An Algorithm for Generating Robust XPath Locators for Web TestingMaurizio Leotta, Andrea Stocco, Filippo Ricca, and Paolo TonellaJournal of Software: Evolution and Process, Mar 2016[Invited journal First track at ICSME 2016]
Automated test scripts are used with success in many web development projects, so as to automatically verify key functionalities of the web application under test, reveal possible regressions and run a large number of tests in short time. However, the adoption of automated web testing brings advantages but also novel problems, among which the test code fragility problem. During the evolution of the web application, existing test code may easily break and testers have to correct it. In the context of automated DOM-based web testing, one of the major costs for evolving the test code is the manual effort necessary to repair broken web page element locators – lines of source code identifying the web elements (e.g. form fields and buttons) to interact with. In this work, we present Robula+, a novel algorithm able to generate robust XPath-based locators – locators that are likely to work correctly on new releases of the web application. We compared Robula+ with several state of the practice/art XPath locator generator tools/algorithms. Results show that XPath locators produced by Robula+ are by far the most robust. Indeed, Robula+ reduces the locators’ fragility on average by 90% w.r.t. absolute locators and by 63% w.r.t. Selenium IDE locators.
@article{2016-Leotta-JSEP, author = {Leotta, Maurizio and Stocco, Andrea and Ricca, Filippo and Tonella, Paolo}, journal = {Journal of Software: Evolution and Process}, pages = {177--204}, volume = {28}, publisher = {John Wiley & Sons}, url = {http://dx.doi.org/10.1002/smr.1771}, doi = {10.1002/smr.1771}, title = {{ROBULA+}: An Algorithm for Generating Robust {XPath} Locators for Web Testing}, note = {[Invited journal First track at ICSME 2016]}, year = {2016}, month = mar, } - ICWEClustering-Aided Page Object Generation for Web TestingAndrea Stocco, Maurizio Leotta, Filippo Ricca, and Paolo TonellaIn Proceedings of the 16th International Conference on Web Engineering, Jun 2016[Best Student Paper Award]
To decouple test code from web page details, web testers adopt the Page Object design pattern. Page objects are facade classes abstracting the internals of web pages (e.g., form fields) into high-level business functions that can be invoked by test cases (e.g., user authentication). However, writing such page objects requires substantial effort, which is paid off only later, during software evolution. In this paper we propose a clustering-based approach for the identification of meaningful abstractions that are automatically turned into Java page objects. Our clustering approach to page object identification has been integrated into our tool for automated page object generation, APOGEN. Experimental results indicate that the clustering approach provides clusters of web pages close to those manually produced by a human (with, on average, only 3 differences per web application). 75% of the code generated by APOGEN can be used as-is by web testers, breaking down the manual effort for page object creation. Moreover, a large portion (84%) of the page object methods created automatically to support assertion definition corresponds to useful behavioural abstractions.
@inproceedings{2016-Stocco-ICWE, author = {Stocco, Andrea and Leotta, Maurizio and Ricca, Filippo and Tonella, Paolo}, booktitle = {Proceedings of the 16th International Conference on Web Engineering}, pages = {132--151}, series = {ICWE 2016}, publisher = {Springer}, title = {Clustering-Aided Page Object Generation for Web Testing}, year = {2016}, month = jun, note = {[Best Student Paper Award]}, doi = {10.1007/978-3-319-38791-8_8}, } - ICWE
Demo TrackAutomatic Page Object Generation with APOGENAndrea Stocco, Maurizio Leotta, Filippo Ricca, and Paolo TonellaIn Proceedings of the 16th International Conference on Web Engineering, Jun 2016Page objects are used in web test automation to decouple the test cases logic from their concrete implementation. Despite the undeniable advantages they bring, as decreasing the maintenance effort of a test suite, yet the burden of their manual development limits their wide adoption. In this demo paper, we give an overview of APOGEN, a tool that leverages reverse engineering, clustering and static analysis, to automatically generate Java page objects for web applications.
@inproceedings{2016-Stocco-ICWE-demo, author = {Stocco, Andrea and Leotta, Maurizio and Ricca, Filippo and Tonella, Paolo}, booktitle = {Proceedings of the 16th International Conference on Web Engineering}, pages = {533--537}, series = {ICWE 2016 - Demo Track}, publisher = {Springer}, title = {Automatic Page Object Generation with APOGEN}, doi = {10.1007/978-3-319-38791-8_42}, year = {2016}, month = jun, } - FSEWATERFALL: An Incremental Approach for Repairing Record-Replay Tests of Web ApplicationsMouna Hammoudi, Gregg Rothermel, and Andrea StoccoIn Proceedings of the 24th ACM SIGSOFT International Symposium on the Foundations of Software Engineering, Nov 2016
Software engineers use record/replay tools to capture use case scenarios that can serve as regression tests for web applications. Such tests, however, can be brittle in the face of code changes. Thus, researchers have sought automated approaches for repairing broken record/replay tests. To date, such approaches have operated by directly analyzing differences between the releases of web applications. Often, however, intermediate versions or commits exist between releases, and these represent finer-grained sequences of changes by which new releases evolve. In this paper, we present WATERFALL, an incremental test repair approach that applies test repair techniques iteratively across a sequence of fine-grained versions of a web application. The results of an empirical study on seven web applications show that our approach is substantially more effective than a coarse-grained approach (209% overall), while maintaining an acceptable level of overhead.
@inproceedings{2016-Hammoudi-FSE, author = {Hammoudi, Mouna and Rothermel, Gregg and Stocco, Andrea}, booktitle = {Proceedings of the 24th ACM SIGSOFT International Symposium on the Foundations of Software Engineering}, pages = {751--762}, series = {FSE 2016}, title = {{WATERFALL}: An Incremental Approach for Repairing Record-Replay Tests of Web Applications}, doi = {10.1145/2950290.2950294}, year = {2016}, month = nov, }
2015
- ASTWhy Creating Web Page Objects Manually If It Can Be Done Automatically?Andrea Stocco, Maurizio Leotta, Filippo Ricca, and Paolo TonellaIn Proceedings of the 10th IEEE/ACM International Workshop on Automation of Software Test, May 2015
Page Object is a design pattern aimed at making web test scripts more readable, robust and maintainable. The effort to manually create the page objects needed for a web application may be substantial and unfortunately existing tools do not help web developers in such task. In this paper we present APOGEN, a tool for the automatic generation of page objects for web applications. Our tool automatically derives a testing model by reverse engineering the target web application and uses a combination of dynamic and static analysis to generate Java page objects for the popular Selenium WebDriver framework. Our preliminary evaluation shows that it is possible to use around 3/4 of the automatic page object methods as they are, while the remaining 1/4 need only minor modifications.
@inproceedings{2015-Stocco-AST, author = {Stocco, Andrea and Leotta, Maurizio and Ricca, Filippo and Tonella, Paolo}, booktitle = {Proceedings of the 10th IEEE/ACM International Workshop on Automation of Software Test}, pages = {70--74}, publisher = {IEEE/ACM}, series = {AST 2015}, doi = {10.1109/AST.2015.26}, title = {Why Creating Web Page Objects Manually If It Can Be Done Automatically?}, year = {2015}, month = may, } - SBSTMeta-Heuristic Generation of Robust XPath Locators for Web TestingMaurizio Leotta, Andrea Stocco, Filippo Ricca, and Paolo TonellaIn Proceedings of the 8th International Workshop on Search-Based Software Testing, May 2015
Test scripts used for web testing rely on DOM locators, often expressed as XPaths, to identify the active web page elements and the web page data to be used in assertions. When the web application evolves, the major cost incurred for the evolution of the test scripts is due to broken locators, which fail to locate the target element in the new version of the software. We formulate the problem of automatically generating robust XPath locators as a graph exploration problem, for which we provide an optimal, greedy algorithm. Since such an algorithm has exponential time and space complexity, we present also a genetic algorithm.
@inproceedings{2015-Leotta-SBST, author = {Leotta, Maurizio and Stocco, Andrea and Ricca, Filippo and Tonella, Paolo}, booktitle = {Proceedings of the 8th International Workshop on Search-Based Software Testing}, pages = {36--39}, publisher = {ACM}, series = {SBST 2015}, doi = {10.1109/SBST.2015.16}, title = {Meta-Heuristic Generation of Robust {XP}ath Locators for Web Testing}, year = {2015}, month = may, } - ICSTUsing Multi-Locators to Increase the Robustness of Web Test CasesMaurizio Leotta, Andrea Stocco, Filippo Ricca, and Paolo TonellaIn Proceedings of the 8th IEEE International Conference on Software Testing, Verification and Validation, Apr 2015
The main reason for the fragility of web test cases is the inability of web element locators to work correctly when the web page DOM evolves. Web elements locators are used in web test cases to identify all the GUI objects to operate upon and eventually to retrieve web page content that is compared against some oracle in order to decide whether the test case has passed or not. Hence, web element locators play an extremely important role in web testing and when a web element locator gets broken developers have to spend substantial time and effort to repair it. While algorithms exist to produce robust web element locators to be used in web test scripts, no algorithm is perfect and different algorithms are exposed to different fragilities when the software evolves. Based on such observation, we propose a new type of locator, named multi-locator, which selects the best locator among a candidate set of locators produced by different algorithms. Such selection is based on a voting procedure that assigns different voting weights to different locator generation algorithms. Experimental results obtained on six web applications, for which a subsequent release was available, show that the multilocator is more robust than the single locators (about –30% of broken locators w.r.t. the most robust kind of single locator) and that the execution overhead required by the multiple queries done with different locators is negligible (2-3% at most).
@inproceedings{2015-Leotta-ICST, author = {Leotta, Maurizio and Stocco, Andrea and Ricca, Filippo and Tonella, Paolo}, booktitle = {Proceedings of the 8th IEEE International Conference on Software Testing, Verification and Validation}, pages = {1--10}, publisher = {IEEE}, series = {ICST 2015}, doi = {10.1109/ICST.2015.7102611}, title = {Using Multi-Locators to Increase the Robustness of Web Test Cases}, year = {2015}, month = apr } - SACAutomated Generation of Visual Web Tests from DOM-based Web TestsMaurizio Leotta, Andrea Stocco, Filippo Ricca, and Paolo TonellaIn Proceedings of the 30th ACM/SIGAPP Symposium on Applied Computing, Apr 2015
Functional test automation is increasingly adopted by web applications developers. In particular, 2nd generation tools overcome the limitations of 1st generation tools, based on screen coordinates, by providing APIs for easy selection and interaction with Document Object Model (DOM) elements. On the other hand, a new, 3rd generation of web testing tools, based on visual image recognition, brings the promise of wider applicability and simplicity. In this paper, we consider the problem of the automated creation of 3rd generation visual web tests from 2nd generation test suites. This transformation affects mostly the way in which test cases locate web page elements to interact with or to assert the expected test case outcome.Our tool PESTO determines automatically the screen position of a web element located in the DOM by a DOM-based test case. It then determines a rectangle image centred around the web element so as to ensure unique visual matching. Based on such automatically extracted images, the original, 2nd generation test suite is rewritten into a 3rd generation, visual test suite. Experimental results show that our approach is accurate, hence potentially saving substantial human effort in the creation of visual web tests from DOM-based ones.
@inproceedings{2015-Leotta-SAC, author = {Leotta, Maurizio and Stocco, Andrea and Ricca, Filippo and Tonella, Paolo}, booktitle = {Proceedings of the 30th ACM/SIGAPP Symposium on Applied Computing}, pages = {775–782}, publisher = {ACM}, series = {SAC 2015}, doi = {10.1145/2695664.2695847}, title = {{Automated Generation of Visual Web Tests from DOM-based Web Tests}}, year = {2015}, month = apr, }
2014
- SCAM
Demo TrackPESTO: A Tool for Migrating DOM-based to Visual Web TestsAndrea Stocco, Maurizio Leotta, Filippo Ricca, and Paolo TonellaIn Proceedings of the 14th International Working Conference on Source Code Analysis and Manipulation, Sep 2014Test automation tools are widely adopted for testing complex Web applications. Three generations of tools exist: first, based on screen coordinates; second, based on DOM–based commands; and third, based on visual image recognition. In our previous work, we proposed Pesto, a tool able to migrate second-generation Selenium WebDriver test suites towards third-generation Sikuli ones. In this work, we extend Pesto to manage Web elements having (1) complex visual interactions and (2) multiple visual appearances. Pesto relies on aspect-oriented programming, computer vision, and code transformations. Our new improved tool has been evaluated on two Web test suites developed by an independent tester. Experimental results show that Pesto manages and transforms correctly test suites with Web elements having complex visual interactions and multistate elements. By using Pesto, the migration of existing DOM–based test suites to the visual approach requires a low manual effort, since our approach proved to be very accurate.
@inproceedings{2014-Stocco-SCAM-demo, author = {Stocco, Andrea and Leotta, Maurizio and Ricca, Filippo and Tonella, Paolo}, booktitle = {Proceedings of the 14th International Working Conference on Source Code Analysis and Manipulation}, pages = {65--70}, publisher = {IEEE}, series = {SCAM 2014 - Demonstration Track}, doi = {10.1109/SCAM.2014.36}, title = {PESTO: A Tool for Migrating {DOM}-based to Visual Web Tests}, year = {2014}, month = sep, } - ISSREWReducing Web Test Cases Aging by means of Robust XPath LocatorsMaurizio Leotta, Andrea Stocco, Filippo Ricca, and Paolo TonellaIn Proceedings of the 25th International Symposium on Software Reliability Engineering Workshops, Nov 2014
In the context of web regression testing, the main aging factor for a test suite is related to the continuous evolution of the underlying web application that makes the test cases broken. This rapid decay forces the quality experts to evolve the test ware. One of the major costs of test case evolution is due to the manual effort necessary to repair broken web page element locators. Locators are lines of source code identifying the web elements the test cases interact with. Web test cases rely heavily on locators, for instance to identify and fill the input portions of a web page (e.g., The form fields), to execute some computations (e.g., By locating and clicking on buttons) and to verify the correctness of the output (by locating the web page elements showing the results). In this paper we present ROBULA (ROBUst Locator Algorithm), a novel algorithm able to partially prevent and thus reduce the aging of web test cases by automatically generating robust XPath-based locators that are likely to work also when new releases of the web application are created. Preliminary results show that XPath locators produced by ROBULA are substantially more robust than absolute and relative locators, generated by state of the practice tools such as Fire Path. Fragility of the test suites is reduced on average by 56% for absolute locators and 41% for relative locators.
@inproceedings{2014-Leotta-WoSAR, author = {Leotta, Maurizio and Stocco, Andrea and Ricca, Filippo and Tonella, Paolo}, booktitle = {Proceedings of the 25th International Symposium on Software Reliability Engineering Workshops}, pages = {449--454}, publisher = {IEEE}, series = {ISSREW 2014}, doi = {10.1109/ISSREW.2014.17}, title = {Reducing Web Test Cases Aging by means of Robust {XP}ath Locators}, year = {2014}, month = nov, }
2013
- WSEWeb Testware EvolutionFilippo Ricca, Maurizio Leotta, Andrea Stocco, Diego Clerissi, and 1 more authorIn Proceedings of the 15th International Symposium on Web Systems Evolution, Sep 2013
Web applications evolve at a very fast rate, to accommodate new functionalities, presentation styles and interaction modes. The test artefacts developed during web testing must be evolved accordingly. Among the other causes, one critical reason why test cases need maintenance during web evolution is that the locators used to uniquely identify the page elements under test may fail or may behave incorrectly. The robustness of web page locators used in test cases is thus critical to reduce the test maintenance effort. We present an algorithm that generates robust web page locators for the elements under test and we describe the design of an empirical study that we plan to execute to validate such robust locators.
@inproceedings{2013-Ricca-WSE, author = {Ricca, Filippo and Leotta, Maurizio and Stocco, Andrea and Clerissi, Diego and Tonella, Paolo}, booktitle = {Proceedings of the 15th International Symposium on Web Systems Evolution}, pages = {39--44}, publisher = {IEEE}, series = {WSE 2013}, doi = {10.1109/WSE.2013.6642415}, title = {Web Testware Evolution}, year = {2013}, issn = {2160-6153}, month = sep }