Evaluation of Large Language Models for Understanding Counterfactual Reasoning in Texts

S. I. M. Adnan; Abrar Hameem; Shikha Anirban; Md. Saiful Islam; Md. Musfique Anwar

doi:10.1109/BigData66926.2025.11402112

Back

Conference proceeding

Evaluation of Large Language Models for Understanding Counterfactual Reasoning in Texts

S. I. M. Adnan, Abrar Hameem, Shikha Anirban, Md. Saiful Islam and Md. Musfique Anwar

Conference Proceedings. BIGDATA 2025

2025 IEEE International Conference on Big Data (Big Data) (Macau, China, 08/12/2025–11/12/2025)

12/2025

DOI: https://doi.org/10.1109/BigData66926.2025.11402112

Abstract

Natural language processing is being revolutionized by large language models (LLMs), including the generation of counterfactuals - hypothetical statements that examine different possibilities. Despite their potential, the effectiveness of LLMs in producing high-quality, logically consistent, and minimally edited counterfactuals is underexplored, with no comprehensive evaluation frameworks. This work addresses this gap by systematically evaluating Llama 3.2, Gemma 3, and Mistral on counterfactual generation tasks for sentiment analysis (IMDB and Financial) and natural language inference (SNLI) datasets. We employ Zeroshot, Few-shot, Chain-of-Thought (CoT), and Tree-of-Thought (ToT) prompting to assess the ability of LLMs to generate counterfactual text and evaluate their logical consistency. Flip rate, textual similarity, and perplexity are used to measure label flipping, textual coherence, and fluency. Mistral excels in sentiment analysis with fluent outputs. Llama 3.2 performs strongly in natural language inference tasks, making logical changes with minimal edits and high textual similarity. Gemma 3 performs well in both tasks but struggles with fluency. This research provides a benchmark for evaluating LLMs on counterfactual text generation and offers practical guidance for selecting models and prompts for specific tasks.

Details

Title: Evaluation of Large Language Models for Understanding Counterfactual Reasoning in Texts
Authors/Creators: S. I. M. Adnan
Abrar Hameem
Shikha Anirban - Murdoch University, School of Information Technology
Md. Saiful Islam - Griffith University
Md. Musfique Anwar
Publication Details: Conference Proceedings. BIGDATA 2025
Conference: 2025 IEEE International Conference on Big Data (Big Data) (Macau, China, 08/12/2025–11/12/2025)
Publisher: IEEE
Number of pages: 8
Identifiers: 991005869305807891
Murdoch Affiliation: School of Information Technology
Resource Type: Conference proceeding

Metrics

1 Record Views