Logo image
Evaluation of Large Language Models for Understanding Counterfactual Reasoning in Texts
Conference proceeding

Evaluation of Large Language Models for Understanding Counterfactual Reasoning in Texts

S. I. M. Adnan, Abrar Hameem, Shikha Anirban, Md. Saiful Islam and Md. Musfique Anwar
Conference Proceedings. BIGDATA 2025
2025 IEEE International Conference on Big Data (Big Data) (Macau, China, 08/12/2025–11/12/2025)
12/2025

Abstract

Natural language processing is being revolutionized by large language models (LLMs), including the generation of counterfactuals - hypothetical statements that examine different possibilities. Despite their potential, the effectiveness of LLMs in producing high-quality, logically consistent, and minimally edited counterfactuals is underexplored, with no comprehensive evaluation frameworks. This work addresses this gap by systematically evaluating Llama 3.2, Gemma 3, and Mistral on counterfactual generation tasks for sentiment analysis (IMDB and Financial) and natural language inference (SNLI) datasets. We employ Zeroshot, Few-shot, Chain-of-Thought (CoT), and Tree-of-Thought (ToT) prompting to assess the ability of LLMs to generate counterfactual text and evaluate their logical consistency. Flip rate, textual similarity, and perplexity are used to measure label flipping, textual coherence, and fluency. Mistral excels in sentiment analysis with fluent outputs. Llama 3.2 performs strongly in natural language inference tasks, making logical changes with minimal edits and high textual similarity. Gemma 3 performs well in both tasks but struggles with fluency. This research provides a benchmark for evaluating LLMs on counterfactual text generation and offers practical guidance for selecting models and prompts for specific tasks.

Details

Metrics

1 Record Views
Logo image