Abstract
Natural language processing is being revolutionized by large language models (LLMs), including the generation of counterfactuals - hypothetical statements that examine different possibilities. Despite their potential, the effectiveness of LLMs in producing high-quality, logically consistent, and minimally edited counterfactuals is underexplored, with no comprehensive evaluation frameworks. This work addresses this gap by systematically evaluating Llama 3.2, Gemma 3, and Mistral on counterfactual generation tasks for sentiment analysis (IMDB and Financial) and natural language inference (SNLI) datasets. We employ Zeroshot, Few-shot, Chain-of-Thought (CoT), and Tree-of-Thought (ToT) prompting to assess the ability of LLMs to generate counterfactual text and evaluate their logical consistency. Flip rate, textual similarity, and perplexity are used to measure label flipping, textual coherence, and fluency. Mistral excels in sentiment analysis with fluent outputs. Llama 3.2 performs strongly in natural language inference tasks, making logical changes with minimal edits and high textual similarity. Gemma 3 performs well in both tasks but struggles with fluency. This research provides a benchmark for evaluating LLMs on counterfactual text generation and offers practical guidance for selecting models and prompts for specific tasks.