Abstract
Generative AI models have seen remarkable advancements with innovations in data comprehension and reasoning. Vision-Language Models (VLMs) offer new opportunities for dermatology screening through zero-shot classification. This study provides a comparative benchmark for nine modern VLMs in skin disease classification and highlights key limitations that must be addressed before reliable clinical deployment. Three prompt-engineering strategies and two dataset variations (original and brightness-augmented images) were used to assess the models' robustness to context and visual noise. Results show that Gemini 2.5 Pro consistently outperforms all other models, while smaller and GPT-based models display reduced accuracy and sensitivity to lighting variations. Prompt refinement improves prediction stability across multiple models, confirming the importance of well-structured instructions in zero-shot dermatology tasks.