Multimodal conversational agents are highly desirable because they offer natural and human-like interaction. However, there is a lack of comprehensive end-to-end solutions to support collaborative development and benchmark-ing. While proprietary systems like GPT-4o and Gemini demonstrating impressive integration of audio, video, and text with response times of 200-250ms, challenges remain in balancing latency, accuracy, cost, and data privacy. To better understand and quantify these issues, we developed OpenOmni, an open-source, end-to-end pipeline benchmarking tool that integrates advanced technologies such as Speech-to-Text, Emotion Detection, Retrieval Augmented Generation, Large Language Models , along with the ability to integrate cus-tomized models. OpenOmni supports local and cloud deployment, ensuring data privacy and supporting latency and accuracy bench-marking. This flexible framework allows researchers to customize the pipeline, focus-ing on real bottlenecks and facilitating rapid proof-of-concept development. OpenOmni can significantly enhance applications like indoor assistance for visually impaired individuals, advancing human-computer interaction. Our demonstration video is available https://www. youtube.com/watch?v=zaSiT3clWqY, demo is available via https://openomni.ai4wa. com, code is available via https://github. com/AI4WA/OpenOmniFramework.
Details
Title
OpenOmni: A Collaborative Open Source Tool for Building Future-Ready Multimodal Conversational Agents
Authors/Creators
Qiang Sun - The University of Western Australia
Yuanyi Luo - Harbin Institute of Technology
Sirui Li - Murdoch University, School of Information Technology
Wenxiao Zhang - The University of Western Australia
Wei Liu - The University of Western Australia
Publication Details
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp.46-52
Conference
Conference on Empirical Methods in Natural Language Processing(EMNLP 2024) (Miami, FL, 12/11/2024–16/11/2024)