Abstract
This paper presents a comparative analysis of lightweight voice biometric authentication methods designed for real-time deployment in telecommunication environments. The study evaluates two distinct approaches: a traditional MelFrequency Cepstral Coefficients (MFCC) combined with Gaussian Mixture Model-Universal Background Model (GMM-UBM), and a Double-Branch Siamese Neural Network (DB-SNN) trained on log-Mel spectrograms. Both models were assessed using the VoxCeleb1 dataset, resampled to 8 kHz to reflect typical telecom audio conditions, and tested across utterance durations ranging from 4 to 7 seconds. Experimental results show that the GMM-UBM model achieved strong efficiency, with an average inference time of 10 ms and a compact model size of 8 KB, demonstrating stable performance on short utterances. Conversely, the DB-SNN achieved higher accuracy (78.53%) and a lower Equal Error Rate (EER) of 21.47% on longer inputs; however, it required substantially more computational resources, including an 8 MB model size and inference times of up to 26 ms. The findings reveal a clear trade-off between speed and accuracy in constrained environments. While GMM-UBM remains preferable for latency-critical telecom systems, the Siamese approach offers superior verification strength when resources permit. The paper concludes by recommending future work on optimizing deep learning models through refined loss functions, adaptive architecture, and enhanced noise robustness for real-world telecom applications.