The voice communication channel is a significant vector for social engineering attacks and the spread of disinformation. Existing countermeasures that rely on cloud services have substantial drawbacks, including high latency, dependence on network connectivity, and privacy risks, making them unsuitable for real-time applications. This paper proposes a resource-efficient modular keyword spotting model designed for autonomous operation on resource-constrained edge devices. The model's architecture is based on the transformation of sequences of Mel-frequency cepstral coefficients into compact string "fingerprints" using differentiated weighting of informative features, followed by classification using the Levenshtein distance. Experimental validation on a Ukrainian-language command corpus demonstrated high performance: the F1-score reached 0.92 in ideal conditions and 0.78 at a signal-to-noise ratio of 5 dB. The proposed model significantly surpasses baseline and classical counterparts in the balance of accuracy, speed, and resource efficiency, which confirms its suitability for creating autonomous systems for proactive detection of auditory threats.