Keyboard acoustic emanations revisited
Li Zhuang, Feng Zhou, J. D. Tygar
2009
ACM Transactions on Privacy and Security
We examine the problem of keyboard acoustic emanations. We present a novel attack taking as input a 10-minute sound recording of a user typing English text using a keyboard and recovering up to 96% of typed characters. There is no need for training recordings labeled with the corresponding clear text. A recognizer bootstrapped from a 10-minute sound recording can even recognize random text such as passwords: In our experiments, 90% of 5-character random passwords using only letters can be
more »
... ted in fewer than 20 attempts by an adversary; 80% of 10-character passwords can be generated in fewer than 75 attempts by an adversary. In the attack, we use the statistical constraints of the underlying content, English language, to reconstruct text from sound recordings without knowing the corresponding clear text. The attack incorporates a combination of standard machine learning and speech recognition techniques, including cepstrum features, Hidden Markov Models, linear classification, and feedback-based incremental learning. 3:2 • L. Zhuang et al. for attacks. For example, Kuhn [2002 Kuhn [ , 2003 was able to recover the display on CRT and LCD monitors using indirectly reflected optical emanations. Acoustic emanations are another source of data for attacks. Researchers have shown that acoustic emanations of matrix printers carry substantial information about the printed text [Briol 1991] . Some researchers suggest it may be possible to discover CPU operations from acoustic emanations [Shamir and Tromer 2004] . In ground-breaking research, Asonov and Agrawal [2004] showed that it is possible to recover text from the acoustic emanations from typing on a keyboard. Most emanations, including acoustic keyboard emanations, are not uniform across different instances, even when the same device model is used; and they are affected by the environment. Different users on a single keyboard or different keyboards (even of the same model) emit different sounds, making reliable recognition hard [Asonov and Agrawal 2004] . Asonov and Agrawal achieved a relatively high recognition rate (approximately 80%) when they trained neural networks with text-labeled sound samples of the same user typing on the same keyboard. Their attack is analogous to a known plaintext attack on a cipher-the cryptanalyst has a sample of plaintext (the keys typed) and the corresponding ciphertext (the recording of acoustic emanations). This labeled training sample requirement suggests a limited attack because the attacker needs to obtain training samples of significant length. Presumably, these could be obtained from video surveillance or network sniffing. However, video surveillance, in most cases, should render the acoustic attack irrelevant because, even if passwords are masked on the screen, a video shot of the keyboard could directly reveal the keys being typed. In this article, we argue that a labeled training sample requirement is unnecessary for an attacker. This implies keyboard emanation attacks are more serious than previous work suggests. The key insight in our work is that the typed text is often not random. When one types English text, the finite number of mostly used English words limits possible temporal combinations of keys, and English grammar limits word combinations. One can first cluster (using unsupervised methods) keystrokes into a number of acoustic classes based on their sound. Given sufficient (unlabeled) training samples, a most-likely mapping between these acoustic classes and actual typed characters can be established using the language constraints. This task is not trivial. Challenges include: (i) How can one mathematically model language constraints and mechanically apply them? (ii) In the first sound-based clustering step, how can one address the problem of different keys clustered in the same acoustic class and a single key clustered in multiple acoustic classes? (iii) Can we improve the accuracy of the guesses by the algorithm to match the level achieved with labeled samples? Our work answers these challenges, using a combination of machine learning and speech recognition techniques. We show how to build a keystroke recognizer that has better recognition rate than labeled sample recognizers in Asonov and Agrawal [2004]. We only use a sound recording of a user typing. Our method can be viewed as a machine learning version of classic attacks to simple substitution ciphers. Assuming the ideal case in which a key produces
doi:10.1145/1609956.1609959
fatcat:ruvm3vzb3ncxtiipvuuf5zmihe