As a part of my thesis in Keystroke Dynamics, I collected quite a bit of data on users typing a handful of passwords. I’ve decided to make this data available publicly for two main reasons:

  1. It could lower the entry barrier to the field. Individuals interested in developing an algorithm but without the means to collect a significant amount of data will be able to use this data to experiment. Also, most researchers don’t have access to pressure-sensitive keyboards when developing their algorithms. This data will allow researchers to test their algorithms and consider this extra dimension of typing data.
  2. I hope that this dataset could become a standard for the field of keystroke dynamics. One thing which I felt was substantially lacking in my literature review was any basis of comparison when analyzing different approaches. My hope is that researchers can, after developing an algorithm, measure their results on this dataset as a way to objectively compare performance.

The data is currently available in two formats, with a third in the works.

A MySQL database dump is available here: Public Keystroke Dynamics SQL (18615). The structure of the database is depicted in the figure to the left.

The same data is also available is a series of Comma-Separated-Value (CSV) files and can be downloaded here: Public Keystroke Dynamics CSV (3520). The same structure and relationships apply, though they’re obviously not enforced in CSV files.

The data was collected over a period of a few months in 2009-2010 on over 104 different users. “Extensive” data was collected on 7 of these users – who entered between 89 and 504 entries total.

The rest of the users just entered each password between 3 and 15 times to provide a substantial amount of “impostor” data on each password.

Three different passwords were tested to try to cover the range of different approaches in KD:

  1. pr7q1z – a password of gibberish meant to test the performance of Pressure-Sensitive KD on modern “strong” passwords.
  2. jeffrey allen – the use of personal information (such as a name) has been recommended in KD before. By using my own name and recording my typing habits on it, the performance of using personal information can be measured.
  3. drizzle – a word meant to test the use of normal dictionary words which likely have no personal significance to any user.

In total, 2,739 entries were collected — over 900 on each password.

Please feel free to use this dataset for any non-commercial purposes. I do ask that you let me know if you use this dataset and please use the following citation (for now):
Allen, Jeffrey D., An Analysis of Pressure-Based Keystroke Dynamics Algorithms, Computer Science and Engineering, Southern Methodist University, 2010.

If you’re interested in contributing more data (pressure-sensitive or not), please contact me as well; I’d love to incorporate more datasets into this one.

Another dataset is available at http://www.cs.cmu.edu/~keystroke/