Pressure-Sensitive Keystroke Dynamics Dataset
- April 16th, 2010
- Posted in Professional
- Write comment
As a part of my thesis in Keystroke Dynamics, I collected quite a bit of data on users typing a handful of passwords. I’ve decided to make this data available publicly for two main reasons:
- It could lower the entry barrier to the field. Individuals interested in developing an algorithm but without the means to collect a significant amount of data will be able to use this data to experiment. Also, most researchers don’t have access to pressure-sensitive keyboards when developing their algorithms. This data will allow researchers to test their algorithms and consider this extra dimension of typing data.
- I hope that this dataset could become a standard for the field of keystroke dynamics. One thing which I felt was substantially lacking in my literature review was any basis of comparison when analyzing different approaches. My hope is that researchers can, after developing an algorithm, measure their results on this dataset as a way to objectively compare performance.
The data is currently available in two formats, with a third in the works.

A MySQL database dump is available here: Public Keystroke Dynamics SQL (18615). The structure of the database is depicted in the figure to the left.
The same data is also available is a series of Comma-Separated-Value (CSV) files and can be downloaded here: Public Keystroke Dynamics CSV (3520). The same structure and relationships apply, though they’re obviously not enforced in CSV files.
The data was collected over a period of a few months in 2009-2010 on over 104 different users. “Extensive” data was collected on 7 of these users – who entered between 89 and 504 entries total.
The rest of the users just entered each password between 3 and 15 times to provide a substantial amount of “impostor” data on each password.
Three different passwords were tested to try to cover the range of different approaches in KD:
- pr7q1z – a password of gibberish meant to test the performance of Pressure-Sensitive KD on modern “strong” passwords.
- jeffrey allen – the use of personal information (such as a name) has been recommended in KD before. By using my own name and recording my typing habits on it, the performance of using personal information can be measured.
- drizzle – a word meant to test the use of normal dictionary words which likely have no personal significance to any user.
In total, 2,739 entries were collected — over 900 on each password.
Please feel free to use this dataset for any non-commercial purposes. I do ask that you let me know if you use this dataset and please use the following citation (for now):
Allen, Jeffrey D., An Analysis of Pressure-Based Keystroke Dynamics Algorithms, Computer Science and Engineering, Southern Methodist University, 2010.
If you’re interested in contributing more data (pressure-sensitive or not), please contact me as well; I’d love to incorporate more datasets into this one.
Another dataset is available at http://www.cs.cmu.edu/~keystroke/
A reference in bib format is always welcome. Is this okey?
@unpublished{Allen.1,
author = {Allen, Jeffrey D.},
title = {An Analysis of Pressure-Based Keystroke Dynamics Algorithms},
affiliation = {Computer Science and Engineering, Southern Methodist},
year = {2010},
}
More info about the method used in the collection is appreciated too
Thanks for your hard work.
Luciano,
Thanks for your kind words, and I agree that a BibTex citation may be helpful. Maybe something along the following lines would be even more specific:
@mastersthesis{JDAllen.1,
author = “Allen, Jeffrey D.”,
title = “An Analysis of Pressure-Based Keystroke Dynamics Algorithms”,
school = “Southern Methodist University”,
address = “Dallas, TX”,
month = “May”,
year = “2010″
}
Hi
Thanks for contributing the keystrokes. i have tested all the three passwords with password strength checker (https://www.microsoft.com/protect/fraud/passwords/checker.aspx). it is rating pr7q1z and drizzle as weak passwords and jeffrey allen as strong. i am in need of keystroke of medium rated password. kindly help me in getting this.
Hi Jeffrey
Thanks for putting this up, really useful.
I am looking at recording keystroke dynamics in a web page, and was wondering whether you know of any javascript code (or similar) that I could use or adapt? I would appreciate any help or thoughts you might have on the matter.
Daniel
Unfortunately, I don’t have anything in Javascript that I could give you; this whole project was done in Java.
It should be possible within JavaScript using the KeyUp and KeyDown listeners in a textbox/area. I remember seeing a KD Company who was using Flash to accomplish a similar effect, but I’d be interested to see an Javascript KD biometric.
The only tip I have is to watch out for overlapping keystrokes, as they’re more common in some typists than you might expect.
Good luck! Sorry I couldn’t be of more help.
Hi Jeffrey,
This is already a big help. Now I know that I wouldn’t be doing work that is already done in some library that all KD researchers know about. Thanks!
daniel
Hello Jeffery,
Thank you for publishing the database. Do you have and summary of performance to compare my classification algorithm with others (benchmark)? What is the exact working point of other’s performance regarding number of strokes used for training etc. Actually I get really good results when trying to defend your keystrokes but terrible results when I try to defends the other less trained individual. Moreover, I disregarded the pressure data since in most cases it is unavailable. Do you have performance reports without the use of pressure data? How did you collect pressure data?
Thanks allot,
Hanan
Hanan,
Lots of questions there. If you’re interested in the intricate details of the data and my analysis of it, you may just be interested in looking at the Thesis, itself. (here). I don’t have any records of other classification algorithms, currently, but you can see the AUC and pAUC of mine in the thesis. As I discuss there, I’d recommend comparing the AUC or pAUC rather than the TPR or FPR at some threshold.
The performance you observed is expected. I don’t know that there’s enough data there if you have only 3 entries of a password on a typist to classify him/her meaningfully. The dataset was really designed to have a handful of users on which you could train and a myriad of users against which you could test. Of course, you’re free to use it however you’d like, but I think it will work best in this framework.
Jeff