the cups blog


SOAPS: An improved audio CAPTCHAs

Jennifer Tam, Jiri Simsa, David Huggins-Daines, Luis von Ahn and Manuel Blum

CAPTCHAs – a test to determine if the user is human.
Some existing audio CAPTCHAs have only a 70% human passing rate, because the additional noise injected into the audio makes discerning the digits difficult.
Additional concern: task time is much greater than with visual CAPTCHAs.

Are current CAPTCHAs secure?

  • Considered insecure if can be beat 5% of the time.
  • Because of the limited vocabulary, a trained system can beat them 45% of the time.

Testing targetted at: Google, reCAPTCHA, digg
Sampled 1000 from each

Algorithm to break them:

  • Segment audio
  • Features: classify as digit/letter, noise, or voice


  • Manually segmented/labelled.
  • Testing used an automatic segmenting algorithm

Feature algorithms:

  • Mel-frequency cepstral coefficients (MFCC)
  • Perceptual linear prediction (PLP)
  • Relative spectral transform with PLP (RASTA-PLP)

Trained with AdaBoost, SVM, k-NN
Algorithm: segment, recognize (features -> labels), repeat until all segments or a maximum solution size

  • 66% Google, 45% reCAPTCHA, 71% Digg. These are for exact matches, rates are higher if errors are allowed (at least Google permits 1 error in the response).

How to build a better audio CAPTCHA?

  • Apply reCAPTCHA’s visual approach to audio techniques.
  • Similar to visual reCAPTCHA , transcribe audio that failed Automatic Speech Recognition, but with audio that is spoken clearly.

How will it work?

  • Start with phrases with known transcriptions.
  • User will transcript adjacent phrases to transcribe.
  • Un-transcribed phrase’s transcription is recorded after the known-phrase transcription is matched.

Security Analysis:

  • Speaker independent recognition and open vocabularies are difficult for ASR systems.
  • AM broadcast and mp3 cause coding degradation which also reduces performance.


  • Improved accessibility for RECAPTCHA
  • Provide transcriptions for non-transcribed audio

Q: How will the bad guys respond to this new technique?
A: Will be collecting data as it runs, detect weak bits and remove them from the system. Should be possible to stay ahead of the bad guys (by having more complete data). Different radio show sources will provide different background patterns which would need to be segmented in the bad guy’s training data, as well.
Q: Radio shows pushed the development of “widely understandable” accents. Does this make them particularly vulnerable to computer attack?
A; Not clear this will be a problem, as there were various accents encouraged in the shows, among other reasons.
Q: What about language barriers?
A: Eventually include audio sources from other languages, perhaps chosen by location or menu selection.

Q:How do you plan to clean the data for spelling, punctuation, etc.
A: Currently, ignore case and punctuation (for comparison).
Also planning to deploy a dictionary for cleanup/comparison.
Q: Can users poison the system?
A: There are statistical tests for ultimate acceptance of a given transcription, as well as other techniques.
Q: What about the deaf-blind users?
What techniques for computer use are available?
A: ASCII output/keyboard input.
A: Would it be possible to use a haptic device with a waveform?
Q: To deal with dyslexia, perhaps combine the audio with a visual representation?
A: Runs the risk of giving the computer more leverage as well, and there are no transcriptions for the audio to use to generate the visual representation.