Is that you, Alice? A Usability Study of the Authentication Ceremony of Secure Messaging Applications

Citation

Elham Vaziripour, Justin Wu, Mark O'Neill, Ray Clinton, Jordan Whitehead, Scott Heidbrink, Kent Seamons, Daniel Zappala, Is that you, Alice? A Usability Study of the Authentication Ceremony of Secure Messaging Applications, USENIX Symposium on Usable Privacy and Security (SOUPS), July 2017

Abstract

The effective security provided by secure messaging applications depends heavily on users completing an authentication ceremony—a sequence of manual operations enabling users to verify they are indeed communicating with one another. Unfortunately, evidence to date suggests users are unable to do this. Accordingly, we study in detail how well users can locate and complete the authentication ceremony when they are aware of the need for authentication. We execute a two-phase study involving 36 pairs of participants, using three popular messaging applications with support for secure messaging functionality: WhatsApp, Viber, and Facebook Messenger. The first phase included instruction about potential threats, while the second phase also included instructions about the importance of the authentication ceremony. We find that, across the three apps, the average success rates of finding and completing the authentication ceremony increases from 14% to 79% from the first to second phase, with second-phase success rates as high as 96% for Viber. However, the time required to find and complete the ceremony is undesirably long from a usability standpoint, and our data is inconclusive on whether users make the connection between this ceremony and the security guarantees it brings. We discuss in detail the success rates, task timings, and user feedback for each application, as well as common mistakes and user grievances. We conclude by exploring user threat models, finding significant gaps in user awareness and understanding.

Publicity

Articles about this paper:

Data

The participant data from this study was collected using a Qualtrics survey. The survey data is listed in alice-qualtrics.xslx. Data for the first phase consists of the following columns:

qualtrics-data.xslx: Phase 1
First App
the first application used.
Second App
the second application used.
Third App
the third application used.
Role
the role played in the scenario.
Gender
gender of the participant.
Age
age of the participant.
Education
educational attainment.
Major
college major.
Computer Knowledge
multiple selections possible, chosen from list shown in paper.
Prior Experience
list of applications used previously.
Status WhatsApp
success or failure using the authentication ceremony in WhatsApp.
Status Viber
success or failure using the authentication ceremony in Viber.
Status FBM
success or failure using the authentication ceremony in Facebook Messenger.
Mechanism for key verification WhatsApp
the mechanism the participant used to verify the authenticity of the partner using WhatsApp.
Mechanism for key verification Viber
the mechanism the participant used to verify the authenticity of the partner using Viber.
Mechanism for key verification FBM
the mechanism the participant used to verify the authenticity of the partner using FBM.
Trust Score WhatsApp
trust scored with a Likert scale for WhatsApp.
Trust Score Viber
trust scored with a Likert scale for Viber.
Trust Score FBM
trust scored with a Likert scale for Facebook Messenger.
Favorite
favorite app.
Favorite reason
reason for ranking the above app as the favorite.

Data for the second phase also includes the following columns:

qualtrics-data.xslx: Phase 2
Self rate security knowledge
Participant's assessment of their computer security knowledge, using scale shown in paper.
Test score
Score from 0-6 based on answering a series of questions testing computer security knowledge.
Familiar with encryption
How familiar participant was with encryption prior to the study, open response.
Mistakes WhatsApp
Mistakes participant made with WhatsApp.
Mistakes Viber
Mistakes participant made with Viber.
Mistakes FBM
Mistakes participant made with Facebook Messenger.

Statistics

Using our data set we have calculated a variety of statistics using SPSS. This data is collected in alice-statistics.zip. This consists of four groups of statistical calculations:

Success and Failure Rates

We measured whether the participants were successful in using the authentication ceremony for each application in the second phase of the study. This data is stored in SuccessFailStatus.sav and contains X columns:

SuccessFailStatus.sav
Case
A numeric identifier for the participant pair.
WhatsApp
The pair's success (1) or failure (0) for WhatsApp.
Viber
The pair's success (1) or failure (0) for Viber.
FBM
The pair's success (1) or failure (0) for Facebook Messenger.

For this data we want to test whether there are any differences between the applications. Because the data is dichotomous we used Cochran’s Q Test, followed by McNemar’s test to find the significant differences among the pairs of applications. These tests are found in Success-Fail-Tests.spv.

Task Completion Times

In the second phase of the study we measured the time taken by participants to (a) find the authentication ceremony and (b) complete the authentication ceremony. This data is stored in SecondPhase.sav and consists of the following columns:

SecondPhase.sav
RawCompletionTaskTime
The time it took to complete the task in minutes. This starts when the participant launched the application and ends when they sent the credit card number after authenticating properly.
RawKeyFindingTime
The time it took to find the authentication ceremony. This starts when the participant launched the application and ends when they saw the screen for the authentication ceremony.
Status
Whether the participant succeeded or failed at completing the authentication ceremony. We mark the status as a failure if the participant took longer than 10 minutes to find the authentication ceremony passed or if they decided to send the credit card number to their partner without authenticating properly. We mark the status as a success otherwise. The value of 0 = fail and 1 = success.
LogCompletionTaskTime
The log of the RawCompletionTaskTime field.
LogKeyFindingTime
The log of the RawKeyFindingTime field.
Role
The role of the participant in the scenario. 0 or A means the user had the role of the person who left his credit card at home, and 1 or B shows the role of the person who is at home and supposed to send the credit card number to his friend.
SUS
The partcipant's SUS score for the application, using the standard SUS questionnaire.
TestScore
A score from 0-6 based on answering a series of questions testing computer security knowledge
TrustScore
Answer to I trust this application to be secure, using a likert scale that varies between 1 = Strongly disagree to 5 = Strongly agree
Order
The order in which this participant used this application.
Application
The application name or number (1 = WhatsApp, 2 = Viber, 3 = Facebook Messenger).

For this data we want to know if there is a significant difference in the time to complete these tasks among the three different applications tested—WhatsApp, Viber, and Facebook Messenger. We first tested for normality using the Shapiro-Wilk test. This test is found in Timing-Shapiro-Wilk.spv. We then ran the Kruskal-Wallis test, found in Timing-Krukal-Wallis.spv.

Favorite Rankings

In the second phase of the study we asked participants which of the systems was their favorite. This data is stored in Favorites.sav, which contains the following columns:

Favorites.sav
FavoriteApp
The application chosen as the favorite (WhatsApp = 1, Viber = 2, Facebook Messenger = 3).
Phase
The phase of the study. The values are either 1 or 2.
choseFBM
Whether Facebook Messenger was chosen as the favorite application (0 or 1).
chosenViber
Whether Viber was chosen as the favorite application (0 or 1).
chosenWhatsApp
Whether WhatsApp was chosen as the favorite application (0 or 1).

We want to test whether there are any differences between the favorite rankings for each application between the two phases. We ran a Chi-Square test, which is found in FavoriteApp-Chi-Square.spv.

Trust Scores

In both phases we asked participants the following question:

I trust this application to be secure.

  • Strongly agree
  • Somewhat agree
  • Neither agree nor disagree
  • Somewhat disagree
  • Strongly disagree

This data is stored in Trust.sav and contains three columns:

Trust.sav
Phase
The phase of the study. The values are either 1 or 2.
TrustWhatsApp
Answer to I trust this application to be secure, for WhatsApp, using a likert scale that varies between 1 = Strongly disagree to 5 = Strongly agree.
TrustViber
Answer to I trust this application to be secure, for Viber, using a likert scale that varies between 1 = Strongly disagree to 5 = Strongly agree.
TrustFBM
Answer to I trust this application to be secure, for Facebook Messenger, using a likert scale that varies between 1 = Strongly disagree to 5 = Strongly agree.

For this data we ran a mixed model ANOVA Test because we are interested in seeing the interaction between two independent variables (application and phase). This test is found in Trust-Mixed-Model-Anova.spv.

To determine whether there was a simple main effect for the application, we ran a repeated measures ANOVA on each phase. This test is found in Trust-GLM-Phases.spv.

To determine whether there was a simple main effect for the study phase, we ran a one-way ANOVA on each application to compare the trust between the two phases. These tests are found in Trust-Univariate-WhatsApp-ByPhase.spv, Trust-Univariate-Viber-ByPhase.spv, and Trust-Univariate-Facebook-ByPhase.spv.