reCAPTCHA: How Much Will We Give Up for Security?

By Eliza Schuh

April 30, 2021

As so eloquently put by John Mulaney, “The world is run by computers, the world is run by robots and we spend most of our day telling them that we’re not a robot just to log on and look at our own stuff.” Mr. Mulaney is joking about Google reCAPTCHA’s “I am not a robot” checkbox, a type of CAPTCHA. Originally invented in 1997, CAPTCHAs, or “Completely Automated Public Turing tests to tell Computers and Humans Apart,” have exploded in popularity in recent years. Like their name suggests, CAPTCHAs are a type of automated Turing test, a test intended to differentiate between humans and machines. In this case, CAPTCHAs are particularly used to distinguish between human users and bots. Ideally, 100% of human users pass the CAPTCHA and 0% of bots do. But bots are becoming increasingly good at passing CAPTCHAs, which means that CAPTCHAs are becoming more advanced, thorough, and holistic. However, this intensity comes with increased security and ethics concerns, especially in the instance of Google’s reCAPTCHA. 

Traditional, text-based CAPTCHAs asked the user to enter a word or series of characters displayed in a way difficult for bots to interpret. Eventually, bots became good at solving those kinds of text based CAPTCHAs, with a 2016 Google machine learning algorithm passing an astounding 99.8% of text-based CAPTCHAs so distorted that humans had only a 33% success rate. Other basic CAPTCHAs have included image and audio tests, such as selecting stop signs and transcribing words. A 2010 Stanford study showed that a group of three humans could only agree on the answer to an image CAPTCHA 71% of the time. Six years later, however, researcher Jason Polakis created an algorithm that could beat image CAPTCHAs 71% of the time – the exact same success rate. The very nature of CAPTCHAs proves to be their undoing: CAPTCHAs are essentially pre-made training for machine learning and artificial intelligence programs, teaching bots to be more human in the process of trying to detect them. CAPTCHAs must continuously evolve to remain effective. Thus, CAPTCHAs have become quite difficult in recent years, and user experience has suffered, particularly for those with learning disabilities.

reCAPTCHA was initially released by Google in 2007. It stands for reversed CAPTCHA, because it “reversed” the conception of CAPTCHAs being useless, as reCAPTCHA was initially used to digitize books. The most recent versions are v3 and Enterprise, which were released in 2018 and 2020, respectively. Due to a lack of data on Enterprise, we will be talking mostly about v3. v3 is Google’s famous (or infamous) “I am not a robot” checkbox, dubbed the “no CAPTCHA reCAPTCHA” due to the lack of a test administered directly to the user. Today, 97.7% of the top 1 million websites (by traffic) embedded with CAPTCHAs use Google’s reCAPTCHA. Over 1 million websites use reCAPTCHA v3, with over 6 million websites using any version of reCAPTCHA.

If you’re like me, you were initially confused about how the checkbox worked in verifying humanity. How did a simple checkbox perform the same job as the squiggly letters and pictures of stop lights we had needed to identify for so long? As a matter of fact, Google’s No CAPTCHA reCAPTCHA does not test if you can check the box – a bot can do that easily – but how you check the box. Google has not released their exact metrics for security reasons, but they include checking a user’s degree of human randomness, such as non-straight cursor lines. reCAPTCHA then uses these metrics to give each user a score indicating how likely it is that they are human.

Here, reCAPTCHA becomes worrisome in two main areas: the privacy and ethics concerns that come with its data collection and scoring system. As bots developed, they began to imitate basic signs of human randomness, leading reCAPTCHA to become even more invasive. In order to make reCAPTCHA’s behavior-based bot detection the most accurate, Google recommends a client embed it on every page of their website, rather than just the page where the checkbox is located. While increasing the accuracy of reCAPTCHA’s bot detection, this means Google can collect data from a user’s activities over the entire website, only disclosing the data collection by displaying a small reCAPTCHA logo somewhere on the page – a 21st century version of fine print.

Google refuses to disclose the types of data collected by v3, but research by lead engineer at AdTruth, Marcos Perona, has determined that (as of 2015) No CAPTCHA reCAPTCHA is a redesigned version of Google’s Botguard Technology, a program originally intended to detect bots on Gmail. And while we don’t know much about how No CAPTCHA reCAPTCHA works, we do know how Botguard works. Botguard checks your browser for Google cookies, and if there are none, inserts one. Then, according to Perona as quoted in Insider, Botguard takes a “pixel-by-pixel fingerprint of the user’s browser window,” collecting data such as IP address, browser type and plug-ins, and mouse and touch motion information.  There is good reason to believe reCAPTCHA captures all of this and possibly more. By the act of searching for a Google cookie, it must also iterate through all your browser cookies. Jeremy Gillula, staff technologist at the Electronic Frontier Foundation, reported to Insider that Google is essentially “identifying whether or not someone is a human by figuring out precisely which human they are.” If nothing else, the presence of No CAPTCHA reCAPTCHA on a web page tells Google that you visited that page, allowing them to add your use of that webpage to its profile of you.

Google also insists they do not use data gathered using reCAPTCHA to target ads and that it is only used for improving reCAPTCHA and general security of their products. However, legally there isn’t really any accountability to this claim, as reCAPTCHA uses Google’s global Privacy Policy, which states “We also allow specific partners to collect information from your browser or device for advertising and measurement purposes using their own cookies or similar technologies.” There is a significant lack of information about the data usage of reCAPTCHA. Additionally, Google has established precedent in going back on promises to maintain the independence of subsidiaries, like in the case of Nest, which it integrated in 2019, despite a 2014 promise to maintain the company’s independence.

 

Since Google’s promise is not legally binding, there is absolutely no reason for why they could not dismantle reCAPTCHA’s independence in the future. We, as users, are essentially putting our trust in Google.

We, as users, are essentially putting our trust in Google.

This brings us to the ethical implications of allowing a for-profit company to determine what makes an internet user “trustworthy.” By assigning a score to each user, Google is claiming the right to assess a user however it sees fit. Researchers at the University of Toronto found that their scores improved if the user had Google cookies on their browser and if the user was logged into a Google Account, giving privilege to those who use Google’s products and actively discriminating against those who wish to remain anonymous. In addition, it found that certain privacy measures, like the use of Tor or a VPN or the clearing of a browser’s cookies, lowered a user’s score. Users have also claimed that reCAPTCHAs take more time when using one of Google’s competitors, such as Firefox. While more research is definitely needed, this clearly exemplifies the conflict of interest inherent in allowing Google to validate humanity.

With reCAPTCHA’s overwhelming dominance of the online CAPTCHA industry, Google has cemented itself as the issuer of what are effectively online credit scores. There is almost zero transparency on the metrics by which those scores are determined and whom those metrics serve. In this day and age, bots are an undeniable security risk. However, an increase in security cannot come with such a devastating blow to personal privacy and the rights of individuals who opt out of using Google products.

Unlike Google’s reCAPTCHA, humanID provides a simple and secure alternative. In place of the reCAPTCHA box requesting for users to select images or type in codes, a user only needs to provide their phone number for verification. From there, humanID’s secure system will send the user’s phone a verification code. The best part? We do not harvest data. Even the user phone number provided is deleted immediately after the verification code is sent. As a nonprofit organization, humanID prioritizes providing our clients with the highest level of security and convenience. Rather than risking implementing Google’s reCAPTCHA, why not partner with us?