Unit 42 CTR: Sensitive Data Exposed in GitHub


Category: Unit 42

Tags: , , , , ,

Executive Summary

The Unit 42 Cloud Threat Report: Spring 2020 focused on the practices of DevOps to determine where misconfigurations are happening in the cloud. This blog offers a detailed analysis of GitHub repositories and the immediate need to “shift-left” security checks to enable all teams – DevOps, engineering and security – to discover and remediate issues earlier.

Few data repositories are more widely leveraged for fast-tracking code from development into production than GitHub. However, as the old adage states ‘with greater speed, comes higher risk’.

Researchers found that an organization’s usage of public GitHub accounts combined with the high potential for DevOps to leak sensitive information, there was an increased risk of data loss or sustained compromising events. However, with proper implementation of DevSecOps and the usage of GitHub Event API scanners, organizations can greatly decrease their risk of exposing compromising information publicly.

Key Findings

Unit 42 researchers analyzed more than 24,000 public GitHub data uploads via the GitHubs Event API and found thousands of files containing potentially sensitive information, which included:

GitHub Exposed Data

GitHub’s Event API

GitHub provides developer access to an Events API search feature. This allows for the near-real-time listing of files and code posted to GitHub servers. With a limit of 5,000 requests per hour, per account, the event API allows researchers to view and scan any file pushed to Github that is available within the public domain, e.g. publicly shared. There are several tools that take advantage of this functionality. On the commercial side, GitHub itself operates and maintains the GitHub Token Scanner. This inspects files for token strings to attempt to prevent fraud and abuse. The AWS’ git-secrets is used by AWS to scan for usernames and passwords, as well as other critical strings to prevent their exposure. There are also Open Source GitHub Event API scanners like gitrob and trugglehog which have been used by red teams, pen testers, and malicious users alike to identify potentially sensitive information.


ShhGit Live

Image 1. shhgit logo
Image 1. shhgit logo


Unit 42 researchers used eth0izzle’s shhgit to read near-real time GitHub events and attempted to answer three questions.

    1. Is potentially sensitive data found within the files?
    2. Could they be traced to an organization?
    3. Could security precautions have prevented unwanted exposure of potentially sensitive data?

Simply put, the answer to all three was yes. Researchers found and analyzed more than 24,000 unique GitHub files which triggered upons shhgit’s premade 120 signature rules as well as custom Unit 42 signature rules. Researchers found a number of potentially sensitive data entries including:

    • 4109 Configuration files
    • 2464 API keys
    • 2328 Hardcoded username and passwords
    • 2144 Private key files
    • 1089 OAuth tokens

Upon analysis of these findings, Unit 42 researchers confirmed the validity of the findings and were able to identify file owners, project names, and in some cases the names of commercial companies posting this information.

Breaking down the results

Hardcoded Passwords

Researchers determined the most critical finding to be the presence of hardcoded passwords. In total, 2328 username and password entries, consisting of 880 unique passwords were found including 797 unique usernames. These passwords were found within service URL API entries and SSH configuration files. Researchers noted that only 18% of total password entries were associated with the top 10 most common passwords. The password ‘password’ taking the lead with 72 total entries, and the password ‘secret’ came in second with 51, see table 1 for breakdown the top 10 most common passwords identified.

Password Count
password 72
secret 51
admin 46
1234 41
db_password 40
*** 38
pass456 35
root 34
pass-word-pass-word-pass-word 29
token 26

Table 1: Top 10 most common identified passwords

Perhaps what is more interesting is that 817 of the 880 unique password entries occurred 3 or fewer times, with 665 passwords only occurring 1 time. What makes these statistics very interesting is the uniqueness of the passwords themselves. For example, the following is a sampling of ten password entries:

    • p4ssW0rde
    • P@##w0rd
    • Password!
    • qwerty123456789
    • simplepass123
    • sqluser2019
    • supersecret
    • wilson1234567
    • xxxxxxxxxxxxxxxxxxxx
    • Z*NsqgS5$@jHsF2

From a subjective viewpoint, researchers noted some of the passwords are thought to be common among corporations. Most meet minimum password requirements and they are easy to remember, e.g. the first two “password” passwords. However, these passwords can be easily guessed by malicious actors and are often present within most bruteforce dictionary listings. Other entries within the provided sample are very simple passwords with only lowercase and number combinations, or even simply the letter ‘x’ repeated 20 times. Researchers give these passwords a ‘high likelihood of being legitimate’ due to the pseudo complexity patterns they exhibit as well as the uniqueness of the entries. Meaning these are likely passwords that engineers are using within their own production environments. Additionally, due to the high occurrence of these passwords appearing within URL API requests to common cloud services, like Redis, PostgreSQL, MongoDB, and AMPQ, it is highly likely these same pseudo complex passwords are used within cloud environments themselves.

In contrast, researchers only found 27 unique instances where a variable password entry was used, indicating temporary or dynamic password usage. For example, $password, {password}, or %Password%. These 27 unique passwords instances only accounted for a total of 67 out of the 2328 passwords identified, or less than 3%.

Hardcoded API Keys and OAuth Tokens

Unit 42 researchers identified 2464 API keys and 1998 OAuth tokens within the 24,000+ triggered GitHub files. These elements were found to be unique with only 15 keys or tokens repeating more than 4 times, and only one repeating the most at 12 times across all of the triggered GitHub files, see table 2.

API Key Count
AIzaSyxxxxxxxxxxy2TVWhfo1VUmARKlG4suk 12
AIzaSyxxxxxxxxxx4kQPLP1tjJNxBTbfCAdgg 9
AIzaSyxxxxxxxxxxYmhQgOjt_HhlWO31cYfic 8
AIzaSyxxxxxxxxxxq_V7x_JXkz2llt6jhI5mI 6
AIzaSyxxxxxxxxxxnpxfNuPBrOxGhXskfebXs 6
AIzaSyxxxxxxxxxxRJ4c2tVRJxu8hZzcWA1fE 5
AIzaSyxxxxxxxxxxfUutD2aeWE1WFdnTBd_Jc 5
AIzaSyxxxxxxxxxxPdQUkRUerdohx28Fuv4wE 5
AIzaSyxxxxxxxxxxc2127s6TkcilVngyfFves 4
AIzaSyxxxxxxxxxx7SALQhZKfGBN3sFDs27Ps 4
AIzaSyxxxxxxxxxxJFdebWiX5KqyLHdakBOUU 4
AIzaSyxxxxxxxxxxVtidHrO1LXtfT3TFZuEOA 4
AIzaSyxxxxxxxxxxrzdcL6rvxywDDOvfAu6eE 4
AIzaSyxxxxxxxxxxEnTs4EfSxFIdFIdowigCs 4
AIzaSyxxxxxxxxxx4o82um7Gj1rY7R9W0apWg 4

Table 2. Top 15 API Keys

Due to the nature of API keys and OAuth keys, these elements provide direct user access to a designated cloud environment. Should an API key or OAuth token fall into the wrong hands, the malicious actor could impersonate the engineer and gain control of the environment. In a worst case scenario, if an API key was created with administrative privileges within a cloud environment, anyone who used that API key would have full access to the cloud account. Legitimate API key exposure does occur. Take for example an incident reported by UpGuard. Nearly 1GB worth of data including AWS API keys, log files, and IaC templates were exposed via GitHub. This incident detailed the exposure of legitimate API keys contained within service and infrastructure configuration files to the public. The credentials and API keys were found to be associated with root accounts held by a commercial organization.

Keys and Token, much like passwords, must be guarded and controlled to ensure they are only known to legitimate users. Any lost or potentially leaked API key or OAuth token should be immediately revoked and reissued. Table 3 displays the identified 2464 API keys and 1098 OAuth tokens and the environment in which they were associated.

Environment Count
Google Cloud API Key 1998
Google OAuth Token 1098
Miscellaneous – API_KEY 358
SonarQube Docs API Key 84
MailGun API Key 19
NuGet API Key 4
Twilo API Key 1

Table 3. Identified environments with associated API and OAuth keys

Configuration and Private key files

Unit 42 researchers completed their cloud deep-dive by analyzing configuration and private key files. Configuration files represented the highest single category of files identified by sshgit and Unit 42 signature rules with nearly 17% of the 24,000 files. The most common configuration file type was a Django configuration file which contained more than a 3rd of all configuration file types identified, see table 4. Django is a python based web framework that promotes fast development and design. PHP, which is also a very common scripting language used in web design, came in third. These web based configuration files could expose an organization’s cloud infrastructure allowing attackers to easily access cloud server internals. This would make for easier exploitation or post exploitation operations as well.

Configuration File Count
Django configuration file 1473
Environment configuration file 601
PHP configuration file 587
Shell configuration file (.bashrc, .zshrc, .cshrc) 328
Potential Ruby On Rails database configuration file 266
NPM configuration file 233
Shell profile configuration file 211
Shell command alias configuration file 130
Git configuration file 113
SSH configuration file 35

Table 4. Top 10 configuration file types

Researchers identified an elevated presence of environmental and shell configuration files. These types of configuration files establish the environmental controls for a system or service. Shell, SSH, profile, and Git configuration files are also present within the top ten listing of identified configuration files. In relation to the previous sections, it is not uncommon for services to demand username and password, API key, or token placeholders within configurations files. Researchers found nearly 80% of the configuration files contained some aspect of a username or password, API key, or OAuth token.


Unit 42 researchers found evidence that users upload sensitive data to GitHub. This sensitive data contained:

    • Hardcoded username and passwords
    • Hardcoded API keys
    • Hardcoded OAuth tokens
    • Internal service and environmental configurations

As was noted in our recent DevOps focused Cloud Threat Report, Unit 42 researchers strongly recommend that every IaC template pulled from a public repository, such as GitHub, be thoroughly scanned for vulnerabilities as part of the CI/CD pipeline. The evidence points to nearly half of every scanned CloudFormation template (CFT) containing a potentially vulnerable configuration. The odds of deploying a vulnerable cloud template are significant. Additionally, organizations should make use of GitHub Event API scanners to assist in the prevention of exposing sensitive internal information via GitHub code publications.


Researchers recommend that users and organizations who publish code to a GitHub repository use the following mitigations to ensure configuration files do not leak sensitive information publicly:

    • Implement variable and CLI argument-based code writing practices to remove hardcoded username and passwords, API keys, and OAuth tokens from code samples.
    • Implement a password security policy mandating the usage of complex passwords
    • Implement a publication policy to regulate and prevent the sharing of internal sensitive data via external sources.
    • Use GitHub’s Enterprise account capabilities to ensure that public sharing practices are more closely scrutinized.
    • Use tools like AWS-git secrets, GitHub’s TokenScanner, gitrob, or trugglehog to identify and remove tokens from being published publicly

Download the Unit 42 Cloud Threat Report: Spring 2020 for more research and best practices to implement in your organization.

Additional Resources