[HackerNotes Ep. 56] Using Data Science to Win Bug Bounty - Mayonaise (aka Jon Colston)

Using Data Science to Win Bug Bounty With Mayonaise (aka Jon Colston)

Hacker TLDR;

  • Mayo’s Data Science Methodology: Jon (Mayonaise) has a pretty unique approach to hunting, applying data science to his hunting methodology. Some takeaways from this approach:

    • Measure all the data you can throughout hunting - including what led you to the bug, and what data source was used.

    • Data analysis - every source, input, and keyword is weighted and things such as keyword frequency are measured. Which keywords commonly show up which can also be used to identify more of the same target, or similar targets with the same technologies?

    • Focus and refine the techniques that worked. Try permutations of the applicable keywords which led to the result.

  • How to get Started on a Data-Driven Approach: Mayo’s tips on how to get started in a data-driven approach to hunting:

    • Enumerate all domains applicable to your target.

    • Use a big wordlist against all of your targets in an iterative loop

    • Dump the results and save the output.

    • Analyze results - this is the basis of the hit rate, and we can use this to provide weight to each endpoint or word.

    • Keep this workflow running on a server in the background continuously. Use permutations of your good hits for a continuous source of fresh data.

    • Extrapolate - what have I done that's succeeded, why has it succeeded and how can I replicate that?

  • Measuring Bug Bounty Data: Mayonaise measures almost every relevant data point possible when hunting. This drive helps refine his process when finding bugs. Some of his insights from this process include:

    • Writing wrappers for every tool in the process to ensure all relevant data is captured - response codes, success rates on keywords used, and more.

    • Effective data sources for tooling - having tried, paid, and tested most of them, the highest performing sources were Security Trails and Shodan.

  • Measuring Bounty KPIs: KPIs are measured and used to further refine his workflow to find more bugs. Some of the KPIs he measures are:

    • Which inputs were used: Inputs being fuzzing a file, content discovery, and so on.

    • Timing: It took x amount of time, and gave back X amount of hits.

    • Hit rate: This can then be used to calculate an X percent success rate on a per-word list basis.

    • Refining: Successful keywords are provided a higher weight for the target, used to further drive wordlist creation

Mayo’s Unconventional Bounty Journey

Jon Colston (Mayonaise) didn’t follow the typical beginnings of many bug hunters. Instead, he approached his early days of hunting from a completely different perspective, putting his rich experience in data science and digital marketing to good use.

This background developed into a unique hunting style, with a data science-led approach to hunting. This also lent itself to targets such as Yahoo, big advertisers, for which he used his marketing background to target marketing applications and APIs, climbing and maintaining the 1st on the leaderboard:

Throughout the process of climbing the leaderboards, Mayo not only documented details about the targets but also meticulously recorded numerous data points at every stage of bug identification.

Even if certain aspects weren't immediately exploitable, he made thorough notes and conducted in-depth research on the target, ensuring valuable insights for future encounters.

With that being said, over time, Mayo’s relationship with bug bounty has evolved and so has his methodology. He had some pretty unique perspectives and takeaways from hunting, so let’s dive into them.

Mayos Data Science Methodology

“What is not measured is not managed.. you want to measure everything”

Mayo approaches hunting differently, measuring numerous data points throughout the process of identifying a bug. Every tool used during the workflow has a wrapper to make sure all tool output can be used and retrieved at a later date, which is used to eventually trickle down into his ‘conversion funnel’ (more on this later).

Combining these data points from various tools alongside data science techniques can create an advantage when approaching a target. Data points Mayo tracks include:

  • All subdomain discovery

    • Which tool discovered the subdomain, which technique discovered the subdomain

  • All fuzz history

    • Including endpoints fuzzed, parameters fuzzed, status codes, timings

  • All domains and subdomains collected

    • Both resolved and unresolved

  • Proxy files when using manual techniques

This data is then analyzed for results. Key performance indicators that are measured off of the back of this:

  • Which inputs were used: Inputs being fuzzing a file, content discovery, and so on

  • Timing: It took x amount of time, and gave back X amount of hits

  • Hit rate: This can then be used to calculate an X percent success rate on a per-word list basis

If a hit rate on a program is low, this could suggest ineffective fuzzing, wordlists, or content discovery methods. This then leads to the question - why is it low? how could this be improved upon? is the data/wordlist for the target still applicable?

All of this data is used by Mayo to create a concept of recipes and ingredients, which can essentially be used as blueprints for a target, recon, or fuzzing.

Ingredients And Recipes

Mayo uses the idea of ingredients and recipes to apply to targets. An ingredient could be something that is a step in the process of vulnerability identification - take a data source or enumeration method, an example being Shodan, crawling or a specific endpoint.

The recipe is then the combination of these ingredients (methods) to identify a bug, which can be used as a blueprint for that target or technology.

Let’s dive into some recommended ingredients.

Data Sources

Fortunately, Mayo has paid for most sources and battle-tested them for us. Paid data sources take the crown when it comes to successful hit rates, and the most effective ones he’s found from his workflow are:

  • Security trails

  • Shodan

  • Binary

  • Crawling (free)

  • Xnlinkfinder (free)

All of these sources generate a tonne of data. Now we have this data, analysis on the frequency of words is performed for the target. Frequency is what is showing up such as ‘API’, ‘admin’, and so on.

If there’s a high frequency of ‘API’ or variations of API, such as ‘api-admin’, this would be classed as an API ‘ingredient’ for that specific target or technology. Eventually, with enough data, you’ll have many ingredients that can be combined to become a recipe.

These recipes start to build a picture of the program. If you wanted to identify more APIs, you could use the previously identified ingredients for further enumeration against a newly identified scope, domain, etc.

Mayo then looks at relevant words. Domain words, technology words, and words in enumerated directory paths. If for example an API is identified from its naming structure and domain name, there’s a probability there will be swagger files or API documentation on the host.

The analysis of relevant words helps drive content discovery to become more targeted. This then enters a recursive cycle which can be broken down into:

  • A second refined wordlist is created dynamically based on the recipe and ingredient idea

    • From things such as domain words, crawled and parsed content, and so on

  • Any new hits are then identified - where did these new hits come from?

  • Another targeted wordlist is used to determine further interesting directories

    • If for example, he has been able to determine that target.com/v1/internal exists from the previous step, he then goes to his previous data, in this instance his EndPoint Database - which is a massive curated list of seen-in-the-wild endpoints - and extracts anything with v1/internal to fuzz it.

Going beyond the already robust data-driven approach, Mayo elevates this target monitoring by conducting additional research when faced with a flagged subdomain or keyword that is unfamiliar. This involves investigating the word to determine if it is associated with a specific technology or platform, which then restarts the discovery cycle.

Recipes

The workflow doesn’t stop there, though.

If a keyword or technique is working well, the ‘recipe’ used for that - whether that be the way the directory structure is structured, the recon method, or the keyword - is revisited and variations of that recipe are tested to measure its effectiveness and see if any extra hits can be produced.

Find your own recipes and ingredients

If you want to do this for yourself for content discovery and you’re starting from scratch, Mayo had a few takeaways for the listeners.

You can start to build the weighting of words from your wordlist by:

  • Enumerate all domains applicable to your target

  • Use a big wordlist against all of your targets in an iterative loop

  • Dump the results and save

  • Analyze results - this is the basis of our hit rate, and we can provide weight to each endpoint or word

  • Keep this workflow running on a server in the background continuously. Use permutations of your good hits for a continuous source of fresh data.

  • Extrapolate - what have I done that's succeeded, why has it succeeded and how can I replicate that?

The best part about all of this? All written in Bash. So don’t worry if you aren’t comfortable with a specific programming language.

If you find yourself wondering why you didn't save your previous scan data after reading this, rest assured, you're not alone.

Conversion Funnels and Bug Bounty

All of the previous steps produce a tonne of data. To filter the data down into actionable items to hunt on, Mayo uses the idea of conversion funnels.

In marketing, a conversion funnel is used to filter down opportunities - Mayo applies these exact principles when it comes to bug bounty.

Using the data and ideology of ingredients and recipes, he breaks it down to ‘...You start with x total targets in, and you will filter down and process the list down to only work on the targets which look vulnerable’

So for example - at the top of the funnel in bug hunting could be all of the hosts found today. The second stage of the conversion funnel may be to look at hosts that haven’t been scanned before. The next then might be to identify functionality that usually is exploitable based on previous data.

Layering in traceability in all these steps helps optimize the conversion funnels. For example - Which data sources are providing the highest concentration of new targets? Areas that are productive and producing results are focused on.

Identifying Vulnerabilities and Metadata Analysis

“If I have discovered a vulnerability on a comparable platform, I'm trying to replicate the same attack first.” - HackerOne Spotlight

When Mayo identifies a vulnerability, a lot of work goes into analyzing the steps that got him there. The data generated off the back of this then goes into further refining and identifying any holes in his process that can be improved on.

The first thing he looks at is how his content discovery worked. Questions around the data are asked:

  • What did I find, what led me here?

  • When did it enter the content discovery process and how long did it take for it to get from discovery of the ingredient to wordlist?

  • Was this a one-off vulnerability or could another recipe be made from the flow that got us here?

  • Which data source did this lead come from? Crawling, Amass, security trails etc.

  • What areas of the process took the longest for it to make it through the whole pipeline? How can I reduce this to ensure I’m the first to identify?

  • Does another server need to be spun up if some of these processes take too long?

  • If the finding was a dupe, why wasn't I the first to find it?

The goal for Mayo is within 18 hours of discovery (of the asset, endpoint, whatever it may be) to ensure it is in the pipeline and output of scanning. This goal does vary depending on whether it's a holiday period, if it's a weekend, a Monday so on.

This might not initially seem intuitive to people, but Mayo has these timelines to try and match when dev teams are online and pushing things out, putting new things online and so on. This is to reduce the likelihood of dupes.

Now, Mayo did say he saw a much bigger return when going incredibly deep on a target. This new methodology is more suited for his current lifestyle and approach to hunting, however.

Hacking Marketing platforms

It’s clear that Mayo's vertical is digital marketing, so when it comes to hacking digital marketing platforms it seems like second nature.

The principles behind his expertise in marketing can be applied to threadmodel other programs. He breaks it down below:

  • You need to know the terminology, what's important to the users, and the business, and where the value is in the platform.

  • Look for documentation. When applying this to and targeting marketing platforms, there’s a LOT of documentation due to the complexity of the processes. This documentation can help identify a variety of areas to focus on.

  • Once you have this context around your target you can start threat modeling.

He breaks it down by applying this concept to marketing applications. “..as an advertiser on one of these platforms, you compete with everyone. Data such as what your competitors’ keywords are bidding for, and how much they are paying for the keyword.”

This is a perfect example of knowing your target. As hackers, this might not be intuitive to us initially as it doesn’t fall into the standard buckets of vulnerability types, but this is key information when it's applied to the context of marketing.

Gaining credentials for an app

As you can imagine a lot of functionality on applications is often only accessible by authenticated users. Having gained so much experience attacking marketing apps, Mayo shared a few tricks up his sleeve to get credentialed access:

  • If you’re hacking a B2B system, demo and test accounts are going to exist

  • If you can identify in some capacity, such as [email protected], it usually means the user is meant to be shared and will often have a simple password. This provides a trivial means of a successful brute force.

  • If you can identify which companies use the platform, monitor their domains, and buy their domain to set up email addresses, you can perform password resets on the application to gain access.

The ‘Mother of all Bugs’

Mayo has been known to throw the term ‘Mother of all bugs’ around. He’s even identified a few MOABs in his live hacking events. He classified a MOAB as:

  • No one else knows to look for it

  • It can be found on different hosts, functionality, and endpoints making it ‘ceasefire’ resistant

  • It has a rating of high or critical impact

Mayo shared intriguing insights about a 'mini MOAB' he discovered during a Live Hacking Event (LHE).

While conducting reconnaissance on a YQL API on the target, he identified API documentation on one of approximately 30 YQL hosts. Surprisingly, this API documentation applied to all hosts, which helped him identify an inherent design flaw.

This flaw was applicable across all YQL APIs resulting in 2 medium severity bugs for each host, totaling around 60 bugs on its own. Further testing revealed additional vulnerabilities on some hosts, bringing the total count to somewhere between 70-80 bugs.

This seems more like a mega MOAB than a mini one to me! Arguably the best part about this finding is he created a custom Nmap NSE script to identify the vulnerability on the other APIs. Who needs Nuclei?

Mayo’s ranking across various leaderboards speaks for itself when it comes to hunting. His unique blend of expertise in marketing and data science adds an intriguing dimension to his approach when tackling targets.

I definitely suggest giving this episode a listen if you haven’t already - Mayo's approach and perspective on hunting were completely new to me. There are a few important takeaways from this episode, however:

  • If you have existing knowledge in products or sectors, find a target that caters to this and hack there. You’ll likely have an edge over other hackers.

  • Take notes on ALL aspects when hunting, not just exploits.

  • Extrapolate - what have I done that's succeeded, why has it succeeded and how can I replicate that?

  • Measure your results. Don’t be afraid to keep data!

  • Collecting and measuring relevant data points on a target can help to refine your hunting further down the line.

  • Don’t reinvent the wheel. You don’t necessarily need fancy tooling to find bugs - Mayo's entire workflow is in Bash.

You can stay up to date and check him out on H1 here and Twitter here.

As always, keep hacking, keep learning, and keep pushing the boundaries!