Remembering the ‘Stakkato’ Hacks

Share

Philip Gabriel Pettersson, best known by the pseudonym of “Stakkato” can be said to have reached legendary status within the computer security community by his numerous successful breaches of high-level targets between 2003 and 2005. Then a 16 year-old hacker from Uppsala, Sweden, he successfully infiltrated systems of large universities, the United States military, NASA and various companies, forming a worldwide network within which he operated for around 2 years before being caught in 2005 and prosecuted by Swedish authorities. This post revisits the story of Stakkato by reviewing his motivation, techniques and exploits and potentially unearth some lessons learned from these events.

Bored Teenagers

Uppsala is the fourth largest city of Sweden and is situated around 70km north of the capital. In 2003, one of its curious and smart teenager went on to challenge himself by exploring – illegally – the digital environment surrounding the city. Some of us might remember the old definition of a “hacker”, as defined by The Mentor’s manifesto [1]. Back in 2003, owning a computer was still not totally commonplace, although it was a lot more than it in 1995. Only teenagers with a certain sense of interest and curiosity about technology would consider spending most of their time on their machines. In my corner of the world, in the 90s, computer science classes were nothing more than learning to type, using word processors and creating spreadsheets. I am sure I was not the only one in the same situation and some readers may remember the frustration of not being able to pursue their hobby in depth while in school. So we spent most the classes programming VBA games or spamming other students using WinPopup to have them call out the teacher, who would struggle to explain the innocuous messages on the screen. Only at night could we connect to the net, login into our favorite BBS, IRC channels or forums to finally learn more. Virtualization was not a thing back in the early 2000s, internet connections were still slow and owning more than 1 computer was a luxury most couldn’t afford. A solution was dumpster diving around computer shops – which were aplenty compared to nowadays – or browsing eBay for scraps. Another one was to poke around systems connected to the internet. Universities were of course perfect targets – opened, poorly secured (in order to be opened) and rich with systems, software and data.

Why am I rambling about the past? Because in many ways, Stakkato may have been the same teenager than many of us were back then, but his cockiness eventually got the better of him and caused his demise. Some even proposed that by 2005, he may have attempted to venture into criminal activities by selling stolen intellectual property. In any case, let’s explore briefly his story, because I believe many who now heads IT security companies, or experts and researchers in the field all shared the same starting point, but fortunately took a different path at some point.

The Stakkato Hacks

The first suspicions of wrongdoing were noticed in 2004. Berkeley researcher Wren Montgomery started receiving email from Stakkato [2], claiming that not only did he infiltrated her university, but that he also accessed the network of White Sands Missile Range in New Mexico, stole F-18 blueprints from Patuxent River Naval Air Station and infiltrated NASA’s Jet Propulsion Laboratory (JPL) – which to be honest, have been hacked by many in the past decade [3][4][5], almost making it an initial test for debuting hackers. These claims were later confirmed by spokesmen from both organizations. They however downplayed the importance of these breaches, claiming that there were low-level breaches and that only weather information was exfiltrated. Later during the year, several laboratories harboring supercomputers connected via the high-speed network TeraGrid reported breaches. However it was only in 2005, with the intrusion in networking company Cisco Systems, that would trigger alerts from authorities and proved to be a bridge too far. Having established a foothold within Cisco, Stakkato was able to locate and download around 800MB of source code of the Internetwork Operating System (IOS) version 12.3 [6]. IOS runs on every Cisco routers and other networking devices which are often key network component of not only large commercial and governmental organizations, but also of the worldwide telecommunication infrastructure. Samples of the code was released on IRC as proof and reported by a Russian security site. The theft of the code caused a stir, many believing that individuals or groups would comb the code and craft zero-day exploits that could be leveraged on critical systems.

This activity would prove the last Stakkato and his team would be able to brag about as the Federal Bureau of Investigation (FBI) and the Swedish authorities started to investigate the leaks. In 2007, he was convicted for breaching networks Swedish universities and paid 25,000$USD in damages. He was further interviewed by U.S. officials [7] and in May 2009, he was formally inducted in California for intrusions in Cisco Systems, NASA’s Ames Research Center and NASA’s Advanced Supercomputing Division [8]. In 2010 his prosecution was transferred to the Swedish authorities.

The Tactics

The core strategy of Stakkato revolved around a trojanized SSH client he uploaded to systems he compromised. The malicious client would be used to intercept users’ credentials and send them to a third location where Stakkato and his group would retrieve them to access additional systems. Once accessed, Linux kernel exploits were used for privilege escalation on the local system and then repeated their main tactic, creating privileged accounts and eventually building a wide network of proxies to launch their attacks. The attack on the National Supercomputer Centre [9] provides insight on the tactics and size of the compromises. The methodology used was not innovative by any mean, but was applied effectively and certainly leveraged human errors to its full extend. The process can be summarized as follow:

  1. Infiltrate a system via a kernel vulnerability or stolen credentials;
  2. Disable command history, e.g prevent the system from logging your commands;
  3. Attempt privilege escalation;
  4. Setup trojanized SSH clients, backdoors and rootkits;
  5. Extract known hosts from current machine;
  6. Attempt to infiltrate extracted hosts as per step 1.

The analysts of the NSC documented logins from universities the United States, Israel and Sweden and referenced the SuckIt rootkit [10] as being installed on one of the target machine. Unfortunately for the administrators, the rootkit was discovered only after a new root password was assigned to all machines, allowing the attackers to re-infiltrate the newly cleared systems. However this time the Swedish teenager was a lot less subtle and vandalized the systems by attempting a web defacement and modifying logon messages. This time the IT specialists took down the network, inspected and reconfigured every machine before putting the system back online. Despite the defensive operation, recurring login attempts and smaller-scale compromised originating from more than 50 compromised organizations were noted between 2003 and 2005.

Lessons Learned

This story follows the same pattern observed throughout the ages, such as sprawling empires from ancient times in which the rulers’ overconfidence led them to bankruptcy, or growing organizations that stretched into markets that proved more difficult than expected. Stakkato’s network of compromised systems grew too large, he became overconfident and tempted the sleeping bears. In other words, patience may have led him to a very different path. Or maybe his arrest was for the best afterall: there is little news about him past 2010, but coincidently there is a security researcher working in Samsun bearing the same name and credited multiple vulnerabilities in the Linux kernel [11][12]. While I have no idea if this is the same individual, I would be glad to hear that he now uses his skills fruitfully.

Arguably another lesson is how simple tricks can still work if applied efficiently. All things considered, security hasn’t changed dramatically within the past 10-15 years: it has evolved, but in the end, we still rely on usernames and passwords, users’ awareness and administrators properly maintaining their networks and hosts. Humans using these systems haven’t changed much either; we will take the simplest approach to achieve our goals. Hence we select the easiest password passing the complexity filters in place and reuse it [13] so we don’t have to remember 100 variations of the same password. Large database compromises in the past few years appears to prove this behavior. We could have many passwords and store them in password managers, but then the password managers can still be trojanized or exploited [14], allowing similar tactics used by Stakkato. Eventually most people would probably not bother to execute an additional program to retrieve their password in order to login in the service they need; it simply adds an additional step.

Conclusion

Studying the past of computer security is sometimes quickly dismissed, often seen as irrelevant given the change in technologies, but one can easily find inspiration in the stories of hackers, malware writers and the analysts that battled to gain and maintain control of systems. Much like studying the battles of Alexander the Great or Patton, there is much to be learned from studying the techniques used and wargaming their applications in modern organizations. Would the current administrators blindly enter their passwords if a windows suddenly popped up requesting their credential for some update? Users still get fooled by fake login web pages [15] and end up with their bank accounts plundered or their Twitter account spewing nonsense to all their followers. It still works.

Obligatory XKCD

References

[1]    “Phrack Magazine” [Online]. Available: http://phrack.org/issues/7/3.html. [Accessed: 05-Nov-2016].

[2]    J. M. L. Bergman, “Internet Attack Called Broad and Long Lasting by Investigators,” The New York Times, 10-May-2005. [Online]. Available: http://www.nytimes.com/2005/05/10/technology/internet-attack-called-broad-and-long-lasting-by-investigators.html. [Accessed: 02-Nov-2016].

[3]    K. Zetter, “Report: Hackers Seized Control of Computers in NASA’s Jet Propulsion Lab,” WIRED, 01-Mar-2012. [Online]. Available: https://www.wired.com/2012/03/jet-propulsion-lab-hacked/. [Accessed: 04-Nov-2016].

[4]    “Hacker Sentenced in New York City for Hacking into Two NASA Jet Propulsion Lab Computers Located in Pasadena, California (September 5, 2001).” [Online]. Available: https://www.justice.gov/archive/criminal/cybercrime/press-releases/2005/gascaConviction.htm. [Accessed: 04-Nov-2016].

[5]    “Hackers penetrated NASA computers 13 times last year,” USATODAY.COM, 02-Mar-2012. [Online]. Available: http://content.usatoday.com/communities/ondeadline/post/2012/03/hackers-penetrated-nasa-computers-13-times-last-year/1. [Accessed: 04-Nov-2016].

[6]    “Sweden to prosecute alleged Cisco, NASA hacker.” [Online]. Available: http://www.theregister.co.uk/2010/02/08/swedish_hacker_prosecution/. [Accessed: 04-Nov-2016].

[7]    D. Kravets, “Swede Indicted for NASA, Cisco Hacks,” WIRED, 05-May-2009. [Online]. Available: https://www.wired.com/2009/05/swede-indicted-for-nasa-cisco-hacks/. [Accessed: 03-Nov-2016].

[8]    United States of America v. Philip Gabriel Pettersson aka “Stakkato.” 2009.

[9]    L. Nixon, “The Stakkato Intrusions: What happened and what have we learned?,” presented at the CCGrid06, Singapore, Singapore, 17-May-2006.

[10]    D. Sd, “Linux on-the-fly kernel patching wihtout LKM,” Phrack, no. 58, Dec. 2001.

[11]    P. Pettersson, “oss-sec: CVE-2015-1328: incorrect permission checks in overlayfs, ubuntu local root.” [Online]. Available: http://seclists.org/oss-sec/2015/q2/717. [Accessed: 05-Nov-2016].

[12]    “Linux Kernel ’crypto/asymmetric_keys/public_key.c ‘ Local Denial of Service Vulnerability.” [Online]. Available: http://www.securityfocus.com/bid/81694. [Accessed: 05-Nov-2016].

[13]    T. Spring and M. Mimoso, “No Simple Fix for Password Reuse,” Threatpost | The first stop for security news, 08-Jun-2016. [Online]. Available: https://threatpost.com/no-simple-fix-for-password-reuse/118536/. [Accessed: 04-Nov-2016].

[14]    “How I made LastPass give me all your passwords.” [Online]. Available: https://labs.detectify.com/2016/07/27/how-i-made-lastpass-give-me-all-your-passwords/. [Accessed: 05-Nov-2016].

[15]    Bursztein, Elie, Borbala Benko, Daniel Margolis, Tadek Pietraszek, Andy Archer, Allan Aquino, Andreas Pitsillidis, and Stefan Savage, “Handcrafted fraud and extortion: Manual account hijacking in the wild,” in Proceedings of the 2014 Conference on Internet Measurement Conference, Vancouver, Canada, 2014, pp. 347–358.

The Byte-BOS Real-Time Multitasking Operating System

Compared to operating systems in general computing, the world of embedded devices remains a world to be discovered by security analysts and hackers alike, and it offers much to explore. While reverse engineering of many newer appliances is revealing well known operating systems such as Windows or Linux, others remains fairly unexplored, such as TinyOS or μOS. Legacy devices, still in used across many industries, also provide a rich variety of unknown software and architectures. One of them is the Byte-BOS Real-Time Multitasking Operating System (RTMOS).

Share

Introduction

Compared to operating systems in general computing, the world of embedded devices remains a world to be discovered by security analysts and hackers alike, and it offers much to explore. While reverse engineering of many newer appliances is revealing well known operating systems such as Windows or Linux, others remain fairly unexplored, such as TinyOS or μOS. Legacy devices, still in use across many industries, also provide a rich variety of unknown software and architectures. One of them is the Byte-BOS Real-Time Multitasking Operating System (RTMOS). This short article attempts, based on limited information, to detail this little known system, explore its history, usage and basic composition.

History

The Byte-BOS RTMOS was initially developed in 1994 by Numerical Services Inc. The company, defunct since 2004, sold the full C source code of the RTMOS to customers for 7495 $USD. Possession of the source code allowed the buyer full customization of the operating system according to the device being designed. The system supported a restricted set of microcontrollers (see table 1) including the Intel x86 and Motorola M68000. Other lesser known microprocessors supported included architectures from Hitachi, Mitsubishi and Texas Instruments, which were used in embedded devices within the medical, industrial and telecommunications sectors. Due to the long life-cycle of these devices, it may still be possible to identify devices leveraging the Byte-BOS RTMOS. Development and support for the RTMOS ceased around 2004. As such, documentation is very scarce, and remnants of information are available only via the Internet Archive [1]. At the time of writing, the domain “www.bytebos.com” resolves to “Scheumacher Engineering”, owned by the original developer of Byte-BOS. However the website is nothing more than a single page which was last updated in 2013.

Table 1 – Microprocessors supported by the Byte-BOS RTMOS.
Intel 80×86 Coldfire Mitsubishi M16C Texas Instruments
C2x/C5x DSP
Intel 80188/86 Motorola 68000 Mitsubishi M32D Texas Instruments
C3x/C4x DSP
Intel 80×86 (32bit) Motorola 68HC11 Hitachi H8300 ARM/THUMB
Intel 8096 Motorola 68HC16 Hitachi H8300H
Intel i960 Mitsubishi
M37700
Hitachi SHX

Description

The Byte-BOS RTMOS is a minimal operating system providing task scheduling management for user applications, along with typical OS activities such as interprocess messaging, memory allocation and prioritization. It performs pre-emptive and non-pre-emptive scheduling of an unlimited number of queued tasks. The system enforces scheduling via interrupt service routines (ISRs) which can suspend and/or resume tasks based on events and their priority. When multiple tasks with a similar priority are requesting resources, the RTMOS assign the resources via a round-robin selection method.

As other operating systems, memory allocation appears to follow a standard process by keeping track of used and free memory chunks via a heap implemented by a double-linked list, which is the data structure used across the RTMOS to store unlimited numbers of objects. Tasks can dynamically allocate and free memory as needed. It also manages events and interaction with devices on behalf of user applications by abstracting the underlying hardware via a Device object (see figure 1). Byte-BOS also provides inter-tasks communications mechanisms for synchronization amongst tasks such as messages and semaphores.  In terms of data structures the double linked list and buffer objects appear to be the main structures used across the RTOS. Memory chunks, tasks and most of the other objects are all managed via the List object.  It also supports the queue structure used to manage tasks. A software stack is also provided and used by tasks to run.

Security-wise, Byte-BOS does not have any authentication or specific security methods for access to memory or special “kernel” functions; all tasks have the same level of authority than the RTOS. Basically, Byte-BOS is completely flat.

Architecture

The RTMOS is written in C or specific variety of C specification depending on the targeted microprocessor. The 10 main components of the operating systems are represented in figure 1 and provide an excellent overview of the functions accomplished. A simulated “this” pointer to the various “struct”s is made available to the developer to give an Oriented-Object Programming (OOP) feel. This section describes briefly the main components of Byte-BOS, along with some of the features marketed by the original developer.

UML Representation of the different objectsfooun in the Byte-BOS RTMOS
Figure 1 – UML Representation of the different objects defined in the Byte-BOS RTMOS

The Task Object

The main object of the RTMOS is the Task object, which is very similar to the similar concept modern software development. Just as in any C/C++ object, a constructor is initiated prior to executing the body of the task, which is defined by the developer to conduct a specific task. Data of the task is allocated and accessible via the this_data pointer. Tasks also allocate sub-objects, opens required devices, initialize variables and synchronize state machines. Once constructed, the task runs its main function until it returns or is removed. Access to a task is done via the this_task pointer. Both pointers described are always in scope and usable within the task. Once the main task completes, the destructor of the structure is called, which frees memory and pointers of sub-objects. It also terminates any internal state machine.

The Memory Object

This object manages the memory heap and stores the pointer to the heap, the block size and the number of memory blocks. It also exposes various functions for memory management, mainly alloc_fmem and free_fmem. The heap is implemented using a double-linked list in which each node contains metadata about the memory blocks. While details are not available, it can be assumed that the size of the block, its status (free or allocated) and a pointer to the next block is included.

The Semaphore Object

As its name implies, the Semaphore is used for signaling and control between multiple tasks by either being in the Up or Down status. A timeout can be defined as needed.

The MessageBox Object

The MessageBox acts as a central repository for tasks to exchange data. Tasks and ISRs can create messages to initiate other tasks or exchange . The MessageBox basically provides two functions: put and get. When a new message is created, memory is allocated. Similarly, the memory is freed when retrieved from the box.

The Timer Object

Similar to any standard timer in other system, this object is used to create timeouts and schedules of tasks and events.

The Event Object

Events are created by tasks and ISRs to schedule other tasks.

The Device Object

This item abstracts interaction between the user application and the hardware device. It does so by pointing to the memory blocks used by the device and managing input/output to the area.

The List, Queue and Buffer Objects

These objects, as their name implies, are implementation of a linked list, a queue and buffer data structures.

Features

Byte-BOS boasted the following additional features on their website from 2000, providing some extra insight on the internals of the system.

Interrupt Handling

ISRs can make operating systems calls and schedule tasks by including the ENTER_ISR and EXIT_ISR macros at the beginning and end of the routine. Most services are available to ISRs, giving them considerable access over the entire OS.

Critical Section Handling

Byte-BOS provides critical section handling by protecting non-reentrant code sections from both interrupt and task pre-emption. Protection of critical sections appears to be done via the following mechanisms:

  • Disabling interrupts
  • Locking and unlocking code sections
  • Prioritization of tasks

Task Execution Trace

Tasks are provided with a trace buffer configurable at both compile time and run time. The trace buffer contains information about the sequence of calls, the parameters passed, and the return value of the called function. The tracing option is extremely useful for debugging purposes.

System Requirements

The Byte-BOS RTMOS requires very little in terms of resources. While the requirements varies depending on the underlying processor, the size of the kernel is a few kilobytes, while the RAM requirements are less than 100 bytes of globally accessible memory and less than 100 bytes per task and ISR. The detail figures are listed in table 2.

Table 2 – Memory requirements for the Byte-BOS RTMOS according to microprocessor
Microprocessor Minimum Kernel
Size (KB)
Maximum Kernel
Size (KB)
RAM Requirement
for Global Variables
(bytes)
RAM Requirement
per Task (bytes)
Intel 80×86* 4 10 28 30
Intel 80×86
(32bit)*
6 15 48 58
Intel 80188/86* 2 10 28 30
Intel 8096 2 8 20 20
Motorola 68000 1.5 12 50 70
Motorola 68HC11 1.2 8 28 20
Motorola 68HC16 2 8 28 20
Mitsubishi M37700 2 10 28 20
Mitsubishi M16C 2 12 64 48
Mitsubishi M32D N/A N/A N/A N/A
Hitachi H8300 1.2 8 28 20
Hitachi H8300H 2 10 38 40
Hitachi SHX 7.8 19.5 72 60
Texas Instruments
C2x/C5x DSP
1.2 8 28 20
Texas Instruments
C3x/C4x DSP
2.3 19.5 28 words 20 words
ARM/THUMB 4 15 50  70

* For Intel-based architecture, the memory footprint reported above is based on usage of the Borland C++ compiler.

Additional volatile memory is required for objects created dynamically. As such, the numbers provided above do not represent the total amount of memory consumed by Byte-BOS.

Compilers Supported

In general, Byte-BOS supports the development tools and compilers provided by the semi-conductor manufacturer. For Intel-based processors, the Borland C/C++ compiler was supported. The complete list of supported compilers is provided in table 3.

Table 3 – Supported compilers by the Byte-BOS RTMOS.
Microprocessor Supported Compilers
Intel 80×86 Microsoft C/C++ (large model)
Borland C/C++
Watcom C/C++
Intel 80×86(32bit) Watcom, Metaware and most of C compilers
Intel 80188/86 Microsoft C/C++ (large model)
Borland C/C++
Intel 8096 IAR
BSO Tasking Compiler
Motorola 68000 Cross-Code
Intermetrics
Introl
Microtec
Motorola 68HC11 IAR
Introl
Cosmic
Motorola 68HC16 Introl
Cosmic
Mitsubishi M37700 IAR
Microtec
Mitsubishi M16C IAR
Mitsubishi Compiler
Mitsubishi M32D N/A
Hitachi H8300 IAR
Microtec
Hitachi H8300H IAR
Hitachi SHX GNU
Green Hills
Texas Instruments
C2x/C5x DSP
Code Composer /
Texas Instrument Compiler
Texas Instruments
C3x/C4x DSP
Code Composer /
Texas Instrument Compiler
ARM/THUMB  ARM

Identification

Devices using the Byte-BOS RTMOS can be identified by looking at ASCII and Unicode strings in their firmware. In almost all cases, if not all, the firmware will contain a copyright notice identifying it as well as the target microcontroller it was compiled for. For example, by extracting the strings from the firmware for the Seagate Cheetah 10K.6 Disc Drive, either using the simple strings.exe (or strings in Linux) or IDA, the copyright notice string “Byte-BOS 8096 Multitasking Operating System — Copyright 1990 Numerical Services Inc.” can be observed (see figure 2).

Byte-BOS Copyright Notice into the Firmware of the Seagate 10K.6 Disc Drive
Figure 2 – Byte-BOS Copyright Notice into the Firmware of the Seagate 10K.6 Disc Drive

While appliances posterior to 2004 will likely use modern RTMOS such as Linux , Byte-BOS can still be found on legacy devices. One such example is the Baxter IPump pain management medical device, which refers to its usage in the manual [2]. Of note, the manual refers to the detection of stack overflows within the Byte-BOS RTMOS. Industrial controls, aircraft, telecommunications systems and other systems designed prior to the 2000s may still harbour this little known RTMOS.

Conclusion

The Byte-BOS RTMOS will likely disappear as legacy embedded devices are life-cycled for newer systems with more processing power and extended memory. Until that moment, for developers of these systems, or simple hobbyists, information remains scarce and limited. While we provided a brief overview of its architecture and features, details about creating user applications remains obscured by the lack of online information such as an API description, code examples or more importantly, the source code. Such data would greatly ease development and engineering efforts of legacy systems.

References

[1] “Bytebos Home Page.” Bytebos Home Page. Internet Archive, 20 Oct. 2000. Web. 01 Nov. 2015. <https://web.archive.org/web/19990208011926/http://www.bytebos.com/>.

[2] Baxter IPump Pain Management System – Service Manual, Baxter Healthcare Corporation, Chapter 4 – Troubleshooting, p.4-8. 2007. Web. 01 Nov. 2015. <Link>

The Syrian Civil Conflict in the Cyber Environment

Introduction

This is an article I wrote a while ago and never got published. It’s a bit outdated now, but I still think it can be useful for historical purposes, so I’ll post a link to it below.

Abstract

This document analyzes the use of the cyber environment in the Syrian civil war by both the population and the government in order to characterize online tactics and strategies developed and used by each belligerent. This overview allows for generalization of online behavior by hacktivists and nation-state sponsored actors on communication networks in the region, which will continue to see online attacks from various parties in the foreseeable future during similar conflict. In Syria, because of poor infrastructure, low rate of Internet penetration and early adoption of control mechanisms by the current government, the authorities had dominance over their information environment early in the conflict, enabling rapid gathering of intelligence on dissidents. While social medias were leveraged by the population as in many other uprisings for coordination, it was also the theater of multiple offensive cyber operations by internal and external groups, mostly for information operations purposes. Despite the high level of activity, none appeared to have a definitive impact on the ground. While events recorded in this space have not reached the level of intensity of other conflicts, it proves a useful model for similar conflicts in the Middle East region.

Reference:

Racicot, Jonathan, The Syrian Civil Conflict in the Cyber Environment, https://www.academia.edu/15182402/The_Syrian_Civil_Conflict_in_the_Cyber_Environment, last accessed 2015-09-03

The Syrian Civil Conflict in the Cyber Environment

This document analyzes the use of the cyber environment in the Syrian civil war by both the population and the government in order to characterize online tactics and strategies developed and used by each belligerent.

Share

Introduction

This is an article I wrote a while ago and never got published. It’s a bit outdated now, but I still think it can be useful for historical purposes, so I’ll post a link to it below.

Abstract

This document analyzes the use of the cyber environment in the Syrian civil war by both the population and the government in order to characterize online tactics and strategies developed and used by each belligerent. This overview allows for generalization of online behavior by hacktivists and nation-state sponsored actors on communication networks in the region, which will continue to see online attacks from various parties in the foreseeable future during similar conflict. In Syria, because of poor infrastructure, low rate of Internet penetration and early adoption of control mechanisms by the current government, the authorities had dominance over their information environment early in the conflict, enabling rapid gathering of intelligence on dissidents. While social medias were leveraged by the population as in many other uprisings for coordination, it was also the theater of multiple offensive cyber operations by internal and external groups, mostly for information operations purposes. Despite the high level of activity, none appeared to have a definitive impact on the ground. While events recorded in this space have not reached the level of intensity of other conflicts, it proves a useful model for similar conflicts in the Middle East region.

Reference:

Racicot, Jonathan, The Syrian Civil Conflict in the Cyber Environment, https://www.academia.edu/15182402/The_Syrian_Civil_Conflict_in_the_Cyber_Environment, last accessed 2015-09-03

(Bad) Amazon Phishing Email

Fortunately, my wife is a smart cookie and always suspicious of weird looking email. Maybe its due to the fact she lives with a paranoid guy. In any case, she caught this phishing email, which appears to be from Amazon, and leads to a fake login page.

Share

Introduction

Fortunately, my wife is a smart cookie and always suspicious of weird looking email. Maybe its due to the fact she lives with a paranoid guy. In any case, she caught this phishing email, which appears to be from Amazon, and leads to a fake login page.

Contents

The phishing email comes from “amazon@iservice.co.org.il” with the terribly spelled subject “your accounnt information need to be updated” and the content is a screenshot of an authentic Amazon email, thus bypassing filters. However, the attacker succeed in misspelling the only field he had to fill.

A fake Amazon account confirmation received which contains a single image.
A fake Amazon account confirmation received which contains a single image.

Clicking anywhere on the image will redirect the target to ‘http://bestofferz.biz/service/support/wp-admin/support/support/”, which host a fake login page as shown below:

Fake Amazon Login Page
The attacker is hosting a fake Amazon login page on HostGator

So by looking under the hood, we can see that the entire page is actually a single javascript function call to decrypt a long Base64 encoded string.

The encryption key used is stored in the hea2p variable and the HTML code. The entire code can be analyzed here and using the AES Javascript code here. If the target enters his emails and password, he will then be forwards to a fake account creation page asking for his address.

Fake Amazon Account Creation Page
Fake Amazon account creation page.

And of course, it will then ask you for your credit card information, which is possibly the end goal of the phisher.

Fake Credit Card Information Request Page
Fake Credit Card Information Request Page

All the pages are encrypted using the same key. Only after entering this information to the target get redirected to the real Amazon website.

Successful Phishing Operation Page
Successful Phishing Operation Page

Conclusion

Remember to always check the URL and the from email address !

Repost: Stack-based Buffer Overflow Vulnerabilities in Embedded Systems

The buffer overflow attack vector is well documented in desktop and server class machines using the von Neumann memory model, but little has been published on buffer overflow vulnerabilities in Harvard architectures.

Share

I have not written or contributed to the enclosed research paper. I’m simply reposting it here because it’s interesting and for some reason, appears available only via Google cache. So before it disappear from results, I’m reposting it here.

This paper discusses a technique to conduct buffer overflows on processors using the Harvard architecture. In this architecture, the stack starts at the beginning of the memory and grows up, versus Von Neumann architectures in which it grows down.

Abstract:

Most small embedded devices are built on Harvard class microprocessor architectures
that are tasked with controlling physical events and, subsequently, critical infrastructures. The Harvard architecture separates data and program memory into independent address spaces, as opposed to the von Neumann architecture that uses a unified memory system with a single address space for both data and program code. The buffer overflow attack vector is well documented in desktop and server class machines using the von Neumann memory model, but little has been published on buffer overflow vulnerabilities in Harvard architectures. In this paper we show that stack-based buffer overflow vulnerabilities exist in embedded control devices based on the Harvard class architecture. We demonstrate how a reversal of stack growth direction can greatly simplify the attack and allow for easier access to critical execution controls. We also examine popular defense techniques employed in server and desktop environments and the applicability of those defenses toward Harvard class machines.
Link: Kristopher Watts & Paul Oman, University of Idaho, Stack-based Buffer Overflow Vulnerabilities in Embedded Systems

The Variable Message Format (VMF) Protocol – A Data Protocol for Radios (Part 1)

The VMF protocol is a SDR data protocol created to exchange information between multiple different systems by providing a rich and flexible specification. Yet, the VMF is little known even amongst avionics engineering and its knowledge remains the speciality of few. As such, in this post we explore this protocol in details to better understand its inner workings and its uses

Share

Introduction

Software-Defined Radio (SDR) is a fast-growing market, expanding in a wide array of industries. Growth in this sector alone is expected to reach $USD27.29 billion by 2020 [1]. When only considering that most smartphones are equipped with SDRs, one can quickly understand the active research conducted in the field. SDRs are also in high usage across militaries and law enforcement given the added flexibility to conduct multiple types of operations using the same hardware, often with reduced maintenance costs. Nowadays, military tactical radios and law enforcement equipment are all software-defined to some degree. Within the industry, 93% of all mobile systems leverage SDR for wireless communications [2]. The added agility introduced by software also comes with additional threats, previously absent from hardware-based radios. Within the military, SDRs form the cornerstone of network-centric warfare as they are used to establish networks between different units in order to exchange data such as position, imagery and target information rapidly between ground, maritime, air and space elements. Radios not only process voice; the internal software also manages various networks and data protocols, further increasing the complexity of the application layer of the communication device. The VMF protocol is one such SDR data protocol created to exchange information between multiple different systems by providing a rich and flexible specification. Yet, the VMF is little known even amongst avionics engineering and its knowledge remains the speciality of few. As such, in this post we explore this protocol in details to better understand its inner workings and its uses

Software-Defined Radios

An overview of the architecture of the SDR is required to better appreciate the threats against them and to eventually understand some of the unique characteristics when looking for vulnerabilities in their software. We do so by describing the general software model used by SDRs which have multiple layers, each of which uses different programs to process waveforms and the data contained in them. The Wireless Innovation Forum (WIF) simply defines a SDR as a radio in which some or all of the physical layers functions are software defined, contrasting with previous radios in which all configurable properties were designed and hardcoded into the hardware. Within the definition of WIF, the physical layers refer to the four lowest layers of the OSI model. This innovation allowed for additional agility into the functionalities of radios, as the integrated general purpose processors (GPP) or Field Programmable Gate Array (FPGA) units and the flexible radio-frequency (RF) modulators – all reconfigurable through software – permit the device to operate on a wider range of the spectrum depending on the current operations. These new components are all software controlled, allowing for fast reconfiguration as needed by the operator (see figure 1).

Simplified schema of the major hardware and software components of modern SDRs.
Figure 1. Simplified schema of the major hardware and software components of modern SDRs.

Within the SDR the Software Communications Architecture (SCA) is the open framework generally used to specify how a radio designer should integrate hardware and software components in order to interact efficiently and maximize software reuse [3]. The SCA was developed by the U.S. military as part of the now defunct Joint Tactical Radio System (JTRS). Despite it’s military roots, the SCA is used in the industrial sector as well. The framework divides the radio into 4 layers, each composed of multiple components. These are layers below the user applications (figure 2) and are responsible for processing the inbound and outbound data:

The Software Communications Architecture (SCA) is an open architecture framework abstracting interactions between hardware and software within a software defined radio.
Figure 2. The Software Communications Architecture (SCA) is an open architecture framework abstracting interactions between hardware and software within a software defined radio.

Programmable Radio Hardware; the Programmable Radio Hardware is the layer which regroups the software-reconfigurable elements of the radios such as the RF modulation units, the modems and the link interfaces.

Operating System and Middleware; the operating system, along with the Common Object Request Broker Architecture (CORBA) middleware play the critical role of managing the communications between the software of the core framework and the hardware. CORBA is a message passing technique that is widely used for cross-platform frameworks. All core interfaces are defined using the Interface Definition Language (IDL) which can be compiled in different languages such as C/C++ and ADA. The operating systems included in SDRs are Real-Time Operating Systems (RTOS) such as uOS, TinyOS or VxWorks. These provide multithread support and are required to be POSIX compliant. The RTOS also interfaces with the underlying network interfaces and serial ports.

Core Framework; the core framework is the abstraction layer between the software developers and the underlying hardware. It contains the interfaces and services required for the applications to use the devices of the radio by describing them using the eXtensible Markup Language (XML);

Waveforms; the software at this layer specifies the required parameters to form the needed waveform to communicate with networked devices. By reviewing the framework above, one quickly realizes that the world of radios is now mainly driven by software development and thus exposes itself to the same issues that plagues programs in the wider computing world. The terms ”operating system”, ”drivers”, ”middleware” and APIs are terms well-known to cyber operators and computer security analysts alike, thus making them a target the same way as any other host on any other network.

Tactical Radios

Software radios have significant advantages for the military as they provide a wide variety of dynamic radio protocols in real time. Since SDRs can be reprogrammed remotely for multiple purposes without any hardware changes, maintenance costs and time have decreased. Because of these characteristics, SDRs are one of the cornerstones of network-centric warfare, the currently predominant Western military doctrine. It is therefore not surprising that all tactical military radios in modern western militaries are SDRs, including most of civil aviation for law enforcement and potentially drones. Modern MTRs are more akin to routers than radios and enable the creation of instant networks and the exchange of operational data between the connected nodes over Tactical Data Links (TDL) such as Link-16. Data protocols such as VMF and the Joint Range Extension Applications Protocol (JREAP) enable the exchange of voice, data and imagery by connecting to a network of various platforms and units, both vertically and horizontally. Amongst the data exchanged are: positions, trajectories, maps, navigation data, mission data and intelligence. This implies the presence of user applications managing the data protocol, and additional applications processing the data received. Other data link protocols include the Type 483D, the Chinese equivalent to Link 4C. This transfer of information between platforms is central to the network-centric warfare doctrine which drives the research and development of network radios and increases the complexity of the software and network protocols. In many aspects, the radio can be considered the gateway device between the internal networks of a platform and external nodes requesting to connect to it. Internally within aircraft and ships many MTRs provide multiple interfaces to communicate with other elements of the platform, such as mission computers, display terminals and external software to display geoposition data for example. On a military aircraft, the interfaces of the radio usually include the MIL-STD-1553 bus (or its civilian counterpart, the ARINC429) to link with the avionics as well as Ethernet and serial ports such as RS-232 or RS-485, often used by Remote Control Units (RCUs) for management purposes via protocols such as the Simple Network Management Protocol (SNMP). Externally the radio networks the platform with external units using different systems by forming Mobile Ad hoc Networks (MANETs) and exchanges tactical data using data protocols over the tactical data links. These networks are wireless, infrastructureless, multi-hop and highly fluid. In military usage these are typically low-bandwidth links and are often managed by devices with little computational power. The Ad hoc On-Demand Distance Vector (AODV) is typically the underlying routing protocol, which manages routing in a peer-to-peer network. Routing information is updated constantly, as nodes within a MANET can also act as proxies for far-away nodes.

The Variable Message Format Protocol

The VMF standard was developed by the U.S. Department of Defense to allow messages of variable lengths to be sent over TDLs. A VMF message is bit-oriented and attempts to minimize the use of TDLs by sending only the required data. The objective is to be flexible enough to be able to communicate with any legacy and new host requiring that additional header fields can be added without modifying the underlying specification. The core of the VMF protocol is the Protocol Data Unit (PDU) which contains the header and the user data (figure 3), much like a typical TCP/IP packet. The PDU is processed at the application layer and is composed of the application header and the user data, which can be multiple format as we will see later on. The size of the former is always a multiple of 8 and is padded with null bits as needed.

Schema of the Application Header of a VMF message
Figure 3. Schema of the Application Header of a VMF message. which is composed of fields and grups.

Structure of the Application Header

The first 4 bits of the header always specify the version of the VMF protocol in used by the PDU. As of 2015, 5 revisions of the protocol have been created (table 1):

Code (binary) Revision
0000 MIL-STD-2045-47001A
0001 MIL-STD-2045-47001B
0010 MIL-STD-2045-47001C
0011 MIL-STD-2045-47001D
0100 MIL-STD-2045-47001D w/ CHANGE1
0101-1110 Undefined
1111 Version Sent Not Implemented
Table 1. Version numbers currently defined in revision D w/ CHANGE1

The “Version Sent Not Implemented” value is used to specify that the current implementation of the VMF protocol in the system is not backward compatible with the incoming VMF messages.  For example, a system using revision D of the protocol sending a message to another system using revision C or earlier will receive an answer with the version field set to “15”, specifying that the source system must use an earlier revision (figure 4). If the version is not implemented, but the destination system is still able to process the message, it will nonetheless.

Usage of the Version Not Implemented value
Figure 4. When a previous version of the VMF receives a message from a latter revision, it returns a message with the version field set to “Version Not Implemented” value.

Fields and Groups

Elements following the version number are either “fields” or “groups” as shown in figure 3. A field can generally be thought as single piece of data along with an overhead of 1 or 2 bits; a Field Presence Indicator (FPI) and a Field Recurrence Indicator (FRI). The FPI is a flag which indicate if a value for the field has been specified or not, e.g. if the FPI is set to zero, the following bit is the start of new field or group. If set to one, the following bits represents the value for the field. Some fields occur more than once. These fields include the FRI flag which is right after the FPI. When set to one, the field is repeated right after until the FRI is set to zero as illustrated in figure 5. Note that the FPI is included only once in the first field.

Example of the Field Recurrence Indicator (FRI) within a VMF message
Figure 5. Example of the Field Recurrence Indicator (FRI) within a VMF message used for the Release Marking field.

Groups, as their name suggest, are sets of related fields or sub-groups. A group have a depth down to 6 levels and possess the same presence and indicator flags as fields, which are known at the Group Presence Indicator (GPI) and Group Recurrence Indicator (GRI). The order of fields and groups of the header is fixed, much like any TCP/IP packet.

While most of the fields contain numeric values, some are 7-bit ASCII coded strings, where the value 127 (the DELETE character) is used as the string terminator, i.e. equivalent to the null byte ‘’ in C/C++ character arrays. The terminator is added to the string only if its bit-length is smaller than the maximum length of the field. For example, the “unitname” field is 448-bit long, i.e. 64 characters. If the unitname contains less than 64 characters, the terminator will be added to the value. A complete list of groups and fields, along with their description and constraints can be found in the specification (and maybe discussed in future parts of this article)

Assembly of a VMF Message

The creation of a VMF message is quite similar to the creation of a HTTP request or a TCP/IP packet: it is done via encapsulation of user data through the multiple layers of the SDR. User data is first entered into the radio, usually by the pilot or crew via a Remote Control Unit (RCU) or a subsystem via the internal MIL-STD-1553 bus or ARINC 629 data bus. The radio receives the MIL-STD-1553 message from the bus controller and extracts its data. It then determine to which units on its MANET to send the data. Note that data can be broadcasted to all units on the network via a broadcast address. The default broadcast address is 16777215. Within the SDR, the data is then transferred to the VMF messaging service of the device. Based on its configuration and properties of the data provided, the VMF layer will construct the proper application header and append the user data to form the application PDU.  The PDU is then send to lower layers, transformed into waveforms and digital data converted to analog radio waves to be sent via the antenna of the aircraft. The receiving unit then do the reverse process and dispatch the received data to the internal subsystems of the destination aircraft. The process is illustrated in figure 6.

Encapsulation process of user data into a VMF message.
Figure 6. Encapsulation process of user data into a VMF message.

User Data

The type of user data that is contained within the application PDU is defined by the User Message Format (UMF) field. This field indicates the format of the message contained in the user data field and is associated with the Functional Area Designator (FAD), the Message Number, the Message Subtype, CANTCO reason and CANTPRO reason fields. More about these fields in part 2 of this article.

Binary Files

VMF can be used to transfer files between systems and this is done by setting the UMF field with the value 1 and using the “Filename” field to specify the name of the file. Furthermore, to indicate that the message is a file transfer the GPI of the VMF Message Identification Group needs to be set to zero.

Redistributed Messages

When the UMF field is set to 0100 (4), the content of the user data is another VMF message, much like a forwarded email. Both the application header and user data sections of the redistributed message are included in the user data portion of the message forwarding it. The “Operation Indicator“, “Security Classification” and “Release Marking” fields are required to be similar in both messages. Both messages are to be processed by the receiving systems.

Addressing

Nodes (hosts) on the network can be identified using one of the following fields: the URN or the Unit name. Both cannot be used at the same time normally. The URN is, as its name implies, a number given to a specific unit to identify it. It acts as a MAC address would. Each URN is unique and only as one unit bearing the number. Distribution of these URN is made by the U.S. Department of Defense. This URN is the addressing scheme at the application layer. VMF can be carried via TCP/IP (or UDP), MIL-STD-1553 or any other lower layer. Note that other addressing scheme can be used, but these will be explored later on.

Conclusion

This concludes part 1 of the introduction to the VMF protocol. In part 2, we will explore further the fields included in the latest revision of the protocol and include actual example of VMF messages. In the meantime, you’re welcome to play with Vmfcat, which is a Python script I’m working on (still incomplete) to generate VMF messages from the command line or via an interactive shell.

CTF Code : Javascript Code to Quickly Process Data on a Webpage

In quite a few Capture the Flag (CTF) exercises, I’ve seen this challenge – a coding one – which asks the participant to process some information on a webpage in less than X seconds, short enough so that no manual processing can be done. One way to do so is using Javascript.

Share

Introduction

In quite a few Capture the Flag (CTF) exercises, I’ve seen this challenge – a coding one – which asks the participant to process some information on a webpage in less than X seconds, short enough so that no manual processing can be done. One way to do so is using Javascript. In one case, the participant is asked to crack a SHA1 encrypted word.

Method

By first doing manually searching for the hash using a search engine, the plaintext will be found and after doing it 2 or 3 times for different hash, a pattern will emerge in the random plaintext. In this case, random integers between 1000 and 9999 are generated. As such, to solve the challenge, one need to hash all integers from 1000 to 9999 and compare it with the one on the page. Once we have a match, we use the plaintext with a provided URL to get the flag. So in summary:

    1. Figure out the plaintext used to generate the hash. Use Google.
    2. Find a pattern in the plaintext, or boundaries [x,y] in the values used.
    3. Extract the SHA1 from the HTML
    4. Generate a SHA1 hash for each possible value in [x,y]
    5. Compare the 2 hashes. If they match, go to the flag retrieval page.

Code Explanation

Below is the code used for the particular CTF. I used Javascript to prevent managing all the cookies and session stuff. I also used CryptoJS for SHA1 encryption.

Then we iterate all integers between 1000 and 9999. For each, we calculate the SHA1 digest using the Crypto.SHA1() function. Once we find a match, we relocate to the flag retrieval URL.

Full Code

Conclusion

This code can be used with GreaseMonkey or simply by using the developers’ tool included in Google Chrome or Mozilla Firefox.

#TheGreatFTPHunt – 2% to 9% of files scanned potentially containing confidential information

In this post, we continue our data collection and evaluation of files stored on removable medias publicly accessible to the Internet. The collection of filenames from 6,500 hosts is ongoing, therefore we’re going to focus on evaluation of sensitivity of a file based only on its filename. We also present the latest statistics collected in our database.

Share

Introduction

In this post, we continue our data collection and evaluation of files stored on removable medias publicly accessible to the Internet. The collection of filenames from 6,500 hosts is ongoing, therefore we’re going to focus on evaluation of sensitivity of a file based only on its filename. Based on the current result, 2 to 9% of the 3000 files reviewed were sensitive or potentially sensitive. Most of the sensitive files are concentrated on a few hosts. These files often include financial information or project data from businesses. So far, 773 hosts containing around 4.5 million files have been scanned.

Discussion

The amount of filenames collected is quite large and we cannot evaluate manually each filename for its probable sensitivity. As such, we need to devise a procedure to automatically assess its sensitivity. We have some definitions and restrictions to list first to clarify what a sensitive file is and limitation to our evaluation criteria.

In this document, sensitive file refers to user-generated or software-generated files based on user input that contains information that should probably not be publicly accessible and which can be leveraged against an individual or organization. This includes:

  • Personal identification documents; passport, driver’s license, visas, government forms…
  • Personal finance documents; income tax files, insurance forms, credit card statements, mortgage, pay stubs, banking information
  • Personal medical documents; prescriptions, medical records
  • Work-related files; emails, proprietary source code, password lists
  • Business finances; customer lists, sales data, project costs, business deals, investments, payrolls
  • Intellectual property; blueprints, schema, patents, research
  • Network configuration; passwords files, configurations files, network diagrams, user databases
  • Large databases of emails, addresses and other personal information.

Some of the files not included in our analysis that includes;

  • Copyrighted / Illegally downloaded files. However we considered text file containing licensing keys to be sensitive.
  • Inappropriate contents (nude selfies, personal politics, group affiliations etc…)
  • Personal pictures, letters.
  • Addresses and emails were not considered personal, however databases of addresses and emails are considered sensitive

Because of the volume, we cannot download and manually verify each file to confirm its contents, as such our main restriction is that our assessment must be done solely based on the absolute filename recorded. As such, to evaluate the sensitivity, we used three categories; positive, negative and neutral, i.e. either a file is very likely to sensitive, potentially sensitive or clearly not sensitive at all. Of course, there is always a possibility that a file labeled as sensitive may not be. For example, a file called social security numbers.xls may contain only formulas or an empty form. Ideally, files identified as positive or neutral should be manually vetted.

The procedure to automatically assess the sensitivity of a file based on its path and name is first done by assessing a random sample manually. Using the ORDER BY RANDOM (note: there will be a need to review if this function is truly random, which I doubt) function (performance is not an issue in this experiment) of the Postgresql database, multiple  random samples of 100 filenames are retrieved from the database. Each file is shown to the evaluator which based on the path, filename and extension assess the sensitivity of the file as ‘positive‘, ‘neutral‘, ‘negative‘. For each run, we log the count of hits for all categories.

Listing 1 : Example of a run in which a script asks an evaluator to assess the sensitivity of files based on its absolute path.

The evaluator is assessing the filename based on keywords that may indicate the contents of the file. As such, a file containing the word, or as we call it in this document, a token such as sales, passport or passwords will be assume to contain information about sales, a passport scan or a list of passwords. In many cases, the filename is too obscure, but the path and extension may indicate the contents of the file. For example, a path containing the tokens project, finances and a Microsoft Excel extension despite a filename of axe189212_c.xls will be considered as neutral, as the file may contents information about a project. Examples of both scenarios are shown in listings 2 and 3:

Listing 2 : Examples of files that were deemed ‘positive’ hits based on keywords in their absolute path.

Listing 3 : Examples of files that were deemed ‘neutral’ (or ‘unknown’) hits based on keywords in their absolute path.

Filenames in foreign languages are roughly translated using Google Translate, as such, many of them are labeled as unsure.

A Python script then divide the filename in tokens, and each token is stored in the database along with the number of times it was found in a positive, neutral and negative hit. Tokens are created slightly differently based if they are located in the path, the filename or in the extension. For the extension, a single token is created which contains the extension itself. If the file does not have an extension or is not an extension usually associated with known software, no token is created. For the filename, tokens are created by splitting each word using characters usually known to separate words such as the underscore, dash, period or spaces. Lastly, for the path, directories are used as token and unlike filenames, are not split further. An example of this process is shown in listing 4:

Listing 4 : Example of the tokenization of a filename.

Once the tokens are created, the script will either add the token in the database or update its count based on the evaluator choice. After each update, a score is given to the token, which is simply the ratio between positive hits and the total count of appearances: p / hits). Note that tokens are considered different depending their location in the filename. As such, a filename such as /My_Passport/backup/Outlook emails backup.pst, will generate 2 distinct ‘backup’  tokens; the one from the path and the one from the filename. We explain this decision in the next paragraphs.

Listing 5 : Scores of the tokens extracted from the file in listing 4.

By using this procedure, we believe that tokens appearing often in both positive and negative hits will cancel each other, while the tokens strongly associated with positive and negative hits will remain clearly divided. Some sort of mathematical should follow later one (I hope…need to review discrete maths I guess). Some preliminary results  appears to confirm this approach as valid. Extensions strongly associated with sensitive contains higher scores while media files have null scores.

However, there is a need to further refine this process by associating a value, or weight, to the location of the token. Tokens in the path are not as indicative of the sensitivity of the file as much as a token in the filename or extension. Even within the path, the highest level is generally less indicative than the lowest one, i.e. /documents/finances 2012/sales/company sales.xls. Therefore when assessing a new filename, we need to give a score to the path, the filename and the extension. For the path, we will get the score of each token and multiply it with a weight that correspond to its location in the structure. For token that are not found the default value of 0 will be given. Then we will take the average of all token for the score of the path. As for the filename, we will not consider the position. Finally the stored score of the extension will be retrieved from the database. If the extension is not found, then a score of 0 will be used. This will transform a filename into a set of three real values which we can range between 0 and 1. To determine the weights needed for each location, we will used a supervised neural network. More research will be conducted to determine how to use this approach.

Results

As of 16 July 2015, 4,568,738 files have been recorded from 773 hosts.

Country Hosts
United States 258
Russian Federation 91
Sweden 69
Canada 66
Ukraine 27
Norway 24
United Kingdom 24
Australia 19
Netherlands 18
Hong 18
Taiwan 16
Poland 15
Germany 11
Romania 11
Finland 10
Switzerland 8
Korea 8
Singapore 7
Czech Republic 7
Japan 6
Table 1. Location of the 773 hosts scanned as of 16 July 2015 order by country.

Mp3 and JPEG image files remains the most common. As such, we focus our statistics on document-type of files for a change, i.e. Office documents. Adobe PDF files and Microsoft Word documents are the most common file types based on our current data as shown in figure 1.

Most common file types scanned as of 16 July 2015 for office-related documents
Figure 1. Most common file types scanned as of 16 July 2015 for office-related documents

At the moment, around 3000 files have been assessed (30 runs of 100 samples). For each run, we recorded the number of positives, neutral and negative hits and found them overall constant at each run. (see figure 2) However more details about the RANDOM function is needed to insure the randomness of the sample. This part may need to be redone. So far, between 2% and 9% of files scanned are considered sensitive or potentially sensitive (see figure 3). However we need to consider the concentration of these files to put this information into perspective. The 278 files identified as sensitive or potentially sensitive were located on 59 hosts, with one host accounting for 101 of these file. This indicates that files of interests for an attacker are likely to be concentrated on a few hosts.

Chart of assessed sensitivity of randomly selected 30 samples of 100 filenames.
Figure 2. Assessed sensitivity of randomly selected 30 samples of 100 filenames.
Chart of percentage of files according to their sensitivity based on manual assessment of 3000 randomly selected files.
Figure 3. Percentage of files according to their sensitivity based on manual assessment of 3000 randomly selected files.

As for tokens, we will have to consider the entire collection of filenames in order to have sample from multiple sources, as such, we will pursue manually assessing samples of 100 filenames as more data is collected. After which we should have an excellent training set for the neural network. Some high-recurring and high-scoring tokens are shown in tables 2 and 3.

Token Hits Score
attach 7 0.9285714286
txn 7 0.9285714286
planning 6 0.9166666667
archived 6 1
recpt 6 0.9166666667
2010taxreturns 5 1
person~2 4 1
purchase 3 1
order 2 1
Паспорт 2 1
Table 2. Sample of high-scoring tokens sorted on the number of times observed.
Token Hits Score
jpg 938 0.013326226
mp3 460 0
music 448 0
seagate_backup_plus_drive 382 0.1452879581
asusware 348 0
pictures 309 0.0048543689
sda1 285 0.0649122807
bigdaddy 279 0
elements 278 0.0485611511
transcend 247 0.0222672065
my_book 234 0.0106837607
Table 3. Sample of high-recurring tokens sorted on the number of times observed.

Conclusion

While these results are preliminary, they nevertheless seems to provide a solid indication of what one can find on publicly-available removable drives. Additional work and fine tuning of both code and processes is required to provide more accurate data and the next step while the scan is still on going it to develop a methodology to assess the sensitivity of all files, likely using a neural network for classification based on the method presented above.