The Byte-BOS Real-Time Multitasking Operating System

Compared to operating systems in general computing, the world of embedded devices remains a world to be discovered by security analysts and hackers alike, and it offers much to explore. While reverse engineering of many newer appliances is revealing well known operating systems such as Windows or Linux, others remains fairly unexplored, such as TinyOS or μOS. Legacy devices, still in used across many industries, also provide a rich variety of unknown software and architectures. One of them is the Byte-BOS Real-Time Multitasking Operating System (RTMOS).

Introduction

Compared to operating systems in general computing, the world of embedded devices remains a world to be discovered by security analysts and hackers alike, and it offers much to explore. While reverse engineering of many newer appliances is revealing well known operating systems such as Windows or Linux, others remain fairly unexplored, such as TinyOS or μOS. Legacy devices, still in use across many industries, also provide a rich variety of unknown software and architectures. One of them is the Byte-BOS Real-Time Multitasking Operating System (RTMOS). This short article attempts, based on limited information, to detail this little known system, explore its history, usage and basic composition.

History

The Byte-BOS RTMOS was initially developed in 1994 by Numerical Services Inc. The company, defunct since 2004, sold the full C source code of the RTMOS to customers for 7495 $USD. Possession of the source code allowed the buyer full customization of the operating system according to the device being designed. The system supported a restricted set of microcontrollers (see table 1) including the Intel x86 and Motorola M68000. Other lesser known microprocessors supported included architectures from Hitachi, Mitsubishi and Texas Instruments, which were used in embedded devices within the medical, industrial and telecommunications sectors. Due to the long life-cycle of these devices, it may still be possible to identify devices leveraging the Byte-BOS RTMOS. Development and support for the RTMOS ceased around 2004. As such, documentation is very scarce, and remnants of information are available only via the Internet Archive [1]. At the time of writing, the domain “www.bytebos.com” resolves to “Scheumacher Engineering”, owned by the original developer of Byte-BOS. However the website is nothing more than a single page which was last updated in 2013.

Table 1 – Microprocessors supported by the Byte-BOS RTMOS.
Intel 80×86 Coldfire Mitsubishi M16C Texas Instruments
C2x/C5x DSP
Intel 80188/86 Motorola 68000 Mitsubishi M32D Texas Instruments
C3x/C4x DSP
Intel 80×86 (32bit) Motorola 68HC11 Hitachi H8300 ARM/THUMB
Intel 8096 Motorola 68HC16 Hitachi H8300H
Intel i960 Mitsubishi
M37700
Hitachi SHX

Description

The Byte-BOS RTMOS is a minimal operating system providing task scheduling management for user applications, along with typical OS activities such as interprocess messaging, memory allocation and prioritization. It performs pre-emptive and non-pre-emptive scheduling of an unlimited number of queued tasks. The system enforces scheduling via interrupt service routines (ISRs) which can suspend and/or resume tasks based on events and their priority. When multiple tasks with a similar priority are requesting resources, the RTMOS assign the resources via a round-robin selection method.

As other operating systems, memory allocation appears to follow a standard process by keeping track of used and free memory chunks via a heap implemented by a double-linked list, which is the data structure used across the RTMOS to store unlimited numbers of objects. Tasks can dynamically allocate and free memory as needed. It also manages events and interaction with devices on behalf of user applications by abstracting the underlying hardware via a Device object (see figure 1). Byte-BOS also provides inter-tasks communications mechanisms for synchronization amongst tasks such as messages and semaphores.  In terms of data structures the double linked list and buffer objects appear to be the main structures used across the RTOS. Memory chunks, tasks and most of the other objects are all managed via the List object.  It also supports the queue structure used to manage tasks. A software stack is also provided and used by tasks to run.

Security-wise, Byte-BOS does not have any authentication or specific security methods for access to memory or special “kernel” functions; all tasks have the same level of authority than the RTOS. Basically, Byte-BOS is completely flat.

Architecture

The RTMOS is written in C or specific variety of C specification depending on the targeted microprocessor. The 10 main components of the operating systems are represented in figure 1 and provide an excellent overview of the functions accomplished. A simulated “this” pointer to the various “struct”s is made available to the developer to give an Oriented-Object Programming (OOP) feel. This section describes briefly the main components of Byte-BOS, along with some of the features marketed by the original developer.

UML Representation of the different objectsfooun in the Byte-BOS RTMOS
Figure 1 – UML Representation of the different objects defined in the Byte-BOS RTMOS

The Task Object

The main object of the RTMOS is the Task object, which is very similar to the similar concept modern software development. Just as in any C/C++ object, a constructor is initiated prior to executing the body of the task, which is defined by the developer to conduct a specific task. Data of the task is allocated and accessible via the this_data pointer. Tasks also allocate sub-objects, opens required devices, initialize variables and synchronize state machines. Once constructed, the task runs its main function until it returns or is removed. Access to a task is done via the this_task pointer. Both pointers described are always in scope and usable within the task. Once the main task completes, the destructor of the structure is called, which frees memory and pointers of sub-objects. It also terminates any internal state machine.

The Memory Object

This object manages the memory heap and stores the pointer to the heap, the block size and the number of memory blocks. It also exposes various functions for memory management, mainly alloc_fmem and free_fmem. The heap is implemented using a double-linked list in which each node contains metadata about the memory blocks. While details are not available, it can be assumed that the size of the block, its status (free or allocated) and a pointer to the next block is included.

The Semaphore Object

As its name implies, the Semaphore is used for signaling and control between multiple tasks by either being in the Up or Down status. A timeout can be defined as needed.

The MessageBox Object

The MessageBox acts as a central repository for tasks to exchange data. Tasks and ISRs can create messages to initiate other tasks or exchange . The MessageBox basically provides two functions: put and get. When a new message is created, memory is allocated. Similarly, the memory is freed when retrieved from the box.

The Timer Object

Similar to any standard timer in other system, this object is used to create timeouts and schedules of tasks and events.

The Event Object

Events are created by tasks and ISRs to schedule other tasks.

The Device Object

This item abstracts interaction between the user application and the hardware device. It does so by pointing to the memory blocks used by the device and managing input/output to the area.

The List, Queue and Buffer Objects

These objects, as their name implies, are implementation of a linked list, a queue and buffer data structures.

Features

Byte-BOS boasted the following additional features on their website from 2000, providing some extra insight on the internals of the system.

Interrupt Handling

ISRs can make operating systems calls and schedule tasks by including the ENTER_ISR and EXIT_ISR macros at the beginning and end of the routine. Most services are available to ISRs, giving them considerable access over the entire OS.

Critical Section Handling

Byte-BOS provides critical section handling by protecting non-reentrant code sections from both interrupt and task pre-emption. Protection of critical sections appears to be done via the following mechanisms:

  • Disabling interrupts
  • Locking and unlocking code sections
  • Prioritization of tasks

Task Execution Trace

Tasks are provided with a trace buffer configurable at both compile time and run time. The trace buffer contains information about the sequence of calls, the parameters passed, and the return value of the called function. The tracing option is extremely useful for debugging purposes.

System Requirements

The Byte-BOS RTMOS requires very little in terms of resources. While the requirements varies depending on the underlying processor, the size of the kernel is a few kilobytes, while the RAM requirements are less than 100 bytes of globally accessible memory and less than 100 bytes per task and ISR. The detail figures are listed in table 2.

Table 2 – Memory requirements for the Byte-BOS RTMOS according to microprocessor
Microprocessor Minimum Kernel
Size (KB)
Maximum Kernel
Size (KB)
RAM Requirement
for Global Variables
(bytes)
RAM Requirement
per Task (bytes)
Intel 80×86* 4 10 28 30
Intel 80×86
(32bit)*
6 15 48 58
Intel 80188/86* 2 10 28 30
Intel 8096 2 8 20 20
Motorola 68000 1.5 12 50 70
Motorola 68HC11 1.2 8 28 20
Motorola 68HC16 2 8 28 20
Mitsubishi M37700 2 10 28 20
Mitsubishi M16C 2 12 64 48
Mitsubishi M32D N/A N/A N/A N/A
Hitachi H8300 1.2 8 28 20
Hitachi H8300H 2 10 38 40
Hitachi SHX 7.8 19.5 72 60
Texas Instruments
C2x/C5x DSP
1.2 8 28 20
Texas Instruments
C3x/C4x DSP
2.3 19.5 28 words 20 words
ARM/THUMB 4 15 50  70

* For Intel-based architecture, the memory footprint reported above is based on usage of the Borland C++ compiler.

Additional volatile memory is required for objects created dynamically. As such, the numbers provided above do not represent the total amount of memory consumed by Byte-BOS.

Compilers Supported

In general, Byte-BOS supports the development tools and compilers provided by the semi-conductor manufacturer. For Intel-based processors, the Borland C/C++ compiler was supported. The complete list of supported compilers is provided in table 3.

Table 3 – Supported compilers by the Byte-BOS RTMOS.
Microprocessor Supported Compilers
Intel 80×86 Microsoft C/C++ (large model)
Borland C/C++
Watcom C/C++
Intel 80×86(32bit) Watcom, Metaware and most of C compilers
Intel 80188/86 Microsoft C/C++ (large model)
Borland C/C++
Intel 8096 IAR
BSO Tasking Compiler
Motorola 68000 Cross-Code
Intermetrics
Introl
Microtec
Motorola 68HC11 IAR
Introl
Cosmic
Motorola 68HC16 Introl
Cosmic
Mitsubishi M37700 IAR
Microtec
Mitsubishi M16C IAR
Mitsubishi Compiler
Mitsubishi M32D N/A
Hitachi H8300 IAR
Microtec
Hitachi H8300H IAR
Hitachi SHX GNU
Green Hills
Texas Instruments
C2x/C5x DSP
Code Composer /
Texas Instrument Compiler
Texas Instruments
C3x/C4x DSP
Code Composer /
Texas Instrument Compiler
ARM/THUMB  ARM

Identification

Devices using the Byte-BOS RTMOS can be identified by looking at ASCII and Unicode strings in their firmware. In almost all cases, if not all, the firmware will contain a copyright notice identifying it as well as the target microcontroller it was compiled for. For example, by extracting the strings from the firmware for the Seagate Cheetah 10K.6 Disc Drive, either using the simple strings.exe (or strings in Linux) or IDA, the copyright notice string “Byte-BOS 8096 Multitasking Operating System — Copyright 1990 Numerical Services Inc.” can be observed (see figure 2).

Byte-BOS Copyright Notice into the Firmware of the Seagate 10K.6 Disc Drive
Figure 2 – Byte-BOS Copyright Notice into the Firmware of the Seagate 10K.6 Disc Drive

While appliances posterior to 2004 will likely use modern RTMOS such as Linux , Byte-BOS can still be found on legacy devices. One such example is the Baxter IPump pain management medical device, which refers to its usage in the manual [2]. Of note, the manual refers to the detection of stack overflows within the Byte-BOS RTMOS. Industrial controls, aircraft, telecommunications systems and other systems designed prior to the 2000s may still harbour this little known RTMOS.

Conclusion

The Byte-BOS RTMOS will likely disappear as legacy embedded devices are life-cycled for newer systems with more processing power and extended memory. Until that moment, for developers of these systems, or simple hobbyists, information remains scarce and limited. While we provided a brief overview of its architecture and features, details about creating user applications remains obscured by the lack of online information such as an API description, code examples or more importantly, the source code. Such data would greatly ease development and engineering efforts of legacy systems.

References

[1] “Bytebos Home Page.” Bytebos Home Page. Internet Archive, 20 Oct. 2000. Web. 01 Nov. 2015. <https://web.archive.org/web/19990208011926/http://www.bytebos.com/>.

[2] Baxter IPump Pain Management System – Service Manual, Baxter Healthcare Corporation, Chapter 4 – Troubleshooting, p.4-8. 2007. Web. 01 Nov. 2015. <Link>

The Syrian Civil Conflict in the Cyber Environment

Introduction

This is an article I wrote a while ago and never got published. It’s a bit outdated now, but I still think it can be useful for historical purposes, so I’ll post a link to it below.

Abstract

This document analyzes the use of the cyber environment in the Syrian civil war by both the population and the government in order to characterize online tactics and strategies developed and used by each belligerent. This overview allows for generalization of online behavior by hacktivists and nation-state sponsored actors on communication networks in the region, which will continue to see online attacks from various parties in the foreseeable future during similar conflict. In Syria, because of poor infrastructure, low rate of Internet penetration and early adoption of control mechanisms by the current government, the authorities had dominance over their information environment early in the conflict, enabling rapid gathering of intelligence on dissidents. While social medias were leveraged by the population as in many other uprisings for coordination, it was also the theater of multiple offensive cyber operations by internal and external groups, mostly for information operations purposes. Despite the high level of activity, none appeared to have a definitive impact on the ground. While events recorded in this space have not reached the level of intensity of other conflicts, it proves a useful model for similar conflicts in the Middle East region.

Reference:

Racicot, Jonathan, The Syrian Civil Conflict in the Cyber Environment, https://www.academia.edu/15182402/The_Syrian_Civil_Conflict_in_the_Cyber_Environment, last accessed 2015-09-03

The Syrian Civil Conflict in the Cyber Environment

This document analyzes the use of the cyber environment in the Syrian civil war by both the population and the government in order to characterize online tactics and strategies developed and used by each belligerent.

Introduction

This is an article I wrote a while ago and never got published. It’s a bit outdated now, but I still think it can be useful for historical purposes, so I’ll post a link to it below.

Abstract

This document analyzes the use of the cyber environment in the Syrian civil war by both the population and the government in order to characterize online tactics and strategies developed and used by each belligerent. This overview allows for generalization of online behavior by hacktivists and nation-state sponsored actors on communication networks in the region, which will continue to see online attacks from various parties in the foreseeable future during similar conflict. In Syria, because of poor infrastructure, low rate of Internet penetration and early adoption of control mechanisms by the current government, the authorities had dominance over their information environment early in the conflict, enabling rapid gathering of intelligence on dissidents. While social medias were leveraged by the population as in many other uprisings for coordination, it was also the theater of multiple offensive cyber operations by internal and external groups, mostly for information operations purposes. Despite the high level of activity, none appeared to have a definitive impact on the ground. While events recorded in this space have not reached the level of intensity of other conflicts, it proves a useful model for similar conflicts in the Middle East region.

Reference:

Racicot, Jonathan, The Syrian Civil Conflict in the Cyber Environment, https://www.academia.edu/15182402/The_Syrian_Civil_Conflict_in_the_Cyber_Environment, last accessed 2015-09-03

(Bad) Amazon Phishing Email

Fortunately, my wife is a smart cookie and always suspicious of weird looking email. Maybe its due to the fact she lives with a paranoid guy. In any case, she caught this phishing email, which appears to be from Amazon, and leads to a fake login page.

Introduction

Fortunately, my wife is a smart cookie and always suspicious of weird looking email. Maybe its due to the fact she lives with a paranoid guy. In any case, she caught this phishing email, which appears to be from Amazon, and leads to a fake login page.

Contents

The phishing email comes from “amazon@iservice.co.org.il” with the terribly spelled subject “your accounnt information need to be updated” and the content is a screenshot of an authentic Amazon email, thus bypassing filters. However, the attacker succeed in misspelling the only field he had to fill.

A fake Amazon account confirmation received which contains a single image.
A fake Amazon account confirmation received which contains a single image.

Clicking anywhere on the image will redirect the target to ‘http://bestofferz.biz/service/support/wp-admin/support/support/”, which host a fake login page as shown below:

Fake Amazon Login Page
The attacker is hosting a fake Amazon login page on HostGator

So by looking under the hood, we can see that the entire page is actually a single javascript function call to decrypt a long Base64 encoded string.

The encryption key used is stored in the hea2p variable and the HTML code. The entire code can be analyzed here and using the AES Javascript code here. If the target enters his emails and password, he will then be forwards to a fake account creation page asking for his address.

Fake Amazon Account Creation Page
Fake Amazon account creation page.

And of course, it will then ask you for your credit card information, which is possibly the end goal of the phisher.

Fake Credit Card Information Request Page
Fake Credit Card Information Request Page

All the pages are encrypted using the same key. Only after entering this information to the target get redirected to the real Amazon website.

Successful Phishing Operation Page
Successful Phishing Operation Page

Conclusion

Remember to always check the URL and the from email address !

Repost: Stack-based Buffer Overflow Vulnerabilities in Embedded Systems

The buffer overflow attack vector is well documented in desktop and server class machines using the von Neumann memory model, but little has been published on buffer overflow vulnerabilities in Harvard architectures.

I have not written or contributed to the enclosed research paper. I’m simply reposting it here because it’s interesting and for some reason, appears available only via Google cache. So before it disappear from results, I’m reposting it here.

This paper discusses a technique to conduct buffer overflows on processors using the Harvard architecture. In this architecture, the stack starts at the beginning of the memory and grows up, versus Von Neumann architectures in which it grows down.

Abstract:

Most small embedded devices are built on Harvard class microprocessor architectures
that are tasked with controlling physical events and, subsequently, critical infrastructures. The Harvard architecture separates data and program memory into independent address spaces, as opposed to the von Neumann architecture that uses a unified memory system with a single address space for both data and program code. The buffer overflow attack vector is well documented in desktop and server class machines using the von Neumann memory model, but little has been published on buffer overflow vulnerabilities in Harvard architectures. In this paper we show that stack-based buffer overflow vulnerabilities exist in embedded control devices based on the Harvard class architecture. We demonstrate how a reversal of stack growth direction can greatly simplify the attack and allow for easier access to critical execution controls. We also examine popular defense techniques employed in server and desktop environments and the applicability of those defenses toward Harvard class machines.
Link: Kristopher Watts & Paul Oman, University of Idaho, Stack-based Buffer Overflow Vulnerabilities in Embedded Systems

The Variable Message Format (VMF) Protocol – A Data Protocol for Radios (Part 1)

The VMF protocol is a SDR data protocol created to exchange information between multiple different systems by providing a rich and flexible specification. Yet, the VMF is little known even amongst avionics engineering and its knowledge remains the speciality of few. As such, in this post we explore this protocol in details to better understand its inner workings and its uses

Introduction

Software-Defined Radio (SDR) is a fast-growing market, expanding in a wide array of industries. Growth in this sector alone is expected to reach $USD27.29 billion by 2020 [1]. When only considering that most smartphones are equipped with SDRs, one can quickly understand the active research conducted in the field. SDRs are also in high usage across militaries and law enforcement given the added flexibility to conduct multiple types of operations using the same hardware, often with reduced maintenance costs. Nowadays, military tactical radios and law enforcement equipment are all software-defined to some degree. Within the industry, 93% of all mobile systems leverage SDR for wireless communications [2]. The added agility introduced by software also comes with additional threats, previously absent from hardware-based radios. Within the military, SDRs form the cornerstone of network-centric warfare as they are used to establish networks between different units in order to exchange data such as position, imagery and target information rapidly between ground, maritime, air and space elements. Radios not only process voice; the internal software also manages various networks and data protocols, further increasing the complexity of the application layer of the communication device. The VMF protocol is one such SDR data protocol created to exchange information between multiple different systems by providing a rich and flexible specification. Yet, the VMF is little known even amongst avionics engineering and its knowledge remains the speciality of few. As such, in this post we explore this protocol in details to better understand its inner workings and its uses

Software-Defined Radios

An overview of the architecture of the SDR is required to better appreciate the threats against them and to eventually understand some of the unique characteristics when looking for vulnerabilities in their software. We do so by describing the general software model used by SDRs which have multiple layers, each of which uses different programs to process waveforms and the data contained in them. The Wireless Innovation Forum (WIF) simply defines a SDR as a radio in which some or all of the physical layers functions are software defined, contrasting with previous radios in which all configurable properties were designed and hardcoded into the hardware. Within the definition of WIF, the physical layers refer to the four lowest layers of the OSI model. This innovation allowed for additional agility into the functionalities of radios, as the integrated general purpose processors (GPP) or Field Programmable Gate Array (FPGA) units and the flexible radio-frequency (RF) modulators – all reconfigurable through software – permit the device to operate on a wider range of the spectrum depending on the current operations. These new components are all software controlled, allowing for fast reconfiguration as needed by the operator (see figure 1).

Simplified schema of the major hardware and software components of modern SDRs.
Figure 1. Simplified schema of the major hardware and software components of modern SDRs.

Within the SDR the Software Communications Architecture (SCA) is the open framework generally used to specify how a radio designer should integrate hardware and software components in order to interact efficiently and maximize software reuse [3]. The SCA was developed by the U.S. military as part of the now defunct Joint Tactical Radio System (JTRS). Despite it’s military roots, the SCA is used in the industrial sector as well. The framework divides the radio into 4 layers, each composed of multiple components. These are layers below the user applications (figure 2) and are responsible for processing the inbound and outbound data:

The Software Communications Architecture (SCA) is an open architecture framework abstracting interactions between hardware and software within a software defined radio.
Figure 2. The Software Communications Architecture (SCA) is an open architecture framework abstracting interactions between hardware and software within a software defined radio.

Programmable Radio Hardware; the Programmable Radio Hardware is the layer which regroups the software-reconfigurable elements of the radios such as the RF modulation units, the modems and the link interfaces.

Operating System and Middleware; the operating system, along with the Common Object Request Broker Architecture (CORBA) middleware play the critical role of managing the communications between the software of the core framework and the hardware. CORBA is a message passing technique that is widely used for cross-platform frameworks. All core interfaces are defined using the Interface Definition Language (IDL) which can be compiled in different languages such as C/C++ and ADA. The operating systems included in SDRs are Real-Time Operating Systems (RTOS) such as uOS, TinyOS or VxWorks. These provide multithread support and are required to be POSIX compliant. The RTOS also interfaces with the underlying network interfaces and serial ports.

Core Framework; the core framework is the abstraction layer between the software developers and the underlying hardware. It contains the interfaces and services required for the applications to use the devices of the radio by describing them using the eXtensible Markup Language (XML);

Waveforms; the software at this layer specifies the required parameters to form the needed waveform to communicate with networked devices. By reviewing the framework above, one quickly realizes that the world of radios is now mainly driven by software development and thus exposes itself to the same issues that plagues programs in the wider computing world. The terms ”operating system”, ”drivers”, ”middleware” and APIs are terms well-known to cyber operators and computer security analysts alike, thus making them a target the same way as any other host on any other network.

Tactical Radios

Software radios have significant advantages for the military as they provide a wide variety of dynamic radio protocols in real time. Since SDRs can be reprogrammed remotely for multiple purposes without any hardware changes, maintenance costs and time have decreased. Because of these characteristics, SDRs are one of the cornerstones of network-centric warfare, the currently predominant Western military doctrine. It is therefore not surprising that all tactical military radios in modern western militaries are SDRs, including most of civil aviation for law enforcement and potentially drones. Modern MTRs are more akin to routers than radios and enable the creation of instant networks and the exchange of operational data between the connected nodes over Tactical Data Links (TDL) such as Link-16. Data protocols such as VMF and the Joint Range Extension Applications Protocol (JREAP) enable the exchange of voice, data and imagery by connecting to a network of various platforms and units, both vertically and horizontally. Amongst the data exchanged are: positions, trajectories, maps, navigation data, mission data and intelligence. This implies the presence of user applications managing the data protocol, and additional applications processing the data received. Other data link protocols include the Type 483D, the Chinese equivalent to Link 4C. This transfer of information between platforms is central to the network-centric warfare doctrine which drives the research and development of network radios and increases the complexity of the software and network protocols. In many aspects, the radio can be considered the gateway device between the internal networks of a platform and external nodes requesting to connect to it. Internally within aircraft and ships many MTRs provide multiple interfaces to communicate with other elements of the platform, such as mission computers, display terminals and external software to display geoposition data for example. On a military aircraft, the interfaces of the radio usually include the MIL-STD-1553 bus (or its civilian counterpart, the ARINC429) to link with the avionics as well as Ethernet and serial ports such as RS-232 or RS-485, often used by Remote Control Units (RCUs) for management purposes via protocols such as the Simple Network Management Protocol (SNMP). Externally the radio networks the platform with external units using different systems by forming Mobile Ad hoc Networks (MANETs) and exchanges tactical data using data protocols over the tactical data links. These networks are wireless, infrastructureless, multi-hop and highly fluid. In military usage these are typically low-bandwidth links and are often managed by devices with little computational power. The Ad hoc On-Demand Distance Vector (AODV) is typically the underlying routing protocol, which manages routing in a peer-to-peer network. Routing information is updated constantly, as nodes within a MANET can also act as proxies for far-away nodes.

The Variable Message Format Protocol

The VMF standard was developed by the U.S. Department of Defense to allow messages of variable lengths to be sent over TDLs. A VMF message is bit-oriented and attempts to minimize the use of TDLs by sending only the required data. The objective is to be flexible enough to be able to communicate with any legacy and new host requiring that additional header fields can be added without modifying the underlying specification. The core of the VMF protocol is the Protocol Data Unit (PDU) which contains the header and the user data (figure 3), much like a typical TCP/IP packet. The PDU is processed at the application layer and is composed of the application header and the user data, which can be multiple format as we will see later on. The size of the former is always a multiple of 8 and is padded with null bits as needed.

Schema of the Application Header of a VMF message
Figure 3. Schema of the Application Header of a VMF message. which is composed of fields and grups.

Structure of the Application Header

The first 4 bits of the header always specify the version of the VMF protocol in used by the PDU. As of 2015, 5 revisions of the protocol have been created (table 1):

Code (binary) Revision
0000 MIL-STD-2045-47001A
0001 MIL-STD-2045-47001B
0010 MIL-STD-2045-47001C
0011 MIL-STD-2045-47001D
0100 MIL-STD-2045-47001D w/ CHANGE1
0101-1110 Undefined
1111 Version Sent Not Implemented
Table 1. Version numbers currently defined in revision D w/ CHANGE1

The “Version Sent Not Implemented” value is used to specify that the current implementation of the VMF protocol in the system is not backward compatible with the incoming VMF messages.  For example, a system using revision D of the protocol sending a message to another system using revision C or earlier will receive an answer with the version field set to “15”, specifying that the source system must use an earlier revision (figure 4). If the version is not implemented, but the destination system is still able to process the message, it will nonetheless.

Usage of the Version Not Implemented value
Figure 4. When a previous version of the VMF receives a message from a latter revision, it returns a message with the version field set to “Version Not Implemented” value.

Fields and Groups

Elements following the version number are either “fields” or “groups” as shown in figure 3. A field can generally be thought as single piece of data along with an overhead of 1 or 2 bits; a Field Presence Indicator (FPI) and a Field Recurrence Indicator (FRI). The FPI is a flag which indicate if a value for the field has been specified or not, e.g. if the FPI is set to zero, the following bit is the start of new field or group. If set to one, the following bits represents the value for the field. Some fields occur more than once. These fields include the FRI flag which is right after the FPI. When set to one, the field is repeated right after until the FRI is set to zero as illustrated in figure 5. Note that the FPI is included only once in the first field.

Example of the Field Recurrence Indicator (FRI) within a VMF message
Figure 5. Example of the Field Recurrence Indicator (FRI) within a VMF message used for the Release Marking field.

Groups, as their name suggest, are sets of related fields or sub-groups. A group have a depth down to 6 levels and possess the same presence and indicator flags as fields, which are known at the Group Presence Indicator (GPI) and Group Recurrence Indicator (GRI). The order of fields and groups of the header is fixed, much like any TCP/IP packet.

While most of the fields contain numeric values, some are 7-bit ASCII coded strings, where the value 127 (the DELETE character) is used as the string terminator, i.e. equivalent to the null byte ‘’ in C/C++ character arrays. The terminator is added to the string only if its bit-length is smaller than the maximum length of the field. For example, the “unitname” field is 448-bit long, i.e. 64 characters. If the unitname contains less than 64 characters, the terminator will be added to the value. A complete list of groups and fields, along with their description and constraints can be found in the specification (and maybe discussed in future parts of this article)

Assembly of a VMF Message

The creation of a VMF message is quite similar to the creation of a HTTP request or a TCP/IP packet: it is done via encapsulation of user data through the multiple layers of the SDR. User data is first entered into the radio, usually by the pilot or crew via a Remote Control Unit (RCU) or a subsystem via the internal MIL-STD-1553 bus or ARINC 629 data bus. The radio receives the MIL-STD-1553 message from the bus controller and extracts its data. It then determine to which units on its MANET to send the data. Note that data can be broadcasted to all units on the network via a broadcast address. The default broadcast address is 16777215. Within the SDR, the data is then transferred to the VMF messaging service of the device. Based on its configuration and properties of the data provided, the VMF layer will construct the proper application header and append the user data to form the application PDU.  The PDU is then send to lower layers, transformed into waveforms and digital data converted to analog radio waves to be sent via the antenna of the aircraft. The receiving unit then do the reverse process and dispatch the received data to the internal subsystems of the destination aircraft. The process is illustrated in figure 6.

Encapsulation process of user data into a VMF message.
Figure 6. Encapsulation process of user data into a VMF message.

User Data

The type of user data that is contained within the application PDU is defined by the User Message Format (UMF) field. This field indicates the format of the message contained in the user data field and is associated with the Functional Area Designator (FAD), the Message Number, the Message Subtype, CANTCO reason and CANTPRO reason fields. More about these fields in part 2 of this article.

Binary Files

VMF can be used to transfer files between systems and this is done by setting the UMF field with the value 1 and using the “Filename” field to specify the name of the file. Furthermore, to indicate that the message is a file transfer the GPI of the VMF Message Identification Group needs to be set to zero.

Redistributed Messages

When the UMF field is set to 0100 (4), the content of the user data is another VMF message, much like a forwarded email. Both the application header and user data sections of the redistributed message are included in the user data portion of the message forwarding it. The “Operation Indicator“, “Security Classification” and “Release Marking” fields are required to be similar in both messages. Both messages are to be processed by the receiving systems.

Addressing

Nodes (hosts) on the network can be identified using one of the following fields: the URN or the Unit name. Both cannot be used at the same time normally. The URN is, as its name implies, a number given to a specific unit to identify it. It acts as a MAC address would. Each URN is unique and only as one unit bearing the number. Distribution of these URN is made by the U.S. Department of Defense. This URN is the addressing scheme at the application layer. VMF can be carried via TCP/IP (or UDP), MIL-STD-1553 or any other lower layer. Note that other addressing scheme can be used, but these will be explored later on.

Conclusion

This concludes part 1 of the introduction to the VMF protocol. In part 2, we will explore further the fields included in the latest revision of the protocol and include actual example of VMF messages. In the meantime, you’re welcome to play with Vmfcat, which is a Python script I’m working on (still incomplete) to generate VMF messages from the command line or via an interactive shell.

CTF Code : Javascript Code to Quickly Process Data on a Webpage

In quite a few Capture the Flag (CTF) exercises, I’ve seen this challenge – a coding one – which asks the participant to process some information on a webpage in less than X seconds, short enough so that no manual processing can be done. One way to do so is using Javascript.

Introduction

In quite a few Capture the Flag (CTF) exercises, I’ve seen this challenge – a coding one – which asks the participant to process some information on a webpage in less than X seconds, short enough so that no manual processing can be done. One way to do so is using Javascript. In one case, the participant is asked to crack a SHA1 encrypted word.

Method

By first doing manually searching for the hash using a search engine, the plaintext will be found and after doing it 2 or 3 times for different hash, a pattern will emerge in the random plaintext. In this case, random integers between 1000 and 9999 are generated. As such, to solve the challenge, one need to hash all integers from 1000 to 9999 and compare it with the one on the page. Once we have a match, we use the plaintext with a provided URL to get the flag. So in summary:

    1. Figure out the plaintext used to generate the hash. Use Google.
    2. Find a pattern in the plaintext, or boundaries [x,y] in the values used.
    3. Extract the SHA1 from the HTML
    4. Generate a SHA1 hash for each possible value in [x,y]
    5. Compare the 2 hashes. If they match, go to the flag retrieval page.

Code Explanation

Below is the code used for the particular CTF. I used Javascript to prevent managing all the cookies and session stuff. I also used CryptoJS for SHA1 encryption.

Then we iterate all integers between 1000 and 9999. For each, we calculate the SHA1 digest using the Crypto.SHA1() function. Once we find a match, we relocate to the flag retrieval URL.

Full Code

Conclusion

This code can be used with GreaseMonkey or simply by using the developers’ tool included in Google Chrome or Mozilla Firefox.

#TheGreatFTPHunt – 2% to 9% of files scanned potentially containing confidential information

In this post, we continue our data collection and evaluation of files stored on removable medias publicly accessible to the Internet. The collection of filenames from 6,500 hosts is ongoing, therefore we’re going to focus on evaluation of sensitivity of a file based only on its filename. We also present the latest statistics collected in our database.

Introduction

In this post, we continue our data collection and evaluation of files stored on removable medias publicly accessible to the Internet. The collection of filenames from 6,500 hosts is ongoing, therefore we’re going to focus on evaluation of sensitivity of a file based only on its filename. Based on the current result, 2 to 9% of the 3000 files reviewed were sensitive or potentially sensitive. Most of the sensitive files are concentrated on a few hosts. These files often include financial information or project data from businesses. So far, 773 hosts containing around 4.5 million files have been scanned.

Discussion

The amount of filenames collected is quite large and we cannot evaluate manually each filename for its probable sensitivity. As such, we need to devise a procedure to automatically assess its sensitivity. We have some definitions and restrictions to list first to clarify what a sensitive file is and limitation to our evaluation criteria.

In this document, sensitive file refers to user-generated or software-generated files based on user input that contains information that should probably not be publicly accessible and which can be leveraged against an individual or organization. This includes:

  • Personal identification documents; passport, driver’s license, visas, government forms…
  • Personal finance documents; income tax files, insurance forms, credit card statements, mortgage, pay stubs, banking information
  • Personal medical documents; prescriptions, medical records
  • Work-related files; emails, proprietary source code, password lists
  • Business finances; customer lists, sales data, project costs, business deals, investments, payrolls
  • Intellectual property; blueprints, schema, patents, research
  • Network configuration; passwords files, configurations files, network diagrams, user databases
  • Large databases of emails, addresses and other personal information.

Some of the files not included in our analysis that includes;

  • Copyrighted / Illegally downloaded files. However we considered text file containing licensing keys to be sensitive.
  • Inappropriate contents (nude selfies, personal politics, group affiliations etc…)
  • Personal pictures, letters.
  • Addresses and emails were not considered personal, however databases of addresses and emails are considered sensitive

Because of the volume, we cannot download and manually verify each file to confirm its contents, as such our main restriction is that our assessment must be done solely based on the absolute filename recorded. As such, to evaluate the sensitivity, we used three categories; positive, negative and neutral, i.e. either a file is very likely to sensitive, potentially sensitive or clearly not sensitive at all. Of course, there is always a possibility that a file labeled as sensitive may not be. For example, a file called social security numbers.xls may contain only formulas or an empty form. Ideally, files identified as positive or neutral should be manually vetted.

The procedure to automatically assess the sensitivity of a file based on its path and name is first done by assessing a random sample manually. Using the ORDER BY RANDOM (note: there will be a need to review if this function is truly random, which I doubt) function (performance is not an issue in this experiment) of the Postgresql database, multiple  random samples of 100 filenames are retrieved from the database. Each file is shown to the evaluator which based on the path, filename and extension assess the sensitivity of the file as ‘positive‘, ‘neutral‘, ‘negative‘. For each run, we log the count of hits for all categories.

Listing 1 : Example of a run in which a script asks an evaluator to assess the sensitivity of files based on its absolute path.

The evaluator is assessing the filename based on keywords that may indicate the contents of the file. As such, a file containing the word, or as we call it in this document, a token such as sales, passport or passwords will be assume to contain information about sales, a passport scan or a list of passwords. In many cases, the filename is too obscure, but the path and extension may indicate the contents of the file. For example, a path containing the tokens project, finances and a Microsoft Excel extension despite a filename of axe189212_c.xls will be considered as neutral, as the file may contents information about a project. Examples of both scenarios are shown in listings 2 and 3:

Listing 2 : Examples of files that were deemed ‘positive’ hits based on keywords in their absolute path.

Listing 3 : Examples of files that were deemed ‘neutral’ (or ‘unknown’) hits based on keywords in their absolute path.

Filenames in foreign languages are roughly translated using Google Translate, as such, many of them are labeled as unsure.

A Python script then divide the filename in tokens, and each token is stored in the database along with the number of times it was found in a positive, neutral and negative hit. Tokens are created slightly differently based if they are located in the path, the filename or in the extension. For the extension, a single token is created which contains the extension itself. If the file does not have an extension or is not an extension usually associated with known software, no token is created. For the filename, tokens are created by splitting each word using characters usually known to separate words such as the underscore, dash, period or spaces. Lastly, for the path, directories are used as token and unlike filenames, are not split further. An example of this process is shown in listing 4:

Listing 4 : Example of the tokenization of a filename.

Once the tokens are created, the script will either add the token in the database or update its count based on the evaluator choice. After each update, a score is given to the token, which is simply the ratio between positive hits and the total count of appearances: p / hits). Note that tokens are considered different depending their location in the filename. As such, a filename such as /My_Passport/backup/Outlook emails backup.pst, will generate 2 distinct ‘backup’  tokens; the one from the path and the one from the filename. We explain this decision in the next paragraphs.

Listing 5 : Scores of the tokens extracted from the file in listing 4.

By using this procedure, we believe that tokens appearing often in both positive and negative hits will cancel each other, while the tokens strongly associated with positive and negative hits will remain clearly divided. Some sort of mathematical should follow later one (I hope…need to review discrete maths I guess). Some preliminary results  appears to confirm this approach as valid. Extensions strongly associated with sensitive contains higher scores while media files have null scores.

However, there is a need to further refine this process by associating a value, or weight, to the location of the token. Tokens in the path are not as indicative of the sensitivity of the file as much as a token in the filename or extension. Even within the path, the highest level is generally less indicative than the lowest one, i.e. /documents/finances 2012/sales/company sales.xls. Therefore when assessing a new filename, we need to give a score to the path, the filename and the extension. For the path, we will get the score of each token and multiply it with a weight that correspond to its location in the structure. For token that are not found the default value of 0 will be given. Then we will take the average of all token for the score of the path. As for the filename, we will not consider the position. Finally the stored score of the extension will be retrieved from the database. If the extension is not found, then a score of 0 will be used. This will transform a filename into a set of three real values which we can range between 0 and 1. To determine the weights needed for each location, we will used a supervised neural network. More research will be conducted to determine how to use this approach.

Results

As of 16 July 2015, 4,568,738 files have been recorded from 773 hosts.

Country Hosts
United States 258
Russian Federation 91
Sweden 69
Canada 66
Ukraine 27
Norway 24
United Kingdom 24
Australia 19
Netherlands 18
Hong 18
Taiwan 16
Poland 15
Germany 11
Romania 11
Finland 10
Switzerland 8
Korea 8
Singapore 7
Czech Republic 7
Japan 6
Table 1. Location of the 773 hosts scanned as of 16 July 2015 order by country.

Mp3 and JPEG image files remains the most common. As such, we focus our statistics on document-type of files for a change, i.e. Office documents. Adobe PDF files and Microsoft Word documents are the most common file types based on our current data as shown in figure 1.

Most common file types scanned as of 16 July 2015 for office-related documents
Figure 1. Most common file types scanned as of 16 July 2015 for office-related documents

At the moment, around 3000 files have been assessed (30 runs of 100 samples). For each run, we recorded the number of positives, neutral and negative hits and found them overall constant at each run. (see figure 2) However more details about the RANDOM function is needed to insure the randomness of the sample. This part may need to be redone. So far, between 2% and 9% of files scanned are considered sensitive or potentially sensitive (see figure 3). However we need to consider the concentration of these files to put this information into perspective. The 278 files identified as sensitive or potentially sensitive were located on 59 hosts, with one host accounting for 101 of these file. This indicates that files of interests for an attacker are likely to be concentrated on a few hosts.

Chart of assessed sensitivity of randomly selected 30 samples of 100 filenames.
Figure 2. Assessed sensitivity of randomly selected 30 samples of 100 filenames.
Chart of percentage of files according to their sensitivity based on manual assessment of 3000 randomly selected files.
Figure 3. Percentage of files according to their sensitivity based on manual assessment of 3000 randomly selected files.

As for tokens, we will have to consider the entire collection of filenames in order to have sample from multiple sources, as such, we will pursue manually assessing samples of 100 filenames as more data is collected. After which we should have an excellent training set for the neural network. Some high-recurring and high-scoring tokens are shown in tables 2 and 3.

Token Hits Score
attach 7 0.9285714286
txn 7 0.9285714286
planning 6 0.9166666667
archived 6 1
recpt 6 0.9166666667
2010taxreturns 5 1
person~2 4 1
purchase 3 1
order 2 1
Паспорт 2 1
Table 2. Sample of high-scoring tokens sorted on the number of times observed.
Token Hits Score
jpg 938 0.013326226
mp3 460 0
music 448 0
seagate_backup_plus_drive 382 0.1452879581
asusware 348 0
pictures 309 0.0048543689
sda1 285 0.0649122807
bigdaddy 279 0
elements 278 0.0485611511
transcend 247 0.0222672065
my_book 234 0.0106837607
Table 3. Sample of high-recurring tokens sorted on the number of times observed.

Conclusion

While these results are preliminary, they nevertheless seems to provide a solid indication of what one can find on publicly-available removable drives. Additional work and fine tuning of both code and processes is required to provide more accurate data and the next step while the scan is still on going it to develop a methodology to assess the sensitivity of all files, likely using a neural network for classification based on the method presented above.

Removing Debugging Information from Visual C++/C# Projects

It’s often surprising how many malware programmers forget to do the simplest things. Mostly because many are so concerned with functionality, stealthiness and other production concerns, that details slip easily of their minds – a clear advantage to forensics. One of these details is the Program DataBase (PDB) information added by Visual Studio, which most malware authors used for Windows development. While it may seem innocuous, this string reveals a lot about the operating system used by the author, its user name and most notably, symbols that can be used by IDA and ease understanding of the disassembly.

Introduction

It’s often surprising how many malware programmers forget to do the simplest things. Mostly because many are so concerned with functionality, stealthiness and other production concerns, that details slip easily of their minds – a clear advantage to forensics. One of these details is the Program DataBase (PDB) information added by Visual Studio, which most malware authors used for Windows development. While it may seem innocuous, this string reveals a lot about the operating system used by the author, its user name and most notably, symbols that can be used by IDA and ease understanding of the disassembly. This information allows to link multiple pieces of malware together, by using the username for example. Of course, this also allows for the creation of signatures. Thus, removing this information will add a hurdle to the analysts.

Contents

The Program Database File

The Program Database (PDB) is a binary file used to store debugging information about DLL and EXE files. The PDB file is created when you build your project and stores a list of symbols  their addresses along with the name of the file and the line number on which the symbol was declared. PDB files is also used for services collecting crash data to send it to developers for resolution.

Debugging Information

In Visual Studio, you can select to build your project in Debug or Release mode. In Debug mode, VS will include debugging information with your executable. In Release mode, no debug information is included by default, but in some cases is enabled so that if the program crashes, information can be retrieved and sent to the author for fixing. However for some reasons, some developers don’t really bother to use the Release mode, and simply use the executable generated by the Debug mode. Generally, you don’t want that if you are making malware (or any program really!). If left within the executable, a path to the PDB file will be included and can be extracted:

Path to the PDB file
Path to the Program Database (PDB) file used by Visual Studio for debugging purposes, extracted using the “strings” program.

Within the strings, you can determine that:

  1. The program was developed on Windows 7+ (because of C:\Users folder),
  2. The username of the developer is SUPPORT_23e45RT
  3. The source, or part of it, can be found on Github
  4. The original name of the program is CaitSithTest

These indicators can be useful to link this specific program with others and provide a common link between multiple malware. Additionally, the username could potentially be used to conduct open source research and find linked accounts or forum posts. But wait, there’s more…

If you leave the debugging information, you may be able to restore all the original names of variables and functions of the source code using IDA. IDA will first detect debugging information and ask the analyst if he wants to retrieve it, either via Microsoft – http://msdl.microsoft.com/download/symbols (not browseable)- or by looking locally.

IDA detected that debugging information is available and ask if the user wishes to retrieve it.
IDA detected that debugging information is available and ask if the user wishes to retrieve it.

If for some reason, the user is able to retrieve the information, he will have access to the names of the original symbols, which will make reverse engineering much more easier.

Since symbol information is available, the original names of the variables are displayed.
Since symbol information is available, the original names of the variables are displayed.

Compare the information from the figure above to the figure below, in which debugging information has been stripped at build time:

IDA could not find any debugging information and thus used its own labelling system to identify variables.
IDA could not find any debugging information and thus used its own labelling system to identify variables.

You can see that the variables defined in the first figure, such as ClipboardData, isProcessElevated and isDebugged have been preserved. By keeping information about the symbols, reverse engineering is much more easier compared to figure 2, in which information about the code is lost.

Disabling Debugging Information

To prevent VC from including this information in your executable, right click on your project, go to Project Properties > Linker > Debugging configuration menu. Select No in the Generate Debug Info option.

Removing debugging information in Visual Studio.
Removing debugging information in Visual Studio.

After doing this, rebuild your project and rerun the string extraction program against your binary, the path to the PDB file should not be present in the executable anymore.

no_debug_info
The path to the PDB file is not included in the executable once debugging information has been omitted.

Doing so makes it a bit more difficult to fingerprint the malware and hides information about the author’s system.

Conclusion

This is a simple tactic that is often omitted not only by malware author, but penetration testers, which are often Google programmers, i.e. copy-pasting code snippets from Stack Overflow or googling functions 😉 If you attempt to hide your malware into the System32 folder, looking for this information in the EXE or DLL files will quickly tell you which files are bad, since legitimate files will rarely have this info, or have legitimate looking one. As such, if you want to make sure, create a legitimate-Microsoft-looking user (Bill.Gates) on your machine and put your code into a Microsoft-looking project and path (C:\users\Bill Gates\Documents\HTA\Release\).

Gears of (Cyber)War – C++ Code to Detect Version of Windows

A common structure in malware, and also in many legitimate software is to recognize what is the underlying operating system (OS). Depending on the version of it, the virus may want to take different route to execute its activities. This functionality can be reused across different programs and is therefore useful to create and optimize. In this post, we design a function to detect the version of the Windows OS on which the program is currently running.

Introduction

A common structure in malware, and also in many legitimate software is to recognize what is the underlying operating system (OS). Depending on the version of it, the virus may want to take different route to execute its activities. This functionality can be reused across different programs and is therefore useful to create and optimize. In this post, we coded a function to detect the version of the Windows OS on which the program is currently running.

Windows Versions

The Windows OS is known to have a confusing version system, which is often the result of marketing pressure. As such, while all software usually follow a version number specifying the major version, the minor version and the revision (and sometimes the build), Windows swings between years of release (ex. Windows 2000), fancy names (Windows Vista) or some designated number (Windows 7). However, in the underlying machinery of Windows are defined standard version numbers, ex. 5.1. Below is the correspondance between Windows operating systems and standard versions.

When programming in C/C++, the version number of Windows can be obtained via the OSVERSIONINFO structure (or OSVERSIONINFOEX) and the GetVersionEx Windows API function:

The version information described in the table above is stored in the dwVersionMajor and dwVersionMinor of the OSVERSIONINFO structure. All version of Windows in the table above will store the value VER_PLATFORM_WIN32_NT (0x2) in the variable dwPlatformId of the structure.

Branching C++ Code Based on Version

Based on the description above, we can create a reusable function to detect the version of the Windows OS:

Note that in many cases, we won’t know if the OS is the workstation version of Windows or the server version of it. Also, there is no details for other versions of Windows, i.e. Windows CE, Windows 95 (…should it still be used somewhere..). That’s where OSVERSIONINFOEX is useful. The wProductType variable will be set to VER_NT_WORKSTATION if the current OS is the workstation version. Furthermore, you can retrieve service packs numbers and edition information with wServicePackMajorwServicePackMinor and wSuiteMask. So if you need more details about the Windows OS, you can include these as well:

For a full version of the procedure, visit this GitHub page, you’ll find the C/C++ code.

Conclusion

Practically any malware at some point will need to check the version of the operating system being infected in order to enable specific functions or exploit certain vulnerability. Rarely will this check go as far as getting the suite, but Remote Access Tool (RAT) and bots will report the operating system, the version and the service pack. If you are a malware analyst, expeect this function to be present in whatever piece of code you’re analyzing.