This is a short post to streamline opening accounts on sina.com, which is a free Chinese webmail account provider.
This is a short post to streamline opening accounts on sina.com, which is a free Chinese webmail account provider. Why would you want an account on Sina? Well, I use it for counter-spam/phishing/counter-phishing operations, or simply as a recurring disposable email account when services like 10 minute mail or Mailinator are blocked, or should I need something less public. Obviously, no sensitive information should be used with any email account used on this service.
Opening an Account
Browse to Sina.com to access the main login page, where you can also register. On the login page, you are greeted with a typical login page. To register account, click on the second button, i.e. ‘注册’, as shown in figure 1.
The registration page should appear. Little information is needed to register an account and no mobile phone number verification is required. Actually, all you need is a password and a username! If you ever need more details, you can generate a complete persona using the FakeIdentityGenerator. For this account, I generated a persona called ‘Mulan Fu’ by selecting Chinese names. Let’s see if Disney sues for using the name ‘Mulan’. Just fill the form with your email prefix, a password, the captcha and click on the blue button.
You may be brought directly to your new inbox. Otherwise, just login and make sure the ‘SSL’ option is selected. Enter your email address and password an click on the blue button (figure 3).
You can finally reach your inbox, which is organized like any other webmail services (figure 4). The ‘New Email’ button is the one on the top left corner, which reads ‘写信’. To access your inbox, simply click on the first item in the folder list on the left (‘收件夹’). Finally, to consult sent email, use the forth item from the top in the folder list (‘已发送’).
Composing a New Email
To write a new email, click the button detailed in the previous paragraph (see figure 4). If you compose a new email for the first time, you will be prompt to create a new signature (figure 5). Enter any information you want and save the information. You should be familiar with the composing interface (figure 6). You can use attachment up to 50M.
Click on the first button above the email, the one with the paper plane (‘发送’), to send the email.
Obviously, there are many other features that can be used. Many of them, if not most, are similar to any other webmail services on the Internet. One of the advantage of using this particular service is that, unlike Google or Outlook, it asks for very little information and doesn’t require mobile phone verification. Also, unlike some temporary email services, it’s ‘private’, at least, less public and permanent. Finally, since Sina.com is still a legit service, but not well known in the Western world, it can be used for most online services without too much suspicion. This make it ideal for social engineering operations. There are other similar services around the world, with various level of intrusiveness when it comes to registration.
Compared to operating systems in general computing, the world of embedded devices remains a world to be discovered by security analysts and hackers alike, and it offers much to explore. While reverse engineering of many newer appliances is revealing well known operating systems such as Windows or Linux, others remains fairly unexplored, such as TinyOS or μOS. Legacy devices, still in used across many industries, also provide a rich variety of unknown software and architectures. One of them is the Byte-BOS Real-Time Multitasking Operating System (RTMOS).
Compared to operating systems in general computing, the world of embedded devices remains a world to be discovered by security analysts and hackers alike, and it offers much to explore. While reverse engineering of many newer appliances is revealing well known operating systems such as Windows or Linux, others remain fairly unexplored, such as TinyOS or μOS. Legacy devices, still in use across many industries, also provide a rich variety of unknown software and architectures. One of them is the Byte-BOS Real-Time Multitasking Operating System (RTMOS). This short article attempts, based on limited information, to detail this little known system, explore its history, usage and basic composition.
The Byte-BOS RTMOS was initially developed in 1994 by Numerical Services Inc. The company, defunct since 2004, sold the full C source code of the RTMOS to customers for 7495 $USD. Possession of the source code allowed the buyer full customization of the operating system according to the device being designed. The system supported a restricted set of microcontrollers (see table 1) including the Intel x86 and Motorola M68000. Other lesser known microprocessors supported included architectures from Hitachi, Mitsubishi and Texas Instruments, which were used in embedded devices within the medical, industrial and telecommunications sectors. Due to the long life-cycle of these devices, it may still be possible to identify devices leveraging the Byte-BOS RTMOS. Development and support for the RTMOS ceased around 2004. As such, documentation is very scarce, and remnants of information are available only via the Internet Archive . At the time of writing, the domain “www.bytebos.com” resolves to “Scheumacher Engineering”, owned by the original developer of Byte-BOS. However the website is nothing more than a single page which was last updated in 2013.
Table 1 – Microprocessors supported by the Byte-BOS RTMOS.
The Byte-BOS RTMOS is a minimal operating system providing task scheduling management for user applications, along with typical OS activities such as interprocess messaging, memory allocation and prioritization. It performs pre-emptive and non-pre-emptive scheduling of an unlimited number of queued tasks. The system enforces scheduling via interrupt service routines (ISRs) which can suspend and/or resume tasks based on events and their priority. When multiple tasks with a similar priority are requesting resources, the RTMOS assign the resources via a round-robin selection method.
As other operating systems, memory allocation appears to follow a standard process by keeping track of used and free memory chunks via a heap implemented by a double-linked list, which is the data structure used across the RTMOS to store unlimited numbers of objects. Tasks can dynamically allocate and free memory as needed. It also manages events and interaction with devices on behalf of user applications by abstracting the underlying hardware via a Device object (see figure 1). Byte-BOS also provides inter-tasks communications mechanisms for synchronization amongst tasks such as messages and semaphores. In terms of data structures the double linked list and buffer objects appear to be the main structures used across the RTOS. Memory chunks, tasks and most of the other objects are all managed via the List object. It also supports the queue structure used to manage tasks. A software stack is also provided and used by tasks to run.
Security-wise, Byte-BOS does not have any authentication or specific security methods for access to memory or special “kernel” functions; all tasks have the same level of authority than the RTOS. Basically, Byte-BOS is completely flat.
The RTMOS is written in C or specific variety of C specification depending on the targeted microprocessor. The 10 main components of the operating systems are represented in figure 1 and provide an excellent overview of the functions accomplished. A simulated “this” pointer to the various “struct”s is made available to the developer to give an Oriented-Object Programming (OOP) feel. This section describes briefly the main components of Byte-BOS, along with some of the features marketed by the original developer.
The Task Object
The main object of the RTMOS is the Task object, which is very similar to the similar concept modern software development. Just as in any C/C++ object, a constructor is initiated prior to executing the body of the task, which is defined by the developer to conduct a specific task. Data of the task is allocated and accessible via the this_data pointer. Tasks also allocate sub-objects, opens required devices, initialize variables and synchronize state machines. Once constructed, the task runs its main function until it returns or is removed. Access to a task is done via the this_task pointer. Both pointers described are always in scope and usable within the task. Once the main task completes, the destructor of the structure is called, which frees memory and pointers of sub-objects. It also terminates any internal state machine.
The Memory Object
This object manages the memory heap and stores the pointer to the heap, the block size and the number of memory blocks. It also exposes various functions for memory management, mainly alloc_fmem and free_fmem. The heap is implemented using a double-linked list in which each node contains metadata about the memory blocks. While details are not available, it can be assumed that the size of the block, its status (free or allocated) and a pointer to the next block is included.
The Semaphore Object
As its name implies, the Semaphore is used for signaling and control between multiple tasks by either being in the Up or Down status. A timeout can be defined as needed.
The MessageBox Object
The MessageBox acts as a central repository for tasks to exchange data. Tasks and ISRs can create messages to initiate other tasks or exchange . The MessageBox basically provides two functions: put and get. When a new message is created, memory is allocated. Similarly, the memory is freed when retrieved from the box.
The Timer Object
Similar to any standard timer in other system, this object is used to create timeouts and schedules of tasks and events.
The Event Object
Events are created by tasks and ISRs to schedule other tasks.
The Device Object
This item abstracts interaction between the user application and the hardware device. It does so by pointing to the memory blocks used by the device and managing input/output to the area.
The List, Queue and Buffer Objects
These objects, as their name implies, are implementation of a linked list, a queue and buffer data structures.
Byte-BOS boasted the following additional features on their website from 2000, providing some extra insight on the internals of the system.
ISRs can make operating systems calls and schedule tasks by including the ENTER_ISR and EXIT_ISR macros at the beginning and end of the routine. Most services are available to ISRs, giving them considerable access over the entire OS.
Critical Section Handling
Byte-BOS provides critical section handling by protecting non-reentrant code sections from both interrupt and task pre-emption. Protection of critical sections appears to be done via the following mechanisms:
Locking and unlocking code sections
Prioritization of tasks
Task Execution Trace
Tasks are provided with a trace buffer configurable at both compile time and run time. The trace buffer contains information about the sequence of calls, the parameters passed, and the return value of the called function. The tracing option is extremely useful for debugging purposes.
The Byte-BOS RTMOS requires very little in terms of resources. While the requirements varies depending on the underlying processor, the size of the kernel is a few kilobytes, while the RAM requirements are less than 100 bytes of globally accessible memory and less than 100 bytes per task and ISR. The detail figures are listed in table 2.
Table 2 – Memory requirements for the Byte-BOS RTMOS according to microprocessor
for Global Variables
per Task (bytes)
* For Intel-based architecture, the memory footprint reported above is based on usage of the Borland C++ compiler.
Additional volatile memory is required for objects created dynamically. As such, the numbers provided above do not represent the total amount of memory consumed by Byte-BOS.
In general, Byte-BOS supports the development tools and compilers provided by the semi-conductor manufacturer. For Intel-based processors, the Borland C/C++ compiler was supported. The complete list of supported compilers is provided in table 3.
Table 3 – Supported compilers by the Byte-BOS RTMOS.
Devices using the Byte-BOS RTMOS can be identified by looking at ASCII and Unicode strings in their firmware. In almost all cases, if not all, the firmware will contain a copyright notice identifying it as well as the target microcontroller it was compiled for. For example, by extracting the strings from the firmware for the Seagate Cheetah 10K.6 Disc Drive, either using the simple strings.exe (or strings in Linux) or IDA, the copyright notice string “Byte-BOS 8096 Multitasking Operating System — Copyright 1990 Numerical Services Inc.” can be observed (see figure 2).
While appliances posterior to 2004 will likely use modern RTMOS such as Linux , Byte-BOS can still be found on legacy devices. One such example is the BaxterIPump pain management medical device, which refers to its usage in the manual . Of note, the manual refers to the detection of stack overflows within the Byte-BOS RTMOS. Industrial controls, aircraft, telecommunications systems and other systems designed prior to the 2000s may still harbour this little known RTMOS.
The Byte-BOS RTMOS will likely disappear as legacy embedded devices are life-cycled for newer systems with more processing power and extended memory. Until that moment, for developers of these systems, or simple hobbyists, information remains scarce and limited. While we provided a brief overview of its architecture and features, details about creating user applications remains obscured by the lack of online information such as an API description, code examples or more importantly, the source code. Such data would greatly ease development and engineering efforts of legacy systems.
This document analyzes the use of the cyber environment in the Syrian civil war by both the population and the government in order to characterize online tactics and strategies developed and used by each belligerent.
This is an article I wrote a while ago and never got published. It’s a bit outdated now, but I still think it can be useful for historical purposes, so I’ll post a link to it below.
This document analyzes the use of the cyber environment in the Syrian civil war by both the population and the government in order to characterize online tactics and strategies developed and used by each belligerent. This overview allows for generalization of online behavior by hacktivists and nation-state sponsored actors on communication networks in the region, which will continue to see online attacks from various parties in the foreseeable future during similar conflict. In Syria, because of poor infrastructure, low rate of Internet penetration and early adoption of control mechanisms by the current government, the authorities had dominance over their information environment early in the conflict, enabling rapid gathering of intelligence on dissidents. While social medias were leveraged by the population as in many other uprisings for coordination, it was also the theater of multiple offensive cyber operations by internal and external groups, mostly for information operations purposes. Despite the high level of activity, none appeared to have a definitive impact on the ground. While events recorded in this space have not reached the level of intensity of other conflicts, it proves a useful model for similar conflicts in the Middle East region.
The buffer overflow attack vector is well documented in desktop and server class machines using the von Neumann memory model, but little has been published on buffer overflow vulnerabilities in Harvard architectures.
I have not written or contributed to the enclosed research paper. I’m simply reposting it here because it’s interesting and for some reason, appears available only via Google cache. So before it disappear from results, I’m reposting it here.
This paper discusses a technique to conduct buffer overflows on processors using the Harvard architecture. In this architecture, the stack starts at the beginning of the memory and grows up, versus Von Neumann architectures in which it grows down.
Most small embedded devices are built on Harvard class microprocessor architectures
that are tasked with controlling physical events and, subsequently, critical infrastructures. The Harvard architecture separates data and program memory into independent address spaces, as opposed to the von Neumann architecture that uses a unified memory system with a single address space for both data and program code. The buffer overflow attack vector is well documented in desktop and server class machines using the von Neumann memory model, but little has been published on buffer overflow vulnerabilities in Harvard architectures. In this paper we show that stack-based buffer overflow vulnerabilities exist in embedded control devices based on the Harvard class architecture. We demonstrate how a reversal of stack growth direction can greatly simplify the attack and allow for easier access to critical execution controls. We also examine popular defense techniques employed in server and desktop environments and the applicability of those defenses toward Harvard class machines.
The VMF protocol is a SDR data protocol created to exchange information between multiple different systems by providing a rich and flexible specification. Yet, the VMF is little known even amongst avionics engineering and its knowledge remains the speciality of few. As such, in this post we explore this protocol in details to better understand its inner workings and its uses
Software-Defined Radio (SDR) is a fast-growing market, expanding in a wide array of industries. Growth in this sector alone is expected to reach $USD27.29 billion by 2020 . When only considering that most smartphones are equipped with SDRs, one can quickly understand the active research conducted in the field. SDRs are also in high usage across militaries and law enforcement given the added flexibility to conduct multiple types of operations using the same hardware, often with reduced maintenance costs. Nowadays, military tactical radios and law enforcement equipment are all software-defined to some degree. Within the industry, 93% of all mobile systems leverage SDR for wireless communications . The added agility introduced by software also comes with additional threats, previously absent from hardware-based radios. Within the military, SDRs form the cornerstone of network-centric warfare as they are used to establish networks between different units in order to exchange data such as position, imagery and target information rapidly between ground, maritime, air and space elements. Radios not only process voice; the internal software also manages various networks and data protocols, further increasing the complexity of the application layer of the communication device. The VMF protocol is one such SDR data protocol created to exchange information between multiple different systems by providing a rich and flexible specification. Yet, the VMF is little known even amongst avionics engineering and its knowledge remains the speciality of few. As such, in this post we explore this protocol in details to better understand its inner workings and its uses
An overview of the architecture of the SDR is required to better appreciate the threats against them and to eventually understand some of the unique characteristics when looking for vulnerabilities in their software. We do so by describing the general software model used by SDRs which have multiple layers, each of which uses different programs to process waveforms and the data contained in them. The Wireless Innovation Forum (WIF) simply defines a SDR as a radio in which some or all of the physical layers functions are software defined, contrasting with previous radios in which all configurable properties were designed and hardcoded into the hardware. Within the definition of WIF, the physical layers refer to the four lowest layers of the OSI model. This innovation allowed for additional agility into the functionalities of radios, as the integrated general purpose processors (GPP) or Field Programmable Gate Array (FPGA) units and the flexible radio-frequency (RF) modulators – all reconfigurable through software – permit the device to operate on a wider range of the spectrum depending on the current operations. These new components are all software controlled, allowing for fast reconfiguration as needed by the operator (see figure 1).
Within the SDR the Software Communications Architecture (SCA) is the open framework generally used to specify how a radio designer should integrate hardware and software components in order to interact efficiently and maximize software reuse . The SCA was developed by the U.S. military as part of the now defunct Joint Tactical Radio System (JTRS). Despite it’s military roots, the SCA is used in the industrial sector as well. The framework divides the radio into 4 layers, each composed of multiple components. These are layers below the user applications (figure 2) and are responsible for processing the inbound and outbound data:
Programmable Radio Hardware; the Programmable Radio Hardware is the layer which regroups the software-reconfigurable elements of the radios such as the RF modulation units, the modems and the link interfaces.
Operating System and Middleware; the operating system, along with the Common Object Request Broker Architecture (CORBA) middleware play the critical role of managing the communications between the software of the core framework and the hardware. CORBA is a message passing technique that is widely used for cross-platform frameworks. All core interfaces are defined using the Interface Definition Language (IDL) which can be compiled in different languages such as C/C++ and ADA. The operating systems included in SDRs are Real-Time Operating Systems (RTOS) such as uOS, TinyOS or VxWorks. These provide multithread support and are required to be POSIX compliant. The RTOS also interfaces with the underlying network interfaces and serial ports.
Core Framework; the core framework is the abstraction layer between the software developers and the underlying hardware. It contains the interfaces and services required for the applications to use the devices of the radio by describing them using the eXtensible Markup Language (XML);
Waveforms; the software at this layer specifies the required parameters to form the needed waveform to communicate with networked devices. By reviewing the framework above, one quickly realizes that the world of radios is now mainly driven by software development and thus exposes itself to the same issues that plagues programs in the wider computing world. The terms ”operating system”, ”drivers”, ”middleware” and APIs are terms well-known to cyber operators and computer security analysts alike, thus making them a target the same way as any other host on any other network.
Software radios have significant advantages for the military as they provide a wide variety of dynamic radio protocols in real time. Since SDRs can be reprogrammed remotely for multiple purposes without any hardware changes, maintenance costs and time have decreased. Because of these characteristics, SDRs are one of the cornerstones of network-centric warfare, the currently predominant Western military doctrine. It is therefore not surprising that all tactical military radios in modern western militaries are SDRs, including most of civil aviation for law enforcement and potentially drones. Modern MTRs are more akin to routers than radios and enable the creation of instant networks and the exchange of operational data between the connected nodes over Tactical Data Links (TDL) such as Link-16. Data protocols such as VMF and the Joint Range Extension Applications Protocol (JREAP) enable the exchange of voice, data and imagery by connecting to a network of various platforms and units, both vertically and horizontally. Amongst the data exchanged are: positions, trajectories, maps, navigation data, mission data and intelligence. This implies the presence of user applications managing the data protocol, and additional applications processing the data received. Other data link protocols include the Type 483D, the Chinese equivalent to Link 4C. This transfer of information between platforms is central to the network-centric warfare doctrine which drives the research and development of network radios and increases the complexity of the software and network protocols. In many aspects, the radio can be considered the gateway device between the internal networks of a platform and external nodes requesting to connect to it. Internally within aircraft and ships many MTRs provide multiple interfaces to communicate with other elements of the platform, such as mission computers, display terminals and external software to display geoposition data for example. On a military aircraft, the interfaces of the radio usually include the MIL-STD-1553 bus (or its civilian counterpart, the ARINC429) to link with the avionics as well as Ethernet and serial ports such as RS-232 or RS-485, often used by Remote Control Units (RCUs) for management purposes via protocols such as the Simple Network Management Protocol (SNMP). Externally the radio networks the platform with external units using different systems by forming Mobile Ad hoc Networks (MANETs) and exchanges tactical data using data protocols over the tactical data links. These networks are wireless, infrastructureless, multi-hop and highly fluid. In military usage these are typically low-bandwidth links and are often managed by devices with little computational power. The Ad hoc On-Demand Distance Vector (AODV) is typically the underlying routing protocol, which manages routing in a peer-to-peer network. Routing information is updated constantly, as nodes within a MANET can also act as proxies for far-away nodes.
The Variable Message Format Protocol
The VMF standard was developed by the U.S. Department of Defense to allow messages of variable lengths to be sent over TDLs. A VMF message is bit-oriented and attempts to minimize the use of TDLs by sending only the required data. The objective is to be flexible enough to be able to communicate with any legacy and new host requiring that additional header fields can be added without modifying the underlying specification. The core of the VMF protocol is the Protocol Data Unit (PDU) which contains the header and the user data (figure 3), much like a typical TCP/IP packet. The PDU is processed at the application layer and is composed of the application header and the user data, which can be multiple format as we will see later on. The size of the former is always a multiple of 8 and is padded with null bits as needed.
Structure of the Application Header
The first 4 bits of the header always specify the version of the VMF protocol in used by the PDU. As of 2015, 5 revisions of the protocol have been created (table 1):
Table 1. Version numbers currently defined in revision D w/ CHANGE1
The “Version Sent Not Implemented” value is used to specify that the current implementation of the VMF protocol in the system is not backward compatible with the incoming VMF messages. For example, a system using revision D of the protocol sending a message to another system using revision C or earlier will receive an answer with the version field set to “15”, specifying that the source system must use an earlier revision (figure 4). If the version is not implemented, but the destination system is still able to process the message, it will nonetheless.
Fields and Groups
Elements following the version number are either “fields” or “groups” as shown in figure 3. A field can generally be thought as single piece of data along with an overhead of 1 or 2 bits; a Field Presence Indicator (FPI) and a Field Recurrence Indicator (FRI). The FPI is a flag which indicate if a value for the field has been specified or not, e.g. if the FPI is set to zero, the following bit is the start of new field or group. If set to one, the following bits represents the value for the field. Some fields occur more than once. These fields include the FRI flag which is right after the FPI. When set to one, the field is repeated right after until the FRI is set to zero as illustrated in figure 5. Note that the FPI is included only once in the first field.
Groups, as their name suggest, are sets of related fields or sub-groups. A group have a depth down to 6 levels and possess the same presence and indicator flags as fields, which are known at the Group Presence Indicator (GPI) and Group Recurrence Indicator (GRI). The order of fields and groups of the header is fixed, much like any TCP/IP packet.
While most of the fields contain numeric values, some are 7-bit ASCII coded strings, where the value 127 (the DELETE character) is used as the string terminator, i.e. equivalent to the null byte ‘’ in C/C++ character arrays. The terminator is added to the string only if its bit-length is smaller than the maximum length of the field. For example, the “unitname” field is 448-bit long, i.e. 64 characters. If the unitname contains less than 64 characters, the terminator will be added to the value. A complete list of groups and fields, along with their description and constraints can be found in the specification (and maybe discussed in future parts of this article)
Assembly of a VMF Message
The creation of a VMF message is quite similar to the creation of a HTTP request or a TCP/IP packet: it is done via encapsulation of user data through the multiple layers of the SDR. User data is first entered into the radio, usually by the pilot or crew via a Remote Control Unit (RCU) or a subsystem via the internal MIL-STD-1553 bus or ARINC 629 data bus. The radio receives the MIL-STD-1553 message from the bus controller and extracts its data. It then determine to which units on its MANET to send the data. Note that data can be broadcasted to all units on the network via a broadcast address. The default broadcast address is 16777215. Within the SDR, the data is then transferred to the VMF messaging service of the device. Based on its configuration and properties of the data provided, the VMF layer will construct the proper application header and append the user data to form the application PDU. The PDU is then send to lower layers, transformed into waveforms and digital data converted to analog radio waves to be sent via the antenna of the aircraft. The receiving unit then do the reverse process and dispatch the received data to the internal subsystems of the destination aircraft. The process is illustrated in figure 6.
The type of user data that is contained within the application PDU is defined by the User Message Format (UMF) field. This field indicates the format of the message contained in the user data field and is associated with the Functional Area Designator (FAD), the Message Number, the Message Subtype, CANTCO reason and CANTPRO reason fields. More about these fields in part 2 of this article.
VMF can be used to transfer files between systems and this is done by setting the UMF field with the value 1 and using the “Filename” field to specify the name of the file. Furthermore, to indicate that the message is a file transfer the GPI of the VMF Message Identification Group needs to be set to zero.
When the UMF field is set to 0100 (4), the content of the user data is another VMF message, much like a forwarded email. Both the application header and user data sections of the redistributed message are included in the user data portion of the message forwarding it. The “Operation Indicator“, “Security Classification” and “Release Marking” fields are required to be similar in both messages. Both messages are to be processed by the receiving systems.
Nodes (hosts) on the network can be identified using one of the following fields: the URN or the Unit name. Both cannot be used at the same time normally. The URN is, as its name implies, a number given to a specific unit to identify it. It acts as a MAC address would. Each URN is unique and only as one unit bearing the number. Distribution of these URN is made by the U.S. Department of Defense. This URN is the addressing scheme at the application layer. VMF can be carried via TCP/IP (or UDP), MIL-STD-1553 or any other lower layer. Note that other addressing scheme can be used, but these will be explored later on.
This concludes part 1 of the introduction to the VMF protocol. In part 2, we will explore further the fields included in the latest revision of the protocol and include actual example of VMF messages. In the meantime, you’re welcome to play with Vmfcat, which is a Python script I’m working on (still incomplete) to generate VMF messages from the command line or via an interactive shell.
By first doing manually searching for the hash using a search engine, the plaintext will be found and after doing it 2 or 3 times for different hash, a pattern will emerge in the random plaintext. In this case, random integers between 1000 and 9999 are generated. As such, to solve the challenge, one need to hash all integers from 1000 to 9999 and compare it with the one on the page. Once we have a match, we use the plaintext with a provided URL to get the flag. So in summary:
Figure out the plaintext used to generate the hash. Use Google.
Find a pattern in the plaintext, or boundaries [x,y] in the values used.
Extract the SHA1 from the HTML
Generate a SHA1 hash for each possible value in [x,y]
Compare the 2 hashes. If they match, go to the flag retrieval page.
In this post, we continue our data collection and evaluation of files stored on removable medias publicly accessible to the Internet. The collection of filenames from 6,500 hosts is ongoing, therefore we’re going to focus on evaluation of sensitivity of a file based only on its filename. We also present the latest statistics collected in our database.
In this post, we continue our data collection and evaluation of files stored on removable medias publicly accessible to the Internet. The collection of filenames from 6,500 hosts is ongoing, therefore we’re going to focus on evaluation of sensitivity of a file based only on its filename. Based on the current result, 2 to 9% of the 3000 files reviewed were sensitive or potentially sensitive. Most of the sensitive files are concentrated on a few hosts. These files often include financial information or project data from businesses. So far, 773 hosts containing around 4.5 million files have been scanned.
The amount of filenames collected is quite large and we cannot evaluate manually each filename for its probable sensitivity. As such, we need to devise a procedure to automatically assess its sensitivity. We have some definitions and restrictions to list first to clarify what a sensitive file is and limitation to our evaluation criteria.
In this document, sensitive file refers to user-generated or software-generated files based on user input that contains information that should probably not be publicly accessible and which can be leveraged against an individual or organization. This includes:
Personal identification documents; passport, driver’s license, visas, government forms…
Personal finance documents; income tax files, insurance forms, credit card statements, mortgage, pay stubs, banking information
Personal medical documents; prescriptions, medical records
Business finances; customer lists, sales data, project costs, business deals, investments, payrolls
Intellectual property; blueprints, schema, patents, research
Network configuration; passwords files, configurations files, network diagrams, user databases
Large databases of emails, addresses and other personal information.
Some of the files not included in our analysis that includes;
Copyrighted / Illegally downloaded files. However we considered text file containing licensing keys to be sensitive.
Inappropriate contents (nude selfies, personal politics, group affiliations etc…)
Personal pictures, letters.
Addresses and emails were not considered personal, however databases of addresses and emails are considered sensitive
Because of the volume, we cannot download and manually verify each file to confirm its contents, as such our main restriction is that our assessment must be done solely based on the absolute filename recorded. As such, to evaluate the sensitivity, we used three categories; positive, negative and neutral, i.e. either a file is very likely to sensitive, potentially sensitive or clearly not sensitive at all. Of course, there is always a possibility that a file labeled as sensitive may not be. For example, a file called social security numbers.xls may contain only formulas or an empty form. Ideally, files identified as positive or neutral should be manually vetted.
The procedure to automatically assess the sensitivity of a file based on its path and name is first done by assessing a random sample manually. Using the ORDER BY RANDOM (note: there will be a need to review if this function is truly random, which I doubt) function (performance is not an issue in this experiment) of the Postgresql database, multiple random samples of 100 filenames are retrieved from the database. Each file is shown to the evaluator which based on the path, filename and extension assess the sensitivity of the file as ‘positive‘, ‘neutral‘, ‘negative‘. For each run, we log the count of hits for all categories.
Listing 1 : Example of a run in which a script asks an evaluator to assess the sensitivity of files based on its absolute path.
The evaluator is assessing the filename based on keywords that may indicate the contents of the file. As such, a file containing the word, or as we call it in this document, a token such as sales, passport or passwords will be assume to contain information about sales, a passport scan or a list of passwords. In many cases, the filename is too obscure, but the path and extension may indicate the contents of the file. For example, a path containing the tokens project, finances and a Microsoft Excel extension despite a filename of axe189212_c.xls will be considered as neutral, as the file may contents information about a project. Examples of both scenarios are shown in listings 2 and 3:
Listing 3 : Examples of files that were deemed ‘neutral’ (or ‘unknown’) hits based on keywords in their absolute path.
Filenames in foreign languages are roughly translated using Google Translate, as such, many of them are labeled as unsure.
A Python script then divide the filename in tokens, and each token is stored in the database along with the number of times it was found in a positive, neutral and negative hit. Tokens are created slightly differently based if they are located in the path, the filename or in the extension. For the extension, a single token is created which contains the extension itself. If the file does not have an extension or is not an extension usually associated with known software, no token is created. For the filename, tokens are created by splitting each word using characters usually known to separate words such as the underscore, dash, period or spaces. Lastly, for the path, directories are used as token and unlike filenames, are not split further. An example of this process is shown in listing 4:
File:/My_Passport(1)/Name of Company/Owner/ASUSLAPTOP/Data/C/Users/Owner/Documents/Forms/Letter of Understanding forSub-contractors(2015_03_2416_00_01UTC).docx
Path=>/My_Passport(1)/Name of Company/Owner/ASUSLAPTOP/Data/C/Users/Owner/Documents/Forms/
File=>Letter of Understanding forSub-contractors(2015_03_2416_00_01UTC)
Tokens(Path)=["My_Passport(1)","Name of Company","Owner","ASUSLAPTOP","Data","C","Users","Documents","Forms"]
Listing 4 : Example of the tokenization of a filename.
Once the tokens are created, the script will either add the token in the database or update its count based on the evaluator choice. After each update, a score is given to the token, which is simply the ratio between positive hits and the total count of appearances: p / hits). Note that tokens are considered different depending their location in the filename. As such, a filename such as /My_Passport/backup/Outlook emails backup.pst, will generate 2 distinct ‘backup’ tokens; the one from the path and the one from the filename. We explain this decision in the next paragraphs.
Listing 5 : Scores of the tokens extracted from the file in listing 4.
By using this procedure, we believe that tokens appearing often in both positive and negative hits will cancel each other, while the tokens strongly associated with positive and negative hits will remain clearly divided. Some sort of mathematical should follow later one (I hope…need to review discrete maths I guess). Some preliminary results appears to confirm this approach as valid. Extensions strongly associated with sensitive contains higher scores while media files have null scores.
However, there is a need to further refine this process by associating a value, or weight, to the location of the token. Tokens in the path are not as indicative of the sensitivity of the file as much as a token in the filename or extension. Even within the path, the highest level is generally less indicative than the lowest one, i.e. /documents/finances 2012/sales/company sales.xls. Therefore when assessing a new filename, we need to give a score to the path, the filename and the extension. For the path, we will get the score of each token and multiply it with a weight that correspond to its location in the structure. For token that are not found the default value of 0 will be given. Then we will take the average of all token for the score of the path. As for the filename, we will not consider the position. Finally the stored score of the extension will be retrieved from the database. If the extension is not found, then a score of 0 will be used. This will transform a filename into a set of three real values which we can range between 0 and 1. To determine the weights needed for each location, we will used a supervised neural network. More research will be conducted to determine how to use this approach.
As of 16 July 2015, 4,568,738 files have been recorded from 773 hosts.
Table 1. Location of the 773 hosts scanned as of 16 July 2015 order by country.
Mp3 and JPEG image files remains the most common. As such, we focus our statistics on document-type of files for a change, i.e. Office documents. Adobe PDF files and Microsoft Word documents are the most common file types based on our current data as shown in figure 1.
At the moment, around 3000 files have been assessed (30 runs of 100 samples). For each run, we recorded the number of positives, neutral and negative hits and found them overall constant at each run. (see figure 2) However more details about the RANDOM function is needed to insure the randomness of the sample. This part may need to be redone. So far, between 2% and 9% of files scanned are considered sensitive or potentially sensitive (see figure 3). However we need to consider the concentration of these files to put this information into perspective. The 278 files identified as sensitive or potentially sensitive were located on 59 hosts, with one host accounting for 101 of these file. This indicates that files of interests for an attacker are likely to be concentrated on a few hosts.
As for tokens, we will have to consider the entire collection of filenames in order to have sample from multiple sources, as such, we will pursue manually assessing samples of 100 filenames as more data is collected. After which we should have an excellent training set for the neural network. Some high-recurring and high-scoring tokens are shown in tables 2 and 3.
Table 2. Sample of high-scoring tokens sorted on the number of times observed.
Table 3. Sample of high-recurring tokens sorted on the number of times observed.
While these results are preliminary, they nevertheless seems to provide a solid indication of what one can find on publicly-available removable drives. Additional work and fine tuning of both code and processes is required to provide more accurate data and the next step while the scan is still on going it to develop a methodology to assess the sensitivity of all files, likely using a neural network for classification based on the method presented above.
It’s often surprising how many malware programmers forget to do the simplest things. Mostly because many are so concerned with functionality, stealthiness and other production concerns, that details slip easily of their minds – a clear advantage to forensics. One of these details is the Program DataBase (PDB) information added by Visual Studio, which most malware authors used for Windows development. While it may seem innocuous, this string reveals a lot about the operating system used by the author, its user name and most notably, symbols that can be used by IDA and ease understanding of the disassembly.
It’s often surprising how many malware programmers forget to do the simplest things. Mostly because many are so concerned with functionality, stealthiness and other production concerns, that details slip easily of their minds – a clear advantage to forensics. One of these details is the Program DataBase (PDB) information added by Visual Studio, which most malware authors used for Windows development. While it may seem innocuous, this string reveals a lot about the operating system used by the author, its user name and most notably, symbols that can be used by IDA and ease understanding of the disassembly. This information allows to link multiple pieces of malware together, by using the username for example. Of course, this also allows for the creation of signatures. Thus, removing this information will add a hurdle to the analysts.
The Program Database File
The Program Database (PDB) is a binary file used to store debugging information about DLL and EXE files. The PDB file is created when you build your project and stores a list of symbols their addresses along with the name of the file and the line number on which the symbol was declared. PDB files is also used for services collecting crash data to send it to developers for resolution.
In Visual Studio, you can select to build your project in Debug or Release mode. In Debug mode, VS will include debugging information with your executable. In Release mode, no debug information is included by default, but in some cases is enabled so that if the program crashes, information can be retrieved and sent to the author for fixing. However for some reasons, some developers don’t really bother to use the Release mode, and simply use the executable generated by the Debug mode. Generally, you don’t want that if you are making malware (or any program really!). If left within the executable, a path to the PDB file will be included and can be extracted:
Within the strings, you can determine that:
The program was developed on Windows 7+ (because of C:\Users folder),
These indicators can be useful to link this specific program with others and provide a common link between multiple malware. Additionally, the username could potentially be used to conduct open source research and find linked accounts or forum posts. But wait, there’s more…
If you leave the debugging information, you may be able to restore all the original names of variables and functions of the source code using IDA. IDA will first detect debugging information and ask the analyst if he wants to retrieve it, either via Microsoft – http://msdl.microsoft.com/download/symbols (not browseable)- or by looking locally.
If for some reason, the user is able to retrieve the information, he will have access to the names of the original symbols, which will make reverse engineering much more easier.
Compare the information from the figure above to the figure below, in which debugging information has been stripped at build time:
You can see that the variables defined in the first figure, such as ClipboardData, isProcessElevated and isDebugged have been preserved. By keeping information about the symbols, reverse engineering is much more easier compared to figure 2, in which information about the code is lost.
Disabling Debugging Information
To prevent VC from including this information in your executable, right click on your project, go to Project Properties > Linker > Debugging configuration menu. Select No in the Generate Debug Info option.
After doing this, rebuild your project and rerun the string extraction program against your binary, the path to the PDB file should not be present in the executable anymore.
Doing so makes it a bit more difficult to fingerprint the malware and hides information about the author’s system.
This is a simple tactic that is often omitted not only by malware author, but penetration testers, which are often Google programmers, i.e. copy-pasting code snippets from Stack Overflow or googling functions 😉 If you attempt to hide your malware into the System32 folder, looking for this information in the EXE or DLL files will quickly tell you which files are bad, since legitimate files will rarely have this info, or have legitimate looking one. As such, if you want to make sure, create a legitimate-Microsoft-looking user (Bill.Gates) on your machine and put your code into a Microsoft-looking project and path (C:\users\Bill Gates\Documents\HTA\Release\).
A common structure in malware, and also in many legitimate software is to recognize what is the underlying operating system (OS). Depending on the version of it, the virus may want to take different route to execute its activities. This functionality can be reused across different programs and is therefore useful to create and optimize. In this post, we design a function to detect the version of the Windows OS on which the program is currently running.
A common structure in malware, and also in many legitimate software is to recognize what is the underlying operating system (OS). Depending on the version of it, the virus may want to take different route to execute its activities. This functionality can be reused across different programs and is therefore useful to create and optimize. In this post, we coded a function to detect the version of the Windows OS on which the program is currently running.
The Windows OS is known to have a confusing version system, which is often the result of marketing pressure. As such, while all software usually follow a version number specifying the major version, the minor version and the revision (and sometimes the build), Windows swings between years of release (ex. Windows 2000), fancy names (Windows Vista) or some designated number (Windows 7). However, in the underlying machinery of Windows are defined standard version numbers, ex. 5.1. Below is the correspondance between Windows operating systems and standard versions.
The version information described in the table above is stored in the dwVersionMajor and dwVersionMinor of the OSVERSIONINFOstructure. All version of Windows in the table above will store the value VER_PLATFORM_WIN32_NT (0x2) in the variable dwPlatformId of the structure.
Branching C++ Code Based on Version
Based on the description above, we can create a reusable function to detect the version of the Windows OS:
//Windows XP (64bit), Windows Server 2003 or Windows Server 2003 R2
}// end function
Note that in many cases, we won’t know if the OS is the workstation version of Windows or the server version of it. Also, there is no details for other versions of Windows, i.e. Windows CE, Windows 95 (…should it still be used somewhere..). That’s where OSVERSIONINFOEX is useful. The wProductType variable will be set to VER_NT_WORKSTATION if the current OS is the workstation version. Furthermore, you can retrieve service packs numbers and edition information with wServicePackMajor, wServicePackMinor and wSuiteMask. So if you need more details about the Windows OS, you can include these as well:
//If there is a suite, add it after the version
strcat_s(WindowsDescription,BufferSize," Small Business");
For a full version of the procedure, visit this GitHub page, you’ll find the C/C++ code.
Practically any malware at some point will need to check the version of the operating system being infected in order to enable specific functions or exploit certain vulnerability. Rarely will this check go as far as getting the suite, but Remote Access Tool (RAT) and bots will report the operating system, the version and the service pack. If you are a malware analyst, expeect this function to be present in whatever piece of code you’re analyzing.
T-Shark is practically the command-line version of Wireshark. It has the same basic capabilities but with the added flexibility offered by using the command-line to process outputs and send them to other applications. Below I’ve enclosed some of the commands that I have found myself reusing over and over again.
T-Shark is practically the command-line version of Wireshark. It has the same basic capabilities but with the added flexibility offered by using the command-line to process outputs and send them to other applications. Furthermore, T-shark is ideal for large PCAP files which Wireshark may have difficulty digesting, especially since it has to load the entire contents of the file prior to any kind of filtering. As such, T-Shark is my main tool for analyzing PCAPs. One of my main use is to extract specific information from the network for investigation. Below I’ve enclosed some of the commands that I have found myself reusing over and over again.
When analyzing PCAPs, I’m mostly concern to locate anomalies based on intelligence on various current actors. I’m especially interested in analyzing covert channels for specific indicators that are usually indicative of malicious traffic. These often include typos in popular URLs, weird looking domain names, emails with suspicious attachments or certificates with random fields. To extract the data, I’ve used the following T-shark commands:
In the example above (and the ones below), I’m reading network traffic from a offline PCAP file, which is why the -r lab1.http.pcap parameters is present. The -T fields specify to output the value of the fields specified with the -e parameter. For each field you want to output, you specify it using the -e <field> option. The -R <filter> is the filter to apply to the traffic. For example in this case we filter only HTTP traffic. Sort and Uniq are Linux applications used to sort the output of a program and remove duplicate entries.
The output file will contain text data that have been sorted and in
which the doubles will have been removed using ‘uniq -i‘, i.e.
we ignore the case of the items.
The example above will extract the URLs of all the DNS queries found
in the file dns.pcap and will output a list of URLs in dns.queries.txt:
Learning to use T-shark has many advantages that can increase efficiency, security and flexibility. It allows for scripting the extraction of data and storage into databases from which further analysis can quickly be done for anomalies. In this short post, we have listed some examples of how T-shark can be used, but it barely scraps the surface.