#TheGreatFTPHunt – 2% to 9% of files scanned potentially containing confidential information

In this post, we continue our data collection and evaluation of files stored on removable medias publicly accessible to the Internet. The collection of filenames from 6,500 hosts is ongoing, therefore we’re going to focus on evaluation of sensitivity of a file based only on its filename. We also present the latest statistics collected in our database.

Introduction

In this post, we continue our data collection and evaluation of files stored on removable medias publicly accessible to the Internet. The collection of filenames from 6,500 hosts is ongoing, therefore we’re going to focus on evaluation of sensitivity of a file based only on its filename. Based on the current result, 2 to 9% of the 3000 files reviewed were sensitive or potentially sensitive. Most of the sensitive files are concentrated on a few hosts. These files often include financial information or project data from businesses. So far, 773 hosts containing around 4.5 million files have been scanned.

Discussion

The amount of filenames collected is quite large and we cannot evaluate manually each filename for its probable sensitivity. As such, we need to devise a procedure to automatically assess its sensitivity. We have some definitions and restrictions to list first to clarify what a sensitive file is and limitation to our evaluation criteria.

In this document, sensitive file refers to user-generated or software-generated files based on user input that contains information that should probably not be publicly accessible and which can be leveraged against an individual or organization. This includes:

  • Personal identification documents; passport, driver’s license, visas, government forms…
  • Personal finance documents; income tax files, insurance forms, credit card statements, mortgage, pay stubs, banking information
  • Personal medical documents; prescriptions, medical records
  • Work-related files; emails, proprietary source code, password lists
  • Business finances; customer lists, sales data, project costs, business deals, investments, payrolls
  • Intellectual property; blueprints, schema, patents, research
  • Network configuration; passwords files, configurations files, network diagrams, user databases
  • Large databases of emails, addresses and other personal information.

Some of the files not included in our analysis that includes;

  • Copyrighted / Illegally downloaded files. However we considered text file containing licensing keys to be sensitive.
  • Inappropriate contents (nude selfies, personal politics, group affiliations etc…)
  • Personal pictures, letters.
  • Addresses and emails were not considered personal, however databases of addresses and emails are considered sensitive

Because of the volume, we cannot download and manually verify each file to confirm its contents, as such our main restriction is that our assessment must be done solely based on the absolute filename recorded. As such, to evaluate the sensitivity, we used three categories; positive, negative and neutral, i.e. either a file is very likely to sensitive, potentially sensitive or clearly not sensitive at all. Of course, there is always a possibility that a file labeled as sensitive may not be. For example, a file called social security numbers.xls may contain only formulas or an empty form. Ideally, files identified as positive or neutral should be manually vetted.

The procedure to automatically assess the sensitivity of a file based on its path and name is first done by assessing a random sample manually. Using the ORDER BY RANDOM (note: there will be a need to review if this function is truly random, which I doubt) function (performance is not an issue in this experiment) of the Postgresql database, multiple  random samples of 100 filenames are retrieved from the database. Each file is shown to the evaluator which based on the path, filename and extension assess the sensitivity of the file as ‘positive‘, ‘neutral‘, ‘negative‘. For each run, we log the count of hits for all categories.

Listing 1 : Example of a run in which a script asks an evaluator to assess the sensitivity of files based on its absolute path.

The evaluator is assessing the filename based on keywords that may indicate the contents of the file. As such, a file containing the word, or as we call it in this document, a token such as sales, passport or passwords will be assume to contain information about sales, a passport scan or a list of passwords. In many cases, the filename is too obscure, but the path and extension may indicate the contents of the file. For example, a path containing the tokens project, finances and a Microsoft Excel extension despite a filename of axe189212_c.xls will be considered as neutral, as the file may contents information about a project. Examples of both scenarios are shown in listings 2 and 3:

Listing 2 : Examples of files that were deemed ‘positive’ hits based on keywords in their absolute path.

Listing 3 : Examples of files that were deemed ‘neutral’ (or ‘unknown’) hits based on keywords in their absolute path.

Filenames in foreign languages are roughly translated using Google Translate, as such, many of them are labeled as unsure.

A Python script then divide the filename in tokens, and each token is stored in the database along with the number of times it was found in a positive, neutral and negative hit. Tokens are created slightly differently based if they are located in the path, the filename or in the extension. For the extension, a single token is created which contains the extension itself. If the file does not have an extension or is not an extension usually associated with known software, no token is created. For the filename, tokens are created by splitting each word using characters usually known to separate words such as the underscore, dash, period or spaces. Lastly, for the path, directories are used as token and unlike filenames, are not split further. An example of this process is shown in listing 4:

Listing 4 : Example of the tokenization of a filename.

Once the tokens are created, the script will either add the token in the database or update its count based on the evaluator choice. After each update, a score is given to the token, which is simply the ratio between positive hits and the total count of appearances: p / hits). Note that tokens are considered different depending their location in the filename. As such, a filename such as /My_Passport/backup/Outlook emails backup.pst, will generate 2 distinct ‘backup’  tokens; the one from the path and the one from the filename. We explain this decision in the next paragraphs.

Listing 5 : Scores of the tokens extracted from the file in listing 4.

By using this procedure, we believe that tokens appearing often in both positive and negative hits will cancel each other, while the tokens strongly associated with positive and negative hits will remain clearly divided. Some sort of mathematical should follow later one (I hope…need to review discrete maths I guess). Some preliminary results  appears to confirm this approach as valid. Extensions strongly associated with sensitive contains higher scores while media files have null scores.

However, there is a need to further refine this process by associating a value, or weight, to the location of the token. Tokens in the path are not as indicative of the sensitivity of the file as much as a token in the filename or extension. Even within the path, the highest level is generally less indicative than the lowest one, i.e. /documents/finances 2012/sales/company sales.xls. Therefore when assessing a new filename, we need to give a score to the path, the filename and the extension. For the path, we will get the score of each token and multiply it with a weight that correspond to its location in the structure. For token that are not found the default value of 0 will be given. Then we will take the average of all token for the score of the path. As for the filename, we will not consider the position. Finally the stored score of the extension will be retrieved from the database. If the extension is not found, then a score of 0 will be used. This will transform a filename into a set of three real values which we can range between 0 and 1. To determine the weights needed for each location, we will used a supervised neural network. More research will be conducted to determine how to use this approach.

Results

As of 16 July 2015, 4,568,738 files have been recorded from 773 hosts.

Country Hosts
United States 258
Russian Federation 91
Sweden 69
Canada 66
Ukraine 27
Norway 24
United Kingdom 24
Australia 19
Netherlands 18
Hong 18
Taiwan 16
Poland 15
Germany 11
Romania 11
Finland 10
Switzerland 8
Korea 8
Singapore 7
Czech Republic 7
Japan 6
Table 1. Location of the 773 hosts scanned as of 16 July 2015 order by country.

Mp3 and JPEG image files remains the most common. As such, we focus our statistics on document-type of files for a change, i.e. Office documents. Adobe PDF files and Microsoft Word documents are the most common file types based on our current data as shown in figure 1.

Most common file types scanned as of 16 July 2015 for office-related documents
Figure 1. Most common file types scanned as of 16 July 2015 for office-related documents

At the moment, around 3000 files have been assessed (30 runs of 100 samples). For each run, we recorded the number of positives, neutral and negative hits and found them overall constant at each run. (see figure 2) However more details about the RANDOM function is needed to insure the randomness of the sample. This part may need to be redone. So far, between 2% and 9% of files scanned are considered sensitive or potentially sensitive (see figure 3). However we need to consider the concentration of these files to put this information into perspective. The 278 files identified as sensitive or potentially sensitive were located on 59 hosts, with one host accounting for 101 of these file. This indicates that files of interests for an attacker are likely to be concentrated on a few hosts.

Chart of assessed sensitivity of randomly selected 30 samples of 100 filenames.
Figure 2. Assessed sensitivity of randomly selected 30 samples of 100 filenames.
Chart of percentage of files according to their sensitivity based on manual assessment of 3000 randomly selected files.
Figure 3. Percentage of files according to their sensitivity based on manual assessment of 3000 randomly selected files.

As for tokens, we will have to consider the entire collection of filenames in order to have sample from multiple sources, as such, we will pursue manually assessing samples of 100 filenames as more data is collected. After which we should have an excellent training set for the neural network. Some high-recurring and high-scoring tokens are shown in tables 2 and 3.

Token Hits Score
attach 7 0.9285714286
txn 7 0.9285714286
planning 6 0.9166666667
archived 6 1
recpt 6 0.9166666667
2010taxreturns 5 1
person~2 4 1
purchase 3 1
order 2 1
Паспорт 2 1
Table 2. Sample of high-scoring tokens sorted on the number of times observed.
Token Hits Score
jpg 938 0.013326226
mp3 460 0
music 448 0
seagate_backup_plus_drive 382 0.1452879581
asusware 348 0
pictures 309 0.0048543689
sda1 285 0.0649122807
bigdaddy 279 0
elements 278 0.0485611511
transcend 247 0.0222672065
my_book 234 0.0106837607
Table 3. Sample of high-recurring tokens sorted on the number of times observed.

Conclusion

While these results are preliminary, they nevertheless seems to provide a solid indication of what one can find on publicly-available removable drives. Additional work and fine tuning of both code and processes is required to provide more accurate data and the next step while the scan is still on going it to develop a methodology to assess the sensitivity of all files, likely using a neural network for classification based on the method presented above.

Using Infogreffe – France’s National Registry of Corporations and Companies

Any operator knows that prior to any penetration testing activity requires a solid recon phase. The more information you collect about your target, the wider your attack surface becomes and thus, increased chances in a successful infiltration. In this post, we browse to the France’s registry, i.e. the “Registre du Commerce et des Societes” to extract information from the company and use this information to expand our attack surface

Introduction

Any operator knows that prior to any penetration testing activity requires a solid recon phase. The more information you collect about your target, the wider your attack surface becomes and thus, increased chances in a successful infiltration. When your target is a company, you have a wide array of tools to extract information from and pretty much every country has some sort of registry where financials and board information is kept up to date. Sometime, it also include additional information such as incorporation documents, trademark renewals and so one.

In this post, we browse to the France’s registry, i.e. the “Registre du Commerce et des Societes” to extract information from the company and use this information to expand our attack surface. In this post, we used Huawei as an example, only because they are widespread and maintain multiple locations. All information extracted is publicly available and should not be used to conduct fraudulent or illegal activities.

20130903 - Screenshot - Inforgreffe
Homepage of the ‘Registre du Commerce et des Societes’

Contents

The Système d’Identification du Répertoire des ÉTablissements (SIRET) Number

The SIRET stands for “système d’identification du répertoire des établissements” which is a unique number given to a physical commercial location (building, store, apartment etc…). The SIRET is 14 digits long. The first 9 digits is the SIREN, i.e. the “Système d’Identification du Répertoire des ENtreprises”, which uniquely identify the company owning the unit. The next four digits are the unit number and the last number is a checksum. The checksum is done using the Luhn algorithm. For example, let’s analyze the SIRET for Huawei Technologies France;

Search Results for Keyword 'Huawei' using the Infogreffe Search Engine
Search Results for Keyword ‘Huawei’ using the Infogreffe Search Engine

As you can see above, the SIRET for the Huawei Technologies France headquarters is 451 063 739 00119. What this number says is that Huawei Technologies France has been assigned SIREN 451 063 739 and that the headquarters is the 11th building Huawei Technologies France occupied in France. By observing the SIRET of each location, you can see the building numbers goes from 3 to 11, meaning the company had 3 buildings prior that they moved out from.

That being said, we could now purchase the KBIS report and get quite a lot of information about the board of Huawei Technologies France. But since we are cheap bastards, we’ll look somewhere else for this information. After all, CEOs and executive usually want to tell the world about their position. A quick search for “SIRET ‘451 063 739 00119′” on Google will yield three results only, but that’s all we need for now. The first one on verif.com will provide the list of the executives.

Names of the executives of Huawei Technologies France found on Verif.com
Names of the executives of Huawei Technologies France found on Verif.com

In all honesty thought, just searching for “Huawei Technologies France” would have return another website with the same information. However, Infogrette, like every other national registry can be a valuable startign point to retrieve additional details and expand your recce.

Additional information about Huawei Technologies France found on Societe.com
Additional information about Huawei Technologies France found on Societe.com

And actually, this site is even more generous, as it gives the month and year of birth of every executive in the company:

Birth information about the president of Huawei Technologies France
Birth information about the president of Huawei Technologies France
Birth information about the director general of Huawei Technologies France
Birth information about the director general of Huawei Technologies France

We can validate this information using the advanced search engine of the Infogreffe web site. There is a functionality included to search companies using names of executives including their birth information. For example, let’s make sure Mr. Wang is still on the board of the company. From the front page, click on “Recherche Avancée” (Advanced Search) and then select the second tab: “Recherche par Dirigeant” (Search by Director). Type the name using the “Lastname, Firstname” format and the birth year of the person. Then click the “Rechercher” (Search) button;

Advanced search using "Wang, Yeming" and "1974" to find Huawei Technologies France
Advanced search using “Wang, Yeming” and “1974” to find Huawei Technologies France

Once the results appear, we can see that Mr. Wang is still registered as being the director general of the company. Since any change must be registered to the tribunals, we can be confident that this information is valid.

Search results confirm that Mr. Wang is still at the headquarters of Huawei Technologies France
Search results confirm that Mr. Wang is still at the headquarters of Huawei Technologies France

Conclusion (So What?)

So what you may ask. What can I do with this. We found out the following about Huawei Technologies France:

  • SIREN (ID)
  • Physical locations in France
  • The name of the president and birth information
  • Names and birth information of most executives
  • Financial data of the company

You have now 7 addresses you can physically recon, i.e. see if you can dumpster dive, gather information about physical security (HID cards, fences, cameras etc…) or people working at these locations, which can then lead to additional recce on individual targets. Recceing the individuals found will likely lead to information about their relations, employment and responsibilities, possibly even to email addresses.

This simple guide was meant to provide a quick and dirty “howto” guide to one of the many, many tools available online to conducting research on companies or individuals for any law enforcement purposes. The Infogreffe is a drop in the ocean to locate business information for a very specific region. In the upcoming weeks and months, we will develop on other tools for other regions, as well as techniques you can use to track down targets. Don’t forget that the most important part of the operation is the information gathering phase. The more you know about your target, the easier the later phases will become. Keep in mind that other countries also keep similar registries, although the quality of their website may differ greatly.

Powerpoint presentation from Huawei found online detailing employment in Huawei Technologies France
Powerpoint presentation from Huawei found online detailing employment in Huawei Technologies France found online using Google.

References:

[1] Conseil national des greffiers des Tribunaux de commerce. “Registre du commerce et des sociétés.” Infogreffe – Greffe du tribunal de commerce. https://www.infogreffe.fr/ (accessed September 3, 2013).

Facebook Profile from Image Filename

This is not new and I’ve been knowing this for a while (and it’s all over the Internet anyway), but I keep forgetting the details. So basically to find a Facebook profile from the file name of an image, follow these steps:

  1. Images downloaded from Facebook always ends with “_n.jpg” and has 3 numbers divided by an underscore (“_”). For example: 1505501_1378549085743944_745448926_n
  2. The middle number is the Facebook Photo ID, this is the one you’ll need to locate the profile, in this example “1378549085743944”
  3. Then you plug your photo ID at the end of this link: https://www.facebook.com/photo.php?fbid=[photo-id]
  4. If the picture is not shared, then you’ll get a “Content Unavailable” message, otherwise you’ll land on the profile.
File "1505501_1378549085743944_745448926_n.jpg"
Image file “1505501_1378549085743944_745448926_n.jpg”

So by going to https://www.facebook.com/photo.php?fbid=1378549085743944, you’ll land on the profile where the picture was downloaded from:

Fake Facebook Profile of Chloe Moretz
Facebook profile corresponding to image id “1378549085743944”