Due Diligence: mining adverse media with Python and scraping.

Lorenzo Romani
8 min readFeb 2, 2021

--

A relevant part of my daily routine involves searching for adverse media about targets of an investigation/due diligence/background check. Every professional in this market knows that such a task may be long and time consuming. Tipically, in a preliminary phase, it involves executing Google queries with a standard set of keywords to find out if the target has been somehow exposed to adverse media. Example:

TargetName AND (“money laundering” OR “fraud” OR “trial”).

Within the scope of a preliminary assessment of a target, many of those queries need to be executed. To reduce stress, I have set my results page on Google to show 100 results. Even so, it can be a boring job.

Then I came up with a solution aimed at automating the preliminary search and evaluation phase of online media. Paid news aggregators and compliance databases just miss many interesting things. While they are usually reliable when it comes to identifying Politically Exposed Persons and State Owned Enterprises (but only at a high/national level), they just miss many adverse media results. Perhaps, there are legitimate and fair legal reasons why they do so.

The solution I am presenting hereby has enabled me to reduce significantly in time the preliminary search routine. The tool is written in Python and has gone through some improvements in recent years.

CONCEPT

It works based on a key assumption: “if it is out there, you should search it on Google first”.

So, I intentionally do not execute any query on other search engines (Yandex, Bing, etc). I do not search for pdf, excel, xml files as well. Only pure text/html.

The tool, which I simply named “Due Diligence”, executes the following tasks:

  1. Based on a predefined (and fully customizable) set of adverse keywords (“money laundering”, “crime”, “arraigned”, etc) it crawls Google via an external API and retrieves contents. A CSV with the scraped urls is saved automatically.
  2. For each fetched url, it extracts only the text of the article. Everything that is on the page but is not part of the article is just discarded.
  3. It searches within the text for the “target entity” name. If the latter, and even just a single adverse keyword (picked from a standard set of adverse keywords) are mentioned, it continues with the loop:
  4. …so it tokenizes the extracted text. This means that most special characters, extra spaces, commas, etc are eliminated. Text tokenizing is used into machine learning and natural language processing. Basically, it means “cleaning” the text from unwanted characters. Next, it converts the target entity name into a token, too. For example, if we are searching for a company named “Dubai Energy Investments”, the name will be replaced, within the tokenized text, with DUBAI_ENERGY_INVESTMENTS.
    for example, let’s consider a case where the extracted text is:
    three managers of Dubay Energy Investments have been indicted over accusations of money laundering (three of them were detained)
    In this case, the tokenized text will be:
    three managers of DUBAI_ENERGY_INVESTMENTS have been indicted over accusations of MONEY_LAUNDERING three of them were detained
  5. Since, in the above example, the target entity is present and four adverse keywords were detected too (“indicted”, “accusations”, “money laundering”, “detained”), the text is converted into a LIST of tokens:
    [‘three’, ‘managers’, ‘of’, ‘DUBAI_ENERGY_INVESTMENTS’, ‘have’, ‘been’, ‘indicted’, ‘over’, ‘accusations’, ‘of’, ‘MONEY_LAUNDERING’, ‘three’, ‘of’, ‘them’, ‘were’, ‘detained’]
  6. At this point, since the software believes it has detected an adverse media article, it will compute a “risk index” for it. This is an important part of the job, as we will see later, because it enables the user to index each retrieved adverse media by its specific risk score value, within a nifty CSV output (more info to come) and, in my most recent upgrade, on a Flask interactive report (a web page that can be easily sorted and examined on the browser). The risk score is computed by calculating the mean distance between — in this case — “DUBAI_ENERGY_INVESTMENTS” and the adverse keywords. For instance, the entity token is 3 words away from “indicted”, 5 from “accusations”, 7 from “MONEY_LAUNDERING”, 12 from “detained”. Note that “money laundering” was tokenized too, since it is an adverse keyword/concept composed of two words.
  7. When all the URLS retrieved from Google have been evaluated, only the URLS where the entity token and at least an adverse keyword are found, are saved, along with their respective “risk index”, into the final output (a csv file saved in the directory of the script).

Now that we have underlined the key infrastructure, let’s dive deeper into some technicalities.

In a previous version of the tool, I had built my own Google scraper. But I came soon to the conclusion that, even with timeouts and proxies, I could not solve the problem of Captchas in a reliable and consistent way.

So, I prefer to “outsource” the scraping job to an external service. There are many unofficial Google APIs that scrape Google results for a dime. They usually do an awesome job at managing proxies on their own, so I don’t have to. A free subscription to one of these services usually enables to execute 50/100 free queries every month.

The API calls are executed by the unofficial Google API and the results are then piped into the tokenizer.

In an earlier version of the tool I used to download the whole HTML for each link. This was inefficient, since adverse keywords could be present on the page without being related to the article text (for example: advertising; side boxes, suggested content, etc). So, I shifted to newspaper3k, an awesome library which does a great job at identifying only the relevant text within a web page. But there are other libraries out there that are maintained (newspaper3k is not maintained anymore…or at least that is my feeling).

Until a few months ago, I published the tool on Github and made it available for everybody to use, but I improved it a lot recently and it became part of a personal professional toolkit, so I cannot share the code anymore. However, you can still try to code your own version if you have some Python coding skills (intermediate level coding is necessary).

When launched, the tool enables searching and evaluating adverse media in multiple languages (ITALIAN, ENGLISH and SPANISH currently).

In this article, I will pick the name of a big tobacco company, “Benson & Hedges”. There is a reason why i’m choosing it (nothing personal, no allegations made here!): the company is also mentioned as “Benson and Hedges”.

Infact, I can specify the “official” target name but also an alternative name. Specifying alternative names (if any) is important because of how the tokenizing job is done. If you search Google for “benson & hedges” but the returned media article mentions it as “benson and hedges”, then if you did not specify that alternative name the software would be unable to detect the entity into the text, so the tokenizing would not be performed and you would probably miss important news.

After specifying the alternative names, separated by a pipe and without spaces (benson and hedges|benson&hedges), you need to define the language you want the research to be performed in. In this case, I specify English (“en”).

Next, the user is asked to specify (optional) a proximity interval. The proximity interval defines whether or not the tool should skip news where the minimum distance between the target entity and the closest adverse keyword is lower than the specified threshold. I usually set an high number (10000 or higher) to be sure I won’t be missing anything, but If I am searching for news where adverse keywords are probably related to the target, I may choose a low threshold (es: 200). The key assumption here is that, if an adverse keyword is close to the target entity, there may be a relationship between the target and some allegations.

Finally, the software asks the user if he needs to refine the search query by a specific term or even by a boolean operator. Let’s say that the user is interested in finding adverse media about Benson & Hedges operations in a specific country. Then the user can specify the country name and the google query will be performed accordingly:

“benson & hedges” countryName AND (“money laundering” OR “corruption” OR ….)

Users can also specify other boolean operators. For instance, if you want to filter only news on a specific country’s cctld domain, you may type, for example, the “site:uk” operator. Then, the query would be executed such as:

“benson & hedges” site:uk AND (“money laundering” OR “corruption” OR …)

At this point, the script starts searching Google, through the external API, based on the user-predefined set of keywords and queries.

When the Google-searching is done, the tool starts opening urls, reading text, tokenizing entities and keywords, and doing the text-mining magic:

In this case, you can see that the software has started analyzing results. It is opening each page. If some likely adverse media is detected, the URL is printed onto the screen along with the detected adverse keyword and the distance from the keyword and the tokenized entity.

When the analysis is done, the user finds a CSV file named COMPANY_NAME_REPORT.csv in the scripts’ directory.

Something looking like this:

Here you can appreciate the final results. At the top of the list, the sources with the highest Risk Index will be those most probably containing adverse media inherently the target. In column F you have the total count of adverse keywords detected within the text, so you can also use double-indexing on Excel.

CUSTOMIZATION

As I pointed out earlier, the software is fully customizable. In order to understand its pros and cons, we will now open the KW.py file, which contains the default queries and the default adverse keywords:

Here you can see that the standard english query routine is based on 5 distinct queries such as:

kw_en1 = ENTITY_NAME AND (arrested OR imprisoned OR indicted OR investigated OR jailed OR sentenced OR detention OR probation OR bail OR criminal) -filetype:pdf -filetype:doc -filetype:xls -filetype:docx

The default queries, based on my personal experience, work pretty well but can surely be improved! This initial setup aims at dealing with a limited external API query availability (50/month for each free token), but if I had an infinite budget I would split each query in a more specific way, such as:

kw_en1 = ENTITY_NAME AND (bribes OR bribery OR bribing OR bribed)
kw_en2 = ENTITY_NAME AND (corruption OR corrupted)
kw_en3 = ENTITY_NAME AND (“money laundering”)

So, I would split queries to get more results, and probably better quality results, eventually using hundreds of adverse keywords and dozens queries. So, this really depends on your budget, time, and effort.

Eventually, this tool has turned to be really useful for compliance professionals, due diligence analysts and journalists as well.

While this tool aims at dealing with “adverse media” (specifically financial adverse media) it is also possible to adapt queries and keywords to specific needs/industries. So it is not necessarily limited to the compliance/AML world.

If you are interested in this solution, just drop me a message in my LinkedIn page (I bet you can find it).

--

--

Lorenzo Romani
Lorenzo Romani

Responses (4)