In the UK, requirements to register a new company are few. Practically anyone over the age of 16 years of age can own and manage a UK limited company. It only takes a few minutes to register a new company on Companies House. This has brought about a rise in serial company formations and dissolutions by individuals.
One example is a 92 year old woman named Barbara Kahan who has 22777 company appointments to her name, all of which are filed with an inconspicuous London address - 2 Woodberry Grove, London, N12 0DR. Little would one know that this leafy, suburban building was being used to register shell companies involved in fraud, money laundering and political corruption.
With company formation agents advertising their services for as little as the price of two Mcdonald’s Big Macs, the 3 hour company formation service has become an open door to criminal activity. Amidst the influx of new companies registering in the UK, it is hardly surprising that companies being formed for criminal purposes have gone unnoticed. More needs to be done to tackle this issue in the UK.
Fortunately, Raphtory can be used to capture the bad eggs amongst the hundreds and thousands of companies registering on UK land. Raphtory is a powerful analytics tool for large-scale graph analysis. With Raphtory, it takes a few seconds to turn Companies House data into insights on fishy behaviour going on with companies.
In this blog, we will scrape information about all the companies that Barbara Kahan has been a director for and use Raphtory to analyse this data. Follow along with your own Python notebook of choice, as we unveil the dark secrets lying within UK’s company registry.
Follow along with our Jupyter Notebook
We have uploaded the full Jupyter notebook for this tutorial blog on our Github which you can find by clicking here. Feel free to pull this example from Github or write up a fresh notebook in your local machine.
How to collect Companies House Data
We are in luck as Companies House have provided a REST API. At Pometry, we have built several crawlers that scrape the Companies House website, giving us direct access to the data we want. Currently, we have 3 crawlers: one made specifically to scrape Barbara Kahan’s companies for this blog post and tutorial, another for grabbing Persons with Significant Control information and the last for grabbing Company Director information. All our crawlers output JSON data, ready to be loaded into a Raphtory graph for analysis. We have made this public via pip install and explain how to use it below.
How to use the Companies House crawler
Getting your Companies House API key
Before scraping the Companies House website, you will need to create an account on the Companies House Developer Hub:
After logging into your account, create an application where your API keys will be stored:
Once created, go into your application and create a new REST API key. This key will be used to authenticate your scrape requests:
Make sure you select REST when creating your application:
Copy your API key which will be used to scrape Companies House website:
You are now ready to install the crawler and start scraping the Companies House website.
Installing and running our Companies House crawler
Install the crawler using pip:
Go into a Python terminal and run the following code:
Our crawler will start to scrape the Companies House API, finding all of Barbara’s company data. Once finished, all your data can be found in the data/aqWJlHS4_rJSJ7rLgTK49iO4gAg
folder in your root directory. We can now start the analysis using Raphtory.
Analysing the data with Raphtory
Install Raphtory via pip:
Open a Python Terminal of your choice. We use Jupyter Notebook for this example. Import all the dependencies needed for this example:
import os, json
import matplotlib.pyplot as plt
from raphtory import Graph
from datetime import datetime
from raphtory import vis
We use the Python JSON library to parse the JSON files outputted from the crawler. Through this, we can create a Raphtory graph and add our values to the graph via the add_edge()
function.
Enter the directory path to your json files inside the path_to_json
variable. It should look something like this: ~/companies_house_scraper/tutorial/data/aqWJlHS4_rJSJ7rLgTK49iO4gAg
:
path_to_json = ''
json_files = [pos_json for pos_json in os.listdir(path_to_json) if pos_json.endswith('.json')]
Create a Raphtory graph:
g = Graph()
Iterate through all the JSON files (there are many files since the crawler works by crawling page by page) and add values to your Raphtory graph via add_edge()
function:
for index, js in enumerate(json_files):
with open(os.path.join(path_to_json, js)) as json_file:
json_text = json.load(json_file)
try:
for item in json_text['items']:
appointed_on = item['appointed_on']
resigned_on = datetime.datetime.strptime(item['resigned_on'], "%Y-%m-%d")
resigned_on_string = str(resigned_on)
epoch_resigned = int(datetime.datetime.timestamp(resigned_on)) * 1000
company_name = item['appointed_to']['company_name']
director_name = item['name']
g.add_edge(appointed_on, director_name, company_name, {'resigned_on': resigned_on_string, 'epoch_resigned': epoch_resigned})
except KeyError as e:
print(f"key {e} not found in json block")
except Exception as e:
print(f"{e}")
Quick overview of Barbara’s companies using Raphtory
With the Raphtory API, we can quickly find statistics from our data about Barbara’s company ownership history.
Create a list of director names to see how many different names the director goes by:
list_of_src= []
for e in g.edges():
list_of_src.append(e.src().name())
print(f"List of director names: {set(list_of_src)}")
List of director names: {'Barbara Z KAHAN', 'Barbara KAHAN'}
Finding the number of companies formed by the director:
print(f"Number of companies director assigned to: {g.num_edges()}")
Number of companies director assigned to: 22305
Seeing the earliest and latest company formations this director has made:
earliest_date = datetime.datetime.fromtimestamp(g.earliest_time()/1000)
latest_date = datetime.datetime.fromtimestamp(g.latest_time()/1000)
print(f"Earliest date director was assigned to company: {earliest_date}")
print(f"Latest date director was assigned to company: {latest_date}")
Earliest date director was assigned to company: 2002-01-14 00:00:00 Latest date director was assigned to company: 2016-02-16 00:00:00
There are a plethora of methods in the Raphtory API that give you an overview about your graph data. These are just a few to demonstrate how easy it is to access this information with Raphtory.
Using Raphtory properties to filter suspicious companies
The date that the director resigned from the company can be accessed via the edge property. This is the API for adding properties to edges in Raphtory:
It is possible to have an infinite number of properties on edges and vertices in Raphtory to store extra information. However, we have kept it simple for this example. Properties in Raphtory have enabled us to store the resignation date in two formats - date time format and epoch timestamp format.
It would be unusual if a Company Formation Agent helped their client set up a company and stayed on as director, as opposed to immediately handing the director title to the client. This can be indication of a criminal using the company formation agent as a front for their dishonest activities.
sus_companies = []
for edge in g.vertex('Barbara KAHAN').edges():
if (edge.property("epoch_resigned") - edge.earliest_time()) > 31557600000:
sus_companies.append(edge)
print(sus_companies.__len__())
859
As you can see from the above code snippet, Barbara had stayed on at 859 companies for longer than one year. Let’s delve deeper into when these companies were created and exactly who these companies belonged to.
Create a line plot visualisation over time with Raphtory
We can use a function in Raphtory called .rolling()
with a window size of 10000000000 milliseconds (around 4 months). This enables us to “roll” through all the windows/views, counting the number of companies the director was assigned to over time.
Roll through the graph with a window of 10000000000 milliseconds:
views = g.rolling(10000000000)
For each view, count the number of edges:
timestamps = []
edge_count = []
for view in views:
time = datetime.fromtimestamp(view.latest_time())
timestamps.append(time)
edge_count.append(view.num_edges())
Create the line plot visualisation with the Seaborn library:
sns.set_context()
ax = plt.gca()
plt.xticks(rotation=45)
ax.set_xlabel("Time")
ax.set_ylabel("Companies Created")
sns.lineplot(x = timestamps, y = edge_count,ax=ax)
<Axes: xlabel='Time', ylabel='Companies Created'>
Now that we can see several spikes in our graph, especially between 2012 and 2016, we can further investigate this window of time.
Using windows to filter particular timepoints of interest
One of the spikes in the line plot above is at year 2014. To investigate this further we use the .window()
function which takes a start and end time. We will look at a window of 01-01-2014 to 01-01-2015.
filtered_view = g.window(1388534400000, 1400070400000)
filtered_views = filtered_view.rolling(100000000)
timestamps = []
edge_count = []
for filtered_view in filtered_views:
time = datetime.datetime.fromtimestamp(filtered_view.latest_time()/1000)
timestamps.append(time)
edge_count.append(filtered_view.num_edges())
sns.set_context()
ax = plt.gca()
plt.xticks(rotation=45)
ax.set_xlabel("Time")
ax.set_ylabel("Companies Created")
sns.lineplot(x = timestamps, y = edge_count,ax=ax)
<Axes: xlabel='Time', ylabel='Companies Created'>
There seems to be a spike between 2014-02-01 to 2014-02-15. We create a window for this spike to investigate further.
filtered_view2 = g.window(1391212800000, 1392422400000)
filtered_views2 = filtered_view2.rolling(window=10000000)
timestamps = []
edge_count = []
for filtered_view2 in filtered_views2:
time = datetime.datetime.fromtimestamp(filtered_view2.latest_time()/1000)
timestamps.append(time)
edge_count.append(filtered_view2.num_edges())
sns.set_context()
ax = plt.gca()
plt.xticks(rotation=45)
ax.set_xlabel("Time")
ax.set_ylabel("Companies Created")
sns.lineplot(x = timestamps, y = edge_count,ax=ax)
<Axes: xlabel='Time', ylabel='Companies Created'>
There is a big spike on 2014-02-11 and 2014-02-12. Now that we have specific time points, we can find out the names of the companies in these spikes by visualising the edges of the graph at these time points.
Dynamic visualisation of your graph in Raphtory
To visualise specific dates, we first create a window which includes the time point we want. Below, we have created a window that only includes the date 2014-02-12. We then filter for edges where the company formation agent (Barbara) has stayed as director at the company for longer than a year. Lastly, we use Raphtory’s .to_pyvis()
function to create a dynamic visualisation of the edges. In this way, we can clearly see the companies where the company formation agent has stayed as director for longer than a year.
sus_companies = []
twelfth_of_feb = g.window(1391212800000, 1392422400000)
for edge in twelfth_of_feb.vertex('Barbara KAHAN').edges():
if (edge.property("epoch_resigned") - edge.earliest_time()) > 157784630000:
sus_companies.append(edge)
g2 = Graph()
for edge in sus_companies:
g2.add_edge(1, edge.src().name(), edge.dst().name())
print(sus_companies.__len__())
263
vis.to_pyvis(graph=g2, edge_color='#F6E1D3',shape="image")
The visualisation will appear in a file called nx.html
which can be opened in a web browser.
A screenshot of the dynamic visualisation
If you would like your graph in a list of vertices and edges, you can call methods such as .vertices()
and .edges()
.
twelfth_of_feb.vertices()
Vertices(Vertex(name=Barbara KAHAN, properties={_id : Barbara KAHAN}), Vertex(name=JYNUX SYSTEMS LIMITED, properties={_id : JYNUX SYSTEMS LIMITED}), Vertex(name=SIRTRECH LIMITED, properties={_id : SIRTRECH LIMITED}), Vertex(name=HOTSOUND LIMITED, properties={_id : HOTSOUND LIMITED}), Vertex(name=HAXMED LIMITED, properties={_id : HAXMED LIMITED}), Vertex(name=NITESYS LIMITED, properties={_id : NITESYS LIMITED}), Vertex(name=HYPERMANAGE LIMITED, properties={_id : HYPERMANAGE LIMITED}), Vertex(name=LYNXMECH LIMITED, properties={_id : LYNXMECH LIMITED}), Vertex(name=SISTEMON LIMITED, properties={_id : SISTEMON LIMITED}), Vertex(name=GINDOLA TRADERS LIMITED, properties={_id : GINDOLA TRADERS LIMITED}), ...)
twelfth_of_feb.edges()
Edges(Edge(source=Barbara KAHAN, target=JYNUX SYSTEMS LIMITED, earliest_time=1392163200, latest_time=1392163200, properties={resigned_on : 1499122800}), Edge(source=Barbara KAHAN, target=SIRTRECH LIMITED, earliest_time=1392163200, latest_time=1392163200, properties={resigned_on : 1423785600}), Edge(source=Barbara KAHAN, target=HOTSOUND LIMITED, earliest_time=1392163200, latest_time=1392163200, properties={resigned_on : 1525215600}), Edge(source=Barbara KAHAN, target=HAXMED LIMITED, earliest_time=1392163200, latest_time=1392163200, properties={resigned_on : 1411686000}), Edge(source=Barbara KAHAN, target=NITESYS LIMITED, earliest_time=1392163200, latest_time=1392163200, properties={resigned_on : 1585522800}), Edge(source=Barbara KAHAN, target=HYPERMANAGE LIMITED, earliest_time=1392163200, latest_time=1392163200, properties={resigned_on : 1457308800}), Edge(source=Barbara KAHAN, target=LYNXMECH LIMITED, earliest_time=1392163200, latest_time=1392163200, properties={resigned_on : 1457308800}), Edge(source=Barbara KAHAN, target=SISTEMON LIMITED, earliest_time=1392163200, latest_time=1392163200, properties={resigned_on : 1445817600}), Edge(source=Barbara KAHAN, target=GINDOLA TRADERS LIMITED, earliest_time=1392163200, latest_time=1392163200, properties={resigned_on : 1565132400}), Edge(source=Barbara KAHAN, target=GINDENE LIMITED, earliest_time=1392163200, latest_time=1392163200, properties={resigned_on : 1410908400}), ...)
In just a few minutes, we have transformed billions of company data into interesting insights on potential fishy behaviour.
Temporally analysing data has been difficult in the past, however Raphtory makes it incredibly easy. Rather than looking at data in a static manner, taking into account of time can give us better understanding on our data that may be beneficial to your unique use cases.
If you would like to run further analysis and algorithms at scale in a production environment, drop the team at Pometry a message, and they will be more than happy to help.