Digging into the panama papers data

This column appeared in the spring 2016 edition of Media magazine.

By Fred Vallance-Jones

If you’ve been fascinated by the revelations in the Panama Papers leak, you can now delve into some of the data used by the international team of hundreds of journalists who spent a year on the project before going public in early April.

Just as it did with its earlier offshore leaks investigation, The Washington, D.C.-based International Consortium of Investigative Journalists has released both searchable data and a downloadable dataset derived from the Panama Papers. The data combines information from both investigations, making it possible for anyone to explore companies and individuals connected to offshore tax havens.

The ICIJ is at pains to point out that while many uses of such offshore accounts are perfectly legitimate, others are not, and with that warning has unleashed the data for anyone to explore.

The release is quite brilliant in many ways. Not only does it allow anyone to explore some of the data from the two investigations, but it permits a kind of crowd-sourced journalism whereby people who may be familiar with particular individuals, organizations or local circumstances can use their own knowledge to find patterns and stories that the international team has not identified.

The original Panama Papers leak was mammoth, more than 2.6 terabytes of documents and data from the Panamanian law firm Mossack Fonseca slipped to the German newspaper Süeddeustche Zeitung by an anonymous source publicly known only as John Doe. There was so much data, so many documents, that Süeddeustche Zeitung turned to the ICIJ for help making sense of it. You can read more about how the team extracted all of the information here.

The Panama Papers investigation rocked the world, and had immediate impacts, such as the almost immediate resignation of Icelandic Prime Minister Sigmundur Davíð Gunnlaugsson over revelations about his family’s use of offshore tax havens. British Prime Minister David Cameron also faced questions about his own benefit from an offshore trust fund that had been set up by his late father.

As it did with the offshore leaks data, the ICIJ employed a graph database, a type of program specialized at making connections between disparate entities. But most potential users won’t want to be bothered with the rather steep learning curve involved, so the ICIJ has made the data available both in an interactive, searchable application and as a series of CSV files that can be downloaded from its site and opened in a spreadsheet or conventional relational database. For most people, these two options will be more than sufficient.

The easiest and quickest way to explore the data is through the interactive application, which allows you to search for particular names, or for all of the corporate entities associated with a particular country. It allows the user to dig into the various relationships, and generates visualizations that allow you to see the nodes (companies, individuals, etc) and the connections between them. The basic search functionality may be all you need; it’s slick, and it doesn’t require you to have any extra software. ICIJ is careful to warn that a great deal of the information it used in its original investigations is absent from the publicly available data, meaning extra caution is necessary to avoid mistaken identities and other errors.

Anyone exploring the data on their own would probably be wise to keep in mind that the investigations have involved hundreds of reporters, and some of the most sophisticated data analysis yet done by journalists. You probably won’t be able to duplicate it using this data (he says with some considerable understatement). But you may be able to run down some specific, individual connections that could lead to stories with further research.

The CSV files available from ICIJ were extracted from the graph database. Graph databases have a different logic from relational databases, connecting what are called “nodes”–for example an address of a company or the name of an officer of a company—-with each other using connections that are called “edges” or “relationships.” The CSV data groups the nodes into several tables, and also includes a table with all of the edges used to relate nodes to each other. It’s a bit of a hack, but as the ICIJ notes, it makes it available to a much wider range of folks who might simply want to look at the tables individually in a spreadsheet.

It didn’t take me long exploring the CSV data in MySQL to discover some interesting patterns, such as numerous offshore entities associated with the same Canadian addresses and individuals. Far more research would be needed to determine the significance of these, so I’m not going to elaborate here.

For those who have the greatest technical skill, or want to learn a new way of looking at data, ICIJ has provided the data in a download that includes a customized distribution of the neo4j database and tutorials on how to use it. Be forewarned though: there’s a steep learning curve here if you don’t already understand how to use these specialized tools.

It can be kind of exciting to walk in the footsteps of such a large and important, collaborative investigation. In a way, we are in a new era of investigative work that takes on datasets so large, subjects so expansive, that no single person or even single news organization has the resources to tackle it alone, be they financial resources or human ones. By making some of the data available, the team becomes even larger.

Happy hunting.