Mining for the best tools

This is a slightly updated version of the column that originally appeared in the Winter 2013 edition of Media magazine.

Oh, how times have changed. When I first started working with data in 1995, there were really just a few tools available for what was then the emerging field of computer-assisted reporting.
I remember going down to the Future Shop on St. James Street in Winnipeg to buy a copy of Microsoft FoxPro, a database management program that was at the head of its class at the time.
But there were very few tools to go with it, beyond the perennial favourite, Microsoft Excel, then in version 5.0. You could manipulate text files in a word processor, but that was about it. ArcView 1.0 was released that year, but it would be a while before mapping would become a mainstream data skill.
Fast forward to 2013, and the available options for the data journalist seem as limitless as the Prairie sky. In fact, while in 1995, you had to be ready to shell out a few hundred dollars for commercial software, today there is a free, open-source version for just about any task you might imagine. The open source tools aren’t always as easy or convenient to use as the commercial products, but because they are constantly being developed, they evolve and get better surprisingly quickly.
So what would the core toolset be for someone wanting to get started with analytical data journalism? By analytical data journalism, I mean data reporting that seeks to find patterns and stories in data before creating visualizations and other final products.
It has come a long way from the 5.0 days, but Excel is still pretty much a must- have for anyone who regularly works with data and wants to find the numbers that help define a story. It’s the standard tool for importing, sorting, filtering and summarizing data in a single table. These days, Excel can accommodate a million rows, and with the PowerPivot for Excel plugin, you can do some seriously heavy duty data crunching.
Because it is part of the standard Microsoft Office suite, Excel is also pretty affordable, especially if you buy it by itself or as part of a home or student edition. If you are at a university, you can also take advantage of academic prices that provide the complete Office suite for about $100. If you can’t or don’t want to pay for Excel, the free LibreOffice suite offers much of the same functionality at infinitely less cost (math joke!).
In actual practice, I find LibreOffice a bit crash-prone when you start to have a lot of windows open, or when you work with very large data files. However, for most routine tasks, it does a pretty good job. It even has its own version of Excel’s popular pivot tables feature.
Both Excel and LibreOffice are available for PC and Mac, while LibreOffice can also be installed on a Linux machine.
If you want a platform-independent alternative, there is always Google Drive. If your needs are basic and you don’t mind entrusting your data to Google, then the spreadsheets in Drive may be for you. For ease of use, they can’t be beaten, and you have access to your work from any computer connected to the Internet. Google spreadsheets are particularly useful for collaborative work because many users can share and update the data.
You can also create online forms that can be used to populate a Google spreadsheet. This is particularly useful if you want to gather data from a great many people, such as in crowdsourcing.
If you want to work with multiple, related tables of data, or if you simply prefer working with a database manager, there are many options available in relational database managers. In my own opinion, the current champ in terms of cost and relative ease of use is MySQL . Like LibreOffice it’s free. There is no compromise in terms of power. It can slice through queries summarizing millions and millions of rows of data, and if it’s on a reasonably powerful machine, it does it with ease. It’s also available for multiple platforms, including Linux, Windows and Mac, so you can use it no matter which hardware and OS you prefer.
While some had feared that Oracle Corp. would want to eliminate its open- source rival when it purchased MySQL, that hasn’t happened. If anything, it is easier to use and install than ever. A Mac install, which used to be tricky, to say the least, now goes smoothly on OS X. Of course, MySQL is a server database, and that makes it a little trickier to work with than a desktop program. You’ll probably have to learn some basic SQL, but this is not an enormous barrier for the technically minded.
MySQL provides its own front end called MySQL Workbench that is a pretty powerful tool for managing databases and querying data. Many other front ends are also available for MySQL, including the popular phpMyAdmin, Navicat (commercial software), and Sequel Pro, the latter being for Macs only.
Each database program has advantages and disadvantages.
Other free options include PostgreSQL and SQLlite. SQLLite is what is called an “embedded database engine” in that it does not run a server process on a computer. It definitely requires a higher degree of technical knowledge, but is renowned for its small size and adaptability, and is used a lot on mobile devices. But that’s another column.
There is, of course, the old standby, Microsoft Access. It’s part of the pro-version of the Office suite. Access has an easy- to-learn graphical querying interface, and handles relatively small datasets (under a million records if the records aren’t too long) relatively easily. But it still maxes out pretty quickly, so is better for working with smaller amounts of data than enormous datasets of millions of records. It’s also relatively cheap if bought on its own, though infinitely more expensive than the free options mentioned above. Access can also be used as a front end to server databases, using an ODBC connection. So, you can have the familiar user interface of access, but run your queries on a powerful MySQL server.
Back in ’95, mapping was in its infancy, but today there are lots of options available, and the open-source QGIS (for Quantum GIS and available for free at http://www.qgis.org/) is rapidly gaining on that old standby ArcView. Arc is now into version 10.2 and is the standard for pro-level GIS. It has a well-refined interface, and even the standard versions have a huge array of geodata processing tools. If you want to try it for non-commercial, individual use you can get a year-long license from ESRI for $100 plus tax. Otherwise, for commercial or ongoing use, it can become an expensive proposition, even for the most basic edition. That’s one reason why QGIS is becoming increasingly popular.
Even a couple of years ago, QGIS really wasn’t a viable alternative; too many key, core mapping features were missing. But while it still lacks Arc’s refinement and some of its power features, it has a lot of tricks up its sleeve with no expensive outlay. It also has an active user community, so answers to most problems are easy to find online.
There are also cloud-based GIS options. If your needs are limited to geocoding (determining latitudes and longitudes) for a few hundred points, or creating a colour- coded map to display geo-related data online, your needs may well be served by the features available in Google Fusion Tables and Tableau Public. More advanced users can take advantage of the Google Fusion Tables API, which like the Google Maps API, allows much greater control over the look and functionality of your online maps. Keep in mind that sites that generate a lot of traffic could hit Google’s usage limits, and you’ll have to pay Google for usage beyond the limits if you consistently exceed them. The limits are generous however, and you are unlikely to hit them except for the most popular sites. ESRI has also vastly improved its ArcGIS Online product, and the free public accounts can now be used by journalists without restriction.
Beyond the basic tools I have outlined here, there are many others available to data journalists. They include utilities for extracting data from PDF files, browser plugins such as iMacros for Firefox that can be used to automate data downloads, scripting languages such as Python, Ruby and Perl that can be used to “scrape” data from the web, and much more. These days, you are more likely limited by your technical prowess than by the lack of available tools. It is a very different world from 1995, and for those wanting to make a start in data journalism, the horizons are almost as limitless as the skies I used to enjoy on the Prairie around Winnipeg.

Related links:

MySQL: www.mysql.org
phpMyAdmin: http://www.phpmyadmin.net/home_page/ PostgreSQL: http://www.postgresql.org/
PowerPivot for Excel plugin: http://office.microsoft.com/en-ca/excel/powerpivot-for-excel-2010-features-and-benefits- HA101810445.aspx
LibreOffice suite: http://www.libreoffice.org/download
SQLite: http://sqlite.org/download.html
QGIS: http://www.qgis.org/ ARCGIS: http://www.esri.ca/en/content/ arcgis-home-use-program
Tableau Public: http://www.tableausoftware.com/public/
Google Fusion: http://www.google.com
ArcGIS Online: www.arcgis.com/home