When an out-of-control train carrying crude oil hurtled down a hill and exploded in the middle of Lac Mégantic, Que., it was a wakeup call for Canadians living near railways. Forty seven people died that day as much of the downtown of the picturesque town was incinerated.
For CBC Winnipeg reporter data reporter Jacques Marcoux, there was an obvious and simple question. How many properties in his city are too close for comfort when it comes to the railway lines that converge on the city from every direction? To come up with an answer, he used a mapping program to see how many existing properties fell within recommended safety buffer zones between rails and people.
The safety zones were developed by the Canadian railway industry and Canada’s municipalities and made public just before the Lac Mégantic tragedy . They were later adopted by Winnipeg officials in a November 2015 report.
The answer to Marcoux’s simple “how many” question was eye-popping, “close to 15,000 parcels of land in the city are either partially or completely within the minimum recommended distance from rail lines,” according to his story published in January 2016.
While, strictly speaking, the guidelines were intended for new development, due to the difficult of retrofitting existing neighbourhoods, the story gave people information about potential dangers that they never had before.
For Marcoux, it was one of the favourite pieces he has done since becoming a full time data journalist last fall.
“Its a perfect example of how multiple sources of data that were never intended to interact with one another can be skillfully manipulated to tell a whole other story in a simply way,” he said in response to emailed questions.
The data for the story came from two sources, both freely available on open-data sites. One was a map file of rail lines, and the other a similar file of property parcels in Winnipeg. By overlaying the two maps in a mapping program, Marcoux was able to determine how many properties fell within the recommended safety zones.
The first step was to add a new field to the rail line layer’s attribute table (a shapefile is made up of several files, one of which is a data table) to contain the the minimum recommended distance for development for each type of track, such as yards, main lines and spur lines. This done, Marcoux created buffers along each section of rail, based on the new field he had created, using the free GIS program QGIS. Finally, he used QGIS’s Intersect geoprocessing tool to identify properties located within or partially within the safety buffers.
Marcoux says this was not a long-term story, in fact, “three entirely foreign data sources came together to produce this story within a half day.” He says the more complex part was the intellectual exercise. “The methods are fairly advanced, only because they require a very good sense of the “available data environment” and to understand how they can come together and be compatible on some level.”
Marcoux has a pretty sweet gig at the CBC, one of the very first journalists to work with data full time. But he doesn’t come from a journalistic background. He has a business degree from the University of Manitoba, and worked in the finance industry before taking communications jobs with the Royal Winnipeg Ballet and the former Canadian Wheat Board. Only after that did he start working at CBC Winnipeg, first as a researcher, then two and a half years as a news reporter with the local Radio-Canada outlet, before finally landing the data journalism job on the English side.
Getting there required what he describes as “many hundreds of hours of unpaid, self-imposed overtime.
“My experience has been that — in the context of budget cuts, and increased workloads on multi-platform reporters — its can be hard for producers or editors to justify giving young reporters the kind of freedom required to ‘go fishing’ and explore potential data leads.”
When Marcoux told me that, it reminded me of my own formative years as a data journalist, also with CBC Mantioba. I had opened a storefront bureau in Brandon for CBC Radio and had learned how to use SQL at a NICAR bootcamp in Missouri, in 1995. Hour and hours of my own time, much as Jacques described it, led to a story on political patronage in Manitoba’s version of Jean Chrétien’s federal infrastructure program. That story helped me land a job as a full-time investigative reporter in Winnipeg. I continued to pursue data stories on such things as contaminated properties and slum houses. This of course was long before the appearance of open data sites; even the public Internet was embryonic.
CBC Manitoba has a long history of supporting this kind of innovative, hard-hitting journalism and Jacques is continuing that tradition. It’s a busy job.
“My time is split three way primarily: working within CBC Manitoba’s investigative unit that focuses on delivering national stories that can also be applied in various regions (we call them “pan cans”), assisting reporters from across the country with all kinds of data-related requests such as scraping, data base structuring, ATIP writing, many many PDF to machine-readable conversions, data analyses and spatial analyses. The last third is spend on developing my own stories over which I have full ownership.”
Marcoux has had a string of successes with his own pieces, including another enterprising piece mapping Winnipeg’s food deserts, areas of town without easy access to full-service grocery stores. To do that one, he drew circular buffers around the stores, to show areas of town that were beyond one kilometre from an outlet — the distance generally used by food safety experts as the outer-limit of accessibility.
“I simply had a hunch that if I took the time to geo-locate all the main grocery stores in Winnipeg I could probably come up with some interesting conclusions.”
But there was also controversy, particularly about the exclusion of some small stores for the analysis, as well as outlets such as Shoppers Drug Mart that sell some food items. “We came up with a methodology after discussing with food safety experts, so we could establish a cut-off. We included this methodology in our story. This was the most challenging part.”
The story underlined the crucial principle that it’s important that stories check out in the real world; you can’t rely on data alone. “I ended up finding a grocery store that was unlisted from all websites within what we almost declared as our “food desert,” days before publication…I was so nervous that I had missed a store, that I hoped into my car at night and drove every main street in the entire 30-square km food desert and stumbled upon it.”
A huge credibility hit for the story, narrowly avoided.
Marcoux often works closely with his colleagues, including on a large team project on boil water advisories in first nations communities that showed two thirds of first nations communities in Canada had been under at least one drinking water advisory in the last decade. Unlike when he did his railway story, there was no easy access to open data, though there is a limited amount of information on boil water advisories on Health Canada’s website.
“The…story was actually mostly the result of endless phone battles with communications people from various levels of government over five months,” he said. “One thing I’ve noticed as a data journalist is that a significant amount of my time is spent simply negotiating for data. Governments don’t mind parting with reports and other documents, but they do not like giving away raw, unfiltered data — and that because they often contain the truth.”
From there, there was extensive data wrangling required by Manitoba’s investigative unit, to bring together datasets from many jurisdictions into one with a standardized format, he said. First nations also had inconsistent spelling or names, often multiple names for the same community, and that required more cleaning.
Once the data was obtained and cleaned, the analysis was mostly straightforward in a spreadsheet, though “to answer one question in particular I had to reach out to one of our programmers in Toronto to write a script in Python in order to help calculate the total number of consecutive days each first nation was under an advisory.”
It’s a lot of work that Marcoux says a lot of reporters would pass on, but which is worthwhile to get the big story. He is quick to point out that the water story was far from his alone and that many others were involved in the hard work of assembling the data and landing the story.
Marcoux uses tools that will be familiar to most journalists doing data analysis, with Excel “the only way to go” for straightforward, daily stories. For larger and more complex analysis that goes beyond Excel’s capabilities, he uses a relational database, either Microsoft Access or the open source, MySQL. Marcoux’s go-to tool for cleaning up datasets is Open Refine. He relies on CBC’s interactives team in Toronto for more advanced presentations, but for basic maps uses either Google Maps or CartoDB.
It all makes for what he describes as “the best job I’ve ever had.
Recently, he had some fun with race results for the annual Manitoba Marathon, discovering that over time, performance has actually worsened. It’s a great example of enterprising reporting, drawing on a data source that nobody had previously thought to aggregate.
“I was a competitive athlete for nearly a decade and this job allows me to aggressively pursue other goals with as much vigor and passion as I did when I was racing. I have somewhat of an all-or nothing approach to things in general, so I have to check myself once in a while, otherwise I can easily become consumed by the endless story leads data journalism create.”
Editor’s note: The text of this story has been altered since publication to clarify the role Marcoux played in the boil water advisory story.