Earlier this week I posted an item about stagnation at NYC Data Mine as well as my thoughts more generally on the city’s #opendata policies and practices. Today I discuss another challenge regarding open data: data quality and poor metadata.
We recently updated the OASIS community mapping website with several data sets: community gardens, subways, bus routes, bike lanes, and more. We also updated the map layer representing New York City park areas. That might seem straightforward for a website focused on open space. And we were using information from New York’s Data Mine website, which is intended to promote ease of access and use of the city’s publicly available data sets.
But adding the latest parks data to OASIS was much more complicated than it needed to be. This post describes why, how we got around the complications, and offers suggestions for improvement going forward. I also include links below to my updated versions of the parks (and playgrounds) data.
Data can be messy, no question about it, but the hassles with the parks data provide an example of the challenges that remain for cities to embrace #opendata, and for the public (or even app developers, for that matter) to seamlessly make use of public data sets.
2006 was the last year that we requested a GIS data set of park properties from the NYC Dept of Parks and Recreation (DPR) for use on the OASIS website. The Parks Department is a partner in the OASIS project, and was one of the project’s founding organizations. The agency sees the value in mapping open space resources beyond just city data, integrating (as OASIS does) a wealth of data layers to provide a comprehensive picture of open space issues in any given neighborhood and citywide.
For various reasons (project transitions, mapping application updates, other priorities) our team at the CUNY Graduate Center hadn’t implemented major data upgrades to the OASIS website till recently. Even last year (2009) when the city’s Data Mine was launched, we didn’t update the parks data on OASIS – the earlier parks data we were using seemed to be more comprehensive.Therefore, I never looked closely at the parks data from Data Mine until this summer, when we started planning for a major data update on OASIS. When I took a close look at the parks data, I was frustrated and disappointed.
It’s important to note that I don’t fault the Parks Department, per se, for the difficulties I encountered. I think the problem has to do with a disconnect that often exists between data creators and data users, with little being done by the city itself to mediate. The Data Mine concept is a good start. But for a meaningful open data effort, data should be vetted before it’s published, helpful metadata has to be included with each data file, and a dialogue should be fostered to help agencies understand how others seek to use their data in order to create an opportunity to learn from each other. The parks data just happens to be one set of files that I’ve focused on, but these problems aren’t unique to parks, as others have pointed out (including me — I feel like I’ve been on a tear lately, blogging about data sets with great potential but that need lots of work before they’re re-used).
Data Mine disappointments
Data Mine has three types of “parks” data. One is the geographic data from the Parks Department (in particular, the “Map of Parks” file, downloaded as “DPR_Parks_001”). The second is the “raw” data from DPR. The third is another “OPEN_SPACE” geographic dataset from the city’s Dept of Information Technology and Telecommunications (DoITT).
With all those options, how can you go wrong? Here’s how:
The DPR geographic data is a great visual depiction of the park areas. But, in GIS parlance, the dataset contains almost no attributes. In other words, each park area in the dataset is identified only by its park ID (such as “M010”, which happens to be the code for Central Park – see www.nycgovparks.org/parks/M010). No name or other information is provided except for an undefined “category code”.
Even the park ID is hard to identify – there’s a “GIS_Propnu” field and a “OMP_PropID” field that each contain values in the “M010” format. For the most part the values in both fields are identical. But there are 43 records where these fields don’t match — see below. I have no idea why they don’t match, but it turns out that the “OMP_PropID” values work in the DPR URL scheme, but the GIS_Propnu values don’t. So I went with the OMP data.
And the values in the “cat_code” field (1, 2, and 4) are not explained. I’ve even talked with DPR employees who use GIS data, and they weren’t familiar with the details of the category codes.
Then I looked at the “raw” data. I assumed the raw data would include a file to link park IDs with park names. Nope. The DPR “raw” data includes lists of many different types of park features (directories of barbecuing areas, beaches, dog runs, nature centers, playgrounds, etc). But there’s no overall list of actual parks. And of the 21 “raw” data files related to park properties from DPR, none of them provides park IDs. They all include names and other attributes (generalized address info, website URLs, etc), but no IDs.
The closest we get from DPR’s “raw” data is a list of “capital projects” which includes names and park IDs, but they look like this:
So you need to do some text parsing to extract just the park IDs. And this wouldn’t even provide a complete list. The capital projects file includes 580 unique park names – far short of the 1,956 features included in the park geography file – as well as more than 330 entries with blank names.
After exhausting the options with the DPR data, I turned to the DoITT data. Aha, the “OPEN SPACE” geographic data has park names and park IDs! (It’s described as a “Planimetric basemap polygon layer containing open space features, such as parks, courts, tracks, cemetery outlines, etc.”)
But looking a bit closer, here’s why the DoITT file isn’t very helpful:
- According to its metadata, it hasn’t been updated since 2006 – no better than the data that we already had on OASIS.
- This file includes 1,600 unique names, but this includes areas not covered by DPR such as cemeteries, so therefore it’s not a complete list that will match the “DPR_Parks_001” geographic file.
- Also, the naming conventions don’t really follow conventions. For example, “Greenstreets” is spelled 8 different ways, there’s a mix of abbreviations and annotation (qualifiers in parentheses, etc)., extra (leading) spaces, misspellings, inconsistent spellings, etc.
- There are more than 2,700 unique park numbers, but this includes park IDs that are blank (304 times) or IDs that don’t match the DPR list (for example, 158 records have “unset” listed in the Park ID field).
So as far as I can tell, it’s impossible to use the Data Mine data alone to link park names to map geography. Maybe someone could match the data by hand (creating a map that labels the park areas by ID, and then comparing that with a DPR map with park name labels, and then manually entering those park names in DPR’s GIS file of park IDs). But this would be so prone to error it wouldn’t be worth the trouble, and it also undermines the idea of providing “machine readable data” from Data Mine in order to automate how we access and analyze the information.
Since I didn’t find what I needed on Data Mine, I reached out directly to DPR for a file that links park IDs and park names. The response was: a) wait for Data Mine to be updated; or b) if I can’t wait, then DPR needs to check with its public relations office before giving me the file. Well, so much for openness and transparency. Sigh.
How did BigApps developers handle this?
This made me wonder how the BigApps competitors could have created their applications, several of whom submitted apps that displayed maps of park locations showing park names. I asked a couple of them how they did it. One of them didn’t answer me directly, but instead suggested that I could “hire some free student labor to go through by hand for two days” to link the IDs/names manually. Not very helpful. Another BigApps project used the DoITT list of IDs/park names. But for the reasons discussed above, that’s inadequate for our purposes.
On OASIS, a current list of park IDs and names is essential. First, we want to display the latest and most accurate information for our visitors. Second, we not only display the park names on the map, but we use the park IDs to create a park-specific URL that sends an OASIS visitor to the DPR website to access the wealth of info DPR maintains about each park.
I figured the Parks Department must have better data itself, so I looked online to see what I could find. In the “Explore Your Park” section of DPR’s website, they have lists of parks by borough. Each park is displayed by name as a link, and the underlying URL includes the park ID (for example, the URL for Claremont Park in the Bronx is http://www.nycgovparks.org/parks/X008).
Here’s the root URL for the parks lists: http://www.nycgovparks.org/sub_your_park/park_list/full_park_list.html?boro=X (just change the last letter for each borough – X is the Bronx, B is Brooklyn, M is Manhattan, Q is Queens, and R is Staten Island).
So I scraped these pages and stripped out the extra HTML code, leaving just the park names and IDs in order to create my own crosswalk table. Then I joined the names with the DPR geography file using the IDs. Not pretty, but more comprehensive, accurate, and up-to-date than relying on the problematic Data Mine data.
The DPR website with park names and IDs also includes playgrounds. So I also joined the scraped list to the Data Mine layer for playgrounds (“DPR_playgrounds_001”).
I added the following fields to each file based on the DPR website data: PARKID, PARKNAME, NameMain, NameSuffix (some names had text in parentheses that I separated out to this field), and Borough. The other fields were in the original shapefile from Data Mine.
Here are the GIS files (ESRI shapefiles) for parks and playgrounds; use them as you wish:
Data Mine improvements?
After I did my screenscraping work, I found a couple of tools that were created to streamline access to Data Mine files. One developer created a service that converts files in Excel (XLS) format or some other format not easily “consumable” by applications or web services. His tool is called elev.at — more info here and here. But even this effort to fix one of Data Mine’s problems wouldn’t have helped with the parks data — converting from Excel to XML would’ve improved the format but not the data quality itself.
The Data Mine files may be good enough for someone throwing together a quick mobile app to enter a competition. But the city’s data – and apps created with the city’s data – should be better than that. We should expect that city data is reliable, current, and easily accessible. My experience with parks data from Data Mine reminds me that the city still has work to do to meet this goal.
Presumably the Parks Department itself has a better system. But this obviously didn’t make it into Data Mine. Hopefully this will be fixed the next time Data Mine is updated (if that ever happens – more on that in my earlier post).