This evening I received a call from NYC DoITT. They were mainly calling to tell me that they changed the official rules for BigApps 3.0. Yesterday the rules said that no new data would be added to the OpenData site until after the BigApps competition. As I said in my blog, why wait? But DoITT saw that and agreed. So now that clause has been removed from the rules (see section D.1). DoITT says that they agree they data should be accessible whether there’s a competition in effect or not. That’s great news! I’m looking forward to more dialogue on the other issues I’ve raised below.
New York City yesterday announced its new version of what had been called its “Datamine” website, a single online point of entry to access the city’s digital data holdings.
NYC’s Datamine was an improvement in some ways over earlier opendata efforts in New York. Now that it’s been around for two years, I think it’s fair to say that Datamine is clunky at best. For me, I can’t wait for it to be replaced by something better. I’m looking forward to the NYC/Socrata roll out.
But is “NYC OpenData” any better than Datamine?
After digging into the site for several hours last night and today, I’d have to say yes and no. It has some great stuff with great promise, but it still falls flat in some key areas. I look forward to using it for the APIs, but for the raw data I’ll go back to the individual agencies that in many cases are doing a better job of providing access to the data. Overall the city has come a long way with open data, but I still think the city’s concept of data-as-economic-engine is misguided. More on that below.
Socrata’s platform is impressive. I’ve blogged about it before, but it’s worth summarizing some of the high points:
- You can immediately preview the data in your browser (no downloading needed just to see what it contains). And you can view more details about each row in the file — very helpful if you’re interested in one particular aspect of the data.
- You can visualize the data in multiple ways — using an interactive map option built into the platform or using one of 9 different chart options.
- If you want to download/export a data set, they give you at least 8 formats for extracting/exporting.
- Short links and “perma” links are available to each data set.
- There’s a “Discuss” option where anyone can attach notes and commentary for each data set. It’s user-generated metadata — you can immediately see, for example, if anyone else has commented about the data’s quality, or completeness, or how up-to-date it is.
The big news with this new approach is the availability of an API for programmatic access for each data set in the Socrata system. On its face, the APIs look great, and the city deserves kudos for implementing them. Socrata has developed a template for developers to hook into the data — either row by row, selected queries, or to view metadata — and the template also provides data publishers with guidance on how to structure their data for automated consumption. And, it seems that DoITT has created web services for the mapped data sets, which is a big step forward.
There are other improvements with specific data sets, such as:
- It looks like the map data for NYC park boundaries is fixed — I posted a detailed review last year about how the parks data via Datamine was basically impossible to use. I had to scrape the NYC Parks website to convert it to a useful format. But now the park names are included with the park IDs in the same file. (However, this improvement is tempered by the fact that I can view the map of parks on NYC’s Socrata website, but I can’t download the data in a mapped format. I discuss that in more detail below.)
There are some interesting new data sets. Two things that caught my eye are:
- School zones are included in the data, which is something I had urged the city to include [PDF] when the BigApps competition was first announced in 2009. (School zones are the key determinant as to where your child can attend public elementary school, rather than the administrative school districts.) But the earlier version of Datamine included school zone boundaries, so this isn’t really new.
- HPD Registrations. Unfortunately the data dictionary accompanying this file can be cryptic, so I couldn’t easily decipher exactly what the file includes. But it seems to be a list of almost 140,000 buildings in the city registered as “multiple dwellings” along with each building’s landlord/owner, managing agent(s), and building details. Should come in pretty handy for anyone interested in the landlord landscape in New York.
Here’s an example of why the data dictionary is not very helpful – the excerpt below is trying to tell us what the “REG-INDV-HM-UNIT-NO” field means:
I thought it was also intriguing (in an insider baseball kind of way) that the interactive maps used at the NYC Socrata site to show mapped views of the data are from ESRI. And the API/web services provided for the mapped data files are ESRI-based. DoITT’s GIS unit has made a point of using non-ESRI technology for its interactive maps (Citymap, Scout, ZoLA, etc). But the GIS web services for Socrata all come from DoITT. Wonder what’s happening there.
The Mayor’s news release about the new Socrata site proclaims that more than 230 new data sets are included. We don’t get any details about which ones; the release simply says that:
Examples of this new data include a directory of HHC Facilities; electricity, gas and steam consumption available by zip code; and school attendance and report statistics.
But I looked pretty closely at what new data sets I could find, and I was hard pressed to identify more than a few dozen.
Examples of old data masquerading as new simply because it’s available through the new Socrata site include many of the files from NYC’s Dept of Finance, such as:
- Condominium comparable rental income listings (38 individual datasets);
- Cooperative comparable rental income listings (40 datasets); and
- Summary of Neighborhood (Property) Sales (21 datasets).
That’s almost 100 data sets right there, close to half the number the city says are newly available. But each of these have been online, for free download, at Finance’s website for several years. This page notes that coop sales information has been available since 2006, and Finance started making the data available for batch download a couple of years after that. The Neighborhood Sales data was put online a couple of years ago. And Finance’s website has more thorough information about the data sets and how to use them than the Socrata site.
Other not-so-new examples include:
- Street centerlines. These are from DoITT circa 2009. In contrast, the City Planning Department “LION” file at DCP’s website is from September 2010, and is updated regularly.
- Building perimeters. From DoITT circa 2010. But DoITT has a more recent file at their website for direct download (click through the online agreement and you’ll find building footprints from March 2011).
- Coastal boundaries. From City Planning, but this was posted on the Bytes of the Big Apple site last month. Great data set, but not new.
- Campaign contributions. From the NYC Campaign Finance Board. The data is current (covering the 2013 election cycle), but the files are already available in batch format and via a searchable website from CFB.
- Landmarks data. There are multiple, conflicting data sets at the NYC Socrata site regarding landmarks. For example, one data set of “NYC Landmarks” is from 2009, another (called “LPC Landmark Points”) is from 2010. Either way, there have been several new landmarks and historic districts designated since then by the Landmarks Commission.
Even if there was only one new data set in the new Socrata site, that’s better than nothing. But there’s so much data maintained by city agencies that is still not easily, publicly accessible. My blog post when BigApps was first announced in 2009 has a listing of some key data files that still haven’t seen the light of day.
The city should be doing a better job — especially since there’s been so much pressure on them to improve their open data policies, they have an avowed policy of doing so, and they’re also under a state law (FOIL) to require them to do so. Frustrating.
One of my biggest and longest standing gripes is about property data. There are a number of property-related files the NYC Socrata website. But nothing that allows us to come close to the City Planning Department’s “MapPLUTO” dataset. The city still charges a fee (up to $3,000 per year) with a restrictive license agreement in order to access the PLUTO data — a mapped file of all properties in NYC with a wealth of information about each one (zoning, ownership, building heighs, land use categories, assessed value, etc). It’s an essential data set for anyone trying to understand real estate, urban planning, neighborhood change, and more in the city.
When will City Planning get it? They’ve done such a great job of making other data sets available — files they used to charge for but now provide for free, and in better formats, with great metadata, and updated frequently. The agency obviously spends a lot of time preparing these other data sets that are freely available, so I don’t buy the argument that the PLUTO fee covers their “costs” of doing extra work to put PLUTO together. I just don’t understand. And property data is so incredibly useful in NYC — certainly to the big real estate players, but I’m not concerned about them. If it were free for everyone, at least we’d have a chance at a level playing field — helping “the little guy” do property analysis and mapping so he/she can analyze land use, understand policy implications, etc.
Data for people, not just machines
Data access — at least in this first iteration of the new Socrata site — seems to be weighted toward APIs, and therefore app developers. I understand the value of the API approach — I’ve developed apps myself, and at CUNY we have online sites that can definitely make use of the APIs. And I was kind of amazed that DoITT opened these up. So the APIs are good, and perhaps they’re worth the effort to create and maintain a one-stop-shop like NYC Socrata.
But for the average user — someone at a Community Board, or a local media outlet, or a City Councilmember’s office — the city’s implementation of the Socrata system seems against them.
For example, with one or two exceptions I wasn’t able to download any mapped data sets from NYC Socrata. Many files (45 by my count) are described as “GIS datasets”, and they’re obviously in ESRI’s “shapefile” format to begin with, but the “Export” option only provides flat files (CSV, JSON, XLS, XML for example), and not even the now-ubiquitous KML format (used by Google and many others).
If I click the API link for these data sets, this enables me to view the data as map layers in my desktop GIS application. But I can’t extract any actual data from these links in order to work with it on my own. The screenshot below (from ESRI’s ArcCatalog application) seems promising, but the inability to download the mapped data itself is very limiting.
It’d be easy enough (I’m assuming) to just add shapefiles to the list of Socrata’s data export formats. The shapefile format (.SHP) is already basically an open one (all the major open source GIS packages read it), so why force GIS users to do extra work to access GIS data? And why have DoITT go through extra work converting from SHP to something else, just to have the user convert it back again. For “point” locations this isn’t a big deal — it’s easy enough to convert latitude/longitude coordinates into a mapped data set. But this isn’t straightforward at all for polygons (district boundaries, for example) or lines (streets, transit routes, etc). I’m not saying don’t provide the data in the other formats, just add SHP to the list where appropriate. (Some GIS datasets are available as GIS downloads: school zones, for example. But this is an exception, as far as I can tell.)
Indeed, not having GIS-ready formats is a step backward. If I visit the City Planning Department’s “Bytes of the Big Apple” website, I can download a wealth of files in GIS format, and several of them are updated regularly. It’s great. Hopefully the NYC OpenData site doesn’t supplant the individual agency sites. For now, they’re better for me, and I’d imagine they’re better for many other users.
And having the raw data, rather than just API access, gives users more flexibility. For example, during the preparation for Hurricane Irene, several organizations downloaded NYC Datamine files in GIS format to create interactive maps of evacuation zones and evacuation sites. (And these groups helped the city in a big way because the city’s own maps and website were down, making it difficult if not impossible to get essential information from NYC.gov.) But the city changed several of the evacuation sites just a day or two before the storm was going to hit. If the outside organizations didn’t have the raw data that we could update ourselves, our presentation of the evacuation sites would’ve been incorrect and misleading. I wouldn’t want to rely on the city updating its API in a crisis situation like that, given how rocky the city’s digital response was to the storm itself.
Tying open data to app competitions & economic growth is the wrong approach
(Note: my concern here still stands, but the city has modified its position a bit, which is great. See the 10/13 Update above.)
I think the real issue here is that the city’s open data efforts are being driven more by the desire to use data access as a way to leverage economic development, and less about true government transparency.
For example, as with the first two BigApps competitions, no new data files will be added to the Socrata site until the latest BigApps competition is over (see section D.1 at the official rules). Why wait? Why should app developers get preference? What about the rest of us? Is NYC providing data just so app developers can do free work for the city, and so the city can make a news splash about open data? Open data should be open 24/7 — and should be updated on a regular basis — not just when it’s convenient for the city and for developers.
I understand that the new NYC Socrata site is a work in progress, and will almost certainly be improved going forward. But for now, although it includes lots of data, much of this has already been available elsewhere. The APIs are intriguing, but I hope they don’t preclude other ways for people rather than machines or apps to access the data.
At this point, with few exceptions I still would prefer to go to the individual agency websites (or even talk to agency staff and request the files via email, or even via disks & snail mail!) to get the data — from what I’ve seen so far, chances are it’ll be more timely, in better quality, and I’ll have better access to metadata/explanations of the files.
I’m even wondering if instead of a Socrata-like site, it might not be better to encourage the agencies directly responsible for creating the data to continue efforts to provide public access, and having them engage with people using the data so they’d see the benefits of open data (and/or realize that it’s not so bad to provide access to their files to the broad public in easily accessible ways). At the least, the new NYC Socrata site shouldn’t preclude this agency-specific work to be done.
I’ve already had a good, late-night exchange on Twitter with DoITT on some of these issues. I’ll be submitting feedback directly at the Socrata website. And hopefully the dialogue will continue.