• SR_spatial tweets

Mapping Hurricane Irene in NYC (plus some thoughts on the city’s digital response to the storm)

A disaster, natural or otherwise, always creates an opportunity to demonstrate the power of maps. Hurricane Irene did not disappoint. In New York City, which hadn’t seen a hurricane of this magnitude in decades, there were at least a half dozen websites with interactive maps related to the storm (plus at least one PDF map – more on that below) that were used extensively and were tweeted about extensively. My team at the CUNY Graduate Center was in the mix with our OASISnyc.net site, and I was watching with keen interest as more maps kept coming online as Irene kept coming closer. I thought I’d share some observations below about how Irene was mapped in New York.

I think I kept good track of the various maps that were deployed, but I’m sure my list and descriptions are incomplete so please chime in if I’ve missed anyone or mischaracterized any of the efforts.

The Context

Hurricane maps are nothing new, but usually the maps show the path of a hurricane while it’s happening or analyze its impact after the storm has past. This time, for New York City, the more interesting and useful maps were focused primarily on the possibility of evacuation, and the potential impact of the storm on New York’s shores.

(That said, the damage from Irene continues north of NYC, and several important mapping efforts are helping with the recovery effort there. For example, follow tweets from @DonMeltz and @watershedpost in upstate New York, or @jarlathond in Vermont.)

The interest in these maps was also perhaps more intense than in earlier situations. First, New Yorkers almost never evacuate for anything (at least on a scale of hundreds of thousands of residents), so the idea that so many people from only certain areas of the city needed to move to higher ground meant that everyone wanted/needed to know: am I in the evacuation zone? And that meant maps.

Second, online interest in this storm in particular was high. Other storms have hit since Twitter and Facebook have been around, but not in the New York area and not at this scale. One writer for GigaOm who had lived through hurricanes on the Gulf Coast wrote that she was “overwhelmed” by the “overall hoopla surrounding Irene online.” For her, it replaced TV as a key source of news (I agree, I barely checked TV news throughout the storm. Twitter and weather-related websites provided all the information I needed, and the news from these sources was more up-to-date.) And because so many New Yorkers were online and hungry for information about evacuations and storm impacts, online maps were critically important.

Will I Need to Evacuate?

Mayor Bloomberg and other officials started talking about the possibility of evacuation on Wednesday (8/24). That night, my wife reminded me that our flagship mapping site OASISnyc.net included a layer of “coastal storm impact zones”.

Actually, we’ve had that data online since 2007, when it was a Map of the Day on Gothamist. It shows areas at greatest risk of storm surges from a hurricane (and, as it turns out, those areas closely match the boundaries of the city’s evacuation zones – see screenshot below). I had also received a couple of emails that night from other groups wanting to map the evacuation zones and were worried that the city’s mapping resources weren’t up to the task.

So I wrote a blog post about how using OASIS could help people see if they were in harm’s way if the storm hit the city. I published the post the next morning (Thursday, 8/25).

That same morning, Gothamist posted an item about the potential for evacuation, and they embedded the city’s evacuation zone map. I was the first to add a comment on the Gothamist piece (via our @oasisnycmaps Twitter account), and I included a link to my blog post and to the maps.

PDF maps: blessing, curse, or both?

Let’s look at the city’s evacuation zone map [PDF – see image at right]. It’s a PDF file. It shows all the city’s streets in black ink, in an 8.5″ x 11″ layout, overlain on color shaded areas (muted green, yellow, and brown) corresponding to the 3 evacuation zones A, B, and C. And it has the evacuation center locations labeled on the map.

So it puts a lot of information into one map, which is challenging on its own. But trying to view that as a PDF online can be especially problematic. People who expected something better complained — it was described (perhaps too harshly) as “terrible” and “useless” in the Gothamist comments. People said it was hard to read, took too long to download, it didn’t work well on mobile phones, etc. And quickly after I posted my comment at Gothamist, several people were thankful that they could access OASIS as an alternative to the city’s map.

Distributing a PDF map in a situation like this has pros and cons. On the one hand, it has flexibility. The PDF format can be viewed in any web browser, or can be downloaded to your computer and viewed there, and can be printed out to share with someone who doesn’t have Internet access. And lots of people on Twitter were appreciative. On the other hand, it’s not something that can be easily updated, and it’s not what the growing population of digitally savvy New Yorkers would expect or desire. NYC has been touting itself as the most digital city on the planet, and all they could do was put out a PDF? People were underwhelmed.

To be fair, the city also had an online “Hurricane Evacuation Zone Finder.” You’d type in an address and it would display a zoomed in zone map of your location. But that provided little context, and it wasn’t as user-friendly as the public was expecting. For a long time this type of web service would’ve been considered state of the art. But these days, I think a lot of people were wondering if New York couldn’t do any better.

Luckily the city had posted a dataset in GIS format representing the city’s evacuation zone boundaries. It was available on Datamine, and anyone could download it for free and use it without restriction. So when people asked me if I had the evacuation zones in a format that could be mapped, I just pointed them to Datamine.

(In a mix of optimism and revisionist history, New York City’s Panglossian chief digital officer was quoted saying that As always, we support and encourage developers to develop civic applications using public data” (emphasis added), in reference to other groups that were using the evacuation zone map in their websites. I chuckled when I read this. If you’ve been around this business for more than just a year or two, you’d know that it hasn’t “always” been this way. It’s terrific that at least some of the city’s data is openly available now. But let’s keep it in perspective, and also remember that there are still important public datasets the city is not making easily available to developers or others.)

NYC.gov goes down

By Thursday afternoon, online interest in NYC’s impending evacuation announcement was so intense that not only did the city’s zone finder application go down, but even the city’s website — in particular, its homepage — was inaccessible.

Although the city will certainly congratulate itself for using social media to get the word out (and I agree they did a good job in this area), it’s not good that the city that strives to be the nation’s premier digital city could not even serve up its homepage at the exact moment when everyone was relying on that web page for information on what was happening next. And with a situation as complex as an approaching hurricane, 140 character tweets are just not enough. I can’t imagine it’s easy to withstand several million hits in a day, but I and a lot of others expected better.

(After complaints from the community, at least one civic activist posted links to hurricane resources on his own and shared that via Twitter.)

The city was left apologizing for no web access and pointing people to its PDF map (at this point hosted on Tumblr and elsewhere). Mayor Bloomberg posted the PDF on his website, but that’s the least he could do. Simply taking a PDF and putting it on another website? Doesn’t take much to pull that off.

More Maps Come Online

In the meantime, more maps appeared. WNYC was next. John Keefe, the public radio station’s Senior Executive News Producer, mashed up the city’s evacuation zone data with Google Maps, and put a simple, easy to use interface together. The map didn’t include evacuation centers at first, but it was clean, effective, and … in the absence of the city’s online resources … it worked.

In fact, several people noted the irony. When @nycgov tweeted that the city’s hurricane zone finder was down “due to high traffic”, a Google representative quickly tweeted back that “WNYC’s map is based on NYC OEM data and is running fine.”

John has developed a successful system creating news-oriented maps in short order, and his hurricane map was the latest example. And it was embeddable, so sites such as Gothamist that originally embedded the city’s PDF map, quickly replaced the PDF with WNYC’s interactive map. People were happy.

By Friday morning, the city was still having difficulty providing online access to its web page and its hurricane evacuation zone finder app, so more mapping sites stepped up. ESRI published an interactive map of the evacuation zones and evacuation centers using their relatively new ArcGIS.com online platform. The map looked great, and included the evacuation centers that WNYC’s map was missing.

But the ESRI map didn’t have the slimmed down, focused look and feel of WNYC’s site. It included ArcGIS.com options such as geographic feature editing that maybe weren’t needed for this situation. (That’s just a quibble. Though at one point I clicked “Edit” and it seemed like I was about to delete all of Zone A!)

One nice thing about the WNYC site is that is uses Google’s Fusion Tables service on the backend, which makes it easy to set up geographic data and then overlay that data on a Google map or any other modern, online mapping site. At the CUNY Graduate Center we’ve started to use Fusion Tables to integrate community-oriented mapped information into the OASISnyc site. By Friday morning we were able to use Fusion Tables to display the city’s evacuation centers on OASIS’s maps. The OASIS site provides a wealth of information such as subway and bus routes, schools, public housing sites, etc. so it provided a way (hopefully an easy way) to locate evacuation sites in relation to these other locations.

By Friday, Google had also stepped in with a mapping service of its own, a customized version of its crisis mapping application.

Originally Google’s map omitted the city’s evacuation zones or centers, but it did include several other layers of data related to potential storm impacts (like the storm surge map at OASIS). The federal weather and environmental agencies such as NOAA and FEMA have consistently done a great job of providing free, online access to observation and modeling data about storms, and Google put this information to use.

Regional Maps

On Friday our team at the CUNY Graduate Center also made two enhancements to our mapping applications to make it easy for a wide range of people to find out if they might be hardest hit by Irene. First, we reconfigured the OASIS maps so the storm surge layer could load quick. We created a pre-cached tiled layer instead of a dynamic layer and also set up the map page so that most of the dynamic layers were turned off by default. This made the map page load quicker, and made the storm surge layer load instantaneously (our site had bogged down a bit on Thursday due to increased traffic — site usage almost tripled to 9,000 pageviews almost solely from my comment at Gothamist with a link to OASISnyc.net — so quick loading was key).

We also incorporated the storm surge layer to an interactive mapping site we maintain with the Long Island Index focused on Nassau and Suffolk counties. It seemed that the storm might have a greater impact on Long Island. The storm surge data we used for OASIS was statewide in scope (it was created by NY SEMO), so we coordinated with our partners at the Index and updated the site Friday afternoon.

Newsday included a link to the LI Index mapping site, and usage soared over the weekend.

Understandably, an organization such as WNYC would limit its map to the city’s 5 boroughs. But there weren’t similar maps for any other part of the tri-state region.

Even though mandatory evacuations had been called for much of Long Island’s south shore, the best data available on those areas were lists (some in PDF format) of affected addresses and affected streets. Given the surge in usage of the LI Index mapping site, I like to think that we helped meet a key need.

Mandatory Evacuation and More Maps

During the day on Friday, Mayor Bloomberg announced the city’s mandatory evacuation plans. The scramble was on to see if you were in Zone A!

Not to be outdone by WNYC, Google, or anyone else, the New York Times launched its version of an interactive evacuation zone map late in the day Friday.

Like WNYC’s version, the NY Times map was focused and easy to use. But it was also limited to NYC, despite the Times’s readership outside the 5 boroughs who had also been required to evacuate.

By then, WNYC and Google had also added the locations of evacuation centers to its maps.

Lessons Learned?

So what to make of all these maps?

I think the first thing is that they were all generally helpful. When the nation’s premier digital city was incapable of providing digital information in a timely, useful way, others stepped in and succeeded.

These efforts, however, suffered to some extent from inconsistencies and lack of coordination.

For example, different mapping sites displayed different kinds of information in ways that may have been confusing to the person on the street.

Google and OASIS posted storm surge zones and the city (and WNYC, ESRI, and the Times – and eventually Google too) posted evacuation zones. Ultimately what most people wanted to know was if they lived in evacuation Zone A. The storm surge areas were important in terms of anticipating where the storm would do the most damage, but perhaps a more pressing issue was the evacuation.

But this difference in approaches underscores the lack of coordination among the various mapping entities. It was as if everyone just wanted to get *their* map online.

We’re as guilty of that as anyone. I know top staff at OEM and I easily could’ve contacted them to coordinate the OASIS layer with their’s. But it was somewhat frantic at the time, and the communication didn’t happen. I’d say WNYC was the most earnest in this regard, since they probably just saw a hole that needed to be filled – the city was talking about evacuation, but the city’s evacuation map was sorely lacking or not online.

But once WNYC went online, as far as I know there was little coordination among them, us, ESRI, the NY Times, Google, etc. I think you could reasonably ask — since WNYC’s map worked perfectly well, and provided the information about evacuation zones — why have essentially the same map from ESRI, Google, and the NY Times. Were these groups talking with each other? For the media outlets (WNYC and the Times), was it just a competition thing?

I do know that when the city’s GIS community was more cohesive, this probably would’ve been coordinated a bit more, perhaps through GISMO. Not that the lack of cohesion is a bad thing necessarily. And not to fault GISMO or other coordinating groups. But I wonder if better information could’ve been provided to the public in a better way if all of us making the maps were in communication.

For example, for at least a day WNYC’s map lacked the evacuation center locations. I added the locations to OASIS using Fusion Tables. Then WNYC added the locations to its map, also using Fusion Tables. We easily could’ve shared the backend data, but WNYC never contacted us to discuss it. I sent a tweet to @jkeefe about it, but didn’t hear back. It was important to keep the evacuation center data up-to-date and consistent because the city changed the locations of 4 centers before Irene hit. Keeping the maps in sync would’ve minimized any confusion for the public.

Overall, I think the biggest takeaway is that the Mayor’s office and NYC agencies – especially DoITT (since they’re responsible for coordinating the city’s technology resources) – need to engage better with mapping/data/online communities in a much more open, collaborative way.

Despite the city’s talk of apps and open data, there’s still very much a closed approach on the city’s part when it comes to public/private partnerships. True, the city has developed partnerships with local startup tech companies. But the city’s nonprofit and academic communities, along with established private entities, have much to share and have proven they have the technological resources to do as good if not a better job than the city providing essential information online.

In terms of mapping Hurricane Irene in NYC, NGOs filled a big void. The city should not only recognize that effort, but cultivate it and help sustain it so that it works more smoothly and effectively next time.

Innovative map comparisons – Census change in 15 cities

Our team at the Center for Urban Research (at the CUNY Graduate Center) has updated our interactive maps showing race/ethnicity patterns from 2000 and 2010 in major cities across the US. We’ve enhanced the maps in several ways:

  1. Added more cities. We now have 15 major urban regions mapped across the US (Atlanta, Baltimore, Boston, Charlotte, Chicago, Detroit, Houston, Los Angeles, Miami, New York, Orlando, Philadelphia, Phoenix, San Francisco, and Washington D.C.).
  2. The maps now have three ways of comparing 2000 and 2010 racial patterns:
  3. We color-coded the population change data in the popup window. Population increase is shown in green; decrease is shown in red. See image below.

Here’s our news release with more info.

Btw, we’ve also updated our static maps to show New York City Council districts, to begin to get a sense of how demographic changes will shape upcoming redistricting efforts at the local level.  Here’s the link:www.urbanresearchmaps.org/plurality/nyccouncil.htm (For the static maps, you can view 2000-2010 demographic change with the vertical slider bar, but you can’t zoom in/out, etc.)

An initial version of the maps launched in June with the vertical bar technique, integrating it with interactive, online maps for the first time. Our Center crafted the maps so you could not only drag the bar left and right but also zoom in and out, click on the map to obtain detailed block-level population counts, and change the underlying basemap from a street view to an aerial image (via OpenLayers use of Microsoft’s Bing maps tiles), while also changing the transparency of the thematic Census patterns.

The latest iteration of CUNY’s Census maps continues to use the vertical slider but now incorporates this technique with two more comparison options. Each approach serves different purposes:

  1. The vertical slider bar provides a “before (2000) and after (2010)” visualization of change, either regionally or at the scale of a city neighborhood.
  2. The side-by-side comparison is ideal for lingering over a given area, especially at the local level, taking the time to absorb the differences in demographic patterns mapped with 2000 Census data on the left and 2010 on the right. We incorporated this approach specifically at the suggestion of the great interactive team at the Chicago Tribune, who have created some similar Census maps.
  3. The single-map 2010/2000 overlay is especially helpful for revealing the increase in diversity over a given area.

For example, you can zoom to Atlanta, GA on the single-map overlay and see the city’s predominantly Black population in 2000 surrounded by suburban Census blocks shaded dark blue, denoting a White population of 90% or more (see images below). As you transition the map from 2000 to 2010, the dark blue in the suburbs fades to a lighter shade (indicating a more mixed population demographically) coupled with more Census blocks shaded green, purple, and orange – each corresponding to communities that are now predominantly (even if only by a few percentage points) Hispanic, Asian, or Black respectively. This pattern is replicated in many of the urban regions featured at the website.

Atlanta & suburbs in 2000

Race/ethnicity change in Atlanta by 2010

Eventually we’ll be moving all this from pre-rendered tiles to vector tiles. CUR’s application architect Dave Burgoon contributed code he developed to TileStache to enable TileStache to produce AMF-based output for use in Flash-based interactive mapping applications. This will give us flexibility in mapping as many Census variables as needed, and also providing complete geographic coverage (hopefully down to the block level) nationwide. That’s the plan, anyway! Stay tuned.

Credits

Funding for much of the Center’s recent work on Census issues has been provided by the Building Resilient Regions Project of the John D. and Catherine T. MacArthur Foundation, the Hagedorn Foundation, as well as support from the CUNY Graduate Center and the City University of New York.

Several people provided feedback and helpful editorial suggestions on earlier versions of the maps and narrative. Though the materials at this site were prepared by the Center for Urban Research, those invdividuals improved our work. We greatly appreciate their contributions.

Slippy maps, meet before-and-after jQuery slider (introductions by OpenLayers)

Our team at the Center for Urban Research (at the CUNY Graduate Center) has launched a set of maps showing race/ethnicity patterns from 2000 and 2010 in major cities across the US.  The maps combine several mapping/web technologies that offer a new way of visualizing population change.  This post explains how we did it.

(And by popular demand, we’ve also included a map of Congressman Anthony Weiner’s district in relation to demographic change — you may have heard of him and his Twitter travails recently?)

Race/Ethnicity Change

Briefly, the maps show race/ethnicity change from 2000 to 2010 at the local level throughout major urban regions across the U.S.  So far we include New York City, Los Angeles, Boston, Chicago, Houston, and San Francisco.  (Others are coming soon.)

For our methodology and data analysis (and static maps), we provide that here.  For the mapping and web techniques, see below.

Reactions

So far we’ve received a pretty good response to our maps.  Here are some tweets posted recently:

  • @dancow (web journalist for ProPublica): Cool before/after map from CUNY’s urban research center showing NYC ethnic changes at the block level, from 2000-10.
  • @mericson (deputy graphics editor at NY Times): Nice block-level maps by @SR_spatial & CUNY Urban Research Center showing racial/ethnic change in NYC from 2000 to 2010.
  • @kelsosCorner (former Washington Post cartographer): Digging new 2010 Census plurality maps of NYC.
  • @albertsun (graphics editor at Wall St Journal): Coolest census map I’ve seen yet.
  • @PJoice (HUD employee; tweets are his own): This is the coolest map I have ever seen. Nice work by @SR_spatial and CUNY!
  • @MapLarge: I like how you can use the slider or move the map! Great Visualization!

Technical overview

The map uses the “before and after” technique that media websites have used for images of natural disasters.  We enhanced this technique by integrating it with interactive maps using OpenLayers, the open source mapping framework.  Now the slider works with two sets of overlapping, but perfectly aligned, maps from 2000 and 2010.

As it turns out, we didn’t set out to create an interactive version of these maps. In fact, we originally created static maps, but everyone we showed them to for feedback wanted the ability to zoom in/out and click on the map for more info.  So we developed the OpenLayers version. (And when I say “we”, that mainly means David Burgoon, CUR’s application architect, who I can’t say enough good things about.  I made the maps, and CUR’s Joe Pereira of the CUNY Data Service created the data sets, but Dave brought it all to life.)

OpenLayers enables us to introduce interactivity into the before-and-after images. Maps like these (to our knowledge) have not been available before — where you can move a slider back and forth while also zooming in/out and clicking on individual Census blocks for detailed information. You can also change the transparency of the thematic map layer, and switch between a street view and aerial view basemap.

It involved a good amount of work to integrate the slider technique with OpenLayers and also have two overlapping map instances working in tandem. The two maps need to appear as one, and this involves painstaking effort to ensure that the pixels on your screen are translated accurately to latitude/longitude coordinates in each of the separate but related interactive map instances, and the maps pan together seamlessly as you drag the slider left or right or move the map and it crosses the slider.

Mashup

In order to create the application, we used a mix of software applications, technologies, and techniques, summarized below:

  • We used the statistical software package SPSS to extract the Census block-level data for both years (see our methodology), allocate the 2000 data to 2010 blocks using the Census Bureau’s block equivalency files, and calculate the race/ethnicity plurality for each block.
  • We exported these SPSS files in DBF format and used ESRI’s ArcGIS Desktop to join the DBFs with 2010 TIGER Census block shapefiles.
  • ArcGIS Desktop was also used to create the choropleth maps (based on color schemes from ColorBrewer.org);
  • The map layouts were published as temporary web map services using ESRI’s ArcGIS Server. We used these to create pre-cached tiles (.PNG files) for the 2000 and 2010 maps, corresponding to zoom levels 4 through 10 using the now-standard Google-Microsoft map scales for online web maps. (Our application accesses the choropleth tiles as PNGs directly from the cache created by ArcGIS Server, rather than accessing the ArcGIS web map service in order to assemble the tiles. The latter approach would be too slow and would undermine the transition as you dragged the slider across the map.)
  • The slider technique was adapted from the jQuery plugin by www.catchmyfame.com.
  • OpenLayers provides all the map navigation and serving the maps themselves, modified with customized JavaScript code.
  • The basemap shown beneath the color-shaded map tiles is provided by Microsoft’s Bing map service. The street map and aerial image tiles from Bing are accessed directly via OpenLayers, rather than using the Bing API. This is a key reason we used Bing for these maps; if we used Google Maps as a basemap, we were limited to accessing Google Maps via Google’s API, which would have slowed map drawing times and undermined the slider effect.
  • For geocoding we use the Yahoo! Placefinder API.
  • Some browsers are not able to handle the before/after slider effect smoothly. In particular, Firefox and Safari perform poorly; the slider transition between one map to the other is not smooth. Microsoft’s Internet Explorer is adequate, but Google’s Chrome browser is best.

Data sources/issues

We used block-level data from the Census Bureau’s 100% population counts from the 2000 and 2010 decennial censuses (from Table P2 in the “PL-94-171” files for 2000 and 2010).

The Census Bureau’s block geography changed between 2000 and 2010 — new blocks were created, blocks were merged, and block boundaries were modified in many places. In order to compare population data from 2000 and 2010 using a common set of blocks, we used the Census Bureau’s block relationship file to allocate the 2000 population counts to 2010 geography.

When you’re viewing the map, it is best to use the maps and block-level data to understand trends over a larger area, even over several blocks. Be careful when viewing a specific block on its own. It covers a small area, and the Census Bureau may have made errors.

Credits

Funding for much of the Center’s recent work on Census issues has been provided by the Building Resilient Regions Project of the John D. and Catherine T. MacArthur Foundation, the Hagedorn Foundation, as well as support from the CUNY Graduate Center and the City University of New York.

Several people provided feedback and helpful editorial suggestions on earlier versions of the maps and narrative. Though the materials at this site were prepared by the Center for Urban Research, those invdividuals improved our work. We greatly appreciate their contributions.

Some good opendata news for NYC

The “Socratic Method” of publishing city data?

I was encouraged at the OpenGov Camp this past Sunday by an announcement from NYC DoITT  that the city will be using Socrata to provide online access to its data.  It’s a great platform.  It doesn’t ensure that the city will actually provide good data, or update it in a timely way, or expand its available data sets — but it’s a good step forward and hopefully a harbinger of better things to come.

The city is seeking feedback here.  They’ve indicated that an “end of summer launch” is planned for a NYC/Socrata rollout.  Here’s an example of what the site might look like.

OpenBaltimore opened my eyes

Earlier this year I had tweeted that a new municipal data portal — OpenBaltimore — blew away sites like NYC’s Datamine.

I was asked by Alex Howard (@digiphile) for my thoughts on OpenBaltimore and other, similar portals.  At the time I didn’t realize OpenBaltimore was using Socrata, but after I looked into it further, I came away impressed.  The platform is visually appealing, easy to search, and offers multiple ways of accessing/extracting data.

(I don’t want to endorse the Socrata product/service, but it seems to me to be a good choice for NYC.)

Useful features, and lots of them

One nice aspect of the platform is the ability to immediately preview the data, in your browser (no downloading needed just to see what it contains).  You can also view more details about each row in the file.  And you can visualize  the data in multiple ways — using an interactive map option built into the platform (if the dataset has a location component) or using one of 9 different chart options.

And if you want to download/export  a data set, they give you at least 8 formats for extracting/exporting, as well as an API for programmatic access.  NYC says that “all datasets will now be available as APIs” once they replace Datamine with NYC/Socrata.

Short links and “perma” links are available to each data set.  And there’s a “Discuss”  option where anyone can attach notes and commentary for each data set.  It’s user-generated metadata — you can immediately see, for example, if anyone else has commented about the data’s quality, or completeness, or how up-to-date it is.  I didn’t notice too many comments at the OpenBaltimore site, but there were some, and they were helpful (including responses from that city’s data team).

The  option includes a map, but didn’t seem to have real time geocoding.  So even if a list has street addresses, it can’t be mapped through Socrata on the fly.  Each list needs a “location column” which presumably means lat/lon.  (It’s easy to submit feature requests to the Socrata team, though, so hopefully we’ll be seeing this addition soon.)

All in all it’s really great.  Other cities use Socrata, including ChicagoSeattle, and even smaller municipalities such as Manor, Texas (pop. 6,500).

However, not a silver bullet

Even though OpenBaltimore’s portal has been online for just a few months, already there are criticisms (for example, data hasn’t been updated since February, some data sets have quality problems, etc).  Many people (including me) have leveled these same criticisms at NYC’s Datamine effort.  So simply having a better portal won’t solve these issues.

But at least a platform like Socrata will make it easier to deploy data sets, it’ll certainly make it easier for the public to access those data sets, and it’ll make it easier to suggest improvements to the substance and the process.

NYC’s Datamine was an improvement in some ways over earlier opendata efforts in New York.  Now that it’s been around for two years, I think it’s fair to say that Datamine is clunky at best.  For me, I can’t wait for it to be replaced by something better.  I’m looking forward to the NYC/Socrata roll out.

What do you think?

On the lookout for ‘open data fatigue’ in NYC

I watched today’s news event by New York City’s Mayor Bloomberg and his colleagues about the city’s new “Digital Road Map” [PDF]. Impressive effort, including the livestream webcast.

But I thought the Twitter stream during the Mayor’s webcast was especially interesting. Seemed to me that there were just as many tweets about real-world problems (potholes, cops on the beat, subway service, etc) as there were about the technology announcements themselves. The technology is cool, and I agree it’s critically important for the city’s competitiveness, but it needs to be considered in the context of the substantive issues of improving city services, quality of life, engaging real people, maintaining a robust economy, etc.

I always worry when I see the city touting its technology efforts without also including local Community Boards, neighborhood groups, business advocates, urban planners, other elected officials, etc. who rely on access to public data so they can hold government accountable and do their jobs better. In my view, these groups need the data moreso than app developers. That is why open data efforts and policies are so important.

[Editorial update: I realized that in the preceding paragraph I omitted a critically important constituency regarding open data: the media.  I was thinking back on the many FOIL request I’ve made and various lawsuits I’ve been party to and hundreds of data requests I’ve made over the past two decades in an ongoing effort to pry loose public data sets from government agencies.  But I realized even that my longstanding involvement in data access efforts pales in comparison to the work done day in and out by reporters, editors, and journalists to not only further the open data cause, but just to do their jobs.

Media organizations absolutely rely on unfettered access to public data so they can shine a light onto government activities and educate us all about what our public officials are doing, perhaps especially when those officials don’t want us to know.  So when we think about improving city (and state and federal) government by developing a “digital road map”, the Foursquares and Tumblrs of the world are just distractions.  Provide unprecedented access to government data for the press — and bloggers and tweeters — and that will do more for better government than any number of Facebook pages, Foursquare check-ins, or officially-sanctioned NYC hackathons.]

But the city seems more focused on apps than on community. I understand the economic development appeal of fostering startups. But the open data movement long predated apps.  I highlighted this in my post last year (see the “Misplaced Priorities” section).

Apps are great (I use them constantly, and I’ve even developed one myself). And kudos to the city and its agencies for responding to app developers and making data more open so the developers can do great things with the data (things even the city might not do).

I just hope the latest announcements by the city will result in more real and lasting efforts to make data easier to access than the latest check-in craze. The Mayor already expressed some hesitation to making data accessible when a reporter asked him about CrashStat. CrashStat is a great example of my point — it wasn’t created to be an “app” per se; it’s an effort by a local nonprofit group to use public data to educate the public and hold government agencies more accountable about traffic injuries and fatalities. But the Mayor said he didn’t even know what CrashStat was, while making excuses about not making data available if it’s not in electronic format, or needs to be vetted, or is “sensitive”.  Blah blah blah – we’ve heard all that before and it undermines my confidence in the city’s pronouncements that more data will really be made open.  (I’d link to the city’s webcast at nyc.gov but it stops right when the Q&A begins.)

(In the livestream video, the Crashstat question comes at 27:00, and the Mayor acknowledges he doesn’t know what it is at 27:10. Thanks to Joly MacFie for the video link.)

So who knows, if the Mayor starts actually using Foursquare more and experiences ‘check-in fatigue‘, maybe he’ll eventually get ‘open data fatigue’ too. Let’s hope he stays as vigorous about public data access as he and his agencies say they will.

(photo via TechCrunch from IntangibleArts)

NYC Data Mine data: an object lesson in #opendata challenges

Earlier this week I posted an item about stagnation at NYC Data Mine as well as my thoughts more generally on the city’s #opendata policies and practices.  Today I discuss another challenge regarding open data: data quality and poor metadata.

Background

We recently updated the OASIS community mapping website with several data sets: community gardens, subways, bus routes, bike lanes, and more.  We also updated the map layer representing New York City park areas.  That might seem straightforward for a website focused on open space.  And we were using information from New York’s Data Mine website, which is intended to promote ease of access and use of the city’s publicly available data sets.

But adding the latest parks data to OASIS was much more complicated than it needed to be.  This post describes why, how we got around the complications, and offers suggestions for improvement going forward.  I also include links below to my updated versions of the parks (and playgrounds) data.

Data can be messy, no question about it, but the hassles with the parks data provide an example of the challenges that remain for cities to embrace #opendata, and for the public (or even app developers, for that matter) to seamlessly make use of public data sets.

The context

2006 was the last year that we requested a GIS data set of park properties from the NYC Dept of Parks and Recreation (DPR) for use on the OASIS website.  The Parks Department is a partner in the OASIS project, and was one of the project’s founding organizations.  The agency sees the value in mapping open space resources beyond just city data, integrating (as OASIS does) a wealth of data layers to provide a comprehensive picture of open space issues in any given neighborhood and citywide.

For various reasons (project transitions, mapping application updates, other priorities) our team at the CUNY Graduate Center hadn’t implemented major data upgrades to the OASIS website till recently.  Even last year (2009) when the city’s Data Mine was launched, we didn’t update the parks data on OASIS – the earlier parks data we were using seemed to be more comprehensive.

Therefore, I never looked closely at the parks data from Data Mine until this summer, when we started planning for a major data update on OASIS. When I took a close look at the parks data, I was frustrated and disappointed.

It’s important to note that I don’t fault the Parks Department, per se, for the difficulties I encountered.  I think the problem has to do with a disconnect that often exists between data creators and data users, with little being done by the city itself to mediate.  The Data Mine concept is a good start.  But for a meaningful open data effort, data should be vetted before it’s published, helpful metadata has to be included with each data file, and a dialogue should be fostered to help agencies understand how others seek to use their data in order to create an opportunity to learn from each other.  The parks data just happens to be one set of files that I’ve focused on, but these problems aren’t unique to parks, as others have pointed out (including me — I feel like I’ve been on a tear lately, blogging about data sets with great potential but that need lots of work before they’re re-used).

Data Mine disappointments

Data Mine has three types of “parks” data.  One is the geographic data from the Parks Department (in particular, the “Map of Parks” file, downloaded as “DPR_Parks_001”).  The second is the “raw” data from DPR.  The third is another “OPEN_SPACE” geographic dataset from the city’s Dept of Information Technology and Telecommunications (DoITT).

With all those options, how can you go wrong?  Here’s how:

The DPR geographic data is a great visual depiction of the park areas.  But, in GIS parlance, the dataset contains almost no attributes.  In other words, each park area in the dataset is identified only by its park ID (such as “M010”, which happens to be the code for Central Park – see www.nycgovparks.org/parks/M010).  No name or other information is provided except for an undefined “category code”.

Even the park ID is hard to identify – there’s a “GIS_Propnu” field and a “OMP_PropID” field that each contain values in the “M010” format. For the most part the values in both fields are identical.  But there are 43 records where these fields don’t match — see below.  I have no idea why they don’t match, but it turns out that the “OMP_PropID” values work in the DPR URL scheme, but the GIS_Propnu values don’t. So I went with the OMP data.

And the values in the “cat_code” field (1, 2, and 4) are not explained.  I’ve even talked with DPR employees who use GIS data, and they weren’t familiar with the details of the category codes.

Then I looked at the “raw” data.  I assumed the raw data would include a file to link park IDs with park names.  Nope.  The DPR “raw” data includes lists of many different types of park features (directories of barbecuing areas, beaches, dog runs, nature centers, playgrounds, etc).  But there’s no overall list of actual parks.  And of the 21 “raw” data files related to park properties from DPR, none of them provides park IDs.  They all include names and other attributes (generalized address info, website URLs, etc), but no IDs.

The closest we get from DPR’s “raw” data is a list of “capital projects” which includes names and park IDs, but they look like this:

So you need to do some text parsing to extract just the park IDs.  And this wouldn’t even provide a complete list.  The capital projects file includes 580 unique park names – far short of the 1,956 features included in the park geography file – as well as more than 330 entries with blank names.

After exhausting the options with the DPR data, I turned to the DoITT data.  Aha, the “OPEN SPACE” geographic data has park names and park IDs!  (It’s described as a “Planimetric basemap polygon layer containing open space features, such as parks, courts, tracks, cemetery outlines, etc.”)

But looking a bit closer, here’s why the DoITT file isn’t very helpful:

  • According to its metadata, it hasn’t been updated since 2006 – no better than the data that we already had on OASIS.
  • This file includes 1,600 unique names, but this includes areas not covered by DPR such as cemeteries, so therefore it’s not a complete list that will match the “DPR_Parks_001” geographic file.
  • Also, the naming conventions don’t really follow conventions.  For example, “Greenstreets” is spelled 8 different ways, there’s a mix of abbreviations and annotation (qualifiers in parentheses, etc)., extra (leading) spaces, misspellings, inconsistent spellings, etc.

  • There are more than 2,700 unique park numbers, but this includes park IDs that are blank (304 times) or IDs that don’t match the DPR list (for example, 158 records have “unset” listed in the Park ID field).

So as far as I can tell, it’s impossible to use the Data Mine data alone to link park names to map geography.  Maybe someone could match the data by hand (creating a map that labels the park areas by ID, and then comparing that with a DPR map with park name labels, and then manually entering those park names in DPR’s GIS file of park IDs).  But this would be so prone to error it wouldn’t be worth the trouble, and it also undermines the idea of providing “machine readable data” from Data Mine in order to automate how we access and analyze the information.

Since I didn’t find what I needed on Data Mine, I reached out directly to DPR for a file that links park IDs and park names.  The response was: a) wait for Data Mine to be updated; or b) if I can’t wait, then DPR needs to check with its public relations office before giving me the file.  Well, so much for openness and transparency.  Sigh.

How did BigApps developers handle this?

This made me wonder how the BigApps competitors could have created their applications, several of whom submitted apps that displayed maps of park locations showing park names.  I asked a couple of them how they did it.  One of them didn’t answer me directly, but instead suggested that I could “hire some free student labor to go through by hand for two days” to link the IDs/names manually.  Not very helpful.  Another BigApps project used the DoITT list of IDs/park names.  But for the reasons discussed above, that’s inadequate for our purposes.

On OASIS, a current list of park IDs and names is essential.  First, we want to display the latest and most accurate information for our visitors. Second, we not only display the park names on the map, but we use the park IDs to create a park-specific URL that sends an OASIS visitor to the DPR website to access the wealth of info DPR maintains about each park.

The workaround

I figured the Parks Department must have better data itself, so I looked online to see what I could find.  In the “Explore Your Park” section of DPR’s website, they have lists of parks by borough.  Each park is displayed by name as a link, and the underlying URL includes the park ID (for example, the URL for Claremont Park in the Bronx is http://www.nycgovparks.org/parks/X008).

Here’s the root URL for the parks lists: http://www.nycgovparks.org/sub_your_park/park_list/full_park_list.html?boro=X (just change the last letter for each borough – X is the Bronx, B is Brooklyn, M is Manhattan, Q is Queens, and R is Staten Island).

So I scraped these pages and stripped out the extra HTML code, leaving just the park names and IDs in order to create my own crosswalk table.  Then I joined the names with the DPR geography file using the IDs.  Not pretty, but more comprehensive, accurate, and up-to-date than relying on the problematic Data Mine data.

The DPR website with park names and IDs also includes playgrounds.  So I also joined the scraped list to the Data Mine layer for playgrounds (“DPR_playgrounds_001”).

I added the following fields to each file based on the DPR website data: PARKID, PARKNAME, NameMain, NameSuffix (some names had text in parentheses that I separated out to this field), and Borough.  The other fields were in the original shapefile from Data Mine.

Here are the GIS files (ESRI shapefiles) for parks and playgrounds; use them as you wish:

Data Mine improvements?

After I did my screenscraping work, I found a couple of tools that were created to streamline access to Data Mine files.  One developer created a service that converts files in Excel (XLS) format or some other format not easily “consumable” by applications or web services.  His tool is called elev.at — more info here and here.  But even this effort to fix one of Data Mine’s problems wouldn’t have helped with the parks data — converting from Excel to XML would’ve improved the format but not the data quality itself.

The Data Mine files may be good enough for someone throwing together a quick mobile app to enter a competition.  But the city’s data – and apps created with the city’s data – should be better than that.  We should expect that city data is reliable, current, and easily accessible.  My experience with parks data from Data Mine reminds me that the city still has work to do to meet this goal.

Presumably the Parks Department itself has a better system.  But this obviously didn’t make it into Data Mine.  Hopefully this will be fixed the next time Data Mine is updated (if that ever happens – more on that in my earlier post).

Open data in NYC? That’s so 2009.

Last fall I had high hopes that New York City would loosen the shackles that agencies too often held tightly around “their” data sets.  The city’s BigApps competition had just been announced, the new Data Mine website was launched with many data sets I never imagined would see the light of day, and the city (i.e., the Mayor’s office and his agencies) seemed to be jumping on the open data bandwagon.

These days, I’m less optimistic about NYC’s #opendata efforts.  Sure, there are bright spots (DOT, MTA, some aspects of City Planning’s Bytes of the Big Apple).  But for the past several months I’ve been hearing rumors that Data Mine will be updated “soon” and “any week now”.  So far, nothing new on the site — data is still from 2009.  I’ve also been hearing that Data Mine will be updated when the next BigApps competition is announced.  Maybe that’ll happen, but even if a new BigApps prompts the city to update Data Mine, that’s problematic – I explain below.

Words …

When Mayor Bloomberg announced BigApps, he made a big deal of how the city would be “providing information to New Yorkers as fast and in as many ways as possible” and of helping entreprenuers use city data to “increase accessibility and transparency in City government, generate jobs, and improve the quality of life for New Yorkers.”

And since then, the city’s own information technology agency indicated it would usher in a sea change in how city agencies made data publicly accessible.  This was somewhat buried in DoITT’s “30 Day Report” (issued Feb. 2010), but page 29 featured a section titled “Open Data/Transparent Information Architecture”.  It said [PDF],

In 2010, DoITT will work to establish citywide policies around “open data.” These efforts will align with Mayoral initiatives of openness and transparency, and further improve access to information by creating citywide standards that are practical and feasible. As a start, City agencies should be required to make available, to the greatest extent possible, all public‐facing data in usable electronic formats for publication in the NYC DataMine. This mandate would apply to all public data that is not subject to a valid restriction, such as public safety or personal information.  City data is by and large the property of the people it serves, and DoITT will be at the forefront of continuing to make it available in as many ways as possible. [Emphasis added.]

Note that this policy was meant to be only a beginning, and that DoITT would be “at the forefront” of aggressively making public data widely available.

… vs. action

The Data Mine website was launched in October 2009.  Most of the data sets at the site had a vintage of 2009 (and some were substantially older — for example, NYC Economic Development Corporation provides geographic data sets that are “based on PLUTO 2005” [PLUTO is the city’s tax parcel data]).

The Data Mine website itself claims that it will be “… refreshed when new data becomes available.”  The data update frequency for many data sets on Data Mine is listed as daily (such as detailed school information from Dept of Education and traffic and parking data from Dept of Transportation), monthly (recycling rates from Dept of Sanitation), or quarterly (most of the geographic data from the Parks Department).  Others are listed as “annually” or “as required”, but the “as required” data sets include NYC landmarks and historic districts (several of which have been updated since Fall 2009) and 311 data.

Even though some of these data updates are already publicly available directly from the individual agencies, Data Mine — as the city’s portal to public data access — hasn’t kept up.  And it appears that Data Mine is really just an adjunct to the city’s BigApps competition, which is focused primarily on application development (and the resulting economic development from these apps), not so much on transparency and open data access.

For example, testimony from DoITT’s commissioner at a recent City Council hearing for Intro 029 (a bill requiring city agencies to provide formalized open access to their data) was revealing.  Among other things, she explained that the Mayor’s office would wait till the next iteration of the BigApps competition before updating the Data Mine website with new data sets.  (Note that this is the same Commissioner who issued DoITT’s 30-day report cited above.)

The commissioner’s presentation starts about 9 minutes into the clip below.  Here’s her testimony [PDF].  Another disconcerting point she made in her comments was that the Mayor’s office wanted to put a priority on data that they believed had value to the public (rather than posting data regardless of how the public might use it or value it).

Vodpod videos no longer available.

Misplaced priorities

Linking Data Mine to BigApps has at least two problems.  The first is: Why wait?  Some agencies are already taking steps on their own to publish data and update it regularly (such as City Planning and Transportation).  I don’t see any reason to delay updates to Data Mine.  Otherwise the site is stale, and sends the wrong message.

In this era, it’s a no-brainer to make data widely and easily available, given all the amazing things people are doing with public data (helping reduce costs, promote economic development, enhance quality of lifeimprove government efficiency, etc).  As one blogger put it, “there’s really no reason for the city to spend the time to ‘discuss’ when the city could spend the time to ‘do’.”

The other problem is that we shouldn’t have to rely on a competition to make data publicly available.

Remember that when the state’s Freedom of Information Law (FOIL) was first enacted (in the mid-1970s), “apps” didn’t exist. It was all about accountability. The public had a right to know what its government knew — and to have easy access to that information so we could evaluate legislative, executive, and regulatory decision making.

Actually, the “legislative declaration” to FOIL in New York State makes a bolder statement: that public awareness of government actions is essential to maintain “a free society”. FOIL also emphasizes that people will (hopefully) understand and participate more fully in government when they know fully what their government is up to.

So apps are cool and powerful, but open government and open data goes much deeper than the latest iPhone app to find the best parking spots.  The more the city ties public data access to app development and competitions like BigApps, the more they veer away from facilitating the public’s fundamental right to know.

App competitions also take attention away from the vibrant community of nonprofits, neighborhood planning groups, Community Boards, and others who want to improve quality of life in the city and steer a progressive course when it comes to local development and citywide policies.  Not to mention the mainstream media and bloggers.  These players may not be developing apps, but they’re doing good work in other ways.  Information access for these groups and individuals is vital.  Some city agencies are smart and know how to work strategically with these groups to move good policies forward.  But too often agencies hunker down and get defensive, and don’t want anyone to have access to data.

Clay Johnson, former director of Sunlight Labs, also makes this point at his “InfoVegan” blog.  And even some BigApps competitors noted the downsides of relying on a competition to made public data accessible.

A better approach

I think the city’s open data efforts would be greatly enhanced by:

  • passing the City Council’s Intro 029;
  • opening up more data (things like property data that are still restricted by a license and access fees);
  • redesigning Data Mine as a pointer to existing agency data repositories; and
  • ensuring that public data sets are refreshed as often as practical.

Of course, we can’t place our faith in just putting the data out there.  It still takes people making policies and actual improving things. It still takes an educated public to take action, etc. But having more data, as long as it’s not in closed formats and is widely accessible, is a good thing.

(Disclaimer: my viewpoints on this blog are my own, not necessarily my employer’s.)

OASISnyc.net map updates

Today our team at the CUNY Graduate Center updated the www.OASISnyc.net mapping site with lots of new data.  There’s more to come by summer’s end, but here’s the latest:

  • The biggest change is that we’ve added the latest community garden inventory from GrowNYC.  Over a year in the making, it comes just in time to provide valuable context for the proposed new rules for gardens being considered by the city.  More info at the OASIS wiki.
  • We’ve integrated the latest subway and bus data that I’ve blogged about earlier (here and here), and also added bike routes via the NYC Dept of Transportation (one of the latest city agencies embracing #opendata).

(Lots of bike routes in Brooklyn.)

  • A side note to the subway routes is that we’ve responded to user requests and added the ability to turn on/off the subway routes separately from the streets and rail lines.  (May not seem like a lot, but hopefully this will make it easier to view an already busy map.)

  • We also have gotten around to updating New York City parks and playgrounds data based on the Datamine release from last year.  I’ll blog more about this separately, since it was a much more involved process than it should have been, given what I now realize are some major limitations to the city’s effort via Datamine and BigApps to provide open access to data.
  • The latest zoning districts, scenic districts, and historic houses are also added to OASIS.
  • The search options for community gardens, Stew-Map turfs, and the NY/NJ Harbor’s Comprehensive Restoration Plan data have been updated.
  • We’re starting to use the OASIS wiki again for tutorials and examples of how OASIS can benefit local groups. For examples, see http://oasisnyc.gc.cuny.edu/index.php/Featured_Maps
  • Finally, we’ve started to incorporate historic maps of Manhattan in partnership with the NY Public Library.  I blogged about this earlier, and we have a post on our wiki with more info.

We’ll have more updates soon.  Historic land use parcel by parcel citywide (with an accompanying statistical analysis), more data (such as parcels & open space) for northern New Jersey, more Google Earth/KML links, etc.

Stay tuned!

@NYPLMaps & OASIS provide context for 18th century ship find

The www.OASISnyc.net mapping team has been working with the great folks at New York Public Library’s Map Division to integrate digitized historic maps aligned to the city’s current street grid.  But as we were working with Map Division staff to incorporate their maps, an amazing find at the World Trade Center construction site prompted us to speed up our work — earlier this month, construction workers unearthed an 18th century ship, largely intact, that likely hadn’t been disturbed for over 200 years.

Now you can display some key maps of lower Manhattan from the from the 18th and 19th centuries, view them in relation to the current street grid, and compare them to each other using OASIS’s dynamic transparency tool.  We added a brief tutorial at the OASIS wiki.

Now you can fade between current property maps …

… the 1775 Montresor map …

… the 1817 Poppleton map …

… and more.

We’ve also added the Viele map from 1874, and more are on their way.  This is all due to the groundbreaking NYPL “Map Rectifier” project.

More MTA data in GIS format

My previous post was on subway routes; this time I tackle subway stations. (Apologies for another long one!)

When I planned this post, it seemed pretty straightforward. My goal: create a GIS “point” file of subway stations based on MTA’s latest GTFS release (easy enough) that included an attribute field with a list of subway lines stopping at each station. The attributes would look like this:

This format would be useful for adding labels to a map layer of subway stations (like Google has, or like the MTA map – station name plus list of trains stopping at that station).

But the GTFS “stops.txt” file only includes station names, not a list of routes. The “stop_times.txt” file includes trip IDs that can be joined with the “trips.txt” file to identify routes as well as stops.  But this represents more than 500,000 records (one for each trip by each route stopping at each station).  If you dissolved those on the stop_id field (using standard grouping or dissolve functions in Excel or Access, for example), you’d only get the first route ID per station, not a comma-separated inline list like I wanted.

I figured there must be a script out there to extract the routes per stop and write them to a field in the “1, 2, 3” format.  I found one, but that was the least of my worries.

The script I found (via my colleague Dave Burgoon) uses SQL’s “COALESCE” function as follows (numerous sites explain how it works, such as this one):

GO
alter FUNCTION RouteList (@StopID Varchar(100))
RETURNS VARCHAR(1000)
AS
BEGIN
DECLARE @Routes VARCHAR(1000)
SELECT @Routes = COALESCE(@Routes + ‘, ‘, ”) + route_id
FROM dbo.stops_route_list
WHERE stop_id = @StopID
ORDER BY route_id ASC
RETURN @Routes
END
GO

In order to make use of this, first I joined the trips.txt file to stop_times.txt (on the trip_id field), created a new field representing a concatenation of stop_id and route_id, then grouped on that new concatenated field.  This gives me a unique list of all the stop – route combinations.  I called it “stops_route_list”.  I used that with the COALESCE function above, then ran a SELECT statement on the results of the function [SELECT Distinct stop_id, dbo.RouteList (stop_id) AS RouteList FROM dbo.stops_route_list] to give me the comma-delimited inline result.  Several steps, but it works.

Data Complications

I quickly figured out that there were two problems with this effort.  One was the complex nature of the New York City subway system itself.  The other had to do with data quality problems with the MTA’s stops.txt file.

Subway system complexities

When I joined my comma-delimited result with a geocoded map layer of subway stations, I saw what looked like substantial errors.  For example, the stations east of Utica Ave in Brooklyn on what the MTA map shows as the line were labeled on my map with the 2, 3, 4, 5 trains stopping between Utica Ave and New Lots.  The 7th Ave local stations in Manhattan (such as 18th and 23rd streets) showed the *and* lines stopping there.  The was shown stopping at Jamaica Ave in Queens (but the MTA map only shows the stopping there).

Then I looked at other applications that used MTA’s GTFS data.  The Google Maps basemap matches the MTA’s map in terms of station labels (for example, the 23rd St stop on 7th Ave is labeled as “23rd St [1]”).  But clicking on Google’s individual station icons opens a popup window with the “additional” lines I mention above (see example below).  OpenTripPlanner.com (which just launched last week) showed the same thing.

Then I read the fine print on the MTA subway map.   It only “depicts weekday service.”  So presumably the 2, 4, and 5 trains at New Lots in Brooklyn, for example, must represent weekend trips.  (And the 2 train, in the images above, would only be on the weekend.) To test that, I filtered the stop_times.txt file for only weekday trips (using all the trip_id’s containing “WKD”) and ran the COALESCE script against that filtered list.  The at the 7th Ave local stations dropped out (it does indeed stop at local Manhattan stations only on the weekend).  But other anomalies remained.  For example, the 2, 4, and 5 trains were still shown stopping at New Lots.

Then I looked further at the MTA’s individual line maps (which were mentioned in the fine print on MTA’s subway map), as well as the agency’s summary service guide [PDF] for subways.  The fine print on *these* documents tells the story — using the New Lots example, the 2, 4, and 5 all have “some rush hour trips to/from New Lots Av, Brooklyn”.

Last I checked, rush hour is during the week.  So when MTA’s subway map shows that “weekday service” to New Lots Av is only provided by the 3 train (see below), it contradicts the MTA’s own more detailed service guide and line maps.

As it turns out, these 2, 4, and 5 train runs are a small part of weekday service to/from New Lots.  The combined stop_times.txt and trips.txt file reveals that there are 8 weekday trips stopping at New Lots, 58 trips, and 6 trips (compared with 262 trips).

I’m not being critical here. This is what I mean by the complexity of the subway system.  There are many exceptions like this, where scheduling or routing needs dictate that some atypical trains stop at unexpected stations.  (For example, despite what the MTA map shows, the  makes some weekday stops in lower Manhattan below Canal St, the  makes some weekday stops at Steinway St and 46th St in Queens, and the  makes local stops at some point during the week between 59th St and 125 St in Manhattan.)*

———————————-

*NB: I missed this in my earlier post — I made some observations about the lines/GTFS data showing routes that didn’t exist, but I had missed some of these complexities.  I’ve updated my earlier post explaining the situation.

———————————-

These are not mistakes in the GTFS data, but they’re a very small portion of overall weekday service.  The MTA map handles these exceptions by generalizing for the purposes of clarity.  What is interesting to me is that other applications incorporate the exceptions at the risk of seeming like they’re showing a mistake.  So when OpenTripPlanner or Google displays the making local stops on 7th Ave in Manhattan without qualifying it, I’d imagine most subway riders (at least those familiar with MTA’s map) would do a double-take.  Again, I’m not being critical, but to me this raises questions about using data “feeds” without a greater level of manual intervention to make the data more meaningful and present it in a way that’s more like what the riding public expects.

I thought I’d be able to easily omit the “anomaly” weekday trips/routes by selecting out weekday service with greater than a certain threshold of frequency.  That works for most instances, but setting it too high (even as high as 25 weekday trips) omits trips that should be included, such as the stops along the line and the train along the Rockaway Shuttle line.

So I implemented a bit of a hack, as follows:

  • The MTA service guide shows rush hour service starting at 6:30am, and “evening” service extending to midnight.  So I queried out all weekday trips (“service_id” ending in ‘WKD’) with arrival times between ’06:30:00′ and ’23:59:00′.
  • After I concatenated stop_id and route_id from this selection, I grouped on this concatenated field and selected all entries where the record count was greater than 20 (this threshold removes the “…some rush hour trips to/from New Lots Av, Brooklyn” issue as well as the other weekday anomalies) OR where the concatenated stop_route field ends in ‘Z_WKD’ OR where the stop_route field begins with ‘H’ and ends with ‘A_WKD’.  I think this got them all.  If anyone goes through this crazy process independently and finds different, please let me know (!).  I saved the result as a “stops_route_list_wkd” file.
  • Then I selected all others, and saved this as a “stops_route_list_offhours” file.
  • Then I dropped the filters altogether and created a “stops_route_list_all” file.
  • I ran the COALESCE script against each of these three files and ran the SELECT statement I mentioned above [SELECT Distinct stop_id, dbo.RouteList (stop_id) AS RouteList FROM dbo.stops_route_list_*] to give me three separate lists of routes per stop.
  • I joined these with the geocoded “stops.txt” to create three separate route attribute fields that can be used for labeling (depending on what type of map you wanted to create — predominant weekday service, offhours service, or all service).

This gives me following table (excerpt):

Typos and more

Amazingly enough, the data hassles don’t stop there.

I found one geographic error in the stops.txt file, and numerous naming inconsistencies (and at least two misspellings) in the stop_name field.

The geographic error has to do with the two Cortlandt Street stations in lower Manhattan.  It appears that the stop IDs were switched in the GTFS data.  Stop ID 138 has the name “Cortland St – IRT”, but has lat/lon coordinates that place it on the BMT/Broadway line.  Stop ID R25 has the name “Cortlandt St – World Trade Center”, but has lat/lon coordinates that place it on the IRT/7th Ave line.  Here’s what it looks like when I map it in ArcMap:

Here’s how it’s shown on OpenTripPlanner:

… and:

For now I’ve switched the attributes for these two stops in the shapefile I’ve linked to at the end of this post, but hopefully MTA will correct this soon.

The naming inconsistencies were more perplexing.  Station names in the stops.txt file are all over the place — parentheses are sometimes included, sometimes not; dashes are used arbitrarily, 16 stops have leading spaces in the name, and there’s a confusing mix of UPPER/Proper/lower case text.

What’s worse, the naming “convention” (if you can call it that) in stops.txt is also inconsistent with MTA’s subway map, MTA’s file of station entrances/exits, and other applications such as Google Maps.  Most of the transit apps I’ve seen simply use the stops.txt station names verbatim, but below I summarize my methodology for cleaning this up.  Hopefully MTA will update its next iteration of GTFS data with something more consistent.

Here are some examples of these issues:

  • Stop IDs 626 and 627 (86th St and 77th St on the Lexington line) each have a leading space in the name, but the adjacent stops on the Lexington line are fine.
  • Stop ID B12 (“NINTH AVE (WEST END)-9 ave” includes AVE and ave.
  • All four of the 110th St stops in Manhattan (IDs 118, 227, 623, and A17) are listed as follows – these examples really take the cake:
    • 110TH STREET – BWAY – Cathedral Pkwy
    • 110 STREET – LENOX (CENTRAL PARK NORTH)
    • 110TH STREET – LEXINGTON
    • 110TH STREET – CATHEDRAL PKWY
    • Here’s how different these stops are named on MTA’s own map:

  • Sometimes street types use a mix of spellings, such as stop ID 112 (168TH STREET – BWAY- WASHINGTON HGTS) and A09 (168TH STREET – IND – WASHINGTON HEIGHTS).
  • I thought the two Dyckman St stops in upper Manhattan were good: stop ID 109, listed as “DYCKMAN ST. – 200 STREET”, and stop ID A03, listed as “DYCKMAN STREET (200 ST)”.

The misspellings I noticed were:

  • stop ID A32: “WEST 4 ST – UPPER LEVEL – WASHINTON SQ” (i.e., Washington is missing the “g”); and
  • stop ID 706: “103RD STREET – CORAON PLAZA” should be “Corona Plaza”.

As data problems go, this isn’t too bad, per se.  But it’s odd to me that there’s such a mix of different naming types, and that it’s so different from the MTA’s own map.   If the differences followed some set of rules or were otherwise there for a reason, I’d be more comfortable with it. But when I see data inconsistencies like this, I worry that larger issues are at play – such as data entry problems that make the whole thing suspect (or at least the whole list of station names).  For example, I can’t imagine how misspellings crept into the station names, except if the names were actually typed in manually into MTA’s GTFS file.  So much for a data “feed” that supposedly mirrors what MTA uses itself.

Regardless of why the problems exist, it would be good if MTA fixed them in the next iteration (or at least explained why they’re there).

Here’s what I did to fix the problems for now:

  • removed leading spaces;
  • converted all the station_name values to UPPER CASE;
  • removed periods;
  • removed parentheses (and replaced each leading paren with a dash);
  • removed suffixes such as BWAY, LEXINGTON, LENOX, IND, IRT, 7 AV; and
  • fixed typos (‘BAIN BRIDGE’, ’9 ave’, ‘L. I.CITY’) and the misspellings.

Results

I hope my GIS file of subway stations includes some enhancements over the raw GTFS data that will be useful to GIS practitioners and app developers – it includes fields that provide route IDs (based on predominant weekday service and “off hours” service), and cleans up all sorts of inconsistencies and typos in the station names.  It’s still not perfect, but I think it’s a good first step. Hopefully you can use it for your apps and projects.  Here is a link to a zipped version of the shapefile:

Note that I’ve left the route IDs in this file unchanged from the GTFS routes.txt file.  So my file includes routes such as “6X” and “FS” and “H”.  I thought it would be better to leave these as-is, and let the user change them (or not) in your own application.

I guess any standardized data system like GTFS that tries to make sense of a subway network as complicated as New York’s will have issues.  But I think for New York’s implementation of “GTFS” to really become a “feed”, there’s lots more work to be done.  Hopefully this post helps shine some light on ways to improve the data.

Btw, thanks to everyone for their comments and feedback on my earlier posts – at my blog and sent separately via email and Twitter.  I’m glad my efforts are helpful.