• SR_spatial tweets

MTA data opened up; provided here in GIS format

I wasn’t able to attend this week’s “MTA Unconference for Developers,” but it sounds like it was a great event.  My colleague Dave Burgoon sat in, and I followed the Twitter stream and read several of the follow up posts.

The shift in attitude and action at MTA to open up access to their data and invite developers and others to work with them to use it is heartening.  I hope it spurs other agencies to do the same.   (In the wake of MTA’s new-found openness, it’s especially mind-boggling that, for example, the NYC Dept of City Planning still requires a license fee for its real property data, and Nassau County and the Suffolk County Real Property office do the same.)

The raw data

The first thing I did as I was reading the conference tweets was to look at the new data the MTA has released.  (MTA will keep you updated on data changes via email if you subscribe here.)  Most of it is in the GTFS format (formerly “Google Transit Feed Specification”, now the “General TFS” to get away from any corporation-specific connotations).  At first that was disconcerting — where were the GIS files?  I wanted pre-set shapefiles or KML files, but nothing was listed.

Of course, someone had already thought of that :).  Data in GTFS format includes latitude/longitude, so that was encouraging.  And after digging a bit further, there are open source tools for exporting GTFS data into KML format, and then importing into other programs (such as ArcGIS) for mapping and spatial analysis.  Now that I’ve worked with the data a bit, it all makes sense — the GTFS format gives you flexibility to import the data and analyze it however you’d like, with whatever software you’d like.  And of course it’s structured to facilitate mapping — silly me, Google wouldn’t create a data format that couldn’t be easily integrated with Google Maps, etc.

But I’m more familiar with ArcGIS than the TransitFeedDistribution tools that convert to KML etc.  So instead I’ve created shapefiles in ArcGIS of some of MTA’s key data sets.  I’ve posted links to the shapefiles below — feel free to use them however you’d like.  I’ve added some notes on that process. And here’s a map of one of those shapefiles — bus routes in Brooklyn:

The context

To put this in some context, it’s amazing to me that this data is now publicly and easily available.  I’ve been using GIS professionally for almost 20 years, and I think it’s safe to say that those of us working with GIS in New York have grown weary of fighting to obtain data that you’d think would be commonplace — such as bus routes, subway routes, commuter rail lines, and related usage and performance statistics.  When I directed the Community Mapping Assistance Project at NYPIRG, or more recently with the CUNY Graduate Center, clients and project partners would ask us to add bus routes to their maps, or to analyze bus transit options, and we’d always have the same answer: the MTA refuses to provide access to the data, so you’re out of luck.  (Or, maybe I was able to find someone years ago who “unofficially” slipped me a floppy disk with bus routes in TransCad format, but now it’s out of date and I can’t get a newer version.)  Of course I’m not the only one who wants to map bus routes and other transit data, so the MTA’s new data access is great news for many people and institutions — not to mention the riding public

My methodology

To create the shapefiles, here’s what I did:

  • downloaded the .txt files from MTA’s website;
  • opened these in Notepad (or Excel or SPSS, for the larger files) to get a sense of the file content and relational structure; and
  • then added them to an ArcGIS data frame. 

Geographically speaking, the data are either points (i.e., subway, train, or bus stops) or lines (routes). ArcGIS screen shot

  • For the points I used the “Display XY Data” function to create a point representation of the stops (see screenshot).
  • I assigned the “North American Datum of 1983” (NAD83) for each file’s spatial reference, but did not project the data.  That way anyone accessing the shapefiles can project them as needed.  (One exception is the NYCT bus data — I projected the stops and routes files using the New York State Plane Long Island (feet) coordinate system.  If anyone needs these files unprojected, just let me know.)
  • The “stops.txt” files do not include route information, only “stop IDs” that can be associated with other MTA files to obtain route names and descriptions.  Any given stop can be associated with multiple routes, so I decided not to join the route data to the stops – that can be done in your application as needed. 
  • To finish, I exported each stops file and renamed it with the category and date (such as “nycbusstops_100401”). 

Here are the shapefiles for the stops (provided in zip file format):

For the routes, the methodology was slightly different:

  • Although the GTFS includes a “routes” file, this just includes the route ID, route name, and other descriptive information — no geometry.  Instead, there’s a separate “shapes” file that includes the latitude/longitude for each point, or node, along the route.  In ArcGIS, the trick is to create a point representation of these nodes and then literally “connect the dots” to create the corresponding line representations.
  • I used the “Display XY Data” function to create the points, and then used the nifty “ET GeoWizards” toolkit to convert the points to polylines.  (ET GeoWizards is a “collection of powerful data manipulation and topology creation functions for ArcGIS”.  It’s really great, and many of the tools are free, with additional functionality for a modest fee.)
  • In the conversion process, ET GeoWizards uses the “stop ID” field, the lat/lon fields, and a “shape_pt_sequence” field to determine which points are connected together to draw the lines properly. 
  • I exported each “shapes” file to native shapefile format, but I wasn’t done.  The “shapes” files don’t include any route information.  The “trips” files contain “shape ID” and “route ID” fields, which provide the linkage between the “shapes” files and the “routes” files.  (For the LI Bus routes, I also copied the route info and URLs from the MTA website and reformatted that to join to the LI Bus shapes file.)
  • After joining the data, a last step was needed.  The shapes file provides a separate shape, or line, for each type of route — such as “inbound” and “outbound” — for each actual route (the B3, the M103, etc).  In order to create a GIS layer that includes individual features for each route, I needed to collapse the data (the LI bus file, for example, includes 2,048 discrete shapes, but only 104 routes — and even this includes duplicates — when it’s finally pared down, there are only 59 actual LI Bus routes).
  • ArcGIS screen shotI used the “Dissolve” tool in ArcToolbox to collapse the shapes files with the joined route information.  Then I exported these files and renamed them with the category and date (such as “libusroutes_100308”).

Perhaps there’s an easier way of doing all this with the TransitFeedDistribution tools, but for my purposes the ArcGIS tools worked just fine.  Here are the shapefiles of the routes (also in zip file format):

  • Long Island Bus
  • LIRR (Important Note: the MTA didn’t include a “shapes.txt” file for the Long Island Rail Road, so I couldn’t create a shapefile of train routes. But our mapping service at CUNY already has that.  Therefore, the LIRR link is not from the MTA data, but from an earlier shapefile directly from LIRR.)
  • Metro North
  • NYCT Bus (I provide two files for NYCT bus routes – one called “grouped” which includes 248 features, one for each bus route; and a second called “tripheadinfo” which includes “trip headsign” text — this includes duplicate line features [for a total of 733 features] because buses on a given route may be travelling to different end points.)


Note that I have not provided route files for NYCT subways or “bus company” routes.  The shape_ids in the trips.txt file for the subways were mostly NULL, so I wasn’t able to link the subway shapes.txt file via the trips.txt and routes.txt files to add the route names.  But, I already have a shapefile of subway routes, which you can download here.

Note also that the shapefiles of routes only includes the route geometry with some basic attributes (route names, and maybe MTA URLs).  I did not attach any of the scheduling or performance data that MTA has also provided.  That data is part of the rest of the MTA’s GTFS data feed if you want to link it yourself to the route shapefiles.

For our own part, we’ll be adding the bus routes shortly to the OASISnyc.net and Long Island Index mapping sites, and using them for other mapping work here at the CUNY Graduate Center.

Some other issues I encountered

Long Island bus systems: as far as I can tell, MTA’s data does not include bus stop or route information for buses in the City of Long Beach (Nassau County), the Huntington Area Rapid Transit (HART) system in Suffolk County, or the Suffolk County bus system itself.  So the LI Bus files above do not provide a comprehensive set of GIS files for bus stops and routes throughout Long Island.  In the past we’ve cobbled this together from various sources, but if anyone has up-to-date files for these areas, I’d love to hear about them.

“Landmarks”.  The MTA’s “Bus Company” files include a “landmarks.txt” file.  Presumably this represents easily recognized local features, but it’s impossible to tell for sure without a description from MTA.  Also, I’m not sure why the file is included with the “bus company” data and not the other categories (or on its own).  The file includes a “Type” field but no description of what the type codes mean.  Some of them seem obvious (ES=elementary school?). but others are opaque (the landmark called “2 Bay Club Dr” has a type code of “AH” – what does that mean?).  But in case you want to use it, here it is (at your own risk!).

Lack of metadata.  Although the GTFS website provides general descriptions of field names and data types, I wish there were better metadata from MTA directly — for example, it would be helpful to know how the lat/lon data were generated (i.e., what basemap was used, what scale is the data best viewed at).  And what’s the difference between the stops in the “Bus Company” file and the “New York City Transit – Bus” file?

But these are minor things.  Overall, it’s a huge step that MTA has opened its data doors.  Kudos to the new MTA leadership and everyone else who nudged (or aggressively pushed them) along the way.  I’m looking forward to great apps to come out of this, and to other agencies to follow suit.

17 Responses

  1. “And what’s the difference between the stops in the “Bus Company” file and the “New York City Transit – Bus” file?”

    The MTA actually has multiple bus divisions that are as I understand it largely operated independently. MTA NYCT operates most of the buses in New York, but MTA Bus Company also operates a fair number.

    This also explains why the MTA Bus Company has a “landmarks.txt” file and the others don’t: It’s a separate agency, with a different computer system. The data for the MTA Bus Company isn’t actually GTFS, it’s some other format.

    Also, you might want to send a message about your post to the official MTA dev mailing list: http://groups.google.com/group/mtadeveloperresources

    Great post!

    • Thanks Nicholas. I’ll definitely send a note to the MTA dev mailing list.

      Thanks also for clarifying the “Bus Company” data. Hopefully the MTA will provide better metadata so this will all be more transparent to people with varying levels of familiarity with transit data in New York.

  2. Great work Steve, really nice, and quite an amazing policy change that I very much welcome. And I agree, hopefully other agencies continue this trend. Just a quick observation – I think I discovered what the “AH” means at 2 Bay Club Dr, from Google Map….

    Animal Holistic Care, Dr Mark Haimann‎
    2 Bay Club Drive
    Flushing, NY 11360-2917

  3. Thanks Steve. I also was amazed at the MTAs sudden turnaround on open data, and my first thoughts were how I could map it too. I am (intermittantly) kicking around a solution (using a python script) to convert these to GIS. You can find it here: http://www.sendspace.com/pro/dl/8efipc.

    If you copy the script files into your folder that contains the GTFS text files and execute it, it will construct the point/polyline geometry and load all the data into a geodatabase.

    My idea is to create something that will would work with anything in the GTFS format. In truth, it will require some modification as each agency will have it’s own quirks. (Watch out for the routes file that has some text values in “quotes” and other not.)

    Disclaimer: It’s still a little half-baked. I haven’t yet tested it with larger datasets and it doesn’t yet build any of the connections between the datasets, but it is a nice one-click solution for what it does.

  4. […] Steve Romalewski of CUNY Mapping Service at the Center for Urban Research has converted several of the MTA data sets that were released as GTFS (General Transit Feed Specification) to shapefile format. Thanks Steve! Links to data available in this blog post […]

  5. Great little Python script that avoids having to use ETools to join the shapefile into lines.

    One little bug gp.Append( should be replaced with gp.Append_management( to resolve duplicate tools called Append.

    This worked well for the google_transit.zip file made available by the Auckland transport agency although with heavy restrictions on building your own websites with it.

  6. Steve, this is great work.

    Am I crazy, or are there a bunch of NYCT Queens bus stops and routes missing from the dataset? I ride the bus every day and I can see the Q66 and Q49 are missing from the dataset. The whole Rockaways seems to be missing.

    It seems like the stops are included in the ‘bus company stops’ dataset, but with no attribute data to distinguish routes, so its not much help.

    We’ve been able to recreate what Steve did, so it sems to me that its missing from the MTA side.

    • Tim, thanks for your comment and for checking the data. You might also want to post your question at http://groups.google.com/group/mtadeveloperresources – several MTA staff follow that group and have responded to other questions about the data.

    • Also, as I recall, the format for the routes in the “bus company” data was more complex than the other types of files. As Nick points out in a comment above, “The data for the MTA Bus Company isn’t actually GTFS, it’s some other format.”

      There’s a PDF that explains the format (filename “MDI_Export.pdf”), available from the MTA’s website.

  7. […] an earlier post I talked about how great it was for MTA to publish their data and make it easily accessible for […]

  8. […] converted the latest MTA GTFS data to GIS format for NYC Transit bus routes, following up on my earlier post this spring.)  Scroll to the end of this post for the […]

  9. Steve,

    I think that Tim touched on some form of this is a previous post, but I was trying to determine which stops are attributable to each bus route? Specifically the Q12 and Q79 lines. Is there any way of determining specific stops along routes without capturing competitor stops on common intersections?

  10. […] data wrangled by Steve Romalewski: mta-data-in-gis-format […]

  11. […] of the fatalities from 2008-2010 occurred within a quarter-mile of a bus stop.  The group used my GIS version of MTA’s bus GTFS data for their […]

Comments are closed.

%d bloggers like this: