• SR_spatial tweets

More MTA data in GIS format

My previous post was on subway routes; this time I tackle subway stations. (Apologies for another long one!)

When I planned this post, it seemed pretty straightforward. My goal: create a GIS “point” file of subway stations based on MTA’s latest GTFS release (easy enough) that included an attribute field with a list of subway lines stopping at each station. The attributes would look like this:

This format would be useful for adding labels to a map layer of subway stations (like Google has, or like the MTA map – station name plus list of trains stopping at that station).

But the GTFS “stops.txt” file only includes station names, not a list of routes. The “stop_times.txt” file includes trip IDs that can be joined with the “trips.txt” file to identify routes as well as stops.  But this represents more than 500,000 records (one for each trip by each route stopping at each station).  If you dissolved those on the stop_id field (using standard grouping or dissolve functions in Excel or Access, for example), you’d only get the first route ID per station, not a comma-separated inline list like I wanted.

I figured there must be a script out there to extract the routes per stop and write them to a field in the “1, 2, 3” format.  I found one, but that was the least of my worries.

The script I found (via my colleague Dave Burgoon) uses SQL’s “COALESCE” function as follows (numerous sites explain how it works, such as this one):

alter FUNCTION RouteList (@StopID Varchar(100))
SELECT @Routes = COALESCE(@Routes + ‘, ‘, ”) + route_id
FROM dbo.stops_route_list
WHERE stop_id = @StopID
ORDER BY route_id ASC
RETURN @Routes

In order to make use of this, first I joined the trips.txt file to stop_times.txt (on the trip_id field), created a new field representing a concatenation of stop_id and route_id, then grouped on that new concatenated field.  This gives me a unique list of all the stop – route combinations.  I called it “stops_route_list”.  I used that with the COALESCE function above, then ran a SELECT statement on the results of the function [SELECT Distinct stop_id, dbo.RouteList (stop_id) AS RouteList FROM dbo.stops_route_list] to give me the comma-delimited inline result.  Several steps, but it works.

Data Complications

I quickly figured out that there were two problems with this effort.  One was the complex nature of the New York City subway system itself.  The other had to do with data quality problems with the MTA’s stops.txt file.

Subway system complexities

When I joined my comma-delimited result with a geocoded map layer of subway stations, I saw what looked like substantial errors.  For example, the stations east of Utica Ave in Brooklyn on what the MTA map shows as the line were labeled on my map with the 2, 3, 4, 5 trains stopping between Utica Ave and New Lots.  The 7th Ave local stations in Manhattan (such as 18th and 23rd streets) showed the *and* lines stopping there.  The was shown stopping at Jamaica Ave in Queens (but the MTA map only shows the stopping there).

Then I looked at other applications that used MTA’s GTFS data.  The Google Maps basemap matches the MTA’s map in terms of station labels (for example, the 23rd St stop on 7th Ave is labeled as “23rd St [1]”).  But clicking on Google’s individual station icons opens a popup window with the “additional” lines I mention above (see example below).  OpenTripPlanner.com (which just launched last week) showed the same thing.

Then I read the fine print on the MTA subway map.   It only “depicts weekday service.”  So presumably the 2, 4, and 5 trains at New Lots in Brooklyn, for example, must represent weekend trips.  (And the 2 train, in the images above, would only be on the weekend.) To test that, I filtered the stop_times.txt file for only weekday trips (using all the trip_id’s containing “WKD”) and ran the COALESCE script against that filtered list.  The at the 7th Ave local stations dropped out (it does indeed stop at local Manhattan stations only on the weekend).  But other anomalies remained.  For example, the 2, 4, and 5 trains were still shown stopping at New Lots.

Then I looked further at the MTA’s individual line maps (which were mentioned in the fine print on MTA’s subway map), as well as the agency’s summary service guide [PDF] for subways.  The fine print on *these* documents tells the story — using the New Lots example, the 2, 4, and 5 all have “some rush hour trips to/from New Lots Av, Brooklyn”.

Last I checked, rush hour is during the week.  So when MTA’s subway map shows that “weekday service” to New Lots Av is only provided by the 3 train (see below), it contradicts the MTA’s own more detailed service guide and line maps.

As it turns out, these 2, 4, and 5 train runs are a small part of weekday service to/from New Lots.  The combined stop_times.txt and trips.txt file reveals that there are 8 weekday trips stopping at New Lots, 58 trips, and 6 trips (compared with 262 trips).

I’m not being critical here. This is what I mean by the complexity of the subway system.  There are many exceptions like this, where scheduling or routing needs dictate that some atypical trains stop at unexpected stations.  (For example, despite what the MTA map shows, the  makes some weekday stops in lower Manhattan below Canal St, the  makes some weekday stops at Steinway St and 46th St in Queens, and the  makes local stops at some point during the week between 59th St and 125 St in Manhattan.)*


*NB: I missed this in my earlier post — I made some observations about the lines/GTFS data showing routes that didn’t exist, but I had missed some of these complexities.  I’ve updated my earlier post explaining the situation.


These are not mistakes in the GTFS data, but they’re a very small portion of overall weekday service.  The MTA map handles these exceptions by generalizing for the purposes of clarity.  What is interesting to me is that other applications incorporate the exceptions at the risk of seeming like they’re showing a mistake.  So when OpenTripPlanner or Google displays the making local stops on 7th Ave in Manhattan without qualifying it, I’d imagine most subway riders (at least those familiar with MTA’s map) would do a double-take.  Again, I’m not being critical, but to me this raises questions about using data “feeds” without a greater level of manual intervention to make the data more meaningful and present it in a way that’s more like what the riding public expects.

I thought I’d be able to easily omit the “anomaly” weekday trips/routes by selecting out weekday service with greater than a certain threshold of frequency.  That works for most instances, but setting it too high (even as high as 25 weekday trips) omits trips that should be included, such as the stops along the line and the train along the Rockaway Shuttle line.

So I implemented a bit of a hack, as follows:

  • The MTA service guide shows rush hour service starting at 6:30am, and “evening” service extending to midnight.  So I queried out all weekday trips (“service_id” ending in ‘WKD’) with arrival times between ’06:30:00′ and ’23:59:00′.
  • After I concatenated stop_id and route_id from this selection, I grouped on this concatenated field and selected all entries where the record count was greater than 20 (this threshold removes the “…some rush hour trips to/from New Lots Av, Brooklyn” issue as well as the other weekday anomalies) OR where the concatenated stop_route field ends in ‘Z_WKD’ OR where the stop_route field begins with ‘H’ and ends with ‘A_WKD’.  I think this got them all.  If anyone goes through this crazy process independently and finds different, please let me know (!).  I saved the result as a “stops_route_list_wkd” file.
  • Then I selected all others, and saved this as a “stops_route_list_offhours” file.
  • Then I dropped the filters altogether and created a “stops_route_list_all” file.
  • I ran the COALESCE script against each of these three files and ran the SELECT statement I mentioned above [SELECT Distinct stop_id, dbo.RouteList (stop_id) AS RouteList FROM dbo.stops_route_list_*] to give me three separate lists of routes per stop.
  • I joined these with the geocoded “stops.txt” to create three separate route attribute fields that can be used for labeling (depending on what type of map you wanted to create — predominant weekday service, offhours service, or all service).

This gives me following table (excerpt):

Typos and more

Amazingly enough, the data hassles don’t stop there.

I found one geographic error in the stops.txt file, and numerous naming inconsistencies (and at least two misspellings) in the stop_name field.

The geographic error has to do with the two Cortlandt Street stations in lower Manhattan.  It appears that the stop IDs were switched in the GTFS data.  Stop ID 138 has the name “Cortland St – IRT”, but has lat/lon coordinates that place it on the BMT/Broadway line.  Stop ID R25 has the name “Cortlandt St – World Trade Center”, but has lat/lon coordinates that place it on the IRT/7th Ave line.  Here’s what it looks like when I map it in ArcMap:

Here’s how it’s shown on OpenTripPlanner:

… and:

For now I’ve switched the attributes for these two stops in the shapefile I’ve linked to at the end of this post, but hopefully MTA will correct this soon.

The naming inconsistencies were more perplexing.  Station names in the stops.txt file are all over the place — parentheses are sometimes included, sometimes not; dashes are used arbitrarily, 16 stops have leading spaces in the name, and there’s a confusing mix of UPPER/Proper/lower case text.

What’s worse, the naming “convention” (if you can call it that) in stops.txt is also inconsistent with MTA’s subway map, MTA’s file of station entrances/exits, and other applications such as Google Maps.  Most of the transit apps I’ve seen simply use the stops.txt station names verbatim, but below I summarize my methodology for cleaning this up.  Hopefully MTA will update its next iteration of GTFS data with something more consistent.

Here are some examples of these issues:

  • Stop IDs 626 and 627 (86th St and 77th St on the Lexington line) each have a leading space in the name, but the adjacent stops on the Lexington line are fine.
  • Stop ID B12 (“NINTH AVE (WEST END)-9 ave” includes AVE and ave.
  • All four of the 110th St stops in Manhattan (IDs 118, 227, 623, and A17) are listed as follows – these examples really take the cake:
    • 110TH STREET – BWAY – Cathedral Pkwy
    • Here’s how different these stops are named on MTA’s own map:

  • Sometimes street types use a mix of spellings, such as stop ID 112 (168TH STREET – BWAY- WASHINGTON HGTS) and A09 (168TH STREET – IND – WASHINGTON HEIGHTS).
  • I thought the two Dyckman St stops in upper Manhattan were good: stop ID 109, listed as “DYCKMAN ST. – 200 STREET”, and stop ID A03, listed as “DYCKMAN STREET (200 ST)”.

The misspellings I noticed were:

  • stop ID A32: “WEST 4 ST – UPPER LEVEL – WASHINTON SQ” (i.e., Washington is missing the “g”); and
  • stop ID 706: “103RD STREET – CORAON PLAZA” should be “Corona Plaza”.

As data problems go, this isn’t too bad, per se.  But it’s odd to me that there’s such a mix of different naming types, and that it’s so different from the MTA’s own map.   If the differences followed some set of rules or were otherwise there for a reason, I’d be more comfortable with it. But when I see data inconsistencies like this, I worry that larger issues are at play – such as data entry problems that make the whole thing suspect (or at least the whole list of station names).  For example, I can’t imagine how misspellings crept into the station names, except if the names were actually typed in manually into MTA’s GTFS file.  So much for a data “feed” that supposedly mirrors what MTA uses itself.

Regardless of why the problems exist, it would be good if MTA fixed them in the next iteration (or at least explained why they’re there).

Here’s what I did to fix the problems for now:

  • removed leading spaces;
  • converted all the station_name values to UPPER CASE;
  • removed periods;
  • removed parentheses (and replaced each leading paren with a dash);
  • removed suffixes such as BWAY, LEXINGTON, LENOX, IND, IRT, 7 AV; and
  • fixed typos (‘BAIN BRIDGE’, ’9 ave’, ‘L. I.CITY’) and the misspellings.


I hope my GIS file of subway stations includes some enhancements over the raw GTFS data that will be useful to GIS practitioners and app developers – it includes fields that provide route IDs (based on predominant weekday service and “off hours” service), and cleans up all sorts of inconsistencies and typos in the station names.  It’s still not perfect, but I think it’s a good first step. Hopefully you can use it for your apps and projects.  Here is a link to a zipped version of the shapefile:

Note that I’ve left the route IDs in this file unchanged from the GTFS routes.txt file.  So my file includes routes such as “6X” and “FS” and “H”.  I thought it would be better to leave these as-is, and let the user change them (or not) in your own application.

I guess any standardized data system like GTFS that tries to make sense of a subway network as complicated as New York’s will have issues.  But I think for New York’s implementation of “GTFS” to really become a “feed”, there’s lots more work to be done.  Hopefully this post helps shine some light on ways to improve the data.

Btw, thanks to everyone for their comments and feedback on my earlier posts – at my blog and sent separately via email and Twitter.  I’m glad my efforts are helpful.

5 Responses

  1. […] This post was mentioned on Twitter by Justin C. Houk, Steven Romalewski. Steven Romalewski said: The results of my digging deep into MTA's #GTFS data regarding NYC subway stations: http://bit.ly/ap0wqD Lots of surprises in there. […]

  2. After looking through your blog, I think you may be interested in the Geospatial Revolution Project. The project explores the world of digital mapping and how it has changed how we think, behave, and interact. The project will feature a web-based serial release of video episodes – each telling an intriguing geospatial story. The first episodes will be released in mid-September.

    The project can be found here: http://www.geospatialrevolution.psu.edu/. Also, if you are on twitter, be sure to follow @geospatialrev for updates from the production team.

  3. […] integrated the latest subway and bus data that I’ve blogged about earlier (here and here), and also added bike routes via the NYC Dept of Transportation (one of the latest city […]

  4. […] like I’ve been on a tear lately, blogging about data sets with great potential but that need lots of work before they’re […]

Comments are closed.

%d bloggers like this: