New Subway Station in NYC: Hudson Yards

7lineExtensionIn September 2015, the MTA opened its first new subway station in NYC in decades. I’ve added the new station and extension of the 7 line to the Center for Urban Research’s (CUR’s) maps and underlying GIS data, and we’re making this updated data freely available.

Here’s the post at CUR’s website, and here are the links below with the data:

If you use the data (which I hope you do), please let me know how it works out.  If you use the files, please reference the “Center for Urban Research at the Graduate Center/CUNY” especially if you use the layer symbology in any printed maps or online applications.  Thanks!


Putting transit GIS data to use


I was reminded recently that Albert Sun‘s terrific Wall St Journal interactive about the spatial patterns of Metrocard usage uses the subway routes in GIS format that I created.  It’s not a major part of the map; the routes are used as a backdrop more than anything. But I was glad the Journal was able to use the data.  (Per the notes from the map, the subway data was “from the MTA. Demographic data from the U.S. Census Bureau. Additional work refining subway line shapes from the CUNY Mapping Service at the City University of NY Graduate Center.”)  Here’s a screen shot:


Recently I’ve come across several examples of people being able to use the MTA subway and bus data that I had converted to GIS format a couple of years ago.  I know that I’ve been able to put the data to good use.  But I’m especially glad to see others benefiting from my efforts.

So I thought I’d share some maps and links below.  Hopefully this will inspire others to use the data, and to let us know about other examples.  If you’ve been able to use the subway or bus GIS data, please let drop me a line by email or add a comment to this post.  Thanks!

Distance Cartograms

Zach Nichols wrote a week ago that he incorporated my GIS version of NYC subway routes into a blog post about “re-scaling NYC based on MTA transit time.”  Here’s one of his maps (a “distance cartogram”); very cool!

Mobile apps

One of the entrants in last year’s MTA AppQuest contest used the subway route GIS data as a layer on their map for reference.  The app — Dead Escalators — is being updated for distribution in the iTunes App Store.  Look for it there soon!  In the meantime, here are a couple of screen shots:


GIS data for student projects

  1. Liz Barry’s students at the New School are incorporating the data into their projects.  Glad to be of help, and thanks Liz for your kind words!
  2. Christopher Bride, a GIS student at CUNY’s Lehman College, used the data for his Capstone project this year examining the intersection of food deserts and the likely route home from subway/bus stations.  The project’s goal is to pinpoint fresh food-critical neighborhoods in New York City.  Here are two sample maps, focused on the Bronx:

  1. Lauren Singleton-Meyers at NYU’s Steinhardt School of Culture, Education and Human Development used the subway routes for a project with the New York Center for Alcohol Policy Solutions, for a campaign she’s launched to stop alcohol advertising on public transportation in the city.  As a start, she’s mapped schools and subway routes and stations.  Next steps will be to link pictures of alcohol ads to the subway route lines as part of an educational effort showing what types of ads are being displayed on each route.

Here’s her map (a work in progress) via ArcGISOnline and ArcGIS Explorer:

  Here are some example photos via her Flickr stream.  If anyone has suggestions on helping her with the next steps for her map, please get in touch (their Twitter handle is @EMTAA).

Inspiring similar efforts in other cities

Soon after I wrote my blog post with the MTA’s data in GIS format, it had an impact not only here in New York but in at least one other city: Chicago.  Blogger and urban planning advocate Steve Vance adapted my methodology to transform the GTFS data from the Chicago Transit Authority into GIS format.  Here’s his post: , plus a more in-depth discussion of his technique:

Proximity of bus stops to pedestrian accidents

This week the Tri-State Transportation Campaign published an analysis of pedestrian fatalities in Nassau County and several towns in Connecticut, and noted that in Nassau, for example, 83% of the fatalities from 2008-2010 occurred within a quarter-mile of a bus stop.  The group used my GIS version of MTA’s bus GTFS data for their analysis.

I haven’t examined TSTC’s report closely, so I’m not sure how strong of a causal relationship exists between bus stops, per se, and the fatalities (an anonymous commenter at TSTC’s blog argues that “Of course the most pedestrian deaths occur near bus stops, they’re located in the only places in the county where anyone actually walks”).

But one observer on Twitter, @capntransit, wondered if buses are so ubiquitous that the relationship would be a non-issue (they wrote “Isn’t 85% of Nassau County within a quarter-mile of a bus stop?”)  I thought I’d try to answer, and came up with the following by mapping the bus stops and block-level population data from the 2010 Census:

  • Nassau County’s land area is 285 square miles.  The area within 1/4 mile of all LI Bus stops is 119 square miles (42% of the county area); and
  • Nassau’s population in 2010 was 1.34 million people.  The population within 1/4 mile of all LI Bus stops in 2010 was 838,524 people (63% of the county population).
  • So on the face of it, the concentration of fatalities near bus stops seems disproportionately higher than the overall nearby population.  The map below highlights the bus stop coverage:

I’m glad my data conversion efforts have been helpful.  It’s only possible due to the MTA’s ongoing effort to provide easy public access to their data sets.  This enables me and many others to help improve life in and around the city by integrating their data into maps, applications, government accountability efforts, and more.  Please send more examples of how you’ve been able to use the data; highlighting these projects helps us all.

More MTA data in GIS format

My previous post was on subway routes; this time I tackle subway stations. (Apologies for another long one!)

When I planned this post, it seemed pretty straightforward. My goal: create a GIS “point” file of subway stations based on MTA’s latest GTFS release (easy enough) that included an attribute field with a list of subway lines stopping at each station. The attributes would look like this:

This format would be useful for adding labels to a map layer of subway stations (like Google has, or like the MTA map – station name plus list of trains stopping at that station).

But the GTFS “stops.txt” file only includes station names, not a list of routes. The “stop_times.txt” file includes trip IDs that can be joined with the “trips.txt” file to identify routes as well as stops.  But this represents more than 500,000 records (one for each trip by each route stopping at each station).  If you dissolved those on the stop_id field (using standard grouping or dissolve functions in Excel or Access, for example), you’d only get the first route ID per station, not a comma-separated inline list like I wanted.

I figured there must be a script out there to extract the routes per stop and write them to a field in the “1, 2, 3” format.  I found one, but that was the least of my worries.

The script I found (via my colleague Dave Burgoon) uses SQL’s “COALESCE” function as follows (numerous sites explain how it works, such as this one):

alter FUNCTION RouteList (@StopID Varchar(100))
SELECT @Routes = COALESCE(@Routes + ‘, ‘, ”) + route_id
FROM dbo.stops_route_list
WHERE stop_id = @StopID
ORDER BY route_id ASC
RETURN @Routes

In order to make use of this, first I joined the trips.txt file to stop_times.txt (on the trip_id field), created a new field representing a concatenation of stop_id and route_id, then grouped on that new concatenated field.  This gives me a unique list of all the stop – route combinations.  I called it “stops_route_list”.  I used that with the COALESCE function above, then ran a SELECT statement on the results of the function [SELECT Distinct stop_id, dbo.RouteList (stop_id) AS RouteList FROM dbo.stops_route_list] to give me the comma-delimited inline result.  Several steps, but it works.

Data Complications

I quickly figured out that there were two problems with this effort.  One was the complex nature of the New York City subway system itself.  The other had to do with data quality problems with the MTA’s stops.txt file.

Subway system complexities

When I joined my comma-delimited result with a geocoded map layer of subway stations, I saw what looked like substantial errors.  For example, the stations east of Utica Ave in Brooklyn on what the MTA map shows as the line were labeled on my map with the 2, 3, 4, 5 trains stopping between Utica Ave and New Lots.  The 7th Ave local stations in Manhattan (such as 18th and 23rd streets) showed the *and* lines stopping there.  The was shown stopping at Jamaica Ave in Queens (but the MTA map only shows the stopping there).

Then I looked at other applications that used MTA’s GTFS data.  The Google Maps basemap matches the MTA’s map in terms of station labels (for example, the 23rd St stop on 7th Ave is labeled as “23rd St [1]”).  But clicking on Google’s individual station icons opens a popup window with the “additional” lines I mention above (see example below). (which just launched last week) showed the same thing.

Then I read the fine print on the MTA subway map.   It only “depicts weekday service.”  So presumably the 2, 4, and 5 trains at New Lots in Brooklyn, for example, must represent weekend trips.  (And the 2 train, in the images above, would only be on the weekend.) To test that, I filtered the stop_times.txt file for only weekday trips (using all the trip_id’s containing “WKD”) and ran the COALESCE script against that filtered list.  The at the 7th Ave local stations dropped out (it does indeed stop at local Manhattan stations only on the weekend).  But other anomalies remained.  For example, the 2, 4, and 5 trains were still shown stopping at New Lots.

Then I looked further at the MTA’s individual line maps (which were mentioned in the fine print on MTA’s subway map), as well as the agency’s summary service guide [PDF] for subways.  The fine print on *these* documents tells the story — using the New Lots example, the 2, 4, and 5 all have “some rush hour trips to/from New Lots Av, Brooklyn”.

Last I checked, rush hour is during the week.  So when MTA’s subway map shows that “weekday service” to New Lots Av is only provided by the 3 train (see below), it contradicts the MTA’s own more detailed service guide and line maps.

As it turns out, these 2, 4, and 5 train runs are a small part of weekday service to/from New Lots.  The combined stop_times.txt and trips.txt file reveals that there are 8 weekday trips stopping at New Lots, 58 trips, and 6 trips (compared with 262 trips).

I’m not being critical here. This is what I mean by the complexity of the subway system.  There are many exceptions like this, where scheduling or routing needs dictate that some atypical trains stop at unexpected stations.  (For example, despite what the MTA map shows, the  makes some weekday stops in lower Manhattan below Canal St, the  makes some weekday stops at Steinway St and 46th St in Queens, and the  makes local stops at some point during the week between 59th St and 125 St in Manhattan.)*


*NB: I missed this in my earlier post — I made some observations about the lines/GTFS data showing routes that didn’t exist, but I had missed some of these complexities.  I’ve updated my earlier post explaining the situation.


These are not mistakes in the GTFS data, but they’re a very small portion of overall weekday service.  The MTA map handles these exceptions by generalizing for the purposes of clarity.  What is interesting to me is that other applications incorporate the exceptions at the risk of seeming like they’re showing a mistake.  So when OpenTripPlanner or Google displays the making local stops on 7th Ave in Manhattan without qualifying it, I’d imagine most subway riders (at least those familiar with MTA’s map) would do a double-take.  Again, I’m not being critical, but to me this raises questions about using data “feeds” without a greater level of manual intervention to make the data more meaningful and present it in a way that’s more like what the riding public expects.

I thought I’d be able to easily omit the “anomaly” weekday trips/routes by selecting out weekday service with greater than a certain threshold of frequency.  That works for most instances, but setting it too high (even as high as 25 weekday trips) omits trips that should be included, such as the stops along the line and the train along the Rockaway Shuttle line.

So I implemented a bit of a hack, as follows:

  • The MTA service guide shows rush hour service starting at 6:30am, and “evening” service extending to midnight.  So I queried out all weekday trips (“service_id” ending in ‘WKD’) with arrival times between ’06:30:00′ and ’23:59:00′.
  • After I concatenated stop_id and route_id from this selection, I grouped on this concatenated field and selected all entries where the record count was greater than 20 (this threshold removes the “…some rush hour trips to/from New Lots Av, Brooklyn” issue as well as the other weekday anomalies) OR where the concatenated stop_route field ends in ‘Z_WKD’ OR where the stop_route field begins with ‘H’ and ends with ‘A_WKD’.  I think this got them all.  If anyone goes through this crazy process independently and finds different, please let me know (!).  I saved the result as a “stops_route_list_wkd” file.
  • Then I selected all others, and saved this as a “stops_route_list_offhours” file.
  • Then I dropped the filters altogether and created a “stops_route_list_all” file.
  • I ran the COALESCE script against each of these three files and ran the SELECT statement I mentioned above [SELECT Distinct stop_id, dbo.RouteList (stop_id) AS RouteList FROM dbo.stops_route_list_*] to give me three separate lists of routes per stop.
  • I joined these with the geocoded “stops.txt” to create three separate route attribute fields that can be used for labeling (depending on what type of map you wanted to create — predominant weekday service, offhours service, or all service).

This gives me following table (excerpt):

Typos and more

Amazingly enough, the data hassles don’t stop there.

I found one geographic error in the stops.txt file, and numerous naming inconsistencies (and at least two misspellings) in the stop_name field.

The geographic error has to do with the two Cortlandt Street stations in lower Manhattan.  It appears that the stop IDs were switched in the GTFS data.  Stop ID 138 has the name “Cortland St – IRT”, but has lat/lon coordinates that place it on the BMT/Broadway line.  Stop ID R25 has the name “Cortlandt St – World Trade Center”, but has lat/lon coordinates that place it on the IRT/7th Ave line.  Here’s what it looks like when I map it in ArcMap:

Here’s how it’s shown on OpenTripPlanner:

… and:

For now I’ve switched the attributes for these two stops in the shapefile I’ve linked to at the end of this post, but hopefully MTA will correct this soon.

The naming inconsistencies were more perplexing.  Station names in the stops.txt file are all over the place — parentheses are sometimes included, sometimes not; dashes are used arbitrarily, 16 stops have leading spaces in the name, and there’s a confusing mix of UPPER/Proper/lower case text.

What’s worse, the naming “convention” (if you can call it that) in stops.txt is also inconsistent with MTA’s subway map, MTA’s file of station entrances/exits, and other applications such as Google Maps.  Most of the transit apps I’ve seen simply use the stops.txt station names verbatim, but below I summarize my methodology for cleaning this up.  Hopefully MTA will update its next iteration of GTFS data with something more consistent.

Here are some examples of these issues:

  • Stop IDs 626 and 627 (86th St and 77th St on the Lexington line) each have a leading space in the name, but the adjacent stops on the Lexington line are fine.
  • Stop ID B12 (“NINTH AVE (WEST END)-9 ave” includes AVE and ave.
  • All four of the 110th St stops in Manhattan (IDs 118, 227, 623, and A17) are listed as follows – these examples really take the cake:
    • 110TH STREET – BWAY – Cathedral Pkwy
    • Here’s how different these stops are named on MTA’s own map:

  • Sometimes street types use a mix of spellings, such as stop ID 112 (168TH STREET – BWAY- WASHINGTON HGTS) and A09 (168TH STREET – IND – WASHINGTON HEIGHTS).
  • I thought the two Dyckman St stops in upper Manhattan were good: stop ID 109, listed as “DYCKMAN ST. – 200 STREET”, and stop ID A03, listed as “DYCKMAN STREET (200 ST)”.

The misspellings I noticed were:

  • stop ID A32: “WEST 4 ST – UPPER LEVEL – WASHINTON SQ” (i.e., Washington is missing the “g”); and
  • stop ID 706: “103RD STREET – CORAON PLAZA” should be “Corona Plaza”.

As data problems go, this isn’t too bad, per se.  But it’s odd to me that there’s such a mix of different naming types, and that it’s so different from the MTA’s own map.   If the differences followed some set of rules or were otherwise there for a reason, I’d be more comfortable with it. But when I see data inconsistencies like this, I worry that larger issues are at play – such as data entry problems that make the whole thing suspect (or at least the whole list of station names).  For example, I can’t imagine how misspellings crept into the station names, except if the names were actually typed in manually into MTA’s GTFS file.  So much for a data “feed” that supposedly mirrors what MTA uses itself.

Regardless of why the problems exist, it would be good if MTA fixed them in the next iteration (or at least explained why they’re there).

Here’s what I did to fix the problems for now:

  • removed leading spaces;
  • converted all the station_name values to UPPER CASE;
  • removed periods;
  • removed parentheses (and replaced each leading paren with a dash);
  • removed suffixes such as BWAY, LEXINGTON, LENOX, IND, IRT, 7 AV; and
  • fixed typos (‘BAIN BRIDGE’, ’9 ave’, ‘L. I.CITY’) and the misspellings.


I hope my GIS file of subway stations includes some enhancements over the raw GTFS data that will be useful to GIS practitioners and app developers – it includes fields that provide route IDs (based on predominant weekday service and “off hours” service), and cleans up all sorts of inconsistencies and typos in the station names.  It’s still not perfect, but I think it’s a good first step. Hopefully you can use it for your apps and projects.  Here is a link to a zipped version of the shapefile:

Note that I’ve left the route IDs in this file unchanged from the GTFS routes.txt file.  So my file includes routes such as “6X” and “FS” and “H”.  I thought it would be better to leave these as-is, and let the user change them (or not) in your own application.

I guess any standardized data system like GTFS that tries to make sense of a subway network as complicated as New York’s will have issues.  But I think for New York’s implementation of “GTFS” to really become a “feed”, there’s lots more work to be done.  Hopefully this post helps shine some light on ways to improve the data.

Btw, thanks to everyone for their comments and feedback on my earlier posts – at my blog and sent separately via email and Twitter.  I’m glad my efforts are helpful.

MTA subway data in GIS format

UPDATE Oct. 5, 2015: See this post for updates to the GIS files described below, including the new Hudson Yards subway station and 7 line extension.

As promised, I’ve created an updated GIS data set of subway routes in New York based on MTA’s latest GTFS data, which I’ve posted below for anyone to use.  (I’ve also converted the latest MTA GTFS data to GIS format for NYC Transit bus routes, following up on my earlier post this spring.)  Scroll to the end of this post for the links.

I hope my effort provides a template for creating a map layer for apps and/or printed maps that approximates the line symbology on MTA’s map, but improves on this map in several ways — the GIS version is spatially precise, scalable, and may even look better than what Google uses for its transit layer in New York City.  The images below show the map zoomed out and zoomed in, and the post below explains how I did it.

Going forward, hopefully the MTA itself will provide subway route GIS data in the format I’ve described below (or something similar) alongside the GTFS data.  Any feedback or suggestions for improvement will be much appreciated.


After reviewing the GTFS data files in detail, it became clear that GTFS is not necessarily applicable for displaying transit lines on a map and/or analyzing the spatial patterns with GIS.  At least that seems a fair characterization regarding the GTFS version of the city’s subway system.

Even though the GTFS files include a “shapes.txt” file for subways (the spec says this is for “drawing lines on a map to represent a transit organization’s routes“), this is only helpful for basic line representations.  This seems to work fine for bus routes.  But for subways, either the GTFS structure or MTA’s implementation of it poses challenges for creating a map layer of subway lines.

New York’s subway system includes local and express routes that are composed of inbound and outbound trips along the same line, some of which may start or end at different stations.  And there are “skip-stop” trains and “trunk” lines (where multiple routes run on the same set of tracks, such as the E, F, M, and R in Queens from Jackson Heights to Forest Hills).  But the GTFS “shapes” data from MTA only provide a partial representation of this complexity.

To attach route IDs to the “shapes” file, shapes.txt needs to be linked with trips.txt based on shape_id.  But doing so causes the 1, G, and latest version of the M line to drop out, because the trips.txt file does not include any trip entries for these routes.  Also, for some reason, it results in the inclusion of lines that aren’t used anymore (at least for passenger trips, as far as I know).

This is just an early iteration of MTA publishing its GTFS data, so I’m not surprised these limitations exist.  Until these issues are fixed, we have to rely on workarounds.  For example, the MTA has provided a separate shapefile for the 1 and the G (see earlier discussions at the MTA Developer Resources listserv).  According to MTA,

We [MTA] do not have shape data for these lines because of changes in their station configurations have occurred since we lost the staff member who had created the data. We have not had funding to replace him and update the data from 2008. We can provide the data next week with a hand-done solution, and/or better data at some later time, when we are able to acquire the staffing to do so.

This undermines the idea of using GTFS as a “feed” (as its name implies) for automatically displaying subway lines on a map, but hopefully the process will be more seamless as the issues are worked out.

But the and lines are not really missing from the GTFS data.  The GTFS “shapes” file on its own (without filtering it based on the “trips” file) includes line segments for virtually the entire subway system.  It’s just a question of being creative with combining the trip_id and shape_id fields from the trips.txt file to extract the appropriate geometry for the routes in question.  For example, the correct G shape is certainly there; it’s denoted by the “G..N05R” or “G..S05R” shape IDs.  It just so happens that there are no records in the trips.txt file with these shape_id values.  But it’s easy enough to create a new “route” field in the shapes file and populate it with a combination of values from the “routes.txt” file and manual entries for the lines that don’t seem to exist (such as the 1 and G).

Other issues with the shapes.txt file after filtering with the trips.txt file are that the line terminates at 57th Street/7th Avenue and the GTFS data includes shapes that show the running to Jamaica-179th St, the running to New Lots Ave on the 3/4 line, the extending to New Lots, and the running to New Lots.  As far as I can tell, these routes do not exist for the riding public.  Perhaps these are artifacts of older routing schemes, but it makes for an inadequate solution for mapping.  I’m curious how the automated routing and scheduling apps deal with this.


UPDATE 7/20/10

After I wrote the preceding paragraph, I did quite a bit more digging into the subway GTFS data for a post at my blog about subway stations.  I realized that my points above about the and lines were wrong.  There are, in fact, stops that these trains make at the stations I mention above.  There aren’t many of them, but they exist. My post about station data explains this more fully.  So for these routes at least, the shapes.txt file is ok.


A bigger problem, though, is that the latest version of the line is missing from the shapes data, and can’t be created from the approach described above for the 1 and G.  The images below highlight the challenge – the area in question is circled in blue on both images.  In the old map, the M runs along what is now the J-Z line, and the orange F-V and B-D lines cross the J-M-Z line.  Unfortunately this old routing is what the latest GTFS geometry follows.

But with the latest service changes, the new M line comes in from Brooklyn and then goes north to meet the B-D line. In the latest shapes.txt file, there is no such geometry for the M.  The geometry follows the old J-M-Z line with no obvious shape that follows the M’s new northward jog to meet the B-D line.

Therefore, I created a new line segment for the M, combining segments from the old M and V lines (shape_id values of “M..N89R” and “V..N01R”), along with an arc connecting the two, using ESRI’s ArcGIS editing tools.


Once I had the updated set of “shapes” from GTFS, my goal was to somehow convert this data into a GIS version of the MTA’s subway lines in a way that could be replicated (and perhaps integrated back into GTFS format) and also easily symbolized to show separate lines along trunk routes.

As far as I know, to the extent anyone had a GIS dataset of subway routes prior to GTFS (such as this one we had created for the OASIS website by digitizing the MTA’s subway map), the only way to display separate trunk lines was to manually edit the geometry of the GIS line segments along a trunk route by clipping the line and moving it parallel to the trunk line, so it would show up as a distinct line symbol.  Obviously this has problems — the manual work involved is tedious, imprecise, hard to replicate, and it doesn’t scale well — it might work at a certain zoom level, but then zooming in would show the parallel lines farther apart and zooming out would show them merged together — as illustrated by the images below from the NYC Citymap website, going from a wide zoom to a closer zoom:

(The Citymap site is just one example; you can see a similar situation on — as you zoom in on the map at this link, you’ll see the and lines become farther apart.)

Divisions and Lines

I remembered that NYC Transit uses “division” and “line” designations that might be helpful in distinguishing the segments. The divisions are a throwback to when the subway system was really three separate systems — the IRT, IND, and BMT. But the line designations are based on more or less current track arrangements (and you can see some of these on the current subway map – see excerpt below).

For example, the movie “The Taking of Pelham One Two Three” refers to the train running on the Pelham Line, leaving the Pelham Bay Park station at 1:23.  Wikipedia has lots of information about the line designations, such as the IND 6th Avenue line or the BMT Nassau Street line.

But how to assign these to the shapes.txt file? The line IDs/names are not included as part of GTFS, and I’ve not seen this information provided anywhere else (publicly anyway).

Station entrance/exit data provides the missing link

Then on July 1 the MTA released a file listing subway entrances and exits with latitude/longitude for each one (the file was updated July 7 to fix some issues in the earlier data). Useful in its own right, the file includes the station name for each entrance/exit along with its division and line. Neat! The entrance/exit points don’t necessarily overlap or intersect the line shapes, so I wouldn’t be able to automatically assign the divisions and lines to the shapes using GIS, but there are only 37 unique lines based on the entrance/exit data so it wouldn’t be that hard or time-consuming to do it manually.

My approach was to create a thematic map of the entrances color-coded by line designation, overlay the GTFS shapes file of subway routes, and then edit the shapes file by splitting the segments where each set of color-coded entrances ended and adding the corresponding line attribute to these new segments.  The image below illustrates the approach.

In other words, instead of a single shape representing the  line, I created six non-overlapping segments to represent the entire 2 train route (along the 7th Ave-Bway, Clark Street, Eastern Parkway, Lenox, Nostrand, and White Plains lines).  I used the ArcGIS “Split Tool” quite a bit, and ended up with a shapefile with 80 unique shapes (including the AirTrain — which is included in GTFS but isn’t managed by MTA so likely doesn’t have an MTA “line” designation).  The attributes from the new file look like this:

This was a manual process based on visual inspection of the line segments, so I’m sure error has crept in.  Also, the way I did it, I allowed for some exceptions.  I didn’t rigorously create new segments, for example, along what appears to be a trunk line in Manhattan where the IND 6th Avenue and 8th Avenue lines meet at the West 4th Street station. And I probably didn’t handle lines travelling over bridges or through tunnels as well as I could have.  And the 5 route along the IRT White Plains line extends from Nereid Avenue to 138th Street/Grand Concourse, but the #5 in the Bronx that just runs during rush hour goes from East 180th Street to Nereid Ave (so on my map the dashed line symbology extends too far south).

Overall, though, I think it works well — it’s pretty good for a first pass.

ArcGIS caused import hassles …

Btw, I should point out that though ArcGIS’s editing tools were great for splitting and re-combining the line segments, ArcGIS misinterpreted important fields when importing the GTFS text files.  Fields that were text (such as “route_id” in the trips.txt file) were imported as numeric, preventing an accurate join.  I needed to use another program (I used SPSS) to save the trips.txt file as a DBF which preserved the text format of the field.  (I had tried using Excel to convert from TXT to CSV and also to XLS, but that also forced the text field to convert to numeric.)

… but ArcGIS provided invaluable cartography tools

Next step was to create the symbology.  I relied on two ArcGIS features to display multiple lines along a trunk route as discrete line symbols: the “cartographic line symbol” feature, and “symbol levels”.  The cartographic line symbol component of ArcGIS’s Symbol Property Editor, among other things, enables you to attach an offset value to the line symbol.  See screen shot below.  The great thing here is that the offset is relative depending on the zoom level of the map — as you zoom in or out the line symbols do not merge together or move further apart, thereby solving the problem parallel copies of line segments.

The Cartographic Line Symbol tool also allows you to create a dashed line symbol, which I used for the Rockaway Park Shuttle and the rush hour extension of the 5 train in the Bronx.

The “symbol levels” feature enabled me to ensure that line segments weren’t inadvertently masked by others along the same geometry.  For example, simply offsetting the and routes from the and  routes running along the IND 8th Avenue line may result in two parallel orange lines, rather than a blue and an orange line.  Setting a priority symbol level ensures that the blue and orange lines will run in parallel.

The resulting trunk line symbology looks good whether you’re zoomed out …

… or zoomed in close:

Soon we’ll have the updated lines and symbology on the OASIS mapping site.

In order to recreate the map symbology, I’ve preserved the color scheme in an ESRI layer (.lyr) file, linked below along with the actual shapefile.  If you’re using a GIS that doesn’t use layer files you’ll need to redo the symbology, but at least you can use the attributes to do so.

(The layer file includes subway route labels that were inspired by an approach provided by ESRI’s New York City office that we first applied for the OASIS site — using the subway route icons a la the MTA subway map. I’ve streamlined it a bit here, and it’s easy to modify further either with ArcGIS or another GIS package. I’ve included a basic MXD file that preserves the labeling.  The MXD uses ESRI’s Maplex labeling engine, but the labels will work with ESRI’s standard labeling engine as well.)

The one missing component to this data is a layer of transfers between subway stations.  I know this has been discussed on the MTA Developers Resource list, but this will have to wait till a more robust data set is available (or I or others have the time to put one together).

Links to the data

Here’s the GIS subway data in shapefile format (zipped):

If you use the data and layer file (which I hope you do), please let me know how it works out.  I’m not including any kind of Creative Commons licensing, but I’d appreciate it if you could reference the “CUNY Mapping Service at the Center for Urban Research” if you use the data and especially the layer symbology in any printed maps or online applications.  Thanks!

Also, here are the post-June 27 service change bus routes in shapefile format (zipped):


Better than Google Maps cartography?

I definitely wanted to compare my GIS version of the subway GTFS data with Google Maps, which presumably uses the GTFS data not only for transit directions but also for the basemap itself.   Two things surprised me.  One was that, as of today (July 7) almost two weeks after the MTA’s service changes took effect, Google Maps still shows old subway routes and station information.

The map below, for example, still displays the old  line (see the 23rd St/6th Avenue station) and the discontinued line (see the 23rd St and 28th St Broadway line stations).

Also, the subway lines on Google Maps were choppy and not as smooth as the GTFS-derived GIS lines. The images below compare the two in lower Manhattan.

I don’t think it’s nitpicking to point out the difference. One important aspect of the MTA’s GTFS data from a cartographic perspective is the high-quality route geometry.  It makes it that much more useful not only for good map development, but also for spatial analysis and alignment with other NYC GIS data layers.  Kudos to MTA for providing it.  I’m surprised Google apparently doesn’t use GTFS for their basemap (hopefully they’ll correct me if I’m wrong).

Going forward

For my purposes (and I think I’m far from alone here), I’m more interested in displaying the subway lines in a map layout than developing an application that provides routing and scheduling. Whether or not I use the data for spatial analysis, I’d like to have a subway layer for use in a GIS or any other application that needs the symbology of MTA’s printed map but is more spatially precise than MTA’s map and not as fine-grained as individual trips.

The GTFS format is great for all the web and mobile applications that are being developed.  But for the purposes of local planning work by Community Boards, students, the media, public officials, etc — we want to see the subway lines on a map and analyze them spatially — visualizing and understanding the relationships of nearby land usesdemographics, etc as well as the ability to monitor maintenance and operations trends, determine who represents each line when service changes are being proposed, and more.  So hopefully MTA will see fit to provide subway route data in a systematic way so we can integrate it easily into our maps.

It’s likely that NYC Transit maintains its subway line/route data in a similar structure as I’ve described above in GIS format, either for planning/modeling purposes or for other mapping needs.  Ideally it’s in a format that allows for an automated, rules-driven way of displaying the routes by division/line so changes are handled as seamlessly as possible.  In other words, it would be great if MTA could provide the subway data in a way that doesn’t require the additional staff resources that are involved in coverting the scheduling/routing data to GTFS format.  I’m not expecting anything as simple as “just hitting the export button,” but hopefully something close :).  And since subway routing doesn’t change very often (certainly not as frequently as schedules), this should be much less of a burden on the agency than the work involved in providing the GTFS data.

I look forward to continuing the dialog.

MTA data opened up; provided here in GIS format

I wasn’t able to attend this week’s “MTA Unconference for Developers,” but it sounds like it was a great event.  My colleague Dave Burgoon sat in, and I followed the Twitter stream and read several of the follow up posts.

The shift in attitude and action at MTA to open up access to their data and invite developers and others to work with them to use it is heartening.  I hope it spurs other agencies to do the same.   (In the wake of MTA’s new-found openness, it’s especially mind-boggling that, for example, the NYC Dept of City Planning still requires a license fee for its real property data, and Nassau County and the Suffolk County Real Property office do the same.)

The raw data

The first thing I did as I was reading the conference tweets was to look at the new data the MTA has released.  (MTA will keep you updated on data changes via email if you subscribe here.)  Most of it is in the GTFS format (formerly “Google Transit Feed Specification”, now the “General TFS” to get away from any corporation-specific connotations).  At first that was disconcerting — where were the GIS files?  I wanted pre-set shapefiles or KML files, but nothing was listed.

Of course, someone had already thought of that :).  Data in GTFS format includes latitude/longitude, so that was encouraging.  And after digging a bit further, there are open source tools for exporting GTFS data into KML format, and then importing into other programs (such as ArcGIS) for mapping and spatial analysis.  Now that I’ve worked with the data a bit, it all makes sense — the GTFS format gives you flexibility to import the data and analyze it however you’d like, with whatever software you’d like.  And of course it’s structured to facilitate mapping — silly me, Google wouldn’t create a data format that couldn’t be easily integrated with Google Maps, etc.

But I’m more familiar with ArcGIS than the TransitFeedDistribution tools that convert to KML etc.  So instead I’ve created shapefiles in ArcGIS of some of MTA’s key data sets.  I’ve posted links to the shapefiles below — feel free to use them however you’d like.  I’ve added some notes on that process. And here’s a map of one of those shapefiles — bus routes in Brooklyn:

The context

To put this in some context, it’s amazing to me that this data is now publicly and easily available.  I’ve been using GIS professionally for almost 20 years, and I think it’s safe to say that those of us working with GIS in New York have grown weary of fighting to obtain data that you’d think would be commonplace — such as bus routes, subway routes, commuter rail lines, and related usage and performance statistics.  When I directed the Community Mapping Assistance Project at NYPIRG, or more recently with the CUNY Graduate Center, clients and project partners would ask us to add bus routes to their maps, or to analyze bus transit options, and we’d always have the same answer: the MTA refuses to provide access to the data, so you’re out of luck.  (Or, maybe I was able to find someone years ago who “unofficially” slipped me a floppy disk with bus routes in TransCad format, but now it’s out of date and I can’t get a newer version.)  Of course I’m not the only one who wants to map bus routes and other transit data, so the MTA’s new data access is great news for many people and institutions — not to mention the riding public

My methodology

To create the shapefiles, here’s what I did:

  • downloaded the .txt files from MTA’s website;
  • opened these in Notepad (or Excel or SPSS, for the larger files) to get a sense of the file content and relational structure; and
  • then added them to an ArcGIS data frame. 

Geographically speaking, the data are either points (i.e., subway, train, or bus stops) or lines (routes). ArcGIS screen shot

  • For the points I used the “Display XY Data” function to create a point representation of the stops (see screenshot).
  • I assigned the “North American Datum of 1983” (NAD83) for each file’s spatial reference, but did not project the data.  That way anyone accessing the shapefiles can project them as needed.  (One exception is the NYCT bus data — I projected the stops and routes files using the New York State Plane Long Island (feet) coordinate system.  If anyone needs these files unprojected, just let me know.)
  • The “stops.txt” files do not include route information, only “stop IDs” that can be associated with other MTA files to obtain route names and descriptions.  Any given stop can be associated with multiple routes, so I decided not to join the route data to the stops – that can be done in your application as needed. 
  • To finish, I exported each stops file and renamed it with the category and date (such as “nycbusstops_100401”). 

Here are the shapefiles for the stops (provided in zip file format):

For the routes, the methodology was slightly different:

  • Although the GTFS includes a “routes” file, this just includes the route ID, route name, and other descriptive information — no geometry.  Instead, there’s a separate “shapes” file that includes the latitude/longitude for each point, or node, along the route.  In ArcGIS, the trick is to create a point representation of these nodes and then literally “connect the dots” to create the corresponding line representations.
  • I used the “Display XY Data” function to create the points, and then used the nifty “ET GeoWizards” toolkit to convert the points to polylines.  (ET GeoWizards is a “collection of powerful data manipulation and topology creation functions for ArcGIS”.  It’s really great, and many of the tools are free, with additional functionality for a modest fee.)
  • In the conversion process, ET GeoWizards uses the “stop ID” field, the lat/lon fields, and a “shape_pt_sequence” field to determine which points are connected together to draw the lines properly. 
  • I exported each “shapes” file to native shapefile format, but I wasn’t done.  The “shapes” files don’t include any route information.  The “trips” files contain “shape ID” and “route ID” fields, which provide the linkage between the “shapes” files and the “routes” files.  (For the LI Bus routes, I also copied the route info and URLs from the MTA website and reformatted that to join to the LI Bus shapes file.)
  • After joining the data, a last step was needed.  The shapes file provides a separate shape, or line, for each type of route — such as “inbound” and “outbound” — for each actual route (the B3, the M103, etc).  In order to create a GIS layer that includes individual features for each route, I needed to collapse the data (the LI bus file, for example, includes 2,048 discrete shapes, but only 104 routes — and even this includes duplicates — when it’s finally pared down, there are only 59 actual LI Bus routes).
  • ArcGIS screen shotI used the “Dissolve” tool in ArcToolbox to collapse the shapes files with the joined route information.  Then I exported these files and renamed them with the category and date (such as “libusroutes_100308”).

Perhaps there’s an easier way of doing all this with the TransitFeedDistribution tools, but for my purposes the ArcGIS tools worked just fine.  Here are the shapefiles of the routes (also in zip file format):

  • Long Island Bus
  • LIRR (Important Note: the MTA didn’t include a “shapes.txt” file for the Long Island Rail Road, so I couldn’t create a shapefile of train routes. But our mapping service at CUNY already has that.  Therefore, the LIRR link is not from the MTA data, but from an earlier shapefile directly from LIRR.)
  • Metro North
  • NYCT Bus (I provide two files for NYCT bus routes – one called “grouped” which includes 248 features, one for each bus route; and a second called “tripheadinfo” which includes “trip headsign” text — this includes duplicate line features [for a total of 733 features] because buses on a given route may be travelling to different end points.)


Note that I have not provided route files for NYCT subways or “bus company” routes.  The shape_ids in the trips.txt file for the subways were mostly NULL, so I wasn’t able to link the subway shapes.txt file via the trips.txt and routes.txt files to add the route names.  But, I already have a shapefile of subway routes, which you can download here.

Note also that the shapefiles of routes only includes the route geometry with some basic attributes (route names, and maybe MTA URLs).  I did not attach any of the scheduling or performance data that MTA has also provided.  That data is part of the rest of the MTA’s GTFS data feed if you want to link it yourself to the route shapefiles.

For our own part, we’ll be adding the bus routes shortly to the and Long Island Index mapping sites, and using them for other mapping work here at the CUNY Graduate Center.

Some other issues I encountered

Long Island bus systems: as far as I can tell, MTA’s data does not include bus stop or route information for buses in the City of Long Beach (Nassau County), the Huntington Area Rapid Transit (HART) system in Suffolk County, or the Suffolk County bus system itself.  So the LI Bus files above do not provide a comprehensive set of GIS files for bus stops and routes throughout Long Island.  In the past we’ve cobbled this together from various sources, but if anyone has up-to-date files for these areas, I’d love to hear about them.

“Landmarks”.  The MTA’s “Bus Company” files include a “landmarks.txt” file.  Presumably this represents easily recognized local features, but it’s impossible to tell for sure without a description from MTA.  Also, I’m not sure why the file is included with the “bus company” data and not the other categories (or on its own).  The file includes a “Type” field but no description of what the type codes mean.  Some of them seem obvious (ES=elementary school?). but others are opaque (the landmark called “2 Bay Club Dr” has a type code of “AH” – what does that mean?).  But in case you want to use it, here it is (at your own risk!).

Lack of metadata.  Although the GTFS website provides general descriptions of field names and data types, I wish there were better metadata from MTA directly — for example, it would be helpful to know how the lat/lon data were generated (i.e., what basemap was used, what scale is the data best viewed at).  And what’s the difference between the stops in the “Bus Company” file and the “New York City Transit – Bus” file?

But these are minor things.  Overall, it’s a huge step that MTA has opened its data doors.  Kudos to the new MTA leadership and everyone else who nudged (or aggressively pushed them) along the way.  I’m looking forward to great apps to come out of this, and to other agencies to follow suit.