OpenStreetMap

At long last, I’ve done a complete pass over municipal and CDP boundaries in New York State. Barring errors and omissions (I daresay there must be some) every incorporated community and every CDP in the state has had its border checked against NYSGIS Civil Division Boundaries and TIGER/Line 2021 respectively, almost always resolving conflicts in favour of the former. All have place=* nodes representing them, with the node a label member of the boundary relation.

Populations are updated as of the 2020 Census. GNIS, FIPS, NYS SWIS, Wikipedia and Wikidata links are provided.

Most of the remaining work that I’d have to do before I consider the job to be done has to do with the tagging on the place=* nodes. Right now, they’re a hodgepodge. Most of them came in from the TIGER import of 2008 with place=* representing their form of government. This is NOT an indication of the significance of the place. Brentwood, Long Island, a bustling community of over 60,000 souls, is tagged place=hamlet because it does not have home rule. Geneva, a sleepy lakefront village is 3400 inhabitants or so, is tagged place=city because it has a city charter.

For a first stratification, I’d propose simple thresholding on population:

  • City: At least 50000 inhabitants.

    This would encompass New York, Buffalo, Yonkers, Rochester, Syracuse, Albany, Schenectady, Utica, White Plains and Troy. The largest communities not to make the cut would be Niagara Falls and Binghamton. The ‘city’ tag would also fall on the suburban communities of Ramapo, Amherst, New Rochelle, Cheektowaga, Mount Vernon, Brentwood, Clay, Hempstead, Town of Tonawanda, Levittown, and Irondequoit.

  • Town: >4800 inhabitants.

    This was a somewhat arbitrary cutoff. I wanted it to include Saranac Lake (pop. 4887) because that community has the only hospital for many miles around, and has an airport with scheduled, albeit infrequent, service. The threshold could be set higher if the manual work of identifying the sites of such facilities as hospitals, universities, airports, major markets, and so on were to be attempted, but I’d consider that to be Out of Scope.

  • Village: >1000 inhabitants.

    Totally arbitrary, there’s a long tail and you have to cut it off somewhere.

  • Hamlet: Smaller.

There are some tagging anomalies that also need attention.

  1. For townships that didn’t have an identifiable population center with the same name as the township, I reimported label nodes from GNIS. I tagged these with not:place=town place=region to indicate the fact. I seleted region because it was available as a JOSM preset, but I now realize that the Wiki mentions place=municipality, and that seems to be a better fit. I’ll make this change as well.

  2. The only correct use of place=suburb among the objects I’ve examined is that the five boroughs of New York City fit the OSM definition. There are other communities that are mistagged place=suburb because they are near to a major city, but that’s not correct tagging.

  3. CDP’s that don’t correspond to identifiable unincorporated communities (for instance, the ones that represent resident university campuses) are tagged place=locality and this should most likely be left alone. CDP’s that represent portions of a city or surround subdivisions, I’ve retagged place=neighbourhood and these too should be left alone.

  4. Somewhat controversially, I’ve left boundaries of most CDP’s as boundary=administrative. I know for certain that the ones in Nassau County, at the very least, actually are administrative subdivisions without home rule - the towns of Hempstead, North Hempstead, and Oyster Bay all designate hamlets, and often promulgate things like parking regulations and zoning ordinances by calling out the hamlets by name rather than repeating the boundaries in each piece of legislation. I figured that in doubtful cases, it’s better to show the boundaries than to hide them.

  5. Even more controversially, most incorporated communities have an office=government node taking the administrative role, and showing the location of the town administration (the town hall or equivalent) and contact information for general inquiries (usually the town clerk’s office). This is a total abuse of the tag - it’s supposed to identify the capitAl, not the capitOl. Nevertheless, it provides useful information, and I believe that instead of deleting the relation members wholesale, it would probably be better to rename the role.
    Does anyone think it would be worthwhile to work up a proposal for a seat role (or something similar - the Naming of Names is an area that I try very hard to steer clear of)?

Discussion

Comment from Minh Nguyen on 24 August 2022 at 10:32

The thought you’re putting into this boundary mapping and cleanup effort is setting a great example for us to follow in other states that have their own vagaries.

This was a somewhat arbitrary cutoff. I wanted it to include Saranac Lake (pop. 4887) because that community has the only hospital for many miles around, and has an airport with scheduled, albeit infrequent, service. The threshold could be set higher if the manual work of identifying the sites of such facilities as hospitals, universities, airports, major markets, and so on were to be attempted, but I’d consider that to be Out of Scope.

You’ve just justified a one-off exception for Saranac Lake, which would allow you to set a rounder overall threshold that doesn’t sound so arbitrary. Some mappers may be inclined to second-guess or ignore arbitrary-sounding rules.

Somewhat controversially, I’ve left boundaries of most CDP’s as boundary=administrative. I know for certain that the ones in Nassau County, at the very least, actually are administrative subdivisions without home rule - the towns of Hempstead, North Hempstead, and Oyster Bay all designate hamlets, and often promulgate things like parking regulations and zoning ordinances by calling out the hamlets by name rather than repeating the boundaries in each piece of legislation. I figured that in doubtful cases, it’s better to show the boundaries than to hide them.

It sounds like these particular imported CDPs are coincidentally coincident to real places that should have been mapped as administrative areas but, like minor civil divisions, were omitted from the TIGER boundary import. You may want to add border_type=* so that someone doesn’t come along, see “CDP” inside tiger:NAMELSAD, and think it only represents a CDP and therefore should be retagged as boundary=census.

This is a total abuse of the tag - it’s supposed to identify the capitAl, not the capitOl. Nevertheless, it provides useful information, and I believe that instead of deleting the relation members wholesale, it would probably be better to rename the role.

I’ve been using operator and operator:wikidata to associate a government’s headquarters with the boundary relation representing the government’s jurisdictional area. The operator:wikidata tag of the amenity=townhall or office=government would match the wikidata tag of the boundary.1 I find this approach to be more flexible in cases where a government’s offices aren’t centralized in a single building, typical of county governments in some states. It’s also consistent with tagging for company headquarters, park offices, university administrative offices, etc.

If the office must be a member of the boundary relation, then a seat role would be an improvement. But this comes uncomfortably close to site relation semantics, for something that isn’t as compact as a site.

  1. More precisely, there would be separate Wikidata items for the place versus its government, linked by the authority and applies to jurisdiction properties. Data consumers would need to consult the Wikidata API or a database extract or query the Wikidata Query Service to determine the relationship between the office and the boundary. But so far I’ve yet to come across a compelling articulation of why a data consumer would need to automatically associate these things anyways. 

Comment from gdt on 24 August 2022 at 13:02

I have long thought that place=foo and admin boundaries are not definitely related even though in most cases they match. The first is the hierarchy of “settlements” which one can determine by ignoring government and seeing where people live, and the second is government. Granted, typically governments are organized around where people live, so in New England there are town centers and town boundaries. But there are also secondary villages within towns, that historically where somewhat separate culturally.

So when putting place= and associating population, is that the population of some admin thing, or does one essentially tile with place= and count population in each polygon?

And then there is quarter/neighborhood as place, which are meant to be sub-parts of city, vs town/village/hamlet which aren’t. So in Acton, MA, it is a town (I think, <50K) in osm-speak, and there is South Acton and West Acton which are not separate by government but which have old town centers. In counting their population, is it removed from Acton’s? I think this situation needs a sub-part of town vs village, as in the modern world admin and locality are messily intertwined.

Also there is place=locality for places that have names but it’s not about people living there, but that can be avoided.

Comment from gdt on 24 August 2022 at 13:18

Also, I thought that OSM had population cutoffs for these terms, and I’d rather see an exception than a tweaked threshold. If you look at the one you want to promote, is the underlying reality that the number of people who consider that they sort of live there, even if outside some admin polygon, is higher?

Also, city and town are relative. In sparse regions, a place with a hospital and airport is a big deal as you say. That level of people. adjacent to a big city, is not worthy of promotion.

Comment from ke9tv on 24 August 2022 at 15:44

@Minh Nguyen:

I’m ok with a rounder arbitrary threshold with an exception for Saranac Lake. It turns out that “about 5000” seemed like the right level anyway, but I’m equally good with saying “5000, but make sure Saranac Lake is included because of its very high regional importance.”

Your suggestion of border_type to label the type of government that the border bounds is a good one. So good, in fact, that it’s already there. I didn’t trouble labeling Cities, Towns and Villages with CDP since they’re all CDP’s, but you’ll see a handful of just hamlet (mostly dissolved Villages that still have legal boundaries but have devolved to the Town government), CDP (for CDP’s that I know are not governmental at all, like university campuses or unnamed suburban regions like ‘Ithaca Northwest’ or cases where the CDP is misaligned to a political unit and I can’t fix it - as with my home town of Niskayuna), and hamlet:CDP for both the cases that I know are civil subdivisions and the cases where I’m not sure.

I retained City of XXX, Town of XXX and Village of XXX consistently in name=* simply because border rendering looks too weird without it. It’s quite common in New York to have City of Plattsburgh be the chartered city, and Town of Plattsburgh be the remainder of the township, which has its own government. The two governments are independent of each other, and both slot in at admin_level=7, so that’s no guide. It looks really strange to see a rendered political boundary with ‘Plattsburgh’ on both sides!

I did NOT put City of, Town of or Village of on place nodes with only a handful of exceptions. Town of Tonawanda is NOT an accident; it’s actually a name that’s in common use to distinguish it from the adjoining cities of Tonawanda and West Tonawanda. I’m blanking on the name, but there’s also a township in the Finger Lakes that’s named for a village OUTSIDE the town boundaries (the township got subdivided). (That’s one of the ones that got an artificial place=region (to be place=municipality soon) from the (previously non-imported) GNIS entry for the township. Clearly, the Towns that were remainders when Cities were chartered also get this treatment. Town of Hempstead also got handled carefully because of its high significance: Town of Hempstead (admin_level=7) has about 800,000 inhabitants. It likes to bill itself as “America’s Largest Township.” But none of the inhabitants associates with it as a “home town”. If they say that they live in “Hempstead”, they’re talking about the Village of Hempstead within the town, itself a significant suburban city of about 60,000.

In general, it’s hard to deal with place nodes that represent different levels of the administrative hierarchy: the Village of Schoharie within the Town of Schoharie within the County of Schoharie. In general, I retained the smallest place because that’s what people identify with, and kept it as the label of all the same-named objects. It didn’t seem to make sense to have multiple nodes at the same location (and if they’re exactly at the same location, it gives a lot of the electronic tools heartburn. In all the cases I spotted where town and village are named alike (a couple of hundred), residents of the town outside the village generally identify as “coming from” the village or hamlet that they live in - unless they really are in the hinterland, but even then, they’ll often qualify their answer to say that they live outside the village.

If there were tiger:NAMELSAD tags on any of these things, they were gone before I got here. I knew that it was a TIGER import because so many had TIGER in the source tag, but there were no tiger:* tags on any of them. There were a lot of gnis:* tags on the places, and I got rid of them, saving only the feature ID. Nobody needs the redundant information of which state and county the features are in, or the fact that the node represents a “Populated Place”. Most particularly, nobody needs to have every place node state that New York is the 36th state in alphabetical order or that Wyoming County is the 61st county of the state.

It’s an easy enough operation to push operator and operator:wikidata down the admin_centre link. One use case I had in mind for going the other way is, “I just moved to town; where do I go for voter registration, dog licensing, property tax information, etc?” In all cases in New York, the town or city clerk’s office is a starting point. The clerk (an elected position) is the official custodian of records (and the boss of pretty much all the local bureaucrats). I don’t know how easy it would be get that from the OSM data if we were to turn the relationship upside down the way you suggest.

Comment from ke9tv on 24 August 2022 at 16:02

@gdt:

Arbitrary population cutoffs were removed from the Wiki for place=* quite some time ago. Mappers weren’t following them at all, and instead using something like the Christaller model that you suggest.

As far as fitting the OSM terms into a Christaller model, I don’t have any source of data for ‘number of people who say they live in a place’. What I have - what all of us have - is census figures, tied to arbitrary polygons. Generally, of course, these polygons follow the political boundaries.

In all these cases of administrative regions/CDP’s, the population - and the source for it - is tagged on the boundary, so that’s unambiguous information about what population was counted, who counted it, and when it was tabulated. (And in all cases in New York at admin_level>6 that’s now the 2020 Census. It appears someone else already did the counties.) That’s really the best any of us can do.

As for the population on place nodes, that pretty much is tagging for the renderer, but it’s information that at least some renderers use. At least, it’s not “lying to the renderer.” population=* on aplace node is asserting “there is an enumeration region of this name containing this point which has the given population.”

It’s not ideal, it’s a starting point. It’s surely better than the mess we have, in which Geneva (pop. <4000) is a city because it has a city charter, while Brentwood (pop. >60000) is a hamlet because its only local government is the Town of Islip. I don’t think we’re going to do significantly better than a somewhat arbitrary framework without a lot more effort than I’ve put into this, and I’ve been working on this project for over half a year at this point.

Comment from Minh Nguyen on 24 August 2022 at 21:12

population=* on aplace node is asserting “there is an enumeration region of this name containing this point which has the given population.” […] It’s not ideal, it’s a starting point.

Last year, there was a proposal to use the Census Bureau’s urbanized areas as the basis for population tags on place nodes. Urbanized areas ignore jurisdictional boundaries in favor of population distribution, which theoretically would line up better with that the place nodes represent, but the messy reality is that populated places are also a function of commerce and industry, which the urbanized area definitions don’t consider, and sometimes downright arbitrariness.

Ultimately, there’s no purely data-driven method for correctly sizing every place label on a map without some degree of human judgment. As you say, the population tags are just a starting point. It may be good enough for the “long tail” of places that a data consumer wouldn’t know how to classify manually.

It’s an easy enough operation to push operator and operator:wikidata down the admin_centre link. One use case I had in mind for going the other way is, “I just moved to town; where do I go for voter registration, dog licensing, property tax information, etc?” In all cases in New York, the town or city clerk’s office is a starting point. The clerk (an elected position) is the official custodian of records (and the boss of pretty much all the local bureaucrats). I don’t know how easy it would be get that from the OSM data if we were to turn the relationship upside down the way you suggest.

To me, this use case doesn’t sound fundamentally different than searching for your state legislator’s constituent service office, police precinct, school board office, or power utility office. In general, we aren’t mapping service areas as boundaries, but some government offices happen to have service areas that conform to an administrative boundary. Even so, it’s up to the user to do their homework about which local office can help them.

In some states, things get too complicated to express in tags. For example, San José’s water utility – a bona fide part of city government – serves only 12% of the city, not including where I live. For most purposes, the county sheriff’s office serves unincorporated areas but not cities and towns. In a neighboring county, the county’s public health department doesn’t serve one city that has their own public health department. There’s a contract city nearby that contracts with other governments to provide basic services and generally doesn’t provide services “in house”.

As long as there’s a distinct item for the government as opposed to the place, then both the boundary and office could be tagged with the same operator:wikidata, making that a little easier. But I don’t think there’s very much a data consumer should infer based on that relationship.

Comment from ke9tv on 26 August 2022 at 02:17

As promised, place=region is dead; long live place=municipality.

I’m still working through the details of how to do a mechanically-aided edit to push operator and operator:wikidata down from the boundary onto the seat of government, and then break the admin_centre link. (My usual automate via JOSM tricks won’t quite work, because the API I’m using doesn’t edit relation memberships.) I do not want to do this as two thousand more changesets!

Comment from they on 23 December 2023 at 16:31

While not based strictly off population, the US Census Bureau’s statistical areas seem to align well with common usage of the place=city tag in OSM - an important regional hub. Statistical areas were used for identifying regional centers in New Mexico’s highway=trunk network using this guidance: “a city is considered the urban core of any Metropolitan Statistical Area (MSA) or Micropolitan Statistical Area (μSA) that is not part of a Combined Statistical Area (CSA) that contains a MSA. If the μSA in question is part of a CSA that does not contain a MSA, then the largest μSA in that CSA shall be considered a city. “ Most cities in New Mexico follow these criteria. The two exceptions - Deming and Grants - are towns of about 10,000 people located within a commutable distance to a much larger metropolitan area. A little more under the radar, when someone decided to tag every incorporated place in Wyoming as place=city, I used these same guidelines to clean that up, leaving only the Census’ statistical areas as cities.

For New York, it might make sense to only consider metropolitan areas as cities. That would align with your threshold of 50,000 inhabitants, except that is not restricted to the city limits but instead the entire statistical area for a metropolitan statistical area. This would result in places like Binghamton, Ithaca, and Watertown retaining their city status while preventing places like Clay, a suburb of Syracuse with just over 50,000 inhabitants, from getting tagged as cities.

With that said, I think this mechanical edit will be a big improvement overall, and the cities will be easy to clean up after the fact.

Log in to leave a comment