h4ck3rm1k3's Diary

New Host for OSM data , archive.org

Posted by h4ck3rm1k3 on 20 December 2009 in English.

I have been looking for a place to share OSM layer files, and have found a perfect spot for them : Archive.org

Archive.org supports unlimited size files and allows you to even upload videos. They support all creative commons licensed data.

I have started to upload my OSM conversion of the US Census Buro ZIPcode database. Anyone who wants the data can get it, and those who don't want the data don't need to see it. The only problem is that people wont find this data when they are looking for it. Hopefully they will find this post, but otherwise that is the smallest problem to start with.
http://www.archive.org/details/OpenstreetmapZipcodes

I will be using the Zipcode to split up the EPA files so that we have a handy mechanism to select what files you are interested in. But first, I am downloading all of the individual data records and processing them into something usable. I would like to use that data to doublecheck the containing zipcode and county data to flag all problems.

That way people will not have the OSM flooded with "junk" data, yet we will have a place to share this data with each other.

The delete of the EPA data is finished, sorry about the problems that caused.

In the end, We should be able to extract all the zipcode regions and all the things they contain, so that we would have a tool to find all streets that belong to what zipcode.

With that data we would have a great tool for processing and verifying Address data and make OSM even more valuable.

mike

Discussion

Comment from JohnSmith on 20 December 2009 at 12:48

If you have (approximate) boundary areas for the zip codes, why not import them as admin_level=8,boundary=administrative?

Comment from h4ck3rm1k3 on 20 December 2009 at 12:56

Yes, We have two levels. 3 digit ones and 5 digit ones.
the 3 digits contain the 4 digits.

Of course I can import them.... But I will first send a mail to the list.
mike

Comment from JohnSmith on 20 December 2009 at 12:58

I should point I'm not implying an batch/automatic import, we are (slowly) importing the Australian postcodes by hand, OSM files were generated from shape files and then people can load these in JOSM and create postcode boundary relations using the information.

Although the main reason we're doing it manually is because other similar boundaries exist from the same organisation that released the postcode data, so boundaries for various administrative areas are shared, eg postcodes, suburbs, states etc.

Comment from h4ck3rm1k3 on 20 December 2009 at 13:04

Yes, I have been looking at the data. There are cases were the boundries are not exactly matching, this will have to all be reviewed.

My idea is to make a program to look for containment hierarchies in the data, this region contains this one and to flag errors...

mike

Comment from balrog-kun on 20 December 2009 at 14:37

I don't think postcode/zipcode areas are administrative divisions like regions/states/provinces/counties/municipalities etc, they're an aid to the postal service and are not part of the administrative hierarchy of different admin_level's.

I'd suggest boundary=postcode or some such.

Comment from aude on 20 December 2009 at 18:10

For splitting up OSM data, I suggest splitting data up by counties or other geographic units (e.g. census tracts).

Zip code boundaries are problematic and should not be imported. In reality, zip codes are merely attributes assigned to addresses and street segments, and not geographic areas.

People have attempted to create zip code boundaries, though there is no standard way of creating them. The Census Bureau's zip code boundaries differ from other sources (e.g. county GIS departments). Also, zip code "boundaries" change all the time, as new addresses/buildings are added for mail delivery.

The only official zip code "boundaries" would come from the USPS, however the USPS does not publish them and has them for internal use only. Changes in zip code assignments and boundaries are done by a GIS person at the regional postal facilities (e.g. in Dulles, VA) who adjusts them in ArcView (or MapInfo), without keeping a history of the changes.

For more about issues with zip code polygons, see: http://www.biomedcentral.com/content/pdf/1476-072x-5-58.pdf

If we end up deciding to import zip codes, it should be done only after much discussion and with a lot of care. They should be assigned to address points, and not as boundaries.

Comment from aude on 20 December 2009 at 18:16

For the Census zip code boundaries, there are the 5-Digit ZIP Code Tabulation Area (2002), linked from http://www2.census.gov/cgi-bin/shapefiles2009/state-files?state=34.

There other versions of Census zip code boundaries, aside from that one. From the archive.org link and data link there, I can't tell which it is.

Comment from aude on 20 December 2009 at 18:18

According to the Census TIGER technical documentation:

"Data users should not use ZCTAs to identify the official USPS ZIP Code for mail delivery. The U.S. Postal Service (USPS) makes periodic changes to ZIP Codes to support more efficient mail delivery. As a result, the original Census 2000 and 2002 ZCTAs may no longer match current ZIP Codes."

http://www.census.gov/geo/www/tiger/tgrshp2009/TGRSHP09.pdf

Comment from h4ck3rm1k3 on 20 December 2009 at 19:09

yes, Well we will be able to check them all out.
My plan is to create a hierarchy of data, where each region (state) that contains another region(county) and so forth (relations and ways that contain each other).

If we find data that does not match or is crossing the border then it can be split up or marked for being manually fixed

Given a hierarchy of data, then we would match it based on the attributes of the EPA datapoints. Does the county match the county from tiger? Does the zipcode match the zipcode from the census.

The census said that they will not update this data, but we can. Given enough test data (zip code attributes) we can find all the ones that break the model and fix it.

Anyway, there is a huge market for this type of processing and I think that OSM or something like it is the right way to go.

I will not commit this data to osm, but keep the osm files on archive.org

if we get enough updates, we can out them into a git repository...

I am starting to think that the monster database idea is not a very good one anyway..

mike

If it turns out that the zip code from the zcta produces bad data,

Comment from JohnSmith on 21 December 2009 at 10:33

@balrog-kun why aren't they administrative exactly?

@aude why does it matter if the data isn't perfect to begin with, fix it up as you get people that have local knowledge, if you shouldn't import imperfect data why don't you just delete all the tiger data?

As for standard way to create them, perhaps you should look at what others are doing:

http://wiki.openstreetmap.org/wiki/Import/Catalogue/ABS_Data

Here's where we're up to regarding the import of Australian postcodes:

http://maps.bigtincan.com/?layer=00B000000FF

The ABS is the aussie equivalent to the Census Bureau in the US.

Australia Post hasn't publish a set of shape files either, but that doesn't mean others haven't made approximation areas that are good enough.

Also assigning them as address points, you must have to assign town and state and country informations to locations too then, but then you might as well add as_in tags too, or better instead use a boundary area for all the above and is_in can be calculated.

@h4ck3rm1k3 Yes there is a huge market for this kind of information, almost as much as for street data itself.

Also our postcodes in some cases cross state borders and so on I don't see why they'd need to be split up, it's easy to use relation boundaries to figure out what country, state, postcode and local government area a point or small area is.

Comment from h4ck3rm1k3 on 21 December 2009 at 10:48

I have hacked osm2pgsql so that it imports the data from my feeds :
http://fmtyewtk.blogspot.com/2009/12/osm2pgsql-hack-for-importing-id-ways.html

The data is loaded in qgis.

I will be creating some postgres queries to split up the data and process it. that is at least my plan.

I dont care if the monolithic OSM database stores this data or not. In fact, I think it would be better to keep it separate until we find a better way to add in layers.

Ideally the chunks of data will be usable directly from some GIT repository and we split them into very small but useful peices.

mike

Comment from JohnSmith on 21 December 2009 at 10:59

@h4ck3rm1k3 OSM is already a monolithic DB considering people are tagging individual trees complete with botanical names. Some find that useful, like the person making a diary entry on orienteering the other day, but it is virtually useless for me.

On the other hand postcodes are very useful to me and I'm spending considerable effort finishing what Franc started by adding the missing postcodes into the system.

Comment from h4ck3rm1k3 on 21 December 2009 at 11:03

Yes, of course. In Germany I found power lines, security cameras and trees.
But we still need a better staging system. Why should we just throw it all into a single database. We could have many databases for various layers. This is a design issue. In fact, why do we need a monster database at all? Cannot we deal with lots of small files and a smart editor that commits them in the right way so that we don't need anything more than a smart distributed version control system?

Comment from JohnSmith on 21 December 2009 at 11:19

While it's possible to distribute code like you suggest, I don't think the same can be done for databases, most just throw bigger hardware at the problem when it becomes a problem.

That's not to say we should keep doing things the same way, but at the same time I can't think of a better way to do it, I don't think storing files in a code repository is the answer either, databases exist for a reason, they're good at what they do.

Comment from HannesHH on 21 December 2009 at 17:20

In Hamburg, Germany we tagged the streets with postal_code:number, wouldn't that work for you?

Comment from h4ck3rm1k3 on 21 December 2009 at 17:35

I was just following the wiki,
personally I would like zipcode, but
here :http://wiki.openstreetmap.org/wiki/Key:postal_code

Comment from JohnSmith on 22 December 2009 at 06:55

@HannesHH the US and Australia use big areas for postcodes, so it makes more sense to use a polygon (relation) in OSM than it does to tag each property since you can find out the postcode of a point using the polygon then. For more information on Australian postcodes see my post above.

Comment from h4ck3rm1k3 on 22 December 2009 at 07:53

I have been playing with qgis, and it looks like there is a feature to create a convex hull based on an attribute value.

So, you could take these attribute values (post codes) and create a convex hull and then compare this to the ZCTA. That would give you a good start because you could compare the areas that have the biggest difference first.

The other thing is that you can flag the nodes and ways that are outside of the ZCTA, that is what I was doing to check them. Maybe other states have more problems with the zipcodes, but NJ looks very stable.

mike

OpenStreetMap

h4ck3rm1k3's Diary

New Host for OSM data , archive.org

Discussion

Log in to leave a comment