But it turns out, there can actually be too much of a good thing. One of my first encounters with such a situation was while doing some fixing up of imported TIGER data somewhere out west. I came across this:
Notice the scale. We are looking at a little over 100 meters of country highway and there are over 300 nodes in this image. This is obviously beyond ridiculous and I can assure you that this particular instance was cleaned up long ago and those nodes met their appropriate end in a special level of /dev/null.
But this is not an isolated incident. I keep coming across such over-noding. So I decided to use my local planet database (which I previously blogged about setting up) to try and find out how much this really happens and where such things come from.
Well, the "ways" table in the database is 85 GB in size and has over 165 million entries in it. Doing anything with the whole data set tends to be slow. So I decided to filter out some things. While discussing it on IRC, Andrew Buck suggested starting by ignoring ways with fewer than 10 nodes in them. After some consideration this actually seemed reasonable. It is hard to over-node something with 10 nodes. So I made a new table with only these ways in it and with only a subset of columns to make it smaller and more manageable. This actually ended up eliminating a lot more ways than I anticipated. It turns out, 75% of the ways in the OSM database have fewer than 10 nodes in them. This left me with about 40.5 million ways to investigate.
The next question is what does it mean to be "over-noded" anyway? Since this is a geospatial database, I can use functions to determine the length of a given way. And I also know how many nodes each way has in it. So a logical unit is nodes per meter.
Next up is to determine what is "normal" for typical mapping. Taking the 40 million ways with more than 10 nodes, I ran some basic statistical functions and found the following numbers:
Minimum nodes/meter: 0.00000282
Median nodes/meter: 0.043
Average nodes/meter: 0.088
Maximum nodes/meter: 65.23
Well that is quite a range. Let's see what is going on here. The way with the fewest nodes/meter turns out to be way 105818922 which is part of the border of Indonesia. It has 14 nodes and stretches almost 5,000 kilometers, averaging one node every 350 kilometers. It could maybe use a little more refinement but it's a national border in the middle of the ocean... it is probably not too bad.
On the other end of the spectrum is way 44342937. It supposedly represents a round "building" with a diameter of about 1.5 meters. It has 339 nodes. What the... WHAT?! What is this, the world's most accurately mapped dog house? Since it is likely that this way is going to be deleted in the very near future, I include here a screen shot of it loaded up in JOSM:
Yes, that is a tiny circle in the middle of someone's back yard. All that red around it is JOSM trying to draw its normal way direction arrows on such a ridiculous object.
Now that we have examined both ends of the spectrum, what's in the middle? Well, there is way 101521179. 174 meters long with 14 nodes which yields a nodes/meter reading of 0.08000019. That is... well... entirely reasonable. The fact that the median is half of the average would seem to indicate that the numbers are skewed towards the high end by a comparatively few number of ways with a very high number of nodes - like that one with 65 per meter.
So if 65 nodes/meter is the extreme upper limit and most people map at less than 0.1 then what exactly is "too much"? Well I don't really know but I'm just going to throw out a number for further analysis. Let's make it... 1 node per meter. This limits the number of ways down to about 105 thousand.
Since my database has geographic knowledge about these ways and I have the number down to a reasonable level, let's hook up QGIS to it and make a map of the ways with more than 1 node per meter. Let's start with the US:
Well that's actually not as dense as I was expecting. It looks like some of the TIGER over-noding might actually come in at under 1 node/meter. Which kind of makes sense if you think about it. A lot of TIGER ways are really long and the over-noding is sometimes clumped into one section so on average the nodes/meter ratio will be less than 1. But for fun let's look at what other interesting things might be in this data set. How about Europe?
Well now that's different. What's with the outbreak in France?! I have my suspicions. Let's see how this works out...
There is a "source" tag that is often used in OSM data to indicate where a certain map object came from. It is especially often used during imports. Technically it usually makes more sense to put the source tag on the changeset instead of the map object itself, especially for large imports but a lot of people still put it on the map objects anyway. So let's see what these over-noded ways reveal in their source tags. Here are the results of a query that groups things by source tag and counts how many ways with a node/meter ratio of over 1 have that source tag. I truncated the source information at 70 characters for display purposes.
WAYS | SOURCE -------|----------------------------------------------------------------------- 62995 | cadastre-dgi-fr source : Direction Générale des Impôts - Cadastre. Mis 22713 | 3535 | extraction vectorielle v1 cadastre-dgi-fr source : Direction Générale 1313 | Bing 1045 | 3dShapes 744 | bing 687 | Kolding Kommune 628 | NHD 608 | dcgis 602 | WakeGIS 585 | WroclawGIS 526 | Planimetria de Vitoria 482 | MassGIS Buildings (http://www.mass.gov/mgis/lidarbuildingfp2d.htm) 454 | http://www.bakersfieldcity.us/gis/downloads/gis_spatial_data.htm 393 | NextView 390 | Regione Emilia Romagna 369 | cadastre-dgi-fr source : Direction Générale des Impôts - Cadastre ; mi 349 | kapor2 336 | Bing Sat 324 | SO!GIS Import 260 | geoimage.at 244 | Regione_del_Veneto_LR28_16.7.1976_Formazione_CTR_auth_39164-5700-1100_ 230 | http://www.sogis.ch SO!GIS Import 225 | MGC 214 | Kreis_Viersen_Katasteramt_2012_06 214 | lukr 206 | Ajuntament de Girona 184 | CCH 183 | NRCan-CanVec-10.0 175 | Orthophotos 2011 du SITG (Système d'Information du Territoire Genevois 133 | DEP Wetlands (1:12,000) - April 2007 (http://www.mass.gov/mgis/wetdep. 128 | vuv:dibavod:a05 125 | http://www.roseville.ca.us/services/maps_n_data/data_clearinghouse.asp 115 | City of Kamloops 112 | cadastre-dgi-fr source : PaysDeBrest - 20100331 105 | OS_OpenData_VectorMapDistrict
Well then. I recognize that top source tag as the ongoing import of french building outlines and some other features from some cadastre data they got their hands on over there. A spot check shows that some of this isn't actually because of excessive nodes being used but rather the odd way in which the import is creating building geometries. For example, here is way 67157454.
Note the wall=no tag. I believe this implies that it is some kind of porch or veranda. It has a rounded front which is where all the nodes are that make it have a nodes/meter ratio of just over 1.0. If a human mapper had mapped this building it would have likely been a single way that included the porch as well as the rest of the house instead of the 8 individual areas that the import created. This would have resulted in a single 70 meter long way with 47 nodes which comes out to 0.67 nodes/meter.
Checking a few more of these ways in France, it does look like some care was taken to prevent over-noding. Most of them are just barely over 1.0 nodes/meter. And at this point it should also be noted that closed ways are actually having their first/last node counted twice because of the way the database stores the node membership information. Since I was originally looking for over-noded highways I didn't take this into account. So technically a lot of these French buildings might be just under 1 node/meter. But the fact that there are so many of them right on this (arbitrary) limit is still interesting.
There are a few other source tags I recognize as well. CanVec is imported from Canadian government data. NHD is the National Hydrography Dataset here in the US from which some people have imported rivers and lakes. I know several of these water feature imports have had problems with ridiculous over-noding. Some of it has been fixed but a lot remains. MassGIS is the Massachusetts GIS office which has been used for some local imports. In particular, the most over-noded way that I showcased above is from this MassGIS building import.
Of course the second most popular source tag is blank which doesn't tell us much. A tiny random sample shows there are a couple of imports that didn't use any source tag and some are just very detailed manual mapping.
I think finding what I was originally looking for (over-noded TIGER ways) is going to take some more digging. This post kind of got hijacked by the big red blob in France but I think I'll call it quits for now and do some more poking to find the TIGERs I'm after. If there is a lesson so far, I think it can be summed up as:
- Sanity check your imports! Tiny objects with a large number of nodes are an obvious sign that there is something weird going on in your conversion to .osm format. It is possible that the object was represented as a parametric curve in the source and the conversion to OSM format tried to recreate that as closely as possible within the constraints of our x/y coordinate system.
- Imports are not human data. Even in France where this building import is generally viewed as a good thing and a lot of checking and verifying of the data has taken place, the data is still very distinctly different from most of the OSM database that has been created by hand. This may not be a bad thing in every case but it is definitely something to think about when proposing and executing an import.