OpenStreetMap

Extreme SEO on osm.org kills both Spam & OSM

Posted by alexkemp on 5 June 2019 in English (English)

Oh Dear, Oh Dear, Oh Dear

tl;dr:
No-one can hear you weep at the bottom of the Abyssal Plain

The OSM Admin have succeeded in killing off the recent floods of spam from within these Diary pages. They have killed that disease by killing the patient.

On 18 May 2019 at 10:05 TomH said:–

Personally at this point I am inclined to just close down the diaries - they are not core to our service and should probably never have been implemented

TomH is the Admin on this site and is thus given the ability to implement site-wide changes.

On the same day (18 May) TomH unilaterally implemented a broken commit to robots.txt (fixed 19 May) which intentionally killed the Diaries by preventing any Search Engine from listing a single page. You can write anything that you like here now, because no-one will ever read it. picard facepalm

Notice also (below) that all nodes, ways & relations in the map have been blocked, so no PoI will ever be found via a Google search. Frankly, there seems little reason to add to the map, either.

I pointed out a broken link within robots.txt here on 5 June and, in the same unilateral manner, that was immediately fixed by gravitystorm. Another update was made to the same file on Thu, 06 Jun 2019 17:07:41 GMT as that update was post-allocated to a 4-year-old issue, and below is the current file:–

$ wget -S osm.org/robots.txt
…
$ cat robots.txt
User-agent: *
Disallow: /user/*/diary
Disallow: /user/*/traces/
Allow: /user/
Disallow: /traces/tag/
Disallow: /traces/page/
Disallow: /api/
Disallow: /edit
Disallow: /browse
Disallow: /diary
Disallow: /login
Disallow: /geocoder
Disallow: /history
Disallow: /message
Disallow: /trace/
Disallow: /*lat=
Disallow: /*node=
Disallow: /*way=
Disallow: /*relation=

Host: www.openstreetmap.org

The “Disallow: /user/*/diary” line above means that in the approx. 1,100,000 current results for “a site:openstreetmap.org” on google there is not a single result for a Diary page; the results are almost all wiki pages, with a sprinkling of help, blog, etc pages.

The OSM robots.txt deliberately causes most of openstreetmap.org to be de-listed in Google, including all of the Map. The number of results shown is also bouncing about in an erratic manner (I first ran the search above ~6 hours ago and there were 600,000 pages available).

So remember, the next time that you upload some PoI or important map updates to this site, that OSM will immediately drop them the 7 miles to rest, unseen & unloved, upon the vastness of the Abyssal Plain. There they will rest for computer eternity, unseen by humans or bots, unless you make a special point to go and see them. And do do that, because no-one else will, because no-one else can ever find them, no matter how hard they seek for them, because they no longer exist within the Google nor Bing nor any other computer universe other than Nominatum. And who uses that?

(The information above was discovered during writing an ordinary post for these Diaries. I have rewritten this post as the info above is far more important than the original post. The rest of the original post now continues, although with the info above removed)

• SEO == “Search Engine Optimisation”
  (this is almost always a euphemism for “SPAM”)

    Between 25 Apr & 23 May
    (29 days):
    -----------------------
              Diary Posts
            Human      Spam
            -----   -------
    Total  :  124   320,272
    Per day:    4    11,044
    -----------------------

2 things have prompted this diary post:

  1. A whole series of posts from me about recent Spam Attacks
    OSM is now within an iteration of spam-bot software (such as XRumer)
    How to Stop the Spam-Storm
    Recent Spam Attacks
    Behold Cassandra
     
  2. A conversation yesterday with the lady at AST Auto Centre
    I could not be certain either from my original visit nor from our Bing imagery whether some central buildings were occupied by the Ashley St Auto Centre or by a business on the opposite side of the block on Handel Street, so phoned the number on the business card that the lady at AST had given me on 17 May. It was yesterday (4 June) so I did not expect her to remember me but she did, and mentioned the little leaflet that I had given her.
     
    I quickly sorted the original reason for the call (AST occupies those buildings), but as soon as I mentioned OSM she immediately responded about the number of enquiries that the business gets via Google maps (I had never mentioned Google). I brightly informed her that the updated mapping should be available within the next hour.

AST Auto Centre

In a former life as a website owner & operator I spent my time fighting to keep spam out of my site forums + trying to keep the site firmly within Google’s gaze. Whilst many spammers are profoundly stupid, it was not lost on me that there is some fierce intelligence at play behind the entire spam business, else they would have long ago dissolved into the void. I was not too proud to learn from these experts where I could find it, at the same time as I did my best to kill them dead on my site. I attempt to put all the SEO intelligence that I have gathered into practical application with these diary posts, as one example, making them as attractive as possible firstly to the human readers but also to the bots operated by the SEs. If both find them attractive then OSM wins again.

I was very interested in the response from the lady at AST. It was the first time that I had had such positive feedback on the idea of OSM, and it seemed a good endorsement of both the promotional leaflets and also of OSM.

I’ve spent 30-odd years of my life as a professional salesman, and therefore it often colours my approach to things. With OSM it was immediately obvious that the map is a boon for every business — what a marketing opportunity! I therefore work tirelessly to promote every business that I meet during my surveys. (For newer mappers: in OSM-speak these businesses are called ‘PoI’, which is to say “Points of Interest”) I collect Business Cards/Compliment Slips/Brochures whilst mapping so that back at home all of the business’s most important contact information can be put up on the map. If the OSM map is an easily-referenced source of information for most businesses, then it will become very interesting for the common customer, which is good for customers, good for businesses, and good for OSM. The kind of virtuous circle that we need.

Public Service Information:– Business Cards, Compliment Slips, Brochures

For the benefit of younger mappers, since I’ve often had folks telling me that they do not know what I’m talking about when I’ve asked for one of these items:–

Business Cards
These are dead trees processed into stiff paper (ask your teachers if you are unfamiliar with the idea of paper) and then printed with the business, personal name of the business owners + address & contact numbers. They are intended to be given to other people during personal meetings to facilitate marketing promotion of the business. Please approach your health professional if you have been triggered by the idea of a one-on-one personal meeting, but this is actually quite normal during business life.

Compliment Slips
These are dead trees processed into (typically) a ⅓ of an A4 sheet of paper (ask your teachers if you are unfamiliar with the idea of paper) and then pre-printed — normally on just one side — with the name, address & other contact details of the business. They are intended to be placed into the same envelope as a letter. Ask your teachers if you are unfamiliar with the idea of an envelope or a postal-letter.

Brochures/Booklets
These are dead trees processed into (typically) a series of A5 sheets of paper (ask your teachers if you are unfamiliar with the idea of paper) and folded (brochures) or bound (booklets) after pre-printing with information about the business & it’s activities.

I can hear you grumbling about the lack of SEO info on https://osm.org so far, so let’s get to that.

(removed material)

The -S parameter to wget gives us a readout of the server headers sent together with the file. Thankfully static files like robots.txt get a Last-Modified header in addition to a strong ETag directly via the Apache web-server, which means that content-negotiation will work well. These Diary pages get neither on supply, which means that you may miss updated pages and/or have to re-download unchanged pages. I was fixing this in PHP 15 years ago (update to Conteg v0.13.13 here), so there is little reason this far into the new millenium not to get this right.

Here is the current file; there are some odd Extreme SEO decisions here (see code for robots.txt above)

The OSM server setup is a work of genius, and the Admin & Mods clearly know their business, and yet, and yet … too often I look at some of their decisions and think to myself “have you lost your marbles?”. The conclave that has set itself up under GitHub comments too often sounds like an echo-chamber, in which the participants brook little argument and get hostile when questioned by outsiders. That is a sure sign of narcissism, but what are we supposed to do? Shine a light on the issue & write critical diaries is, I guess, all that we can do.

The main problem comes with the sitemap, which has a whoopsie
(a 404 Not Found fixed today not by fixing the dead-link, but by removing that link from robots.txt)

That single mistake will have caused this entire site to have plunged in the Google rankings (been there, done that - it always takes twice as long to get back as it does to drop; Google is as unforgiving as the OSM admin).

(removed material)

In OSM We have an Inside-Out Microsoft

I always thought that Micro$oft was a company with incredibly good programmers, but ruined by being led by the Marketing Department. Now sure, they made some truly stupid decisions — their Windows’ command-line is the best example of that — but on balance the company was excellent. It is just that they always deserved that $ in their name.

OSM can not be blamed for making the Micro$oft mistake, no sir. The OSM affair is led by geeks who have zero understanding of what their customers want & need. Indeed, they do not WANT to know or ever have to give such a notion headspace.

This all does my head in; it is difficult to carry on when involved with folks with such odd ideas. “Hiding your light under a bushel” is one thing, but burying it at the bottom of the Abyssal Plain is another notion altogether.

Location: St Ann's, City of Nottingham, East Midlands, England, NG3 4QP, United Kingdom

Comment from ianlopez1115 on 5 June 2019 at 13:17

In addition to business cards, brochures and booklets, I also collect receipts which are useful when getting needed information such as address, contact information like phone numbers (mobile and landline), websites (usually Facebook pages), and tax identification numbers. In some instances, I explain to business owners what OpenStreetMap is and encourage them to use it over “the leading brand”.

Comment from Glassman on 5 June 2019 at 14:48

Alex, Can you explain, for us un-enlightened, what the whoopsie is when in The main problem comes with the sitemap, which has a whoopsie:

If it causes us a drop in ranking, does it really matter?

Clifford

Comment from alexkemp on 5 June 2019 at 17:23

Hiya Glassman

The actual whoopsie is that the Sitemap is inaccessible.

When certain crucial site-files cannot be accessed, then Google downgrades the entire site. robots.txt is one of those (not applicable here) but sitemap.xml.gz is another, and there are 2 problems:

  1. It is robots.txt on a HTTPS url - therefore, it only applies to other HTTPS urls, and NOT a HTTP url (the Sitemap directive gives a HTTP url)
  2. sitemap.xml.gz appears within robots.txt which, because it gives a 404, also reflects back on robots.txt (one of the easiest ways to de-list your entire site is to make robots.txt inaccessible).

If it causes us a drop in ranking, does it really matter?

I’ll put the answer to this one back to you as a question: does it matter if no-one ever looks at the Map (I’m being serious)? Put together in your mind the countless thousands of hours put into surveying & editing the map by uncounted thousands of people & bots, and then answer — if no-one ever uses it (and/or if they are unable to see it because the SE’s say that it is not worth looking at) — does that really matter?

Comment from mmd on 5 June 2019 at 17:44

I think we can stop the discussion about sitemaps right here, as it got removed today: https://github.com/openstreetmap/openstreetmap-website/commit/31e1204dfab9b81346a2881ac0943ba37a76a323

Comment from mmd on 5 June 2019 at 17:57

Regarding “The robots.txt instructions Disallow: /user/*/diary and Disallow: /diary tell the Search Engine’s crawlers to ignore all Diary pages.”:

If you had in fact followed the respective GitHub issue, you knew that this was a short term immediate action until more sophisticated approaches are in place - which required some extra development effort. Don’t expect this to be a permanent solution either.

Comment from alexkemp on 5 June 2019 at 18:00

Yes. Like Income Tax, perhaps? (a temporary measure to pay for the war with Napoleon).

Login to leave a comment