OpenStreetMap

HTTPS All The Things (https_all_the_things)

Posted by b-jazz-bot on 18 February 2019 in English (English)

Beep boop. I’m working on a project to update website tags (mostly in the U.S.) that use the http protocol instead of the https protocol when the website is already forcing you to use the https protocol. You can find more information at https://wiki.openstreetmap.org/wiki/Automated_Edits/b-jazz

Comment from Nmxosm on 18 February 2019 at 20:07

Thanks for these edits. I’ve tried to do that manually on objects I touched anyway.

Great work.

Nmxosm

Comment from rorym 🏳️‍🌈 on 19 February 2019 at 15:15

Thanks for some of the details. Can you go into more detail. Can you include the script you’re using? What tags are you changing? How are you comparing the URLs? What are you doing to protocol-less domains (e.g. website=www.example.com)? Why not try it on the whole world, not just USA?

Comment from Wynndale on 19 February 2019 at 17:36

The Slack page linked from the wiki isn’t publicly readable. Do you distinguish between permanent and temporary moves or whether HSTS is set?

Comment from b-jazz on 19 February 2019 at 18:59

@rorym: You can find the python code at https://gitlab.com/b-jazz/https_all_the_things/. It’s not meant for others to run just yet, but is there for review and comments. I’m currently just touching the “website” tag, but will likely add “url” and “contact:website” for the next go-around. I’m not sure I’ll do more than those as they make up the vast majority of http urls that are tagged. I’m happy to hear arguments on others that should really be included. When comparing the urls: I’m currently doing four checks. For http://example.com, I’m looking for https://example.com, https://example.com/, https://www.example.com, and https://www.example.com/. Those are the most common variations when specifying redirect urls. At this point, I’m not tackling protocol-less urls, but I certainly could. I should do some research and find out how common it is to leave off the http://. As for the U.S. vs. the entire planet, I’m open to running it on the rest of the world, but I just started with the U.S. as I know that community better than the rest of the world and only posted there looking for feedback. I could built up the script a little more and document how to run it and let others do their own countries. What I worry about most is getting buy-off from the larger community across the globe. If someone gives me the go ahead, I’ll happily run it world wide.

Comment from b-jazz on 19 February 2019 at 19:08

@Wynndale: Thanks for pointing that out. I’ll redact the names of the slack thread and post the rest of the content in the wiki so that people not on the US Slack server can see comments. I am currently only rewriting 301 (Moved Permanently) and 302 (Found). As you probably know, 302 has been known at times as Moved Temporarily. So it’s arguable that I shouldn’t be rewriting any of the 302 redirects, but IMO most website operators are using 302 when they really should be doing 301. It is the reason though that I’m avoiding touching anything that is much different from the original url. I’ve seen a bunch of domains redirecting to a facebook page or a google site temporarily. Those remain untouched. As for HSTS, I wasn’t familiar with that, but did a little reading. I’m not sure how you think that could be incorporated into what I’m doing. Can you explain?

Comment from rorym 🏳️‍🌈 on 19 February 2019 at 19:56

As for HSTS, I wasn’t familiar with that, but did a little reading. I’m not sure how you think that could be incorporated into what I’m doing. Can you explain?

HSTS is where a website says “Always contact this website over HTTPS”. If an OSM object’s website tag URL returns that, then you can be much more confidence that you should change the OSM object, the RFC says that you should always use HTTPS from now on.

Comment from b-jazz on 19 February 2019 at 20:00

I agree that it would be pretty clear at that point that you can use HTTPS, but I think a simple HTTPS redirect is pretty convincing. Especially in this day and age when more and more websites are getting clued in about the importance of secure transmissions.

Comment from Nakaner on 20 February 2019 at 20:44

@b-jazz I recommend you to discuss this edit on a public mailing list with a proper archive (Talk-us in your case). Otherwise people not participating in proprietary communication channels can complain that the edit was not discussed, i.e. violating the Automated Edits Code of Conduct.

Comment from b-jazz on 21 February 2019 at 07:01

Thanks for the feedback @Nakaner. I’ll make sure I mention it in both the talk-us mailing list and the Slack channel in the future. Do you want to edit the AECoC page to point out that discussions shouldn’t take place solely on “proprietary communication channels”? Maybe we can prevent someone else from interpreting the page as I did in the future.

Comment from rorym 🏳️‍🌈 on 21 February 2019 at 13:00

Do you want to edit the AECoC page to point out that discussions shouldn’t take place solely on “proprietary communication channels”?

I think the AECoC is relatively clear that you should at least always post to a mailing list?:

If you plan to make any automated edit, you should discuss and document your plans beforehand. Documentation should be placed on the wiki and the proposal should then be discussed on a suitable mailing lists: [list of mailing list options]

Comment from b-jazz on 21 February 2019 at 17:36

As clear as mud. ;-)

if your edit affects only one country or territory then the national-language mailing lists, forums, or other standard communication methods for the territory affected by the change

My argument is that osmus.slack.com is a national-language forum for the U.S. with excellent representation. If that isn’t good enough for one reason or another, the wiki should call that out.

Comment from rorym 🏳️‍🌈 on 21 February 2019 at 17:45

Upon closer reading, you are technically correct, the best kind of correct! I was influenced by the Organised Editing Guidelines which have more explicit rules. But that’s a different document. Perhaps the community should think on this.

Regardless, posting to mailing lists would be helpful.

Comment from rorym 🏳️‍🌈 on 21 February 2019 at 17:53

Comments on your script.

  • Have you considered setting a useragent when you make a request, be a better web citizen!
  • You compare the URL including with www, what if a http://www.example.com redirect to https://example.com (because they think www prefixes aren’t cool), am I right in thinking your script doesn’t handle that? Should it?
  • You’re only looking for websites which start with http://, what about website=www.example.com (i.e. no protocol defined). You could check if it answers on https, and add that protocol, so more people will default to the secure version.

Comment from b-jazz on 21 February 2019 at 18:07

Three excellent questions/suggestions. Thanks!

  1. You’re quite right. I’ll add that right away.
  2. That’s correct, it doesn’t handle it, and it should. (I’m a big advocate of ridding the world of the scurge of having to say “double you double you double you”.) Now I’m curious and I’ll dig through the logs and see if there were any cases of that occurring.
  3. I’m planning on adding that today, though I’m not going to make an assumption about favoring to https, thinking that maybe some crazy/lazy website owners don’t have their https matching their http site. I’ll hit up the http version, and if it redirects, then I’ll update the value.

Comment from b-jazz on 21 February 2019 at 19:13

I’ve found about 3000 instances of http://www.example.com redirecting to https://example.com in the lower 48. This makes me happy (because I abhor ‘www’). I’ll put a fix and run batches again as soon as I implement www.example.com to http://www.example.com as well. Great find @rorym.

Comment from escada on 22 February 2019 at 07:20

The tags “heritage:website” and “image” can also contain URLs. Might be worth looking at them too in a future version of your script.

Comment from b-jazz on 23 February 2019 at 02:08

Thanks @escada. I only thought about “website”, “:website”, “url”, and “:url”. I wasn’t aware of “image”. Looks like there are over 100,000 image tags. I’ll look into it and see if they are predominantly URLs.

Login to leave a comment