Veniamin Veselovsky's Diary

An Automated Approach to Identifying Corporate Editing Activity

Posted by Veniamin Veselovsky on 24 June 2021 in English.

In this post we are excited to share some of our recent research into organized editing on OSM. This relatively new line of research has been motivated by two observations. The OSM community has seen a dramatic rise in organized editing over the last several years. This new presence has continued the historic debate on the role organized editing should play in OSM. We became interested to study how the editing habits of these new actors differed from the community as a whole, but were surprised by the lack of accessible data. We decided to use the public tools to create computational methods of understanding different editing behaviours to classify editors as being part of the organized group or volunteer.

Following is an attempt to share some of our initial results. We’ll begin by outlining a new method we created on classifying users based on their profiles. Then we’ll share some features we extracted that may affect whether a user is organized or volunteer. Finally, we’ll show initial prediction results.

Extracting a list of users

Our first objective was to create an accurate, updated list of organized editors on OSM. In the past, researchers have searched individual web pages Organized Editing List. We decided instead to develop a dataset based on scraped user profiles. There are two possible approaches to extract corporate mappers based on user profiles. The first approach uses a clustering of the keywords within the profiles. Though effective at uncovering semantic relations between users (like students, programmers, Garmin editors, Colorado mappers), this method failed to properly capture all known corporate groups. Instead we did a keyword search for corporations listed on the Organized Editing List and classified similar users together. This method is more scalable since as the number of corporate editors grows, it becomes increasingly difficult to compile a list by going through individual web pages. This method works because corporate mappers often have consistent profiles as mandated by the company they are working for. For example, the profiles of Amazon Logistic mappers tended to be:

“I work for Amazon Logistics. At Amazon Logistics, we’ve been utilizing OSM in some cases related to our delivery programs….”

We then divided this list into corporate or not-corporate. This simplification was done to align with past research into corporate editing. A future step in our analysis will be to compare the organized editing list with the scraped lists of editors involved in organized editing (e.g. HOT, corporate editors) to notice any discrepancies.

Exploratory Analysis

Using this extracted list, we aimed to discern features that could act as “signals” for organized editors. In particular, what we wanted to understand was which features from the changesets can point to an editor being part of an organization or volunteer. Do organized editors edit specific types of items? Do their time series signatures differ? Again, in order to build on previous work on corporate editing, we focused on corporate editors but the method generalizes to all organized editors.

For the creation of these features, we relied significantly on some of Jennings Anderson’s past work on corporate editing. We found four main features that act as signals. The first one is the actual metadata from the edits themselves. This includes the type of editor used, the hashtags, and source. We find that most organizations use editors like JSOM and iD. Next, we attempted to model which objects corporations edit. We observed that most corporations focus on services and roads, as opposed to buildings which tend to be dominated by volunteer mappers (showing in Figure 1).

types of edits

Figure 1: Three bar graphs comparing how organizations and regular users edit. Notice that organized editors tend to make edits that feature the comment “service” and “road” more than users.

As the interests of a corporation change, the editing of its mapping team can also change. This has led to the well documented phenomena of corporate mappers having a geographically dispersed editing pattern. Some days they edit in India, whereas other days they can edit in Brazil. This is markedly different from a lot of volunteer mappers who often begin by mapping their local neighbourhoods. Methods exist to quantify the geographic dispersion by using fancy cosines and sines, which will not be covered in this diary post. But what’s important to recognize is that empirically the observation holds. In figure <> we show the histogram of the typical geo-score (what we dubbed for geographic dispersion) between corporate and non-corporate mappers.

geo-score

Figure 2: Difference in geographic dispersion between corporations and volunteers.

The metric we found most effective was the timeseries signature. Corporations have a very traditional 9-5 mapping schedule, whereas non-corporate mappers tend to map far more haphazardly, including significant mapping on the weekend. When attempting to convert this timeseries signature into a usable metric, we came across a problem: timezones. All changesets in OSM are normalized to UTC time, this means that someone editing at 8am in Toronto, Canada and someone editing at 8pm in Beijing, China would in fact appear to be editing at the same time in OSM. Using longitude and latitude data also won’t allow us to fix this problem, since more and more editing is taking place remotely. We were thus tasked with coming up with an effective method of adjusting a users timeseries signature to normalize for their location.

For reference, let’s look at figure 3. In this figure we see six plots for the weekly editing time signature of the employees within six corporations (Apple, Amazon, Facebook, Kaart, Grab, and Uber). Each tick on the x-axis represents 1 hour, so we have 168 ticks for each hour in a week. As we can tell from the visual there are a couple underlying patterns in the editing behaviours. Looking at Facebook, we see two such patterns, each displaced by around 8 hours. This motivated us to create a “traditional corporate editing signature” and translate these signatures to find the minimal distance between the two. After using this method of adjustment, we were able to significantly improve the alignment of the time-series. In other words, we were able to recover the local time zone of most of these corporate editors. Figure 4 presents the same plot, only now realigned using our method. As we can see, the two distinct editing patterns from before actually turned out to be one.

adjustment Figure 3: This plot feature the editing habits for employees from Apple. Each tick on the x-axis represents one hour in a week. The y-axis shows the proportion of edits that were done during that hour. It’s interesting to observe that each company has more than one typical editing sequence which is caused by differences in timezone.

Once we realigned each user using this method, we calculated the distance between the users adjusted time signature and “corporate signature” we created before. While not perfect, we were very impressed with how well this one feature was able to capture corporate editors. In fact, out of the top 100 editors (who had the smallest distance to the corporate signature) all of them belonged to corporations.

Predicting Corporate Editing

Now, with these features in hand, we predict whether an editor is organized or not. We experimented with several classification algorithms, which all provide comparable outputs.The four most important features in the prediction task, ordered by impact on model, were the geo-score, time series score, first edit date, and the editor type. We tried several different classifiers including logistic regression, k-nearest neighbours, support vector machines, and neural networks. All models provided comparable results offering a high recall of 96%+ and predicting anywhere between 1,300 to 3,000 additional corporate mappers. Examining the newly predicted mappers reveals users that map for humanitarian groups like HOT, corporate mappers that the initial scrape didn’t pick up on, corporate mappers who reveal their association only in the hashtags, users who are likely corporate mappers with no ability to know for certain, and volunteers. We remove any “predicted mappers” who have known humanitarian associations because these users are beyond the breadth of this paper. We are now entering the stage of further validating the different models based on a manually annotated sets of users that any of the models predicted to be corporate. We aim to find the model that predicts the most “corporate mappers” and the least volunteer mappers. Choosing the conservative neural model we present a few summary statistics below that reveal how the OSM landscape has been evolving since 2016.

Some of these initial findings will be presented at this year’s State of the Map.

This research is a collaboration between Veniamin Veselovsky, Dipto Sarkar, Jennings Anderson, and Robert Soden

Location: Cornwall, Town of Cornwall, Queens County, Prince Edward Island, Canada

OpenStreetMap

An Automated Approach to Identifying Corporate Editing Activity

Extracting a list of users

Exploratory Analysis

Predicting Corporate Editing

Discussion

Log in to leave a comment