Wikipedia — the free encyclopedia that anyone can edit — faces many challenges in maintaining the quality of its articles and sustaining the volunteer community of editors. The people behind the hundreds of different language versions of Wikipedia have long relied on automation, bots, expert systems, recommender systems, human-in-the-loop assisted tools, and machine learning to help moderate and manage content at massive scales. The issues around artificial intelligence in
∗The majority of this work was authored when Halfaker was affiliated with the Wikimedia Foundation. †The majority of this work was authored when Geiger was affiliated with the Berkeley Institute for Data Science at the University of California, Berkeley.
Authors’ addresses: Aaron Halfaker, Microsoft, 1 Microsoft Way, Redmond, WA, 98052, USA, aaron.halfaker@gmail.com; R. Stuart Geiger, Department of Communication, Halıcıoğlu Data Science Institute, University of California, San Diego, 9500 Gilman Drive, San Diego, CA, 92093, USA, stuart@stuartgeiger.com.
148:2 Halfaker & Geiger
Wikipedia are as complex as those facing other large-scale user-generated content platforms like Facebook, Twitter, or YouTube, as well as traditional corporate and governmental organizations that must make and manage decisions at scale. And like in those organizations, Wikipedia’s automated classifiers are raising new and old issues about truth, power, responsibility, openness, and representation.
Yet Wikipedia’s approach to AI has long been different than in corporate or governmental contexts typically discussed in emerging fields like Fairness, Accountability, and Transparency in Machine Learning (FAccTML) or Critical Algorithms Studies (CAS). The volunteer community of editors has strong ideological principles of openness, decentralization, and consensus-based decision-making. The paid staff at the non-profit Wikimedia Foundation — which legally owns and operates the servers — are not tasked with making editorial decisions about content1. Content review and moderation, in either its manual or automated form, is instead the responsibility of the volunteer editing community. A self-selected set of volunteer developers build tools, bots, and advanced technologies, generally with some consultation with the community. Even though Wikipedia’s prior socio-technical systems of algorithmic governance have generally been more open, transparent, and accountable than most platforms operating at Wikipedia’s scale, ORES2, the system we present in this paper, pushes even further on the crucial issue of who is able to participate in the development and use of advanced technologies.
ORES represents several innovations in openness in machine learning, particularly in seeing openness as a socio-technical challenge that is as much about scaffolding support as it is about open-sourcing code and data [95]. With ORES, volunteers can curate labeled training data from a variety of sources for a particular purpose, commission the production of a machine classifier based on particular approaches and parameters, and make this classifier available via an API which anyone can query to score any edit to a page — operating in real time on the Wikimedia Foundation’s servers. Currently, 110 classifiers have been produced for 44 languages, classifying edits in real-time based on criteria like “damaging / not damaging,” “good-faith / bad-faith,” language-specific article quality scales, and topic classifications like “Biography" and “Medicine". ORES intentionally does not seek to produce a single classifier to enforce a gold standard of quality, nor does it prescribe particular ways in which scores and classifications will be incorporated into fully automated bots, semi-automated editing interfaces, or analyses of Wikipedian activity. As we describe in section 3, ORES was built as a kind of cultural probe [56] to support an open-ended set of community efforts to re-imagine what machine learning in Wikipedia is and who it is for.
Open participation in machine learning is widely relevant to both researchers of user-generated content platforms and those working across open collaboration, social computing, machine learning, and critical algorithms studies. ORES implements several of the dominant recommendations for algorithmic system builders around transparency and community consent [19, 22, 89]. We discuss practical socio-technical considerations for what openness, accountability, and transparency mean in a large-scale, real-world, user-generated content platform. Wikipedia is also an excellent space for work on participatory governance of algorithms, as the broader Wikimedia community and the non-profit Wikimedia Foundation are founded on ideals of open, public participation. All of the work presented in this paper is publicly-accessible and open sourced, from the source code and training data to the community discussions about ORES. Unlike in other nominally ‘public’ platforms where users often do not know their data is used for research purposes, Wikipedians have extensive discussions about using their archived activity for research, with established guidelines
ORES 148:3
we followed3. This project is part of a longstanding engagement with the volunteer communities which involves extensive community consultation, and the case studies research has been approved by a university IRB.
In discussing ORES, we could present a traditional HCI “systems paper” and focus on the architectural details of ORES, a software system we constructed to help expand participation around ML in Wikipedia. However, this technology only advances these goals when they are used in particular ways by our4 engineering team at the Wikimedia Foundation and the volunteers who work with us on that team to design and deploy new models. As such, we have written a new genre of a socio-technical systems paper. We have not just developed a technical system with intended affordances; ORES as a technical system is the ‘kernel’ [86] of a socio-technical system we operate and maintain in certain ways. In section 4, we focus on the technical capacities of the ORES service, which our needs analysis made apparent. Then in section 5, we expand to focus on the larger socio-technical system built around ORES.
We detail how this system enables new kinds of activities, relationships, and governance models between people with capacities to develop new machine learning classifiers and people who want to train, use, or direct the design of those classifiers. We discuss many demonstrative cases — stories from our work with members of Wikipedia’s various language communities in collaboratively designing training data, engineering features, building models, and expanding the technical affordances of ORES to meet new needs. These demonstrative cases provide a brief exploration of the complex, community-governed response to the deployment of novel AIs in their spaces. The cases illustrate how crowd-sourced and crowd-managed auditing can be an important community governance activity. Finally, we conclude with a discussion of the issues raised by this work beyond the case of Wikipedia and identify future directions.
2.1 The politics of algorithms
Algorithmic systems [94] play increasingly crucial roles in the governance of social processes [7, 26, 40]. Software algorithms are increasingly used in answering questions that have no single right answer and where using prior human decisions as training data can be problematic [6]. Algorithms designed to support work change people’s work practices, shifting how, where, and by whom work is accomplished [19, 113]. Software algorithms gain political relevance on par with other process-mediating artifacts (e.g. laws & norms [65]).
There are repeated calls to address power dynamics and bias through transparency and accountability of the algorithms that govern public life and access to resources [12, 23, 89]. The field around effective transparency, explainability, and accountability mechanisms is growing. We cannot fully address the scale of concerns in this rapidly shifting literature, but we find inspiration in Kroll et al’s discussion of the limitations of auditing and transparency [62], Mulligan et al’s shift towards the term “contestability” [78], Geiger’s call to go “beyond opening up the black box” [34], and Selbst et al.’s call to explore the socio-technical/societal context and include social actors of algorithmic systems when considering issues of fairness [95].
In this paper, we discuss a specific socio-political context — Wikipedia’s algorithmic quality control and socialization practices — and the development of novel algorithmic systems for support of these processes. We implement a meta-algorithmic intervention aligned with Wikipedians’
148:4 Halfaker & Geiger
principles and practices: deploying a service for building and deploying prediction algorithms, where many decisions are delegated to the volunteer community. Instead of training the single best classifier and implementing it in our own designs, we embrace having multiple potentiallycontradictory classifiers, with our intended and desired outcome involving a public process of training, auditing, re-interpreting, appropriating, contesting, and negotiating both these models themselves and how they are deployed in content moderation interfaces and automated agents. Often, work on technical and social ways to achieve fairness and accountability does not discuss this broader kind of socio-infrastructural intervention on communities of practice, instead staying at the level of models themselves.
However, CSCW and HCI scholarship has often addressed issues of algorithmic governance in a broader socio-technical lens, which is in line with the field’s longstanding orientation to participatory design, value-sensitive design, values in design, and collaborative work [29, 30, 55, 91, 92]. Decision support systems and expert systems are classic cases which raise many of the same lessons as contemporary machine learning [8, 67, 98]. Recent work has focused on more participatory [63], value-sensitive [112], and experiential design [3] approaches to machine learning. CSCW and HCI has also long studied human-in-the-loop systems [14, 37, 111], as well as explored implications of “algorithm-in-the-loop” socio-technical systems [44]. Furthermore, the field has explored various adjacent issues as they play out in citizen science and crowdsourcing, which is often used as a way to collect, process, and verify training data sets at scale, to be used for a variety of purposes [52, 60, 107].
2.2 Machine learning in support of open production
Open peer production systems, like most user-generated content platforms, use machine learning for content moderation and task management. For Wikipedia and related Wikimedia projects, quality control for removing “vandalism”5 and other inappropriate edits to articles is a major goal for practitioners and researchers. Article quality prediction models have also been explored and applied to help Wikipedians focus their work in the most beneficial places and explore coverage gaps in article content.
Quality control and vandalism detection. The damage detection problem in Wikipedia is one of great scale. English Wikipedia is edited about 142k times each day, which immediately go live without review. Wikipedians embrace this risk, but work tirelessly to maintain quality. Damaging, offensive, and/or fictitious edits can cause harms to readers, the articles’ subjects, and the credibility of all of Wikipedia, so all edits must be reviewed as soon as possible [37]. As an information overload problem, filtering strategies using machine learning models have been developed to support the work of Wikipedia’s patrollers (see [1] for an overview). Some researchers have integrated their prediction models into purpose-designed tools for Wikipedians to use (e.g. STiki [106], a classifiersupported moderation tool). Through these machine learning models and constant patrolling, most damaging edits are reverted within seconds of when they are saved [35].
Task routing and recommendation. Machine learning plays a major role in how Wikipedians decide what articles to work on. Wikipedia has many well-known content coverage biases (e.g. for a long period of time, coverage of women scientists lagged far behind [47]). Past work has explored collaborative recommender-based task routing strategies (see SuggestBot [18]), in which contributors are sent articles that need improvement in their areas of expertise. Such systems show strong promise to address content coverage biases, but could also inadvertently reinforce biases.
ORES 148:5
2.3 Community values applied to software and process
Wikipedia has a large community of “volunteer tool developers” who build bots, third party tools, browser extensions, and Javascript-based gadgets, which add features and create new workflows. This “bespoke” code can be developed and deployed without seeking approval or major actions from those who own and operate Wikipedia’s servers [33]. The tools play an oversized role in structuring and governing the actions of Wikipedia editors [32, 101]. Wikipedia has long managed and regulated such software, having built formalized structures to mitigate the potential harms that may arise from automation. English Wikipedia and many other wikis have formal processes for the approval of fully-automated bots [32], which are effective in ensuring that robots do not often get into conflict with each other [36].
Some divergences between the bespoke software tools that Wikipedians use to maintain Wikipedia and their values have been less apparent. A line of critical research has studied the unintended consequences of this complex socio-technical system, particularly on newcomer socialization [48, 49, 76]. In summary, Wikipedians struggled with the issues of scaling when the popularity of Wikipedia grew exponentially between 2005 and 2007 [48]. In response, they developed quality control processes and technologies that prioritized efficiency by using machine prediction models [49] and templated warning messages [48]. This transformed newcomer socialization from a human and welcoming activity to one far more dismissive and impersonal [76], which has caused a steady decline in Wikipedia’s editing population [48]. The efficiency of quality control work and the elimination of damage/vandalism was considered extremely politically important, while the positive experience of a diverse cohort of newcomers was less so.
After the research about these systemic issues came out, the political importance of newcomer experiences in general and underrepresented groups specifically was raised substantially. But despite targeted efforts and shifts in perception among some members of the Wikipedia community [76, 80]6, the often-hostile quality control processes that were designed over a decade ago remain largely unchanged [49]. Yet recent work by Smith et al. has confirmed that “positive engagement" is one of major convergent values expressed by Wikipedia editors with regards to algorithmic tools. Smith et al. discuss conflicts between supporting efficient patrolling practices and maintaining positive newcomer experiences. [97]. Smith et al. and Selbst et al. highlight discuss how it is crucial to examine not only how the algorithm behind a machine learning model functions, but also how predictions are surfaced to end-users and how work processes are formed around the predictions [95, 97].
In this section, we discuss mechanisms behind Wikipedia’s socio-technical problems and how we as socio-technical system builders designed ORES to have impact within Wikipedia. Past work has demonstrated how Wikipedia’s problems are systemic and caused in part by inherent biases in the quality control system. To responsibly use machine learning in addressing these problems, we examined how Wikipedia functions as a distributed system, focusing on how processes, policies, power, and software come together to make Wikipedia happen.
3.1 Wikipedia uses decentralized software to structure work practices
In any online community or platform, software code has major social impact and significance, enabling certain features and supporting (semi-)automated enforcement of rules and procedures; or, “code is law" [65]. However, in most online communities and platforms, such software code is
148:6 Halfaker & Geiger
built directly into the server-side code base. Whoever has root access to the server has jurisdiction over this sociality of software.
For Wikipedia’s almost twenty year history, the volunteer editing community has taken a more decentralized approach to software governance. While there are many features integrated into server-side code — some of which have been fiercely-contested [82]7 — the community has a strong tradition of users developing third-party tools, gadgets, browser extensions, scripts, bots, and other external software. The MediaWiki software platform has a fully-featured Application Programming Interface (API) providing multiple output formats, including JSON8. This has let developers write software that adds or modifies features for Wikipedians who download or enable them.
As Geiger describes, this model of “bespoke code" is deeply linked to Wikipedians’ conceptions of governance, fostering more decentralized decision-making [33]. Wikipedians develop and install third-party software without generally needing approvals from those who own and operate Wikipedia’s servers. There are some centralized governance processes, especially for fully-automated bots, which give developer-operators massive editorial capacity. Yet even bot approval decisions are made through local governance mechanisms of each language version, rather than top-down decisions by Wikimedia Foundation staff.
In many other online communities and platforms, users who want to implement an addition or change to user interfaces or ML models would need to convince the owners of the servers. In contrast, the major blocker in Wikipedia is typically finding someone with the software engineering and design expertise to develop such software, as well as the resources and free time to do this work. This is where issues of equity and participation become particularly relevant: Ford & Wajcman discuss how Wikipedia’s infrastructural choices have added additional barriers to participation, where programming expertise is important in influencing encyclopedic decision-making [27]. Expertise, resources, and free time to overcome such barriers are not equitably distributed, especially around technology [93]. This is a common pattern of inequities in “do-ocracies," including the content of Wikipedia articles.
The people with the skills, resources, time, and inclination to develop such software tools and ML models have held a large amount of power in deciding what types of work will and will not be supported [32, 66, 77, 83, 101]. Almost all of the early third-party tools (including the first tools to use ML) were developed by and/or for so-called “vandal fighters," who prioritized the quick and efficient removal of potentially damaging content. Tools that supported tasks like mentoring new users, edit-a-thons, editing as a classroom exercise, or identifying content gaps lagged significantly behind, even though these kinds of activities were long recognized as key strategies to help build the community, including for underrepresented groups. Most tools were also written for the English-language Wikipedia, with tools for other language versions lagging behind.
3.2 Using machine learning to scale Wikipedia
Despite the massive size of Wikipedia’s community (66k avg. monthly active editors in English Wikipedia in 20199) — or precisely because of this size — the labor of Wikipedia’s curation and support processes is massive. More edits from more contributors requires more content moderation work, which has historically resulted in labor shortages. For example, if patrollers in Wikipedia review 10 revisions per minute for vandalism (an aggressive estimate, for those seeking to catch only blatant issues) it would require 483 labor hours per day just to review the 290k edits saved to all the various language editions of Wikipedia.10 In some cases, the labor shortage has become so
ORES 148:7
extreme that Wikipedians have chosen to shut down routes of contribution in an effort to minimize the workload.11 Content moderation work can also be exhausting and even traumatic. [87]
When labor shortages threatened the function of the open encyclopedia, ML was a breakthrough technology. A reasonably fit vandalism detection model can filter the set of edits that need to be reviewed down by 90%, with high levels of confidence that the remaining 10% of edits contains almost all of the most blatant kinds of vandalism – the kind that will cause credibility problems if readers see these edits on articles. The use of such a model turns a 483 labor hour problem into a 48.3 labor hour problem. This is the difference between needing 240 coordinated volunteers to work 2 hours per day to patrol for vandalism and needing 24 coordinated volunteers to work for 2 hours per day. For smaller wikis, it means that vandalism can be tackled by just 1 or 2 part-time volunteers, who can spend time on other tasks. Beyond vandalism, there are many other cases where sorting, routing, and filtering make problems of scale manageable in Wikipedia.
3.3 Machine learning resource distribution
If these crucial and powerful third-party workflow support tools only get built if someone with expertise, resources, and time freely decides to build them, then the barriers to and resulting inequities in developing and implementing ML at Wikipedia’s scale are even more stark. Those with advanced skills in ML and data engineering can have day jobs that prevent them from investing the time necessary to maintain these systems [93]. Diversity and inclusion are also major issues in computer science education and the tech industry, including gender, race/ethnicity, and national origin [68, 69]. Given Wikipedia’s open participation model but continual issues with diversity and inclusion, it is also important to note that free time is not equitably distributed in society [10].
To the best of our knowledge, in the 15 years of Wikipedia’s history prior to ORES, there were only three ML classifier projects that successfully built and served real-time predictions at the scale of at least one entire language version of a Wikimedia project. All of these also began on the English language Wikipedia, with only one ML-assisted content moderation tool 12 supporting other language versions. Two of these three were developed and hosted at university research labs. These projects also focused on supporting a particular user experience in an interface the model builders also designed, which did not easily support re-use of their models in other tools and interfaces. One notable exception was the public API that Stiki’s13 developer made available for a period of time. While limited in functionality,14 in past work, we appropriated this API in a new interface designed to support mentoring [49]. We found this experience deeply inspirational about how publicly hosting models with APIs could support new and different uses of ML.
Beyond expertise and time, there are financial costs in hosting real-time ML classifier services at Wikipedia’s scale. These require continuously operating servers, unlike third-party tools, extensions, and gadgets that run on a user’s computer. The classifier service must keep in sync with the new edits, prepared to classify any edit or version of an article. This speed is crucial for tasks that involve reviewing new edits, articles, or editors, as ML-assisted patrollers respond within 5 seconds of an edit being made [35]. Keeping this service running at scale, at high reliability, and in sync with Wikimedia’s servers is a computationally intensive task requiring costly server resources.15
148:8 Halfaker & Geiger
Given this background context, we intended to foster collaborative model development practices through a model hosting system that would support shared use of ML as a common infrastructure. We initially built a pattern of consulting with Wikipedians about how they made sense of their important concepts — like damage, vandalism, good-faith vs. bad-faith, spam, quality, topic categorization, and whatever else they brought to us and wanted to model. We saw an opportunity to build technical infrastructure that would serve as the base of a more open and participatory socio-technical system around ML in Wikipedia. Our goals were twofold: First, we wanted to have a much wider range of people from the Wikipedian community substantially involved in the development and refinement of new ML models. Second, we wanted those trained ML models to be broadly available for appropriation and re-use for volunteer tool developers, who have the capacity to develop third-party tools and browser extensions using Wikipedia’s JSON-based API, but less capacity to build and serve ML models at Wikipedia’s scale.
We built ORES as a technical system to be multi-purpose in its support of ML in Wikimedia projects. Our goal is to support as broad of use-cases as possible, including ones that we could not imagine. As Figure 1 illustrates, ORES is a machine learning as a service platform that connects sources of labeled training data, human model builders, and live data to host trained models and serve predictions to users on demand. As an endpoint, we implemented ORES as an API-based web service, where once a project-specific classifier had been trained and deployed, predictions from a classifier for any specific item on that wiki (e.g. an edit or a version of a page) could be requested via an HTTP request, with responses returned in JSON format. For Wikipedia’s existing communities of bot and tool developers, this API-based endpoint is the default way of engaging with the platform.
Fig. 1. ORES conceptual overview. Model builders design process for training ScoringModels from training data. ORES hosts ScoringModels and makes them available to researchers and tool developers.
From a user’s point of view (who is often a developer or a researcher), ORES is a collection of models that can be applied to Wikipedia content on demand. A user submits a query to ORES like “is the edit identified by 123456 in English Wikipedia damaging?" (https://ores.wikimedia.org/ v3/scores/enwiki/123456/damaging). ORES responds with a JSON score document that includes a prediction (false) and a confidence level (0.90), which suggests the edit is probably not damaging. Similarly, the user could ask for the quality level of the version of the article created by the same edit (https://ores.wikimedia.org/v3/scores/enwiki/123456/articlequality). ORES responds with a emphscore document that includes a prediction (Stub – the lowest quality class) and a confidence
ORES 148:9
level (0.91), which suggests that the page needs a lot of work to attain high quality. It is at this API interface that a user can decide what to do with the prediction: put it in a user interface, build a bot around it, use it as part of an analysis, or audit the performance of the models.
In the rest of this section, we discuss the strictly technical components of the ORES service. See also Appendix A for details about how we manage open source, self-documenting model pipelines, how we maintain a version history of the models deployed in production, and measurements of ORES usage patterns. In the section 5, we discuss socio-technical affordances we designed in collaboration with ORES’ users.
4.1 Score documents
The predictions made by ORES are human- and machine-readable.16 In general, our classifiers will report a specific prediction along with a set of probability (likelihood) for each class. By providing detailed information about a prediction, we allow users to re-purpose the prediction for their own use. Consider the article quality prediction output in Figure 2.
"score": { "prediction": "Start", "probability": { "FA": 0.00, "GA": 0.01, "B": 0.06, "C": 0.02, "Start": 0.75, "Stub": 0.16 } }
Fig. 2. A score document – the result of https://ores.wikimedia.org/v3/scores/enwiki/34234210/articlequality
A developer making use of a prediction like this in a user-facing tool may choose to present the raw prediction “Start” (one of the lower quality classes) to users or to implement some visualization of the probability distribution across predicted classes (75% Start, 16% Stub, etc.). They might even choose to build an aggregate metric that weights the quality classes by their prediction weight (e.g. Ross’s student support interface [88] discussed in section 5.2.3 or the weighted sum metric from [47]).
4.2 Model information
In order to use a model effectively in practice, a user needs to know what to expect from model performance. For example, a precision metric helps users think about how often is it that when an edit is predicted to be “damaging,” it actually is. Alternatively, a recall metric helps users think about what proportion of damaging edits they should expect will be caught by the model. The target metric of an operational concern depends strongly on the intended use of the model. Given that our goal with ORES is to allow people to experiment with the use and to appropriate prediction models in novel ways, we sought to build a general model information strategy.
The output captured in Figure 3 shows a heavily trimmed (with ellipses) JSON output of model_info for the “damaging” model in English Wikipedia. What remains gives a taste of what information is available. There is structured data about what kind of algorithm is being used to construct the estimator, how it is parameterized, the computing environment used for training, the size of the train/test set, the basic set of fitness metrics, and a version number for secondary caches. A developer or researcher using an ORES model in their tools or analyses can use these fitness
148:10 Halfaker & Geiger
"damaging": { "type": "GradientBoosting", "version": "0.4.0", "environment": {"machine": "x86_64", ...}, "params": {"labels": [true, false], "learning_rate": 0.01, "min_samples_leaf": 1...}, "statistics": { "counts": { "labels": {"false": 18702, "true": 743}, "n": 19445, "predictions": { "false": {"false": 17989, "true": 713}, "true": {"false": 331, "true": 412}}}, "precision": {"macro": 0.662, "micro": 0.962, "labels": {"false": 0.984, "true": 0.34}}, "recall": {"macro": 0.758, "micro": 0.948, "labels": {"false": 0.962, "true": 0.555}}, "pr_auc": {"macro": 0.721, "micro": 0.978, "labels": {"false": 0.997, "true": 0.445}}, "roc_auc": {"macro": 0.923, "micro": 0.923, "labels": {"false": 0.923, "true": 0.923}}, ... }}
Fig. 3. Model information for an English Wikipedia damage detection model – the result of https://ores.
metrics to make decisions about whether or not a model is appropriate and to report to users what fitness they might expect at a given confidence threshold.
4.3 Scaling and robustness
To be useful for Wikipedians and tool developers, ORES uses distributed computation strategies to provide a robust, fast, high-availability service. Reliability is a critical concern, as many tasks are time sensitive. Interruptions in Wikipedia’s algorithmic systems for patrolling have historically led to increased burdens for human workers and a higher likelihood that readers will see vandalism [35]. ORES also needs to scale to be able to be used in multiple different tools across different language Wikipedias.
This horizontal scalability17 is achieved in two ways: input-output (IO) workers (uwsgi18) and the computation (CPU) workers (celery19). Requests are split across available IO workers, and all necessary data is gathered using external APIs (e.g. the MediaWiki API20). The data is then split into a job queue managed by celery for the CPU-intensive work. This efficiently uses available resources and can dynamically scale, adding and removing new IO and CPU workers in multiple datacenters as needed. This is also fault-tolerant, as servers can fail without taking down the service as a whole.
4.3.1 Real-time processing. The most common use case of ORES is real-time processing of edits to Wikipedia immediately after they are made. Those using patrolling tools like Huggle to monitor edits in real-time need scores available in seconds of when the edit is saved. We implement several strategies to optimize different aspects of this request pattern.
ORES 148:11
Single score speed. In the worst case scenario, ORES generates a score from scratch, as for scores requested by real-time patrolling tools. We work to ensure the median score duration is around 1 second, so counter-vandalism efforts are not substantially delayed (c.f. [35]). Our metrics show for the week April 6-13th, 2018, our median, 75%, and 95% percentile response timings are 1.1, 1.2, and 1.9 seconds respectively. This includes the time to process the request, gather data, process data into features, apply the model, and return a score document. We achieve this through computational optimizations to ensure data processing is fast, and by choosing not to adopt new features and modeling strategies that might make score responses too slow (e.g., the RNN-LSTM strategy used in [21] is more accurate than our article quality models, but takes far longer to compute).
Caching and precaching. In order to take advantage of our users’ overlapping interests in scoring recent activity, we maintain a least-recently-used (LRU) cache21 using a deterministic score naming scheme (e.g. enwiki:123456:damaging would represent a score needed for the English Wikipedia damaging model for the edit identified by 123456). This allows requests for scores that have recently been generated to be returned within about 50ms – a 20X speedup. To ensure scores for all recent edits are available in the cache for real-time use cases, we implement a “precaching” strategy that listens to a high-speed stream of recent activity in Wikipedia and automatically requests scores for the subset of actions that are relevant to a model. With our LRU and precaching strategy, we consistently attain a cache hit rate of about 80%.
De-duplication. In real-time ORES use cases, it is common to receive many requests to score the same edit/article right after it was saved. We use the same deterministic score naming scheme from the cache to identify scoring tasks, and ensure that simultaneous requests for that same score are de-duplicated. This allows our service to trivially scale to support many different robots and tools requesting scores simultaneously on the same wiki.
4.3.2 Batch processing. In our logs, we have observed bots submitting large batch processing jobs to ORES once per day. Many different types of Wikipedia’s bots rely on periodic, batch processing strategies to support Wikipedian work [32]. Many bots build daily or weekly worklists for Wikipedia editors (e.g. [18]). Many of these tools have adopted ORES to include an article quality prediction for use in prioritization22). Typically, work lists are either built from all articles in a Wikipedia language version (>5m in English) or from some subset of articles specific to a single WikiProject (e.g. WikiProject Women Scientists claims about 6k articles23). Also, many researchers are using ORES for analyses, which presents as a similar large burst of requests.
In order to most efficiently support this type of querying activity, we implemented batch optimizations by splitting IO and CPU operations into distinct stages. During the IO stage, all data is gathered for all relevant scoring jobs in batch queries. During the CPU stage, scoring jobs are split across our distributed processing system discussed above. This batch processing enables up to a 5X increase in speed of response for large requests[90]. At this rate, a user can request tens of millions of scores in less than 24 hours in the worst case scenario (no scores were cached), without substantially affecting the service for others.
While ORES can be considered as a purely technical information system (e.g. as bounded in Fig. 1) and discussed in the prior section, it is useful to take a broader view of ORES as a socio-technical system that is maintained and operated by people in certain ways and not others. Such a move
148:12 Halfaker & Geiger
is in line with recent literature encouraging a shift from focusing on “algorithms" specifically to “algorithmic systems" [94]. In this broader view, the role of our engineering team is particularly relevant, as ORES is not a fully-automated, fully-decoupled “make your own model" system, as in the kind of workflow exemplified by Google’s Teachable Machine.24 In the design of ORES, this could have been a possible configuration, in which Wikipedians would have been able to upload their own training datasets, tweak and tune various parameters themselves, then hit a button that would deploy the model on the Wikimedia Foundation’s servers to be served publicly at scale via the API — all without needing approval or actions by us.
Instead, we have played a somewhat stronger gatekeeping role than we initially envisioned, as no new models can be deployed without our express approval. However, our team has long operated with a particular orientation and set of values, most principally committing to collaboratively build ML models that contributors at local Wikimedia projects believe may be useful in supporting their work. Our team — which includes paid staff at the Wikimedia Foundation, as well as some part-time volunteers — has long acted more like a service or consulting unit than a traditional product team. We provide customized resources after extended dialogue with representatives from various wiki communities who have expressed interest in using ML for various purposes.
5.1 Collaborative model design
From the perspective of Wikipedians who want to add a new model/classifier to ORES, the first step is to request a model. They message the team and outline what kind of model/classifier they would like and/or what kinds of problems they would like to help tackle with ML. Requesting a classifier and discussing the request take place on the same communication channels used by active contributors of Wikimedia projects. The team has also performed outreach, including giving presentations to community members virtually and at meetups25. Wikimedia contributors have also spread news about ORES and our team’s services to others through word-of-mouth. For example, the Portuguese Wikipedia demonstrative case we describe in section 5.1.2 was started because of a presentation that someone outside of our team gave about ORES at WikiCon Portugual.26
Even though our team has a more collaborative, participatory, and consulting-style approach, we still play a strong role in the design, scoping, training, and deployment of ORES’ models. Our team assesses the feasibility of each request, works with requester(s) to define and scope the proposed model, helps collect and curate labeled training data, and helps engineer features. We often go through multiple rounds of iteration based on the end users’ experience with the model and we help community members think about how they might integrate ORES scores within existing or new tools to support their work. ORES could become a rather different kind of socio-technical system if our team performed this exact same kind of work in the exact same technical system, but had different values, approaches, and procedures.
5.1.1 Trace extraction and manual labeling. Labeled data is the heart of any model, as ML can only be as good as the data used to train it [39]. It is through focusing on labeled observations that we help requesters understand and negotiate the meaning to be modeled [57] by an ORES classifier. We employ two strategies around labeling: found trace data and community labeling campaigns.
Found trace data from wikis. Wikipedians have long used the wiki platform to record digital traces about decisions they make [38]. These traces can sometimes be assumed to reflect a useful labeled data for modeling, although like all found data, this must be done with much care. One of
ORES 148:13
the most commonly requested models are classifiers for the quality of an article at a point in time. The quality of articles is an important concept across language versions, which are independently written according to their own norms, policies, and procedures, and standards. Yet most Wikipedia language versions have a more-or-less standardized process for reviewing articles. Many of these processes began as ways to select high quality articles to feature on their wiki’s home page as the “article of the day.” Articles that do not pass are given feedback; like in academic peer review, there can be many rounds of review and revision. In many wikis, an article can be given a range of scores, with specific criteria defining each level.
Many language versions have developed quality scales that formalize their concepts of article quality [99]. Each local language version can have their own quality scale, with their own standards and assessment processes (e.g. English Wikipedia has a 7-level scale, Italian Wikipedia has a 5-level scale, Tamil Wikipedia has a 3-level scale). Even wikis that use the same scales can have different standards for what each level means and that meaning can change over time [47].
Most wikis also have standard “templates" for leaving assessment trace data. In English Wikipedia and many others, these templates are placed by WikiProjects (subject-focused working groups) on the “Talk pages" that are primarily used for discussing the article’s content. For wikis that have these standardized processes, scales, and trace data templates, our team asks the requesters of article quality models to provide some information and links about the assessment process. The team uses this to build scripts that scrape this into training datasets. This process is highly iterative, as processing mistakes and misunderstandings about the meaning and historical use of a template often need to be worked out in consultations with Wikipedians who are more well versed in their own community’s history and processes. These are reasons why it is crucial to involve community members in ML throughout the process.
Fig. 4. The Wiki labels interface embedded in Wikipedia
Manual labeling campaigns with Wiki Labels. In many cases, we do not have any suitable trace data we can extract as labels. For these cases, we ask our Wikipedian collaborators to perform a community labeling exercise. This can have high cost: for some models, we need tens of thousands of observations in order to achieve high fitness and adequately test performance. To minimize that cost, we developed a high-speed, collaborative labeling interface called “Wiki Labels.”27 We work with model requesters to design an appropriate sampling strategy for items to be labeled and appropriate labeling interfaces, load the sample of items into the system, and help requesters
148:14 Halfaker & Geiger
recruit collaborators from their local wiki to generate labels. Labelers to request “worksets" of observations to label, which are presented in a customizable interface. Different campaigns require different “views" of the observation to be labeled (e.g., a whole page, a specific sentence within a page, a sequence of edits by an editor, a single edit, a discussion post, etc.) and “forms" to capture the label data (e.g., “Is this edit damaging?", “Does this sentence need a citation?", “What quality level is this article?", etc.). Unlike with most Wikipedian edit review interfaces, Wiki Labels does not show information about the user who made the edit, to help mitigate implicit bias against certain kinds of editors (although this does not mitigate biases against certain kinds of content).
Labeling campaigns may be labor-intensive, but we have found they often prompt us to reflect and specify what precisely it is the requesters want to predict. Many issues can arise in ML from overly broad or poorly operationalized theoretical constructs [57], which can play out in human labeling [39]. For example, when we first started discussing modeling patrolling work with Wikipedians, it became clear that these patrollers wanted to give vastly different responses when they encountered different kinds of “damaging” edits. Some of these “damaging” edits were clearly seen as “vandalism” to patrollers, where the editor in question appears to only be interested in causing harm. In other cases, patrollers encountered “damaging” edits that they felt certainly lowered the quality of the article, but were made in a way that they felt was more indicative of a mistake or misunderstanding.
Patrollers felt these second kind of damaging edits were from people who were generally trying to contribute productively, but they were still violating some rule or norm of Wikipedia. Wikipedians have long referred to this second type of “damage” as “good-faith,” which is common among new editors and requires a carefully constructive response. “Good-faith” is a well-established term in Wikipedian culture28, with specific local meanings that are different than their broader colloquial use — similar to how Wikipedians define “consensus” or “neutrality”29. We used this understanding and the distinction it provided to build a form for Wiki Labels that allowed Wikipedians to distinguish between these cases. That allowed us to build two separate models which allow users to filter for edits that are likely to be good-faith mistakes [46], to just focus on vandalism, or to apply themselves broadly to all damaging edits.
5.1.2 Demonstrative case: Article quality in Portuguese vs Basque Wikipedia. In this section, we discuss how we worked with community members from Portuguese Wikipedia to develop article quality models based on trace data, then we contrast with what we did for Basque Wikipedia where no trace data was available. These demonstrative cases illustrate a broader pattern we follow when collaboratively building models. We obtained consent from all editors mentioned to share their story with their usernames, and they have reviewed this section to verify our understanding of how and why they worked with us.
GoEThe, a volunteer editor from the Portuguese Wikipedia, attended a talk at WikiCon Portugal where a Wikimedia Foundation staff member (who was not on our team) mentioned ORES and the potential of article quality models to support work. GoEThe then found our documentation page for requesting new models 30. He clicked on a blue button titled “Request article quality model”. This led to a template for creating a task in our ticketing system.31 The article quality template asks questions like “How do Wikipedians label articles by their quality level?" and “What levels are there and what processes do they follow when labeling articles for quality?" These are good starting points for a conversation about what article quality is and how to measure it at scale.
ORES 148:15
Our engineers responded by asking follow-up questions about the digital traces Portuguese Wikipedians have long used to label articles by quality, which include the history of the bots, templates, automation, and peer review used to manage the labeling process. As previously discussed, existing trace data from decisions made by humans can appear to be a quick and easy way to get labels for training data, but it can also be problematic to rely on these traces without understanding the conditions of their production. After an investigation, we felt more confident there was consistent application of a well-defined quality scale32 and labels were consistently applied using a common template named “Marca de projeto". Meanwhile, we encountered another volunteer (Chtnnh) who was not a Portuguese Wikipedian, but interested in ORES and machine learning. Chtnnh worked with our team to iterate on scripts for extracting the traces with increasing precision. At the same time, a contributor to Portuguese Wikipedia named He7d3r joined, adding suggestions about our data extraction methods and contributing code. One of He7d3r’s major contributions were features based on a Portuguese “words_to_watch" list — a collection of imprecise or exaggerated words, like visionário (visionary) and brilhante (brilliant).33. We even gave He7d3r access to one of our servers, so that he could more easily experiment with extracting labels, engineering new features, and building new models.
In Portuguese Wikipedia, there was already trace data available that was suitable for training and testing an article quality model, which is not always the case. For example, we were approached about an article quality model by another volunteer from Basque Wikipedia – a much smaller language community – which had no history of labeling articles and thus no trace data we could use as observations. However, they had drafted a quality scale and were hoping that we could help them get started by building a model for them. In this case, we worked with a volunteer from Basque Wikipedia to develop heuristics for quality (e.g. article length) and used those heuristics to build a stratified sample, which we loaded into the Wiki Labels tool with a form that matched their quality scale. Our Basque collaborator also gave direct recommendations on our feature engineering. For example, after a year of use, they noticed that the model seemed to put too much weight on the presence of in the articles. They requested we remove the features and retrain the model. While this produced a small, measurable drop in our formal fitness statistics, editors reported that the new model matched their understanding of quality better in practice.
These two cases demonstrate the process by which volunteers — many of whom had no to minimal experiences with software development and ML — work with us to encode their understanding of quality and the needs of their work practices into a modeling pipeline. These cases also show how there is far more socio-technical work involved in the design, training, and optimization of machine learning models than just labeling training data. Volunteers play a critical role in deciding that a model is necessary, in defining the output classes the model should predict, in associating labels with observations (either manually or by interpreting historic trace data), and in engineering model features. In some cases, volunteers rely on our team to do most of the software and model engineering work, then they mainly give us feedback about model performance, but more commonly, the roles that volunteers take on overlap significantly with us researchers and engineers. The result is a process that is collaborative and is similar in many ways to the open collaboration practice of Wikipedia editors.
5.1.3 Demonstrative case: Italian Wikipedia thematic analysis. Italian Wikipedia was one of the first wikis outside of English Wikipedia where we deployed models for helping editors detect vandalism. After we deployed the initial version of the model, we asked Rotpunkt — our local
148:16 Halfaker & Geiger
collaborator who originally requested we build the model and helped us develop language-specific features — to help us gather feedback about how the model was performing in practice. He put together a page on Italian Wikipedia35 and encouraged patrollers to note the mistakes that the model was making there. He created a section for reporting “falsi positivi” (false-positives). Within several hours, Rotpunkt and others noticed trends in edits that ORES was getting wrong. They sorted false positives under different headers, representing themes they were seeing — effectively performing an audit of ORES through an inductive, grounded theory-esque thematic coding process.
One of the themes they identified was “correzioni verbo avere” (“corrections to the verb for have”). The word “ha” in Italian translates to the English verb “to have”. In English and many other languages, “ha” signifies laughing, which is not usually found in encyclopedic prose. Most non-English Wikipedias receive at least some English vandalism like this, so we had built a common feature in all patrolling support models called “informal words” to capture these patterns. Yet in this case, “ha” should not carry signal of damaging edits in Italian, while “hahaha” still should. Because of Rotpunkt and his collaborators in Italian Wikipedia, we were recognized the source of this issue, to removed “ha” from that informal list for Italian Wikipedia, and deployed a model that showed clear improvements.
This case demonstrates the innovative way our Wikipedian collaborators have advanced their own processes for working with us. It was their idea to group false positives by theme and characteristics, which made for powerful communication. Each theme identified by the Wikipedians was a potential bug somewhere in our model. We may have never found such a bug without the specific feedback and observations that end-users of the model were able to give us.
5.2 Technical affordances for adoption and appropriation
Our engineering team values being responsive to community needs. To this end, we designed ORES as a technical system in a way that we can more easily re-architect or extend the system in response to initially unanticipated needs. In some cases, these needs only became apparent after the base system (described in section 4) was in production — where communities we supported were able to actively explore what might be possible with ML in their projects, then raised questions or asked for features that we had not considered. Through maintaining close relationships with the communities we support, we have been able to extend and adapt the technical systems in novel ways to support their use of ORES. We identified and implemented two novel affordances that support the adoption and re-appropriation of ORES models: dependency injection and threshold optimization.
5.2.1 Dependency injection. When we originally developed ORES, we designed our feature engineering strategy based on a dependency injection framework36. A specific feature used in prediction (e.g., number of references) depends on one or more datasources (e.g. article text). Many different features can depend on the same datasource. A model uses a sampling of features in order to make predictions. A dependency solver allowed us to efficiently and flexibly gather and process the data necessary for generating the features for a model — initially a purely technical decision.
After working with ORES’ users, we received requests for ORES to generate scores for edits before they were saved, as well as to help explore the reasons behind some of the predictions. After a long consultation, we realized we could provide our users with direct access to the features that ORES used for making predictions and let those users inject features and even the datasources they depend on. A user can gather a score for an edit or article in Wikipedia, then request a new scoring job with one of those features or underlying datasources modified to see how the prediction would
ORES 148:17
change. For example, how does ORES differently judge edits from unregistered (anon) vs registered editors? Figure 5 demonstrates two prediction requests to ORES with features injected.
Fig. 5. Two “damaging” predictions about the same edit are listed for ORES. In one case, ORES scores the prediction assuming the editor is unregistered (anon) and in the other, ORES assumes the editor is registered.
Figure 5a shows that ORES’ “damaging” model concludes the edit is not damaging with 93.9% confidence. Figure 5b shows the prediction if the edit were saved by an anonymous editor. ORES would still conclude that the edit was not damaging, but with less confidence (91.2%). By following a pattern like this, we better understand how ORES prediction models account for anonymity with practical examples. End users of ORES can inject raw text of an edit to see the features extracted and the prediction, without making an edit at all.
5.2.2 Threshold optimization. When we first started developing ORES, we realized that operational concerns of Wikipedia’s curators need to be translated into confidence thresholds for the prediction models. For example, counter-vandalism patrollers seek to catch all (or almost all) vandalism quickly. That means they have an operational concern around the recall of a damage prediction model. They would also like to review as few edits as possible in order to catch that vandalism. So they have an operational concern around the filter-rate—the proportion of edits that are not flagged for review by the model [45]. By finding the threshold of prediction confidence that optimizes the filter-rate at a high level of recall, we can provide patrollers with an effective trade-off for supporting their work. We refer to these optimizations as threshold optimizations and ORES provides information about these thresholds in a machine-readable format so tool developers can write code that automatically detects the relevant thresholds for their wiki/model context.
Originally when we developed ORES, we defined these threshold optimizations in our deployment configuration, which meant we would need to re-deploy the service any time a new threshold was needed. We soon learned users wanted to be able to search through fitness metrics to choose thresholds that matched their own operational concerns on demand. Adding new optimizations and redeploying became a burden on us and a delay for our users, so we developed a syntax for requesting an optimization from ORES in real-time using fitness statistics from the model’s test data. For example, maximum recall @ precision >= 0.9 gets a useful threshold for a patrolling auto-revert bot or maximum filter_rate @ recall >= 0.75 gets a useful threshold for patrollers who are filtering edits for human review.
Figure 6 shows that, when a threshold is set on 0.299 likelihood of damaging=true, a user can expect to get a recall of 0.751, precision of 0.215, and a filter-rate of 0.88. While the precision is low, this threshold reduces the overall workload of patrollers by 88% while still catching 75% of (the most egregious) damaging edits.
One of the most noteworthy (and initially unanticipated) applications of ORES to support new editors is the suite of tools
148:18 Halfaker & Geiger
{"threshold": 0.32, ..., "filter_rate": 0.89, "fpr": 0.087, "precision": 0.23, "recall": 0.75}
Fig. 6. A threshold optimization – the result of https://ores.wikimedia.org/v3/scores/enwiki/?models=
developed by Sage Ross to support the Wiki Education Foundation’s37 activities. Their organization supports classroom activities that involve editing Wikipedia. They develop tools and dashboards that help students contribute successfully and to help teachers monitor their students’ work. Ross published about how he interprets meaning from ORES article quality models [88] (an example of re-appropriation) and how he has used the article quality model in their student support dashboard38 in a novel way. Ross’s tool uses our dependency injection system to suggest work to new editors. This system asks ORES to score a student’s draft article, then asks ORES to reconsider the predicted quality level of the article with one more header, one more image, one more citation, etc. — tracking changes in the prediction and suggesting the largest positive change to the student. In doing so, Ross built an intelligent user interface that can expose the internal structure of a model in order to recommend the most productive change to a student’s article — the change that will most likely bring it to a higher quality level.
This adoption pattern leverages ML to support an under-supported user class, as well as balances concerns around quality control efficiency and newcomer socialization with a completely novel strategy. By helping student editors learn to structure articles to match Wikipedians’ expectations, Ross’s tool has the potential to both improve newcomer socialization and minimize the burden for Wikipedia’s patrollers — a dynamic that has historically been at odds [48, 100]. By making it easy to surface how a model’s predictions vary based on changes to input, Ross was able to provide a novel functionality we had never considered.
5.2.4 Demonstrative case: Optimizing thresholds for RecentChanges Filters. In October of 2016, the Global Collaboration Team at the Wikimedia Foundation began work to redesign Wikipedia’s RecentChanges feed,39 a tool used by Wikipedia editors to track edits to articles in near real-time, which is used for patrolling edits for damage, to welcome new editors, to review new articles for their suitability, and to otherwise stay up to date with what is happening in Wikipedia. The most prominent feature of the redesign they proposed would bring in ORES’s “damaging” and “good-faith” models as flexible filters that could be mixed and matched with other basic filters. This would (among other use-cases) allow patrollers to more easily focus on the edits that ORES predicts as likely to be damaging. However, a question quickly arose from the developers and designers of the tool: at what confidence level should edits be flagged for review?
We consulted with the design team about the operational concerns of patrollers: that recall and filter-rate need to be balanced in order for effective patrolling [45]. The designers on the team identified the value of having a high precision threshold as well. After working with them to define threshold and making multiple new deployment of the software, we realized that there was an opportunity to automate the threshold discovery process and allow the client (in this case, the MediaWiki software’s RecentChanges functionality) to use an algorithm to select appropriate thresholds. This was very advantageous for both us and the product team. When we deploy an
ORES 148:19
updated version of a model, we increment a version number. MediaWiki checks for a change in that version number and re-queries for an updated threshold optimization as necessary.
This case shows how we were able to formalize a relationship between model developers and model users. Model users needed to be able to select appropriate confidence thresholds for the UX that they have targeted and they did not want to need to do new work every time that we make a change to a model. Similarly, we did not want to engage in a heavy consultation and iterative deployments in order to help end users select new thresholds for their purposes. By encoding this process in ORES API and model testing practices, we give both ourselves and ORES users a powerful tool for minimizing the labor involved in appropriating ORES’ models.
Fig. 7. The distributions of the probability of a single edit being scored as “damaging” based on injected features for the target user-class is presented. Note that when injecting user-class features (anon, newcomer), all other features are held constant.
5.2.5 Demonstrative case: Anonymous users and Tor users. Shortly after we deployed ORES, we received reports that ORES’s damage detection models were overly biased against anonymous editors. At the time, we were using Linear SVM40 estimators to build classifiers, and we were considering making the transition towards ensemble strategies like GradientBoosting and RandomForest estimators.41 We took the opportunity to look for bias in the error of estimation between anonymous editors and registered editors. By using our dependency injection strategy, we could ask our current prediction models how they would change their predictions if the exact same edit were made by a different editor.
Figure 7 shows the probability density of the likelihood of “damaging” given three different passes over the exact same test set, using two of our modeling strategies. Figure 7a shows that, when we leave the features to their natural values, it appears that both algorithms are able to learn models that differentiate effectively between damaging edits (high-damaging probability) and non-damaging edits (low-damaging probability) with the odd exception of a large amount of non-damaging edits with a relatively high-damaging probability around 0.8 in the case of the Linear SVM model. Figures 7b and 7c show a stark difference. For the scores that go into these plots, characteristics of anonymous editors and newly registered editors were injected for all of the test edits. We can see that the GradientBoosting model can still differentiate damage from non-damage while the Linear SVM model flags nearly all edits as damage in both cases.
148:20 Halfaker & Geiger
Through the reporting of this issue and our analysis, we identified the weakness of our estimator and mitigated the problem. Without a tight feedback loop, we likely would not have noticed how poorly ORES’s damage detection models were performing in practice. It might have caused vandal fighters to be increasingly (and inappropriately) skeptical of contributions by anonymous editors and newly registered editors—two groups that are already met with unnecessary hostility42[48].
This case demonstrates how the dependency injection strategy can be used in practice to empirically and analytically explore the bias of a model. Since we have started talking about this case publicly, other researchers have begun using this strategy to explore ORES predictions for other user groups. Notably, Tran et al. used ORES’ feature injection system to explore the likely quality of Tor users’ edits43 as though they were edits saved by newly registered users [102]. We likely would not have thought to implement this functionality if it were not for our collaborative relationship with developers and researchers using ORES, who raised these issues.
5.3 Governance and decoupling: delegating decision-making
If “code is law” [65], then internal decision-making processes of technical teams constitute governance systems. Decisions about what models should be built and how they should be used play major roles in how Wikipedia functions. ORES as a socio-technical system is variously coupled and decoupled, which as we use it, is a distinction about how much vertical integration and top-down control there is over different kinds of labor involved in machine learning. Do the same people have control and/or responsibility over deciding what the model will classify, labeling training data, curating and cleaning labeled training data, engineering features, building models, evaluating or auditing models, deploying models, and developing interfaces and/or agents that use scores from models? A highly-coupled ML socio-technical system has the same people in control of this work, while a highly-decoupled system involves distributing responsibility for decision-making — providing entrypoints for broader participation, auditing, and contestability [78].
Our take on decoupling draws on feminist standpoint epistemology, which often critiques a single monolithic “God’s eye view” or “view from nowhere” and emphasizes accounts from multiple positionalities [53, 54]. Similar calls drawing on this work have been made in texts like the Feminist Data Set Manifesto [see 96], as more tightly coupled approaches enforce a particular singular top-down worldview. Christin [15] also uses “decoupling” to discuss algorithms in an organizational context, in an allied but somewhat different use. Christin draws on classic sociology of organizations work [73] to describe decoupling as when gaps emerge between how algorithms are used in practice by workers and how the organization’s leadership assumes they are used, like when workers ignore or subvert algorithmic systems imposed on them by management. We could see such cases as unintentionally decoupled ML socio-technical systems in our use of the term, whereas ORES is intentionally decoupled. There is also some use of “decoupling” in ML to refer to training separate models for members of different protected classes [25], which is not the sense we mean, although this is in same spirit of going beyond a ‘one model to rule them all’ strategy.
As discussed in section 5.1, we collaboratively consult and work with Wikipedians to design models that represent their “emic” concepts, which have meaning and significance in their communities. Our team works more as stewards of ML for a community, making decisions to support wikis based on our best judgment. In one sense, this can be considered a more coupled approach to a socio-technical ML system, as the people who legally own the servers do have unitary control over what models can be developed — even if our team has a commitment to exercise that control in a more participatory way. Yet because of this commitment, the earlier stages of the ML process are still
ORES 148:21
more decoupled than in most content moderation models for commercial social media platforms, where corporate managers hire contractors to label data using detailed instructions [43, 87].
While ORES’s patterns around the tasks of model training, development, and deployment are somewhat-but-not-fully decoupled, tasks around model selection, use, and appropriation are highly decoupled. Given the open API, there is no barrier where we can selectively decide who can request a score from a classifier. We could have decided to make this part of ORES more strongly coupled, requiring users apply for revocable API keys to get ORES scores. This has important implications for how the models are regulated, limiting the direct power our team has over how models are used. This requires that Wikipedians and volunteer tool developers govern and regulate the use of these ML models using their own structures, strategies, values, and processes.
5.3.1 Decoupled model appropriation. ORES’s open API sees a wide variety of uses. The majority of the use is by volunteer tool developers who incorporate ORES’s predictions into the user experiences they target with their user interfaces (e.g. a score in an edit review tool) and into decisions made by fully-automated bots (e.g. all edits above a threshold are auto-reverted). We have also seen appropriation by professional product teams at the Wikimedia Foundation (e.g. see section 5.2.4) and other researchers (e.g., see section 5.2.5). In these cases, we play at most a supporting role helping developers and researchers understand how to use ORES, but we do not direct these uses beyond explaining the API and how to use various features of ORES.
This is relatively unusual for a Wikimedia Foundation-supported or even a volunteer-led engineering project, much less a traditional corporate product team. Often, products are deployed as changes and extensions of user-facing functionality on the site. Historically, there have been disagreements between the Wikimedia Foundation and a Wikipedia community about what kind of changes are welcome (e.g., the deployment of “Media Viewer" resulted in a standoff between the German Wikipedia community and the Wikimedia Foundation44[82]). This tension between the values of the owners of the platform and the community of volunteers exists at least in part due to how software changes are deployed. Indeed, we might have implemented ORES as a system that connected models to the specific user experiences that we valued. Under this pattern, we might design UIs for patrolling or task routing based on our values — as we did with an ML-assisted tool to support mentoring [49].
ORES represents a shift towards a far more decoupled approach to ML, which gives more agency to Wikipedian volunteers, in line with Wikipedia’s long history more decentralized governance structures [28]. For example, English Wikipedia and others have a Bot Approvals Group (BAG) [32, 50] regulating which bots are allowed to edit articles in specific ways. This has generally been effective in ensuring that bot developers do not get into extended conflicts with the rest of the community or each other. [36]. If a bot were programmed to act against the consensus of the community — with or without using ORES — the BAG could shut the bot down without our help.
Beyond governance, the more decoupled and participatory approach to ML that ORES affords allows innovation by people who hold different values, points of view, and ideas than we do. For example, student contributors to Wikipedia were our main focus, but Ross was able to appropriate those predictions to support the student contributors he valued and whose concerns were important (see section 5.2.3). Similarly, Tor users were not our main focus, but Tran et al. explored the quality of Tor user contributions to argue for for their high quality [102].
5.3.2 Demonstrative case: PatruBOT and Spanish Wikipedia. Soon after we released a “damaging" edit model for Spanish Wikipedia, a volunteer developer designed PatruBOT, a bot that automatically reverted any new edit to Spanish Wikipedia where ORES returned a score above a certain threshold.
148:22 Halfaker & Geiger
Our discussion spaces were soon bombarded with confused Spanish-speaking editors asking us why ORES did not like their edits. We struggled to understand the complaints until someone told us about PatruBOT and showed us where its behavior was being discussed on Spanish Wikipedia.
After inspecting the bot’s code, we learned this case was about tradeoffs between precision/recall and false positives/negatives — a common ML issue. PatruBOT’s threshold for reverting was far too sensitive. ORES reports a prediction and a probability of confidence, but it is up to the local developers to decide if the bot will auto-revert edits classified as damage with a .90, .95, .99, or higher confidence. Higher thresholds minimize the chance a good edit will be mistakenly auto-reverted (false-positive), but also increase the chance that a bad edit will not be auto-reverted (false-negative). Ultimately, we hold that each volunteer community should decide where to draw the line between false positives and false negatives, but we would could help inform these decisions.
The Spanish Wikipedians held a discussion about PatruBOT’s many false-positives.45 Using wiki pages, they crowdsourced an audit of PatruBOT’s behavior.46 They came to a consensus that PatruBOT was making too many mistakes and it should stop until it could be fixed. A volunteer administrator was quickly able to stop PatruBOTs activities by blocking its user account, which is a common way bots are regulated. This was entirely a community governed activity that required no intervention of our team or the Wikimedia Foundation staff.
This case shows how Wikipedian stakeholders do not need to have an advanced understanding in ML evaluation to meaningfully participate in a sophisticated discussion about how, when, why, and under what conditions such classifiers should be used. Because of the API-based design of the ORES system, no actions are needed on our end once they make a decision, as the fully-automated bot is developed and governed by Spanish Wikipedians and their processes. In fact, upon review of this case, we were pleased to see that SeroBOT took the place of PatruBOT in 2018 and continues to auto-revert vandalism today — albeit with a higher confidence threshold.
5.3.3 Stewarding model design/deployment. As maintainers of the ORES servers, our team does retain primary jurisdiction over every model: we must approve each request and can change or remove models as we see fit. We have rejected some requests that community members have made, specifically rejecting multiple requests for an automated plagiarism detector. However, we rejected those requests more because of technical infeasibility, rather than principled objections. This would require a complete-enough database of existing copyrighted material to compare against, as well as immense computational resources that other content scoring models do not require.
In the future, it is possible that our team will have to make difficult and controversial decisions about what models to build. First, demand for new models could grow such that the team could not feasibly accept every request, because of a lack of human labor and/or computational resources. We would have to implement a strategy for prioritizing requests, which like all systems for allocating scarce resources, would have embedded values that benefit some more than others. Our team has been quite small: we have never had more than 3 paid staff and 3 volunteers at any time, and no more than 2 requests typically in progress simultaneously. This may become a different enterprise if it grew to dozens or hundreds of people — as is the size of ML teams at major social media platforms — or received hundreds of requests for new models a day.
Our team could also receive requests for models in fundamental contradiction with our values: for example, classifying demographic profiles of an anonymous editor from their editing behavior, which would raise privacy issues and could be used in discriminatory ways. We have not formally specified our policies on these kinds of issues, which we leave to future work. Wikipedia’s own governance systems started out informal and slowly began to formalize as was needed [28], with
ORES 148:23
periodic controversies prompting new rules and governing procedures. Similarly, we expect that as issues like these arise, we will respond to them iteratively, building to when formally specified policies and structures will be necessary. However, we note ‘just-in-time’ approaches to governance have led to user-generated content platforms deferring decisions on critical issues [104].
ORES as a technical and socio-technical system can be more or less tightly coupled onto existing social systems, which can dramatically change their operation and governance. For example, the team could decide that any new classifier for a local language wiki would only be deployed if the entire community held a discussion and reached a consensus in favor of it. Alternatively, the team could decide that any request would be accepted and implemented by default, but if a local language wiki reached a consensus against a particular classifier, they would take it down. These hypothetical decisions by our team would implement two rather different governance models around ML, which would both differ from the current way the team operates.
5.3.4 Demonstrative case: Article quality modeling for Dutch Wikipedia. After meeting our team at an in-person Wikimedia outreach event, two Dutch Wikipedians (Ciell and RonnieV) reached out to us to build an article quality model for Dutch Wikipedia. After working with them to understand how Dutch Wikipedians thought about article quality and what kind of digital traces we might be able to extract, Ciell felt it was important to reach out to the Dutch community to discuss the project. She made a posting on — a central local discussion space.
At first, the community response to the idea of an article quality model was very negative, on the grounds that an algorithm could not measure the aspects of quality they cared about. They also noted that while ORES may work well for English Wikipedia, this model could not be used on Dutch Wikipedia due to differences in language and quality standards. After discussing that the model could be designed to work with training data from their wiki and using their specific quality scale, there was some support. However, the crucial third point was that adding a model to the ORES service would not necessarily change the user interface for everyone, if the community did not want to use it in that way. It would be a service that anyone could query on their own, which could be implemented as opt-in feature via bespoke code, which in this case was a Javascript gadget that the community could manage — as it currently manages many other opt-in add-on extensions and gadgets. The tone of the conversation immediately changed, and the idea of running a trial to explore the usefulness of ORES article quality predictions was approved.
This case illustrates two key issues. First, the Dutch Wikipedians were hesitant to approve a new ML-based feature that would affect all users’ experience of the entire wiki, but they were much more interested in an optional opt-in ML feature. Second, they rejected adopting models originally tailored for English Wikipedia, explicitly discussing how they conceptualized quality differently. As ORES operates as a service they can use on their own terms, they can direct the design and use of the models. This would have been far more controversial if a ML-based article quality tool had been designed in more standard coupled product engineering model, like if we added a feature that surfaced article quality predictions to everyone.
6.1 Participatory machine learning
In a world increasingly dominated by for-profit user-generated content platforms — often marketed by their corporate owners as “communities” [41] — Wikipedia is an anomaly. While the non-profit Wikimedia Foundation has only a fraction of the resources as Facebook or Google, the unique principles and practices in the broad Wikipedia/Wikimedia movement are a generative constraint.
148:24 Halfaker & Geiger
ORES emerged out of this context, operating at the intersection of a pressing need to deploy efficient machine learning at scale for content moderation, but to do so in ways that enable volunteers to develop and deploy advanced technologies on their own terms. Our approach is in stark contrast to the norm in machine learning research and practice, which involves a more tightly-coupled, top-down mode of developing the most precise classifiers for a known ground truth, then wrapping those classifiers in a complete technology for end-users, who must treat them as black boxes.
The more wiki-inspired approach to what we call “participatory machine learning” imagines classifiers to be just as provisional and open to criticism, revision, and skeptical reinterpretation as the content of Wikipedia’s encyclopedia articles. And like Wikipedia articles, we suspect some classifiers will be far better than others based on how volunteers develop and curate them, for various definitions of “better” that are already being actively debated. Our demonstrative cases and exploratory work by Smith et al. based on ORES [97] briefly indicate how volunteers have collectively engaged in sophisticated discussions about how they ought to use machine learning. ORES’s fully open, reproducible/auditable code and data pipeline — from training data to models and scored predictions — enables a wide range of new collaborative practices. ORES is a more socio-technical and CSCW-oriented approach to issues in the FAccTML space, where attention is often placed on mathematical and technical solutions, like interactive visualizations for model interpretability or formal guarantees of operationalized definitions of fairness [79, 95].
ORES also represents an innovation in openness in that it decouples several activities that have typically all been performed by managers/engineers or those under their direct supervision and control: deciding what will be modeled, labeling training data, choosing or curating training data, engineering features, building models to serve predictions, auditing predictions for false positives/negatives, and developing interfaces or automated agents that act on those predictions. As our cases have shown, people with extensive contextual and domain expertise in an area can make well-informed decisions about curating training data, identifying false positives/negatives, setting thresholds, and designing interfaces that use scores from a classifier. In decoupling these actions, ORES helps delegate these responsibilities more broadly, opening up the structure of the socio-technical system and expanding who can participate in it. In the next section, we introduce this concept of decoupling more formally through a comparison to a quite different ML system, then in later sections show how ORES has been designed to these ends.
6.2 What is decoupling in machine learning?
A 2016 ProPublica investigation [4] raised serious allegations of racial biases in a ML-based tool sold to criminal courts across the US. The COMPAS system by Northpointe, Inc. produced risk scores for defendants charged with a crime, to be used to assist judges in determining if defendants should be released on bail or held in jail until their trial. This exposé began a wave of academic research, legal challenges, journalism, and organizing about a range of similar commercial software tools that have saturated the criminal justice system. Academic debates followed over what it meant for such a system to be “fair” or “biased” [2, 9, 17, 24, 110] As Mulligan et al. [79] discuss, debates over these “essentially contested concepts” often focused on competing mathematically-defined criteria, like equality of false positives between groups, etc.
When we examine COMPAS, we must admit that we feel an uneasy comparison between how it operates and how ORES is used for content moderation in Wikipedia. Of course, decisions about what is kept or removed from Wikipedia are of a different kind of social consequence than decisions about who is jailed by the state. However, just as ORES gives Wikipedia’s human patrollers a score intended to influence their gatekeeping decisions, so does COMPAS give judges a similarlyfunctioning score. Both are trained on data that assumes a knowable ground truth for the question to be answered by the classifier. Often this data is taken from prior decisions, heavily relying
ORES 148:25
on found traces produced by a multitude of different individuals, who brought quite different assumptions and frameworks to bear when originally making those decisions.
Yet comparing the COMPAS suite with ORES as socio-technical systems, one of the more striking differences to us — beyond inherent issues in using ML in criminal justice systems, see [5] for a review — is how tightly coupled COMPAS is. This is less-often discussed, particularly as a kind of meta-value embeddable in design. COMPAS is a standalone turnkey vendor solution, developed end-to-end by Northpointe, based on their particular top-down values. The system is a deployable set of pretrained models trained on data from nationwide “norm groups” that can be dropped into any criminal justice system’s IT stack. If a client wants retraining of the “norm groups” (or other modifications) this additional service must be purchased from Northpointe.48
COMPAS scores are also only to be accessed through an interface provided by Northpointe.49 This interface shows such scores as part of workflows within the company’s flagship product Northpointe Suite — a enterprise management system criminal justice organizations. Many debates about COMPAS and related systems have placed less attention on the models as an element of a broader socio-technical software system within organizations, although there are notable exceptions like Christin’s ethnographic work on algorithmic decision making in criminal justice, who raises similar organizational issues [15, 16]. Several of our critiques of tightly coupled ML systems are also similar to the ‘traps’ of abstraction that Selbst et al. [95] see ML engineers fall into.
Given such issues, many have called for public policy solutions, such as requiring public sector ML systems to have source code and training data released, or at least a structured transparency report (e.g. [31, 75]). We agree with this, but it is not simply that ORES is open source and open data, while COMPAS is not. It is just as important to have an open socio-technical system that flexibly accommodates the kinds of decoupled decision-making around data, model building and tuning, and decisions about how scores will be presented and used. A more decoupled approach does not mitigate potential harms, but can provide better entrypoints for auditing and criticism.
Extensive literature has discussed problems in taking found trace data from prior decisions as ground truth, particularly in institutions with histories of systemic bias [11, 42]. This has long been our concern with the standard approach in ML for content moderation in Wikipedia, which often uses the entire set of past edit revert decisions as ground truth. These concerns are rising around the adoption of ML in other domains where institutions may have an easily-parsable dataset of past decisions, from social services to hiring to finance [26]. However, we see a gap in the literature for a uniquely CSCW-oriented view of ML systems as software-enabled organizations — with resonances to classic work on Enterprise Resource Planning systems [e.g. 85]. In taking this broader view, argue that instead of attempting to de-bias problematic datasets for a single model, we seek to broaden participation and offer multiple contradictory models trained on different datasets.
In a highly coupled socio-technical machine learning system like COMPAS, a single set of people are responsible for all aspects of the ML process: labeling training data, engineering features, building models with various algorithms and parameterizations, choosing between these models, scoring items using models, and building interfaces or agents using those scores. As COMPAS is designed (and marketed), there is little capacity for people outside Northpointe, Inc. to intervene at any stage. Nor is there much of a built-in capacity to scaffold on new elements to this workflow at various points, such as introducing auditing or competing models at any point. This is a common theme in the auditing literature in and around FAccT [89], which often emphasizes ways to reverse engineer how models operate. In the next section, we discuss how ORES is more coupled in
148:26 Halfaker & Geiger
some aspects, while less coupled in others, which draws the demonstrative cases to describe the architectural and organizational decisions that produced such a configuration.
6.3 Cases of decoupling and appropriation of machine learning models
Multiple independent classifiers trained on multiple independent data sets. Many MLbased workflows presume a single canonical training data set approximating a ground truth, producing a model to be deployed at scale. ORES as socio-technical system is designed to support many co-existing and even contradictory training data sets and classifiers. This was initially a priority because of the hundreds of different wiki projects across languages that are edited independently. Yet this design constraint also pushed us to develop a system where multiple independent classifiers can be trained on different datasets within the same wiki project, based on different ideas about what ‘quality’ or ‘damage’ are.
Having independent co-existing classifiers is a key sign of a more decoupled system, as different people and publics can be involved in the labeling and curation of training data and in model design and auditing (which we discuss later). This decoupling is supported both in the architectural decisions outlined in our sections on scaling and API design, as well as in how our team has somewhat formalized processes for wiki contributors to request their own models. As we discussed, this is not a fully decoupled approach were anyone can build their own model with a few clicks, as our team still plays a role in responding to requests, but it is a far more decoupled approach to training data than in prior uses of ML in Wikipedia and in many other real-world ML systems.
Supporting latent user-needs. While the ORES socio-technical system is implemented in a somewhat-but-not-fully decoupled mode for model design and iteration, appropriation of models by developers and researchers is highly decoupled. Decoupling the maintenance of a model and the adoption of the model seems to have lead to adoption of ORES as a commons commodity — a reusable raw component that can be incorporated into specific products and that is owned and regulated as a common resource[84]. The “damage" detection models in ORES allow damage detection or the reverse (good edit detection) to be much more easily appropriated into end-user products and analyses. In section 5.2.5, we showed how related research did this for exploring Tor users’ edits. We also showed in section 5.2.3 how Ross re-imagined the article quality models as a strategy for building work suggestions. We had no idea that this model would be useful for this, and the developers and researchers did not need to ask us for permission to use ORES this way.
When we chose to provide an open API—to decouple model adoption—we stopped designing only for cases we imagined, moving to support latent user-needs [74]. By providing infrastructure for others to use as they saw fit, we are enabling needs that we cannot anticipate. This is a less common approach in ML, where it is more common for a team to design a specific ML-supported interface for users end-to-end. Indeed, this pattern was the state of ML in Wikipedia before ORES, and as we argue in section 3.3, this lead to many users’ needs being unaddressed, with substantial systemic consequences.
Enabling community governance authority. Those who develop and maintain technical infrastructure for software-mediated social systems often gain an incidental jurisdiction over the entire social system, able to make technical decisions with major social impacts. [65] In the case of Wikipedia, decoupling the maintenance of ML resources from their application pushes back against this pattern. ORES has enabled the communities of Wikipedia to more easily make governance decisions around ML by not requiring them to negotiate as much with us, the model maintainers. In section 5.3.2, we discuss how Spanish Wikipedians decided to approve PatruBOT to run in their wiki, then shut down the bot again without petitioning us to do so. Similarly, in section 5.3.4, we
ORES 148:27
discuss how Dutch Wikipedians were much more interested in experimenting with ORES when they learned how they would maintain control of their user experience.
The powerful role of auditing in decoupled ML systems. A final example of the importance of decoupled ML systems is in auditing. In our work with ORES, we recruit models’ users to perform an audit after every deployment or substantial model change. We have found misclassification reports to be an effective boundary object [64] for communicating about the meaning a model captures and does not capture. In section 5.1.3, we showed how a community of non-subject matter expert volunteers helped us identify a specific issue related to a bug in our modeling pipeline via an audit and thematic analysis. Wikipedians imagined the behavior they desired from a model and contrasted that to actual behavior, and in response, our engineering team imagined the technical mechanisms for making ORES behavior align with their expectations.
Wikipedians were also able to use auditing to make value-driven governance decisions. In section 5.3.2, we showed evidence of critical reflection on the current processes and the role of algorithms in quality control processes. While the participants were discussing issues that ML experts would refer to as model fitness or precision, they did not use that language or a formal understanding of model fitness. Yet they were able to effectively determine both the fitness of PatruBOT and make decisions about what criteria were important to allow the continued functioning of the bot. They did this using their own notions of how the bot should and should not behave and by looking at specific examples of the bots behavior in context.
Eliciting this type of critical reflection and empowering users to engage in their own choices about the roles of algorithmic systems in their social spaces has typically been more of a focus from the Critical Algorithms Studies literature, which comes from a more humanistic and interpretivist social science perspective (e.g. [6, 59]). This literature also emphasizes a need to see algorithmic systems as dynamic and constantly under revision by developers [94] — work that is invisible in most platforms, but is intentionally foregrounded in ORES. We see great potential for building new systems for supporting the crowd-sourced collection and interpretation of this type of auditing data. Future work should explore strategies for supporting auditing as a normal part of model design, deployment, and maintenance.
6.4 Design implications
In many user-generated content platforms, the technologies that mediate social spaces are controlled top-down by a single organization. Many social media users express frustration over their lack of agency around various understandings of “the algorithm.” [13]. We have shown how if such an organization seeks to involve its users and stakeholders more around ML, they can employ a decoupling strategy like that of ORES, where professionals with infrastructural expertise build and serve ML models at scale, while other stakeholders curate training data, audit model performance, and decide where and how the ML models will be used. The demonstrative cases show the feasibility and benefits of decoupling the ML modeling service from the curation of training data and the implementation of ML scores in interfaces and tools, as well as in moving away from a single “one classifier to rule them all” and towards giving users agency to train and serve models on their own.
With such a service, a wide range of people play critical roles in the governance of ML in Wikipedia, which go beyond what they would be capable of if ORES were simply another ML classifier hidden behind a single-purpose UI — albeit with open-source code and training data, as prior ML classifiers in Wikipedia were. Since ORES has been in service, more than twenty-five times more ML models have been built and served at Wikipedia’s scale than in the entire 15 year history of Wikipedia prior. Open sourcing the data and model code behind these original pre-ORES models did not lead to a proliferation of alternatives, while ORES as a socio-infrastructural service did.
148:28 Halfaker & Geiger
This is because ORES as a technical and socio-technical system reduces “incidental complexities” [71] involved in developing the systems necessary for deploying ML in production and at scale.
Our design implications for organizations and platforms is to take this literally: to run open ML as a service so that users can build their own models with training datasets they provide, which serve predictions using open APIs, and support activities like dependency injection and threshold optimization for auditing and re-appropriation. Together with a common set of discussion spaces like the ones Wikipedians used, these can enable the re-use of models by a broader audience and make space for reflective practices such as model auditing, decision-making about thresholds, or the choice between different classifiers trained on different training datasets. People with varying backgrounds and expertise in programming and ML (at least in our field site) have an interest in participating in such governance activities and can effectively coordinate common understandings of what a ML model is doing and whether or not that is acceptable to them.
6.5 Limitations and future work
Observing ORES in practice suggests avenues of future work toward crowd-based auditing tools. As our cases demonstrate, auditing of ORES’ predictions and mistakes has become a popular activity both during quality assurance checks after deploying a model (see section 5.1.3) and during community discussions about how a model should be used (see section 5.3.2). Even though we did not design interfaces for discussion and auditing, some Wikipedians have used unintended affordances of wiki pages and MediaWiki’s template system to organize processes for flagging false positives and calling attention to them. This process has proved invaluable for improving model fitness and addressing critical issues of bias against disempowered contributors (see section 5.2.5).
To better facilitate this process, future system builders should implement structured means to refute, support, discuss, and critique the predictions of machine models. With a structured way to report what machine prediction gets right and wrong, the process of reporting mistakes could be streamlined—making the negotiation of meaning in a machine learning model part of its every day use. This could also make it easier to perform the thematic audits we saw in section 5.1.3. For example, a structured database of ORES mistakes could be queried in order to discover groups of misclassifications with a common theme. By supporting such an activity, we are working to transfer more power from ourselves (the system owners) and to our users. Should one of our models develop a nasty bias, our users will be more empowered to coordinate with each other, show that the bias exists and where it causes problems, and either get the modeling pipeline fixed or even shut down a problematic usage pattern—as Spanish Wikipedians did with PatruBOT.
ORES has become a platform for supporting researchers, who use ORES both in support of and in comparison to their own analytical or modeling work. For example, Smith et al. used ORES as a focal point of discussion about values applied to machine learning models and their use [97]. ORES is useful for this because it is not only an interesting example for exploration, but also because it has been incorporated as part of the essential infrastructure of Wikipedia [105]. Dang et al. [20, 21] and Joshi et al. [58] use ORES as a baseline from which to compare their novel modeling work. Halfaker et al. [47] and Tran et al. [102] use ORES scores directly as part of their analytical strategies. And even Yang et al. used Wiki labels and some of our feature extraction systems to build their own models [109]. We look forward to what socio-technical explorations, model comparisons, and analytical strategies will be built on this research platform by future work.
We also look forward to what those from the fields around Fairness, Accountability, and Transparency in ML and Critical Algorithm Studies can ask, do, and question about ORES. Most of the studies and critiques of subjective algorithms [103] focus on for-profit or governmental organizations that are resistant to external interrogation. Wikipedia is one of the largest and arguably more influential information resources in the world, and decisions about what is and is not represented
ORES 148:29
have impacts across all sectors of society. The algorithms that ORES makes available are part of the decision process that leads to some people’s contributions remaining and others being removed. This is a context where algorithms have massive social consequence, and we are openly exploring transparent and open processes to help address potential issues and harms.
There is a large body of work exploring how biases manifest and how unfairness can play out in algorithmically mediated social contexts. ORES would be an excellent place to expand the literature within a specific and important field site. Notably, Smith et al. have used ORES as a focus for studying Value-Sensitive Algorithm Design and highlighting convergent and conflicting values [97]. We see great potential for research exploring strategies for more effectively encoding these values in both ML models and the tools/processes that use them on top of open machine learning services like ORES. We are also very open to the likelihood that our more decoupled approaches could still be reinforcing some structural inequalities, with ORES solving some issues but raising new issues. In particular, the decentralization and delegation to a different set of self-selected Wikipedians on local language versions may raise new issues, as past work has explored the hostility and harassment that can be common in some Wikipedia communities [72]. Any approach that seeks to expand access to a potentially harmful technology should also be mindful about unintended uses and abuses, as adversaries and harms in the content moderation space are numerous [70, 81, 108].
Finally, we also see potential in allowing Wikipedians to freely train, test, and use their own prediction models without our engineering team involved in the process. Currently, ORES is only suited to deploy models that are trained and tested by someone with a strong modeling and programming background, and we currently do that work for those who come to us with a training dataset and ideas about what kind of classifier they want to build. That does not necessarily need to be the case. We have been experimenting with demonstrating ORES model building processes using Jupyter Notebooks [61] 50 and have found that new programmers can understand the work involved. This is still not the fully-realized accessible approach to crowd-developed machine prediction, where all of the incidental complexities involved in programming are removed from the process of model development and evaluation. Future work exploring strategies for allowing end-users to build models that are deployed by ORES would surface the relevant HCI issues involved and the changes to the technological conversations that such a margin-opening intervention might provide, as well as be mindful of potential abuses and new governance issues.
In this ‘socio-technical systems paper,’ we first discussed ORES as a technical system: an open API for providing access to machine learning models for Wikipedians. We then discussed the socio-technical system we have developed around ORES that allows us to encode communities emic concepts in their models, collaboratively and iteratively audit their performance, and support broad appropriation of the models both within Wikipedia’s editing community and in the broader research community. We have also shown a series of demonstrative cases how these concepts are negotiated, audits are performed, and appropriation has taken place. This system, the observations, and the cases show a deep view of a technical system and the social structures around it. In particular, we analyze this arrangement as a more decoupled approach to machine learning in organizations, which we see as a more CSCW-inspired approach to many issues being raised around the fairness, accountability, and transparency of machine learning.
148:30 Halfaker & Geiger
This work was funded in part by the Gordon & Betty Moore Foundation (Grant GBMF3834) and Alfred P. Sloan Foundation (Grant 2013-10-27), as part of the Moore-Sloan Data Science Environments grant to UC-Berkeley, and directly by the Wikimedia Foundation. Thanks to our Wikipedian collaborators for sharing their insights and reviewing the descriptions of our collaborations: Chaitanya Mittal (chtnnh), Helder Geovane (He7d3r), Goncalo Themudo (GoEThe), Ciell, Ronnie Velgersdijk (RonnieV), and Rotpunkt.
[1] B Thomas Adler, Luca De Alfaro, Santiago M Mola-Velasco, Paolo Rosso, and Andrew G West. 2011. Wikipedia vandalism detection: Combining natural language, metadata, and reputation features. In International Conference on Intelligent Text Processing and Computational Linguistics. Springer, 277–288.
[2] Philip Adler, Casey Falk, Sorelle A. Friedler, Tionney Nix, Gabriel Rybeck, Carlos Scheidegger, Brandon Smith, and Suresh Venkatasubramanian. 2018. Auditing black-box models for indirect influence. 54, 1 (2018), 95–122. https://doi.org/10.1007/s10115-017-1116-3
[3] Oscar Alvarado and Annika Waern. 2018. Towards algorithmic experience: Initial efforts for social media contexts. In Proceedings of the 2018 CHI Conference on Human Factors in Computing Systems. ACM, 286.
[4] Julia Angwin, Jeff Larson, Surya Mattu, and Lauren Kirchner. 2016. Machine Bias. ProPublica 23 (2016), 2016. https://www.propublica.org/article/machine-bias-risk-assessments-in-criminal-sentencing
[5] Chelsea Barabas, Audrey Beard, Theodora Dryer, Beth Semel, and Sonja Solomun. 2020. Abolish the #TechToPrisonPipeline. https://medium.com/@CoalitionForCriticalTechnology/abolish-the-techtoprisonpipeline-9b5b14366b16
[6] Solon Barocas, Sophie Hood, and Malte Ziewitz. 2013. Governing algorithms: A provocation piece. SSRN. Paper presented at Governing Algorithms conference. (2013). https://papers.ssrn.com/sol3/papers.cfm?abstract_id=2245322
[7] Ruha Benjamin. 2019. Race After Technology: Abolitionist Tools for the New Jim Code. John Wiley & Sons, Hoboken, New Jersey.
[8] R. Bentley, J. A. Hughes, D. Randall, T. Rodden, P. Sawyer, D. Shapiro, and I. Sommerville. 1992. EthnographicallyInformed Systems Design for Air Traffic Control. In Proceedings of the 1992 ACM Conference on Computer-Supported . Association for Computing Machinery, New York, NY, USA, 123–129. https://doi.org/ 10.1145/143457.143470
[9] Richard Berk, Hoda Heidari, Shahin Jabbari, Michael Kearns, and Aaron Roth. 2018. Fairness in Criminal Justice Risk Assessments: The State of the Art. (2018), 0049124118782533. https://doi.org/10.1177/0049124118782533 Publisher: SAGE Publications Inc.
[10] Suzanne M. Bianchi and Melissa A. Milkie. 2010. Work and Family Research in the First Decade of the 21st Century. Journal of Marriage and Family 72, 3 (2010), 705–725. http://doi.org/10.1111/j.1741-3737.2010.00726.x
[11] danah boyd and Kate Crawford. 2011. Six Provocations for Big Data. In A decade in internet time: Symposium on the dynamics of the internet and society. Oxford Internet Institute, Oxford, UK. https://dx.doi.org/10.2139/ssrn.1926431
[12] Joy Buolamwini and Timnit Gebru. 2018. Gender Shades: Intersectional Accuracy Disparities in Commercial Gender Classification. In Proceedings of the 1st Conference on Fairness, Accountability and Transparency (Proceedings of Machine Learning Research), Sorelle A. Friedler and Christo Wilson (Eds.), Vol. 81. PMLR, New York, NY, USA, 77–91. http://proceedings.mlr.press/v81/buolamwini18a.html
[13] Jenna Burrell, Zoe Kahn, Anne Jonas, and Daniel Griffin. 2019. When Users Control the Algorithms: Values Expressed in Practices on Twitter. Proc. ACM Hum.-Comput. Interact. 3, CSCW, Article 138 (Nov. 2019), 20 pages. https: //doi.org/10.1145/3359240
[14] Eun Kyoung Choe, Nicole B. Lee, Bongshin Lee, Wanda Pratt, and Julie A. Kientz. 2014. Understanding QuantifiedSelfers’ Practices in Collecting and Exploring Personal Data. In Proceedings of the SIGCHI Conference on Human . Association for Computing Machinery, New York, NY, USA, 1143–1152. https://doi.org/10.1145/2556288.2557372
[15] Angèle Christin. 2017. Algorithms in practice: Comparing web journalism and criminal justice. 4, 2 (2017), 2053951717718855. https://doi.org/10.1177/2053951717718855
[16] Angèle Christin. 2018. Predictive Algorithms and Criminal Sentencing. In The Decisionist Imagination: Sovereignty, Social Science and Democracy in the 20th Century. Berghahn Books, New York, 272.
[17] Sam Corbett-Davies, Emma Pierson, Avi Feller, Sharad Goel, and Aziz Huq. 2017. Algorithmic Decision Making and the Cost of Fairness. In Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (2017-08-04) . Association for Computing Machinery, 797–806. https://doi.org/10.1145/ 3097983.3098095
ORES 148:31
[18] Dan Cosley, Dan Frankowski, Loren Terveen, and John Riedl. 2007. SuggestBot: using intelligent task routing to help people find work in wikipedia. In Proceedings of the 12th international conference on Intelligent user interfaces. ACM, 32–41.
[19] Kate Crawford. 2016. Can an algorithm be agonistic? Ten scenes from life in calculated publics. Science, Technology, & Human Values 41, 1 (2016), 77–92.
[20] Quang-Vinh Dang and Claudia-Lavinia Ignat. 2016. Quality assessment of wikipedia articles without feature engineering. In Digital Libraries (JCDL), 2016 IEEE/ACM Joint Conference on. IEEE, 27–30.
[21] Quang-Vinh Dang and Claudia-Lavinia Ignat. 2017. An end-to-end learning solution for assessing the quality of Wikipedia articles. In Proceedings of the 13th International Symposium on Open Collaboration. 1–10. https: //doi.org/10.1145/3125433.3125448
[22] Nicholas Diakopoulos. 2015. Algorithmic accountability: Journalistic investigation of computational power structures. Digital Journalism 3, 3 (2015), 398–415.
[23] Nicholas Diakopoulos and Michael Koliska. 2017. Algorithmic Transparency in the News Media. Digital Journalism 5, 7 (2017), 809–828. https://doi.org/10.1080/21670811.2016.1208053
[24] Julia Dressel and Hany Farid. 2018. The accuracy, fairness, and limits of predicting recidivism. Science Advances 4, 1 (2018). https://doi.org/10.1126/sciadv.aao5580
[25] Cynthia Dwork, Nicole Immorlica, Adam Tauman Kalai, and Max Leiserson. 2018. Decoupled classifiers for group-fair and efficient machine learning. In ACM Conference on Fairness, Accountability and Transparency. 119–133.
[26] Virginia Eubanks. 2018. Automating inequality: How high-tech tools profile, police, and punish the poor. St. Martin’s Press.
[27] Heather Ford and Judy Wajcman. 2017. ’Anyone can edit’, not everyone does: Wikipedia’s infrastructure and the gender gap. Social Studies of Science 47, 4 (2017), 511–527. https://doi.org/10.1177/0306312717692172
[28] Andrea Forte, Vanesa Larco, and Amy Bruckman. 2009. Decentralization in Wikipedia governance. Journal of Management Information Systems 26, 1 (2009), 49–72.
[29] Batya Friedman. 1996. Value-sensitive design. interactions 3, 6 (1996), 16–23.
[30] Batya Friedman and Helen Nissenbaum. 1996. Bias in computer systems. ACM Transactions on Information Systems (TOIS) 14, 3 (1996), 330–347.
[31] Timnit Gebru, Jamie Morgenstern, Briana Vecchione, Jennifer Wortman Vaughan, Hanna Wallach, Hal Daumeé III, and Kate Crawford. 2018. Datasheets for datasets. arXiv preprint arXiv:1803.09010 (2018).
[32] R Stuart Geiger. 2011. The lives of bots. In Critical Point of View: A Wikipedia Reader. Institute of Network Cultures, Amsterdam, 78–93. http://stuartgeiger.com/lives-of-bots-wikipedia-cpov.pdf
[33] R Stuart Geiger. 2014. Bots, bespoke, code and the materiality of software platforms. Information, Communication & Society 17, 3 (2014), 342–356.
[34] R. Stuart Geiger. 2017. Beyond opening up the black box: Investigating the role of algorithmic systems in Wikipedian organizational culture. Big Data & Society 4, 2 (2017), 2053951717730735. https://doi.org/10.1177/2053951717730735
[35] R Stuart Geiger and Aaron Halfaker. 2013. When the levee breaks: without bots, what happens to Wikipedia’s quality control processes?. In Proceedings of the 9th International Symposium on Open Collaboration. ACM, 6.
[36] R Stuart Geiger and Aaron Halfaker. 2017. Operationalizing conflict and cooperation between automated software agents in wikipedia: A replication and expansion of’even good bots fight’. Proceedings of the ACM on Human-Computer Interaction 1, CSCW (2017), 1–33.
[37] R Stuart Geiger and David Ribes. 2010. The work of sustaining order in wikipedia: the banning of a vandal. In Proceedings of the 2010 ACM conference on Computer supported cooperative work. ACM, 117–126.
[38] R Stuart Geiger and David Ribes. 2011. Trace ethnography: Following coordination through documentary practices. In Proceedings of the 2011 Hawaii International Conference on System Sciences. IEEE, 1–10. https://doi.org/10.1109/ HICSS.2011.455
[39] R Stuart Geiger, Kevin Yu, Yanlai Yang, Mindy Dai, Jie Qiu, Rebekah Tang, and Jenny Huang. 2020. Garbage in, garbage out? do machine learning application papers in social computing report where human-labeled training data comes from?. In Proceedings of the ACM 2020 Conference on Fairness, Accountability, and Transparency. 325–336.
[40] Tarleton Gillespie. 2014. The relevance of algorithms. Media technologies: Essays on communication, materiality, and society 167 (2014).
[41] Tarleton Gillespie. 2018. Custodians of the internet : platforms, content moderation, and the hidden decisions that shape social media. Yale University Press, New Haven.
[42] Lisa Gitelman. 2013. Raw data is an oxymoron. The MIT Press, Cambridge, MA.
[43] Mary L Gray and Siddharth Suri. 2019. Ghost work: how to stop Silicon Valley from building a new global underclass. Eamon Dolan Books.
[44] Ben Green and Yiling Chen. 2019. The principles and limits of algorithm-in-the-loop decision making. Proceedings of the ACM on Human-Computer Interaction 3, CSCW (2019), 1–24.
148:32 Halfaker & Geiger
[45] Aaron Halfaker. 2016. Notes on writing a Vandalism Detection paper. Socio-Technologist blog. http:// socio-technologist.blogspot.com/2016/01/notes-on-writing-wikipedia-vandalism.html
[46] Aaron Halfaker. 2017. Automated classification of edit quality (worklog, 2017-05-04). Wikimedia Research. https: //meta.wikimedia.org/wiki/Research_talk:Automated_classification_of_edit_quality/Work_log/2017-05-04
[47] Aaron Halfaker. 2017. Interpolating Quality Dynamics in Wikipedia and Demonstrating the Keilana Effect. In Proceedings of the 13th International Symposium on Open Collaboration. ACM, 19.
[48] Aaron Halfaker, R Stuart Geiger, Jonathan T Morgan, and John Riedl. 2013. The rise and decline of an open collaboration system: How Wikipedia’s reaction to popularity is causing its decline. American Behavioral Scientist 57, 5 (2013), 664–688.
[49] Aaron Halfaker, R Stuart Geiger, and Loren G Terveen. 2014. Snuggle: Designing for efficient socialization and ideological critique. In Proceedings of the SIGCHI conference on human factors in computing systems. ACM, 311–320.
[50] Aaron Halfaker and John Riedl. 2012. Bots and cyborgs: Wikipedia’s immune system. Computer 45, 3 (2012), 79–82.
[51] Aaron Halfaker and Dario Taraborelli. 2015. Artificial Intelligence Service “ORES” Gives Wikipedians XRay Specs to See Through Bad Edits. Wikimedia Foundation blog. https://blog.wikimedia.org/2015/11/30/ artificial-intelligence-x-ray-specs/
[52] Mahboobeh Harandi, Corey Brian Jackson, Carsten Osterlund, and Kevin Crowston. 2018. Talking the Talk in Citizen Science. In Companion of the 2018 ACM Conference on Computer Supported Cooperative Work and Social Computing . Association for Computing Machinery, New York, NY, USA, 309–312. https://doi.org/10.1145/3272973. 3274084
[53] Donna Haraway. 1988. Situated knowledges: The science question in feminism and the privilege of partial perspective. Feminist Studies 14, 3 (1988), 575–599.
[54] Sandra G Harding. 1987. Feminism and methodology: Social science issues. Indiana University Press, Bloomington, IN.
[55] Steve Harrison, Deborah Tatar, and Phoebe Sengers. 2007. The three paradigms of HCI. In In SIGCHI Conference on Human Factors in Computing Systems.
[56] Hilary Hutchinson, Wendy Mackay, Bo Westerlund, Benjamin B Bederson, Allison Druin, Catherine Plaisant, Michel Beaudouin-Lafon, Stéphane Conversy, Helen Evans, Heiko Hansen, et al. 2003. Technology probes: inspiring design for and with families. In Proceedings of the SIGCHI conference on Human factors in computing systems. ACM, 17–24.
[57] Abigail Z Jacobs, Su Lin Blodgett, Solon Barocas, Hal Daumé III, and Hanna Wallach. 2020. The meaning and measurement of bias: lessons from natural language processing. In Proceedings of the 2020 ACM Conference on Fairness, Accountability, and Transparency. 706–706. https://doi.org/10.1145/3351095.3375671
[58] Nikesh Joshi, Francesca Spezzano, Mayson Green, and Elijah Hill. 2020. Detecting Undisclosed Paid Editing in Wikipedia. In Proceedings of The Web Conference 2020. 2899–2905.
[59] Rob Kitchin. 2017. Thinking critically about and researching algorithms. Information, Communication & Society 20, 1 (2017), 14–29. https://doi.org/10.1080/1369118X.2016.1154087
[60] Aniket Kittur, Jeffrey V Nickerson, Michael Bernstein, Elizabeth Gerber, Aaron Shaw, John Zimmerman, Matt Lease, and John Horton. 2013. The future of crowd work. In Proceedings of the 2013 conference on Computer supported cooperative work. ACM, 1301–1318.
[61] Thomas Kluyver, Benjamin Ragan-Kelley, Fernando Pérez, Brian Granger, Matthias Bussonnier, Jonathan Frederic, Kyle Kelley, Jessica Hamrick, Jason Grout, Sylvain Corlay, et al. 2016. Jupyter Notebooks—a publishing format for reproducible computational workflows. In Positioning and Power in Academic Publishing: Players, Agents and Agendas: Proceedings of the 20th International Conference on Electronic Publishing. IOS Press, 87.
[62] Joshua A Kroll, Solon Barocas, Edward W Felten, Joel R Reidenberg, David G Robinson, and Harlan Yu. 2016. Accountable algorithms. U. Pa. L. Rev. 165 (2016), 633.
[63] Min Kyung Lee, Daniel Kusbit, Anson Kahng, Ji Tae Kim, Xinran Yuan, Allissa Chan, Daniel See, Ritesh Noothigattu, Siheon Lee, Alexandros Psomas, et al. 2019. WeBuildAI: Participatory framework for algorithmic governance. Proceedings of the ACM on Human-Computer Interaction 3, CSCW (2019), 1–35.
[64] Susan Leigh Star. 2010. This is not a boundary object: Reflections on the origin of a concept. Science, Technology, & Human Values 35, 5 (2010), 601–617.
[65] Lawrence Lessig. 1999. Code: And other laws of cyberspace. Basic Books.
[66] Randall M. Livingstone. 2016. Population automation: An interview with Wikipedia bot pioneer Ram-Man. First Monday 21, 1 (2016). https://doi.org/10.5210/fm.v21i1.6027
[67] Teresa Lynch and Shirley Gregor. 2004. User participation in decision support systems development: influencing system outcomes. European Journal of Information Systems 13, 4 (2004), 286–301.
[68] Jane Margolis, Rachel Estrella, Joanna Goode, Holme Jellison, and Kimberly Nao. 2017. Stuck in the shallow end: Education, race, and computing. MIT Press, Cambridge, MA.
[69] Jane Margolis and Allan Fisher. 2002. Unlocking the clubhouse: Women in computing. MIT Press, Cambridge, MA.
ORES 148:33
[70] Adrienne Massanari. 2017. #Gamergate and The Fappening: How Reddit’s algorithm, governance, and culture support toxic technocultures. New Media & Society 19, 3 (2017), 329–346.
[71] Robert G. Mays. 1994. Forging a silver bullet from the essence of software. IBM Systems Journal 33, 1 (1994), 20–45.
[72] Amanda Menking and Ingrid Erickson. 2015. The heart work of Wikipedia: Gendered, emotional labor in the world’s largest online encyclopedia. In Proceedings of the 33rd annual ACM conference on human factors in computing systems. 207–210.
[73] John W. Meyer and Brian Rowan. 1977. Institutionalized Organizations: Formal Structure as Myth and Ceremony. 83, 2 (1977), 340–363. https://www.jstor.org/stable/2778293
[74] Marc H Meyera and Arthur DeToreb. 2001. Perspective: creating a platform-based approach for developing new services. Journal of Product Innovation Management 18, 3 (2001), 188–204.
[75] Margaret Mitchell, Simone Wu, Andrew Zaldivar, Parker Barnes, Lucy Vasserman, Ben Hutchinson, Elena Spitzer, Inioluwa Deborah Raji, and Timnit Gebru. 2019. Model cards for model reporting. In Proceedings of the conference on fairness, accountability, and transparency. 220–229.
[76] Jonathan T Morgan, Siko Bouterse, Heather Walls, and Sarah Stierch. 2013. Tea and sympathy: crafting positive new user experiences on wikipedia. In Proceedings of the 2013 conference on Computer supported cooperative work. ACM, 839–848.
[77] Claudia Muller-Birn, Leonhard Dobusch, and James D. Herbsleb. 2013. Work-to-rule: The Emergence of Algorithmic Governance in Wikipedia. In Proceedings of the 6th International Conference on Communities and Technologies (C&T . ACM, New York, NY, USA, 80–89. https://doi.org/10.1145/2482991.2482999 event-place: Munich, Germany.
[78] Deirdre K. Mulligan, Daniel Kluttz, and Nitin Kohli. 2019. Shaping Our Tools: Contestability as a Means to Promote Responsible Algorithmic Decision Making in the Professions. SSRN Scholarly Paper ID 3311894. Social Science Research Network, Rochester, NY. https://papers.ssrn.com/abstract=3311894
[79] Deirdre K. Mulligan, Joshua A. Kroll, Nitin Kohli, and Richmond Y. Wong. 2019. This Thing Called Fairness: Disciplinary Confusion Realizing a Value in Technology. Proc. ACM Hum.-Comput. Interact. 3, CSCW, Article 119 (Nov. 2019), 36 pages. https://doi.org/10.1145/3359221
[80] Sneha Narayan, Jake Orlowitz, Jonathan T Morgan, and Aaron Shaw. 2015. Effects of a Wikipedia Orientation Game on New User Edits. In Proceedings of the 18th ACM Conference Companion on Computer Supported Cooperative Work & Social Computing. ACM, 263–266.
[81] Gina Neff and Peter Nagy. 2016. Talking to Bots: Symbiotic agency and the case of Tay. International Journal of Communication 10 (2016), 17.
[82] Neotarf. 2014. Media Viewer controversy spreads to German Wikipedia. Wikipedia Signpost. https://en.wikipedia. org/wiki/Wikipedia:Wikipedia_Signpost/2014-08-13/News_and_notes
[83] Sabine Niederer and Jose van Dijck. 2010. Wisdom of the crowd or technicity of content? Wikipedia as a sociotechnical system. New Media & Society 12, 8 (Dec. 2010), 1368–1387. https://doi.org/10.1177/1461444810365297
[84] Elinor Ostrom. 1990. Governing the commons: The evolution of institutions for collective action. Cambridge university press.
[85] Neil Pollock and Robin Williams. 2008. Software and organisations: The biography of the enterprise-wide system or how SAP conquered the world. Routledge.
[86] David Ribes. 2014. The kernel of a research infrastructure. In Proceedings of the 17th ACM conference on Computer supported cooperative work & social computing. 574–587.
[87] Sarah T Roberts. 2019. Behind the screen: Content moderation in the shadows of social media. Yale University Press.
[88] Sage Ross. 2016. Visualizing article history with Structural Completeness. Wiki Education Foundation blog. https: //wikiedu.org/blog/2016/09/16/visualizing-article-history-with-structural-completeness/
[89] Christian Sandvig, Kevin Hamilton, Karrie Karahalios, and Cedric Langbort. 2014. Auditing algorithms: Research methods for detecting discrimination on internet platforms. Data and discrimination: converting critical concerns into productive inquiry (2014), 1–23.
[90] Amir Sarabadani, Aaron Halfaker, and Dario Taraborelli. 2017. Building automated vandalism detection tools for Wikidata. In Proceedings of the 26th International Conference on World Wide Web Companion. International World Wide Web Conferences Steering Committee, 1647–1654.
[91] Kjeld Schmidt and Liam Bannon. 1992. Taking CSCW seriously. Computer Supported Cooperative Work (CSCW) 1, 1-2 (1992), 7–40.
[92] Douglas Schuler and Aki Namioka (Eds.). 1993. Participatory Design: Principles and Practices. L. Erlbaum Associates, Publishers, Hillsdale, NJ.
[93] David Sculley, Gary Holt, Daniel Golovin, Eugene Davydov, Todd Phillips, Dietmar Ebner, Vinay Chaudhary, Michael Young, Jean-Francois Crespo, and Dan Dennison. 2015. Hidden technical debt in machine learning systems. In Advances in neural information processing systems. 2503–2511.
148:34 Halfaker & Geiger
[94] Nick Seaver. 2017. Algorithms as culture: Some tactics for the ethnography of algorithmic systems. Big Data & Society 4, 2 (2017). https://doi.org/10.1177/2053951717738104
[95] Andrew D. Selbst, Danah Boyd, Sorelle A. Friedler, Suresh Venkatasubramanian, and Janet Vertesi. 2019. Fairness and Abstraction in Sociotechnical Systems. In Proceedings of the Conference on Fairness, Accountability, and Transparency (2019-01-29) . Association for Computing Machinery, 59–68. https://doi.org/10.1145/3287560.3287598
[96] Caroline Sinders. 2019. Making Critical Ethical Software. The Critical Makers Reader:(Un)Learning Technology (2019), 86. https://ualresearchonline.arts.ac.uk/id/eprint/14218/3/CriticalMakersReader.pdf
[97] C Estelle Smith, Bowen Yu, Anjali Srivastava, Aaron Halfaker, Loren Terveen, and Haiyi Zhu. 2020. Keeping Community in the Loop: Understanding Wikipedia Stakeholder Values for Machine Learning-Based Systems. In Proceedings of the 2020 CHI Conference on Human Factors in Computing Systems. 1–14.
[98] Susan Leigh Star and Karen Ruhleder. 1994. Steps towards an ecology of infrastructure: complex problems in design and access for large-scale collaborative systems. In Proceedings of the 1994 ACM conference on Computer supported cooperative work. ACM, 253–264.
[99] Besiki Stvilia, Abdullah Al-Faraj, and Yong Jeong Yi. 2009. Issues of cross-contextual information quality evaluation -The case of Arabic, English, and Korean Wikipedias. Library & information science research 31, 4 (2009), 232–239.
[100] Nathan TeBlunthuis, Aaron Shaw, and Benjamin Mako Hill. 2018. Revisiting The Rise and Decline in a Population of Peer Production Projects. In Proceedings of the 2018 CHI Conference on Human Factors in Computing Systems. ACM, 355.
[101] Nathaniel Tkacz. 2014. Wikipedia and the Politics of Openness. University of Chicago Press, Chicago.
[102] Chau Tran, Kaylea Champion, Andrea Forte, Benjamin Mako Hill, and Rachel Greenstadt. 2019. Tor Users Contributing to Wikipedia: Just Like Everybody Else? arXiv preprint arXiv:1904.04324 (2019).
[103] Zeynep Tufekci. 2015. Algorithms in our midst: Information, power and choice when software is everywhere. In Proceedings of the 18th ACM Conference on Computer Supported Cooperative Work & Social Computing. ACM, 1918–1918.
[104] Siva Vaidhyanathan. 2018. Antisocial media: How Facebook disconnects us and undermines democracy. Oxford University Press, Oxford, UK.
[105] Lyudmila Vaseva. 2019. You shall not publish: Edit filters on English Wikipedia. Ph.D. Dissertation. Freie Universität Berlin. https://www.mi.fu-berlin.de/en/inf/groups/hcc/theses/ressources/2019_MA_Vaseva.pdf
[106] Andrew G West, Sampath Kannan, and Insup Lee. 2010. STiki: an anti-vandalism tool for Wikipedia using spatiotemporal analysis of revision metadata. In Proceedings of the 6th International Symposium on Wikis and Open Collaboration. ACM, 32.
[107] Andrea Wiggins and Kevin Crowston. 2011. From conservation to crowdsourcing: A typology of citizen science. In 2011 44th Hawaii international conference on system sciences. IEEE, 1–10.
[108] Marty J Wolf, Keith W Miller, and Frances S Grodzinsky. 2017. Why we should have seen that coming: comments on Microsoft’s Tay ‘experiment,’ and wider implications. The ORBIT Journal 1, 2 (2017), 1–12.
[109] Diyi Yang, Aaron Halfaker, Robert Kraut, and Eduard Hovy. 2017. Identifying semantic edit intentions from revisions in wikipedia. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing. 2000–2010.
[110] Muhammad Bilal Zafar, Isabel Valera, Manuel Gomez Rodriguez, and Krishna P. Gummadi. [n. d.]. Fairness Beyond Disparate Treatment & Disparate Impact: Learning Classification without Disparate Mistreatment. In Proceedings of the 26th International Conference on World Wide Web (2017-04-03) . International World Wide Web Conferences Steering Committee, 1171–1180. https://doi.org/10.1145/3038912.3052660
[111] Amy X. Zhang, Lea Verou, and David Karger. 2017. Wikum: Bridging Discussion Forums and Wikis Using Recursive Summarization. In Proceedings of the 2017 ACM Conference on Computer Supported Cooperative Work and Social . Association for Computing Machinery, New York, NY, USA, 2082–2096. https://doi.org/10. 1145/2998181.2998235
[112] Haiyi Zhu, Bowen Yu, Aaron Halfaker, and Loren Terveen. 2018. Value-sensitive algorithm design: method, case study, and lessons. Proceedings of the ACM on Human-Computer Interaction 2, CSCW (2018), 194.
[113] Shoshana Zuboff. 1988. In the age of the smart machine: The future of work and power. Vol. 186. Basic Books, New York.
ORES 148:35
A.1 Empirical access patterns
The ORES service has been online since July 2015[51]. Since then, usage has steadily risen as we’ve developed and deployed new models and additional integrations are made by tool developers and researchers. Currently, ORES supports 78 different models and 37 different language-specific wikis.
Generally, we see 50 to 125 requests per minute from external tools that are using ORES’ predictions (excluding the MediaWiki extension that is more difficult to track). Sometimes these external requests will burst up to 400-500 requests per second. Figure 8a shows the periodic and “bursty” nature of scoring requests received by the ORES service. For example, every day at about 11:40 UTC, the request rate jumps—most likely a batch scoring job such as a bot.
Figure 8b shows the rate of precaching requests coming from our own systems. This graph roughly reflects the rate of edits that are happening to all of the wikis that we support since we’ll start a scoring job for nearly every edit as it happens. Note that the number of precaching requests is about an order of magnitude higher than our known external score request rate. This is expected, since Wikipedia editors and the tools they use will not request a score for every single revision. This is a computational price we pay to attain a high cache hit rate and to ensure that our users get the quickest possible response for the scores that they do need.
Taken together these strategies allow us to optimize the real-time quality control workflows and batch processing jobs of Wikipedians and their tools. Without serious effort to make sure that ORES is practically fast and highly available to real-time use cases, ORES would become irrelevant to the target audience and thus irrelevant as a boundary-lowering intervention. By engineering a system that conforms to the work-process needs of Wikipedians and their tools, we’ve built a systems intervention that has the potential gain wide adoption in Wikipedia’s technical ecology.
A.2 Explicit pipelines
Within ORES system, each group of similar models have explicit model training pipelines defined in a repo. Currently, we support 4 general classes of models:
“good-faith” • – Models that predict the quality of an article on a scale •
– Models that predict whether new articles are spam or vandalism •
– Models that predict the general topic space of articles and new article drafts
Within each of these model repositories is a collection of facility for making the modeling process explicit and replay-able. Consider the code shown in figure 9 that represents a common pattern from our model-building Makefiles.
Essentially, this code helps someone determine where the labeled data comes from (manually labeled via the Wiki Labels system). It makes it clear how features are extracted (using the revscoring extract utility and the feature_lists.enwiki.damaging feature set). Finally, this dataset of extracted features is used to cross-validate and train a model predicting the “damaging” label and a serialized version of that model is written to a file. A user could clone this repository, install the set of requirements, and run make enwiki_models and expect that all of the data-pipeline would be reproduced, and an equivalent model obtained.
148:36 Halfaker & Geiger
Fig. 8. Request rates to the ORES service for the week ending on April 13th, 2018
By explicitly using public resources and releasing our utilities and Makefile source code under an open license (MIT), we have essentially implemented a turn-key process for replicating our model
ORES 148:37
building and evaluation pipeline. A developer can review this pipeline for issues knowing that they are not missing a step of the process because all steps are captured in the Makefile. They can also build on the process (e.g. add new features) incrementally and restart the pipeline. In our own experience, this explicit pipeline is extremely useful for identifying the origin of our own model building bugs and for making incremental improvements to ORES’ models.
At the very base of our Makefile, a user can run make models to rebuild all of the models of a certain type. We regularly perform this process ourselves to ensure that the Makefile is an accurate representation of the data flow pipeline. Performing complete rebuild is essential when a breaking change is made to one of our libraries or a major improvement is made to our feature extraction code. The resulting serialized models are saved to the source code repository so that a developer can review the history of any specific model and even experiment with generating scores using old model versions. This historical record of past models has already come in handy for audits of past model behavior.
Fig. 9. Makefile rules for the English damage detection model from https://github.com/wiki-ai/editquality