Audit of Commercial Content Moderation APIs: Over- and Under-Moderation of Group-Targeted Hate Speech

5M+

Queries Analyzed

5

APIs Audited

4

Datasets Used

1 Introduction

Commercial content moderation APIs are marketed as scalable solutions to combat online hate speech, but they risk both silencing legitimate speech (over-moderation) and failing to protect users from harmful content (under-moderation). This paper introduces a comprehensive framework for auditing black-box NLP systems used in content moderation.

2 Methodology

2.1 Audit Framework

Our black-box audit framework evaluates commercial content moderation APIs through multiple approaches: performance evaluation, SHAP explainability analysis, and perturbation analysis. The framework analyzes five million queries across four datasets to systematically assess bias patterns.

2.2 Datasets

The study utilizes four diverse datasets: HateXplain for general hate speech, Civil Comments for longer texts, ToxiGen for implicit hate speech, and SBIC for stereotypes and implicit bias. This diversity ensures comprehensive evaluation across different hate speech manifestations.

2.3 APIs Evaluated

Five commercial APIs were evaluated: Google Natural Language API, Microsoft Azure Content Moderation, OpenAI Content Moderation API, Perspective API, and Amazon Comprehend. These represent the major providers in the commercial content moderation market.

3 Technical Framework

3.1 SHAP Analysis

SHAP (SHapley Additive exPlanations) values are used to explain the output of machine learning models. The SHAP value for feature $i$ is calculated as:

$\\phi_i = \\sum_{S \\subseteq N \\setminus \\{i\\}} \\frac{|S|!(|N|-|S|-1)!}{|N|!}[f(S \\cup \\{i\\}) - f(S)]$

where $N$ is the set of all features, $S$ is a subset of features, and $f$ is the model prediction function.

3.2 Perturbation Analysis

Counterfactual token fairness scores are computed by systematically perturbing input text and measuring changes in moderation decisions. This helps identify which tokens disproportionately influence moderation outcomes.

4 Results

4.1 Performance Metrics

The study found significant variation in API performance. OpenAI and Amazon performed slightly better with F1 scores of 0.83 and 0.81 respectively, while other APIs showed lower performance (Microsoft: 0.74, Perspective: 0.62, Google: 0.59).

4.2 Bias Patterns

All APIs demonstrated systematic biases: over-moderation of counter-speech, reclaimed slurs, and content mentioning Black, LGBTQIA+, Jewish, and Muslim people. Simultaneously, they under-moderated implicit hate speech, especially against LGBTQIA+ individuals.

Key Insights

APIs frequently rely on group identity terms (e.g., "black") to predict hate speech
Implicit hate speech using codified messages is consistently under-moderated
Counter-speech and reclaimed slurs are systematically over-moderated
Performance varies significantly across different demographic groups

5 Code Implementation

Below is a simplified Python implementation of the audit framework:

import requests
import pandas as pd
from sklearn.metrics import precision_recall_fscore_support

class ContentModerationAudit:
    def __init__(self, api_endpoints):
        self.apis = api_endpoints
        
    def query_api(self, text, api_config):
        """Query content moderation API"""
        headers = {'Authorization': f'Bearer {api_config[\"key\"]}'}
        payload = {'text': text, 'threshold': api_config.get('threshold', 0.5)}
        response = requests.post(api_config['url'], json=payload, headers=headers)
        return response.json()
    
    def calculate_bias_metrics(self, predictions, ground_truth, protected_groups):
        """Calculate bias metrics across protected groups"""
        metrics = {}
        for group in protected_groups:
            group_mask = protected_groups[group]
            precision, recall, f1, _ = precision_recall_fscore_support(
                ground_truth[group_mask], predictions[group_mask], average='binary'
            )
            metrics[group] = {'precision': precision, 'recall': recall, 'f1': f1}
        return metrics

# Example usage
api_configs = {
    'openai': {'url': 'https://api.openai.com/v1/moderations', 'key': 'YOUR_KEY'},
    'amazon': {'url': 'https://comprehend.amazonaws.com', 'key': 'YOUR_KEY'}
}

audit = ContentModerationAudit(api_configs)

6 Future Applications

The findings have significant implications for future content moderation systems. Future research should focus on developing more nuanced models that can distinguish between harmful hate speech and legitimate discussions of identity. As noted in the CycleGAN paper (Zhu et al., 2017), domain adaptation techniques could help address distribution shifts across different demographic groups. Additionally, following the approach of the Perspective API team (Lees et al., 2022), future systems should incorporate community-specific norms and context-aware processing.

Emerging directions include:

Multi-modal content moderation combining text, image, and context analysis
Federated learning approaches to preserve privacy while improving model performance
Explainable AI techniques to provide transparent moderation decisions
Cross-cultural adaptation of moderation systems for global platforms

Original Analysis: The Double-Edged Sword of Automated Content Moderation

This research provides crucial insights into the operational realities of commercial content moderation APIs, revealing a troubling pattern of systematic bias that affects vulnerable communities disproportionately. The finding that APIs frequently rely on group identity terms like "black" to predict hate speech echoes similar issues identified in other NLP systems, such as the racial bias found in sentiment analysis tools by Sap et al. (2019). What makes this study particularly significant is its scale—analyzing five million queries across multiple datasets—and its comprehensive framework that combines performance metrics with explainability techniques.

The technical approach using SHAP values and perturbation analysis represents a sophisticated methodology for auditing black-box systems. This aligns with growing calls for algorithmic transparency, similar to requirements in other high-stakes AI applications like healthcare diagnostics (Topol, 2019). The systematic under-moderation of implicit hate speech against LGBTQIA+ individuals is especially concerning, as it suggests that current systems fail to recognize sophisticated forms of discrimination that don't rely on explicit slurs.

Compared to open-source models audited in previous research (Röttger et al., 2021), commercial APIs show similar bias patterns but with potentially greater real-world impact due to their widespread deployment. The recommendation for better guidance on threshold setting is particularly important, as threshold optimization represents a key intervention point for reducing both over- and under-moderation. Future work should explore adaptive thresholds that consider context and community norms, similar to approaches discussed in the Fairness and Machine Learning literature (Barocas et al., 2019).

The study's limitations, including its focus on English-language content and specific demographic groups, point to important directions for future research. As platforms become increasingly global, developing moderation systems that work across languages and cultural contexts will be essential. The framework established in this paper provides a valuable foundation for such cross-cultural audits.

7 References

Hartmann, D., Oueslati, A., Staufer, D., Pohlmann, L., Munzert, S., & Heuer, H. (2025). Lost in Moderation: How Commercial Content Moderation APIs Over- and Under-Moderate Group-Targeted Hate Speech and Linguistic Variations. arXiv:2503.01623
Sap, M., Card, D., Gabriel, S., Choi, Y., & Smith, N. A. (2019). The Risk of Racial Bias in Hate Speech Detection. Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics.
Röttger, P., Vidgen, B., Nguyen, D., Waseem, Z., Margetts, H., & Pierrehumbert, J. (2021). HateCheck: Functional Tests for Hate Speech Detection Models. Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics.
Zhu, J. Y., Park, T., Isola, P., & Efros, A. A. (2017). Unpaired Image-to-Image Translation using Cycle-Consistent Adversarial Networks. IEEE International Conference on Computer Vision.
Lees, A., Tran, V. Q., Tay, Y., Sorensen, J., Gupta, A., Metzler, D., & Vasserman, L. (2022). A New Generation of Perspective API: Efficient Multilingual Character-level Transformers. Proceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining.
Barocas, S., Hardt, M., & Narayanan, A. (2019). Fairness and Machine Learning. fairmlbook.org.
Topol, E. J. (2019). High-performance medicine: the convergence of human and artificial intelligence. Nature Medicine, 25(1), 44-56.

Table of Contents

5M+

5

4