Unbowed, unbent, unbroken? Examining the validity of the responsibility to protect

How has the sentiment around the “responsibility to protect” (R2P) changed over time? Scholars have debated far and wide whether the political norm enjoys widespread discursive acceptance or is on the brink of decline. This article contends that we can use sentiment analysis as an important indicator for norm validity. My analysis provides three crucial insights. First, despite the well-known fear of some scholars, R2P is still frequently invoked in Security Council deliberations on issues of international peace and security. Second, overall levels of affirmative language have remained remarkably stable over time. This finding indicates that R2P is far from being obliterated. Out of 130 states, 4 international organizations (IOs), and 2 non-governmental organizations (NGOs) invoking the norm, 65% maintain a positive net-sentiment. Third, zooming into Libya as a case illustration of a critical juncture, we see some minor tonal shifts from some pivotal member states. Adding the fact that interest constellations within the Permanent Five are heterogeneous concerning the third pillar of R2P, future military interventions, sanctioned under the norm, seem unlikely.


Introduction
How has the sentiment around the "responsibility to protect" (R2P) changed over time? Since its official conception in 2005, scholars have debated far and wide whether the political norm enjoys widespread discursive acceptance or is on the brink of decline. Adopted as an outcome document of the UN world summit, all participating nations committed themselves to the R2P (Badescu and Weiss, 2010: 356). The norm asserts that each state has a moral obligation to prevent mass atrocities from its citizens. Should a state manifestly fail to secure its population from ethnic cleansing, crimes against humanity, war crimes, or even genocide, other states have the responsibility to intervene using force if necessary (United Nations, 2005). Some scholars lauded the norm as "the most dramatic normative development of our time" (Thakur and Weiss, 2009: 22). Since then, the norm has been applied in numerous documents, speeches, and, resolutions. However, it was only once used to justify military intervention against the wishes of a host government. In 2011, a military intervention was carried out in Libya-under the premise of the norm-averting ensuing humanitarian catastrophe but eventually leaving it with ongoing civil strife. Observing the aftermath of the Security Council resolution, many scholars wondered again what the intervention would mean for the norm's legacy (Deitelhoff and Zimmermann, 2020;Gholiagha, 2015;Glanville, 2016;Hehir, 2012;Hehir, 2013;Morris, 2013;Thakur, 2013). While Alex J. Bellamy and Jess Gifkins viewed R2P as unharmed (Bellamy, 2015;Gifkins, 2016), many argued that R2P had suffered existential reputation costs Evans, 2014;Cronogue, 2012;Mamdani, 2011;Pape, 2012;Thakur, 2013: 72), leaving the norm diminished if not obliterated (Hehir, 2013;Hehir, 2019). Some scholars concluded that future policymakers would refrain from casting issues in a language of R2P due to its contentious nature (Bellamy, 2012: 13;Welsh, 2019: 55). Thus, to render a verdict on the norm's validity, scholars are looking for and testing usable heuristics that could highlight whether a norm is in decline or actually consolidating (Crossley, 2018;Deitelhoff and Zimmermann, 2020;Girard, 2021).
Contributing to the same end, this article contends that we can use sentiment analysis as an important indicator for norm validity. In line with Deitelhoff and Zimmermann, I understand norm validity as discursive acceptance which can be operationalized as verbal support or affirmatory rhetoric concerning a norm or its content (Deitelhoff and Zimmermann, 2019: 6). I argue that automated sentiment analysis can be used to assess the extent of positive or negative framing around the R2P to gauge actors' tonality toward the norm. While positive sentiment cannot be directly equated with discursive acceptance, sentiment that is stable in terms of its positive tonality might be a meaningful indicator of norm validity. As such, positive tonality can be understood as a necessary but not sufficient condition for norm validity. 1 I carry out my analysis in the United Nations Security Council 2 because the Council represents a forum in which we can investigate both the tonality of individual member states and the consensual sentiment of the most powerful organization in world politics. Furthermore, the Security Council-tasked with preserving international peace and security-was also responsible for authorizing resolution 1973, leading to regime change in Libya. Norm scholars typically assume that critical junctures-such as the one in Libya-are significant because they shed light on the degree of norm consolidation. Therefore, the case of Libya, discussed in the Security Council, is also apt for this analysis because we can trace learning effects concerning the norm by looking at the speeches of each voting member.
My analysis provides three crucial insights. First, despite the well-known concerns of some scholars for the significance of the principle (Hehir, 2010;Hehir, 2013;Rieff, 2011), R2P is frequently invoked in Security Council deliberations on issues of international peace and security. Therefore, Edward Luck's "risk of relevance" has not materialized since the Libyan intervention. Second, overall levels of affirmative language have remained remarkably stable over time. This finding indicates that R2P is far from being obliterated and instead is likely to enjoy discursive validity. Out of 130 states, four international organizations (IOs), and two non-governmental organizations (NGOs) invoking the norm, 65% maintain a positive net-sentiment (compared to the institution's benchmark). 3 Third, zooming into Libya as a case illustration of a critical juncture, we see some minor tonal shifts from some pivotal member states. Because these shifts concern the handling of the Libyan intervention, they might be emblematic of applicatory contestation. While this specific type of contestations does not necessarily spell doom for the norm's robustness (Deitelhoff and Zimmermann, 2020: 64), persistent applicatory contestation could spill over into validity contestation in the long run. Adding the fact that interest constellations within the P5 are heterogeneous concerning the third pillar of R2P, future military interventions, sanctioned under the norm, seem unlikely.
The article unfolds in five steps. First, I build upon existing theory to analyze the wide range of scholarly positions on the validity of R2P, its presumed contestation, and their interplay. Second, I introduce augmented corpus data by Schönfeld et al. (2019), explaining each step I employed to arrive at a comprehensive set of statements relating to R2P in the Security Council (Schönfeld et al., 2019). Third, I offer descriptive statistics on the invocation of the norm, showing top contributors and overall mentioning trends. Fourth, I run an automated sentiment analysis on the Security Council's consensual level and each state that has referred to the norm, estimating their individual sentiment scores. Five, as a case illustration of learning effects from a critical juncture, I look in detail at the case of the Libyan intervention-exemplified by the sentiment positions of each voting member before and after intervention. Finally, I close the article by discussing the implications of the findings for the future of the norm and its unlikely application in future military endeavors.

Prior research and theory
International Relations (IR) scholars have debated for a while whether, and if so how much, the R2P has been consolidating or weakening over the years (Bellamy and Dunne, 2016;Bellamy, 2015;Crossley, 2018;Deitelhoff and Zimmermann, 2020;Thakur, 2013). While earlier debate entries questioned whether the norm played any role in the decisionmaking processes of international security organizations (Hehir, 2010), newer entries demonstrated the considerable extent to which the norm has shaped policy responses in the face of gruesome atrocities (Glanville, 2016;Nahlawi, 2019;Welsh, 2021). However, some of these policy responses raised significant controversy themselves. The peak of controversy concerned the aftermath of the Libyan intervention-authorized under the premise of R2P by the Security Council. Because the intervention ended in regime change, something that was not specifically called for in the UN mandate (Bellamy, 2012), some scholars believed that the norm's validity could have suffered (Morris, 2013;Pape, 2012;Thakur, 2013). Moreover, authors argued that the intervention underscored R2P's inherent potential to be abused by Western powers to achieve parochial ends (Brockmeier et al., 2016: 131). 4 As a consequence, prominent scholarship assumed that the Libya intervention would have a chilling effect, preventing states from casting future actions in the language of R2P (Welsh, 2019: 55). These concerns were summarized by Edward Luck-the Secretary General's first advisor on R2P-as the "risk of relevance" (Bellamy, 2012: 12). At its essence, the assumption was that because early support for the norm had been relatively weak-Gareth Evans spoke of "buyer's remorse" soon after the unanimous adoption of the framework in 2005 (Bellamy, 2012: 17)-the newly invigorated fears of ulterior motives would lead to a decline in invocations of the norm. Since the Security Council had been quite vocal in affirming R2P (see, for example, UN Security Council, 2006) and had authorized the controversial intervention in Libya, this chilling effect was said to affect the Council itself prominently. Contrary to norm scholars who assumed that a low level of invocation might also signal internalization (Finnemore and Sikkink, 1998), 5 this line of research argues that, to stay relevant, norms relied on application and invocation in concrete situations (Deitelhoff and Zimmermann, 2020: 71). While there is limited research on whether Council resolutions continued to refer to the norm (Bellamy, 2013), there is no systematic evaluation of whether "the risk of relevance" materialized and states invoked the norm less frequently after the intervention. 6 Therefore, we can probe Edward Luck's risk of relevance, as a first stab at the legacy of R2P.

H1.
After the Libyan intervention, R2P should be less frequently invoked in the Security Council than before.
Furthermore, scholars discussed whether contestation arising from critical junctures, such as the one in Libya, would lead to norm decline (Panke and Petersohn, 2016;Rieff, 2011), norm hollowing (Hehir, 2019), or even norm death. Other prominent scholarship emphasized that contestation per se does not necessarily lead to norm decay (Wiener, 2014) and instead might be productive for a norm in that it triggers a process of debate and create joint meanings (Deitelhoff and Zimmermann, 2019: 10;Wiener, 2014: 30). Much in this line of thinking, Badescu and Weiss argued that even intentional misrepresentation and widespread contestation could be beneficial to a norm (Badescu and Weiss, 2010: 355). Interestingly enough, both sides-the one arguing that contestation could improve acceptance and the one assuming otherwise-pointed, at times, to the Libyan intervention. Aidan Hehir, for example, argued that R2P was seriously weakened-perhaps even irrelevant (Hehir, 2013). For Hehir, R2P is simply "a slogan employed for differing purposes shorn of any real meaning or utility" (Hehir, 2010: 218-219). Alex J. Bellamy, on the contrary, assumed that R2P still enjoyed strong support (Bellamy, 2012). Jess Gifkins went as far as saying that R2P remained untouched and improved in terms of acceptance (Gifkins, 2016). The sheer variety of assumptions concerning R2P's assumed legacy (Donovan, 2018;Jacob, 2018) make uniform theoretical expectations hard to come by. Instead, scholars can be roughly distinguished into two camps: those who think that the validity of the norm has improved over time and those who think the validity has worsened over time.
In addition, individual country rhetoric toward the norm are not well-known and are often only present in literature if they are profoundly negative (Welsh, 2019: 59). Although other scholars have tried to illuminate the legacy of R2P (Docherty et al., 2020;Dunne and Gifkins, 2011;Gholiagha and Loges, 2020;Gifkins, 2016;Pattison, 2021), their accounts either use single case studies or observe individual resolutions to form a verdict. Instead, I propose to use a countries' tonality and overall levels of affirmatory or derogatory framing concerning the R2P to indicate discursive validity. 7 Therefore, I submit that we can approximate the validity of the norm by taking country-level sentiment and overall sentiment of the UNSC as discursive indicators. Overall, positive sentiment could be an indicator that R2P is still a valid norm, while negative sentiment could indicate the erosion of the norm. To qualify scholarship (and potentially expose faulty assumptions), the following theoretical assumption needs to be validated.

H2
. Sentiment around the R2P should be positive rather than negative.
Newer studies have tried to connect the type of contestation with the robustness of a norm (Deitelhoff and Zimmermann, 2019;Deitelhoff and Zimmermann, 2020;Welsh, 2019). According to these approaches, robustness can be described as a composite property of a norm that itself derives from its validity and its facticity (Deitelhoff and Zimmermann, 2019: 8). Put in other words, robustness describes a particular characteristic of norms. For some scholars, this characteristic composes of a verbal element (discursive acceptance) and a factual element (its facticity-the extent to which it guides action). Taken together, these two elements are said to form the robustness of a norm (Sandholtz, 2019: 142). A series of qualitative case studies evaluated this framework against empirical evidence. 8 Most recently, an approach using item-response theory tried to harmonize quantitative and qualitative research in regard to norm robustness (Girard, 2021). While applicatory contestation is theorized to occur from specific actions taken in the name of the principle, its effects are not assumed to derail the robustness of the norm (Deitelhoff and Zimmermann, 2020: 70-71). Still, if applicatory contestations arise from concrete actions, it is reasonable to assume that the most controversial of all applications-the Libyan intervention-should cause a change in attitude, and therefore rhetoric, among the states authorizing the action.
H3. After the intervention, sentiment among authorizing member states should have worsened.

Data and research design
To answer my research question, I used the seminal dataset on UN Security Council speeches provided by Schönfeld et al. (2019). These data represented the empirical foothold for my analysis. The phrase "responsibility to protect" is a rather artificial one; therefore, false-positive hits are extremely unlikely. This made automated sentiment analysis an ideal method to assess the tonality around the norm.
I began by using string detection 9 to identify each document that contained the words "responsibility to protect" or "R2P." After that, I loaded an existing sentiment dictionary to assess the valance around the terms. 10 Next, I used keyword-in-context-functions with varying word windows to increase the robustness of my findings. 11 This is crucial for my undertaking, as sentiment analysis is predicated upon the idea that you can estimate a particular word's sentiment by looking at the valence of words surrounding it. For the following analyses, I display word windows of 8, 15, and 30 words around "responsibility to protect" or "R2P" in the main text and provide a replication on a sentence level in the Supplemental Appendix. 12 To illustrate this point, one can think of sentiment analysis as an automated process that counts the occurrence of positively connotated words versus negatively connotated words (and their negations) around a word or pattern of interest. For example, if one wants to find out how diplomats talk about the R2P in everyday language of UNSC speeches, one could select five speeches that mention the norm and read eight words before and after to make a snap judgment on its sentiment (Table 1).
Automated sentiment analysis does the same thing, but instead of giving snap judgments on five speeches, it relies on the frequencies of positive and negative words surrounding our pattern of interest (on all references to R2P). Intuitively, human coders might actually behave in a similar manner-perhaps without being aware of the process that leads to their final coding decision. It is also noteworthy that sentiment analysis is capable (like human coders) to detect litotes-double negatives-such as in the speech given by Slovenia stating "not to reduce" the principle. Another crucial takeaway from the example is that sentiment analysis already performs well when used on small word windows. Even eight-word windows around a term are often sufficient to understand the sentiment of a pattern of interest.
Because I want to estimate country sentiment and sentiment on the consensual level of the UNSC, I transform the keyword patterns into an ordinary data frame. Then, I collapse each public statement containing R2P and the word windows surrounding it, on country and year. This leaves me with 598 country-year speeches on R2P, nested in 130 unique countries, four IOs, and two NGOs.
I believe that a target's sentiment is much more informative compared to its institution's benchmark (Rauh, 2018). 13 For example, because we have reason to assume that there will be significantly negatively connotated language occurring during speeches related to R2P (as many of them will feature gruesome atrocities), only the relative comparison to the institution's benchmark (and the benchmark of the R2P debate) will jointly inform us how positive or negative rhetoric surrounding the norm truly is. Therefore, I estimate an institution's benchmark as a mean sentiment for all 77,815 UNSC speeches given during the same time frame and use bootstrapped 99% confidence intervals to show variation over time. 14 In addition, I calculate another benchmark for the entire R2P debate. 15 This leaves us with a comprehensive dataset, which details the Security Council sentiment on R2P (as an aggregate measure and for each member state) compared to the UNSC's average sentiment on any given speech from 1995 to 2019.

Empirical findings
Let us begin by illustrating some crucial quantities of interest. An intuitive question is as follows: When was R2P first discussed in the Security Council of the United Nations? In July 1996, France was the first nation to reference the norm, setting a pattern that would repeat itself throughout the years. In fact, France is also the permanent member that invoked the norm the most (39 speeches). Following as a close second is the United Kingdom, with 31 speeches, and then the United States with 16 speeches. Next comes China, with 15 speeches, and finally, the Russian Federation with 13 speeches on the matter. The main contributor, however, is the United Nations itself, with 50 speeches invoking the norm. These speeches stem from the Secretary General or senior officials such as High Representatives.
Interestingly, early invocations of R2P predate the groundbreaking International Commission on Intervention and State Sovereignty (ICISS) report as well as the world summit outcome. In line with constructivist thinking (Sandholtz, 2007;Wiener, 2009), there is some evidence that prior to the adoption of these documents, there was a period of norm campaigning and contestation. There is also evidence that the meaning of the norm was, at least in its earliest invocations, understood by some actors partly different from the well-known three pillar structure distributed by Secretary General Ban Ki-Moon (2009). 16 Some elected members are, in the absolute number of speeches, more active contributors to the debate than some powerful members of the P5. Looking at the top contributors to the discussion, we see that-while not being entirely representative in terms of global reach-four out of five regions, according to the United Nations' regional group scheme, are represented with top contributions (with Asia missing) ( Table 2). Table 1. Sentiment examples illustrated on eight-word windows around R2P on selected country-speeches.

Country
Year 8 Regarding raw (numerical) contribution, the debate seems to be quite inclusive, with 130 participating states, four IOs, 17 and two NGOs 18 giving at least one public speech concerning the norm before the Council. In total, 598 aggregated country-year speeches revolve around the R2P. The norm itself is mentioned 1487 times. On average, states devote seven speeches to the norm (with a standard deviation of eight speeches). In all, 38 entities have discussed R2P more than the average of the debate. If we take Organisation for Economic Co-operation and Development (OECD) membership as a proxy for the historical category of "the West" for these 38 entities, then the debate has a slight Western-bias with 52.6% of contributions stemming from OECD members and 42.1% from non-members. In all, 5.3% come from the United Nations and the Vatican. Five countries have not referred to the R2P during the time of this study: Bhutan, Grenada, North Korea, St. Kitts and Nevis, and Vanuatu. The top five agenda items where R2P is invoked are civilians in conflict (with 360 invocations), the situation in the Middle East (with 81 invocations), maintenance of international peace and security (with 71 invocations), children and armed conflict (with 48 invocations), and women, peace, and security (with 45 invocations). To arrive at a better understanding of the debate's duration, I plot below each reference to the norm from 1995 to 2019 ( Figure 1).
In 2014, R2P mentioned peaks with 126 references to the norm. Surprisingly, the year 2011-the year in which the UNSC authorized Libya's intervention-is not when the discussions culminated. In other words, the high point of deliberation on the norm did not coincide with the fall of the Gaddafi regime. In fact, the norm was also discussed in public sessions that focused on later crises such as Syria, Mali, Democratic Republic of the Congo (DRC), Cote d'Ivoire, Somalia, Yemen, the Central African Republic (CAR), and South-Sudan. 19 The line plot also illustrates that interest in R2P has not vanished from the Security Council's public debates. Although recent mentions of the norm have not reached peak levels, the term was referred to 56 times in 2019 alone. These demonstrate that the Security Council still sees the need to discuss R2P concerning international peace and security matters. Thus, H1 must be rejected. States have not shied away from casting matters into the language of R2P and still frequently invoke the norm in international security deliberations, therefore mitigating Edward Luck's fear of "the risk of relevance" (Bellamy, 2012: 13). The sheer frequency alone, however, cannot tell us sufficiently how member states talk about the norm. Let us, therefore, consider the sentiment around R2P. Figure 2 provides at least two important insights. First, framing around the R2P is moderately positive throughout the years. 20 Compared to the institution's benchmark, however, language surrounding R2P has substantial overlay with ordinary UNSC talk (indicated by the dotted purple line). All three depicted measures show affirmative sentiment over time, and two of them show statistically significant positive net-sentiment continuously after 2005. 21 Admittedly, confidence intervals are rather large in the early years of the political debate, indicating substantial variance by contributing states' mean sentiment. In addition, early years of the debate saw fewer speeches devoted to the norm. Hence, the variance in tonality before 2005 is also a function of limited data. Compared to fully neutral vocabulary (a mean of zero), the debate on R2P is most often positive in its tonality. Compared to the mean sentiment of any given Security Council debate (indicated by the dotted purple line), however, the overall mean sentiment on R2P is-statistically speaking-not different from the mean sentiment of other UNSC debates. Across all years and discussions, the mean sentiment of a Security Council speech is roughly at 0.094. As a result, the usual Security Council debate is slightly positive, with approximately 10% more positive (and negated negative terms) than negative words. The discussion on R2P arrives at a mean sentiment of 0.086, illustrating that the tonality around the norm is remarkably close to its institution's benchmark. Therefore, despite critics' insistence, sentiment around R2P is far from being overwhelmingly negative and is rather comparable to the tonality of an ordinary UNSC debate.
In essence, the R2P exemplifies shifts in sentiment we would expect from a consolidating norm. Concerning the aggregate level of the UNSC, these shifts are likely to reflect ordinary forms of political opposition and deliberation, rather than pointing to fundamental opposition. Norms are often described as volatile entities that transform through regional or even local interaction (Acharya, 2013: 471). Changes in sentiment might, therefore, stem from ordinary dissemination or translation processes. Even slumps might be disputes in meaning-contestation in the sense of some norm scholars-and might not signal pervasive pushback. 22 If the opposite was the case, and the norm was wildly detested or there was even backlash 23 forming around the norm, the mean sentiment around R2P should have been much lower, at least significantly more negative than its institution's benchmark. The higher variance in early years is also in line with scholarship that assumes that the norm faced more validity contestation in the beginning (Deitelhoff and Zimmermann, 2020) and later on gained more acceptance among the international community. Second, trend lines of more than 20 years underscore that sentiment around the R2P seems to be rather stable. From the perspective of norm supporters, a point of concern should be that, although slightly positive, the level of affirmative sentiment is not exceedingly high. So far, however, we have only observed the consensual level of the UNSC. To render a fully informed verdict on the sentiment of the norm, and thus evaluate H2, we also need to look at individual states' sentiment.
Crucially, Figure 3 underlines the moderately positive sentiment within the Council. Out of 136 speaking entities, 89, or roughly 65%, maintain a positive sentiment toward the norm, compared to the institutional benchmark. 24 While there are few states with strongly positively connotated rhetoric-like Senegal, Jamaica, or Georgia-where more than every third word around the R2P is a positive one, most states arrive at a less pronounced but still positive mean sentiment. This suggests that the framing around the norm is-for a majority of speakers-rather affirmatory. In fact, very few states hold negative sentiment scores. Confirming qualitative scholarship, Cuba, Venezuela, and Nicaragua have low or even negative sentiment scores, which indicates substantial criticism of the norm (Welsh, 2019: 59).
Depending on the measure, China appears to have the lowest sentiment among the P5, arriving in most models in an area of neutrality (a score of around zero). The United States maintain the highest sentiment score within this prestigious group. There are a small number of entities, for example, Afghanistan or Nepal, that flip their tonality from one word window to another, but these are very few indeed. This is likely the result of measurement error and due to the small number of speeches they gave on the norm. While automated sentiment analysis works well on small word windows, it needs a collection of these windows to arrive at robust results. For these few cases, measurement on sentence-level might be more informative and is available in the Supplemental Appendix.
Taken the evidence of Figures 2 and 3 together, we can confirm H2. All in all, sentiment by country-speakers as well as on the level of the UNSC is rather positive. Over time, affirmatory rhetoric appears to be stable as well. However, few states, judging from sentiment scores, frame the norm in exceptionally supportive terms. The fact that not even well-known supporters like the United States have exceedingly positive sentiment scores (>0.5) is telling. This could indicate that while the norm generally enjoys praise, few states are outspoken advocates of the principle. For a while now, the United States has followed the idea that others should push the principle and that the United States should lead "from behind." Yet it seems that not many states have embraced a forerunner role in terms of positive norm framing. Furthermore, because sentiment scores cannot indicate whether statements are principled or strategic in nature (Deitelhoff and Müller, 2005), we cannot discern how much of this supportive framing is truly an expression of state's interest. Nevertheless, the presented sentiment scores are at least one sound indicator that speak for a positive discursive validity of the norm, albeit on a moderate level.
As a last measure, I want to trace the impact of critical junctures on norm framing and verbal contestation. The Libyan intervention was a watershed moment for the dynamic of the norm and remains the most controversial application, as well as the only application where a military intervention was justified under the R2P, against the wishes of a host state (Bellamy, 2015;Brockmeier et al., 2016;Deitelhoff and Zimmermann, 2020;Dunne and Gifkins, 2011). If the norm had suffered reputational costs stemming from this intervention, the framing around the norm might display a negative rhetoric. In the Security Council, 15 members were tasked with the objective to preserve peace and security in Libya. Authorizing resolution 1973 which imposed a no-fly zone and ultimately lead to regime change, these members form an intuitive sample to observe learning effects concerning the norm. Thus, Figure 4 plots their sentiment before and after intervention.
The evidence presented in Figure 4 calls for a rejection of H3. The sentiment around the norm has not worsened (or improved considerably for that matter). While there are some shifts in sentiment around the framing of R2P, coming from, for example, the United States or Germany, neither of these changes are statistically significant. In statistical terms, the only significant change comes from the wording of Brazil toward the norm. This is in line with qualitative scholarship that has argued that Brazil-although an early critic of the norm-has advanced productive feedback on how to improve it .
Furthermore, there are some interesting changes, after the intervention, in terms of variance around sentiment positions. China and Nigeria show stronger variance around their mean tonality. This is either driven by fewer valanced terms around the norm and more neutral vocabulary or by speeches which strongly differ in their sentiment. By manually reading some of these speeches, we see that China gives veiled and reluctant criticism toward the third pillar of R2P (Ki-Moon, 2009). In doing so, China advances an understanding that upholds classical or Westphalian conceptions of state sovereignty (Foot, 2020), which are incompatible with the international community taking forceful measures inside a host country to prevent atrocity crimes (Welsh, 2019: 61). These rhetorical actions by China can be read as a strategy to keep application issues relevant and, thus, narrow the space for forceful measures of the third pillar of the R2P. The next section briefly raises awareness for the limitations of quantitative approaches to text analysis, before I close the article by situating the findings in a broader context.

Limitations of this study
There are some words of caution necessary to contextualize the previous findings. Sentiment estimations can detect framing and tonality and perhaps even infer political attitudes, but they are not expressions of political positions. Sentiment estimates count frequencies of positive or negative words surrounding an item of interest. Thus, estimates necessarily investigate the framing around patterns of interest. Such framing may be contextually negative due to the nature of the topic. For example, a debate focusing on atrocity crimes will feature a lot of negatively connotated vocabulary, simply because of the topic's graphical description of the crimes committed. One way to overcome contextual measurement error-which I have used in this study-is to relate sentiment estimates to institution and debate benchmarks. 26 Through such comparisons, we can arrive at more meaningful results because we can see more clearly whether measurements are truly negative or mainly negative compared to the topic's nature. Of course, such benchmarks present averages and might fail to reflect the discrepancies within single speeches. If a given state uses many more negative terms to paint a picture of a gruesome situation than the average of such a debate, sentiment estimates might be too low for this country's speech. This means that benchmarks remain an imperfect but necessary assessment of usual parlance and still represent an improvement over absolute estimates without comparison.
Furthermore, a weakness in automated sentiment analysis lies in the fact that its findings depend immensely on chosen word windows and input text quality. Scholars using the method must perform a tough balancing act: choosing small windows not only increases precision but also increases the danger of excluding meaningful grammatical negations. Choosing large word windows, by contrary, might increase recall but leads to a loss of precision. In the presented analysis, I opted to present three different word windows to give a more informed view on sentiment estimates. The Supplemental Appendix features a replication of the entire study using sentences on R2P as the unit of analysis-this constitutes an additional robustness measure. Crucially, the findings are fairly close to the results presented in the main text. It is noteworthy, however, that word windows form a conservative sentiment measure within the Security Council. Calculated on a sentence level, average Security Council expressions are slightly more positive; also, roughly 10% (7 percentage points) more countries have affirmative sentiment toward the norm. To boost the validity of findings, such replication steps continue to be necessary (Benoit et al., 2009). In essence, automated sentiment analysis, like most text-as-data approaches, relies on extensive validation to produce meaningful results. Even then, sentiment analysis generates models and estimations that are an abstraction of policymakers' tonality. Results should not be regarded as true or false but rather as useful or not useful. At its bare minimum, such results still give valuable insights into broader relational comparisons around a pattern of interest.

Conclusion
Bearing the presented evidence in mind, I can expose some myths regarding the contested discursive validity of the R2P. Contrary to scholarship that has argued that the norm was largely defeated, the presented analysis still shows overall positive tonality toward the norm. The fact that roughly 65% of speaking entities in the Security Council expressed positive sentiment toward R2P-compared to the institutional benchmark-serves as a strong indicator that the discursive validity of the norm remains intact. Furthermore, Edward Luck's fear of the "risk of relevance" has not materialized. The Security Council still frequently invokes the norm when deliberating on issues of international peace and security. However, advocates of the political principle should practice caution when observing these findings. Very few states maintain an exceedingly positive framing around the norm. This could indicate that the norm is generally well received, but few states perform the role of outspoken norm advocates. Moreover, it also remains questionable whether the US approach of "leading from behind" (Bellamy, 2015) has taken root as, at least concerning raw tonality, there are not many strong norm advocates.
Zooming into the authorization of force in Libya, as a case illustration of a critical juncture, we saw that some pivotal states shifted their tonality toward the norm. While these shifts are indeed minor, their changed sentiment can be read in light of qualitative scholarship (Deitelhoff and Zimmermann, 2020;Welsh, 2019Welsh, , 2021 suggesting that the specific application of the norm remains debated. While such debate does not necessarily spell doom for the norm-Brazil even increased its positive framing-sustained and persistent applicatory contestation might lead to an inability to enforce the principle. Since R2P was contrived, inter alia, to prevent the most heinous crimes, failing to act in such instances might translate applicatory contestation into validity contestation. Given that China has begun to further an understanding of R2P that is comparable with classical notions of sovereignty, future military applications of the norm seem unlikely. 27 In turn, this means that preventive measures to tackle R2P-related crimes can only rely on tools such as crisis diplomacy, arms embargos, or targeted sanctions. Whether these measures can effectively prevent war crimes, or even genocide, remains to be seen. To be sure, this analysis has only underscored one indicator for the discursive validity of the norm. Yet, as Deitelhoff and Zimmermann have convincingly argued, the facticity of a norm (inasmuch as it guides actions) is also a relevant criterion when assessing its robustness (Deitelhoff and Zimmermann, 2020). Therefore, future research could try to elucidate, quantitatively as well as qualitatively, the extent to which specific types of Security Council actions are justified under the R2P framework and whether or not such invocations increase support for one and derail support for another measure.
Finally, scholar's overwhelming focus on the R2P has had the unintended consequence of taking away much needed attention from the equal sovereignty of states. 28 Sometimes theorized to stand in a diametrical relationship with the enforcement of R2P, norms on sovereignty protection have rarely been studied in relation to Security Council action. For one thing, it is surprisingly hard to find any data on the robustness, facticity, or discursive validity of the equal sovereignty of states. Furthermore, if these two norms really stand in conflict with one another, we should be witnessing an inverse relationship; states should not only invoke R2P when they act but also call more often for sovereignty of nation states when non-action is justified. Analyzing such cross-cutting interaction between rhetorical invocations of norms would not only further empirical analysis on the power of norms but also do justice to the theoretical argument that norms do not exist in an isolated space. In other words, if the R2P remains unbowed, unbent, and unbroken, we should start asking questions about the sovereignty of states. states in the Security Council. However, 98% of the time, states vote with "yes" on resolutions, from 1990 to 2020. Therefore, estimating country positions from voting data seems impractical as there are too few contested votes. Scholars interested in the empirical facticity of R2P should try to connect rhetoric with other types of actions. 8. For a summary, see Sandholtz (2019). 9. Which matches each reference to R2P and keeps a tally of them. 10. I rely for the entire analysis on Young and Sorokas' (2012) Lexicoder Sentiment Dictionary (LSD). The dictionary is often seen as the gold standard in public and political communications' sentiment estimation techniques. Quantitative text analysis scholars, particularly the ones working with dictionaries, often point to specificities of each dictionary and are sometimes skeptical whether they can be applied in different contexts (Grimmer and Stewart, 2013). In this instance, however, the application is quite logical. Young and Sorokas' dictionary was designed, inter alia, to meet the requirements of legislative speeches. While not being a legislature, the Security Council's practice of public justification before ensuing votes is very similar in its functionality. 11. I use quanteda's excellent kwic function to that end (Benoit et al., 2018). Contextual usage is important to examine because it validates that the searched terms or phrases are used in the way the scholar has anticipated. For example, the word race could be used in a context of an election (as the running of two competitors) or in a context of a social construct within societies. Depending on the aim of the research, scholars should validate that they are using references that are in line with their research interest. 12. These are available in the Supplemental Appendix under item 1 to item 4. 13. The average sentiment of institutional language. 14. I do this by counting the sum of all lexicoder positive terms plus all negated negative words minus the sum of all negative terms plus negated positive words in Security Council speeches related to R2P: Sentiment Raw = (Positive Terms + Negated Negative Terms) -(Negative Terms + Negated Positive Terms). To arrive at a relative comparison, I calculate each document's length, which is nothing other than the sum of each row in the data feature matrix (DFM) without stopwords. Then, I remove punctuation, numbers, and symbols. Finally, I divide the raw sentiment by the term length of each document (Term Length). In this way, we obtain a scale from −1 to +1. R2P Sentiment Weighted = Sentiment Raw / Term Lengths (number of words in speeches without stopwords). I repeat these two steps with all speeches given in the UNSC to arrive at an institutional benchmark (average UNSC speech sentiment). 15. Surprisingly, the average R2P debate sentiment and the sentiment of an average Security Council discussion are not so different from each other. This suggests that, in terms of vernacular, a debate on the R2P is not particularly different from other issues discussed at the Security Council. 16. There are 12 speeches that feature the norm before the publication of the International Commission on Intervention and State Sovereignty (ICISS) report. While some actors emphasize that the state has a special R2P children from any harm (and thereby have a much narrower applicatory frame), others seem to think that this responsibility also applies to foreign nationals within one's country (such as peacekeepers). By 1999, references to the norm are very much in line with nowadays first pillar and second pillar invocations. In a speech given by Portugal in the year 2000, the speaker argues that the R2P is already "a well-established principle under international humanitarian law." In the Supplemental Appendix, I provide a list of relevant paragraphs of each speech which referred to the norm before the ICISS report. Due to brevity of space, full speeches are available as RData frame upon publication. 17. These international organizations (IOs) are the European Union, League of Arab States, North Atlantic Treaty Organization (NATO), and the United Nations itself.
18. These are Families for Freedom and Physicians for Human Rights. 19. Jess Gifkins arrives at an identical list of target countries in R2P debates. See Gifkins (2016: 157). 20. Replicated sentiment on a sentence level is actually slightly more positive than already indicated by the three windows sizes, see Supplemental Appendix, item 2. 21. Compared to neutral sentiment with an average of zero. 22. For pushback against norms and institutions, see Börzel and Zürn (2020). 23. For a conceptualization of backlash, see Alter and Zürn (2020). For backlash within the United Nations, see Cupać and Ebetürk (2021). 24. To take the varying word window sentiment scores into account, I counted an entity as having a positive sentiment when at least two out of three measures were more positive than the institution's benchmark. If the confidence intervals of all three measures centered on, or touched the benchmark line, I counted the observation also as a positive sentiment because the benchmark is already slightly positive with a score of 0.094, setting a higher standard for the norm. In addition, word windows are a conservative measure that rather under-appreciates than over-appreciates sentiment scores. Estimated on a sentence-level, 98 entities, or 72%, possess a positive sentiment scores against the institutions' benchmark. The latter finding is available in the Supplemental Appendix. 25. Gabon held the presidency during the month of intervention and refrained from giving a speech on the matter (is, thus, excluded from the plot). Lebanon only spoke about the R2P before intervention. Shown data represent the aggregated mean of all country-speeches on R2P given prior and past 2011. Replication data on a sentence level are available in the Supplemental Appendix. 26. Average sentiment of institutional language versus debate sentiment. 27. Due to its de facto veto power as a member of the P5. 28. Some laudable exceptions include Altman (2020) and Tourinho (2021).

Supplemental material
Supplemental material for this article is available online.