The Math Behind the Metrics - Coral by Vox Media

[mathjax]
By Francis Tseng

As part of the Coral Project, we’re trying to come up with some interesting and useful metrics about community members and discussion on news sites for our first product.

It’s an interesting exercise to develop metrics which embody an organization’s principles. For instance – perhaps we see our content as the catalyst for conversations, so we’d measure an article’s success by how much discussion it generates.

Generally, there are two groups of metrics that I have been focusing on:

Asset-level metrics, computed for individual articles or whatever else may be commented on
User-level metrics, computed for individual users

For the past couple of weeks I’ve been sketching out a few ideas for these metrics:

For assets, the principles that these metrics aspire to capture are around quantity and diversity of discussion.
For users, I look at organizational approval, community approval, how much discussion this user tends to generate, and how likely they are to be moderated.

Here I’ll walk through my thought process for these initial ideas.

Asset-level metrics

For assets, I wanted to value not only the amount of discussion generated but also the diversity discussions. A good discussion is one in which there’s a lot of high-quality exchange (something else to be measured, but not captured in this first iteration) from many different people.

There are two scores to start:

a discussion score, which quantifies how much discussion an asset generated. This looks at how much people are talking to each other as opposed to just counting up the number of comments. For instance, a comments section in which all comments are top-level should not have a high discussion score. A comments section in which there are some really deep back-and-forths should have a higher discussion score.
a diversity score, which quantifies how many different people are involved in the discussions. Again, we don’t want to look at diversity in the comments section as a whole because we are looking for diversity within discussions, i.e. within threads.

The current sketch for computing the discussion score is via two values:

maximum thread depth: how long is the longest thread?
maximum thread width: what is the highest number of replies for a comment?

These are pretty rough approximations of “how much discussion” there is. The idea is that for sites which only allow one level of replies, a lot of replies to a comment can signal a discussion, and that a very deep thread signals the same for sites which allow more nesting.

The discussion score of a top-level thread is the product of these two intermediary metrics:

$$
\text{discussion score}_{\text{thread}} = \max(\text{thread}_{\text{depth}}) \max(\text{thread}_{\text{width}})
$$

The discussion score for the entire asset is the value that answers this question: if a new thread were to start in this asset, what discussion score would it have?

The idea is that if a section is generating a lot of discussion, a new thread would likely also involve a lot of discussion.

The nice thing about this approach (which is similar to the one used throughout all these sketches) is that we can capture uncertainty. When a new article is posted, we have no idea how good of a discussion a new thread might be. When we have one or two threads – maybe one is long and one is short – we’re still not too sure, so we still have a fairly conservative score. But as more and more people comment, we begin to narrow down on the “true” score for the article.

More concretely (skip ahead to be spared of the gory details), we assume that this discussion score is drawn from a Poisson distribution. This makes things a bit easier to model because we can use the gamma distribution as a conjugate prior.

By default, the gamma prior is parameterized with $k=1, \theta=2$ since it is a fairly conservative estimate to start. That is, we begin with the assumption that any new thread is unlikely to generate a lot of discussion, so it will take a lot of discussion to really convince us otherwise.

Since this gamma-Poisson model will be reused elsewhere, it is defined as its own function:

def gamma_poission_model(X, n, k, theta, quantile):
    k = np.sum(X) + k
    t = theta/(theta*n + 1)
    return stats.gamma.ppf(quantile, k, scale=t)

Since the gamma is a conjugate prior here, the posterior is also a gamma distribution with easily-computed parameters based on the observed data (i.e. the “actual” top-level threads in the discussion).

We need an actual value to work with, however, so we need some point estimate of the expected discussion score. However, we don’t want to go with the mean since that may be too optimistic a value, especially if we only have a couple top-level threads to look at. So instead, we look at the lower-bound of the 90% credible interval (the 0.05 quantile) to make a more conservative estimate.

So the final function for computing an asset’s discussion score is:

def asset_discussion_score(threads, k=1, theta=2):
    X = np.array([max_thread_width(t) * max_thread_depth(t) for t in threads])
    n = len(X)

    k = np.sum(X) + k
    t = theta/(theta*n + 1)

    return {'discussion_score': gamma_poission_model(X, n, k, theta, 0.05)}

A similar approach is used for an asset’s diversity score. Here we ask the question: if a new comment is posted, how likely is it to be a posted by someone new to the discussion?

We can model this with a beta-binomial model; again, the beta distribution is a conjugate prior for the binomial distribution, so we can compute the posterior’s parameters very easily:

def beta_binomial_model(y, n, alpha, beta, quantile):
    alpha_ = y + alpha
    beta_ = n - y + beta
    return stats.beta.ppf(quantile, alpha_, beta_)

Again we start with conservative parameters for the prior, $\alpha=2, \beta=2$, and then compute using threads as evidence:

def asset_diversity_score(threads, alpha=2, beta=2):
    X = set()
    n = 0
    for t in threads:
        users, n_comments = unique_participants(t)
        X = X | users
        n += n_comments
    y = len(X)

    return {'diversity_score': beta_binomial_model(y, n, alpha, beta, 0.05)}

Then averages for these scores are computed across the entire sample of assets in order to give some context as to what good and bad scores are.

User-level metrics

User-level metrics are computed in a similar fashion. For each user, four metrics are computed:

a community score, which quantifies how much the community approves of them. This is computed by trying to predict the number of likes a new post by this user will get.
an organization score, which quantifies how much the organization approves of them. This is the probability that a post by this user will get “editor’s pick” or some equivalent (in the case of Reddit, “gilded”, which isn’t “organizational” but holds a similar revered status).
a discussion score, which quantifies how much discussion this user tends to generate. This answers the question: if this user starts a new thread, how many replies do we expect it to have?
a moderation probability, which is the probability that a post by this user will be moderated.

The community score and discussion score are both modeled as gamma-Poission models using the same function as above. The organization score and moderation probability are both modeled as beta-binomial models using the same function as above.

Time for more refinement

These metrics are just a few starting points to shape into more sophisticated and nuanced scoring systems. There are some desirable properties missing, and of course, every organization has different principles and values, and so the ideas presented here are not one-size-fits-all, by any means. The challenge is to create some more general framework that allows people to easily define these metrics according to what they value.

******************************

By Tara Adiseshan

As we start thinking about metrics and trust analytics at The Coral Project, I’ve been thinking about what we can measure and how that might be different from what we should measure. Luckily, there are lots of folks in this space who have done interesting research. I’m going to start posting some of the questions that I’ve been thinking about and some of the work I’m inspired by.

How do we make sure that the metrics we collect don’t penalize newcomers?

It can be easier to figure out if you trust someone after you’ve known them for a long time. And the same is frequently true when it comes to trust scores or reputation scores. The problem is, when metrics are based on long user histories on a site, it can be difficult to make decisions about newcomers. I’ve really enjoyed reading Aaron Halfaker’s work on the treatment of good-faith newcomers in the Wikipedia community. In particular, Aaron’s work described the ways in which the tools used to maintain the quality of Wikipedia contributions also contributed to the community’s decline.

How do we best respect the privacy and safety of the community members as we build these metrics?

As we build our first product, we’ll be using data that publishers are already gathering. As someone who gets very excited about data science and anti-surveillance efforts, I’ve been thinking a lot about how we can build metrics that respect the privacy and safety of community members. It was helpful to learn about Tor’s usability research and the guidelines they use as they try to make their browser more user-friendly.

How do we make sure that the metrics we collect are inclusive?

This is something I’m looking for more research on, but something that I’ve had many conversations about with folks working in the comments / moderation space. For example, what does it mean to use vocabulary or adherence to punctuation and grammar rules as a way to decide quality? It is very possible to write comments that are perfectly punctuated, while still attacking individuals or community members. It is also very possible that a comment with spelling and grammar mistakes could be an incredibly thoughtful and meaningful perspective. It’s one example, but points to a broader question. How could our metrics be used for unintentionally exclusionary or intentionally malicious purposes? The code we write is political, and I think it’s an important part of the design / development process to acknowledge that. Emma Pierson’s work on gender parity in The New York Times’ comments sections was a great read.

What kinds of feedback loops are involved in the metrics we collect?

Moderation, community response, and reputation scores can all be important parts of feedback loops that shape online spaces. I’ve heard a lot about Riot Game’s rehabilitative moderation approach in the past. I’ve also been reading Justin Cheng’s research on how community feedback shapes behavior.

If there’s any work that you’ve also found interesting, let me know on Twitter!

Thank you to all the folks who pointed me towards some of these questions and work, including the Coral team and Nate Matias.

Image by DarwinPeacock, Maklaan, CC-BY

Click here to discuss this piece within The Coral Project Community.