This post is part of a series about evaluating user generated content, titled We don’t all think alike: Evaluating user generated content.

Recently a friend of mine posted a frustrated question to Facebook asking if anyone knew how to create a linked table of contents in Microsoft Word. She knew it was possible, but she couldn’t figure it out and was losing patience. In a past life, I was a technical writer, so my Word skills are freakishly good. I quickly typed her the answer (you have to select Insert > Index and Tables in case you’re wondering), and she was able to move on with her document.

Crowdsourcing to the rescue! But can the crowds always be trusted to give you the right answer? In a new report from Yahoo, “When the Crowd is Not Enough: Improving User Experience with Social Media through Automatic Quality Analysis,” researchers explored whether an algorithm could better predict a high quality answer than a live human could. Further, they wanted to know if content quality could affect user behavior [1].

The results may surprise you (spoiler: the algorithm wins). To develop and test their algorithm, the researchers experimented with Yahoo Answers, which is a large question and answer site similar to Quora or Stack Overflow. Users submit questions, the crowd submits answers, and everyone votes each answer up or down based on how well it answers the question.

Displaying better quality answers means users don’t just find an answer, they find the best answer faster. Currently many of these sites rely on voting (e.g., thumbs up or down) to determine which answers are higher quality. This design has major impacts to users. Voting may not be enough to highlight the best answer. 

For example, Quora not only looks at votes, but also the reputation of both the author and the voter when considering the quality of an answer. They suggest that they’re using some programmatic methods too, but they don’t outline their secret sauce, writing that they look at “other signals, including ones that help us prevent gaming of ranking through votes.” [2]

In this article, we’ll look at the other features the Yahoo researchers considered and how their algorithm was able to select the best answers programmatically.

Beyond user voting: better quality features to consider

For most question and answer sites, user voting (e.g., thumbs up or down) determines which answers are considered higher quality. However, there are problems with relying on only this system.

  • Voting is subjective. People may vote down a well-written, relevant argument purely because they disagree with it. Alternately, people may vote up a poorly written argument because they agree with the opinion or sentiment (regardless if the answer is helpful).
  • Trolls and user wars wreak havoc.Trolls” are more interested in creating chaos than being helpful, and user wars may break out among fanboys and people who have been slighted by another user previously. These groups often attack via voting or writing irrelevant, hateful, or nonsense comments, detracting from the main goal of the website - helping people find answers.
  • People are influenced by the crowd. The researchers note that the “rich get richer” is also true for judging answers. Higher voted answers may receive even more votes due to “social influence bias.” Further, “Voting with thumbs up or down may also capture a notion of viewer agreement and not necessarily of quality.” This phenomenon has been verified by other research studies. Specifically, position bias (where the answer appears on the page) and appearance bias (how attractive the answer is due to longer text or images) have been cited as influences on voting [3].
Voting with thumbs up or down may also capture a notion of viewer agreement and not necessarily of quality.

When developing their algorithm to determine quality of answers, the researchers considered the following factors:

  • Text style addresses writing style in terms of word selection, misspellings, abusive words, etc. Short phrases like “yes” or “idk” were often removed from the data set of answers, being seen as incomplete answers to the question.
  • Text statistics focuses on the length of answer, punctuation, etc.
  • Best answer language model helps define an expected style of a “good” answer based on prior answers that were selected by users.
  • User feedback includes number of thumbs up or down votes and how many times the answerer edited their answer.
  • Answerer reputation measures the quality of the person writing the answer based on votes for previous answers they’ve written, number of comments and replies to those answers, etc.
  • Surface word question similarity measures how similar the answer is to the question. More similar words denotes a higher quality answer, because it’s likely more related to the question. Note: Measures were taken to address language differences and other factors that impact this factor of quality.

Additionally, the researchers considered four quality scoring features:

  • ESA-Based Question Similarity - addresses differences in language between the question and answer using the text’s Explicit Semantic Analysis vector.
  • Answer Similarity - looks for repeated recommendations of the answer, which represent a more relevant or common view.
  • Query Performance Predictor - measures how often an answer is returned in a search result for the topic of the question.
  • Sentiment Analysis - assesses whether the answer’s language is positive, neutral, or negative. This is important, because “empathic answers are appealing, while ‘flaming’ text in an answer alienates the reader.”

Algorithm versus humans

Using these factors, the researchers developed an algorithm to predict which answers were high quality through programmatic means (i.e., without the help of a human). They asked 40 human annotators to rank a data set of questions and answers from Yahoo Answers. Then they used that data to compare their algorithm’s picks for best answers with the humans’ pick for best answers. They ran three tests.

Test 1: Can the algorithm choose better answers more often than users can?

In this test, a group of people who had written questions for Yahoo Answers previously was asked to review a set of questions and answers, choosing the answer they considered to be the best. Then the algorithm was used to find the best answers. 

  • 63% of the time, the algorithm and people chose the same answer. 
  • 37% of the time, they chose different answers.

Within the different answers, the researchers ran a separate test where humans chose between the answers chosen by people and those chosen by the algorithm. In most cases, the algorithm had chosen a better or equal quality answer.

Best answers by asker versus algorithm (referred to as AQS).

Best answers by asker versus algorithm (referred to as AQS).

Next the researchers tested their algorithm against user feedback (i.e., answers that had been voted up or down). In this case, the algorithm chose a different answer 71% of the time. Because the difference was so high, the researchers dug deeper into that 71% and found again that their algorithm had usually chosen an answer that was better or equal quality to the answer chosen by people.

Best answers by crowd versus algorithm (referred to as AQS).

Best answers by crowd versus algorithm (referred to as AQS).

They also learned that the algorithm performed better among questions that had less than 20 votes, which makes up 99% of all questions on the Yahoo Answers website.

Test 2: Can the algorithm affect click through rates?

In this test, researchers set up an A/B test of their question pages. Each question page included a list of answers from users. One version ordered the answers by quality based on user’s votes, while the other ordered the answers by quality based on the algorithm’s choices. In both versions, low quality answers were removed.

Each answer was truncated to only show two lines, with a “show more” link. Click through rates were measured by how many times users clicked the “show more” link. Believing that users would click “show more” if the answer seemed high quality enough to read further, the researchers used this metric to determine whether votes or the algorithm produced better answers.

The algorithm version achieved a 9.2% higher click through rate, proving that it generated a higher quality list of answers than user voting did.

Test 3: Can the algorithm impact user engagement with the website?

In this test the researchers wanted to answer the following questions: 

  • “Do users spend more time reading higher quality answers? 
  • Do they view more answers if they are of higher quality? 
  • How deeply do they explore the content, and does the depth of exploration depend on the quality?”

To answer these questions, researchers created question and answer pages that included high and low quality answers. For some questions, the top answer was a low-quality answer. Then they tracked users’ time on page and how far they scrolled (i.e., to expose more content).

The results:

  • People spend a long time looking at low quality answers, but they spend more time on a page when the answers are high quality.
  • People scroll more when there are more answers to look at, but people scrolled further down the page when the answers were of high quality than of low quality.

Therefore, content quality was more important than content quantity.

So what?

It’s not enough to simply vote whether an answer is a good one, because what’s good for one person may not be for another. 

My friend who needed help with Word couldn’t pick up the phone to call Microsoft. She had to find the answer another way - product help, support forums, Google search, social media. When I told her to select Insert > Indexes and Tables, that seemed like an OK answer. I soon realized though that she may not understand that the table of contents is built based on the heading styles in your document or that you have to update the table of contents when you make a change to your document.

My first answer was a “good” answer and would have probably received votes on a forum, but my follow up answers gave more information and would have actually helped my friend through the entire process of creating the table of contents. Voting is not enough to determine quality. When developing a question and answer site, consider other features (either user-based or programmatic) that can help your website display the best answers first, thus getting your users to answers faster.

Read more from this series at We don’t all think alike: Evaluating user generated content.

References

  1. Dan Pelleg, et al. “When the Crowd is Not Enough: Improving User Experience with Social Media through Automatic Quality Analysis.” (March 2016), Presented at The 19th ACM Conference On Computer-Supported Cooperative Work And Social Computing (CSCW 2016). https://labs.yahoo.com/publications/8514/when-crowd-not-enough-improving-user-experience-social-media-through-automatic 
  2. “How does the ranking of answers on Quora work?” (June 3, 2015), Quora. https://www.quora.com/How-does-the-ranking-of-answers-on-Quora-work 
  3. Xiaochi Wei, et al. “Re-Ranking Voting-Based Answers by Discarding User Behavior Biases.” (2015), Proceedings of the Twenty-Fourth International Joint Conference on Artificial Intelligence (IJCAI 2015).  http://ijcai.org/papers15/Papers/IJCAI15-337.pdf 

Comment