ICTIR '21: Proceedings of the 2021 ACM SIGIR International Conference on Theory of Information Retrieval

Full Citation in the ACM Digital Library

SESSION: Keynote Talk

Session details: Keynote Talk

Fairness and Control of Exposure in Two-sided Markets

A large number of two-sided markets are now mediated by search and recommender systems, ranging from online retail and streaming entertainment to employment and romantic-partner matching. I will discuss in this talk how the design decisions that go into these search and recommender systems carry substantial power in shaping markets and allocating opportunity to the participants. This does not only raise legal and fairness questions, but also questions about how these systems shape incentives and the long-term effectiveness of the market.

At the core of these questions lies the problem of where to rank each item, and how this affects both sides of the market. While it is well understood how to maximize the utility to the users, this talk focuses on how rankings affect the items that are being ranked. From the items perspective, the ranking system is an arbiter of exposure and thus economic opportunity. I will discuss how machine learning algorithms that follow the conventional Probability Ranking Principle [1] can lead to undesirable and unfair exposure allocation for both exogenous and endogenous reasons. Exogenous reasons often manifest themselves as biases in the training data, which then get reflected in the learned ranking policy. But even when trained with unbiased data, reasons endogenous to the system can lead to unfair or undesirable allocation of opportunity. To overcome these challenges, I will present new machine learning algorithms [2,3,4] that directly address both endogenous and exogenous factors, allowing the designer to tailor the ranking policy to be appropriate for the specific two-sided market.

SESSION: Session 1A - Ranking (1)

Session details: Session 1A - Ranking (1)

Effective and Privacy-preserving Federated Online Learning to Rank

Online Learning to Rank (OLTR) has been primarily studied in the centralised setting, where a central server is responsible to index the searchable data, collect the users' queries and search interactions, and optimize ranking models. A drawback of such a centralised OLTR paradigm is that it cannot guarantee user's privacy as all data (both the searchable one and the one related to user interactions) is collected by the server.

In this paper, we propose a Federated OLTR method, called FPDGD, which leverages the state-of-the-art Pairwise Differentiable Gradient Descent (PDGD) and adapts it to the Federated Averaging framework. For a strong privacy guarantee, we further introduce a noise-adding clipping technique based on the theory of differential privacy to be used in combination with FPDGD.

Empirical evaluation shows FPDGD significantly outperforms the only other federated OLTR method. In addition, FPDGD is more robust across different privacy guarantee requirements than the current method: our method is therefore more reliable for real-life applications.

Towards Axiomatic Explanations for Neural Ranking Models

Recently, neural networks have been successfully employed to improve upon state-of-the-art effectiveness in ad-hoc retrieval tasks via machine-learned ranking functions. While neural retrieval models grow in complexity and impact, little is understood about their correspondence with well-studied IR principles. Recent work on interpretability in machine learning has provided tools and techniques to understand neural models in general, yet there has been little progress towards explaining ranking models. We investigate whether one can explain the behavior of neural ranking models in terms of their congruence with well understood principles of document ranking by using established theories from axiomatic~IR. Axiomatic analysis of information retrieval models has formalized a set of constraints on ranking decisions that reasonable retrieval models should fulfill. We operationalize this axiomatic thinking to reproduce rankings based on combinations of elementary constraints. This allows us to investigate to what extent the ranking decisions of neural rankers can be explained in terms of the existing retrieval axioms, and which axioms apply in which situations. Our experimental study considers a comprehensive set of axioms over several representative neural rankers. While the existing axioms can already explain the particularly confident ranking decisions rather well, future work should extend the axiom set to also cover the other still "unexplainable" neural IR rank decisions.

User Fairness, Item Fairness, and Diversity for Rankings in Two-Sided Markets

Ranking items by their probability of relevance has long been the goal of conventional ranking systems. While this maximizes traditional criteria of ranking performance, there is a growing understanding that it is an oversimplification in online platforms that serve not only a diverse user population, but also the producers of the items. In particular, ranking algorithms are expected to be fair in how they serve all groups of users --- not just the majority group --- and they also need to be fair in how they divide exposure among the items. These fairness considerations can partially be met by adding diversity to the rankings, as done in several recent works. However, we show in this paper that user fairness, item fairness and diversity are fundamentally different concepts. In particular, we find that algorithms that consider only one of the three desiderata can fail to satisfy and even harm the other two. To overcome this shortcoming, we present the first ranking algorithm that explicitly enforces all three desiderata. The algorithm optimizes user and item fairness as a convex optimization problem which can be solved optimally. From its solution, a ranking policy can be derived via a novel Birkhoff-von Neumann decomposition algorithm that optimizes diversity. Beyond the theoretical analysis, we investigate empirically on a new benchmark dataset how effectively the proposed ranking algorithm can control user fairness, item fairness and diversity, as well as the trade-offs between them.

Pareto Solutions vs Dataset Optima: Concepts and Methods for Optimizing Competing Objectives with Constraints in Retrieval

When ranking search results for multiple objectives, such as maximizing the relevance and diversity of retrieved documents, competing objectives can induce a space of optimal solutions, each reflecting a different optimal trade-off over objectives. This raises several important questions. Firstly, what Pareto optimal solution set is induced by objective functions? Secondly, what are the best solutions achievable on a given dataset? Finally, how closely do the best dataset solutions approach the true Pareto optimal? We present a clear conceptual framing of these questions, with supporting terminology and visualizations, distinguishing Pareto vs. "dataset-best" solutions and providing strong intuition about how and why different optimization problems lead to different shapes and forms of solutions (regardless of optimization technique). We also provide benchmark problems for verifying the correctness of any Pareto machinery and show how existing multi-objective optimization (MOO) and filter methods can be used to provide accurate and interpretable answers to the above questions. Finally, we show how user-defined constraints imposed on the solution space can be effectively handled. In sum, we provide conceptual, translational, and practical contributions to solving MOO problems in retrieval.

SESSION: Session 1B - User Interaction

Session details: Session 1B - User Interaction

Incorporating Widget Positioning in Interaction Models of Search Behaviour

Models developed to simulate user interactions with search interfaces typically do not consider the visual layout and presentation of a Search Engine Results Page (SERP). In particular, the position and size of interfacewidgets ---such as entity cards and query suggestions---are usually considered a negligible constant. In contrast, in this work, we investigate the impact of widget positioning on user behaviour. To this end, we focus on one specific widget: the Query History Widget (QHW). It allows users to see (and thus reflect) on their recently issued queries. We build a novel simulation model based on Search Economic Theory (SET) that considers how users behave when faced with such a widget by incorporating its positioning on the SERP. We derive five hypotheses from our model and experimentally validate them based on user interaction data gathered for an ad-hoc search task, run across five different placements of the \qhw on the SERP. We find partial support for three of the five hypotheses, and indeed observe that a widget's location has a significant impact on search behaviour.

The Impact of Entity Cards on Learning-Oriented Search Tasks

Entity cards are a common occurrence in today's web Search Engine Results Pages (SERPs). SERPs provide information on a complex information object in a structured manner. Typically, they combine data from several search verticals. They have been shown to: (i) increase users' engagement with the SERP; and (ii) improve decision making for certain types of searches (such as health searches). In this paper, we investigate whether the benefits of showing entity cards also extend to the Search as Learning (SAL) domain. Do learners learn more when entity cards are present on the SERP? To answer this question, we designed a series of learning-oriented search tasks (with a minimum search time of 15 minutes), and conducted a crowdsourced Interactive Information Retrieval (IIR) user study (N=144) with four interface conditions: (i) a control with no entity cards; (ii) displaying relevant entity cards; (iii) displaying somewhat relevant entity cards; and (iv) displaying non-relevant entity cards. Our results show that (i) entity cards do not have an effect on participants' learning, but (ii) they do significantly impact participants' search behaviours across a range of dimensions (such as the dwell time and search session duration).

Untangling the Concept of Task in Information Seeking and Retrieval

Many researchers have pointed to tasks as being the driving force for Information Seeking and Retrieval (ISR). Unfortunately, researchers do not agree on what is meant by tasks and related concepts such as activities and search tasks. Researchers often use different terminologies to describe the same concepts or use the same terms to refer to different concepts. Moreover, previous researchers ignored work roles and their effect on tasks. Thus, the purpose of this paper is to present some of the main competing task hierarchies by conducting a literature analysis in the field of ISR, management, work structure, and human resources. A specific contribution is to use key existing (explicit or implicit) hierarchies and frameworks of tasks into an integrated taxonomy of tasks and related concepts for use by researchers in ISR.

An Integrated Model of Task, Information Needs, Sources and Uncertainty to Design Task-Aware Search Systems

The varieties of information seeking behavior encompass a range of practices and constructs such as the realization of an information need, selecting the nature of information, as well as information sources. Most of the past works have studied various constructs of the information seeking process, i.e., information, information need, and information sources individually. However, a person forms and re-forms his or her information seeking strategy based on continually shifting values of these dimensions associated with information seeking. This preliminary study conducted a survey with 15 search scenarios and multiple-choice characteristics completed by 114 Amazon's Mechanical Turk workers to find out more about how these constructs play a role in people's preferences regarding information seeking strategies. The study took an exploratory and inferential research approach to investigate how different forms of information and information needs might lead to different information sources by building binary classification models. The results show that the choice of sources can be predicted (with 80% accuracy) if the information need, representation, and form of information are apparent.

SESSION: Session 1C - Content Analysis

Session details: Session 1C - Content Analysis

Sentiment Intensity Prediction using Neural Word Embeddings

Sentiment analysis is central to the process of mining opinions and attitudes from online texts. While much attention has been paid to the sentiment classification problem, much less work has tried to tackle the problem of predicting the intensity of the sentiment. The go to method is VADER --- an unsupervised lexicon based approach to scoring sentiment. However, such approaches are limited because of the vocabulary mismatch problem. In this paper, we present in detail and evaluate our AWESSOME framework (A Word Embedding Sentiment Scorer Of Many Emotions) for sentiment intensity scoring, that capitalizes on pre-existing lexicons, does not require training and provides fine grained and accurate sentiment intensity scores of words, phrases and text. In our experiments, we used seven Sentiment Collections to evaluate the proposed approach, against lexicon based approaches (e.g., VADER), and supervised methods such as deep learning based approaches (e.g., SentiBERT). The results show that despite not surpassing supervised approaches, the AWESSOME unsupervised approach significantly outperforms existing lexicon approaches and therefore provides a simple and effective approach for sentiment analysis. The AWESSOME framework can be flexibly adapted to cater for different seed lexicons and different neural word embeddings models in order to produce corpus specific lexicons -- without the need for extensive supervised learning or retraining.

News Article Retrieval in Context for Event-centric Narrative Creation

Writers such as journalists often use automatic tools to find relevant content to include in their narratives. In this paper, we focus on supporting writers in the news domain to develop event-centric narratives. Given an incomplete narrative that specifies a main event and a context, we aim to retrieve news articles that discuss relevant events that would enable the continuation of the narrative. We formally define this task and propose a retrieval dataset construction procedure that relies on existing news articles to simulate incomplete narratives and relevant articles. Experiments on two datasets derived from this procedure show that state-of-the-art lexical and semantic rankers are not sufficient for this task. We show that combining those with a ranker that ranks articles by reverse chronological order outperforms those rankers alone. We also perform an in-depth quantitative and qualitative analysis of the results that sheds light on the characteristics of this task.

Can Language Models Identify Wikipedia Articles with Readability and Style Issues?

Wikipedia is frequently criticised for having poor readability and style issues. In this article, we investigate using GPT-2, a neural language model, to identify poorly written text in Wikipedia by ranking documents by their perplexity. We evaluated the properties of this ranking using human assessments of text quality, including readability, narrativity and language use. We demonstrate that GPT-2 perplexity scores correlate moderately to strongly with narrativity, but only weakly with reading comprehension scores. Importantly, the model reflects even small improvements to text as would be seen in Wikipedia edits. We conclude by highlighting that Wikipedia's featured articles counter-intuitively contain text with the highest perplexity scores. However, these examples highlight many of the complexities that need to be resolved for such an approach to be used in practice.

Multi-label Classification of Short Texts with Label Correlated Recurrent Neural Networks

Short texts are commonly seen nowadays on the Internet in various forms such as tweets, queries, comments, status updates, snippets of search results, and reviews from social platforms. Accurate categorization of these short texts is critical for enhancing information services as it provides the foundation for better search and recommendation. In many real-world applications, a short text is often associated with multiple categories. Due to the sparsity of context information, traditional multi-label classification methods do not perform well on short texts. In this paper, we propose a novel Label Correlated Recurrent Neural Network (LC-RNN) for multi-label classification of short texts by exploiting correlations between categories. We utilize a tree structure to represent the relationships among labels and consequently an efficient max-product algorithm can be developed for exact inference of label prediction. We conduct experiments on four testbeds and the results demonstrate the effectiveness of the proposed model.

The Simplest Thing That Can Possibly Work: (Pseudo-)Relevance Feedback via Text Classification

Motivated by recent commentary that has questioned today's pursuit of ever-more complex models and mathematical formalisms in applied machine learning and whether meaningful empirical progress is actually being made, this paper tackles the decades-old problem of pseudo-relevance feedback with "the simplest thing that can possibly work". We present a technique based on training a document relevance classifier for each information need using pseudo-labels from an initial ranked list and then applying the classifier to rerank the retrieved documents. Experiments demonstrate significant improvements across a number of standard newswire collections, with initial rankings supplied by bag-of-words BM25 as well as from query expansion. Further evaluations in the TREC-COVID challenge using human relevance judgments verify the effectiveness and robustness of our proposed technique. While this simple idea draws elements from several well-known threads in the literature, to our knowledge this exact combination has not previously been proposed and rigorously evaluated.

SESSION: Session 2A - Ranking (2)

Session details: Session 2A - Ranking (2)

Ensemble Distillation for BERT-Based Ranking Models

Over the past two years, large pretrained language models such as BERT have been applied to text ranking problems and showed superior performance on multiple public benchmark data sets. Prior work demonstrated that an ensemble of multiple BERT-based ranking models can not only boost the performance, but also reduce the performance variance. However, an ensemble of models is more costly because it needs computing resource and/or inference time proportional to the number of models. In this paper, we study how to retain the performance of an ensemble of models at the inference cost of a single model by distilling the ensemble into a single BERT-based student ranking model. Specifically, we study different designs of teacher labels, various distillation strategies, as well as multiple distillation losses tailored for ranking problems. We conduct experiments on the MS MARCO passage ranking and the TREC-COVID data set. Our results show that even with these simple distillation techniques, the distilled model can effectively retain the performance gain of the ensemble of multiple models. More interestingly, the performances of distilled models are also more stable than models fine-tuned on original labeled data. The results reveal a promising direction to capitalize on the gains achieved by an ensemble of BERT-based ranking models.

Ranking by Language Similarity for Resource Scarce Southern Bantu Languages

Resource Scarce Languages (RSLs) lack sufficient resources to use Cross-Lingual Information Retrieval (CLIR) techniques and tools such as machine translation. Consequentially, searching using RSLs is frustrating and usually ends in unsuccessful struggling search. In such search tasks, search engines return low-quality results; relevant documents are either limited and lowly ranked or non-existent. Previous work has shown that alternative relevant results written in similar languages, including dialects, neighbouring and genetically related languages, can assist multilingual RSLs speakers to complete their search tasks. To improve the quality of search results in this context, we propose the re-ranking of documents based on the similarity between the language of the document and the language of the query. Accordingly, we created a dataset of four Southern Bantu languages that includes documents, topics, topical relevance and intelligibility features, and document utility annotations. To understand the intelligibility dimension of the studied languages, we conducted online intelligibility test experiments and used the data for feature selection and intelligibility prediction. We performed re-ranking of search results using offline evaluation, exploring Learning To Rank (LTR). Our results show that integrating topical relevance and intelligibility in ranking slightly improves retrieval effectiveness. Further, results on intelligibility prediction show that classification of intelligibility is feasible at a fair accuracy.

Understanding the Effectiveness of Reviews in E-commerce Top-N Recommendation

Modern E-commerce websites contain heterogeneous sources of information, such as numerical ratings, textual reviews and images. These information can be utilized to assist recommendation. Through textual reviews, a user explicitly express her affinity towards the item. Previous researchers found that by using the information extracted from these reviews, we can better profile the users' explicit preferences as well as the item features, leading to the improvement of recommendation performance. However, most of the previous algorithms were only utilizing the review information for explicit-feedback problem i.e. rating prediction, and when it comes to implicit-feedback ranking problem such as top-N recommendation, the usage of review information has not been fully explored. Seeing this gap, in this work, we investigate the effectiveness of textual review information for top-N recommendation under E-commerce settings. We adapt several SOTA review-based rating prediction models for top-N recommendation tasks and compare them to existing top-N recommendation models from both performance and efficiency. We find that models utilizing only review information can not achieve better performances than vanilla implicit-feedback matrix factorization method. When utilizing review information as a regularizer or auxiliary information, the performance of implicit-feedback matrix factorization method can be further improved. However, the optimal model structure to utilize textual reviews for E-commerce top-N recommendation is yet to be determined.

SESSION: Session 2B - Conversational Search

Session details: Session 2B - Conversational Search

Asking Clarifying Questions Based on Negative Feedback in Conversational Search

Users often need to look through multiple search result pages or reformulate queries when they have complex information-seeking needs. Conversational search systems make it possible to improve user satisfaction by asking questions to clarify users' search intents. This, however, can take significant effort to answer a series of questions starting with "what/why/how". To quickly identify user intent and reduce effort during interactions, we propose an intent clarification task based on yes/no questions where the system needs to ask the correct question about intents within the fewest conversation turns. In this task, it is essential to use negative feedback about the previous questions in the conversation history. To this end, we propose a Maximum-Marginal-Relevance (MMR) based BERT model (MMR-BERT) to leverage negative feedback based on the MMR principle for the next clarifying question selection. Experiments on the Qulac dataset show that MMR-BERT outperforms state-of-the-art baselines significantly on the intent identification task and the selected questions also achieve significantly better performance in the associated document retrieval tasks.

Towards Facet-Driven Generation of Clarifying Questions for Conversational Search

Clarifying an underlying user information need is an important aspect of a modern-day IR system. The importance of clarification is even higher in limited-bandwidth scenarios, such as conversational or mobile search, where a user is unable to easily browse through a long list of retrieved results. Thus, asking clarifying questions about user's potentially ambiguous queries arises as one of the main tasks of conversational search. Recent approaches have, while making significant progress in the field, remained limited to selecting a clarifying question from a predefined set or prompting the user with vague or template-based questions. However, with the recent advances in text generation through large-scale language models, an ideal system should generate the next clarifying question. The challenge of generating an appropriate clarifying question is twofold: 1) to produce the question in coherent natural language; 2) to ask a question that is relevant to the user query. In this paper, we propose a model that generates clarifying questions with respect to the user query and query facets. We fine-tune the GPT-2 language model to generate questions related to the query and one of the extracted query facets. Compared to competitive baselines, results show that our proposed method is both natural and useful, as judged by human annotators. Moreover, we discuss the potential theoretical framework this approach would fit in. We release the code for future work and reproducibility purposes.

Effective Query Formulation in Conversation Contextualization: A Query Specificity-based Approach

Proactively retrieving relevant information to contextualize conversations has potential applications in better understanding the conversational content between communicating parties. Since in contrast to traditional IR, there is no explicitly formulated user-query, an important research challenge is to first identify the candidate segments of text that may require contextualization for a better comprehension of their content, and then make use of these identified segments to formulate a query and eventually retrieve the potentially relevant information to augment a conversation. In this paper, we propose a generic unsupervised framework that involves shifting overlapping windows of terms through a conversation and estimating scores indicating the likelihood of the existence of an information need within these segments. Within our proposed framework, we investigate a query performance prediction (QPP) based approach for scoring these candidate term windows with the hypothesis that a term window that indicates a higher specificity is likely to be indicative of a potential information need requiring contextualization. Our experiments revealed that the QPP approaches of scoring the term windows provide better contextualization than other term extraction approaches. A post-retrieval QPP approach was observed to yield better results than a pre-retrieval one.

SESSION: Session 3A - Queries

Session details: Session 3A - Queries

A Modern Perspective on Query Likelihood with Deep Generative Retrieval Models

Existing neural ranking models follow the text matching paradigm, where document-to-query relevance is estimated through predicting the matching score. Drawing from the rich literature of classical generative retrieval models, we introduce and formalize the paradigm of deep generative retrieval models defined via the cumulative probabilities of generating query terms. This paradigm offers a grounded probabilistic view on relevance estimation while still enabling the use of modern neural architectures. In contrast to the matching paradigm, the probabilistic nature of generative rankers readily offers a fine-grained measure of uncertainty. We adopt several current neural generative models in our framework and introduce a novel generative ranker T-PGN, which combines the encoding capacity of Transformers with the Pointer Generator Network model. We conduct an extensive set of evaluation experiments on passage retrieval, leveraging the MS MARCO Passage Re-ranking and TREC Deep Learning 2019 Passage Re-ranking collections. Our results show the significantly higher performance of the T-PGN model when compared with other generative models. Lastly, we demonstrate that exploiting the uncertainty information of deep generative rankers opens new perspectives to query/collection understanding, and significantly improves the cut-off prediction task.

Modality Effects When Simulating User Querying Tasks

When faced with an information need, different users are likely to issue different queries. A renewed interest in these user query variations has resulted in a number of collections, tasks, and studies which utilize multiple queries for each topic. The most commonly applied technique for generating query variations is to show a short backstory to a pool of crowdworkers, and ask each of them to provide a keyword query that they would expect to provide more information pertaining to the backstory. In this short paper we explore whether the length of the backstory and the mode in which the backstory is conveyed to crowdworkers affect the resulting queries. Our experiments demonstrate that both the length of the backstory and the mode in which the backstory is delivered influence the resultant query variations; and that there is a consequent downstream implication in terms of forming the judgment pools necessary to assess systems in the presence of query variations.

Extracting per Query Valid Explanations for Blackbox Learning-to-Rank Models

Learning-to-rank (LTR) is a class of supervised learning techniques that apply to ranking problems dealing with a large number of features. The popularity and widespread application of LTR models in prioritizing information in a variety of domains makes their scrutability vital in today's landscape of fair and transparent learning systems. However, limited work exists that deals with interpreting the decisions of learning systems that output rankings. In this paper we propose a model agnostic local explanation method that seeks to identify a small subset of input features as explanation to the ranked output for a given query. We introduce new notions of validity and completeness of explanations specifically for rankings, based on the presence or absence of selected features, as a way of measuring goodness. We devise a novel optimization problem to maximize validity directly and propose greedy algorithms as solutions. In extensive quantitative experiments we show that our approach outperforms other model agnostic explanation approaches across pointwise, pairwise and listwise LTR models in validity while not compromising on completeness.

Recommending Search Queries in Documents Using Inter N-Gram Similarities

Reading a document can often trigger a need for additional information. For example, a reader of a news article might be interested in information about the persons and events mentioned in the article. Accordingly, there is a line of work on recommending search-engine queries given a document read by a user. Often, the recommended queries are selected from a query log independently of each other, and are presented to the user without any context. We address a novel query recommendation task where the recommended queries must be n-grams (sequences of consecutive terms) in the document. Furthermore, inspired by work on using inter-document similarities for document retrieval, we explore the merits of using inter n-gram similarities for query recommendation. Specifically, we use a supervised approach to learn an inter n-gram similarity measure where the goal is that n-grams that are likely to serve as queries will be deemed more similar to each other than to other n-grams. We use the similarity measure in a wide variety of query recommendation approaches which we devise as adaptations of ad hoc document retrieval techniques. Empirical evaluation performed using data gathered from Yahoo!'s search engine logs attests to the effectiveness of the resultant recommendation methods.

SESSION: Session 3B - Evaluation

Session details: Session 3B - Evaluation

Decomposition and Interleaving for Variance Reduction of Post-click Metrics

In this study, we propose an efficient method for comparing the post-click metric (e.g., dwell time and conversion rate) of multiple rankings in online experiments. The proposed method involves (1) the decomposition of the post-click metric measurement of a ranking into a click model estimation and a post-click metric measurement of each item in the ranking, and (2) interleaving of multiple rankings to produce a single ranking that preferentially exposes items possessing a high population variance. The decomposition of the post-click metric measurement enables the free layout of items in a ranking and focuses on the measurement of the post-click metric of each item in the multiple rankings. The interleaving of multiple rankings reduces the sample variance of the items possessing a high population variance by optimizing a ranking to be presented to the users so that those items received more samples of the post-click metric. In addition, we provide a proof that the proposed method leads to the minimization of the evaluation error in the ranking comparison and propose two practical techniques to stabilize the online experiment. We performed a comprehensive simulation experiment and a real service setting experiment. The experimental results revealed that (1) the proposed method outperformed existing methods in terms of efficiency and accuracy, and the performance was especially remarkable when the input rankings shared many items, and (2) the two stabilization techniques successfully improved the evaluation accuracy and efficiency. The real service data from our experiments can be used to validate our experiments as well as investigate the user modeling and bandit algorithms.

ERR is not C/W/L: Exploring the Relationship Between Expected Reciprocal Rank and Other Metrics

We explore the relationship between expected reciprocal rank (ERR) and the metrics that are available under the C/W/L framework. On the surface, it appears that the user browsing model associated with ERR can be directly injected into a C/W/L arrangement, to produce system measurements equivalent to those generated from ERR. That assumption is now known to be invalid, and demonstration of the impossibility of ERR being described via C/W/L choices forms the first part of our work. Given that ERR cannot be accommodated within the C/W/L framework, we then explore the extent to which practical use of ERR correlates with metrics that do fit within the C/W/L user browsing model. In this part of the investigation we present a range of shallow-evaluation C/W/L variants that have very high correlation with ERR when compared in experiments involving a large number of TREC runs. That is, while ERR itself is not a C/W/L metric, there are other weighted-precision computations that fit with the user model assumed by C/W/L, and yield system comparisons almost indistinguishable from those generated via the use of ERR.

A Fast and Exact Randomisation Test for Comparing Two Systems with Paired Data

The randomisation test with B trials has been used in the information retrieval (IR) field for comparing two systems with paired data (i.e., a common set of topics). It approximates the exact randomisation test whose time complexity is O(2n) for n topics. In this paper, we first show that the counting operation for obtaining the exact p-value in a randomisation test can be reduced to a subsequence sum enumeration problem that can be solved by dynamic programming. By taking advantage of this observation along with the fact that we only require a small number of significant digits in IR evaluation measure scores, we demonstrate that the time complexity of the exact randomisation test can be reduced to O(mn), where m is the maximum subsequence sum of the paired score differences. Hence, researchers can utilise our test if they want to avoid random sampling and/or normality assumptions and want a fast and reliable test. In addition, we utilise Vandermonde's convolution in order to theoretically explain a known fact, namely, that the sign test is a special case of the exact randomisation test.

How do Metric Score Distributions affect the Type I Error Rate of Statistical Significance Tests in Information Retrieval?

Statistical significance tests are the main tool that IR practitioners use to determine the reliability of their experimental evaluation results. The question of which test behaves best with IR evaluation data has been around for decades, and has seen all kinds of results and recommendations. Definitive answer to this question has recently been attempted via stochastic simulation of IR evaluation data, allowing researchers to compute actual Type I error rates because they can control the null hypothesis. One such research line simulates metric scores for a fixed set of systems on random topics, and concluded that the t-test behaves the best. Another such line simulates retrieval runs by random systems on a fixed set of topics, and concluded that the Wilcoxon test behaves the best. Interestingly, two recent surveys of the IR literature have shown that the community has a clear preference precisely for these two tests, so further investigation is critical to understand why the above simulation studies reach opposite conclusions. It has been recently postulated that a reason for the disagreement is the distributions of metric scores used by one of these simulation methods. In this paper we investigate this issue and extend the argument to another key aspect of the simulation, namely the dependence between systems. Following a principled approach, we analyze the robustness of statistical tests to different factors, thus identifying under what conditions they behave well or not with respect to the Type I error rate. Our results suggest that differences between the Wilcoxon and t-test may be explained by the skewness of score differences.

SESSION: Session 4A - Question Answering

Session details: Session 4A - Question Answering

A Discriminative Semantic Ranker for Question Retrieval

Similar question retrieval is a core task in community-based question answering (CQA) services. To balance the effectiveness and efficiency, the question retrieval system is typically implemented as multi-stage rankers: The first-stage ranker aims to recall potentially relevant questions from a large repository, and the latter stages attempt to re-rank the retrieved results. Most existing works on question retrieval mainly focused on the re-ranking stages, leaving the first-stage ranker to some traditional term-based methods. However, term-based methods often suffer from the vocabulary mismatch problem, especially on short texts, which may block the re-rankers from relevant questions at the very beginning. An alternative is to employ embedding-based methods for the first-stage ranker, which compress texts into dense vectors to enhance the semantic matching. However, these methods often lose the discriminative power as term-based methods, thus introduce noise during retrieval and hurt the recall performance. In this work, we aim to tackle the dilemma of the first-stage ranker, and propose a discriminative semantic ranker, namely DenseTrans, for high-recall retrieval. Specifically, DenseTrans is a densely connected Transformer, which learns semantic embeddings for texts based on Transformer layers. Meanwhile, DenseTrans promotes low-level features through dense connections to keep the discriminative power of the learned representations. DenseTrans is inspired by DenseNet in computer vision (CV), but poses a new way to use the dense connectivity which is totally different from its original design purpose. Experimental results over two question retrieval benchmark datasets show that our model can obtain significant gain on recall against strong term-based methods as well as state-of-the-art embedding-based methods.

Evaluating BERT-based Rewards for Question Generation with Reinforcement Learning

Question generation systems aim to generate natural language questions that are relevant to a given piece of text, and can usually be answered by just considering this text. Prior works have identified a range of shortcomings (including semantic drift and exposure bias) and thus have turned to the reinforcement learning paradigm to improve the effectiveness of question generation. As part of it, different reward functions have been proposed. As typically these reward functions have been empirically investigated in different experimental settings (different datasets, models and parameters) we lack a common framework to fairly compare them. In this paper, we first categorize existing rewards systematically. We then provide such a fair empirical evaluation of different reward functions (including three we propose here for QG) in a common framework. We find rewards that model answerability to be the most effective.

Passage Similarity and Diversification in Non-factoid Question Answering

The rise in popularity of mobile and voice search has led to a shift in focus from document retrieval to short answer passage retrieval for non-factoid questions. Some of the questions have multiple answers, and the aim is to retrieve a set of relevant answer passages, which covers all these alternatives. Compared to documents, answers are more specific and typically form more defined types or groups. Grouping answer passages based on strong similarity measures may provide a means of identifying these types. Typically, kNN clustering in combination with term-based representations have been used in Information Retrieval (IR) scenarios. An alternate method is to use pre-trained distributional representations such as GloVe and BERT, which capture additional semantic relationships. The recent success of trained neural models for various tasks provides the motivation for generating more task-specific representations. However, due to the absence of large datasets for incorporating passage level similarity information, a more feasible alternative is to use weak supervision based training. This information can then be used to generate a final ranked list of diversified answers using standard diversification algorithms.

In this paper, we introduce a new dataset NFPassageQA_Sim, with human annotated similarity labels for pairs of answer passages corresponding to each question. These similarity labels are then processed to generate another dataset NFPassageQA_Div, which consists of answer types for these questions. Using the similarity labels, we demonstrate the effectiveness of using weak supervision signals derived from GloVe, fine-tuned and trained using a BERT model for the task of answer passage clustering. Finally, we introduce a model which incorporates these clusters into a MMR (Maximal Marginal Relevance) model, which significantly beats other diversification baselines using both diversity and relevance metrics.

A Novel Approach for Mining Team Leaders in Community Question Answering

Expert finding is the task of ranking experts for a given topic. This task can be utilized for finding suitable people for a job. The problem of finding a team leader for coaching a team is a practical and industrial issue that can be addressed by employing the expert finding task. On the other hand, this is a challenging problem because for the position of team leader, having knowledge in the required skills solely is not enough. A suitable leader should have abilities, including communication, mentoring, solving problems, etc. In this paper, we proposed a model to address this problem. Our proposed model ranks candidates for a given team. In this model, team members vote for candidates based on the centrality metrics. Our experiments on two real-world and large datasets collected from Stackoverflow indicated that our proposed method outperforms the baseline.

SESSION: Session 4B - Semantic Retrieval

Session details: Session 4B - Semantic Retrieval

More Robust Dense Retrieval with Contrastive Dual Learning

Dense retrieval conducts text retrieval in the embedding space and has shown many advantages compared to sparse retrieval. Existing dense retrievers optimize representations of queries and documents with contrastive training and map them to the embedding space. The embedding space is optimized by aligning the matched query-document pairs and pushing the negative documents away from the query. However, in such training paradigm, the queries are only optimized to align to the documents and are coarsely positioned, leading to an anisotropic query embedding space. In this paper, we analyze the embedding space distributions and propose an effective training paradigm, Contrastive Dual Learning for Approximate Nearest Neighbor (DANCE) to learn fine-grained query representations for dense retrieval. DANCE incorporates an additional dual training object of query retrieval, inspired by the classic information retrieval training axiom, query likelihood. With contrastive learning, the dual training object of DANCE learns more tailored representations for queries and documents to keep the embedding space smooth and uniform, thriving on the ranking performance of DANCE on the MS MARCO document retrieval task. Different from ANCE that only optimized with the document retrieval task, DANCE concentrates the query embeddings closer to document representations while making the document distribution more discriminative. Such concentrated query embedding distribution assigns more uniform negative sampling probabilities to queries and helps to sufficiently optimize query representations in the query retrieval task. Our codes are released at https://github.com/thunlp/DANCE.

Pseudo-Relevance Feedback for Multiple Representation Dense Retrieval

Pseudo-relevance feedback mechanisms, from Rocchio to the relevance models, have shown the usefulness of expanding and reweighting the users' initial queries using information occurring in an initial set of retrieved documents, known as the pseudo-relevant set. Recently, dense retrieval -- through the use of neural contextual language models such as BERT for analysing the documents' and queries' contents and computing their relevance scores -- has shown a promising performance on several information retrieval tasks still relying on the traditional inverted index for identifying documents relevant to a query. Two different dense retrieval families have emerged: the use of single embedded representations for each passage and query (e.g. using BERT's [CLS] token), or via multiple representations (e.g. using an embedding for each token of the query and document). In this work, we conduct the first study into the potential for multiple representation dense retrieval to be enhanced using pseudo-relevance feedback. In particular, based on the pseudo-relevant set of documents identified using a first-pass dense retrieval, we extract representative feedback embeddings (using KMeans clustering) -- while ensuring that these embeddings discriminate among passages (based on IDF) -- which are then added to the query representation. These additional feedback embeddings are shown to both enhance the effectiveness of a reranking as well as an additional dense retrieval operation. Indeed, experiments on the MSMARCO passage ranking dataset show that MAP can be improved by upto 26% on the TREC 2019 query set and 10% on the TREC 2020 query set by the application of our proposed ColBERT-PRF method on a ColBERT dense retrieval approach.

Semantic Hilbert Space for Interactive Image Retrieval

The paper introduces a model for interactive image retrieval utilising the geometrical framework of information retrieval (IR). We tackle the problem of image retrieval based on an expressive user information need in form of a textual-visual query, where a user is attempting to find an image similar to the picture in their mind during querying. The user information need is expressed using guided visual feedback based on Information Foraging which lets the user perception embed within the model via semantic Hilbert space (SHS). This framework is based on the mathematical formalism of quantum probabilities and aims to understand the relationship between user textual and image input, where the image in the input is considered a form of visual feedback. We propose SHS, a quantum-inspired approach where the textual-visual query is regarded analogously to a physical system that allows for modelling different system states and their dynamic changes thereof based on observations (such as queries, relevance judgements). We will be able to learn the input multimodal representation and relationships between textual-image queries for retrieving images. Our experiments are conducted on the MIT States and Fashion200k datasets that demonstrate the effectiveness of finding particular images autocratically when the user inputs are semantically expressive.

BERT-based Dense Retrievers Require Interpolation with BM25 for Effective Passage Retrieval

The integration of pre-trained deep language models, such as BERT, into retrieval and ranking pipelines has shown to provide large effectiveness gains over traditional bag-of-words models in the passage retrieval task. However, the best setup for integrating such deep language models is still unclear. When BERT is used to re-rank passages (i.e., BERT re-ranker), previous work has empirically shown that, while in practice BERT re-ranker cannot act as initial retriever due to BERT's high query time costs, and thus a bag-of-words model such as BM25 is required. It is not necessary to interpolate BERT re-ranker and bag-of-words scores to generate the final ranking. In fact, the BERT re-ranker scores alone can be used by the re-ranker: the BERT re-ranker score appears to already capture the relevance signal provided by BM25. In this paper, we further investigate the topic of interpolating BM25 and BERT-based rankers. Unlike previous work that considered the BERT re-ranker, however, here we consider BERT-based dense retrievers (RepBERT and ANCE). Dense retrievers encode queries and documents into low dimensional BERT-based embeddings. These methods overcome BERT's high computational costs at query time, and can thus be feasibly used in practice as whole-collection retrievers, rather than just as re-rankers. Our novel empirical findings suggest that, unlike for BERT re-ranker, interpolation with BM25 is necessary for BERT-based dense retrievers to perform effectively; and the gains provided by the interpolation are significant. Further analysis reveals why this is so: dense retrievers are very effective at encoding strong relevance signals, but they fail in identifying weaker relevance signals -- a task that the interpolation with BM25 is able to make up for.