Welcome to ACL Tool

You can use this to search all the papers exists in the ACL rolling website

ACLToolBox Analysis

Keyword:

Single Month:

Year: Month:

Range:

Start:

End:

Title	Abstract	PDF	Month	Year
UniK-QA: Unified Representations of Structured and Unstructured Knowledge for Open-Domain Question Answering	We study open-domain question answering with \emph{structured, unstructured} and \emph{semi-structured} knowledge sources, including text, tables, lists and knowledge bases. Departing from prior work, we propose a unifying approach that homogenizes all sources by reducing them to text and applies the retriever-reader model which has so far been limited to text sources only. Our approach greatly improves the results on knowledge-base QA tasks by 11 points, compared to latest graph-based methods. More importantly, we demonstrate that our \emph{unified knowledge} (\uniqa{}) model is a simple and yet effective way to combine heterogeneous sources of knowledge, advancing the state-of-the-art results on two popular question answering benchmarks, NaturalQuestions and WebQuestions, by 3.5 and 2.6 points, respectively.	PDF	5	2021
A Statistical Typology of (Textual) Language in Finer Granularity	We propose a character-level perspective for a new understanding and visualization of language, in its textual representation in computing, using relative line length and character vocabulary size from parallel corpora as parameters. We discover an emergent pattern with a natural, continuous order to languages. We highlight some of the outlier languages and discuss the opportunities and challenges in line for character- and byte-level development in language technology.	PDF	5	2021
GeDi: Generative Discriminator Guided Sequence Generation	While large-scale language models (LMs) are able to imitate the distribution of natural language well enough to generate realistic text, it is difficult to control which regions of the distribution they generate. This is especially problematic because datasets used for training large LMs usually contain significant toxicity, hate, bias, and negativity. One promising approach to address this is to use discriminators to guide decoding from LMs, but existing methods for this are too slow to be useful in practice for many applications. We present GeDi as a significantly more efficient discriminator-based approach for guiding decoding. GeDi guides generation at each step by computing classification probabilities for all possible next tokens via Bayes rule by normalizing over two class-conditional distributions; one conditioned on the desired attribute, or control code, and another conditioned on the undesired attribute, or anti control code. We find that GeDi gives controllability on par with or better than previous controllable generation methods. GeDi results in significantly faster generation speeds than the only previous method that achieved comparable controllability in our experiments. We also show that GeDi can make GPT-2 and GPT-3 significantly less toxic while maintaining linguistic fluency, without sacrificing significantly on generation speed. Lastly, we find training GeDi on only three topics allows us to controllably generate new topics zero-shot from just a keyword.	PDF	5	2021
Analysis and Prediction of NLP models via Task Embeddings	Relatedness between tasks, which is key to transfer learning, is often characterized by measuring the influence of tasks on one another during sequential or simultaneous training, with tasks being treated as black boxes. In this paper, we propose MetaEval, a set of $101$ NLP tasks. We fit a single transformer to all MetaEval tasks jointly while conditioning it on low-dimensional task embeddings. The resulting task embeddings enable a novel analysis of the relatedness among tasks. We also show that task aspects can be used to predict task embeddings for new tasks without using any annotated examples. Predicted embeddings can modulate the encoder for zero-shot inference and outperform a zero-shot baseline on GLUE tasks. The provided multitask setup can function as a benchmark for future transfer learning research.	PDF	5	2021
FinQA: A Dataset of Numerical Reasoning over Financial Data	The sheer volume of financial statements makes it difficult for humans to access and analyze a business's financials. Robust numerical reasoning likewise faces unique challenges in this domain. In this work, we focus on answering deep questions over financial data, aiming to automate the analysis of a large corpus of financial documents. In contrast to existing tasks on general domain, the finance domain includes complex numerical reasoning and understanding of heterogeneous representations. To facilitate analytical progress, we propose a new large-scale dataset, FinQA, with Question-Answering pairs over Financial reports, written by financial experts. We also annotate the gold reasoning programs to ensure full explainability. We further introduce baselines and conduct comprehensive experiments in our dataset. The results demonstrate that popular, large, pre-trained models fall far short of expert humans in acquiring finance knowledge and in complex multi-step numerical reasoning on that knowledge. Our dataset --- the first of its kind --- should therefore enable significant, new community research into complex application domains. The dataset and code are publicly available at \url{https://github.com/czyssrs/FinQA}.	PDF	5	2021
Data and Parameter Scaling Laws for Neural Machine Translation	We observe that the development cross-entropy loss of supervised neural machine translation models scales like a power law with the amount of training data and the number of non-embedding parameters in the model. We discuss some practical implications of these results, such as predicting BLEU achieved by large scale models and predicting the ROI of labeling data in low-resource language pairs.	PDF	5	2021
How to Query Language Models?	Large pre-trained language models (LMs) are capable of not only recovering linguistic but also factual and commonsense knowledge. To access the knowledge stored in mask-based LMs, we can use cloze-style questions and let the model fill in the blank. The flexibility advantage over structured knowledge bases comes with the drawback of finding the right query for a certain information need. Inspired by human behavior to disambiguate a question, we propose to query LMs by example. To clarify the ambivalent question \textit{Who does Neuer play for?}, a successful strategy is to demonstrate the relation using another subject, e.g., \textit{Ronaldo plays for Portugal. Who does Neuer play for?}. We apply this approach of querying by example to the LAMA probe and obtain substantial improvements of up to 37.8\% for BERT-large on the T-REx data when providing only 10 demonstrations---even outperforming a baseline that queries the model with up to 40 paraphrases of the question. The examples are provided through the model's context and thus require neither fine-tuning nor an additional forward pass. This suggests that LMs contain more factual and commonsense knowledge than previously assumed---if we query the model in the right way.	PDF	5	2021
BdLAN:BERTdoc Label Attention Networks for Multi-label text classification	Multi-label text classification (MLTC) brings us new challenge in Natural Language Processing (NLP) which aims at assigning multiple labels for a given document. Many real-world tasks can be view as MLTC, such as tag recommendation, information retrieval, etc. However, several flinty problems are placed in the presence of researchers about how to establish connections between labels or distinguish similar sub-labels, which haven't been solved thoroughly by current endeavor. Therefore, we proposed a novel framework named BdLAN, BERTdoc Label Attention Networks in this paper, consist of the BERTdoc layer, the label embeddings layer, the doc encoder layer, the doc-label attention layer and the prediction layer. We apply a powerful technique BERT to pretrain documents to capture their deep semantic features and encode them via Bi-LSTM to obtain a two-directional contextual representation of uniform length. Then we create label embeddings and feed them together with encoded-pretrained-documents to doc-label attention mechanism to obtain interactive information between documents and their corresponding labels, finally using MLP to make prediction.We carry out experiments on three real-world datasets, the empirical results indicating our proposed model outperforms all state-of-the-art MLTC benchmarks. Moreover, we conduct a case study, visualizing real application of our BdLAN model vividly.	PDF	5	2021
SINA-BERT: A Pre-Trained Language Model for Analysis of Medical Texts in Persian	We have released SINA-BERT, a language model pre-trained on BERT to address the lack of a high-quality Persian language model in the medical domain. SINA-BERT utilizes pre-training on a large-scale corpus of medical contents including formal and informal texts collected from various online resources in order to improve the performance on health-care related tasks. We employ SINA-BERT to complete following representative tasks: categorization of medical questions, medical sentiment analysis, medical named entity recognition, and medical question retrieval. For each task, we have developed Persian annotated data sets for training and evaluation and learnt a representation for the data of each task especially complex and long medical questions. With the same architecture being used in each task, SINA-BERT outperforms BERT-based models that were previously made available in the Persian language.	PDF	5	2021
A Survey on Geocoding: Algorithms and Datasets for Toponym Resolution	Geocoding, the task of converting unstructured text to structured spatial data, has recently seen progress thanks to a variety of new datasets, evaluation metrics, and machine-learning algorithms. We provide a comprehensive survey to review, organize and analyze recent work on geocoding (also known as toponym resolution) where the text is matched to geospatial coordinates and/or ontologies. We summarize the findings of this research and suggest some promising directions for future work.	PDF	5	2021
Opinion-based Relational Pivoting for Cross-domain Aspect Term Extraction	Domain adaptation methods often exploit domain-transferable input features, a.k.a. pivots. The task of Aspect and Opinion Term Extraction presents a special challenge for domain transfer: while opinion terms largely transfer across domains, aspects change drastically from one domain to another (e.g. from \textit{restaurants} to \textit{laptops}). In this paper, we investigate and establish empirically a prior conjecture, which suggests that the linguistic relations connecting opinion terms to their aspects transfer well across domains and therefore can be leveraged for cross-domain aspect term extraction. We present several analyses supporting this conjecture, via experiments with four linguistic dependency formalisms to represent relation patterns. Following, we present an aspect term extraction method that drives models to consider opinion--aspect relations via explicit multitask objectives. This method provides significant performance gains, even on top of a prior state-of-the-art linguistically-informed model, which are shown in analysis to stem from the relational pivoting signal.	PDF	5	2021
Faithful and Plausible Explanations of Medical Code Predictions	Machine learning models that offer excellent predictive performance often lack the interpretability necessary to support integrated human machine decision-making. In clinical medicine and other high-risk settings, domain experts may be unwilling to trust model predictions without explanations. Work in explainable AI must balance competing objectives along two different axes: 1) Models should ideally be both accurate and simple. 2) Explanations must balance faithfulness to the model's decision-making with their plausibility to a domain expert. We propose to train a proxy model that mimics the behavior of a trained model and provides control over these trade-offs. We evaluate our approach on the task of assigning ICD codes to clinical notes to demonstrate that the proxy model is faithful to the trained model's behavior and produces quality explanations.	PDF	5	2021
Causal Augmentation for Causal Sentence Classification	Scarcity of corpora with annotated causal texts can lead to poor robustness when training state-of-the-art language models for causal sentence classification. In particular, we find that these models misclassify on augmented sentences that have been negated or strengthened in terms of their causal meaning. This is worrying because minor linguistic changes in causal sentences can have disparate meanings. To resolve such issues, we propose to generate counterfactual causal sentences by creating contrast sets (Gardner et al., 2020). However, we notice an important finding that simply introducing edits is not sufficient to train models with counterfactuals. We thus introduce heuristics, like sentence shortening or multiplying key causal terms, to emphasize semantically important keywords to the model. We demonstrate these findings on different training setups and across two out-of-domain corpora. Our proposed mixture of augmented edits consistently achieves improved performance compared to baseline across two models and both within and out of corpus' domain, suggesting our proposed augmentation also helps the model generalize.	PDF	5	2021
Is Knowledge Embedding Fully Exploited in Language Understanding? An Empirical Study	The recent development of knowledge embedding (KE) enables machines to represent knowledge graphs (KGs) with low-dimensional embeddings, which facilitates utilizing KGs for various downstream natural language understanding (NLU) tasks. However, less work has been done on systematically evaluating the impact of KE on NLU. In this work, we conduct a comprehensive analysis of utilizing KE on four downstream knowledge-driven NLU tasks using two representative knowledge-guided frameworks, including knowledge augmentation and knowledge attention. From the experimental results, we find that: (1) KE models that have better performance on knowledge graph completion do not necessarily help knowledge-driven NLU tasks better in the knowledge-guided frameworks; (2) KE could effectively benefit NLU tasks from two aspects including entity similarity and entity relation information; (3) KE could further benefit pre-trained language models which have already learned rich knowledge from pre-training. We hope the results could help and guide future studies to utilize KE in NLU tasks. Our source code will be released to support further exploration.	PDF	5	2021
Breaking Down Multilingual Machine Translation	While multilingual training is now an essential ingredient in machine translation (MT) systems, recent work has demonstrated that it has different effects in different multilingual settings, such as many-to-one, one-to-many, and many-to-many learning. These training settings expose the encoder and the decoder in a machine translation model with different data distributions. In this paper, we examine how different varieties of multilingual training contribute to learning these two components of the MT model. Specifically, we compare bilingual models with encoders and/or decoders initialized by multilingual training. We show that multilingual training is beneficial to encoders in general, while it only benefits decoders for low-resource languages (LRLs). We further find the important attention heads for each language pair and compare their correlations during inference. Our analysis sheds light on how multilingual translation models work and also enables us to propose methods to improve performance by training with highly related languages. Our many-to-one models for high-resource languages and one-to-many models for LRL outperform the best results reported by Aharoni et al. (2019).	PDF	5	2021
Update Frequently, Update Fast: Retraining Semantic Parsing Systems in a Fraction of Time	Currently used semantic parsing systems deployed in voice assistants can require weeks to train. Datasets for these models often receive small and frequent updates, data patches. Each patch requires training a new model. To reduce training time, one can fine-tune the previously trained model on each patch, but naive fine-tuning exhibits catastrophic forgetting -- degradation of the model performance on the data not represented in the data patch. In this work, we propose a simple method that alleviates catastrophic forgetting and show that it is possible to match the performance of a model trained from scratch in less than 10\% of a time via fine-tuning. The key to achieving this is supersampling and EWC regularization. We demonstrate the effectiveness of our method on multiple splits of the Facebook TOP and SNIPS datasets.	PDF	5	2021
Coming to its senses: Lessons learned from Approximating Retrofitted BERT representations for Word Sense information	Retrofitting static vector space word representations using external knowledge bases has yielded substantial improvements in their lexical-semantic capacities but is non-trivial to apply to contextual word embeddings (CWE). In this paper, we propose MAKESENSE, a method that 'approximates' retrofitting in CWEs to better infer word sense knowledge from word contexts. We specifically analyze BERT and MAKESENSE-transformed BERT representations over a diverse set of experiments encompassing sense-sensitive similarities, alignment with human-elicited similarity judgments, and probing tasks focusing on sense distinctions and hypernymy. Our findings indicate that MAKESENSE imparts substantial improvements in word sense information over vanilla CWEs but largely preserves more complex usage of sense and directionally sensitive information such as hypernymy.	PDF	5	2021
Detecting Adversarial Text Attacks via SHapley Additive exPlanations	State-of-the-art machine learning models are prone to adversarial attacks: maliciously crafted inputs to fool the model into making a wrong prediction, often with high confidence. While defense strategies have been extensively explored in the computer vision domain, research in natural language processing still lacks techniques to make models resilient to adversarial text inputs. We propose an adversarial detector leveraging Shapley additive explanations against text attacks. Our approach outperforms the current state-of-the-art detector by around 19% F1-score on the IMDb and 14% on the SST-2 datasets while also showing competitive performance on AG_News and Yelp Polarity. Furthermore, we prove the detector to only require a low amount of training samples and, in some cases, to generalize to different datasets without needing to retrain.	PDF	5	2021
QUASER: Question Answering with Scalable Extractive Rationalization	Designing NLP models that produce predictions by first extracting a set of relevant input sentences (i.e., rationales), is gaining importance as a means to improving model interpretability and to producing supporting evidence for users. Current unsupervised approaches are trained to extract rationales that maximize prediction accuracy, which is invariably obtained by exploiting spurious correlations in datasets, and leads to unconvincing rationales. In this paper, we introduce unsupervised generative models to extract dual-purpose rationales, which must not only be able to support a subsequent answer prediction, but also support a reproduction of the input query. We show that such models can produce more meaningful rationales, that are less influenced by dataset artifacts, and as a result, also achieve the state-of-the-art on rationale extraction metrics on four datasets from the ERASER benchmark, significantly improving upon previous unsupervised methods.	PDF	5	2021
A Legal Approach to Hate Speech – Operationalizing the EU’s Legal Framework against the Expression of Hatred as an NLP Task	We propose a 'legal approach' to hate speech detection by operationalization of the decision as to whether a post is subject to criminal law into an NLP task. Comparing existing regulatory regimes for hate speech, we base our investigation on the European Union's framework as it provides a widely applicable legal minimum standard. Accurately judging whether a post is punishable or not usually requires legal training. We show that, by breaking the legal assessment down into a series of simpler sub-decisions, even laypersons can annotate consistently. Based on a newly annotated dataset, our experiments show that directly learning an automated model of punishable content is challenging. However, learning the two sub-tasks of `target group' and `targeting conduct' instead of an end-to-end approach to punishability yields better results. Overall, our method also provides for better explainability and higher transparency, which is a crucial point in legal decision-making.	PDF	5	2021
Contrastive Conditioning for Assessing Disambiguation in MT: A Case Study of Distilled Bias	Lexical disambiguation is a major challenge for machine translation systems, especially if some senses of a word are trained less often than others. Identifying patterns of overgeneralization requires evaluation methods that are both reliable and scalable. We propose contrastive conditioning as a reference-free black-box method for detecting disambiguation errors. Specifically, we score the quality of a translation by conditioning on variants of the source that provide contrastive disambiguation cues. After validating our method, we apply it in a case study to perform a targeted evaluation of sequence-level knowledge distillation. By probing word sense disambiguation and translation of gendered occupation names, we show that distillation-trained models tend to overgeneralize more than other models with a comparable BLEU score. Contrastive conditioning thus highlights a side effect of distillation that is not fully captured by standard evaluation metrics. Code and data to reproduce our findings are publicly available.	PDF	5	2021
Fine-Tuned Transformers Show Clusters of Similar Representations Across Layers	Despite the success of fine-tuning pretrained language encoders like BERT for downstream natural language understanding (NLU) tasks, it is still poorly understood how neural networks change after fine-tuning. In this work, we use centered kernel alignment (CKA), a method for comparing learned representations, to measure the similarity of representations in task-tuned models across layers. In experiments across twelve NLU tasks, we discover a consistent block diagonal structure in the similarity of representations within fine-tuned RoBERTa and ALBERT models, with strong similarity within clusters of earlier and later layers, but not between them. The similarity of later layer representations implies that later layers only marginally contribute to task performance, and we verify in experiments that the top few layers of fine-tuned Transformers can be discarded without hurting performance, even with no further tuning.	PDF	5	2021
Subtopic Clustering with a Query-Specific Siamese Similarity Metric	We propose a Query-Specific Siamese Similarity Metric (QS3M) for query-specific clustering of text documents. It uses fine-tuned BERT embeddings and trains a non-linear projection into a query-specific similarity space. We build on the idea of Siamese networks but include a third component, a representation of the query. The empirical evaluation for clustering employs two TREC datasets with two different clustering benchmarks each. When used to obtain query-relevant clusters, QS3M achieves a 12% performance improvement over a recently published BERT-based reference method and significantly outperforms other unsupervised baselines.	PDF	5	2021
A Study on Summarizing and Evaluating Long Documents	Text summarization has been a key language generation task for over 60 years. The field has advanced considerably during the past two years, benefiting from the proliferation of pre-trained Language Models (LMs). However, the field is constrained by two factors: 1) the absence of an effective automatic evaluation metric and 2) a lack of effective architectures for long document summarization. Our first contribution is to demonstrate that a set of semantic evaluation metrics (BERTScore, MoverScore and our novel metric, BARTScore) consistently and significantly outperform ROUGE. Using these metrics, we then show that combining transformers with sparse self-attention is a successful method for long document summarization and is very competitive with the state of the art. Finally, we show that sparsifying self-attention does not degrade model performance when using transformers for summarization.	PDF	5	2021
LiSTra, Automatic Speech Translation: English to Lingala case study	In recent years there have been great interests in addressing the low resourcefulness of African languages and provide baseline models for different Natural Language Processing tasks. Several initiatives on the continent use the Bible as a data source to provide proof of concept for some NLP tasks. In this work, we present the Lingala Speech Translation (LiSTra) dataset, release a full pipeline for the construction of such dataset in other languages, and report baselines using both the traditional cascade approach (Automatic Speech Recognition -> Machine Translation) and a revolutionary transformer-based End-2-End architecture with a custom interactive attention that allows information sharing between the recognition decoder and the translation decoder.	PDF	5	2021
Privacy-Preserving Graph Convolutional Networks for Text Classification	Graph convolutional networks (GCNs) are a powerful architecture for representation learning on documents that naturally occur as graphs, e.g., citation or social networks. However, sensitive personal information, such as documents with people's profiles or relationships as edges, are prone to privacy leaks, as the trained model might reveal the original input. Although differential privacy (DP) offers a well-founded privacy-preserving framework, GCNs pose theoretical and practical challenges due to their training specifics. We address these challenges by adapting differentially-private gradient-based training to GCNs and conduct experiments using two optimizers on five NLP datasets in two languages. We propose a simple yet efficient method based on random graph splits that not only improves the baseline privacy bounds by a factor of 2.7 while retaining competitive $F_1$ scores, but also provides strong privacy guarantees of $\varepsilon = 1.0$. We show that, under certain modeling choices, privacy-preserving GCNs perform up to 90\% of their non-private variants, while formally guaranteeing strong privacy measures.	PDF	5	2021
Multilingual Domain Adaptation for NMT: Decoupling Language and Domain Information with Adapters	Adapter layers are lightweight, learnable units inserted between transformer layers. Recent work explores using such layers for neural machine translation (NMT), to adapt pre-trained models to new domains or language pairs. We propose strategies to compose language and domain adapters. Our goals are both parameter-efficient adaptation to multiple domains and languages simultaneously, and cross-lingual transfer in domains where parallel data is unavailable for certain language pairs. We find that a naive combination of domain-specific and language-specific adapters often results in translations into the wrong language. We study other ways to combine the adapters to alleviate this issue and maximize cross-lingual transfer. With our best adapter combinations, we obtain improvements of 3-4 BLEU on average for source languages that do not have in-domain data. For target languages without in-domain data, we achieve a similar improvement by combining adapters with back-translation.	PDF	5	2021
Neural Predictive Text for Grammatical Error Prevention	In this paper we study the potential of two neural language models, an LSTM and an autoregressive language model GPT-2, to predict possible correction tokens in erroneous sentences and to predict the next token in randomly sliced correct sentences, in the aim of establishing a new Grammatical Error Correction (GEC) subarea, for which we coin the term Grammatical Error Prevention (GEP). Systems that could assist in GEP, such as language models, are expected to predict elements and therefore prevent grammatical errors in advance. Our findings show that GPT-2 can predict 29% of the correct tokens with one prediction. Accuracy rises up to 44% when the top 3 predictions are considered. To test the pedagogical capacity of such a model, we also experimented with real English as a second language (ESL) learners. By equipping GPT-2 to generate text that functions as potential continuation of the learners' sentences, we created a small corpus of the learners' writings and analyzed their errors along with their frequencies.	PDF	5	2021
Contrastive Learning of Natural Language and Code Representations for Semantic Code Search	Retrieving semantically relevant code functions given a natural language (NL) or programming language (PL) query is a task of great practical value towards building productivity enhancing tools for software developers. Recent approaches to solve this task involve leveraging transformer based masked language models that are pre-trained on NL and PL and fine-tuned for code search using a contrastive learning objective. However, these approaches suffer from uninformative in-batch negative samples. We propose DyHardCode: a contrastive learning framework that leverages hard negative examples, which are mined globally from the entire training corpus to improve the quality of code and natural language representations. We experiment with different hard negative mining strategies, and provide explanations to the effectiveness of our method from the perspectives of optimization and adversarial learning. We show that DyHardCode leads to improvements in multiple code search tasks. Our approach achieves an average (across 6 programming languages) mean reciprocal ranking (MRR) score of $0.750$ as opposed to the previous state of the art result of $0.713$ MRR on the CodeSearchNet benchmark.	PDF	6	2021
So Different Yet So Alike! Constrained Unsupervised Text Style Transfer	Transferring text from one domain to the other has seen tremendous progress in the recent past. However, these methods do not aim to explicitly maintain constraints such as similar text length, descriptiveness between the source and the translated text. To this end, we introduce two complementary cooperative losses to the generative adversarial network family. Here, both the generator and the critic reduce the contrastive and/or the classification loss aiming to satisfy the constraints. These losses allow lexical, syntactic, and domain-specific consistencies to persist across domains. We demonstrate the effectiveness of our method over multiple benchmark datasets, both with single and multi-attribute transfers. The complimentary cooperative losses also improve text quality across datasets as judged by current, automated generation and human evaluation metrics.	PDF	6	2021
The AI Doctor Is In: A Survey of Task-Oriented Dialogue Systems for Healthcare Applications	Task-oriented dialogue systems in healthcare are attracting increased attention, and have been characterized by a diverse range of architectures and objectives. However, although these systems have been surveyed in the medical community from a non-technical perspective, a systematic review from a rigorous computational perspective remains noticeably absent. As a result, many important implementation details of healthcare-oriented dialogue systems remain limited or under-specified, slowing the pace of innovation in this area. To fill this gap, we investigated an initial pool of 4070 papers from well-known computer science, natural language processing, and artificial intelligence venues, identifying 70 papers that satisfied our defined inclusion criteria. We conducted a comprehensive technical review of the included papers, and present our findings along with identified trends and intriguing directions for future research.	PDF	6	2021
On Evaluation and Improvement of Tail Label Performance for Multi-label Text Classification	Extreme multi-label text classification (XMTC) is a task for tagging each document with the most relevant subset of labels from an extremely large label set. The most challenging part for machine learning methods is the skewed label distribution in which a majority of labels receive very few training instances (named as the tail labels). Benchmark evaluations so far have focused on micro-averaging metrics, where the performance on tail labels can be easily overshadowed by high-frequency labels (named as head labels), and hence they are insufficient for evaluating the true success of methods in XMTC. This paper presents a re-evaluation of state-of-the-art (SOTA) methods based on the binned macro-averaging F1 instead, which reveals new insights into the strengths and weaknesses of representative methods. Based on the evaluation, we conduct in-depth analysis and experiments on Transformer models with various depths and attention mechanisms to improve the tail label performance. We show that a shallow Transformer model with word-label attentions can effectively leverage word-level features and outperforms previous Transformers on tails labels.	PDF	6	2021
Augmented Neural Story Generation with Commonsense Inference	Transformer-based language model approaches to automated story generation currently provide state-of-the-art results. However, they still suffer from plot incoherence when generating narratives over time, and critically lack basic commonsense reasoning. Furthermore, existing methods generally focus only on single-character stories, or fail to track characters at all. To improve the coherence of generated narratives and to expand the scope of character-centric narrative generation, we introduce Commonsense-inference Augmented neural StoryTelling (CAST), a framework for introducing commonsense reasoning into the generation process while modeling the interaction between multiple characters. We find that our CAST method produces significantly more coherent and on-topic two-character stories, outperforming baselines in dimensions including plot plausibility and staying on topic. We also show how the CAST method can be used to further train language models that generate more coherent stories and reduce computation cost.	PDF	6	2021
Continuation is a Sub-Task of Fill in the Blank: Why Not Train for Both?	The task of inserting text into a specified position in a passage, known as fill in the blank, is useful for a variety of applications where writers interact with a natural language generation (NLG) system to craft text. However, NLG research has mostly focused on continuation models that append text to the end of a passage. Since continuation is in fact a sub-task of fill in the blank, one where the blank is placed at the sequence's end, we propose the training of a single model which can effectively handle both these tasks. The result is improved efficiency---as only one model needs to be maintained---without any negative impact on performance at either task.	PDF	6	2021
Active Evaluation: Efficient NLG Evaluation with Few Pairwise Comparisons	Recent studies have shown the advantages of evaluating NLG systems using pairwise comparisons as opposed to direct assessment. Given $k$ systems, a naive approach for identifying the top-ranked system would be to uniformly obtain pairwise comparisons from all ${k \choose 2}$ pairs of systems. However, this can be very expensive as the number of human annotations required would grow quadratically with $k$. In this work, we introduce Active Evaluation, a framework to efficiently identify the top-ranked system by actively choosing system pairs for comparison using dueling bandit algorithms. We perform extensive experiments with 13 dueling bandits algorithms on 13 NLG evaluation datasets spanning 5 tasks and show that the number of human annotations can be reduced by 80%. To further reduce the number of human annotations, we propose model-based dueling bandit algorithms which combine automatic evaluation metrics with human evaluations. Specifically, we eliminate sub-optimal systems even before the human annotation process and perform human evaluations only on test examples where the automatic metric is highly uncertain. This reduces the number of human annotations required further by 89%. In effect, we show that identifying the top-ranked system requires only a few hundred human annotations, which grow linearly with $k$. Lastly, we provide practical recommendations and best practices to identify the top-ranked system efficiently.	PDF	6	2021
Topic-independent Detection of Dissonance in Short Stance Text	We address dissonance detection, the task of detecting conflicting stance between two input statements. Computational models for stance detection have typically been trained for a given target topic (e.g. gun control). In this paper, we aim for building a computational model for dissonance detection without using training data from the topic of test data. We first build a large-scale dataset of topic-controlled arguments from two sources: (i) an online debate platform, consisting of 15k pairs of statements with support, attack, or no relation from 20 diverse topics, and (ii) Twitter, consisting of 5k pairs of statements from 5 topics. We then evaluate a BERT-based dissonance detection model on this dataset in a topic-controlled manner. Our experiments suggest that dissonance detection models learn the topic-independent patterns of language for detecting dissonance and generalize largely to other arguments in unseen topics.	PDF	6	2021
Restoring Hebrew Diacritics Without a Dictionary	We demonstrate that it is feasible to diacritize Hebrew script without any human-curated resources other than plain diacritized text. We present NAKDIMON, a two-layer character level LSTM, that performs on par with much more complicated curation-dependent systems, across a diverse array of modern Hebrew sources.	PDF	6	2021
A Recipe For Arbitrary Text Style Transfer with Large Language Models	In this paper, we leverage large language models (LLMs) to perform zero-shot text style transfer. We present a prompting method that we call augmented zero-shot learning, which frames style transfer as a sentence rewriting task and requires only a natural language instruction, without model fine-tuning or exemplars in the target style. Augmented zero-shot learning is simple and demonstrates promising results not just on standard style transfer tasks such as sentiment, but also on arbitrary transformations such as 'make this melodramatic' or 'insert a metaphor.'	PDF	6	2021
Predicting Visual Futures with Image Captioning and Pre-Trained Language Models	The task of visual forecasting deals with predicting future events from a sequence of input images. Purely pixel-based approaches find this challenging due to the presence of abstract concepts and temporal events at different timescales. In this paper, we present an approach that combines image captioning with pre-trained language models to predict visual futures. By leveraging language as an intermediate medium, our model is able to perform more effective temporal reasoning on two different tasks -- visual story cloze and action forecasting. Despite making the final predictions using only the generated captions, our approach outperforms state-of-the-art systems by $4\%$ and $6\%$ respectively on the two tasks. We find that our model consistently picks images/actions that are semantically relevant to the given image sequence instead of simply relying on visual similarity.	PDF	6	2021
What is Missing in Existing Multi-hop Datasets? Toward Deeper Multi-hop Reasoning Task	Multi-hop machine reading comprehension (MRC) is a task that requires models to read and perform multi-hop reasoning over multiple paragraphs to answer a question. The task can be used to evaluate reasoning skills, as well as to check the explainability of the models, and is useful in applications (e.g., QA system). However, the current definition of hop (alias step) in the multi-hop MRC is ambiguous; moreover, previous studies demonstrated that many multi-hop examples contain reasoning shortcuts where the questions can be solved without performing multi-hop reasoning. In this opinion paper, we redefine multi-hop MRC to solve the ambiguity of its current definition by providing three different definitions of the steps. Inspired by the assessment of student learning in education, we introduce a new term of In-depth multi-hop reasoning task with three additional evaluations: step evaluation, coreference evaluation, and entity linking evaluation. In addition, we also examine the existing multi-hop datasets based on our proposed definitions. We observe that there is potential to extend the existing multi-hop datasets by including more intermediate evaluations to the task. To prevent reasoning shortcuts, multi-hop MRC datasets should focus more on providing a clear definition for the steps in the reasoning process and preparing gold data to evaluate them.	PDF	6	2021
Best Practices for Noise-Based Augmentation to Improve the Performance of Deployable Speech-Based Emotion Recognition Systems	Emotion recognition models are a key component of several downstream applications, such as mental health assessments. These models are usually trained on small, clean, and synthetically controlled datasets, which leads to high failure rates in presence of `unseen' background noises, promoting noise-overlay based adversarial attacks. Noisy data augmentation has aided robustness of speech recognition and classification models, wherein, the ground truth label remains consistent even in the presence of noise which, isn't always true for subjectively perceived emotion labels. In this work, we create realistic noisy samples of IEMOCAP, using multiple categories of environmental and synthetic noise. We evaluate how ground truth labels (human) and predicted labels (model) change as a function of these noise source introductions. We show that some commonly used noisy augmentation techniques, impact human perception of emotion, thus, falsifying the `clean ground truth label. Our experiments show that the performance of both, baseline, and even denoised emotion recognition models significantly declines on noisy samples as compared to that on the clean set. This performance degradation prevails when model is trained on a combination of clean and test set mismatched noisy samples. We investigate how using the above found `human-perceptible noise overlays can lead to inaccurate metrics when testing the model for robustness or vulnerability to adversarial attacks. Finally, we present a set of recommendations for noise-based augmentation of speech emotion datasets and for deploying the models trained using those datasets.	PDF	6	2021
How to Do Human Evaluation: Best Practices for User Studies in NLP	Many research topics in natural language processing (NLP), such as explanation generation, dialog modeling or machine translation, require evaluation that goes beyond standard metrics like accuracy or F1 score toward a more human-centered approach. Therefore, understanding how to design user studies becomes increasingly important. However, few comprehensive resources exist on planning, conducting and evaluating user studies for NLP, making it hard to get started for researchers without prior experience in the field of human evaluation. In this paper, we summarize the most important aspects of user studies and their design and evaluation, providing direct links to NLP tasks and NLP specific challenges where appropriate. We (i) outline general study design, ethical considerations, and factors to consider for crowdsourcing, (ii) discuss the particularities of user studies in NLP and provide starting points to select questionnaires, experimental designs and evaluation methods that are tailored to the specific NLP tasks. Additionally, we offer examples with accompanying statistical evaluation code in R throughout, to bridge the gap between theoretical guidelines and practical applications.	PDF	7	2021
EIDER: Evidence-enhanced Document-level Relation Extraction	Document-level relation extraction (DocRE) aims at extracting the semantic relations among entity pairs in a document. In DocRE, a subset of the sentences in a document, called the evidence sentences, might be sufficient for predicting the relation between a specific entity pair. To make better use of the evidence sentences, in this paper, we propose a three-stage evidence-enhanced DocRE framework called Eider consisting of joint relation and evidence extraction, evidence-centered relation extraction (RE), and fusion of extraction results. We first jointly train an RE model with a simple and memory-efficient evidence extraction model. Then, we construct pseudo documents based on the extracted evidence sentences and run the RE model again. Finally, we fuse the extraction results of the first two stages using a blending layer and make a final prediction. Extensive experiments show that our proposed framework achieves state-of-the-art performance on the DocRED dataset, outperforming the second-best method by 1.37/1.26 Ign F1/F1. In particular, Eider-RoBERTa$_\text{large}$ significantly improves the performance on entity pairs requiring co-reference and multi-hop reasoning by 1.98/2.08 F1, respectively, which cover around 75\% of the cross-sentence samples.	PDF	7	2021
A Survey on Geocoding: Algorithms and Datasets for Toponym Resolution	Geocoding, the task of converting unstructured text to structured spatial data, has recently seen progress thanks to a variety of new datasets, evaluation metrics, and machine-learning algorithms. We provide a survey to review, organize and analyze recent work on geocoding (also known as toponym resolution) where the text is matched to geospatial coordinates and/or ontologies. We summarize the findings of this research and suggest some promising directions for future work.	PDF	7	2021
Using Language Models on Low-end Hardware	This paper evaluates the viability of using fixed language models for training text classification networks on low-end hardware. We combine language models with a CNN architecture and put together a comprehensive benchmark with 8 datasets covering single and multi-label classification of topic, sentiment, and genre. Our observations are distilled into a list of trade-offs, concluding that there are scenarios, where not fine-tuning a language model yields competitive effectiveness at faster training, requiring only a quarter of the memory compared to fine-tuning.	PDF	7	2021
Synthetic Data Generation and Joint Learning for Robust Code-Mixed Translation	The widespread online communication in a modern multilingual world has provided opportunities to blend more than one language (aka. code-mixed language) in a single utterance. This has resulted a formidable challenge for the computational models due to the scarcity of annotated data and presence of noise. A potential solution to mitigate the data scarcity problem in low-resource setup is to leverage existing data in resource-rich language through translation. In this paper, we tackle the problem of code-mixed (Hinglish and Bengalish) to English machine translation. First, we synthetically develop HINMIX, a parallel corpus of Hinglish to English, with ~5M sentence pairs. Subsequently, we propose JAMT, a robust perturbation based joint-training model that learns to handle noise in the real-world code-mixed text by parameter sharing across clean and noisy words. Further, we show the adaptability of JAMT in a zero-shot setup for Bengalish to English translation. Our evaluation and comprehensive analyses qualitatively and quantitatively demonstrate the superiority of JAMT over state-of-the-art code-mixed and robust translation methods.	PDF	7	2021
AlephBERT: Pre-training and End-to-End Language Models Evaluation from Sub-Word to Sentence Level	Large Pre-trained Language Models (PLMs) have become ubiquitous in the development of language understanding technology and lie at the heart of many artificial intelligence advances. While advances reported for English using PLMs are unprecedented, reported advances using PLMs in Hebrew are few and far between. The problem is twofold. First, Hebrew resources for training large language models are not at the same order of magnitude as their English counterparts. Second, there are no accepted tasks and benchmarks to evaluate the progress of Hebrew PLMs on, and in particular, evaluation on sub-word (morphological) tasks. We aim to remedy both aspects. We present AlephBERT, a large PLM for Modern Hebrew, trained on larger vocabulary and a larger dataset than any Hebrew PLM before. Moreover, we introduce a novel language-agnostic architecture that extracts all of the sub-word morphological segments encoded in contextualized word embedding vectors. Utilizing this new morphological component we offer a new PLM evaluation pipeline of multiple Hebrew tasks and benchmarks, that cover word-level, sub-word level and sentence level tasks. With AlephBERT we achieve state-of-the-art results compared against contemporary baselines. We make our AlephBERT model and evaluation pipeline publicly available, providing a single point of entry for evaluating and comparing Hebrew PLMs.	PDF	7	2021
A Vector-Based Approach to Few-Shot Veracity Classification for Automated Fact-Checking	As progress on automated fact-checking continues to be called, veracity classification has gained more attention. It is the task of predicting the veracity of a given claim by comparing it with retrieved pieces of evidence. One of the challenges for this task is to obtain manual annotations for large datasets, especially when it comes to new domains for which labelled data is unavailable in the first instance. In this paper, we describe a vector-based approach that achieves significant performance improvement on veracity classification in few-shot settings. Performance is compared with two competitive baselines: (1) fine-tuning BERT / RoBERTa, and (2) the state-of-the-art few-shot veracity classification approach leveraging language model perplexity with thresholds. Our approach first utilises sentence-BERT to get sentence vectors of claims and evidences. We then create a relation vector for each claim-evidences pairs, by applying absolute operation on their vector offsets. Experiments show significant improvements over the baselines.	PDF	7	2021
Rethinking Offensive Text Detection as a Multi-Hop Reasoning Problem	We introduce the task of implicit offensive language detection in dialogues, where a statement may have either an offensive or unoffensive interpretation, depending on the listener and context. We argue that inference is crucial for understanding this broader set of offensive utterances, and create a dataset featuring chains of reasoning to describe how an offensive interpretation may be reached. Experiments show that state-of-the-art methods of offense classification perform poorly on this task, achieving less than 0.12 average accuracy. We explore the use of pre-trained entailment models % to score links as part of a multi-hop approach to the problem, showing improved accuracy in most situations. We discuss the feasibility of our approach and the types of external knowledge necessary to support it.	PDF	7	2021
The Role of Context in Detecting Previously Fact-Checked Claims	Recent years have seen the proliferation of disinformation and misinformation online, thanks to the freedom of expression on the Internet and to the rise of social media. Two solutions were proposed to address the problem: (i) manual fact-checking, which is accurate and credible, but slow and non-scalable, and (ii) automatic fact-checking, which is fast and scalable, but lacks explainability and credibility. With the accumulation of enough manually fact-checked claims, a middle-ground approach has emerged: checking whether a given claim has previously been fact-checked. This can be made automatically, and thus fast, while also offering credibility and explainability, thanks to the human fact-checking and explanations in the associated fact-checking article. This is a relatively new and understudied research direction, and here we focus on claims made in a political debate, where context really matters. Thus, we study the impact of modeling the context of the claim: both on the source side, i.e., in the debate, as well as on the target side, i.e., in the fact-checking explanation document. We do this by modeling the local context, the global context, as well as by means of co-reference resolution, and reasoning over the target text using Transformer-XH. The experimental results show that each of these represents a valuable information source, but that modeling the source-side context is more important, and can yield 10+ points of absolute improvement.	PDF	7	2021
ALLWAS: Active Learning on Language models in WASserstein space	Active learning has emerged as a standard paradigm in areas with scarcity of labeled training data, such as in the medical domain. Language models have emerged as the prevalent choice of several natural language tasks due to the performance boost offered by these models. However, in several domains, such as medicine, the scarcity of labeled training data is a common issue. Also, these models may not work well in cases where class imbalance is prevalent. Active learning may prove helpful in these cases to boost the performance with a limited label budget. To this end, we propose a novel method using sampling techniques based on submodular optimization and optimal transport for active Learning in language models, dubbed ALLWAS. We construct a sampling strategy based on submodular optimization of the designed objective in the gradient domain. Furthermore, to enable learning from few samples, we propose a novel strategy for sampling from the Wasserstein barycenters. Our empirical evaluations on standard benchmark datasets for text classification show that our methods perform significantly better (> 20% relative increase in some cases) than existing approaches for active learning on language models.	PDF	7	2021
Reinforce Attack: Adversarial Attack against BERT with Reinforcement Learning	Adversarial attacks against textual data has been drawing increasing attention in both the NLP and security domains. Current successful attack methods for text typically consist of two stages: word importance ranking and word replacement. The first stage is usually achieved by masking each word in the sentence one at a time and obtaining the resulting output probability of the target model. The second stage involves finding synonyms to replace “vulnerable” words by the order of ranking. In this paper, we first explore the effects of employing the model explanation tool LIME to generate word importance ranking, which has the advantage of taking the local information around the word into account to obtain word importance scores. We then propose Reinforce Attack, a reinforcement learning (RL) based framework to generate adversarial text. Notably, the attack process is controlled by a reward function rather than heuristics as in previous methods to encourage higher semantic similarity and lower query costs. Through automatic and human evaluations, we show that our LIME + Reinforce Attack method achieves better or comparable attack success rate against other state-of-the-art attack frameworks, while the generated samples preserve significantly higher semantic similarity.	PDF	8	2021
A Comparison of Strategies for Source-Free Domain Adaptation	Data sharing restrictions are common in NLP, especially in the clinical domain, but there is limited research on adapting models to new domains without access to the original training data, a setting known as source-free domain adaptation. We take algorithms that traditionally assume access to the source-domain training data—active learning, self-training, and data augmentation—and adapt them for source free domain adaptation. Then we systematically compare these different strategies across multiple tasks and domains. We find that active learning yields consistent gains across all SemEval 2021 Task 10 tasks and domains, but though the shared task saw successful self-trained and data augmented models, our systematic comparison finds these strategies to be unreliable for source-free domain adaptation.	PDF	8	2021
An Empirical Survey of Data Augmentation \\for Limited Data Learning in NLP	NLP has achieved great progress in the past decade through the use of neural models and large labeled datasets. The dependence on abundant data prevents NLP models from being applied to low-resource settings or novel tasks where significant time, money, or expertise is required to label massive amounts of textual data. Recently, data augmentation methods have been explored as a means of improving data efficiency in NLP. To date, there has been no systematic empirical overview of data augmentation for NLP in the limited labeled data setting, making it difficult to understand which methods work in which settings. In this paper, we provide an empirical survey of recent progress on data augmentation for NLP in the limited labeled data setting, summarizing the landscape of methods (including token-level augmentations, sentence-level augmentations, adversarial augmentations, and hidden-space augmentations) and carrying out experiments on 11 datasets covering topics/news classification, inference tasks, paraphrasing tasks, and single-sentence tasks. Based on the results, we draw several conclusions to help practitioners choose appropriate augmentations in different settings and discuss the current challenges and future directions for limited data learning in NLP.	PDF	8	2021
An Investigation of the (In)effectiveness of Counterfactually Augmented Data	While pretrained language models achieve excellent performance on natural language understanding benchmarks, they tend to rely on spurious correlations and generalize poorly to out-of-distribution (OOD) data. Recent work has explored using counterfactually-augmented data (CAD)---data generated by minimally perturbing examples to flip the ground-truth label---to identify robust features that are invariant under distribution shift. However, empirical results using CAD for OOD generalization have been mixed. To explain this discrepancy, we draw insights from a linear Gaussian model and demonstrate the pitfalls of CAD. Specifically, we show that (a) while CAD is effective at identifying robust features, it may prevent the model from learning unperturbed robust features; and (b) CAD may exacerbate existing spurious correlations in the data. On two crowdsourced CAD datasets, our results show that the lack of perturbation diversity limits their effectiveness on OOD generalization, calling for innovative crowdsourcing procedures to elicit diverse perturbation of examples.	PDF	8	2021
A Latent-Variable Model for Intrinsic Probing	The success of pre-trained contextualized representations has prompted researchers to analyze them for the presence of linguistic information. Indeed, it is natural to assume that these pre-trained representations do encode some level of linguistic knowledge as they have brought about large empirical improvements on a wide variety of NLP tasks, which suggests they are learning true linguistic generalization.In this work, we focus on intrinsic probing, an analysis technique where the goal is not only to identify whether a representation encodes a linguistic attribute, but also to pinpoint where this attribute is encoded. We propose a novel latent-variable formulation for constructing intrinsic probes and derive a tractable variational approximation to the log-likelihood. Our results show that our model is versatile and outperforms two intrinsic probes previously proposed in the literature. Finally, we find empirical evidence that pre-trained representations develop a cross-lingually entangled notion of morphosyntax.	PDF	8	2021
CTRLsum: Towards Generic Controllable Text Summarization	Current summarization systems yield generic summaries that are disconnected from users' preferences and expectations. To address this limitation, we present CTRLsum, a generic framework to control generated summaries through a set of keywords. During training keywords are extracted automatically without requiring additional human annotations. At test time CTRLsum features a control function to map control signal to keywords; through engineering the control function, the same trained model is able to be applied to control summaries on various dimensions, while neither affecting the model training process nor the pretrained models. We additionally explore the combination of keywords and text prompts for more control tasks. Experiments demonstrate the effectiveness of CTRLsum on three domains of summarization datasets and five control tasks: (1) entity-centric and (2) length-controllable summarization, (3) contribution summarization on scientific papers, (4) invention purpose summarization on patent filings, and (5) question-guided summarization on news articles. Moreover, when used in a standard, unconstrained summarization setting, CTRLsum is comparable or better than the state-of-the-art systems.	PDF	8	2021
Automatic Multi-Label Prompting: Simple and Interpretable Few-Shot Classification	Prompt-based learning (i.e., prompting) is an emerging paradigm for exploiting knowledge learned by a pretrained language model. In this paper, we propose Automatic Multi-Label Prompting (AMuLaP), a simple yet effective method to automatically select label mappings for few-shot text classification with prompting. Our method exploits one-to-many label mappings and a statistics-based algorithm to select label mappings given a prompt template. Our experiments demonstrate that AMuLaP achieves competitive performance on the GLUE benchmark without human effort or external resources.	PDF	8	2021
EncT5: Fine-tuning T5 Encoder for Discriminative Tasks	Encoder-decoder transformer architectures have become popular recently with the advent of T5 models. While they demonstrate impressive performance on benchmarks such as GLUE (Wang et al., 2019), it is not clearly evident if the proposed encoder-decoder architecture is the most efficient for fine-tuning on downstream discriminative tasks. In this work, we study fine-tuning pre-trained encoderdecoder models such as T5. Particularly, we propose EncT5 as a way to efficiently finetune pre-trained encoder-decoder T5 models for classification and regression tasks by using only the encoder layers. Our experimental results show that EncT5 with less than half of the parameters of T5 performs similarly to T5 models on GLUE benchmark. We believe our proposed approach can be easily applied to any pre-trained encoder-decoder model.	PDF	8	2021
On the Diversity and Limits of Human Explanations	A growing effort in NLP aims to build datasets of human explanations. However, the term explanation encompasses a broad range of notions, each with different properties and ramifications. Our goal is to provide an overview of diverse types of explanations and human limitations, and discuss implications for collecting and using explanations in NLP. Inspired by prior work in psychology and cognitive sciences, we group existing human explanations in NLP into three categories: proximal mechanism, evidence, and procedure. These three types differ in nature and have implications for the resultant explanations. For instance, procedure is not considered explanations in psychology and connects with a rich body of work on learning from instructions. The diversity of explanations is further evidenced by proxy questions that are needed for annotators to interpret and answer open-ended why questions. Finally, explanations may require different, often deeper, understandings than predictions, which casts doubt on whether humans can provide useful explanations in some tasks.	PDF	8	2021
BERT Learns to Teach: Knowledge Distillation with Meta Learning	We present Knowledge Distillation with Meta Learning (MetaDistil), a simple yet effective alternative to traditional knowledge distillation (KD) methods where the teacher model is fixed during training. We show the teacher network can learn to better transfer knowledge to the student network (i.e., learning to teach) with the feedback from the performance of the distilled student network in a meta learning framework. Moreover, we introduce a pilot update mechanism to improve the alignment between the inner-learner and meta-learner in meta learning algorithms that focus on an improved inner-learner. Experiments on various benchmarks show that MetaDistil can yield significant improvements compared with traditional KD algorithms and is less sensitive to the choice of different student capacity and hyperparameters, facilitating the use of KD on different tasks and models.	PDF	8	2021
Semantics-aware Attention Improves Neural Machine Translation	The integration of syntactic structure into Transformer machine translation has shown positive results, but to our knowledge, no work has attempted to do so with semantic structures. In this work we propose two novel parameter-free methods for injecting semantic information into Transformers, both rely on semantics-aware masking of (some of) the attention heads. One such method operates on the encoder, through a Scene-Aware Self-Attention (SASA) head. Another on the decoder, through a Scene-Aware Cross-Attention (SACrA) head. We show a consistent improvement over the vanilla Transformer and syntax-aware models for four language pairs.We further show an additional gain when using both semantic and syntactic structures in some language pairs.	PDF	8	2021
QAConv: Question Answering on Informative Conversations	This paper introduces QAConv, a new question answering (QA) dataset that uses conversations as a knowledge source. We focus on informative conversations, including business emails, panel discussions, and work channels. Unlike open-domain and task-oriented dialogues, these conversations are usually long, complex, asynchronous, and involve strong domain knowledge. In total, we collect 34,608 QA pairs, including span-based and unanswerable questions, from 10,259 selected conversations with both human-written and machine-generated questions. We use a question generator and a dialogue summarizer as auxiliary tools to collect multi-hop questions. The dataset has two testing scenarios, chunk mode and full mode, depending on whether the grounded partial conversation is provided or retrieved. Experimental results show that state-of-the-art pretrained QA systems have limited zero-shot performance and tend to predict our questions as unanswerable. Our dataset provides a new training and evaluation testbed to facilitate QA on conversations research.	PDF	8	2021
Deduplicating Training Data Makes Language Models Better	We find that existing language modeling datasets contain many near-duplicate examples and long repetitive substrings.As a result, over $1\%$ of the unprompted output of language models trained on these datasets is copied verbatim from the training data.We develop two tools that allow us to deduplicate training datasets---for example removing from C4 a single 61 word English sentence that is repeated over $60{,}000$ times.Deduplication allows us to train models that emit memorized text ten times less frequently and require fewer training steps to achieve the same or better accuracy.We can also reduce train-test overlap, which affects over $4\%$ of the validation set of standard datasets, thus allowing for more accurate evaluation.	PDF	8	2021
Reliability and Robustness of Transformers for Automated Short-Answer Grading	Short-Answer Grading (SAG) is an application for NLP in education where studentanswers to open questions are graded. This task places high demands both on thereliability (accuracy and fairness) of label predictions and model robustnessagainst strategic, "adversarial" input. Neural approaches are powerful tools formany problems in NLP, and transfer learning for Transformer-based modelsspecificially promises to support data-poor tasks as this. We analyse theperformance of a Transfomer-based SOTA model, zooming in on class- and item typespecific behavior in order to gauge reliability; we use adversarial testing toanalyze the the model's robustness towards strategic answers. We find a strongdependence on the specifics of training and test data, and recommend that modelperformance be verified for each individual use case.	PDF	8	2021
Quantifying the Task-Specific Information in Text-Based Classifications	Recently, neural natural language models have attained state-of-the-art performance on a wide variety of tasks, but the high performance can result from superficial, surface-level cues (Bender and Koller, 2020; Niven and Kao, 2020). These surface cues, as the ``shortcuts'' inherent in the datasets, do not contribute to the task-specific information (TSI) of the classification tasks. While it is essential to look at the model performance, it is also important to understand the datasets. In this paper, we consider this question: Apart from the information introduced by the shortcut features, how much task-specific information is required to classify a dataset? We formulate this quantity in an information-theoretic framework. While this quantity is hard to compute, we approximate it with a fast and stable method. TSI quantifies the amount of linguistic knowledge modulo a set of predefined shortcuts -- that contributes to classifying a sample from each dataset. This framework allows us to compare across datasets, saying that, apart from a set of ``shortcut features'', classifying the Multi-NLI task involves around 0.4 nats more TSI than the Quora Question Pair.	PDF	8	2021
VALSE: A Task-Independent Benchmark for Vision and Language Models centered on Linguistic Phenomena	We propose VALSE (Vision And Language Structured Evaluation), a novel benchmark designed for testing general-purpose pretrained vision and language (V&L) models for specific visio-linguistic grounding capabilities. Currently, V&L models are evaluated on tasks such as visual question answering or visual reasoning, which do not address their fine-grained linguistic capabilities. VALSE addresses this gap by offering a suite of six tests targeting specific linguistic phenomena. Solving these tests requires models to ground these phenomena in the visual modality, allowing more fine-grained evaluations than hitherto possible. We build VALSE using methods that support the construction of reliable foils, and report results from evaluating five widely-used V&L models. Our experiments suggest that current models have considerable difficulty addressing most phenomena. Hence, we expect VALSE to serve as an important benchmark to measure future progress of pretrained V&L models from a linguistic perspective, complementing the canonical task-centred V&L evaluations.	PDF	8	2021
Multimodal Knowledge Learning for Named Entity Disambiguation	With the popularity of online social medias in recent years, massive-scale multimodal information has brought new challenges to traditional Named Entity Disambiguation (NED) tasks. Recently, Multimodal Named Entity Disambiguation (MNED) is proposed to link ambiguous mentions with the textual and visual contexts to a predefined knowledge graph. Recent attempts handle these issues mainly by annotating multimodal mentions and adding multimodal features to traditional NED models. These methods still suffer from 1) lack of multimodal annotation data against the huge scale of unlabeled corpus and 2) failing to model multimodal information at knowledge level. In this paper, we explore a pioneer study on leveraging multimodal knowledge learning to address the MNED task. Specifically, we propose a knowledge-guided transfer learning strategy to extract unified representation from different modalities and enrich multimodal lnowledge in a Meta Learning way which is much easier than collecting ambiguous mention corpus. Then we propose an Interactive Multimodal Learning Network (IMN), which is capable of fully utilizing the multimodal information in both mention and knowledge side. To verify the validity of the proposed method, we implemented comparisons on a public large-scale MNED dataset based on Twitter KB. Experimental results show that our method is superior to the state-of-the-art multimodal methods	PDF	8	2021
Adapting Multilingual Models for Code-Mixed Translation using Back-to-Back Translation	In this paper, we explore the problem of translating code-mixed sentences to an equivalent monolingual form. The scarcity of gold standard code-mixed to pure language parallel data makes it difficult to train a translation model that can perform this task reliably. Prior work has addressed the paucity of parallel data with data augmentation techniques. Such techniques rely heavily on external resources, which make the systems difficult to train and scale effectively for multiple languages. We present a simple yet highly effective training scheme for adapting multilingual models to the task of code-mixed translation. Our method eliminates the dependence on external resources by creating synthetic data from a novel two-stage back-translation approach that we propose. We show substantial improvement in translation quality (measured through BLEU), beating existing prior work by up to +3.8 BLEU on code-mixed Hi$\rightarrow$En, Mr$\rightarrow$En, and Bn$\rightarrow$En tasks. On the LinCE Machine Translation leader board, we achieve the highest score for code-mixed Es$\rightarrow$En, beating existing best baseline by +6.5 BLEU, and our own stronger baseline by +1.1 BLEU.	PDF	8	2021
Tailor: Generating and Perturbing Text with Semantic Controls	Making controlled perturbations is essential for various tasks (e.g., data augmentation), but building task-specific generators can be expensive. We introduce Tailor, a task-agnostic generation system that perturbs text in a semantically-controlled way. With unlikelihood training, Tailor's generator is designed to follow a series of control codes derived from semantic roles. Through modifications of these control codes, Tailor can produce fine-grained perturbations. We implement a set of operations on control codes that can be composed into complex perturbation strategies, and demonstrate their effectiveness in three applications. First, Tailor facilitates the construction of high-quality contrast sets that are lexically diverse and less biased than original task test data. Second, paired with automated labeling heuristics, Tailor helps improve model generalization through data augmentation: we obtain an average gain of 1.73 on an (natural language inference) NLI challenge set by perturbing just $\sim5\%$ of training data. Third, without any finetuning overhead, Tailor's perturbations effectively improve compositionality in fine-grained style transfer, outperforming fine-tuned baselines on 5 transfers.	PDF	8	2021
Digging Errors in NMT: Evaluating and Understanding Model Errors from Hypothesis Distribution	Sound evaluation of a neural machine translation (NMT) model is key to its understanding and improvement. Current evaluation of an NMT system is usually built upon a heuristic decoding algorithm (e.g., beam search) and an evaluation metric assessing similarity between the translation and golden reference (e.g., BLEU). However, this system-level evaluation framework is prone to its evaluation over only one best hypothesis and search errors brought by heuristic decoding algorithms. To better understand NMT models, we propose a novel evaluation protocol, which defines model errors with hypothesis distribution. In particular, we first propose an exact top-$k$ decoding algorithm, which finds top-ranked hypotheses in the whole hypothesis space and avoids search errors. Then, we evaluate NMT model errors with the distance between hypothesis distribution with the ideal distribution, aiming for a comprehensive interpretation. We apply our evaluation on various NMT benchmarks and model architectures to provide an in-depth understanding of how NMT models work. We show that the state-of-the-art Transformer models are facing serious ranking errors and do not even outperform the random chance level. We further provide several interesting findings over data-augmentation techniques, dropouts, and deep/wide models. Additionally, we analyze beam search's lucky biases and regularization terms. Interestingly, we find these lucky biases decrease when increasing model capacity.	PDF	9	2021
Improving Generalizability in Implicitly Abusive Language Detection with Concept Activation Vectors	Robustness of machine learning models on ever-changing real-world data is critical, especially for applications affecting human well-being such as content moderation. New kinds of abusive language continually emerge in online discussions in response to current events (e.g., COVID-19), and the deployed abuse detection systems should be updated regularly to remain accurate. General abusive language classifiers tend to be fairly reliable in detecting out-of-domain explicitly abusive utterances but often fail to detect new types of more subtle, implicit abuse. We propose an interpretability technique, based on the Testing Concept Activation Vector (TCAV) method from computer vision, to quantify the sensitivity of a trained model to the human-defined concepts of explicit and implicit abusive language, and use that to explain the generalizability of the model on new data, in this case, COVID-related anti-Asian hate speech. Extending this technique, we introduce a novel metric, Degree of Explicitness, for a single instance and show that the new metric is beneficial in suggesting out-of-domain unlabeled examples to effectively enrich the training data with informative, implicitly abusive texts.	PDF	9	2021
Hansel: A Chinese Few-Shot and Zero-Shot Entity Linking Benchmark	Modern Entity Linking (EL) systems entrench a popularity bias. However, there is no dataset focusing on tail and emerging entities in languages other than English. We present Hansel, a new benchmark in Chinese that fills the vacancy of non-English few-shot and zero-shot EL challenges. Hansel is human annotated and reviewed, with a novel method for collecting zero-shot EL datasets. It is a diverse dataset covering 8.2K documents in news, social media posts and other web articles, with Wikidata as its target Knowledge Base. We demonstrate that existing state-of-the-art EL system performs poorly on Hansel (R@1 of 35.8% on Few-Shot). We then establish a strong baseline that scores a R@1 of 43.2% on Few-Shot and 76.6% on Zero-Shot on our dataset. We also show that our baseline achieves competitive results on TAC-KBP2015 Chinese Entity Linking task.	PDF	9	2021
Synthetic Data Generation and Joint Learning for Robust Code-Mixed Translation	The widespread online communication in a modern multilingual world has provided opportunities to blend more than one language (aka. code-mixed language) in a single utterance. This has resulted a formidable challenge for the computational models due to the scarcity of annotated data and presence of noise. A potential solution to mitigate the data scarcity problem in low-resource setup is to leverage existing data in resource-rich language through translation. In this paper, we tackle the problem of code-mixed (Hinglish and Bengalish) to English machine translation. First, we synthetically develop HINMIX, a parallel corpus of Hinglish to English, with ~5M sentence pairs. Subsequently, we propose JAMT, a robust perturbation based joint-training model that learns to handle noise in the real-world code-mixed text by parameter sharing across clean and noisy words. Further, we show the adaptability of JAMT in a zero-shot setup for Bengalish to English translation. Our evaluation and comprehensive analyses qualitatively and quantitatively demonstrate the superiority of JAMT over state-of-the-art code-mixed and robust translation methods.	PDF	9	2021
The Past Mistake is the Future Wisdom: Error-driven Contrastive Probability Optimization for Chinese Spell Checking	Chinese Spell Checking (CSC) aims to detect and correct Chinese spelling errors, which are mainly caused by phonologically or visually similarity. Recently, due to the development of various pre-trained language models (PLMs), many CSC methods have achieved great progress. However, PLMs will pay more attention to common characters because of the pre-training settings. Therefore, there exists a gap between the learned knowledge of PLMs and the essential of CSC task. To address this issue, we propose an Error-driven COntrastive Probability Optimization (ECOPO) framework to refine the knowledge representation of PLMs for CSC. Particularly, ECOPO guides the model to avoid predicting common but improper characters through an error-driven way. Besides, ECOPO is model-agnostic so that it can be easily combined with existing CSC methods to achieve better performance. Extensive experiments and detailed analysis on three standard benchmarks demonstrate that ECOPO is simple yet effective.	PDF	9	2021
Headed-Span-Based Projective Dependency Parsing	We propose a new paradigm for projective dependency parsing based on headed spans. In a projective dependency tree, the subtree rooted at each word covers a contiguous sequence (i.e., a span) in the surface order. We call a span marked with a root word headed span. A projective dependency tree can be represented as a collection of headed spans. We decompose the score of a dependency tree into the scores of the headed spans and design a novel $O(n^3)$ dynamic programming algorithm to enable global training and exact inference. The advantages of our headed-span-based dependency parsing include that it captures subtree information more adequately than first-order graph-based methods and performs global optimization in decoding (in contrast to transition-based methods). We evaluate our method on PTB, CTB, and UD and it achieves competitive results in comparison with previous methods.	PDF	9	2021
Knowledgeable Prompt-tuning: Incorporating Knowledge into Prompt Verbalizer for Text Classification	Tuning pre-trained language models (PLMs) with task-specific prompts has been a promising approach for text classification. Particularly, previous studies suggest that prompt-tuning has remarkable superiority in the low-data scenario over the generic fine-tuning methods with extra classifiers. The core idea of prompt-tuning is to insert text pieces, i.e., template, to the input and transform a classification problem into a masked language modeling problem, where a crucial step is to construct a projection, i.e., verbalizer, between a label space and a label word space. A verbalizer is usually handcrafted or searched by gradient descent, which may lack coverage and bring considerable bias and high variances to the results. In this work, we focus on incorporating external knowledge into the verbalizer, forming a knowledgeable prompt-tuning (KPT), to improve and stabilize prompt-tuning. Specifically, we expand the label word space of the verbalizer using external knowledge bases (KBs) and refine the expanded label word space with the PLM itself before predicting with the expanded label word space. Extensive experiments on zero and few-shot text classification tasks demonstrate the effectiveness of knowledgeable prompt-tuning.	PDF	9	2021
SeaD: End-to-end Text-to-SQL Generation with Schema-aware Denoising	On the WikiSQL benchmark, most methods tackle the challenge of text-to-SQL with predefined sketch slots and build sophisticated sub-tasks to fill these slots. Though achieving promising results, these methods suffer from over-complex model structure. In this paper, we present a simple yet effective approach that enables auto-regressive sequence-to-sequence model to robust text-to-SQL generation. Instead of formulating the task of text-to-SQL as slot-filling, we propose to train sequence-to-sequence model with Schema-aware Denoising (SeaD), which consists of two denoising objectives that train model to either recover input or predict output from two novel erosion and shuffle noises. These model-agnostic denoising objectives act as the auxiliary tasks for structural data modeling during sequence-to-sequence generation. In addition, we propose a clause-sensitive execution guided (EG) decoding strategy to overcome the limitation of EG decoding for generative model. The experiments show that the proposed method improves the performance of sequence-to-sequence model in both schema linking and grammar correctness and establishes new state-of-the-art on WikiSQL benchmark. Our work indicates that the capacity of sequence-to-sequence model for text-to-SQL may have been under-estimated and could be enhanced by specialized denoising task.	PDF	9	2021
Hierarchical Recurrent Aggregative Generation for Few-Shot NLG	Large pretrained models enable transfer learning to low-resource domains for language generation tasks. However, previous end-to-end approaches do not account for the fact that some generation sub-tasks, specifically aggregation and lexicalisation, can benefit from transfer learning in different extents. To exploit these varying potentials for transfer learning, we propose a new hierarchical approach for few-shot and zero-shot generation. Our approach consists of a three-moduled jointly trained architecture: the first module independently lexicalises the distinct units of information in the input as sentence sub-units (e.g. phrases), the second module recurrently aggregates these sub-units to generate a unified intermediate output, while the third module subsequently post-edits it to generate a coherent and fluent final text. We perform extensive empirical analysis and ablation studies on few-shot and zero-shot settings across 4 datasets. Automatic and human evaluation shows that the proposed hierarchical approach is consistently capable of achieving state-of-the-art results when compared to previous work.	PDF	9	2021
Widely Interpretable Semantic Representation: Frameless Meaning Representation for Broader Applicability	This paper presents a semantic representation called WISeR that overcomes challenges for Abstract Meaning Representation (AMR). Despite its richness and exapandability, AMR is not easily applied to languages or domains without predefined semantic frames, and its use of numbered arguments results in semantic role labels which are not directly interpretable and are semantically overloaded for parsers. We examine the numbered arguments of predicates in AMR and convert them to thematic roles which do not require reference to semantic frames. We create a new corpus of 1K dialogue sentences annotated in both WISeR and AMR. WISeR shows stronger inter-annotator agreement for beginner and experienced annotators, with beginners becoming proficient in WISeR annotation sooner. Finally, we train two state-of-the-art parsers on the AMR 3.0 corpus and a WISeR corpus converted from AMR 3.0. The parsers are evaluated on these corpora and our dialogue corpus. WISeR models exhibit higher accuracy than their AMR counterparts across the board, demonstrating that WISeR is easier for parsers to learn.	PDF	9	2021
Attention Mechanism with Energy-Friendly Operations	Attention mechanism has become the dominant module in natural language processing models. It is computationally intensive and depends on massive power-hungry multiplications. In this paper, we rethink variants of attention mechanism from the energy consumption aspects. After reaching the conclusion that the energy costs of several energy-friendly operations are far less than their multiplication counterparts, we build a novel attention model by completely replacing multiplications with either selective operations or additions. Empirical results on three machine translation tasks demonstrate that the proposed method, against the vanilla one, achieves comparable accuracy while only consumes a half of energy. Our code will be released upon the acceptance.	PDF	9	2021
Relevance in Dialogue: An empirical comparison of existing metrics, and a novel simple metric	In this work, we evaluate various existing dialogue relevance metrics, find strong dependencies on the dataset, often with poor correlation with human scores of relevance, and propose modifications to reduce data requirements while improving correlation. With these changes, our metric achieves a new state-of-the-art on the HUMOD dataset (Merdivan et al., 2020). We achieve this without fine-tuning, using only 3750 unannotated human dialogues and a single negative example. Despite these limitations, we demonstrate competitive performance on three datasets from different domains. Our code including our metric and data processing is open sourced.	PDF	9	2021
Denoising Large-Scale Image Captioning from Alt-text Data using Content Selection Models	Training large-scale image captioning (IC) models demands access to a rich and diverse set of training examples that are expensive to curate both in terms of time and man-power. Instead, alt-text based captions gathered from the web is a far cheaper alternative to scale with the downside of being noisy. Recent modeling approaches to IC often fall short in terms of performance in leveraging these noisy datasets in favor of clean annotations. We address this problem by breaking down the task into two simpler, more controllable tasks -- skeleton prediction and skeleton-based caption generation. Specifically, we show that sub-selecting content words as skeletons helps in generating improved and denoised captions when leveraging rich yet noisy alt-text--based uncurated datasets. We also show that the predicted English skeletons can further cross-lingually be leveraged to generate non-English captions, and present experimental results covering caption generation in French, Italian, German, Spanish and Hindi. We also show that skeleton-based prediction allows for better control of certain caption properties, such as length, content, and gender expression, providing a handle to perform human-in-the-loop interpretable semi-automatic corrections.	PDF	9	2021
Towards Better Characterization of Paraphrases	To effectively characterize the nature of paraphrase pairs without expert human annotation, we proposes two new metrics: word position deviation (WPD) and lexical deviation (LD). WPD measures the degree of structural alteration, while LD measures the difference in vocabulary used. We apply these metrics to better understand the commonly-used MRPC dataset and study how it differs from PAWS, another paraphrase identification dataset. We also perform a detailed study on MRPC and propose improvements to the dataset, showing that it improves generalizability of models trained on the dataset. Lastly, we apply our metrics to filter the output of a paraphrase generation model and show how it can be used to generate specific forms of paraphrases for data augmentation or robustness testing of NLP models.	PDF	9	2021
Constructing Multilingual CCG Treebanks from Universal Dependencies	This paper introduces an algorithm to convert Universal Dependencies (UD) treebanks to Combinatory Categorial Grammar (CCG) treebanks. As CCG encodes almost all grammatical information into the lexicon, obtaining a high quality CCG derivation from a dependency tree is a challenging task. Our algorithm contains four main steps: binarization of dependency trees, functor/argument identification, category assignment through hand-crafted rules, and category inference for unassigned constituents. To evaluate our converted treebanks, we perform lexical, sentential, and syntactic rule coverage analysis, as well as CCG parsing experiments. We achieve over 80% conversion rate on 68 treebanks of 44 languages, and over 90% lexical coverage on 81 treebanks of 52 languages.	PDF	9	2021
Multilingual Generative Language Models for Zero-Shot Cross-Lingual Event Argument Extraction	We present a pioneering study on leveraging multilingual pre-trained generative language models for zero-shot cross-lingual event argument extraction (EAE) by formulating EAE as a language generation task. Compared to previous classification-based EAE models that build classifiers on top of pre-trained masked language models, our generative model effectively encodes the event structures and better captures the dependencies between arguments. To achieve cross-lingual transfer, we design language-agnostic templates to encode argument roles and train our models on source languages to "generate" arguments in the source languages to fill in the language-agnostic template. The trained model can then be directly applied to target languages to "generate" arguments in the target languages to fill in the template. Our experimental results demonstrate that the proposed model outperforms the current state-of-the-art results on zero-shot cross-lingual EAE. Comprehensive ablation study and error analysis are presented to better understand the advantages and the current limitations of using multilingual generative language models for cross-lingual transfer.	PDF	9	2021
Multi-Narrative Semantic Intersection Task: Evaluation and Benchmark	In this paper, we introduce an important yet relatively unexplored NLP task called Multi-Narrative Semantic Intersection (MNSI), which entails generating a Semantic Intersection of multiple alternate narratives. As no benchmark dataset is readily available for this task, we created one by crawling 2,925 alternative narrative pairs from the web and then, went through the tedious process of manually creating 411 different ground-truth semantic intersections by engaging human annotators. As a way to evaluate this novel task, we first conducted a systematic study by borrowing the popular ROUGE metric from text summarization literature and discovered that ROUGE is not suitable for our task. Subsequently, we conducted further human annotations/validations to create 200 document-level and 1,518 sentence-level ground-truth labels which helped us formulate a new precision-recall style evaluation metric, called SEM F1 (semantic F1), based on presence, partial presence and absence of information. Experimental results show that the proposed SEM F1 metric yields higher correlation with human judgement as well as higher inter-rater agreement compared to ROUGE metric and thus, we recommend the community to use this metric for evaluating future research on this topic.	PDF	9	2021
Focus on the Target's Vocabulary: Masked Label Smoothing for Machine Translation	Label smoothing and vocabulary sharing are two widely used techniques in neural machine translation models. However, we argue that jointly adopting these two techniques can be conflicting and even leads to sub-optimal performance, since the soft label produced by label smoothing still considers the source-side words that would not appear at the target side. To address this issue, we propose Masked Label Smoothing (MLS), a new mechanism that masks the soft label probability of source-side words to zero. Simple yet effective, MLS manages to better integrate label smoothing with vocabulary sharing and hence improves the quality of the translation. Our extensive experiments show that MLS consistently yields improvement over original label smoothing on different datasets, including bilingual and multilingual translation in both BLEU and calibration scores.	PDF	9	2021
Draft, Command, and Edit: Controllable Text Editing in E-Commerce	Product description generation is a challenging and under-explored task. Most such work takes a set of product attributes as inputs then generates a description from scratch in a single pass. However, this widespread paradigm might be limited when facing the dynamic wishes of users on constraining the description, such as deleting or adding the content of a user-specified attribute based on the previous version. To address this challenge, we explore a new draft-command-edit manner in description generation, leading to the proposed new task ---controllable text editing in E-commerce. More specifically, we allow systems to receive a command (deleting or adding) from the user and then generate a description by flexibly modifying the content based on the previous version. It is easier and more practical to meet the new needs by modifying previous versions than generating from scratch. To accompany this new task, we present a human-written draft-command-edit dataset called E-cEdits. Furthermore, we design a data augmentation method to remedy the low resource challenge in this task, which contains a model-based and a rule-based strategy to imitate the edit by humans. Our experimental results show that using the new data augmentation method outperforms baselines to a greater extent in both automatic and human evaluations.	PDF	9	2021
Accurate Online Posterior Alignments for Principled Lexically-Constrained Decoding	Online alignment in machine translation refers to the task of aligning a target word to a source word when the target sequence has only been partially decoded. Good online alignments facilitate important applications such as lexically constrained translation where user-defined dictionaries are used to inject lexical constraints into the translation model. We propose a novel posterior alignment technique that is truly online in its execution and superior in terms of alignment error rates compared to existing methods. Our proposed inference technique jointly considers alignment and token probabilities in a principled manner and can be seamlessly integrated within existing constrained beam-search decoding algorithms. On five language pairs, including two distant language pairs, we achieve consistent drop in alignment error rates. When deployed on three lexically constrained translation tasks, we achieve significant improvements in BLEU specifically around the constrained positions. We show that our alignment guided constrained inference yields additional benefits of fluency with negligible additional computational costs.	PDF	9	2021
Learning Low-frequency Patterns with A Pre-trained Document-Grounded Conversation Model	Owing to its perceived capability in recognizing the high-frequency patterns appeared in the large corpora, the Generative Pre-trained Transformer model (GPT-2) has demonstrated its remarkable performance in the document-grounded dialogue generation. Capturing low-frequency patterns, however, remains a challenging task. Here we consider a possible extension of the GPT-2 model with its improved capability of grasping the low-frequency patterns especially for task-specific dialogues. The extension consists of a semantic-oriented encoder and a GPT-2 decoder, the latter equipped with a knowledge-aware classification. The proposed encoder-decoder framework strengthens the GPT-2 in two task-specific aspects: One is in regard of a suitable way to select, on a semantic level, the crucial information of the dialogue context and the corresponding history knowledge from the documents; The other is in terms of the determination of the suitable time to generate a response with the knowledge from documents. With the enhanced capability to learn not only high-frequency and but also low-frequency patterns, the proposed extension is shown to outperform the state-of-the-art generative models.	PDF	9	2021
Meta-Learning with Sparse Experience Replay for Lifelong Language Learning	Lifelong learning requires models that can continuously learn from sequential streams of data without suffering catastrophic forgetting due to shifts in data distributions. Deep learning models have thrived in the non-sequential learning paradigm; however, when used to learn a sequence of tasks, they fail to retain past knowledge and learn incrementally. We propose a novel approach to lifelong learning of language tasks based on meta-learning with sparse experience replay that directly optimizes to prevent forgetting. We show that under the realistic setting of performing a single pass on a stream of tasks and without any task identifiers, our method obtains state-of-the-art results on lifelong text classification and relation extraction. We analyze the effectiveness of our approach and further demonstrate its low computational and space complexity.	PDF	9	2021
Towards Full Utilization on Mask Task for Distilling PLMs into NMT	Owing to being well-performed in many natural language processing tasks, the application of pre-trained language models (PLMs) in neural machine translation (NMT) is widely concerned. Knowledge distillation (KD) is one of the mainstream methods which could gain considerable promotion for NMT models without extra computational costs. However, previous methods in NMT always distill knowledge at hidden states level and can not make full use of the teacher models. For solving the aforementioned issue, we propose KD based on mask task as a more effective method utilized in NMT which includes encoder input conversion, mask task distillation, and gradient optimization mechanism. Here, we evaluate our translation systems for English→German and Chinese→English tasks and our methods clearly outperform baseline methods. Besides, our framework can get great performances with different PLMs.	PDF	9	2021
Machine Reading Comprehension: Generative or Extractive Reader?	While both extractive and generative readers have been successfully applied to the Question Answering (QA) task, little attention has been paid toward the comparison of these two readers. Which reader performs better? What are the reasons for the performance differences? In this paper, we aim to answer these questions in the setting of extractive QA tasks. We design multiple transformer-based models and different scenarios to systematically compare these two readers. Our findings characterize the difference of two readers and their pros and cons, which can instruct the optimal selection of the two readers, and open up new research avenues to improve each reader.Our major findings are:1) generative readers perform better when the input context is long, whereas extractive readers are better when the context is short;2) extractive readers generalize better as compared to the generative ones under out-of-domain settings, in both single- and multi-task learning scenarios. Our experiments also suggest that, although an encoder-only pre-trained language model (PrLM) is an intuitive choice for extractive readers, the encoder from encoder-decoder PrLM is a strong alternative that performs competitively.	PDF	9	2021
Continual Few-shot Relation Learning via Embedding Space Regularization and Data Augmentation	Existing continual relation learning (CRL) methods rely on plenty of labeled training data for learning a new task, which can be hard to acquire in real scenario as getting large and representative labeled data is often expensive and time-consuming. It is therefore necessary for the model to learn novel relational patterns with very few labeled data while avoiding catastrophic forgetting of previous task knowledge. In this paper, we formulate this challenging yet practical problem as continual few-shot relation learning (CFRL). Based on the finding that learning for new emerging few-shot tasks often results in feature distributions that are incompatible with previous tasks' learned distributions, we propose a novel method based on embedding space regularization and data augmentation. Our method generalizes to new few-shot tasks and avoids catastrophic forgetting of previous tasks by enforcing extra constraints on the relational embeddings and by adding extra relevant data in a self-supervised manner. With extensive experiments we demonstrate that our method can significantly outperform previous state-of-the-art methods in CFRL task settings.	PDF	9	2021
ElitePLM: An Empirical Study on General Language Ability Evaluation of Pretrained Language Models	Pretrained language models (PLMs), such as BERT and GPT-3, have dominated the majority of NLP tasks. However, relatively little work has been conducted on systematically evaluating the language abilities of PLMs. In this paper, we present a large-scale empirical study on gen\underline{E}ral \underline{l}anguage ab\underline{i}li\underline{t}y \underline{e}valuation of PLMs (ElitePLM). We first design four evaluation dimensions in ElitePLM, including memory, comprehension, reasoning, and composition, and further measure ten widely-used PLMs within five categories. Our empirical results demonstrate that: (1) the pretraining objectives and strategies have significant impacts on PLMs performance in downstream tasks; (2) fine-tuning PLMs in downstream tasks is usually sensitive to the data size and distribution; (3) PLMs have excellent transferability between similar tasks. Our experimental results summarize several important findings, which can guide the future work to choose, apply, and design PLMs for specific tasks. We have made all the details of experiments publicly available at https://anonymous.4open.science/r/Paper-for-ACL-4FD1.	PDF	9	2021
Transformers Can Compose Skills To Solve Novel Problems Without Finetuning	It is possible to achieve improved prediction performance with Transformers on unseen datasets by adding disparate new training tasks to an existing multitask training regime. We demonstrate that this can be attributed to a compositional mechanism rather than memorisation. Performance on DROP, DROP-CS and ROPES datasets can be improved by over 26 percent without finetuning through application of numerical reasoning tasks, while performance on seven other question-answering datasets that would not be expected to be improved remains essentially unchanged. By filtering our evaluation datasets to only those samples that have no answer overlap to similar training samples, and then further restricting to those samples which have the least semantic similarity with the training set, we show that improved performance after adding numerical reasoning tasks was not attributable to direct lookup. Our code and filtered datasets are available at https://github.com/anonymised.	PDF	9	2021
Utterance Rewriting with Contrastive Learning in Multi-turn Dialogue	Context modeling plays a significant role in building multi-turn dialogue systems. In order to make full use of context information, systems can use Incomplete Utterance Rewriting(IUR) methods to simplify the multi-turn dialogue into single-turn by merging current utterance and context information into a self-contained utterance. However, previous approaches ignore the intent consistency between the original query and rewritten query. The detection of omitted or coreferred locations in the original query can be further improved. In this paper, we introduce contrastive learning and multi-task learning to jointly model the problem. Our method benefits from carefully designed self-supervised objectives, which act as auxiliary tasks to capture semantics at both sentence-level and token-level. The experiments show that our proposed model achieves state-of-the-art performance on several public datasets.	PDF	9	2021
ASSIST: Towards Label Noise-Robust Dialogue State Tracking	The MultiWOZ 2.0 dataset has greatly boosted the research on dialogue state tracking (DST). However, substantial noise has been discovered in its state annotations. Such noise brings about huge challenges for training DST models robustly. Although several refined versions, including MultiWOZ 2.1-2.4, have been published recently, there are still lots of noisy labels, especially in the training set. Besides, it is costly to rectify all the problematic annotations. In this paper, instead of improving the annotation quality further, we propose a general framework, named ASSIST (lAbel noiSe-robuSt dIalogue State Tracking), to train DST models robustly from noisy labels. ASSIST first generates pseudo labels for each sample in the training set by using an auxiliary model trained on a small clean dataset, then puts the generated pseudo labels and vanilla noisy labels together to train the primary model. We show the validity of ASSIST theoretically. Experimental results also demonstrate that ASSIST improves the joint goal accuracy of DST by up to $28.16\%$ on the initial version MultiWOZ 2.0 and $8.41\%$ on the latest version MultiWOZ 2.4, respectively.	PDF	9	2021
A Variational Hierarchical Model for Neural Cross-Lingual Summarization	The goal of the cross-lingual summarization (CLS) is to convert a document in one language (e.g., English) to a summary in another one (e.g., Chinese), which is essentially the combination of machine translation (MT) and monolingual summarization (MS). Existing studies on CLS mainly focus on utilizing pipeline methods or jointly training an end-to-end model through an auxiliary MT or MS objective. However, it is very challenging for the model to directly conduct CLS as it requires both the abilities to translate and summarize. Besides, the processes of MT and MS have a hierarchical relationship with CLS. Therefore, we propose a hierarchical model for the CLS task, based on the conditional variational auto-encoder. The hierarchical model contains two kinds of latent variables at the local and global levels, respectively. At the local level, there are two latent variables, one for translation and the other for summarization. As for the global level, there is another latent variable for cross-lingual summarization conditioned on the two local-level variables. Experiments on two language directions (English-Chinese) verify the effectiveness and superiority of the proposed approach, yielding state-of-the-art performances. In addition, we show that our model is able to generate better cross-lingual summaries than comparison models in the few-shot setting.	PDF	9	2021
Benchmarking Answer Verification Methods for Question Answering-Based Summarization Evaluation Metrics	Question answering-based summarization evaluation metrics must automatically determine whether the QA model's prediction is correct or not, a task known as answer verification. In this work, we benchmark the lexical answer verification methods which have been used by current QA-based metrics as well as two more sophisticated text comparison methods, BERTScore and LERC. We find that LERC out-performs the other methods in some settings while remaining statistically indistinguishable from lexical overlap in others. However, our experiments reveal that improved verification performance does not necessarily translate to overall QA-based metric quality: In some scenarios, using a worse verification method -- or using none at all -- has comparable performance to using the best verification method, a result that we attribute to properties of the datasets.	PDF	9	2021
Towards Overcoming Practical Obstacles to Deploying Deep Active Learning	Active learning (AL) is a prominent technique for reducing the annotation effort required for training machine learning models. Deep learning offers a solution for several essential obstacles to deploying AL in practice but introduces many others. One of such problems is the excessive computational resources required to train an acquisition model and estimate its uncertainty on instances in the unlabeled pool. We propose two techniques that tackle this issue for text classification and tagging tasks, offering a substantial reduction of AL iteration duration and the computational overhead introduced by deep acquisition models in AL. We also demonstrate that our algorithm that leverages pseudolabeling and distilled models overcomes one of the obstacles revealed previously in the literature. Namely, it was shown that due to differences between an acquisition model used to select instances during AL and a successor model trained on the labeled data, the benefits of AL can diminish. We show that our algorithm, despite using a smaller and faster acquisition model, is capable of training a more expressive successor with higher performance.	PDF	9	2021
Extractive Topical Summarization With Aspects	Extractive summarization is a task of highlighting the most important parts of the text. We introduce a new approach to extractive summarization task using hidden topical structure and information about aspects of the text. Experimental results on CNN/DailyMail demonstrate that our approach generates more accurate summarizations than baseline methods, achieving state-of-the-art results in terms of ROUGE metric. Additionally, we show that aspect information is extremely important in extractive summarization scenario.	PDF	9	2021
Pseudo-Error Generation for Grammatical Error Correction Based on Learner’s First Language	We propose to adapt grammatical error correction (GEC) systems to the learners' first language (L1) by generating artificial errors that reflect the L1 influence. To this end, we employ two simple approaches: fine-tuning a back-translation model on L1-annotated data; and controlling the output of a back-translation model and generating artificial errors that follow the L1-dependant error type distribution. We demonstrate that, despite the simplicity of the model and the paucity of the L1-annotated data, our methods succeed in adapting GEC models to some languages. We also show that generating L1-adapted artificial errors is orthogonal to the existing method that directly adapts the GEC model to each L1. Lastly, we present an analysis of the pseudo errors generated by our models and show that they approximately capture the L1-specific error patterns.	PDF	9	2021
Learning Functional Distributional Semantics with Visual Data	Functional Distributional Semantics is a recently proposed framework for learning distributional semantics that provides linguistic interpretability. It models the meaning of a word as a binary classifier rather than a numerical vector. In this work, we propose a method to train a Functional Distributional Semantics model with grounded visual data. We train it on the Visual Genome dataset, which is closer to the kind of data encountered in human language acquisition than a large text corpus. On four external evaluation datasets, our model outperforms previous work on learning semantics from Visual Genome.	PDF	9	2021
Personalized News Recommendation with Candidate-aware User Modeling	News recommendation aims to match news with personalized user interest. Existing methods for news recommendation usually model user interest from historical clicked news without the consideration of candidate news. However, each user usually has multiple interests, and it is difficult for these methods to accurately match a candidate news with a specific user interest. In this paper, we present a candidate-aware user modeling method for personalized news recommendation, which can incorporate candidate news into user modeling for better matching between candidate news and user interest. More specifically, we propose a candidate-aware self-attention network that uses candidate news as guidance to model candidate-aware global user interest. In addition, we propose a candidate-aware CNN network to incorporate candidate news into local behavior context modeling to learn candidate-aware short-term user interest. Besides, we use a candidate-aware attention network to aggregate previously clicked news weighted by their relevance with candidate news to build candidate-aware user representation. The experiments on real-world datasets show the effectiveness of our approach in improving news recommendation performance.	PDF	9	2021
Knowledge Neurons in Pretrained Transformers	Large-scale pretrained language models are surprisingly good at recalling factual knowledge presented in the training corpus. In this paper, we present preliminary studies on how factual knowledge is stored in pretrained Transformers by introducing the concept of knowledge neurons. Specifically, we examine the fill-in-the-blank cloze task for BERT. Given a relational fact, we propose a knowledge attribution method to identify the neurons that express the fact. We find that the activation of such knowledge neurons is positively correlated to the expression of their corresponding facts. In our case studies, we attempt to leverage knowledge neurons to edit (such as update, and erase) specific factual knowledge without fine-tuning. Our results shed light on understanding the storage of knowledge within pretrained Transformers.	PDF	9	2021
A Survey on Geocoding: Algorithms and Datasets for Toponym Resolution	Geocoding, the task of converting unstructured text to structured spatial data, has recently seen progress thanks to a variety of new datasets, evaluation metrics, and machine-learning algorithms. We provide a survey to review, organize and analyze recent work on geocoding (also known as toponym resolution) where the text is matched to geospatial coordinates and/or ontologies. We summarize the findings of this research and suggest some promising directions for future work.	PDF	9	2021
Should a Bot be Sarcastic? Understanding User Preferences Towards Sarcasm Generation	Previous sarcasm generation research has focused on how to generate text that people perceive as sarcastic to create more human-like interactions. In this paper, we argue that we should first turn our attention to the question of when sarcasm should be generated, finding that human annotators consider many inputs to be unfit for sarcastic responses. Next, we introduce a theory-driven framework for sarcasm generation which allows us to better control the linguistic devices used during the generation process in order to measure their impact on sarcasm perception, finding that pragmatic insincerity and emotional markers are crucial elements in generating recognizable sarcasm.	PDF	9	2021
Combining (Second-Order) Graph-Based and Headed-Span-Based Projective Dependency Parsing	Graph-based methods are popular in dependency parsing for decades, which decompose the score of a dependency tree into scores of dependency arcs. Recently, (Yang and Tu, 2021) propose a headed-span-based method that decomposes the score of a dependency tree into scores of headed spans. In this paper, we combine the two types of methods by considering both arc scores and headed-span scores, designing three scoring methods and the corresponding dynamic programming algorithms for joint inference. Experiments show the effectiveness of our proposed methods.	PDF	9	2021
SOS: Systematic Offensive Stereotyping Bias in Word Embeddings	Hate speech detection models aim to provide a safe environment for marginalised social groups to express themselves. However, the bias in these models could lead to silencing those groups. In this paper, we introduce the systematic offensive stereotyping (SOS) bias metric. We propose a method to measure the SOS bias in different word embeddings and also investigate its influence on the downstream task of hate speech detection. Our results show that SOS bias against various groups exists in widely used word embeddings and that, in most cases, our SOS bias metric correlates positively with the bias statistics of published surveys on online abuse and hate. However, we found that it is not easy to prove that bias in word embeddings influences downstream task performance. Finally, we show that our SOS bias metric is more indicative of sexism and racism in the inspected word embeddings when used for sexism and racism detection than the stereotypical social biases.	PDF	9	2021
Graph Neural Networks for Multiparallel Word Alignment	After a period of decrease, interest in word alignments is increasing again for their usefulness in domains such as typological research, cross-lingual annotation projection and machine translation. Generally, alignment algorithms only use bitext and do not make use of the fact that many parallel corpora are multiparallel. We propose to use graph neural networks (GNNs) and community detection algorithms to exploit the graph structure of multiparallel word alignments. Our GNN approach (i) utilizes information about the meaning, position and language of the input words, (ii) incorporates information from multiple parallel sentences, (iii) can remove edges from the initial alignments, and (iv) provides a prediction model that can generalize beyond the sentences it is trained on. We show that community detection algorithms can provide valuable information for multiparallel word alignment. We show on three word alignment datasets and on a downstream task that our method outperforms previous work.	PDF	9	2021
Learning Emotion-Aware Contextual Representations for Emotion Cause Analysis	Emotion Cause Analysis has been a key topic in natural language processing. Previous works focus on Emotion Cause Extraction (ECE), a clause-level classification task aimed at extracting causes of certain given emotion in text. The task has been expanded to Emotion Cause Pair Extraction (ECPE) that focus on extracting both emotions and corresponding causes in the context. Most existing methods for the ECPE task implement a joint model that performs extracting and matching of emotion and cause clauses simultaneously. However, we argue that different input features are needed for the two subtasks, thus sharing contextual representations may be suboptimal. In this work, we propose a pipelined approach that builds on two independent pre-trained encoders, in which the emotion extraction model only provide input features for the cause extraction model. Based on a series of careful experiments, we validate that our model can create distinct contextual representations according to specific emotional texts, and thus achieve state-of-the-art performance in both ECE and ECPE tasks, with the absolute F1 improvements of 1.5% and 4.72% over best previous works respectively. Besides, we apply a set of simple clause selection rules to extract multiple pairs in the document, strengthening the applicability of our approach in real world scenarios.	PDF	9	2021
AMR-to-text Generation with Graph Structure Reconstruction and Coverage	Generating text from semantic representations such as AMR is a challenging task. Previous research formalizes this task as a graph-to-sequence learning problem and uses various graph neural networks to model the graph structure. Recently, methods based on pre-trained models improve the performance significantly due to pre-trained on a large text corpus. However, these pre-trained model-based methods take linearized AMR graphs as input and may lose the information of graph structure. In addition, these methods don't consider the coverage of the AMR graph. Therefore, some nodes in the graph may be lost or repeated in the generated text. To address these problems, we propose a graph structure and coverage enhanced model for this task. To enhance the information of graph structure, we design two auxiliary objectives, relationship prediction and distance prediction of nodes in AMR graphs. To consider the coverage of AMR graphs, we design a coverage mechanism to solve the problem of information under-translation or over-translation in AMR-to-text generation. Experimental results on three standard datasets show that our proposed method outperforms the existing methods significantly.	PDF	9	2021
Rebuild and Ensemble: Exploring Defense Against Text Adversaries	Adversarial attacks can mislead strong neural models; as such, in NLP tasks, substitution-based attacks are difficult to defend. Current defense methods usually assume that the substitution candidates are accessible, which cannot be widely applied against substitution-agnostic attacks. In this paper, we propose a \textbf{Rebuild and Ensemble} Framework to defend against adversarial attacks in texts without knowing the candidates.We propose a rebuild mechanism to train a robust model and ensemble the rebuilt texts during inference to achieve good adversarial defense results.Experiments show that our method can improve accuracy under the current strong attack methods.	PDF	9	2021
Enhancing the Nonlinear Mutual Dependencies in Transformers with Mutual Information	The Predictive Uncertainty problem does exist in Transformers. We present that pre-trained Transformers can be further regularized by mutual information to alleviate such issue in Neural Machine Translation (NMT). In this paper, we explicitly capture the nonlinear mutual dependencies existing in decoder self-attentions to reduce the model uncertainty concerning token-token interactions. Specifically, we adopt an unsupervised objective of mutual information maximization on self-attentions with the contrastive learning methodology and construct the estimation of mutual information by using InfoNCE. Experimental results on WMT'14 En$\rightarrow$De, WMT'14 En$\rightarrow$Fr demonstrate the consistent effectiveness and evident improvements of our model over the strong baselines. Quantifying the model uncertainty again verifies our hypothesis. The proposed plug-and-play approach can be easily incorporated and deployed into pre-trained Transformer models. Code will be released soon.	PDF	9	2021
Achieving Reliable Human Assessment of Open-Domain Dialogue Systems	Evaluation of open-domain dialogue systems is highly challenging and development of better techniques is highlighted time and again as desperately needed. Despite substantial efforts to carry out reliable live evaluation of systems in recent competitions, annotations have been abandoned and reported as too unreliable to yield sensible results. This is a serious problem since automatic metrics are not known to provide a good indication of what may or may not be a high-quality conversation. Answering the distress call of competitions that have emphasized the urgent need for better evaluation techniques in dialogue, we present the successful development of human evaluation that is highly reliable while still remaining feasible and low cost. Self-replication experiments reveal almost perfectly repeatable results with a correlation of $r=0.969$. Furthermore, due to the lack of appropriate methods of statistical significance testing, the likelihood of potential improvements to systems occurring due to chance is rarely taken into account in dialogue evaluation, and the evaluation we propose facilitates application of standard tests. Since we have developed a highly reliable evaluation method, new insights into system performance can be revealed. We therefore include a comparison of state-of-the-art models (i) with and without personas, to measure the contribution of personas to conversation quality, as well as (ii) prescribed versus freely chosen topics. Interestingly with respect to personas, results indicate that personas do not positively contribute to conversation quality as expected.	PDF	9	2021
DocEE: A Large-Scale Dataset for Document-level Event Extraction	Event extraction (EE) is the task of identifying events and their types, along with the involved arguments. Despite the great success in sentence-level event extraction, events are more naturally presented in the form of document, with event arguments scattering in multiple sentences. However, a major barrier to promote document-level event extraction has been the lack of large-scale and practical training and evaluation datasets. In this paper, we present DocEE, a new document-level EE dataset including 20,000+ events, 100,000+ arguments. We highlight three features: large-scale annotations, fine-grained event arguments and application-oriented settings. Experiments show that even SOTA models show inferior performance on DocEE, especially in cross-domain settings, indicating that DocEE is still a challenging task. We will publish DocEE upon acceptance.	PDF	9	2021
Improving Personalized Dialogue Generation Models with Data-level Distillation and Diversification	Personalized dialogue generation is a challenging task in which a persona-consistent response needs to be generated conditioning both persona texts and dialogue utterances, being more complex than conventional dialogues. Multiple persona texts and utterances exist in one sample and some of them can be distractors for generating. Thus even strong models have difficulty posing attention to suitable personas so generating persona-irrelevant responses. Besides, the limited data scale and diversity further affect the performance. Thus, we start from data and propose to boost the model in data-level distillation and diversification (D$^3$). We first distill the original training samples into simplified persona-consistent ones, lowering the difficulty by removing redundant information in personas and dialogue history. Next in the diversification, we increase both the amount and diversity of distilled data to ease its insufficiency. A model will be trained via curricula, first on easier augmented samples and then on harder original ones. Experiments on the PersonaChat benchmark dataset illustrate the superiority of our method when packed with two strong base dialogue models (Transformer and GPT2) on various automatic metrics and human evaluation.	PDF	9	2021
Towards Comprehensive Patent Approval Predictions:Beyond Traditional Document Classification	Predicting the approval chance of a patent application is a challenging problem involving multiple facets. The most crucial facet is arguably the novelty --- \emph{35 U.S. Code § 102} rejects more recent applications that have very similar prior arts. Such novelty evaluations differ the patent approval prediction from conventional document classification --- Successful patent applications may share similar writing patterns; however, too-similar newer applications would receive the opposite label, thus confusing standard document classifiers (e.g., BERT). To address this issue, we propose a novel framework \our that unifies the document classifier with handcrafted features, particularly time-dependent novelty scores. Specifically, we formulate the novelty scores by comparing each application with millions of prior arts using a hybrid of efficient filters and a neural bi-encoder. Moreover, we impose a new regularization term into the classification objective to enforce the monotonic change of approval prediction w.r.t. novelty scores. From extensive experiments on the large-scale USPTO dataset, we find that our time-dependent novelty features offer a boost on top of the document classifier. Also, our monotonic regularization, while shrinking the search space, can drive the optimizer to better local optima, yielding empirical performance gains. Ex-post analysis of prediction scores further confirms that the document classifier and handcrafted features capture distinct sets of learning information.	PDF	9	2021
EIDER: Evidence-enhanced Document-level Relation Extraction	Document-level relation extraction (DocRE) aims to extract the semantic relations among entity pairs in a document. In DocRE, we observe that (1) a subset of the sentences in a document, noted as the evidence sentences, are often sufficient for predicting the relation between a specific entity pair; (2) these evidence sentences can be extracted in an effective and lightweight manner: by multi-task learning along with the RE model or by heuristic rules. In this paper, we propose a novel DocRE framework called Eider that automatically extracts and makes use of evidence. Eider enhances a DocRE model by combining the inference results from the evidence sentences and the original document through a blending layer. The performance can be further improved by jointly training an RE model with an evidence extraction model via multi-task learning. If human-annotated evidence is not available, we can use the evidence extracted by this joint model or by several heuristic rules. Extensive experiments show that Eider achieves state-of-the-art performance on the DocRED, CDR, and GDA datasets. Remarkably, Eider outperforms the runner-up by 1.37/1.26 Ign F1/F1 on DocRED. In particular, Eider-RoBERTa$_\text{large}$ significantly improves the performance on entity pairs requiring co-reference/multi-hop reasoning by 1.98/2.08 F1, respectively.	PDF	9	2021
GraphPrompt: Biomedical Entity Normalization Using Graph-based Prompt Templates	Biomedical entity normalization unifies the language across biomedical experiments and studies, and further enables us to obtain a holistic view of life sciences. Current approaches mainly study the normalization of more standardized entities such as diseases and drugs, while disregarding the more ambiguous but crucial entities such as pathways, functions and cell types, hindering their real-world applications. To achieve biomedical entity normalization on these under-explored entities, we first introduce an expert-curated dataset OBO-syn encompassing 70 different types of entities and 2 million curated entity-synonym pairs. To utilize the unique graph structure in this dataset, we propose GraphPrompt, a prompt-based learning approach that creates prompt templates according to the graphs. GraphPrompt obtained 41.0% and 29.9% improvement on zero-shot and few-shot settings respectively, indicating the effectiveness of these graph-based prompt templates. We envision that our method GraphPrompt and OBO-syn dataset can be broadly applied to graph-based NLP tasks, and serve as the basis for analyzing diverse and accumulating biomedical data.	PDF	9	2021
Modeling Multi-granularity Segmentation for Rare Words in Neural Machine Translation	Segmenting rare words into subwords has become a commonly used and effective way to alleviate the open vocabulary problem in Neural Machine Translation (NMT). The existing dominant segmentation methods either give rare words a single segmentation or a fixed segmentation, which leads to a lack of morphological diversity in representing words. For rare words, we first obtain segmentation with different granularities through Byte Pair Encoding (BPE) and BPE-Dropout, and then propose \textsc{BPEatt} model to dynamically mix the BPE subwords and BPE-Dropout subwords, which enhances the encoder's ability to represent rich morphological information. Experiments on six translation benchmarks of different scales show that our proposed method significantly outperforms the baseline model and has obvious advantages over related methods.	PDF	9	2021
An Empirical Study of Document-to-document Neural Machine Translation	This paper does not aim at introducing a novel method for document NMT. Instead, we head back to the original transformer model with document-level training and hope to answer the following question: Is the capacity of current models strong enough for document-level NMT? Interestingly, we observe that the original transformer with appropriate training techniques can achieve strong results for document translation, even with a length of 2000 words. We evaluate this model and several recent approaches on nine document-level datasets and two sentence-level datasets across six languages. Experiments show that the original Transformer model outperforms sentence-level models and many previous methods in a comprehensive set of metrics, including BLEU, four lexical indices, three newly proposed assistant linguistic indicators, and human evaluation.	PDF	9	2021
Consistent Crosslingual Data Transfer for Open Information Extraction	Progress with supervised Open Information Extraction (OpenIE) has been primarily limited to English due to the scarcity of training data in other languages. In this paper, we explore techniques to automatically convert English text for training OpenIE systems in other languages. We introduce the Alignment Augmented Constrained Translation (AACTrans) model to translate English sentences and their corresponding extractions consistently with each other --- with no changes to vocabulary or semantic meaning which may result from independent translations. Using the data generated with AACTrans, we train a novel two-stage generative OpenIE model, which we call Gen2OIE, that outputs for each sentence: 1) relations in the first stage and 2) all extractions containing the relation in the second stage. Gen2OIE increases relation coverage using a training data transformation technique that is generalizable to multiple languages, in contrast to existing models that use an English-specific training loss. Evaluations on 5 languages --- Spanish, Portuguese, Chinese, Hindi and Telugu --- show that the Gen2OIE with AACTrans data outperforms prior systems by a margin of 6-40% in F1.	PDF	9	2021
Probing, Generalization and Application of Metaphorical Knowledge in Pre-trained Language Models	Human languages are full of metaphorical expressions. Metaphors help people understand the world by connecting new concepts and domains to more familiar ones. Large pre-trained language models (PLMs) are therefore assumed to encode metaphorical knowledge useful for NLP systems when processing language. In this paper, we investigate this hypothesis for PLMs by probing the metaphoricity knowledge in their encodings, by measuring the cross-lingual and cross-dataset generalization of this knowledge, and by analyzing the application of this knowledge when generating metaphorical expressions. We present studies in multiple metaphoricity detection datasets and four languages (i.e., English, Spanish, Russian, and Farsi). Our extensive experiments suggest that contextual representations in PLMs do encode metaphoricity information, and mostly in their middle layers, and the knowledge is transferrable between languages and datasets in most cases. Finally, we show that PLMs face more challenges in generating metaphors, especially as their novelty increases. Our findings give helpful insights for both cognitive and NLP scientists.	PDF	9	2021
Cross-Lingual Event Detection via Optimized Adversarial Training	In this work, we focus on Cross-Lingual Event Detection (CLED) where a model is trained on data from a source language but its performance is evaluated on data from a second, target, language. Most recent works in this area have harnessed the language-invariant qualities displayed by pre-trained Multi-lingual Language Models (MLM). Their performance, however, reveals there is room for improvement as they mishandle delicate cross-lingual instances. We leverage the use of unlabeled data to train a Language Discriminator (LD) to discern between the source and target languages. The LD is trained in an adversarial manner so that our encoder learns to produce refined, language-invariant representations that lead to improved CLED performance. More importantly, we optimize the adversarial training by only presenting the LD with the most \textit{informative} samples. We base our intuition about \textit{what} makes a sample informative on two disparate metrics: sample similarity and event presence. Thus, we propose using Optimal Transport (OT) as a solution to naturally combine these two distinct information sources into the selection process. Extensive experiments on 8 different language pairs, using 4 languages from unrelated families, show the flexibility and effectiveness of our model that achieves new state-of-the-art results.	PDF	9	2021
Scaling Laws vs Model Architectures: How does Inductive Bias Influence Scaling? An Extensive Empirical Study on Language Tasks	There have been a lot of interest in the scaling properties of Transformer models. However, not much has been done on the front of investigating the effect of scaling properties of different inductive biases and model architectures. Do model architectures scale differently? If so, how does inductive bias affect scaling behaviour? How does this influence upstream (pretraining) and downstream (transfer)? This paper conducts a systematic study of scaling behaviour of ten diverse model architectures such as Transformers, Switch Transformers, Universal Transformers, Dynamic convolutions, Performers, and recently proposed MLP-Mixers. Via extensive experiments, we show that (1) architecture is an indeed an important consideration when performing scaling and (2) the best performing model can fluctuate at different scales. We believe that the findings outlined in this work has significant implications to how model architectures are currently evaluated in the community.	PDF	9	2021
Contrastive Word Embedding Learning for Neural Machine Translation	Seq2seq models have shined in the field of Neural Machine Translation (NMT). However, word embeddings learned by NMT models tend to degenerate and be distributed into a narrow cone, named {\em{representation degeneration problem}}, which limits the representation capacity of word embeddings. In this paper, we propose a Contrastive Word Embedding Learning (CWEL) method to address this problem. CWEL combines the ideas of contrastive representation learning with embedding regularization, and adaptively minimizes the cosine similarity of word embeddings on the target side according to their semantic similarity. Experiments on multiple translation benchmark datasets show that CWEL significantly improves translation qualities. Additional analysis shows that the improvements mainly come from the well-learned word embeddings.	PDF	9	2021
Unsupervised Personality-Aware Language Identification	Recognizing the language of ambiguous texts remains a main challenge in language identification (LID). When using multilingual applications, users have their own language preferences, which can be regarded as external knowledge for LID. Nevertheless, current studies marginally consider the inter-personal variations due to the lack of user annotated training data. To fill this gap, we introduce personality-aware LID and propose a novel unsupervised learning strategy. Concretely, we extract training samples for each user from a standard LID corpus according to his/her language preference. Furthermore, we contribute the first user labeled LID test set called "U-LID''. Experimental results reveal that the proposed model can incarnate user traits and significantly outperforms existing LID systems on handling ambiguous texts. Our code and dataset will be released upon the acceptance.	PDF	9	2021
A Simple and Effective Model for Multi-Hop Question Generation	Previous research on automated question generation has almost exclusively focused on generating factoid questions whose answers can be extracted from a single document. However, there is an increasing interest in developing systems that are capable of more complex multi-hop question generation (QG), where answering the question requires reasoning over multiple documents. In this work, we propose a simple and effective approach based on the transformer model for multi-hop QG. Our approach consists of specialized input representations, a supporting sentence classification objective, and training data weighting. Prior work on multi-hop QG considers the simplified setting of shorter documents and also advocates the use of entity-based graph structures as essential ingredients in model design. On the contrary, we showcase that our model can scale to the challenging setting of longer documents as input, does not rely on graph structures, and substantially outperforms the state-of-the-art approaches as measured by automated metrics and human evaluation.	PDF	9	2021
A Few-Shot Semantic Parser for Wizard-of-Oz Dialogues with the Precise ThingTalk+ Representation	Previous attempts to build effective semantic parsers for Wizard-of-Oz (WOZ) conversations suffer from the difficulty in acquiring a high-quality, manually annotated training set. Approaches based only on dialogue synthesis are insufficient as dialogues generated from state-machine based models are poor approximations of real-life conversations. Furthermore, previously proposed dialogue state representations are ambiguous and lack the precision necessary for building an effective agent.This paper proposes a new dialogue representation and a sample-efficient methodology that can predict precise dialogue states in WOZ conversations. We propose a precise, complete, and executable dialogue representation called ThingTalk+, which captures all information an agent needs to respond properly. Our training strategy is sample-efficient: we combine (1) few-shot data sparsely sampling the full dialogue space and (2) synthesized data covering a subset space of dialogues generated by a succinct state-based dialogue model. The completeness of the ThingTalk+ language is demonstrated with a fully operational agent, which is also used in training data synthesis. We demonstrate the effectiveness of our methodology on MultiWOZ 3.0, a reannotation of the MultiWOZ 2.1 dataset in ThingTalk+. ThingTalk+ can represent 98% of the test turns, while the simulator can emulate 85% of the validation set. We train a contextual semantic parser using our strategy, and obtain 79% turn-by-turn exact match accuracy on the reannotated test set.	PDF	9	2021
Divide and Rule: Effective Pre-Training for Context-Aware Multi-Encoder Translation Models	Multi-encoder models are a broad family of context-aware neural machine translation systems that aims to improve translation quality by encoding document-level contextual information alongside the current sentence. The context encoding is undertaken by contextual parameters, trained on document-level data. In this work, we discuss the difficulty of training these parameters effectively, due to the sparsity of the words in need of context (i.e., the training signal), and their relevant context. We propose to pre-train the contextual parameters over split sentence pairs, which makes an efficient use of the available data for two reasons. Firstly, it increases the contextual training signal by breaking intra-sentential syntactic relations, and thus pushing the model to search the context for disambiguating clues more frequently. Secondly, it eases the retrieval of relevant context, since context segments become shorter. We propose four different splitting methods, and evaluate our approach with BLEU and contrastive test sets. Results show that it consistently improves learning of contextual parameters, both in low and high resource settings.	PDF	9	2021
FlipDA: Effective and Robust Data Augmentation for Few-Shot Learning	Most previous methods for text data augmentation are limited to simple tasks and weak baselines. We explore data augmentation on hard tasks (i.e., few-shot natural language understanding) and strong baselines (i.e., pretrained models with over one billion parameters). Under this setting, we reproduced a large number of previous augmentation methods and found that these methods bring marginal gains at best and sometimes degrade the performance much. To address this challenge, we propose a novel data augmentation method FlipDA that jointly uses a generative model and a classifier to generate label-flipped data. Central to the idea of FlipDA is the discovery that generating label-flipped data is more crucial to the performance than generating label-preserved data. Experiments show that FlipDA achieves a good tradeoff between effectiveness and robustness---it substantially improves many tasks while not negatively affecting the others.	PDF	9	2021
BART-light: One Decoder Layer Is Enough	BART (Lewis et al., 2020), an encoder-decoder transformer language model (LM), has reached state-of-the-art results on several tasks in natural language generation and understanding. Similar to other pretrained encoder-decoder LMs, it uses the same number of hidden layers in the encoder and the decoder. In this paper, we show that one can easily remove all but one or two decoder layers for text generation tasks and even remove the whole decoder for classification tasks, with little to no compromises on performance. Our study presents that a shallow decoder is sufficient for most tasks when a deep encoder is used.	PDF	9	2021
Multimodal Audio-textual Architecture for Robust Spoken Language Understanding	Tandem spoken language understanding (SLU) systems suffer from the so-called automatic speech recognition (ASR) error propagation problem. Additionally, as the ASR is not optimized to extract semantics, but solely the linguistic content, relevant semantic cues might be left out of its transcripts. In this work, we propose a multimodal language understanding (MLU) architecture to mitigate these problems. Our solution is based on two compact unidirectional long short-term memory (LSTM) models that encode speech and text information. A fusion layer is also used to fuse audio and text embeddings. Two fusion strategies are explored: a simple concatenation of these embeddings and a cross-modal attention mechanism that learns the contribution of each modality. The first approach showed to be the optimal solution to robustly extract semantic information from audio-textual data. We found that attention is less effective at testing time when the text modality is corrupted. Our model is evaluated on three SLU datasets and robustness is tested using ASR outputs from three off-the-shelf ASR engines. Results show that the proposed approach effectively mitigates the ASR error propagation problem for all datasets.	PDF	9	2021
CoDA21: Evaluating Language Understanding Capabilities of NLP Models With Context-Definition Alignment	Pretrained language models (PLMs) have achieved superhuman performance on many benchmarks, creating a need for harder tasks. We introduce CoDA21 (Context Definition Alignment), a challenging benchmark that measures natural language understanding (NLU) capabilities of PLMs: Given a definition and a context each for k words, but not the words themselves, the task is to align the k definitions with the k contexts. CoDA21 requires a deep understanding of contexts and definitions, including complex inference and world knowledge. We find that there is a large gap between human and PLM performance, suggesting that CoDA21 measures an aspect of NLU that is not sufficiently covered in existing benchmarks.	PDF	9	2021
Remixers: A Mixer-Transformer Architecture with Compositional Operators for Natural Language Understanding	Recent work such as MLP-Mixers (Tolstikhin et al.) have demonstrated the promise of All-MLP architectures. While All-MLP architectures have demonstrated reasonable performance in computer vision and garnered recent interest, we argue that making them effective in NLP applications is still an uphill battle. Hence, there may be no solid reason to drop the self-attention modules altogether. In this paper, we propose a new Mixer-Transformer architecture, showing that Transformers and Mixer models can be quite complementary indeed. Fundamentally, we show that Mixer models are capable of acting as persistent global memory (in a similar vein to standard MLPs) while being imbued with global receptive fields at the same time. Hence, interleaving sample-dependent and input-local self-attention with persistent Mixer modules can be an effective strategy. Additionally, we propose compositional remixing, a new way of baking compositional operators (multiplicative and subtractive composition) within the mixing process to improve the expressiveness of the model. This allows us to effectively model relationships between unmixed and mixed representations - an inductive bias that we postulate is powerful for NLU applications. Via extensive experiments on 14 challenging NLU datasets (e.g., SuperGLUE, entailment and compositional generalization), we show that the proposed architecture consistently outperforms a strong T5 baseline (Raffel et al.). We believe this work paves the way for more effective synergies between the two families of models.	PDF	9	2021
Comprehension of Subtitles from Re-Translating Simultaneous Speech Translation	In simultaneous speech translation, one can vary the size of the output window,system latency and sometimes the allowed level of rewriting. The effect of theseproperties on readability and comprehensibility has not been tested with modernneural translation systems.In this work, we propose an evaluation method and investigate the effects on comprehension and user preferences. It is a pilot study with 14 users on 2 hours of German documentaries or speeches with onlinetranslations into Czech. We collect continuous feedback and answers on factualquestions. Our results show that the subtitling layout or flicker have a littleeffect on comprehension, in contrast to machine translation itself andindividual competence. Other results show that users with a limited knowledge ofthe source language have different preferences to stability and latency than theusers with zero knowledge. The results are statistically insignificant, however, we show that our method works and can be reproduced in larger volume.	PDF	9	2021
Top-Down Influence? Predicting CEO Personality and Risk Impact from Speech Transcripts	How much does a CEO’s personality influence the performance of their company? Past literature has contested the possibility of predicting the Myers–Briggs Type Indicator (MBTI) from purely textual data. However, we use Transformers to create the first supervised model to regress the MBTI personality of CEOs. We show that moderate to strong predictions can be obtained for three out of four MBTI dimensions. Finally, providing empirical evidence for the upper echelons theory, we demonstrate that the predicted CEO personalities have explanatory power of financial risk.	PDF	9	2021
Placing (Historical) Events on a Timeline: A Classification cum Co-ref Resolution Approach	The event timeline provides one of the most effective ways to visualize the important historical events that occurred over a period of time, presenting the insights that may not be so apparent from reading the equivalent information in textual form. By leveraging generative adversarial learning for important event classification and by assimilating knowledge based tags for improving the performance of event coreference resolution we introduce a two staged system for event timeline generation from multiple (historical) text documents. In addition, we propose a vis-timeline based visualization technique to portray the event timeline. We demonstrate our results on two very well known historical documents -- the Collected Works of Mahatma Gandhi (CWMG) and the Collected Works of Abraham Lincoln (CWAL). Our results can be extremely helpful for historians, in advancing research in history and in understanding the socio-political landscape of a country as reflected in the writings of political leaders/scholars. Our work has some parallels with timeline summarization (TLS) tasks and therefore we use these as baselines. Rigorous experiments demonstrate that prior event detection which was hitherto absent in the TLS methods} can improve summarization performance. In order to show that our methods are very generic we reuse our method to visualize the evolution of coronavirus related events in India from a collection of various COVID-19 articles.	PDF	9	2021
Exploring the Effectiveness of Student Behavior in Prerequisite Relation Discovery for Concepts	What knowledge should a student grasp before beginning a new MOOC course? This question can be answered by discovering prerequisite relations of knowledge concepts. In recent years, researchers have devoted intensive efforts to detecting such relations by analyzing various types of information. However, there are still a few explorations of utilizing student behaviors in this task. In this paper, we investigate the effectiveness of student behaviors in prerequisite relation discovery for course concepts. Specifically, we first construct a novel MOOC dataset to support the study. We then verify the effectiveness of student behaviors via serving as additional features for existing prerequisite relation discovery models. Moreover, we explore to better utilize student behaviors via graph-based modeling. We hope our study could call for more attention and efforts to explore the student behavior for prerequisite relation discovery.	PDF	9	2021
We need to talk about random seeds	Modern neural network libraries all take as a hyperparameter a random seed, typically used to determine the initial state of the model parameters. In this position piece, I argue that there are some appropriate uses for random seeds: as part of the hyperparameter search to select a good model, creating an ensemble of several variants of a model, or measuring the sensitivity of the training algorithm to the random seed hyperparameter. I argue against some inappropriate uses for random seeds: using a fixed random seed for "replicability" and creating score distributions for performance comparison. I review 85 recent publications from the ACL Anthology and find that more than 50% are using random seeds inappropriately.	PDF	9	2021
Measuring Factual Consistency of Abstractive Summaries	Recent abstractive summarization systems fail to generate factual consistent -- faithful -- summaries, which heavily limits their practical application. Commonly, these models tend to mix concepts from the source or hallucinate new content, completely ignoring the source.Addressing the faithfulness problem is perhaps the most critical challenge for current abstractive summarization systems.First automatic faithfulness metrics were proposed, but we argue that existing methods do not yet utilize all "machinery" that this field has to offer and introduce new approaches to assess factual correctness.We evaluate existing and our proposed methods by correlating them with human judgements and find that BERTScore works well.Next, we conduct a data analysis, which reveals common problems, ways to further improve the metrics and indicates that combining multiple metrics is promising. Finally, we exploit faithfulness metrics in pre- and post-processing steps to decrease factual errors made by state-of-the-art summarization systems.We find that simple techniques like filtering training data and re-ranking generated summaries can increase the faithfulness by a substantial margin.	PDF	9	2021
What Makes Reading Comprehension Questions Difficult? Investigating Variation in Passage Sources and Question Types	In order for a natural language understanding benchmark to be useful in research, it has to consist of examples that are diverse and difficult enough to discriminate among current and near-future state-of-the-art systems. However, we do not yet know what kinds of passages and their sources help us collect a variety of challenging examples. In this study, we crowdsource multiple-choice reading comprehension questions for passages taken from seven qualitatively distinct sources, analyzing what attributes of passages contribute to the difficulty and question types of the collected examples. We find that passage source, length, and readability measures do not significantly affect question difficulty. Among seven question types we manually annotate, questions that require numerical reasoning and logical reasoning are relatively difficult but their frequencies depend on the passage sources. These results suggest that when creating a new benchmark dataset, we do not have to use difficult passages but select passage sources carefully so that it has questions that involve linguistic phenomena we are interested in.	PDF	9	2021
Making Transformers Solve Compositional Tasks	Several studies have reported the inability of Transformer models to generalize compositionally, a key type of generalization in many NLP tasks such as semantic parsing. In this paper we explore the design space of Transformer models showing that the inductive biases given to the model by several design decisions significantly impact compositional generalization. We identified Transformer configurations that generalize compositionally significantly better than previously reported in the literature in a diverse set of compositional tasks, and that achieve state-of-the-art results in a semantic parsing compositional generalization benchmark (COGS), and a string edit operation composition benchmark (PCFG).	PDF	9	2021
Investigating Logic Tensor Networks for Neural-Symbolic Argument Mining	We present an application of neural-symbolic learning to argument mining.We use Logic Tensor Networks to train neural models to jointly fit the data and satisfy specific domain rules.Our experiments on a corpus of scientific abstracts indicate that including symbolic rules during the training process improves classification performance, compliance with the rules, and robustness of the results.	PDF	9	2021
Modeling Future for Neural Machine Translation by Fusing Target Information	Sequence-to-sequence Neural Machine Translation (NMT) models have achieved excellent performance. However, the NMT decoder only makes predictions based on the source and the target historical context, ignores the target future information completely, leading to a problem that NMT does not consider potential future information when making decisions. To alleviate this problem, we propose a simple and effective {\bf Fu}ture-fused {\bf NMT} model called \textsc{FuNMT}, which introduces a reverse decoder to explicitly model the target future information, then adopts an agreement mechanism to enable the forward decoder to learn this future information. Empirical studies on multiple benchmarks show that our proposed model significantly improves translation quality.	PDF	9	2021
Supervised Relation Classification as Two-way Span-Prediction	Most of the current supervised relation classification (RC) algorithms use a single embedding to represent the relation between a pair of entities. We argue that a better approach is to treat the RC task as a Span-Prediction (SP) problem, similar to Question Answering (QA). We present an SP-based system for RC and evaluate its performance compared to the embedding-based system. We demonstrate that by adding a few improvements, the supervised SP objective works significantly better than the standard classification-based objective. We achieve state-of-the-art results on the TACRED, SemEval task 8, and the CRE datasets.	PDF	9	2021
A Meta-framework for Spatiotemporal Quantity Extraction from Text	News events are often associated with quantities (e.g., the number of COVID-19 patients or the number of arrests in a protest), and it is often important to extract their type, time, and location from unstructured text in order to analyze these quantity events. This paper thus formulates the NLP problem of spatiotemporal quantity extraction, and proposes the first meta-framework for solving it. This meta-framework contains a formalism that decomposes the problem into several information extraction tasks, a shareable crowdsourcing pipeline, and transformer-based baseline models. We demonstrate the meta-framework in three domains---the COVID-19 pandemic, Black Lives Matter protests, and 2020 California wildfires---to show that the formalism is general and extensible, the crowdsourcing pipeline facilitates fast and high-quality data annotation, and the baseline system can handle spatiotemporal quantity extraction well enough to be practically useful. All resources of this paper will be released for future research on this topic.	PDF	9	2021
Gender Roles from Word Embeddings in a Century of Children’s Books	When presenting content to children, educators and parents not only want to know whether characters of different backgrounds are represented; they also want to understand how these characters are depicted. In this paper, we measure the gender portrayal of central domains of social life as depicted in highly influential children's books using word co-occurrence and word embeddings. We find that females are more likely than males to be associated with words related both to family and appearance, while males are more associated with business-related words. The gender associations with appearance and business have endured over time, whereas family word associations have become more gender-neutral. We make two main contributions: one, we create a word embeddings data set, StoryWords 1.0, of 100 years of award-winning children's literature, and two, we show inequality in the portrayal of gender in this literature, which in turn may convey messages to children about differential roles in society. We include our code and models as supplemental data associated with this manuscript.	PDF	9	2021
Commonsense Knowledge-Augmented Pretrained Language Models for Causal Reasoning Classification	Commonsense knowledge can be leveraged for identifying causal relations in text. In this work, we convert triples in ATOMIC2020, a wide coverage commonsense reasoning knowledge graph, to natural language text and continually pretrain a BERT pretrained language model. We evaluate the resulting model on answering commonsense reasoning questions. Our results show that a continually pretrained language model augmented with commonsense reasoning knowledge outperforms our baseline on two commonsense causal reasoning benchmarks, COPA and BCOPA-CE, without additional improvement on the base model or using quality-enhanced data for fine-tuning.	PDF	9	2021
Does BERT really agree ? Fine-grained Analysis of Lexical Dependence on a Syntactic Task	Although transformer-based Neural Language Models obtain impressive results on a wide variety of tasks, their generalization abilities are not well understood. They have been shown to perform strongly on subject-verb number agreement in a wide array of settings, suggesting that they learned to capture syntactic dependencies during their training even without explicit supervision. In this paper, we examine the extent to which BERT relies on lexical content to solve the number agreement (NA) task. To do so, we disrupt the lexical patterns found in naturally occurring stimuli in a novel fine-grained analysis of BERT's behavior. Our results on nonce sentences suggest that the model generalizes well for simple structures, but fails to perform lexically-independent syntactic generalization when as little as one attractor is present.	PDF	9	2021
Defending Textual Neural Networks against Black-Box Adversarial Attacks with Stochastic Multi-Expert Patcher	Even though several methods have proposed to defend textual neural network (NN) models against black-box adversarial attacks, they often defend against a specific text perturbation strategy and/or require re-training the models from scratch. This leads to a lack of generalization in practice and redundant computation. In particular, the state-of-the-art transformer models (e.g., BERT, RoBERTa) require great time and computation resources. By borrowing an idea from software engineering, in order to address these limitations, we propose a novel algorithm, SHIELD, which modifies and re-trains only the last layer of a textual NN, and thus it "patches" and "transforms" the NN into a stochastic weighted ensemble of multi-expert prediction heads. Considering that most of current black-box attacks rely on iterative search mechanisms to optimize their adversarial perturbations, SHIELD confuses the attackers by automatically utilizing different weighted ensembles of predictors depending on the input. In other words, SHIELD breaks a fundamental assumption of the attack, which is a victim NN model remains constant during an attack. By conducting comprehensive experiments, we demonstrate that all of CNN, RNN, BERT, and RoBERTa-based textual NNs, once patched by SHIELD, exhibit a relative enhancement of 15%--70% in accuracy on average against 14 different black-box attacks, outperforming 6 defensive baselines across 3 public datasets. All codes are to be released.	PDF	9	2021
Measuring Fairness of Text Classifiers via Prediction Sensitivity	With the rapid growth in language processing applications, fairness has emerged as an important consideration in data-driven solutions. Although various fairness definitions have been explored in the recent literature, there is lack of consensus on which metrics most accurately reflect the fairness of a system. In this work, we propose a new formulation -- accumulated prediction sensitivity, which measures fairness in machine learning models based on the model's prediction sensitivity to perturbations in input features. The metric attempts to quantify the extent to which a single prediction depends on a protected attribute, where the protected attribute encodes the membership status of an individual in a protected group. We show that the metric can be theoretically linked with a specific notion of group fairness (statistical parity) and individual fairness. It also correlates well with humans' perception of fairness. We conduct experiments on two text classification datasets -- Jigsaw Toxicity, and Bias in Bios, and evaluate the correlations between metrics and manual annotations on whether the model produced a fair outcome. We observe that the proposed fairness metric based on prediction sensitivity is statistically significantly more correlated with human annotation than the existing counterfactual fairness metric.	PDF	9	2021
DiS-ReX: A Multilingual Dataset for Distantly Supervised Relation Extraction	Our goal is to study the novel task of distant supervision for multilingual relation extraction (Multi DS-RE). Research in Multi DS-RE has remained limited due to the absence of a reliable benchmarking dataset. The only available dataset for this task, RELX-Distant (Köksal and Özgür, 2020), displays several unrealistic characteristics, leading to a systematic overestimation of model performance. To alleviate these concerns, we release a new benchmark dataset for the task, named DiS-ReX. We also modify the widely-used bag attention models using an mBERT encoder and provide the first baseline results on the proposed task. We show that DiS-ReX serves as a more challenging dataset than RELX-Distant, leaving ample room for future research in this domain.	PDF	9	2021
GradMask: Gradient-Guided Token Masking for Textual Adversarial Example Detection	We present a simple model-agnostic textual adversarial example detection scheme called GradMask. It uses gradient signals to detect adversarially perturbed tokens in an input sequence and occludes such tokens by a masking process. GradGask provides several advantages over existing methods including lower computational cost, improved detection performance, and a weak interpretation of its decision. Extensive evaluations on widely adopted natural language processing benchmark datasets demonstrate the efficiency and effectiveness of Gradmask.	PDF	9	2021
Curriculum Data Augmentation for Low-Resource Slides Summarization	Data augmentation is commonly used in training in low-resource scenarios. However, there are sometimes large discrepancy between distributions of augmented data and target data. How to bridge the gap between the augmented and target data, especially when target data is harder-to-learn? In this paper, we study improved data augmentation strategies in the scenario of scientific slides text summarization, where we generate a textual summary based on texts of presentation slides. Since slides are messy and difficult to understand by current models, we introduce an easier form of data, i.e., articles in natural language. The basic idea is that we generate the transition data between slides and articles, and all three of them form a curriculum for neural models to learn the distribution transition from article data to slides data. We find that our approach achieves consistent improvements over different backbone summarization models. The curriculum-oriented data augmentation method can generate data that fill the gap between the easy-to-obtain data and the low-resource task data. We show that curriculum learning and data augmentation can be combined to help NLP models learn from otherwise hard-to-learn data.	PDF	10	2021
Multi-Task End-to-End Training Improves Conversational Recommendation	In this paper, we analyze the performance of a multitask end-to-end transformer model on the task of conversational recommendations, which aim to provide recommendations based on a user’s explicit preferences expressed in dialogue. While previous works in this area adopt complex multi-component approaches where the dialogue generation and entity recommendation tasks are handled by separate components, we show that a unified transformer model, based on the T5 text-to-text transformer model, can perform competitively in both recommending relevant items and generating conversation dialogue. We fine-tune our model on the ReDIAL conversational movie recommendation dataset, and create additional training tasks derived from MovieLens (such as the prediction of movie attributes and related movies based on an input movie), in a multitask learning setting. Using a series of probe studies, we demonstrate that the learned knowledge in the additional tasks is transferred to the conversational setting, where each task leads to a $9\% - 52\%$ increase in its related probe score.	PDF	10	2021
SPE: Symmetrical Prompt Enhancement for Factual Knowledge Retrieval	Pretrained language models (PLMs) have been shown to accumulate factual knowledge from their unsupervised pretraining procedures (Petroni et al., 2019). Prompting is an effective way to query such knowledge from PLMs. Recently, continuous prompt methods have been shown to have a larger potential than discrete prompt methods in generating effective queries (Liu et al., 2021a). However, these methods do not consider symmetry of the task. In this work, we propose Symmetrical Prompt Enhancement (SPE), a continuous prompt-based method for fact retrieval that leverages the symmetry of the task. Our results on LAMA, a popular fact retrieval dataset, show significant improvement of SPE over previous prompt methods.	PDF	10	2021
Dataset Geography: Mapping Language Data to Language Users	As language technologies become more ubiquitous, there are increasing efforts towards expanding the language diversity and coverage of natural language processing (NLP) systems. Arguably, the most important factor influencing the quality of modern NLP systems is data availability. In this work, we study the geographical representativeness of NLP datasets, aiming to quantify if and by how much do NLP datasets match the expected needs of the language speakers. In doing so, we use entity recognition and linking systems, also making important observations about their cross-lingual consistency and giving suggestions for more robust evaluation. Last, we explore some geographical and economic factors that may explain the observed dataset distributions.	PDF	10	2021
GCPG: A General Framework for Controllable Paraphrase Generation	Controllable paraphrase generation (CPG) incorporates various external conditions to obtain desirable paraphrases. However, existing works only highlight a special condition under two indispensable aspects of CPG (i.e., lexically and syntactically CPG) individually, lacking a unified circumstance to explore and analyze their effectiveness. In this paper, we propose a general controllable paraphrase generation framework (GCPG), which represents both lexical and syntactical conditions as text sequences and uniformly processes them in an encoder-decoder paradigm. Under GCPG, we reconstruct commonly adopted lexical condition (i.e., Keywords) and syntactical conditions (i.e., Part-Of-Speech sequence, Constituent Tree, Masked Template and Sentential Exemplar) and study the combination of the two types. In particular, for Sentential Exemplar condition, we propose a novel exemplar construction method --- Syntax-Similarity based Exemplar (SSE). SSE retrieves a syntactically similar but lexically different sentence as the exemplar for each target sentence, avoiding exemplar-side words copying problem. Extensive experiments demonstrate that GCPG with SSE achieves state-of-the-art performance on two popular benchmarks. In addition, the combination of lexical and syntactical conditions shows the significant controllable ability of paraphrase generation, and these empirical results could provide novel insight to user-oriented paraphrasing.	PDF	10	2021
About Time: Do Transformers Learn Temporal Verbal Aspect?	Aspect is a linguistic concept that describes how an action, event, or state of a verb phrase is situated in time. In this paper, we explore whether different transformer models are capable of identifying aspectual features. We focus on two specific aspectual features: telicity and duration. Telicity marks whether the verb's action or state has an endpoint or not (telic/atelic), and duration denotes whether a verb expresses an action (dynamic) or a state (stative). These features are integral to the interpretation of natural language, but also hard to annotate and identify with NLP methods. Our results show that transformer models adequately capture information on telicity and duration in their vectors, even in their pretrained forms, but are somewhat biased with regard to verb tense and word order.	PDF	10	2021
Semantic Search as Extractive Paraphrase Span Detection	In this paper, we approach the problem of semantic search by framing the search task as paraphrase span detection, i.e. given a segment of text as a query phrase, the task is to identify its paraphrase in a given document, the same modelling setup as typically used in extractive question answering. On the Turku Paraphrase Corpus of 100,000 manually extracted Finnish paraphrase pairs including their original document context, we find that our paraphrase span detection model outperforms two strong retrieval baselines (lexical similarity and BERT sentence embeddings) by 31.9pp and 22.4pp respectively in terms of exact match, and by 22.3pp and 12.9pp in terms of token-level F-score. This demonstrates a strong advantage of modelling the task in terms of span retrieval, rather than sentence similarity. Additionally, we introduce a method for creating artificial paraphrase data through back-translation, suitable for languages where manually annotated paraphrase resources for training the span detection model are not available.	PDF	10	2021
A Novel Metric for Evaluating Semantics Preservation	In this paper, we leverage pre-trained language models (PLMs) to precisely evaluate the semantics preservation of edition process on sentences. Our metric, Neighboring Distribution Divergence (NDD), evaluates the disturbance on predicted distribution of neighboring words from mask language model (MLM). NDD is capable of detecting precise changes in semantics which are easily ignored by text similarity. By exploiting the property of NDD, we implement a unsupervised and even training-free algorithm for extractive sentence compression. We show that NDD-based algorithm outperforms previous perplexity-based unsupervised algorithm by a large margin. For further exploration on interpretability, we evaluate NDD by pruning on syntactic dependency treebanks and apply NDD for predicate detection as well.	PDF	10	2021
Measuring and Improving Compositional Generalization in Text-to-SQL via Component Alignment	Recently, the challenge of compositional generalization in NLP has attracted more and more attention. Specifically, many prior works show that neural networks struggle with compositional generalization where training and testing distributions differ. However, most of these works are based on word-level synthetic data or a specific data split method to generate compositional biases. In this work, we propose a clause-level compositional example generation method, and we focus on text-to-SQL tasks. We start by splitting the sentences in the Spider text-to-SQL dataset into several sub-sentences, annotating each sub-sentence with its corresponding SQL clause, resulting in a new dataset, Spider-SS. Building upon Spider-SS, we further construct a new dataset named Spider-CG, by substituting and appending Spider-SS sub-sentences to test the ability of models to generalize compositionally. Experiments show that previous models suffer significant performance degradation when evaluated on Spider-CG, even though every sub-sentence has been seen during training. To deal with this problem, we modify the RATSQL+GAP model to fit the segmented data of Spider-SS, and results show that this method can improve generalization performance.	PDF	10	2021
Joint Content-Context Analysis of Scientific Publications: Identifying Opportunities for Collaboration in Cognitive Science	This work studies publications in cognitive science and utilizes natural language processing and graph theoretical techniques to connect the analysis of the papers' content (abstracts) to the context (citation, journals). We apply hierarchical topic modeling on the abstracts and community detection algorithms on the citation network, and measure content-context discrepancy to find academic fields that study similar topics but do not cite each other or publish in the same venues. These results show a promising, systemic framework to identify opportunities for scientific collaboration in highly interdisciplinary fields such as cognitive science and machine learning.	PDF	10	2021
TableFormer: Robust Transformer Modeling for Table-Text Encoding	Understanding tables is an important aspect of natural language understanding. Existing models for table understanding require linearization of the table structure, where row or column order is encoded as an unwanted bias. Such spurious biases make the model vulnerable to row and column order perturbations. Additionally, prior work has not explicitly modeled the table structure, hindering the table-text modeling ability. In this work, we propose a robust and structurally aware table-text encoding architecture TableFormer, where tabular structural biases are incorporated completely through learnable attention biases. TableFormer is strictly invariant to row and column orders, and could understand tables better due to its tabular inductive biases. Our evaluations showed that TableFormer outperforms strong baselines in all settings on SQA, WTQ and TabFact table reasoning datasets, and achieves state-of-the-art performance on SQA, especially when facing answer-invariant row and column order perturbations (6% improvement over the best baseline), because previous SOTA models' performance drops by 4% - 6% when facing such perturbations while TableFormer is not affected.	PDF	10	2021
N-Shot Learning for Augmenting Task-Oriented Dialogue State Tracking	We introduce an augmentation framework that utilizes belief state annotations to match turns from various dialogues and forms new synthetic dialogues in a bottom-up manner. Unlike other augmentation strategies, it operates with as few as five examples. Our augmentation strategy yields significant improvements when both adapting a DST model to a new domain, and when adapting a language model to the DST task, on evaluations with TRADE and TOD-BERT models. Further analysis shows that our model performs better on seen values during training, and it is also more robust to unseen values even though we do not use any external dataset for augmentation. We conclude that exploiting belief state annotations enhances dialogue augmentation and results in improved models in $n$-shot training scenarios.	PDF	10	2021
Few-Shot Learning with Siamese Networks and Label Tuning	We study the problem of building text classifiers with little or no training data, commonly known as zero and few-shot text classification.In recent years, an approach based on neural textual entailment models has been found to give strong results on a diverse range of tasks.In this work, we show that with proper pre-training, Siamese networks that embed texts and labels area competitive alternative.These models allow for a large reduction in inference cost: constant in the number of labels rather than linear.Furthermore, we introduce label tuning, a simple and computationally efficient approach that allows to adapt the models in a few-shot setup by only changing the label embeddings.While giving lower performance than model fine-tuning, this approach has the architectural advantage that a single encoder can be shared by many different tasks.	PDF	10	2021
D2U: Distance-to-Uniform Learning for Out-of-Scope Detection	Supervised models trained for single-label classification tasks with cross-entropy loss are implicitly enforced to produce probability distributions that follow a discrete delta distribution in training. Model predictions in test time are expected to be similar to delta distributions given that the classifier determines the class of an input correctly. However, the shape of the predicted probability distribution becomes similar to the uniform distribution when the model cannot infer properly. We exploit this observation for detecting out-of-scope (OOS) utterances in conversational systems. Specifically, we propose a zero-shot post-processing step, called Distance-to-Uniform (D2U), exploiting not only the classification confidence score, but the shape of the entire output distribution. We also introduce a learning procedure that uses D2U for loss calculation in the supervised setup. We conduct experiments using six publicly available datasets. Experimental results show that the performance of out-of-scope detection is improved with our post-processing when there is no OOS training data, as well as with D2U learning procedure when OOS training data is available.	PDF	10	2021
On the Sensitivity and Stability of Model Interpretations	Recent years have witnessed the emergence of a variety of post-hoc interpretations that aim to uncover how natural language processing (NLP) models make predictions. Despite the surge of new interpretation methods, it remains an open problem how to define and quantitatively measure the faithfulness of interpretations, i.e., to what extent interpretations reflect the reasoning process by a model. We propose two new criteria, sensitivity and stability, that provide complementary notions of faithfulness to the existed removal-based criteria. Our results show that the conclusion for how faithful interpretations are could vary substantially based on different notions. Motivated by the desiderata of sensitivity and stability, we introduce a new class of interpretation methods that adopt techniques from adversarial robustness. Empirical results show that our proposed methods are effective under the new criteria and overcome limitations of gradient-based methods on removal-based criteria. Besides text classification, we also apply interpretation methods and metrics to dependency parsing. Our results shed light on understanding the diverse set of interpretations.	PDF	10	2021
Multi-Label Text Classification by Graph Neural Network with Mixing Operations	Multi-label text classification is one of the fundamental tasks in natural language processing. Recently, the graph convolution network (GCN) is leveraged to boost the performance of such a task. However, the best way for label correlation modeling and feature learning with label system awareness is still unclear. This paper proposes Mix-GCN, a graph network with two mixing operations, to improve the conventional GCN framework for multi-label text classification in the following two steps. Firstly, we model the label correlations by mixing the graph built from statistical co-occurrence information and the graph constructed from prior knowledge. Secondly, we propose a mixing operation to continuously inject GCN embedding into LSTM representation learning for better label-aware representation. Experimental results on four benchmarks demonstrate that Mix-GCN significantly outperforms the state-of-the-art models and performs better in long-tail label cases.	PDF	10	2021
PPT: Pre-trained Prompt Tuning for Few-shot Learning	Prompts for pre-trained language models (PLMs) have shown remarkable performance by bridging the gap between pre-training tasks and various downstream tasks. Among these methods, prompt tuning, which freezes PLMs and only tunes soft prompts, provides an efficient and effective solution for adapting large-scale PLMs to downstream tasks. However, prompt tuning is yet to be fully explored. In our pilot experiments, we find that prompt tuning performs comparably with conventional full-model tuning when downstream data are sufficient, whereas it is much worse under few-shot learning settings, which may hinder the application of prompt tuning. We attribute this low performance to the manner of initializing soft prompts. Therefore, in this work, we propose to pre-train prompts by adding soft prompts into the pre-training stage to obtain a better initialization. We name this Pre-trained Prompt Tuning framework "PPT". To ensure the generalization of PPT, we formulate similar classification tasks into a unified task form and pre-train soft prompts for this unified task. Extensive experiments show that tuning pre-trained prompts for downstream tasks can reach or even outperform full-model fine-tuning under both full-data and few-shot settings. Our approach is effective and efficient for using large-scale PLMs in practice.	PDF	10	2021
Re-evaluating Extreme Multi-label Text Classification Methods in Tail Label Prediction	Extreme multi-label text classification (XMTC) is the task of tagging each document with the relevant labels in a very large set of predefined category labels. The most challenging part of the problem is due to a highly skewed label distribution where the majority of the categories (namely the tail labels) have very few training instances. Recent benchmark evaluationshave focused on micro-averaging metrics, where the performance on tail labels can be easily overshadowed by that on thehigh-frequency labels (namely the head labels). This paper presents a re-evaluation of state-of-the-art (SOTA) methods based on the binned macro-averaging F1 instead, revealing new insights into the strengths and weaknesses of representative methods, especially in tail label prediction.	PDF	10	2021
Exploring the Impact of Negative Samples of Contrastive Learning: A Case Study of Sentence Embedding	Contrastive learning is emerging as a powerful self-supervised technique for extracting knowledge from unlabeled image and text data. This technique requires a balanced mixture of two ingredients: positive (similar) and negative (dissimilar) samples. This is typically achieved by maintaining a queue of negative samples during training. Prior works in the area typically uses a fixed-length negative sample queue, but how the negative sample size affects the model performance remains unclear. The opaque impact of the number of negative samples on performance when employing contrastive learning aroused our in-depth exploration. This paper presents a momentum contrastive learning model with negative sample queue for sentence embedding, namely MoCoSE. We add the prediction layer to the online branch to make the model asymmetric and together with EMA update mechanism of the target branch to prevent model from collapsing. We define a maximum traceable distance metric, through which we learn to what extent the text contrastive learning benefits from the historical information of negative samples. Our experiments find that the best results are obtained when the maximum traceable distance is at a certain range, demonstrating that there is an optimal range of historical information for a negative sample queue. We evaluate the proposed unsupervised MoCoSE on the semantic text similarity (STS) task and obtain an average Spearman's correlation of $77.27\%$. Source code is available at https://anonymous.4open.science/r/mocose-3E3C.	PDF	10	2021
Ditch the Gold Standard: Re-evaluating Conversational Question Answering	Conversational question answering (CQA) systems aim to provide natural-language answers to users in information-seeking conversations. Existing benchmarks compare CQA models on pre-collected human-human conversations, with ground-truth answers provided in conversational history. It remains unclear whether we can rely on this static evaluation for model development, or current systems can well generalize to real-world human-machine conversations. In this work, we conduct the first large-scale human evaluation of state-of-the-art CQA systems, where human evaluators converse with models and judge the correctness of their answers. We ﬁnd that the distribution of human-machine conversations drastically differs from that of human-human conversations, and evaluations using gold answers are inconsistent with human evaluations. We further investigate how to improve automatic evaluations and propose a question rewriting mechanism based on predicted history, which better correlates with human judgments. Finally, we analyze the impact of various modeling strategies. We hope that our ﬁndings can shed light on how to develop better CQA systems in the future.	PDF	10	2021
BitFit: Simple Parameter-efficient Fine-tuning for Transformer-based Masked Language-models	We show that with small-to-medium training data, fine-tuning only the bias terms (or a subset of the bias terms) of pre-trained BERT models is competitive with (and sometimes better than) fine-tuning the entire model. For larger data, bias-only fine-tuning is competitive with other sparse fine-tuning methods.Besides their practical utility, these findings are relevant for the question of understanding the commonly-used process of finetuning: they support the hypothesis that finetuning is mainly about exposing knowledge induced by language-modeling training, rather than learning new task-specific linguistic knowledge.	PDF	10	2021
A Federated Approach to Predict Emojis in Hindi Tweets	The use of emojis provide for adding a visual modality to textual communication.The task of predicting emojis however provides a challenge for computational approaches as emoji use tends to cluster into the frequently used and the rarely used emojis. Much of the research on emoji use has focused on high resource languages and conceptualised the task of predicting emojis around traditional servers-side machine learning approaches, which can introduce privacy concerns, as user data is transmitted to a central storage. In this paper, we provide a benchmark dataset of $118$k tweets for emoji prediction in Hindi.Specifically, we show that a privacy preserving approach, Federated Learning exhibits comparable performance to traditional servers-side transformer models.	PDF	10	2021
Safety Bench: Identifying Safety-Sensitive Situations for Open-domain Conversational Systems	The social impact of natural language processing and its applications has received increasing attention. Here, we focus on the problem of safety for end-to-end conversational AI. We survey the problem landscape therein, introducing a taxonomy of three observed phenomena: the Instigator, Yea-Sayer, and Impostor effects. To help researchers better understand the impact of their conversational models with respect to these scenarios, we present Safety Bench, a set of open-source tooling for quickly assessing safety issues. Finally, we provide extensive analysis of these tools using five popular models and make recommendations for future use.	PDF	10	2021
VALSE: A Task-Independent Benchmark for Vision and Language Models Centered on Linguistic Phenomena	We propose VALSE (Vision And Language Structured Evaluation), a novel benchmark designed for testing general-purpose pretrained vision and language (V&L) models for their visio-linguistic grounding capabilities on specific linguistic phenomena. VALSE offers a suite of six tests covering various linguistic constructs. Solving these requires models to ground linguistic phenomena in the visual modality, allowing more fine-grained evaluations than hitherto possible. We build VALSE using methods that support the construction of valid foils, and report results from evaluating five widely-used V&L models. Our experiments suggest that current models have considerable difficulty addressing most phenomena. Hence, we expect VALSE to serve as an important benchmark to measure future progress of pretrained V&L models from a linguistic perspective, complementing the canonical task-centred V&L evaluations.	PDF	10	2021
Know your tools well: Better $\textit{and}$ faster QA with synthetic examples	Synthetic training data---commonly used to augment human-labeled examples in supervised learning---are often noisy, but can be generated in very large quantities and diversity. This paper proposes to leverage these unique attributes in a targeted manner to maximize the utility of synthetic examples. Via two novel applications that utilize synthetic data for targeted pre-training and knowledge distillation, we demonstrate the feasibility of this idea for machine reading comprehension (MRC). Using our proposed methods, we are able to train simultaneously $\textbf{\textit{smaller}}$, $\textbf{\textit{faster}}$ and $\textbf{\textit{more accurate}}$ MRC models than existing synthetic augmentation methods. Our methods are generic in nature and can be applied to any task for which synthetic data can be generated.	PDF	10	2021
3M:Multi-document Summarization Considering Main and Minor Relationship	The multi-document summary task is an important branch of the information aggregation task. Compared with the single-document summary, the input of multi-document summary is much longer and the logic is more complicated. This article proposes a hypothesis: taking the content of a document as the main body and the content of other documents as auxiliary information, a summary that combines all the information in the document collection can be generated. Based on this assumption, the multi-document summarization task can select one main document, and then combine the information of other documents for summary generation. This paper combines CopyTransformer and the Maximal Marginal Relevance (MMR) to design Multi-document summarization considering Main and Minor relationship model(3M). Empirical results on the Multi-News and DUC 2004 dataset show that the 3M brings substantial improvements over several strong baselines, manual evaluation shows that the generated abstract is fluent and can better express the content of the main document. In addition, by selecting different main documents, 3M can generate multiple abstracts with different styles for a set of documents.	PDF	10	2021
Re-Examining System-Level Correlations of Automatic Summarization Evaluation Metrics	How reliably an automatic summarization evaluation metric replicates human judgments of summary quality is quantified by system-level correlations. We identify two ways in which the definition of the system-level correlation is inconsistent with how metrics are used to evaluate systems in practice and propose changes to rectify this disconnect. First, we calculate the system score for an automatic metric using the full test set instead of the subset of summaries annotated by humans, which is currently standard practice. We demonstrate how this small change leads to more precise estimates of system-level correlations. Second, we propose to calculate correlations only on pairs of systems which are separated by differences in automatic scores that are commonly used to argue one system is of higher quality. This allows us to demonstrate that our best estimate of the correlation of ROUGE to human judgments is near 0 in realistic scenarios. Finally, the results from both analyses point to the need for future research to focus on developing more consistent and reliable human evaluations of summaries.	PDF	10	2021
Fully Hyperbolic Neural Networks	Hyperbolic neural networks have shown great potential for modeling complex data. However, existing hyperbolic networks are not completely hyperbolic, as they encode features in the hyperbolic space yet formalize most of their operations in the tangent space (a Euclidean subspace) at the origin of the hyperbolic model. This hybrid method greatly limits the modeling ability of networks. In this paper, we propose a fully hyperbolic framework to build hyperbolic networks based on the Lorentz model by adapting the Lorentz transformations (including boost and rotation) to formalize essential operations of neural networks. Moreover, we also prove that linear transformation in tangent spaces used by existing hyperbolic networks is a relaxation of the Lorentz rotation and does not include the boost, implicitly limiting the capabilities of existing hyperbolic networks. The experimental results on four NLP tasks show that our method has better performance for building both shallow and deep networks. Our code will be released to facilitate follow-up research.	PDF	10	2021
Knowledge Inheritance for Pre-trained Language Models	Recent explorations of large-scale pre-trained language models (PLMs) such as GPT-3 have revealed the power of PLMs with huge amounts of parameters, setting off a wave of training ever-larger PLMs. However, training a large-scale PLM requires tremendous amounts of computational resources, which is time-consuming and expensive. In addition, existing large-scale PLMs are mainly trained from scratch individually, ignoring the availability of many existing well-trained PLMs. To this end, we explore the question that how can previously trained PLMs benefit training larger PLMs in future. Specifically, we introduce a novel pre-training framework named "knowledge inheritance" (KI), which combines both self-learning and teacher-guided learning to efficiently train larger PLMs. Experimental results demonstrate the superiority of our KI framework. We also conduct empirical analyses to explore the effects of teacher PLMs' pre-training settings, including model architecture, pre-training data, etc. Finally, we show that KI can well support lifelong learning and knowledge transfer. All source code and model parameters will be available to advance further research explorations.	PDF	10	2021
Echo-Attention: Attend Once and Get $N$ Attentions for Free	This paper proposes echo-attention layers, an efficient method for improving the expressiveness of the self-attention layers without incurring significant parameter or training time costs. The key idea is to iteratively refine the attentional activations via stateful repeated computation, i.e., we compute the activations once and get $N$ refinements (echo-attentions) at a relatively cheap cost. To this end, we introduce an update and state transition function that operates over these attentional activations. Via a set of extensive experiments, we show that this the proposed Echoformer model demonstrates widespread benefits across 21 datasets including language modeling, machine translation, language understanding and question answering.	PDF	10	2021
Finding the Right Recipe for Low Resource Domain Adaptation in Neural Machine Translation	Despite the considerable amount of parallel data used to train neural machine translation models, they can still struggle to generate fluent translations in technical domains. In-domain parallel data is often very low resource and synthetic domain data generated via back-translation is frequently lower quality. To guide machine translation practitioners and characterize the effectiveness of domain adaptation methods under different data availability scenarios, we conduct an in-depth empirical exploration of monolingual and parallel data approaches to domain adaptation. We compare mixed domain fine-tuning, traditional back-translation, tagged back-translation, and shallow fusion with domain specific language models in isolation and combination. We study method effectiveness in very low resource (8k parallel examples) and moderately low resource (46k parallel examples) conditions. We demonstrate the advantages of augmenting clean in-domain parallel data with noisy mined in-domain parallel data and propose an ensemble approach to alleviate reductions in original domain translation quality. Our work includes three domains: consumer electronic, clinical, and biomedical and spans four language pairs - Zh-En, Ja-En, Es-En, and Ru-En. We make concrete recommendations for achieving high in-domain performance. We release our consumer electronic and clinical domain datasets for all languages and make our code publicly available.	PDF	10	2021
On the current state of reproducibility and reporting of uncertainty for Aspect-based Sentiment Analysis	For the latter part of the past decade, Aspect-Based Sentiment Analysis has been a field of great interest within Natural Language Processing. Supported by the Semantic Evaluation Conferences in 2014 -- 2016, a variety of methods has been developed competing in improving performances on benchmark data sets. Exploiting the transformer architecture behind BERT, results improved rapidly and efforts in this direction still continue today. Our contribution to this body of research is a holistic comparison of six different architectures which achieved (near) state-of-the-art results at some point in time. We utilize a broad spectrum of five benchmark data sets and introduce a fixed setting with respect to the pre-processing, the train/validation splits, the performance measures and the quantification of uncertainty. Overall, our findings are two-fold: First, we find that the results reported in the scientific articles are hardly reproducible, since in our experiments the observed performance (most of the time) fell short of the reported one. Second, the results are burdened with notable uncertainty (depending on the data splits) which is why a reporting of uncertainty measures is crucial.	PDF	10	2021
A Recipe for Arbitrary Text Style Transfer with Large Language Models	In this paper, we leverage large language models (LLMs) to perform zero-shot text style transfer. We present a prompting method that we call augmented zero-shot learning, which frames style transfer as a sentence rewriting task and requires only a natural language instruction, without model fine-tuning or exemplars in the target style. Augmented zero-shot learning is simple and demonstrates promising results not just on standard style transfer tasks such as sentiment, but also on arbitrary transformations such as 'make this melodramatic' or 'insert a metaphor.'	PDF	10	2021
Decomposing Natural Logic Inferences in Neural NLI	In the interest of interpreting neural NLI models and their reasoning strategies, we carry out a systematic probing study which investigates whether these modelscapture the crucial semantic features central to natural logic: \emph{monotonicity} and \emph{concept inclusion}.Correctly identifying valid inferences in \emph{downward-monotone contexts} is a known stumbling block for NLI performance,subsuming linguistic phenomena such as negation scope and generalized quantifiers.To understand this difficulty, we emphasize monotonicity as a property of a \emph{context} and examine the extent to which models capture monotonicity information in the contextual embeddings which are intermediate to their decision making process.Drawing on the recent advancement of the probing paradigm,we compare the presence of monotonicity features across various models.We find that monotonicity information is notably weak in the representations of popularNLI models which achieve high scores on benchmarks, and observe that previous improvements to these models based on fine-tuning strategies have introduced stronger monotonicity features together with their improved performance on challenge sets.	PDF	10	2021
Hallucinated but Factual! Inspecting the Factuality of Hallucinations in Abstractive Summarization	State-of-the-art abstractive summarization systems often generate hallucinations; i.e., content that is not directly inferable from the source text. Despite being assumed to be incorrect, we find that much hallucinated content is actually consistent with world knowledge, which we call factual hallucinations. Including these factual hallucinations in a summary can be beneficial because they provide useful background information. In this work, we propose a novel detection approach that separates factual from non-factual hallucinations of entities. Our method is based on an entity's prior and posterior probabilities according to pre-trained and finetuned masked language models, respectively. Empirical results suggest that our method vastly outperforms two baselines in both accuracy and F1 scores and has a strong correlation with human judgments on factuality classification tasks.Furthermore, we use our method as a reward signal to train a summarization system using an off-line reinforcement learning (RL) algorithm that can significantly improve the factuality of generated summaries while maintaining the level of abstractiveness.	PDF	10	2021
xGQA: Cross-Lingual Visual Question Answering	Recent advances in multimodal vision and language modeling have predominantly focused on the English language, mostly due to the lack of multilingual multimodal datasets to steer modeling efforts. In this work, we address this gap and provide xGQA, a new multilingual evaluation benchmark for the visual question answering task. We extend the established English GQA dataset to 7 typologically diverse languages, enabling us to detect and explore crucial challenges in cross-lingual visual question answering. We further propose new adapter-based approaches to adapt multimodal transformer-based models to become multilingual, and---vice versa---multilingual models to become multimodal. Our proposed methods outperform current state-of-the-art multilingual multimodal models (e.g., M3P) in zero-shot cross-lingual settings, but the accuracy remains low across the board; a performance drop of around 38 accuracy points in target languages showcases the difficulty of zero-shot cross-lingual transfer for this task. Our results suggest that simple cross-lingual transfer of multimodal models yields latent multilingual multimodal misalignment, calling for more sophisticated methods for vision and multilingual language modeling. The xGQA dataset is available online at: URL	PDF	10	2021
CONTaiNER: Few-Shot Named Entity Recognition via Contrastive Learning	Named Entity Recognition (NER) in Few-Shot setting is imperative for entity tagging in low resource domains. Existing approaches only learn class-specific semantic features and intermediate representations from source domains. This affects generalizability to unseen target domains, resulting in suboptimal performances. To this end, we present CONTaiNER, a novel contrastive learning technique that optimizes the inter-token distribution distance for Few-Shot NER. Instead of optimizing class-specific attributes, CONTaiNER optimizes a generalized objective of differentiating between token categories based on their Gaussian-distributed embeddings. This effectively alleviates overfitting issues originating from training domains. Our experiments in several traditional test domains (OntoNotes, CoNLL'03, WNUT '17, GUM) and a new large scale Few-Shot NER dataset (Few-NERD) demonstrate that on average, CONTaiNER outperforms previous methods by 3%-13% absolute F1 points while showing consistent performance trends, even in challenging scenarios where previous approaches could not achieve appreciable performance.	PDF	10	2021
Attending to Visual Differences for Situated Language Generation in Changing Scenes	We investigate the problem of generating utterances from pairs of images showing a before and an after state of a change in a visual scene. We present a transformer model with difference attention heads that learns to attend to visual changes in consecutive images via a difference key. We test our approach in instruction generation, change captioning, and difference spotting and compare these tasks in terms of their linguistic phenomena and reasoning abilities. Our model outperforms the state-of-the-art for instruction generation on the BLOCKS and difference spotting on the Spot-the-diff dataset and generates accurate referential and compositional spatial expressions. Finally, we identify linguistic phenomena that pose challenges for generation in changing scenes.	PDF	10	2021
Towards Using Diachronic Distributed Word Representations as Models of Lexical Development	Recent work has shown that distributed word representations can encode abstract information from child-directed speech. In this paper, we use diachronic distributed word representations to perform temporal modeling and analysis of lexical development in children. Unlike all previous work, we use temporally sliced corpus to learn distributed word representations of child-speech and child-directed speech under a curriculum-learning setting. In our experiments, we perform a lexical categorization task to plot the semantic and syntactic knowledge acquisition trajectories in children. Next, we perform linear mixed-effects modeling over the diachronic representational changes to study the role of input word frequencies in the rate of word acquisition in children. We also perform a fine-grained analysis of lexical knowledge transfer from adults to children using Representational Similarity Analysis. Finally, we perform a qualitative analysis of the diachronic representations from our model, which reveals the grounding and word associations in the mental lexicon of children. Our experiments demonstrate the ease of usage and effectiveness of diachronic distributed word representations in modeling lexical development.	PDF	10	2021
A Word is Worth A Thousand Dollars: Adversarial Attack on Tweets Fools Meme Stock Prediction	More and more investors and machine learning models rely on social media (e.g., Twitter and Reddit) to gather information and predict certain stocks' prices (meme stock). However, text-based models are known to be vulnerable to adversarial attacks, but whether stock prediction models have similar adversarial vulnerability is underexplored.In this paper, we experiment with a variety of adversarial attack configurations to fool three stock prediction victim models (StockNet, FinGRU, FinLSTM). We address the task of adversarial generation by solving combinatorial optimization problems with semantics and budget constraints. Our results show that the proposed attack method can achieve consistent success rates, with capabilities of causing thousands of dollars loss (with Long-Only Buy-Hold-Sell investing strategy) by simply concatenating a perturbed but semantically similar tweet.	PDF	10	2021
Event Detection for Suicide Understanding	Suicide is a serious problem in every society. Understanding life events of a potential patient is essential for successful suicide-risk assessment and prevention. In this work, we focus on the Event Detection (ED) task to identify event trigger words of suicide-related events in public posts of discussion forums. In particular, we introduce a new dataset for ED (called SuicideED) that features seven suicidal event types to comprehensively capture suicide actions and ideation, and general risk and protective factors. Our experiments with current state-of-the-art ED systems suggest that there is still room for improvement of ED models in this domain. We will publicly release SuicideED to support future research in this important area.	PDF	10	2021
C-MORE: Pretraining to Answer Open-Domain Questions by Consulting Millions of References	We consider the problem of pretraining a two-stage open-domain question answering (QA) system (retriever + reader) with strong transfer capabilities. The key challenge is how to construct a large amount of high-quality question-answer-context triplets without task-specific annotations. Specifically, the triplets should align well with downstream tasks by: (i) covering a wide range of domains (for open-domain applications), (ii) linking a question to its semantically relevant context with supporting evidence (for training the retriever), and (iii) identifying the correct answer in the context (for training the reader). Previous pretraining approaches generally fall short of one or more of these requirements. In this work, we automatically construct a large-scale corpus that meets all three criteria by consulting millions of references cited within Wikipedia. The well-aligned pretraining signals benefit both the retriever and the reader significantly. Our pretrained retriever leads to 2%-10% absolute gains in top-20 accuracy. And with our pretrained reader, the entire system improves by up to 4% in exact match.	PDF	10	2021
DoTAT: A Domain-oriented Text Annotation Tool	We propose DoTAT, a domain-oriented text annotation tool. The tool designs and implements functions heavily in need in domain-oriented information extraction. Firstly, the tool supports a multi-person collaborative process with automatically merging and review, which can greatly improve the annotation accuracy. Secondly, the tool provides annotation of event, nested event, and nested entity, which are frequently required in domain-related text structuring tasks. Finally, DoTAT provides visualized annotation specification definition, automatic batch annotation, and iterative annotation to improve annotation efficiency. Experiments on the ACE2005 dataset show that DoTAT can reduce the event annotation time by 19.7% compared with existing annotation tools. The accuracy without review is 84.09%, 1.35% higher than Brat, and 2.59% higher than Webanno. The accuracy of DoTAT even reaches 93.76% with the review. The demonstration video can be accessed from https://ecust-nlp-docker.oss-cn-shanghai.aliyuncs.com/dotat_demo.mp4. A live demo website is available at https://github.com/FXLP/MarkTool.	PDF	10	2021
Attention Temperature Matters in Abstractive Summarization Distillation	Recent progress of abstractive text summarization largely relies on large pre-trained sequence-to-sequence Transformer models, which are computationally expensive. This paper aims to distill these large models into smaller ones for faster inference and with minimal performance loss. Pseudo-labeling based methods are popular in sequence-to-sequence model distillation. In this paper, we find simply manipulating attention temperatures in Transformers can make pseudo labels easier to learn for student models. Our experiments on three summarization datasets show our proposed method consistently improves vanilla pseudo-labeling based methods. Further empirical analysis shows that both pseudo labels and summaries produced by our students are shorter and more abstractive.	PDF	10	2021
Slangvolution: A Causal Analysis of Semantic Change and Frequency Dynamics in Slang	Words are not static in their usage and meaning, but evolve over time. An interesting phenomenon in languages is slang, which is an informal language that is considered ephemeral and is often associated with contemporary trends. In this work, we study the semantic change and relative frequency shift of slang words and compare this change with standard, nonslang words. To measure semantic change, we obtain contextualized representations of words, reduce their dimensionality and propose a metric to measure their average pairwise distances between two time periods. We apply causal discovery algorithms and causal inference to uncover the dynamics of language evolution and measure the effect that word type (slang/nonslang) has on both semantic change and frequency shift, as well as its relationship to absolute frequency and polysemy. Our causal analysis shows that slang words undergo less semantic change even though they have larger frequency shifts over time.	PDF	10	2021
Learning Meta Word Embeddings by Unsupervised Weighted Concatenation of Source Embeddings	We propose a method to protect the privacy of search engine users by decomposing the queries usingsemantically \emph{related} and unrelated \emph{distractor} terms. Instead of a single query, the search enginereceives multiple decomposed query terms. Next, we reconstruct the search results relevant to the originalquery term by aggregating the search results retrieved for the decomposed query terms.We show that the word embeddings learnt using a distributed representation learning method can be used to find semantically related and distractor query terms.We derive the relationship between the \emph{obfuscity} achieved through the proposed query anonymisation method and the \emph{reconstructability} of the original search results using the decomposed queries.We analytically study the risk of discovering the search engine users' information intents under the proposedquery obfuscation method, and empirically evaluate its robustness against clustering-based attacks.Our experimental results show that the proposed method can accurately reconstruct the search results for user queries, without compromising the privacy of the search engine users.	PDF	10	2021
BotsTalk: Machine-Sourced Framework for Automatic Curation of Large-scale Multi-skill Dialogue Datasets	Previous work in open-domain chatbots has introduced dialogue corpora and tasks that aim to inject dialogue systems different communicative skills such as being personable, knowledgeable and empathetic. With the advent of conversational agents grounded to specific skills, a new challenge in open-domain chatbots has been posed: A good open-domain chatbot should retain a well-rounded set of skills and seamlessly blend them into a conversation. To this end, a new dialogue dataset Blended Skill Talk is collected via crowdsourcing and commonly used as a benchmark for multi-skill dialogue generation. However, such data construction approach requires labor intensive manual annotation, which severely limits their utility on large-scale learning. In this work, we propose BotsTalk, a novel machine-sourced framework, where several agents participate in a conversation to automatically annotate multi-skill dialogues. We then present Blended Skill BotsTalk (BS$\mathbb{B}$T), a large-scale multi-skill dialogue dataset of 200K conversations. Experimental results show that our dataset can be effectively used as training data for multi-skill dialogue systems which require an understanding of both skill blending and grounding. We also demonstrate the dataset is orthogonally applicable to diverse learning schemes such as fine-tuning and multi-task learning.	PDF	10	2021
Learning to Prioritize: Precision-Driven Sentence Filtering for Long Text Summarization	Neural text summarization has shown great potential in recent years. However, current state-of-the-art summarization models are limited by their maximum input length, posing a challenge to summarize longer texts comprehensively. As part of a layered summarization architecture, we introduce PureText, a simple yet effective precision-driven sentence filtering layer that learns to remove low-quality sentences in texts to improve existing summarization models. When evaluated on popular datasets like WikiHow and Reddit TIFU, we show up to 3 and 8 point Rouge-1 absolute improvement on the full test set and the long article subset, respectively, for state-of-the-art summarization models such as BertSum and Bart. Our approach provides downstream models with higher-quality sentences for summarization, improving overall model performance, especially on long text articles.	PDF	10	2021
Zero-shot Cross-Language Transfer of Monolingual Entity Linking Models	Most entity linking systems, whether mono or multilingual, link mentions to a single English knowledge base. Few have considered linking non-English text to a non-English KB, and therefore, transferring an English entity linking model to both a new document and new KB language. We consider the task of zero-shot cross-lingual transfer of entity linking systems to a new language and KB. We find that a system trained with multilingual representations does reasonably well, and propose improvements to system training that lead to improved recall in most datasets, often matching the in-language performance. We further conduct a detailed evaluation to elucidate the challenges of this setting.	PDF	10	2021
CTRLsum: Towards Generic Controllable Text Summarization	Current summarization systems yield generic summaries that are disconnected from users' preferences and expectations. To address this limitation, we present CTRLsum, a generic framework to control generated summaries through a set of keywords. During training keywords are extracted automatically without requiring additional human annotations. At test time CTRLsum features a control function to map control signal to keywords; through engineering the control function, the same trained model is able to be applied to control summaries on various dimensions, while neither affecting the model training process nor the pretrained models. We additionally explore the combination of keywords and text prompts for more control tasks. Experiments demonstrate the effectiveness of CTRLsum on three domains of summarization datasets and five control tasks: (1) entity-centric and (2) length-controllable summarization, (3) contribution summarization on scientific papers, (4) invention purpose summarization on patent filings, and (5) question-guided summarization on news articles. Moreover, when used in a standard, unconstrained summarization setting, CTRLsum is comparable or better than the state-of-the-art systems.	PDF	10	2021
FarFetched: An Entity-centric Approach for Reasoning on Textually Represented Environments	We address the problem of automatically acquiring knowledge from news articles and leverage it to estimate the veracity of a user's claim based on the supporting or refuting content within the accumulated evidence. We present FarFetched, an entity-centric approach for reasoning based on news, where latent connections between events, actions or statements are discovered via their identified entity mentions and are represented with the help of a knowledge graph. We propose a way of selecting specific subsets from the accumulated wealth of information based on the user hypothesis and construct relevant premises relying on the semantic similarity between them. We leverage textual entailment recognition to provide a measurable way for assessing whether the user claim is plausible based on the selected evidence. Our work is demonstrated on the less-resourced Greek language and supported by the training of state-of-the-art models for STS and NLI that are evaluated on benchmark datasets.	PDF	10	2021
Multimodal Entity Tagging with Multimodal Knowledge Base	To enhance research on multimodal knowledge base and multimodal information processing, we propose a new task called multimodal entity tagging (MET) with a multimodal knowledge base (MKB). We also develop a dataset for the problem using an existing MKB. In an MKB, there are entities and their associated texts and images. In MET, given a text-image pair, one uses the information in the MKB to automatically identify the related entity in the text-image pair. We solve the task by using the information retrieval paradigm and implement several baselines using state-of-the-art methods in NLP and CV. We conduct extensive experiments and make analyses on the experimental results. The results show that the task is challenging, but current technologies can achieve relatively high performance. We will release the dataset, code, and models for future research.	PDF	10	2021
Should We Trust This Summary? Bayesian Abstractive Summarization to The Rescue	We explore the notion of uncertainty in the context of modern abstractive summarization models, using the tools of Bayesian Deep Learning. Our approach approximates Bayesian inference by first extending state-of-the-art summarization models with Monte Carlo dropout and then using them to perform multiple stochastic forward passes. Based on Bayesian inference we are able to effectively quantify uncertainty at prediction time. Having a reliable uncertainty measure, we can improve the experience of the end user by filtering out generated summaries of high uncertainty. Furthermore, uncertainty estimation could be used as a criterion for selecting samples for annotation, and can be paired nicely with active learning and human-in-the-loop approaches. Finally, Bayesian inference enables us to find a Bayesian summary which performs better than a deterministic one and is more robust to uncertainty. In practice, we show that our Variational Bayesian equivalents of BART and PEGASUS can outperform their deterministic counterparts on multiple benchmark datasets.	PDF	10	2021
Explainable Assessment of Healthcare Articles with QA	The healthcare domain suffers from the spread of poor quality articles on the Internet. While manual efforts exist, they are not sufficient to assess the amount of articles in circulation. The task can be automated as text classification, however explanations for the labels are necessary for the users. While current explainable systems tackle explanation generation as summarization, we propose a new approach based on Question-Answering that allows us to generate explanations for multiple criteria. We show that QA-based models are competitive with current state-of-the-art systems and complement summarization-based models for explainable quality assessment.	PDF	10	2021
DEMix Layers: Disentangling Domains for Modular Language Modeling	We introduce a new domain expert mixture (DEMix) layer that enables conditioning a language model (LM) on the domain of the input text. A DEMix layer is a collection of expert feedforward networks, each specialized to a domain, that makes the LM modular: experts can be mixed, added, or removed after initial training. Extensive experiments with autoregressive transformer LMs (up to 1.3B parameters) show that DEMix layers reduce perplexity, increase training efficiency, and enable rapid adaptation. Mixing experts during inference, using a parameter-free weighted ensemble, enables better generalization to heterogeneous or unseen domains. Adding experts incorporates new domains without forgetting older ones, and removing experts restricts access to unwanted domains without additional training. Overall, these results demonstrate benefits of explicitly conditioning on textual domains during language modeling.	PDF	10	2021
Cross-lingual Inference with A Chinese Entailment Graph	Predicate entailment detection is a crucial task for question-answering from text, where previous work has explored unsupervised learning of entailment graphs from typed open relation triples. In this paper, we present the first pipeline for building Chinese entailment graphs. In this pipeline, we present a novel high-recall open relation extraction (ORE) method and the first Chinese fine-grained entity typing dataset following the FIGER type ontology. Through experiments on the popular Levy-Holt dataset, translated into Chinese, we show that our Chinese entailment graph outperforms a range of strong baselines by large margins. Moreover, an ensemble of Chinese and English entailment graphs sets a new unsupervised SOTA on the original Levy-Holt dataset, surpassing previous SOTA by more than 4 AUC points.	PDF	10	2021
Can Language Models Take A Hint? Prompting for Controllable Contextualized Commonsense Inference	Generating commonsense assertions, given a certain story context, is a tough challenge even for modern language models. One of the reasons for this may be that the model has to "guess" what topic or entity in a story to generate an assertion about. Prior work has tackled part of the problem, by providing techniques to align commonsense inferences with stories and training language generation models on these. However, none of the prior work provides means to control the parts of a generated assertion. In this work, we present "hinting", a data augmentation technique for improving inference of contextualized commonsense assertions. Hinting is a prefix prompting strategy that uses both hard and soft prompts. We demonstrate the effectiveness of hinting by showcasing its effect on two contextual commonsense inference frameworks: ParaCOMET and GLUCOSE, for both general and context-specific inference.	PDF	10	2021
Use of a Taxonomy of Empathetic Response Intents to Control and Interpret Empathy in Neural Chatbots	A recent trend in the domain of open-domain conversational agents is enabling them to converse empathetically to emotional prompts. Current approaches either follow an end-to-end approach or condition the responses on similar emotion labels to generate empathetic responses. But empathy is a broad concept that refers to the cognitive and emotional reactions of an individual to the observed experiences of another and it is more complex than mere mimicry of emotion. Hence, it requires identifying complex human conversational strategies and dynamics in addition to generic emotions to control and interpret empathetic responding capabilities of chatbots. In this work, we make use of a taxonomy of eight empathetic response intents in addition to generic emotion categories in building a dialogue response generation model capable of generating empathetic responses in a controllable and interpretable manner. It consists of two modules: 1) a response emotion/intent prediction module; and 2) a response generation module. We propose several rule-based and neural approaches to predict the next response's emotion/intent and generate responses conditioned on these predicted emotions/intents. Automatic and human evaluation results emphasize the importance of the use of the taxonomy of empathetic response intents in producing more diverse and empathetically more appropriate responses than end-to-end models.	PDF	10	2021
A Unified Abstractive Model for Generating Question-Answer Pairs	Large-scale question-answer pairs (QAP) are valuable for many applications, such as knowledge bases construction and machine reading comprehension. Although its importance has been widely recognized, existing approaches are still faced with critical challenges. On the one hand, QAPs are obtained by selecting spans from original texts as their answers, while abstractive answer generation is more suitable and natural for complex QA applications. On the other hand, the interaction between the sub-tasks of answer generation and question generation should be well captured to enhance each other mutually. To this end, we propose a Unified Abstractive model for Question-Answer Pairs generation (UA-QAP). Specifically, we devise the joint model with a query-guided gate to collectively model the two sub-tasks simultaneously and capture the interaction information between them. Therefore, our model can generate semantically comprehensive question-answer pairs. We conduct extensive experiments on three large-scale datasets. The experimental results demonstrate that our model achieves state-of-the-art performance.	PDF	10	2021
Modeling Context With Linear Attention for Scalable Document-Level Translation	Document-level neural machine translation allows models to leverage dependencies beyond sentence-internal context to produce more coherent and consistent translations. However, these models, predominantly based on transformers, are difficult to scale to long documents due to the quadratic time and space complexity of their self-attention layers. Recent efforts on efficient attention variants improve scalability, but it is yet unclear if and to what extent their inductive biases are suitable for document translation. In this paper, we explore the efficacy of a recent linear attention model by Peng et al. (2021) on document-level translation and augment it with a sentential gating mechanism. We evaluate the model on the IWSLT 2015 and OpenSubtitles 2018 datasets against a strong transformer baseline and achieve up to 40% decoding speedup with similar or improved BLEU scores. We show that the sentential gate further improves translation quality on IWSLT, a dataset with long sequences.	PDF	10	2021
IMPLI: Investigating NLI Models' Performance on Figurative Language	Natural language inference (NLI) has been widely used as a task to train and evaluate models for language understanding. However, the ability of NLI models to perform inferences that require understanding of figurative languages such as idioms and metaphors remains understudied. We introduce the IMPLI (Idiomatic and Metaphoric Paired Language Inference) dataset consisting of over 25K semi-automatically generated and 1.5K hand-written English sentence pairs based on idiomatic and metaphoric phrases. We use \dataset to evaluate NLI models based on RoBERTa fine-tuned on the MNLI dataset, and show that while they can reliably detect entailment relationship between figurative phrases with their literal definition, they perform poorly on examples where the phrases are designed to not entail the paired definition. This dataset suggests the limits of current NLI models with regard to understanding figurative language and provides a benchmark for future improvements in this direction.	PDF	10	2021
Compositional Generalization Requires Compositional Parsers	A growing body of research has focused on the task of \textit{compositional generalization}, the ability of a semantic parser to dynamically combine known linguistic elements in novel structures. We analyze the accuracy of different parsers on the recent COGS corpus (Kim and Linzen, 2020). While lexical generalization tasks are solvable by almost all existing models, tasks involving changes to the linguistic structure are hard for even the best sequence-to-sequence models. Structural generalization tasks can be solved with models that have compositionality built in; we present new results confirming this from the AM parser (Groschwitz et al., 2021). We further analyze the role of syntactic generalization in compositional generalization, and we discuss ramifications for the design of both semantic parsers and compositional generalization datasets.	PDF	10	2021
Better Sample Efficiency Does Not Imply Out-of-Distribution Robustness	We study the relationship between sample efficiency and out-of-distribution performance---if two models have the same in-distribution performance, does the model trained on fewer labeled training examples (higher sample efficiency) perform better out-of-distribution? First, we find that models with higher sample efficiency can have worse out-of-distribution robustness than models that are less sample-efficient. We then empirically study the correlation between sample efficiency and out-of-distribution robustness across three tasks, 23 total ID-OOD settings, and four broadly-applicable methods that change sample efficiency: (1) changing the pre-training data source; (2) using natural language prompts; (3) increasing model size; and (4) increasing the amount of pre-training data. Given that better sample efficiency does not necessarily give rise to robust models, our results underscore the importance of developing and evaluating whether interventions jointly improve both.	PDF	10	2021
InducT-GCN: Inductive Graph Convolutional Networks for Text Classification	Text classification aims to assign labels to textual units by making use of global information. Recent studies have applied graph neural network (GNN) techniques to capture the global word co-occurrence in a corpus. Most existing approaches require that all the nodes (training and test) in a graph are present during training, which are transductive and do not naturally generalise to unseen nodes. To make those models \textit{inductive}, previous works use extra resources, like pretrained word embedding. However, high-quality resource is not always available and can be hard to train. Under the extreme settings with no extra resource and limited amount of training set, can we still learn an inductive graph-based text classification model? In this paper, we introduce a novel inductive graph-based text classification framework, namely InducT-GCN (InducTive Graph Convolutional Networks for Text classification). Compared to transductive models that require test documents in training, we construct a graph based on the statistics of training documents only and represent document vectors with a weighted sum of word vectors. We then conduct one-directional GCN propagation during testing. Across five text classification benchmarks, our InducT-GCN outperformed state-of-the-art methods that are either transductive in nature or pre-trained additional resources. We also conducted scalability testing by gradually increasing the data size and revealed that our InducT-GCN can reduce the time and space complexity.	PDF	10	2021
A Two-Stage Curriculum Training Framework for NMT	Neural Machine Translation (NMT) models are typically trained on heterogeneous data that are concatenated and randomly shuffled. Curriculum training aims to present the data to the NMT systems in a meaningful order. In this work, we introduce a two-stage curriculum training framework for NMT where we fine-tune a base NMT model on subsets of data, selected by both deterministic scoring using pre-trained methods and online scoring that consider prediction scores of the emerging NMT model. Through extensive experiments on six language pairs comprising low- and high-resource languages from WMT'21, we have shown that our curriculum strategies consistently demonstrate better quality (up to +2.2 BLEU improvement) and faster convergence (approximately 50% fewer updates).	PDF	10	2021
Attention as Grounding: Exploring Textual and Cross-Modal Attention on Entities and Relations in Language-and-Vision Transformer	We explore how a multi-modal transformer trained for generation of longer image descriptions learns syntactic and semantic representations about entities and relations grounded in objects at the level of masked self-attention (text generation) and cross-modal attention (information fusion). We observe that cross-attention learns the visual grounding of noun phrases into objects and high-level semantic information about spatial relations, while text-to-text attention captures low-level syntactic knowledge between words. This concludes that language models in a multi-modal task learn different semantic information about objects and relations cross-modally and uni-modally (text-only). Our code is available here: [the GitHub link placeholder].	PDF	10	2021
Turkish Named Entity Recognition: A Survey and Comparative Analysis	Named entity recognition is a challenging task that has been widely studied in English. Although there are some efforts for named entity recognition in Turkish language, the reported results are limited to particular datasets and models. Moreover, there is a lack of comparative analysis for named entity recognition in Turkish. In this study, we contribute to the literature in three folds. First, we provide an up-to-date short survey on Turkish named entity recognition studies. Second, we compare state-of-the-art named entity recognition models on various Turkish datasets that we can access to. Lastly, we analyze a set of linguistic processing steps that would affect the performance of Turkish named entity recognition.	PDF	10	2021
Unmasking the Trade-off: Measuring Gender Bias Mitigation and Over-debiasing Effects in Pretrained Language Models	Pretrained language models (PLMs) have demonstrated success across many natural language processing tasks. However, evidence suggests that they encode gender bias present in the corpora they are trained on. Existing bias mitigation methods are usually devised to remove all associations related to gender. This can hurt the performance of PLMs, because it can lead to a loss of genuine and factual associations (e.g., not associating the word "mother" with females over males). To measure the extent of undesirable loss of gender associations (i.e. over-debiasing), we introduce the Desirable Associations evaluation corpus for Gender (DA-Gender). We find that three popular debiasing methods result in substantial undesirable loss of gender associations. Our results highlight the importance of mitigating bias without removing genuine gender association, and our dataset constitutes the first benchmark to evaluate over-debiasing.	PDF	10	2021
Generating Scientific Definitions with Controllable Complexity	Unfamiliar terminology and complex language can present barriers to understanding science. Natural language processing stands to help address these issues by automatically defining unfamiliar terms. We introduce a new task and dataset for defining scientific terms and controlling the complexity of generated definitions as a way of adapting to a specific reader's background knowledge. We test four definition generation methods for this new task, finding that a sequence-to-sequence approach is most successful. We then explore the version of the task in which definitions are generated at a target complexity level. We introduce a novel reranking approach and find in human evaluations that it offers superior fluency while also controlling complexity, compared to several controllable generation baselines.	PDF	10	2021
Faithful or Extractive? On Mitigating the Faithfulness-Abstractiveness Trade-off in Abstractive Summarization	Despite recent progress in abstractive summarization, systems still suffer from faithfulness errors. While prior work has proposed models that improve faithfulness, it is unclear whether the improvement comes from an increased level of extractiveness of the model outputs as one naive way to improve faithfulness is to make summarization models more extractive. In this work, we present a framework for evaluating the effective faithfulness of summarization systems, by generating a faithfulness-abstractiveness trade-off curve that serves as a control at different operating points on the abstractiveness spectrum. We then show that the Maximum Likelihood Estimation (MLE) baseline as well as recently proposed methods for improving faithfulness, fail to consistently improve over the control at the same level of abstractiveness. Finally, we learn a selector to identify the most faithful and abstractive summary for a given document, and show that this system can attain higher faithfulness scores in human evaluations while being more abstractive than the baseline system on two datasets. Moreover, we show that our system is able to achieve a better faithfulness-abstractiveness trade-off than the control at the same level of abstractiveness.	PDF	10	2021
ME-GCN: Multi-dimensional Edge-Embedded Graph Convolutional Networks for Semi-supervised Text Classification	Compared to sequential learning models, graph-based neural networks exhibit excellent ability in capturing global information and have been used for semi-supervised learning tasks, including citation network analysis or text classification. However, most GCNs are designed with the single-dimensional edge feature and neglected to utilise the rich edge information about graphs. In this paper, we introduce the ME-GCN (Multi-dimensional Edge-enhanced Graph Convolutional Networks) for semi-supervised text classification. A text graph for an entire corpus is firstly constructed to describe the undirected and multi-dimensional relationship of word-to-word, document-document, and word-to-document. The graph is initialised with corpus-trained multi-dimensional word and document node representation, and the relations are represented according to the distance of those words/documents nodes. Then, the generated graph is trained with ME-GCN, which considers the edge features as multi-stream signals, and each stream performs a separate graph convolutional operation. Our ME-GCN can integrate a rich source of graph edge information of the entire text corpus. The results have demonstrated that our proposed model has significantly outperformed the state-of-the-art methods across eight benchmark datasets.	PDF	10	2021
WECHSEL: Effective initialization of subword embeddings for cross-lingual transfer of monolingual language models	Recently, large pretrained language models (LMs) have gained popularity. Training these models requires ever more computational resources and most of the existing models are trained on English text only. It is exceedingly expensive to train these models in other languages. To alleviate this problem, we introduce a method – called WECHSEL – to transfer English models to new languages. We exchange the tokenizer of the English model to a tokenizer in the target language and initialize token embeddings such that they are close to semantically similar English tokens by utilizing multilingual static word embeddings covering English and the target language. We use WECHSEL to transfer GPT-2 and RoBERTa models to 4 other languages (French, German, Chinese and Swahili). WECHSEL improves over a previously proposed method for cross-lingual parameter transfer and outperforms models of comparable size trained from scratch in the target language with up to 64x less training effort. Our method makes training large language models for new languages more accessible and less damaging to the environment. We make our code and models publicly available.	PDF	10	2021
Can Synthetic Translations Improve Bitext Quality?	Synthetic translations have been used for a wide range of NLP tasks primarily as a means of data augmentation. This work explores instead, how we can use synthetic translations to selectively replace potentially imperfect reference translations in mined bitext. We find that synthetic samples can improve bitext quality without any additional bilingual supervision, when they replace the originals based on a semantic equivalence classifier that helps mitigate NMT noise. The improved quality of the revised bitext is confirmed intrinsically via human evaluation and extrinsically through bilingual induction and MT tasks.	PDF	10	2021
Hierarchical Transformer Networks for Long-sequence and Multiple Clinical Documents Classification	We present a Hierarchical Transformer Network for modeling long-term dependencies across clinical notes for the purpose of patient-level prediction. The network is equipped with three levels of Transformer-based encoders to learn progressively from words to sentences, sentences to notes, and finally notes to patients. The first level from word to sentence directly applies a pre-trained BERT model as a fully trainable component. While the second and third levels both implement a stack of transformer-based encoders, before the final patient representation is fed into a classification layer for clinical predictions. Compared to conventional BERT models, our model increases the maximum input length from 512 tokens to much longer sequences that are appropriate for modeling large numbers of clinical notes.We empirically examine different hyper-parameters to identify an optimal trade-off given computational resource limits. Our experiment results on the MIMIC-III dataset for different prediction tasks demonstrate that the proposed Hierarchical Transformer Network outperforms previous state-of-the-art models, including but not limited to BigBird.	PDF	10	2021
Measuring the Language of Self-Disclosure across Corpora	Being able to reliably estimate self-disclosure -- a key component of friendship and intimacy -- from language is important for many psychology studies. We build single-task models on five self-disclosure corpora, but find that these models generalize poorly; the within-domain accuracy of predicted message-level self-disclosure of the best-performing model (mean Pearson's r=0.69) is much higher than the respective across data set accuracy (mean Pearson's r=0.32), due to both variations in the corpora (e.g., medical vs. general topics) and labeling instructions (target variables: self-disclosure, emotional disclosure, intimacy). However, some lexical features, such as expression of negative emotions and use of first person personal pronouns such as 'I' reliably predict self-disclosure across corpora. We develop a multi-task model that yields better results, with an average Pearson's r of 0.37 for out-of-corpora prediction.	PDF	10	2021
Zero-shot Cross-lingual Transfer is Under-specified Optimization	Pretrained multilingual encoders enable zero-shot cross-lingual transfer performance, but often produce unreliable models that exhibit high performance variance on the target language. We postulate that high variance results from zero-shot cross-lingual transfer solving an under-specified optimization problem. We show that the source language monolingual model and source + target bilingual model are linearly connected using a model interpolation, suggesting that the model struggles to identify good solutions for both source and target languages using the source language alone.	PDF	10	2021
On the Complementarity of Data Selection and Fine Tuning for Domain Adaptation	Domain adaptation of neural networks commonly relies on three training phases: pretraining, selected data training and then fine tuning. Data selection improves target domain generalization by training further on pretraining data identified by relying on a small sample of target domain data. This work examines the benefit of data selection for language modeling and machine translation. Our experiments assess the complementarity of selection with fine tuning and result in practical recommendations: (i) selected data must be similar to the fine-tuning domain but not so much as to erode the complementary effect of fine-tuning;(ii) there is a trade-off between selecting little data for fast but limited progress or much data for slow but long lasting progress;(iii) data selection can be applied early during pretraining, with performance gains comparable to long pretraining session; (iv) data selection from domain classifiers is often more effective than the popular contrastive data selection method.	PDF	10	2021
Comparing Apples and Oranges: Recognizing Political Heterogeneity on Reddit and Its Implications for Behavioral Analysis	Reddit is home to a broad spectrum of political activity and users signal their political affiliations in multiple ways—from self-declarations to community participation. Commonly, political studies have assumed political users are a single bloc, both in developing models to infer political leaning and in studying political behavior. Here, test this model assumption of political users. We show that a variety of commonly-used political-inference approaches models do not generalize, indicating heterogeneous types of political users, and remains imprecise at best for most users, regardless of which sources of data or methods are used. Across a 14-year longitudinal analysis, we demonstrate that the choice in definition of a political user has significant implications for behavioral analysis. Controlling for multiple factors, political users are more toxic on the platform and inter-party interactions are even more toxic---but not all political users behave this way. Last, we identify a subset of political users who repeatedly flip affiliations, showing that these users are the most controversial of all, acting as provocateurs by more frequently bringing up politics, and are more likely to be banned, suspended, or deleted.	PDF	10	2021
Lexicon Creation for Interpretable NLP Models	Lexica--words and associated scores--are widely used as simple, interpretable, generalizable language features to predict sentiment, emotions, mental health, and personality traits. Applying different feature importance methods to different predictive models yields lexica of varying quality. In this paper, we train diverse sequence classification models, including context-oblivious (SVMs, Feed-forward neural networks) and context-sensitive (RoBERTa, DistilBERT) models, and generate lexica based on different feature importance measurements, including attention, masking, and SHAP (SHapley Additive exPlanations) values. We evaluate the generated lexica on their predictive performance on test sets within the same corpus domain and on their generalization to different but similar domains. We find that simple context-oblivious models produce lexica of similar accuracy within domain and of better accuracy across domains to those from complex context-sensitive models. Based on human evaluator ratings of these lexica, we also find that context-oblivious models generate similar lexica that are more aligned with human judgments.	PDF	10	2021
ODE Transformer: An Ordinary Differential Equation-Inspired Model for Sequence Generation	Residual networks are an Euler discretization of solutions to Ordinary Differential Equations (ODE). This paper explores a deeper relationship between Transformer and numerical ODE methods. We first show that a residual block of layers in Transformer can be described as a higher-order solution to ODE. Inspired by this, we design a new architecture, {\it ODE Transformer}, which is analogous to the Runge-Kutta method that is well motivated in ODE. As a natural extension to Transformer, ODE Transformer is easy to implement and efficient to use. Experimental results on the large-scale machine translation, abstractive summarization, and grammar error correction tasks demonstrate the high genericity of ODE Transformer. It can gain large improvements in model performance over strong baselines (e.g., 30.77 and 44.11 BLEU scores on the WMT'14 English-German and English-French benchmarks) at a slight cost in inference efficiency.	PDF	10	2021
HiTab: A Hierarchical Table Dataset for Question Answering and Natural Language Generation	Tables are often created with hierarchies, but existing works on table reasoning mainly focus on flat tables and neglect hierarchical tables. Hierarchical tables challenge numerical reasoning by complex hierarchical indexing, as well as implicit relationships of calculation and semantics. We present a new dataset, HiTab, to study question answering (QA) and natural language generation (NLG) over hierarchical tables. HiTab is a cross-domain dataset constructed from a wealth of statistical reports and Wikipedia pages, and has unique characteristics: (1) nearly all tables are hierarchical, and (2) QA pairs are not proposed by annotators from scratch, but are revised from real and meaningful sentences authored by analysts. (3) to reveal complex numerical reasoning in statistical reports, we provide fine-grained annotations of quantity and entity alignment. Experiments suggest that this HiTab presents a strong challenge for existing baselines and a valuable benchmark for future research. Targeting hierarchical structure, we devise a hierarchy-aware logical form for symbolic reasoning over tables, which shows high effectiveness. Targeting table reasoning, we leverage entity and quantity alignment to explore partially supervised training in QA and conditional generation in NLG, and largely reduce spurious predictions in QA and produce better descriptions in NLG.	PDF	10	2021
Old BERT, New Tricks: Artificial Language Learning for Pre-Trained Language Models	We extend the artificial language learning experimental paradigm from psycholinguistics and apply it to pre-trained language models -- specifically, BERT (Devlin et al., 2019). We treat a pretrained model as a subject in an artificial language learning experimental setting: in order to learn the relation between two linguistic properties A and B, we introduce a set of new, non-existent, linguistic items, give the model information about their variation along property A, then measure to what extent the model learns property B for these items as a result of training. We show this method at work for degree modifiers (expressions like slightly, very, rather, extremely) and test the hypothesis that the degree expressed by the modifier (low, medium or high degree) is related to its sensitivity to sentence polarity (whether it shows preference for affirmative or negative sentences or neither). Our experimental results are compatible with existing linguistic observations that relate degree semantics to polarity-sensitivity, including the main one: low degree semantics leads to positive polarity sensitivity (that is, to preference towards affirmative contexts). The method can be used in linguistic theory to elaborate on hypotheses and interpret experimental results, as well as for more insightful evaluation of linguistic representations in language models.	PDF	10	2021
Seeing things or seeing scenes: Investigating the capabilities of V&L models to align scene descriptions to images	Images can be described in terms of the objects they contain, or in terms of the types of scene or place that they instantiate. In this paper we address to what extent pretrained Vision and Language models can learn to align descriptions of both types with images. We compare 3 state-of-the-art models, VisualBERT, LXMERT and CLIP. We find that (i) V\&L models are susceptible to stylistic biases acquired during pretraining; (ii) only CLIP performs consistently well on both object- and scene-level descriptions. A follow-up ablation study shows that CLIP uses object-level information in the visual modality to align with scene-level textual descriptions.	PDF	10	2021
DialogueScript: Using Dialogue Agents to Produce a Script	We present a novel approach to generating scripts by using agents with different personality types. To manage character interaction in the script, we employ simulated dramatic networks. Automatic and human evaluation on multiple criteria shows that our approach outperforms a vanilla-GPT2-based baseline. We further introduce a new metric to evaluate dialogue consistency based on natural language inference and demonstrate its validity.	PDF	10	2021
Alleviating the Inequality of Attention Heads for Neural Machine Translation	Recent studies show that the attention heads in Transformer are not equal. We relate this phenomenon to the imbalance training of multi-head attention and the model dependence on specific heads. To tackle this problem, we propose a simple masking method: HeadMask, in two specific ways. Experiments show that translation improvements are achieved on multiple language pairs. Subsequent empirical analyses also support our assumption and confirm the effectiveness of the method.	PDF	10	2021
Emotion Style Transfer with a Specified Intensity Using Deep Reinforcement Learning	Text style transfer is a widely explored task in natural language generation which aims to change the stylistic properties of the text while retaining its style-independent content. In this work, we propose the task of emotion style transfer with a specified intensity in an unsupervised setting. The aim is to rewrite a given sentence, in any emotion, to a target emotion while also controlling the intensity of the target emotion. Emotions are gradient in nature, some words/phrases represent higher emotional intensity, while others represent lower intensity. In this task, we want to control this gradient nature of the emotion in the output. Additionally, we explore the issues with the existing datasets and address them. A novel BART-based model is proposed that is trained for the task by direct rewards. Unlike existing work, we bootstrap the BART model by training it to generate paraphrases so that it can explore lexical and syntactic diversity required for the output. Extensive automatic and human evaluations show the efficacy of our model in solving the problem.	PDF	10	2021
Rare but Severe Errors Induced by Minimal Deletions in English-Chinese Neural Machine Translation	We examine the inducement of rare but severe errors in English-Chinese and Chinese-English Transformer-based neural machine translation by minimal deletion in the source text. We also examine the effect of training data size on the number and types of pathological cases induced by these perturbations, finding significant variation. We find that one type of hallucination can be remedied through data preprocessing and that deleting words hurts more than deleting characters in a character-based model, even though deleting characters introduces nonsense words.	PDF	10	2021
It's my Job to be Repetitive! My Job! My Job! -- Linking Repetitions to In-Context Learning in Language Models	Recent studies have shown that large language models can display surprising accuracy at learning tasks from few examples presented in the input context, which goes under the name of in-context learning. Other studies have shown that language models can sometimes display the undesirable behavior of falling back into loops in which an utterance is repeated infinitely often. Here, we observe that the model's capacity to produce repetitions goes well beyond frequent or well-formed utterances, and generalizes to repeating completely arbitrary sequences of tokens. Construing this as a simple form of in-context learning, we hypothesize that these two phenomena are linked through shared processing steps. With controlled experiments, we show that impairing the network from producing repetitions severely affects in-context learning, without reducing its overall predictive performance, thus supporting the proposed hypothesis.	PDF	10	2021
Cooperative Semi-Supervised Transfer Learning of Machine Reading Comprehension	Pretrained language models have significantly improved the performance of down-stream language understanding tasks, including extractive question answering, by providing high-quality contextualized word embeddings. However, training question answering models still requires large amounts of annotated data for specific domains. In this work, we propose a cooperative, self-play learning framework, REGEX, for automatically generating more non-trivial question-answer pairs to improve model performance. REGEX is built upon a masked answer extraction task with an interactive learning environment containing an answer entity REcognizer, a question Generator, and an answer EXtractor. Given a passage with a masked entity, the generator generates a question around the entity, and the extractor is trained to extract the masked entity with the generated question and raw texts. The framework allows the training of question generation and answering models on any text corpora without annotation. We further leverage a reinforcement learning technique to reward generating high-quality questions and to improve the answer extraction model's performance. Experiment results show that REGEX outperforms the state-of-the-art (SOTA) pretrained language models and transfer learning approaches on standard question-answering benchmarks, and yields the new SOTA performance under given model size and transfer learning settings.	PDF	10	2021
Evaluation of Transfer Learning for Polish with a text-to-text model	We present polT - a general purpose text-to-text model for Polish that can be fine-tuned on a variety on Natural Language Processing (NLP) tasks with a single training objective. Unsupervised denoising pre-training is performed efficiently by initializing the model weights with multi-lingual T5 (mT5) counterpart. We evaluate performance of polT, mT5, Polish BART (plBART) and Polish GPT-2 (papuGaPT2) on diverse downstream tasks such as: text-to-text KLEJ benchmark, en-pl machine translation, question answering and summarization. The polT scores top on all of this tasks except summarization where plBART is best. In general (except summarization), the larger the model the better the results. The encoder-decoder architectures prove to be better than decoder-only equivalent. Additionally, since summarization and question answering lack benchmark datasets for Polish language we describe in detail their construction and will make them publicly available.	PDF	10	2021
Challenges in Generalization in Open Domain Question Answering	Recent work on Open Domain Question Answering has shown that there is a large discrepancy in model performance between novel test questions and those that largely overlap with training questions. However, it is as of yet unclear which aspects of novel questions that make them challenging. Drawing upon studies on systematic generalization, we introduce and annotate questions according to three categories that measure different levels and kinds of generalization: training set overlap, compositional generalization (comp-gen), and novel entity generalization (novel-entity). When evaluating six popular parametric and non-parametric models, we find that for the established Natural Questions and TriviaQA datasets, even the strongest model performance for comp-gen/novel-entity is 13.1/5.4% and 9.6/1.5% lower compared to that for the full test set – indicating the challenge posed by these types of questions. Furthermore, we show that whilst non-parametric models can handle questions containing novel entities, they struggle with those requiring compositional generalization. Through thorough analysis we find that key question difficulty factors are: cascading errors from the retrieval component, frequency of question pattern, and frequency of the entity.	PDF	10	2021
Making Pre-trained Language Models End-to-end Few-shot Learners with Contrastive Prompt Tuning	Prompt-based learning for Pre-trained Language Models (PLMs) has achieved remarkable performance in few-shot learning by exploiting prompts as task guidance and turning downstream tasks into masked language problems. In most existing approaches, the high performance of prompt-based learning heavily relies on handcrafted prompts and verbalizers, which may limit the application of such approaches in real-world scenarios. To solve this issue, we present CP-Tuning, the first end-to-end Contrastive Prompt Tuning framework for PLMs without any manual engineering of task-specific prompts and verbalizers. It is integrated with the task-invariant continuous prompt encoding technique with fully trainable prompt parameters. We further propose a pair-wise cost-sensitive contrastive loss to optimize the model in order to achieve verbalizer-free class mapping and enhance the task-invariance of prompts. Experiments over a variety of NLP tasks show CP-Tuning consistently outperforms state-of-the-art methods.	PDF	10	2021
Learning to Acquire Knowledge from a Search Engine for Dialogue Response Generation	Knowledge-aided dialogue response generation aims at augmenting chatbots with relevant external knowledge in the hope of generating more informative responses.The majority of previous work assumes that the relevant knowledge is given as input or retrieved from a static pool of knowledge. However, this assumption violates the real-world situation, where knowledge is continually updated and a chatbot has to \emph{dynamically} retrieve useful knowledge.In this paper, we propose a dialogue model that can access the vast and dynamic information from any search engine for response generation. To this end, we design a query producer that generates queries from a dialogue context to interact with a search engine. The query producer is trained without any human annotation of gold queries, making it easily transferable to other domains and search engines. More specifically, we design a reinforcement learning algorithm to train the query producer, where rewards are obtained by comparing retrieved articles and gold responses. Experiments show that our query producer can achieve R@$1$ and R@$5$ rates of 62.4\% and 74.8\% for retrieving gold knowledge, and the overall model generates better responses over a strong BART (Lewis et al., 2020) model and other typical baselines.	PDF	10	2021
Gendered Language in Resumes	Despite growing concerns around gender bias in NLP models used in algorithmic hiring, there is little empirical work studying the extent and nature of gendered language in resumes.Using a corpus of 709k resumes from IT firms, we train a series of models to classify the gender of the applicant, thereby measuring the extent of gendered information encoded in resumes.We also investigate whether it is possible to obfuscate gender from resumes by removing gender identifiers, removing gender sub-space in embedding models, etc.We find that there is a significant amount of gendered information in resumes even after obfuscation.A simple Tf-Idf model can learn to classify gender with AUROC=0.75, and more sophisticated transformer-based models achieve AUROC=0.8.We further find that gender predictive values have little correlation with gender direction of embeddings -- meaning that, what is predictive of gender is not necessarily ``gendered'' in the masculine/feminine sense.We discuss the implications of these findings in the algorithmic hiring context.	PDF	10	2021
Large-Scale Hate Speech Detection with Cross-Domain Transfer	Hate speech towards people with different backgrounds is a major problem observed in social media. Although there are various attempts to detect hate speech automatically via supervised learning models, the performance of such models simply rely on limited datasets on which models are trained. In this study, we construct large-scale tweet datasets for supervised hate speech detection in English and Turkish, including human-labeled 100k tweets per each. Our datasets are designed to have equal number of tweets distributed over five domains; namely religion, gender, race, politics, and sports. We analyze the performance of state-of-the-art language models on large-scale hate speech detection with a special focus on model scalability. We also examine cross-domain transfer ability of hate speech detection.	PDF	10	2021
Models In a Spelling Bee: Language Models Implicitly Learn the Character Composition of Tokens	Standard pretrained language models operateon sequences of subword tokens without direct access to the characters that compose eachtoken’s string representation. We probe theembedding layer of pretrained language models and show that models learn the internalcharacter composition of whole word and subword tokens to a surprising extent, withoutever seeing the characters coupled with the tokens. Our results show that the embeddinglayer of RoBERTa holds enough informationto accurately spell up to a third of the vocabulary and reach high average character ngramoverlap on all token types. We further testwhether enriching subword models with additional character information can improve language modeling, and observe that this methodhas a near-identical learning curve as training without spelling-based enrichment. Overall, our results suggest that language modelingobjectives incentivize the model to implicitlylearn some notion of spelling, and that explicitly teaching the model how to spell does notenhance its performance on such tasks.	PDF	10	2021
CoMPM: Context Modeling with Speaker's Pre-trained Memory Tracking for Emotion Recognition in Conversation	As the use of interactive machines grow, the task of Emotion Recognition in Conversation (ERC) became more important. If the machine generated sentences reflect emotion, more human-like sympathetic conversations are possible. Since emotion recognition in conversation is inaccurate if the previous utterances are not taken into account, many studies reflect the dialogue context to improve the performances. We introduce CoMPM, a context embedding module (CoM) combined with a pre-trained memory module (PM) that tracks memory of the speaker's previous utterances within the context, and show that the pre-trained memory significantly improves the final accuracy of emotion recognition. We achieve competitive performance with previous methods on English datasets (MELD, EmoryNLP, IEMOCAP, DailyDailog), and achieve good performance with small data sets. In addition, our method shows that it can be extended to other languages because structured knowledge is not required unlike existing methods.	PDF	10	2021
Graph-to-Graph Annotation Conversion Based on Pretrained Models	Annotation conversion is an effective way to construct datasets under new annotation guidelines based on existing datasets with little human labour. Previous work has been limited in conversion between tree-structured datasets and mainly focused on feature-based models which are not easily applicable to new conversion. In this paper, we propose two pretrained model-based graph-to-graph annotation conversion approaches, namely Label Switching and Graph2Graph Linear Transfer, which are able to deal with conversion between graph-structured annotations and require no manually designed feature. We manually construct a graph-structured parallel annotated dataset and evaluate the proposed approaches on it as well as four existing parallel annotated datasets. Experimental results show that the proposed approaches outperform two strong baselines across all the datasets. Furthermore, the combination of the two models have a better effect.	PDF	10	2021
Benchmarking Biomedical Nested NER and Relation Extraction Models	The Open EPPI corpus comprises $151$ full-text papers annotated by domain experts for entity mentions, protein-protein interactions (PPIs), and normalisation of entities to publicly available ontologies.The corpus is publicly available at [ANON].We benchmark recent nested NER and relation extraction models.Results show that, although existing nested NER models achieve good performance on outermost and innermost entity mentions, they struggle with other types of nested mentions.Benchmark results for relation extraction show substantial room for improvement with precision under $70$ and recall around $40$ to $52$.	PDF	10	2021
Subword-based Cross-lingual Transfer of Embeddings from Hindi to Marathi	Word embeddings are growing to be a crucial resource in the field of NLP for any language. This work focuses on static subword embeddings transfer for Indian languages from a relatively higher resource language to a genealogically related low resource language. We work with Hindi-Marathi as our language pair, simulating a low-resource scenario for Marathi. We demonstrate the consistent benefits of unsupervised morphemic segmentation on both source and target sides over the treatment performed by FastText. We show that a trivial "copy-and-paste'' embeddings transfer based on even perfect bilingual lexicons is inadequate in capturing language-specific relationships. Our best-performing approach uses an EM-style approach to learning bilingual subword embeddings; the resulting embeddings are evaluated using the publicly available Marathi Word Similarity task as well as WordNet-Based Synonymy Tests. We find that our approach significantly outperforms the FastText baseline on both tasks; on the former task, its performance is close to that of pretrained FastText Marathi embeddings that use two orders of magnitude more Marathi data.	PDF	10	2021
On the Importance of Effectively Adapting Pretrained Language Models for Active Learning	Recent active learning (AL) approaches in Natural Language Processing (NLP) proposed using off-the-shelf pretrained language models (LMs). In this paper, we argue that these LMs are not adapted effectively to the downstream task during AL and we explore ways to address this issue. We suggest to first adapt the pretrained LM to the target task by continuing training with all the available unlabeled data and then use it for AL. We also propose a simple yet effective fine-tuning method to ensure that the adapted LM is properly trained in both low and high resource scenarios during AL. Our experiments demonstrate that our approach provides substantial data efficiency improvements compared to the standard fine-tuning approach, suggesting that a poor training strategy can be catastrophic for AL.	PDF	10	2021
Cross-lingual Constituency Parsing with Linguistic Typology Knowledge	Cross-lingual Transfer learning (CLT) has successfully been applied to the dependency parsing task. This is the first work that evaluates a CLT based approach to the Constituency parsing task. Furthermore, we utilized the linguistic typology knowledge in WALS database to improve the cross-lingual transferring ability of our proposed parser.	PDF	10	2021
A Double-Graph Based Framework for Frame Semantic Parsing	Frame semantic parsing is a fundamental NLP task, which consists of three subtasks: frame identification, argument identification and role classification. Most previous studies tend to neglect relations between different subtasks and arguments and pay little attention to ontological frame knowledge defined in FrameNet. In this paper, we propose a Knowledge-guided Incremental semantic parser with Double-graph (KID). We first introduce Frame Knowledge Graph (FKG), a heterogeneous graph containing both frames and FEs (Frame Elements) built on the frame knowledge so that we can derive knowledge-enhanced representations for frames and FEs. Besides, we propose Frame Semantic Graph (FSG) to represent frame semantic structures extracted from the text with graph structures. In this way, we can transform frame semantic parsing into an incremental graph construction problem to strengthen interactions between subtasks and relations between arguments. Our experiments show that KID outperforms the previous state-of-the-art method by up to 1.7 F1-score on two FrameNet datasets.	PDF	10	2021
A Copy-Augmented Generative Model for Open-Domain Question Answering	Open-domain question answering is a challenging task with a wide variety of practical applications. Existing modern approaches mostly follow a standard two-stage paradigm: retriever then reader. In this article, we focus on improving the effectiveness of the reader module and propose a novel copy-augmented generative approach that integrates the merits of both extractive and generative readers. In particular, our model is built upon the powerful generative model FiD \citep{FiD}. We enhance the original generative reader by incorporating a pointer network to encourage the model to directly copy words from the retrieved passages. We conduct experiments on the two benchmark datasets, Natural Questions and TriviaQA, and the empirical results demonstrate the performance gains of our proposed approach.	PDF	10	2021
A Zero-Resource Approach to Cross-Lingual Query-Focused Abstractive Summarization	We present a novel approach for cross-lingual query-focused abstractive summarization (QFAS) that leverages the translate-then-summarize paradigm. We approach cross-lingual QFAS as a zero-resource problem and introduce a framework to create a synthetic QFAS corpus from a standard summarization corpus using a novel query-generation strategy. Our model summarizes documents in foreign languages for which translation quality is poor. It learns not only to identify and condense salient information relevant to a query, but also to appropriately rephrase grammatical errors and disfluencies that may occur in the noisy translations. Our technique enhances a pre-trained encoder-decoder transformer by introducing query focus to the encoder. We show that our method for creating synthetic QFAS data leads to more robust models that not only achieve state-of-the-art performance on our corpus, but also perform better on out-of-distribution data as compared to prior work.	PDF	11	2021
Combating Spurious Features by Distribution Difference Regularization	Prior studies show that spurious features are inevitable to avoid in the data collection process. These spurious features cause a shortcut for a model making bad prediction in real world test data due to ignoring the real features. In this work, we focus ondesigning a learning scheme to hinder the model from leveraging spurious features. To achieve this, prior studies usually make strong assumptions about the spurious features and identify them purely by manipulating the training data. In contrast, we make weaker assumptions and purpose a new framework for combating spurious features by observing the distribution shift between training and auxiliary data. In particular, with the help of unlabeled auxiliary data, we design a regularization technique based on the embedding distribution difference between training and auxiliary data to mitigate the effect of spurious features. Experimental results on NLI and coreference resolution tasks demonstrate that we improve the models on out-of-domain test data and reduce the contribution of spurious features in model predictions.	PDF	11	2021
Seq2rel: A sequence-to-sequence-based approach for document-level relation extraction	Motivated by the fact that many relations cross the sentence boundary, there has been increasing interest in document-level relation extraction (RE). Document-level RE requires integrating information within and across sentences, capturing complex interactions between mentions of interacting entities. Most document-level RE methods proposed to date are pipeline-based, requiring entities as input. However, previous work has demonstrated that jointly learning to extract entities and relations can improve performance and be more efficient due to shared parameters and training steps. In this paper, we develop a sequence-to-sequence-based approach that can learn the sub-tasks of document-level RE --- entity extraction, coreference resolution and relation extraction --- in an end-to-end fashion. We evaluate our approach on several datasets, in some cases exceeding the performance of existing methods. Finally, we demonstrate that, under our model, the end-to-end approach outperforms a pipeline-based approach. Our code and models will be made publicly available.	PDF	11	2021
What Works and Doesn't Work, A Deep Decoder for Neural Machine Translation	Deep learning has demonstrated performance advantages in a wide range of natural language processing tasks, including neural machine translation (NMT). Transformer NMT models are typically strengthened by deeper encoder layers, but deepening their decoder layers usually results in failure. In this paper, we first identify the cause of the failure of the deep decoder in the Transformer model. Inspired by this discovery, we then propose approaches to improving it, with respect to model structure and model training, to make the deep decoder practical in NMT. Specifically, with respect to model structure, we propose a cross-attention drop mechanism to allow the decoder layers to perform their own different roles, to reduce the difficulty of deep-decoder learning. For model training, we propose a collapse reducing training approach to improve the stability and effectiveness of deep-decoder training. We experimentally evaluated our proposed Transformer NMT model structure modification and novel training methods on several popular machine translation benchmarks. The results showed that deepening the NMT model by increasing the number of decoder layers successfully prevented the deepened decoder from degrading to an unconditional language model. In contrast to prior work on deepening an NMT model on the encoder, our method can deepen the model on both the encoder and decoder at the same time, resulting in a deeper model and improved performance.	PDF	11	2021
Interpretability on clinical analysis from Pattern Disentanglement Insight	Diagnosis of a clinical condition can help medical professionals save time in making a clinical diagnosis and prevent overlooking risks. Therefore we explore the problem using free-text medical notes recorded in electronic health records (EHR). MIMIC III is a de-identified EHR database containing observations from over 40,000 patients in critical care units. Since the text corpus is unstructured in the non-database table format, existing machine learning models may be ineffective at interpreting the results, which is often desirable for clinical diagnosis. Hence, in this paper, we proposed a text mining and pattern discovery solution to discover strong association patterns from patient discharge summaries and the code of international classification of diseases (ICD9-code). The proposed approach offers a straightforward interpretation of the underlying relation of patient characteristics in an unsupervised machine learning setting. The clustering results outperform the baseline clustering algorithm and are comparable to baseline supervised methods.	PDF	11	2021
How do we get there? Evaluating transformer neural networks as cognitive models for English past tense inflection	Neural network models have achieved good performance on morphological inflection tasks, including English past tense inflection. However whether they can represent human cognitive mechanisms is still under debate. In this work, we examined transformer models with different training size to show that: 1) neural models correlate with both human behaviors and cognitive theories' predictions on nonce verbs; and the model with small-size training data that matches parents' input distribution has the highest correlation; 2) neural models make different types of errors on regular and irregular verbs, exhibiting a clear distinction between regulars and irregulars. Therefore, we conclude that neural networks have the potential to be good cognitive models for English past tense.	PDF	11	2021
Retrieval Data Augmentation Informed by Downstream Question Answering Performance	Training retrieval models to fetch contexts for Question Answering (QA) over large corpora requires labeling relevant passages in those corpora. Since obtaining exhaustive manual annotations of all relevant passages is not feasible, prior work uses text overlap heuristics to find passages that are likely to contain the answer, but this is not feasible when the task requires deeper reasoning and answers are not extractable spans (e.g.: multi-hop, discrete reasoning). We address this issue by identifying relevant passages based on whether they are useful for a trained QA model to arrive at the correct answers, and develop a search process guided by the QA model's loss. Our experiments show that this approach enables identifying relevant context for unseen data greater than 90% of the time on the IIRC dataset and generalizes better to the end QA task than those trained on just the gold retrieval data on IIRC and QASC datasets.	PDF	11	2021
Modeling Multi-hop Question Answering as Single Sequence Prediction	Fusion-in-decoder (Fid) (Izacard and Grave, 2020) is a generative question answering (QA) model that leverages passage retrieval with a pre-trained transformer and pushed the state of the art on single-hop QA. However, the complexity of multi-hop QA hinders the effectiveness of the generative QA approach. In this work, we propose a simple generative approach (PathFid) that extends the task beyond just answer generation by explicitly modeling the reasoning process to resolve the answer for multi-hop questions. By linearizing the hierarchical reasoning path of supporting passages, their key sentences, and finally the factoid answer, we cast the problem as a single sequence prediction task. To facilitate complex reasoning with multiple clues, we further extend the unified flat representation of multiple input documents by encoding cross-passage interactions. Our extensive experiments demonstrate that PathFid leads to strong performance gains on two multi-hop QA datasets: HotpotQA and IIRC. Besides the performance gains, PathFid is more interpretable, which in turn yields answers that are more faithfully grounded to the supporting passages and facts compared to the baseline Fid model.	PDF	11	2021
Analyzing Dynamic Adversarial Training Data in the Limit	To create models that are robust across a wide range of test inputs, training datasets should include diverse examples that span numerous phenomena. Dynamic adversarial data collection (DADC), where annotators craft examples that challenge continually improving models, holds promise as an approach for generating such diverse training sets. Prior work has shown that running DADC over 1-3 rounds can help models fix some error types, but it does not necessarily lead to better generalization beyond adversarial test data. We argue that running DADC over many rounds maximizes its training-time benefits, as the different rounds can together cover many of the task-relevant phenomena. We present the first study of longer-term DADC, where we collect 20 rounds of NLI examples for a small set of premise paragraphs, with both adversarial and non-adversarial approaches. Models trained on DADC examples make 26% fewer errors on our expert-curated test set compared to models trained on non-adversarial data. Our analysis shows that DADC yields examples that are more difficult, more lexically and syntactically diverse, and contain fewer annotation artifacts compared to non-adversarial examples.	PDF	11	2021
Fast and Accurate Transformer-based Translation with Character-Level Encoding and Subword-Level Decoding	The Transformer translation model is fast to train and achieves state-of-the-art results for various translation tasks. However, unknown input words at test time remain a challenge for the Transformer, especially when unknown words are segmented into inappropriate subword sequences that break morpheme boundaries. This paper improves the Transformer model to learn more accurate source representations via character-level encoding. We simply adopt character sequences instead of subword sequences as input of the standard Transformer encoder and propose contextualized character embedding (CCEmb) to help character-level encoding. Our CCEmb contains information about the current character and its context by adding the embeddings of its contextual character $n$-grams. The CCEmb causes little extra computational cost and we show that our model with a character-level encoder and a standard subword-level Transformer decoder can outperform the original pure subword-level Transformer, especially for translating source sentences that contain unknown (or rare) words.	PDF	11	2021
How do we answer complex questions: Discourse structure of long form answers	Long form answers, consisting of multiple sentences, can provide nuanced and comprehensive answers to a broader set of questions. However, little prior work exists on this task. To better understand this complex task, we study the functional structure of long form answers on two datasets, Natural Questions~\cite{kwiatkowski2019natural} and ELI5~\cite{Fan2019ELI5LF}. Our main goal is to understand how humans organize information to craft complex answers. We develop an ontology of sentence-level functional roles for long form answers, and annotate 3.3k sentences in 542 examples. Our annotated data enables training a reliable role classifier that can be used for automatic analysis and thus reveals machine generated answers are structured worse than human written answers. Our data further yields an extractive summarization dataset for long form answers, giving models the ability to identify a concise answer to a complex query.	PDF	11	2021
Towards Community-Driven NLP: Measuring Geographic Performance Disparities of Offensive Language Classifiers	Text classifiers are applied at scale in the form of one-size-fits-all solutions. Nevertheless, many studies show that many classifiers are biased regarding different languages and dialects. Both language style and content change depending on the location that it is posted. For example, states that border Mexico may be more likely to discuss issues regarding immigration from Latin America. However, several questions remain, such as ``Do changes in the style and content of text across geographic regions impact model performance?''. We introduce a novel dataset called GeoOLID with more than 13 thousand examples across 15 geographically and demographically diverse cites to address this question. Furthermore, we perform a comprehensive analysis of geographical content and stylistic differences and their interaction in causing performance disparities of Offensive Language Detection models. Overall, we find that current models do not generalize across. Likewise, we show that understanding broad dialects (e.g., African American English) is not the only predictive factor of model performance when applied to cities with large minority populations. Hence, community-specific evaluation is vital for real-world applications. Warning: This paper contains offensive language.	PDF	11	2021
Probing Difficulty and Discrimination of Natural Language Questions With Item Response Theory	Item Response Theory (IRT) has been extensively used to characterize question difficulty for human subjects in domains including cognitive psychology and education (Primi et al.,2014; Downing, 2003). In this work, we explore IRT to characterize the difficulty and discrimination of natural language questions in Question-Answering datasets. We use HotPotQA for illustration. Our analysis reveals significant variations along these traits, as well as interdependence between them. Additionally, we explore predictive models for directly estimating these traits from the text of the questions and answers. Our experiments show that it is possible to predict both difficulty and discrimination parameters for new questions, and these traits are correlated with features of questions, answers, and associated contexts. Our findings can have significant implications for the creation of new datasets and tests on the one hand and strategies such as active learning and curriculum learning on the other.	PDF	11	2021
Fill-in-the-Blank: A Challenging Video Understanding Evaluation Framework	We propose fill-in-the-blanks as a video understanding evaluation framework. The task tests a model's understanding of a video by requiring the model to predict a masked noun phrase in the caption of the video, given the video and the surrounding text. To this end, we introduce a novel dataset consisting of 28,000 videos and fill-in-the-blank tests with multiple correct answers. The task and the dataset are challenging for the current state-of-the-art systems to solve. This task also does not share the weaknesses of the current state of the art language-informed video understanding tasks, namely: (1) video question answering using multiple-choice questions, where models perform relatively well because they exploit linguistic biases in the task formulation; and (2) video captioning, which relies on an open-ended evaluation framework that is often inaccurate because system answers may be perceived as incorrect if they differ in form from the ground truth.	PDF	11	2021
Uncertainty-Based Joint Training For Semi-Supervised Math Word Problem	Math word problems (MWPs) convert natural math corpus into structured equation forms. Data sparsity is one of the main obstacles for math word understanding problem due to the high cost of human annotation efforts. However, existing work mainly start from the supervised learning perspective, making the low-resource scenario under explored. In this paper, we are the first to incorporate semi-supervised learning (SSL) framework into MWPs. We propose an uncertainty-aware unlabeled data selection strategies, which can access to reliable samples and increase the model capacity gradually. Besides, to improve the quality of pseudo equations, we incorporate two indirect supervision signals considering the semantic consistency property and grammar format constraints of generated equations. Experimental results on two benchmark MWPs datasets across different ratio of unlabeled data verify the effectiveness and generalization ability of our proposed method.	PDF	11	2021
Same Neurons, Different Languages: Probing Morphosyntax in Multilingual Pre-trained Models	The success of multilingual pre-trained models in transferring knowledge cross-lingually is underpinned by their ability to learn representations shared by multiple languages even in absence of any explicit supervision. However, it remains unclear how. In this work, we conjecture that multilingual pre-trained models can derive language-universal abstractions about grammar. In particular, we investigate whether morphosyntactic information is encoded in the same subset of neurons in different languages. We conduct the first large-scale empirical study over 43 typologically diverse languages and 14 morphosyntactic categories with a state-of-the-art neuron-level probe. Our findings show that the cross-lingual overlap between neurons is significant, but its extent may vary across categories and depends on language proximity and pre-training data size.	PDF	11	2021
Pair-Based Joint Learning with Relational Graph Convolutional Networks for Emotion-Cause Pair Extraction	Emotion-cause pair extraction (ECPE) aims to extract the emotion clauses and the corresponding cause clauses, which have recently received more attention. Previous methods sequentially encode features with a specified order, which first encode the emotion and cause features for clause extraction and then combine them for pair extraction, leading to an imbalance in inter-task feature interaction where features extracted later have no direct contact with the former. To this end, we propose a novel joint encoding network, which generates pairs and clauses features simultaneously in a joint feature learning manner to model the causal relationship from clauses. Specifically, from a multi-relational perspective, we construct a heterogeneous undirected graph and apply the Relational Graph Convolutional Network (RGCN) to capture the complex relationship between clauses and the relationship between pairs and clauses. Experimental results show that our model achieves state-of-the-art performance on the Chinese benchmark corpus.	PDF	11	2021
CONJR: Conjunctive Sentence Splitter without Parsing	In this paper, we observe and address the challenges of splitting conjunctive sentences around each group of conjuncts. Most existing methods rely on parsers to identify the conjuncts in a sentence and detect the coordination boundaries. However, state-of-the-art syntactic parsers are slow and suffer from errors, especially for long and complicated sentences. In order to better solve the problems, we formulate coordination boundary detection as a sequence tagging task and propose a specialized model CONJR without using syntactic parsers. We introduce both semantic and syntactic features and a specially designed attention mechanism to capture the symmetry among the potential conjuncts. The experimental results on datasets from various domains demonstrate the effectiveness of our proposed methods.	PDF	11	2021
Remove Noise and Keep Truth: A Noisy Channel Model for Semantic Role Labeling	Semantic role labeling usually models structures using sequences, trees, or graphs. Past works focused on researching novel modeling methods and neural structures and integrating more features. In this paper, we re-examined the noise in neural semantic role labeling models, a problem that has been long-ignored. By proposing a noisy channel model structure, we effectively eliminate the noise in the labeling flow and thus improve performance. Without relying on additional features, our proposed novel model significantly outperforms a strong baseline on multiple popular semantic role labeling benchmarks, which demonstrates the effectiveness and robustness of our proposed model.	PDF	11	2021
Enhancing Neural Machine Translation with Syntactic Ambiguities	Benefiting from the data-driven end-to-end model architecture, neural machine translation has obvious performance advantages over statistical machine translation, but its demand for data is also significantly greater, including monolingual and parallel corpus. Most of the past studies have focused on reducing the demand for parallel corpus or making more effective use of limited parallel corpus. In this work, we have studied a method of using ambiguity of syntactic structure to achieve more effective use of monolingual corpus. Experiments conducted on multiple benchmarks for various languages show that our method has a greater improvement than the method using back-translation only, demonstrating the effectiveness of our proposed method.	PDF	11	2021
All Birds with One Stone: Multi-task Learning for Inference with One Forward Pass	Task-specific fine-tuning of pre-trained language models like Transformers has shown their effectiveness in various NLP tasks. To achieve better storage efficiency and model performance, Multi-task Learning (MTL) has been studied to share model parameters and utilize knowledge transfer between tasks. However, in real applications where enormous numbers of tasks (e.g., large sets of labels to be classified) need to be conducted on a large corpus, the inference efficiency is still hindered by the number of tasks.For a document with N sets of labels to be predicted, recent MTL methods with adaptive modules or prompts need to encode the input data N times to extract the hidden representation needed for the tasks. Notice that the hidden representation is not sharable between tasks, as task-specific features are extracted at very bottom layers in the Transformer. In this paper, we seek to maintain the computational efficiency of only requiring one forward pass for a document to get a generalized feature for all N tasks, without sacrificing overall model performance.We design a prompt-sharing module to let the model take all tasks into considerations and output N heads simultaneously. We also design a dynamic task scheduling module to sample tasks according to their training progress. In our evaluation, we show that our method is able to outperform previous MTL state-of-the-arts and single task fine-tuning by 0.4-1.5% on GLUE benchmark dataset. We also perform comprehensive module analysis to demonstrate the effectiveness and robustness of our method.	PDF	11	2021
Feature-rich Open-vocabulary Interpretable Neural Representations for All of the World’s 7000 Languages	Modern NLP research is firmly predicated on two assumptions: that very large corpora are available, and that the word, rather than the morpheme, is the primary meaning-bearing unit of language. For the vast majority of the world's languages, these assumptions fail to hold, and as a result existing state-of-the-art neural representations such as BERT fail to meet the needs of thousands of languages. In this paper, we present a novel general-purpose neural representation using Tensor Product Representations that is designed from the beginning to be both linguistically interpretable and fully capable of handling the broad variety found in the world's diverse set of 7000 languages, regardless of corpus size or morphological characteristics.We demonstrate the applicability of our representation through examples drawn from a typologically diverse set of languages whose morphology includes prefixes, suffixes, infixes, circumfixes, templatic morphemes, derivational morphemes, inflectional morphemes, and reduplication.	PDF	11	2021
Improving Controllable Text Generation with Position-Aware Weighted Decoding	Weighted decoding methods composed of the pretrained language model (LM) and the controller have achieved promising results for controllable text generation. However, these models often suffer from a control strength/fluency trade-off problem as higher control strength is more likely to generate incoherent and repetitive text. In this paper, we illustrate this trade-off is arisen by the controller imposing the target attribute on the LM at improper positions. And we propose a novel framework based on existing weighted decoding methods called CAT-PAW, which introduces a lightweight regulator to adjust bias signals from the controller at different decoding positions. Experiments on positive sentiment control, topic control, and language detoxification show the effectiveness of our CAT-PAW upon 4 SOTA models.	PDF	11	2021
NSP-BERT: A Prompt-based Zero-Shot Learner Through an Original Pre-training Task —— Next Sentence Prediction	Using prompts to utilize language models to perform various downstream tasks, also known as prompt-based learning or prompt-learning, has lately gained significant success in comparison to the pre-train and fine-tune paradigm. Nonetheless, virtually all prompt-based methods are token-level, meaning they all utilize GPT's left-to-right language model or BERT's masked language model to perform cloze-style tasks. In this paper, we attempt to accomplish several NLP tasks in the zero-shot scenario using a BERT original pre-training task abandoned by RoBERTa and other models—Next Sentence Prediction (NSP). Unlike token-level techniques, our sentence-level prompt-based method NSP-BERT does not need to fix the length of the prompt or the position to be predicted, allowing it to handle tasks such as entity linking with ease. Based on the characteristics of NSP-BERT, we offer several quick building templates for various downstream tasks. We suggest a two-stage prompt method for word sense disambiguation tasks in particular. Our samples-contrast method for mapping the labels significantly enhance the model's performance on sentence-pair tasks. On the Chinese benchmark FewCLUE, our NSP-BERT outperforms other zero-shot methods on most of these tasks and comes close to the few-shot methods. And on GLUE and other English datasets NSP-BERT is still competitive. Our code will be available on github.	PDF	11	2021
Counting What Deserves to be Counted for Graph Parsing	Graph parsers rely on scoring every subgraphs for building a complete graph. In real syntactic parsing or semantic parsing, every types of subgraphs in terms of syntactic or semantic roles may generate quite unbalanced distribution, which seems not well captured by the current graph paring models. Thus we propose an enhanced model design to let the parser explicitly capture such kind of unbalanced distribution. In detail, we introduce Accumulative Operation-based Induction (AOI) attention mechanism to assign accumulative scores for words. AOI scorer successfully approximates word-level unbalanced distribution. With conceptually simple but general-purpose design, our proposed AOI attention enhancement indeed leads to better parsing performance on a wide range of datasets of different parsing tasks, which verifies the scalability and robustness of capturing diverse subgraph distribution.	PDF	11	2021
A Probabilistic Framework for Analyzing Moral Perspectives in the COVID-19 Vaccine Debate	The Covid-19 pandemic has led to infodemic of low quality information leading to poor health decisions. Combating the outcomes of this infodemic is not only a question of identifying false claims, it requires understanding the reasoning behind the decisions individuals make.In this work we propose a holistic analysis framework connecting stance and reason analysis and fine-grained entity level moral sentiment analysis. We study how to model the dependencies between the different level of analysis and incorporate human insights into the learning process. Our experiments show that our framework can provide reliable predictions even in the low-supervision settings.	PDF	11	2021
Non-Linear Relational Information Probing in Word Embeddings	Pre-trained word embeddings such as SkipGram and GloVe are known to contain a myriad of useful information about words. In this work, we use multilayer perceptrons (MLP) to probe the relational information contained in these word embeddings. Previous studies that use linear models on the analogy and relation induction tasks have shown that SkipGram generally outperforms GloVe, suggesting that SkipGram embeddings contain more relational information than GloVe embeddings. However, by using non-linear probe like MLP, our results instead suggest that GloVe embeddings contain more relational information than SkipGram embeddings, but a good amount of that is stored in a non-linear form and thus previous linear models failed to reveal that. Interpreting our relation probes using post-hoc analysis provides us with an explanation for this difference.	PDF	11	2021
N-grammer: Augmenting Transformers with latent n-grams	Transformer models have recently emerged as one of the foundational models in natural language processing, and as a byproduct, there has been significant recent interest and investment in scaling these models. However, the training and inference costs of these large Transformer language models are prohibitive, thus necessitating more research in identifying more efficient variants. In this work, we propose a simple yet effective modification to the Transformer architecture inspired by the literature in statistical language modeling, by augmenting the model with n-grams constructed from a discrete latent representation of the text sequence. We evaluate our model, the N-grammer on language modeling on the C4 data-set, and find that it outperforms several strong baselines such as the Transformer and the Primer. We will open-source our model for reproducibility purposes upon acceptance.	PDF	11	2021
Discontinuous Constituency and BERT: A Case Study of Dutch	In this paper, we set out to quantify the syntactic capacity of BERT in the evaluation regime of non-context free patterns, as occurring in Dutch. We devise a test suite based on a mildly context-sensitive formalism, from which we derive grammars that capture the linguistic phenomena of control verb nesting and verb raising. The grammars, paired with a small lexicon, provide us with a large collection of naturalistic utterances, annotated with verb-subject pairings, that serve as the evaluation test bed for an attention-based span selection probe. Our results, backed by extensive analysis, suggest that the models investigated fail in the implicit acquisition of the dependencies examined.	PDF	11	2021
The Power of Prompt Tuning for Low-Resource Semantic Parsing	Prompt tuning has recently emerged as an effective method for adapting pre-trained language models to a number of language understanding and generation tasks. In this paper, we investigate prompt tuning for semantic parsing—the task of mapping natural language utterances onto formal meaning representations. On the low-resource splits of Overnight and TOPv2, we find that a prompt tuned T5-xl significantly outperforms its fine-tuned counterpart, as well as strong GPT-3 and BART baselines. We also conduct ablation studies across different model scales and target representations, finding that, with increasing model scale, prompt tuned T5 models improve at generating target representations that are far from the pre-training distribution.	PDF	11	2021
Combining static and contextualised multilingual embeddings	Static and contextual multilingual embeddings have complementary strengths. Static embeddings, while less expressive than contextual language models, can be more straightforwardly aligned across multiple languages. Contextual language models are more powerful. We combine the strengths of static and contextual models to improve multilingual representations. We extract static embeddings for 40 languages from XLM-R, validate those embeddings with cross-lingual word retrieval, and then align them using VecMap. This results in high-quality, highly multilingual static embeddings. Then we apply a novel continued pre-training approach to XLM-R, leveraging the high quality alignment of our static embeddings to better align the representation space of XLM-R. We show positive results for multiple complex semantic tasks. We will release the static embeddings and the continued pre-training code.	PDF	11	2021
Efficient Speech Translation with Pre-trained models	When building state-of-the-art speech translation models, the need for large computational resources is a significant obstacle due to the large training data size and complex models. The availability of pre-trained models is a promising opportunity to build strong speech translation systems efficiently. In a first step, we investigate efficient strategies to build cascaded and end-to-end speech translation systems based on pre-trained models. Using this strategy, we can train and apply the models on a single GPU. While the end-to-end models show superior translation performance to cascaded ones, the application of this technology has a limitation on the need for additional end-to-end training data. In a second step, we proposed an additional similarity loss to encourage the model to generate similar hidden representations for speech and transcript. Using this technique, we can increase the data efficiency and improve the translation quality by 6 BLEU points in scenarios with limited end-to-end training data.	PDF	11	2021
Extracting and Inferring Personal Attributes from Dialogue	Personal attributes represent structured information about a person, such as their hobbies, pets, family, likes and dislikes. We introduce the tasks of extracting and inferring personal attributes from human-human dialogue, and analyze the linguistic demands of these tasks. To meet these challenges, we introduce a simple and extensible model that combines an autoregressive language model utilizing constrained attribute generation with a discriminative reranker. Our model outperforms strong baselines on extracting personal attributes as well as inferring personal attributes that are not contained verbatim in utterances and instead requires commonsense reasoning and lexical inferences, which occur frequently in everyday conversation. Finally, we demonstrate the benefit of incorporating personal attributes in social chit-chat and task-oriented dialogue settings.	PDF	11	2021
Divide and Denoise: Learning from Noisy Labels in Fine-grained Entity Typing with Cluster-wise Loss Correction	Fine-grained Entity Typing(FET) has witnessed great progress since distant supervision was introduced, but still suffers from label noise. Existing noise control methods applied to FET rely on predicted distribution and deals instances isolately, thus suffers from confirmation bias. In this work, We propose to tackle the two limitations with a cluster based loss correction framework named Feature Cluster Loss Correction(FCLC). FCLC first train a coarse backbone model as feature extractor and noise estimator. Then perform loss correction on each cluster to learn directly from noisy labels. Experimental results on three public datasets show FCLC achieves the best performance over existing competitive systems. Auxiliary experiments further show FCLC is stable to hyper-paramerters and even works with extream scneriaos like no clean data is available.	PDF	11	2021
Learn to Adapt for Generalized Zero-Shot Text Classification	Generalized zero-shot text classification aims to classify textual instances from both previously seen classes and incrementally emerging unseen classes. Most existing methods generalize poorly since the learned parameters are only optimal for seen classes rather than for both classes, and the parameters keep stationary in predicting procedures. To address these challenges, we propose a novel Learn to Adapt (LTA) network using a variant meta-learning framework. Specifically, LTA trains multiple meta-learners by using both seen classes and virtual unseen classes to simulate a generalized zero-shot learning (GZSL) scenario in accordance with the test time, and simultaneously learns to calibrate the class prototypes and sample representations to make the learned parameters adaptive to incoming unseen classes. We claim that the proposed model is capable of representing all prototypes and samples from both classes to a more consistent distribution in the global space. Extensive experiments on five text classification datasets show that our model outperforms several competitive previous approaches by large margins. The code and the whole datasets will be available after paper publication.	PDF	11	2021
XLM-E: Cross-lingual Language Model Pre-training via ELECTRA	In this paper, we introduce ELECTRA-style tasks to cross-lingual language model pre-training. Specifically, we present two pre-training tasks, namely multilingual replaced token detection, and translation replaced token detection. Besides, we pretrain the model, named as XLM-E, on both multilingual and parallel corpora. Our model outperforms the baseline models on various cross-lingual understanding tasks with much less computation cost. Moreover, analysis shows that XLM-E tends to obtain better cross-lingual transferability.	PDF	11	2021
Active Dialogue Simulation in Conversational Systems	Semantic parsing helps conversational systems in satisfying users' requests through dialogues. To train these models, collecting annotated dialogues as a dataset is a very expensive and time-consuming process. In this paper, our goal is to utilize large language models and active learning to replace Wizard-of-Oz (WoZ) collection via crowdsourcing for bootstrapping training data for task-driven semantic parsers. We first demonstrate the utility of utterances generated by GPT-3 when seeded with prior training dialogues, as evaluated by human judges. We then explore the use of parser uncertainty on generated outputs as a selection criteria for annotation and contrast this with a strategy based on Core-sets. Our pipeline leads to more useful examples on average, motivating future work on active generation for bootstrapping semantic parsers.	PDF	11	2021
DISAPERE: A Dataset for Discourse Structure in Peer Review Discussions	At the foundation of scientific evaluation is the labor-intensive process of peer review. This critical task requires participants to consume vast amounts of highly technical text. Prior work has annotated different aspects of review argumentation, but discourse relations between reviews and rebuttals have yet to be examined.We present DISAPERE, a labeled dataset of 20k sentences contained in 506 review-rebuttal pairs in English, annotated by experts. DISAPERE synthesizes label sets from prior work and extends them to include fine-grained annotation of the rebuttal sentences, characterizing the authors' stance towards review arguments, and their context in the review. Further, we annotate \textit{every} review and rebuttal sentence.We show that discourse cues from rebuttals can shed light on the quality and interpretation of reviews. Further, an understanding of the argumentative strategies employed by the reviewers and authors provides useful signal for area chairs and other decision makers.	PDF	11	2021
A Feasibility Study of Answer-Unaware Question Generation for Education	We conduct a feasibility study into the applicability of \textit{answer-unaware} question generation models to textbook passages. We show that a significant portion of errors in such systems arise from asking irrelevant or un-interpretable questions and that such errors can be ameliorated by providing summarized input. We find that giving these models human-written summaries instead of the original text results in a significant increase in acceptability of generated questions (33\% -> 83\%) as determined by expert annotators. We also find that, in the absence of human-written summaries, automatic summarization can serve as a good middle ground.	PDF	11	2021
Simulating Bandit Learning from User Feedback for Extractive Question Answering	We study learning from user feedback for extractive question answering by simulating feedback using supervised data. We cast the problem as contextual bandit learning, and analyze the characteristics of several learning scenarios with focus on reducing data annotation. We show that systems initially trained on few examples can dramatically improve given feedback from users on model-predicted answers, and that one can use existing datasets to deploy systems in new domains without any annotation effort, but instead improving the system on-the-fly via user feedback.	PDF	11	2021
Self-supervised Semantic-driven Phoneme Discovery for Zero-resource Speech Recognition	Phonemes are defined by their relationship to words: changing a phoneme changes the word. Learning a phoneme inventory with little supervision has been a longstanding challenge with important applications to under-resourced speech technology. In this paper, we bridge the gap between the linguistic and statistical definition of phonemes and propose a novel neural discrete representation learning model for self-supervised learning of phoneme inventory with raw speech and word labels. Under mild assumptions, we prove that the phoneme inventory learned by our approach converges to the true one with an exponentially low error rate. Moreover, in experiments on TIMIT and Mboshi benchmarks, our approach consistently learns a better phoneme-level representation and achieves a lower error rate in a zero-resource phoneme recognition task than previous state-of-the-art self-supervised representation learning algorithms.	PDF	11	2021
Entity-Conditioned Question Generation for Robust Attention Distribution in Neural Information Retrieval	We show that supervised neural information retrieval (IR) models are prone to learning sparse attention patterns over passage tokens, which can result in key phrases including named entities receiving low attention weights, eventually leading to model under-performance. Using a novel targeted synthetic data generation method that identifies poorly attended entities and conditions the generation episodes on those, we teach neural IR to attend more uniformly and robustly to all entities in a given passage. On three public IR benchmarks, we empirically show that the proposed method helps improve both the model's attention patterns and retrieval performance, including in zero-shot settings.	PDF	11	2021
Zero-shot Learning for Grapheme to Phoneme Conversion with Language Ensemble	Grapheme-to-Phoneme (G2P) has many applications in NLP and speech fields. Most existing work focuses heavily on languages with abundant training datasets, which limit the scope of target languages to less than 100 languages. This work attempts to apply zero-shot learning to propose G2P models for all low-resource and endangered languages in Glottolog (about 8k languages). For any unseen target language, we first build the phylogenetic tree (i.e. language family tree) to identify top-$k$ nearest languages for which we have training sets. Then we run models of those languages to obtain a hypothesis set, which we combine into a confusion network to propose a most likely hypothesis as an approximation to the target language. We test our approach on over 600 unseen languages and demonstrate it significantly outperforms baselines.	PDF	11	2021
Stable Natural Language Understanding via Invariant Causal Constraint	Natural Language Understanding (NLU) task requires the model to understand the underlying semantics of input text. However, recent analyses demonstrate that NLU models tend to utilize dataset biases to achieve high dataset-specific performances, which always leads to performance degradation on out-of-distribution (OOD) samples. Toincrease the performance stability, previous debiasing methods \emph{empirically} capture bias features from data to prevent model from corresponding biases. However, we argue that, the semantic information can form a \emph{causal} relationship with the target labels of the NLU task, while the biases information is only \emph{correlative} to the target labels. Such difference between the semantic information and dataset biases still remains not fully addressed, which limits the effectiveness of debiasing. To address this issue, we analyze the debiasing process under a \emph{causal perspective}, and present a causal invariance based stable NLU framework (CI-sNLU).	PDF	11	2021
Direct parsing to sentiment graphs	This paper demonstrates how a graph-based semantic parser can be applied to the task of structured sentiment analysis, directly predicting sentiment graphs from text. We advance the state of the art on 4 out of 5 standard benchmark sets. We release the source code, models and predictions with the camera-ready version.	PDF	11	2021
Dual Architecture for Name Entity Extraction and Relation Extraction with Applications in Medical Corpora	There is a growing interest in automatic knowledge discovery in plain text documents. Automation enables the analysis of massive collections of information. Such efforts are especially relevant in the health domain as advancements could use the large volume of available resources to transform areas important for society when addressing various health research challenges. However, knowledge discovery is usually aided by annotated corpora, which are scarce resources in the literature. This situation is particularly critical in the Spanish language, for which the volume of training resources is less widespread. This work uses a health-oriented Spanish dataset, and it also creates an English variant using the same tagging system. Furthermore, we design and analyze two separated architectures for Entity Extraction and Relation Recognition that outperform previous works in the Spanish dataset. With such promising results, we also evaluate their performance in the English version. Finally, we perform a use case experiment to evaluate the utility of the output of these two architectures in Information Retrieval systems.	PDF	11	2021
Mixture-of-Graphs: Zero-shot Relational Learning for Knowledge Graph by Fusing Ontology and Textual Experts	Knowledge Graph Embedding (KGE) have been proposed and succeed utilized to knowledge Graph Completion (KGC). But dominant KGE models often fail in zero-shot relational learning because they cannot learn effective representations for unseen relations. Previous studies mainly separately utilize the textual description of relation and its neighbor relations to represent unseen relations. In fact, the semantics of a relation can be expressed by three kinds of graphs: factual graph, ontology graph and textual description graph, and they can complement and enhance each other. Therefore, to obtain more accurate representation of relation in zero-shot learning, we propose the mixture-of-graphs (MoG) experts to improve the effect of current KGE for unseen relations. We build multi-aspect associations between seen and unseen relations which will be used directly to guide previous KGE methods such as TransE and RotatE on zero-shot relational learning. The experiments on multiple public datasets verify the effectiveness of the proposed method, which improves the state-of-the-art zero-shot relational learning method by 12.84% in Hits@10 on average.	PDF	11	2021
Relevance in Dialogue: Is Less More? An Empirical Comparison of Existing Metrics, and a Novel Simple Metric	In this work, we evaluate various existing dialogue relevance metrics, find strong dependencies on the dataset, often with poor correlation with human scores of relevance, and propose modifications to reduce data requirements and domain sensitivity while improving correlation. With these changes, our metric achieves state-of-the-art performance on the HUMOD dataset (Merdivan et al., 2020) while reducing measured sensitivity to dataset by 50%. We achieve this without fine-tuning, using only 3750 unannotated human dialogues and a single negative example. Despite these limitations, we demonstrate competitive performance on four datasets from different domains. Our code including our metric and experiments is open sourced.	PDF	11	2021
On the Use of Entity Embeddings from Pre-Trained Language Models for Knowledge Graph Completion	Recent work has found that entity representations can be extracted from pre-trained language models to develop knowledge graph completion models that are more robust to the naturally occurring sparsity found in knowledge graphs. In this work, we explore how to best extract and incorporate those embeddings. We explore the suitability of the extracted embeddings for direct use in entity ranking and introduce both unsupervised and supervised processing methods that can lead to improved downstream performance. We then introduce supervised embedding extraction methods and demonstrate that we can extract more informative representations. We also examine the effect of language model selection and find that the choice of model can have a significant impact. We then synthesize our findings and develop a knowledge graph completion model that significantly outperforms recent neural models.	PDF	11	2021
Cross-Utterance Conditioned VAE for Non-Autoregressive Text-to-Speech	Modelling prosody variation is critical for synthesizing natural and expressive speech in end-to-end text-to-speech (TTS) systems. In this paper, a cross-utterance conditional VAE (CUC-VAE) is proposed to estimate a posterior probability distribution of the latent prosody features for each phoneme by conditioning on acoustic features, speaker information, and text features obtained from both past and future sentences. At inference time, instead of the standard Gaussian distribution used by VAE, CUC-VAE allows sampling from an utterance-specific prior distribution conditioned on cross-utterance information, which allows the prosody features generated by the TTS system to be related to the context and is more similar to how humans naturally produce prosody. The performance of CUC-VAE is evaluated via a qualitative listening test for naturalness, intelligibility and quantitative measurements, including word error rates and the standard deviation of prosody attributes. Experimental results on LJ-Speech and LibriTTS data show that the proposed CUC-VAE TTS system improves naturalness and prosody diversity with clear margins.	PDF	11	2021
ATOGAN:Adaptive Training Objective Generative Adversarial Network for Cross-lingual Word Alignment in Non-Isomorphic Embedding Spaces	Cross-lingual word alignment is a task for word translation from monolingual word embedding spaces of two languages. Recent works are mostly based on supervised approaches, which need specific bilingual seed dictionaries. The unsupervised adversarial approaches, which utilize the generative adversarial networks to map the whole monolingual space, do not need any aligned data. However these approaches pay no attention to the problem of mode collapse and gradient disappearance in generative adversarial networks(GAN). We proposed an adaptive training objective generative adversarial network(ATOGAN). We combined particle swarm optimization(PSO) with GAN to select the training objective in GAN's training, which alleviates the problem of mode collapse and gradient disappearance. Moreover, we improved the word alignment by bi-directional mapping and consistency loss. Experimental results demonstrate that our approach is better than several state-of-the-art approaches in distant language pairs(non-isomorphic embedding spaces).	PDF	11	2021
GenRE: A Generative Model for Relation Extraction	Relation extraction (RE) is an important information extraction task which provides essential information to many NLP applications such as knowledge base population and question answering. In this paper, we present a novel generative model for relation extraction (which we call GenRE), where RE is modeled as a sequence-to-sequence generation task. We explore various encoding schemes for the source and target sequences, and design effective schemes that enable GenRE to achieve state-of-the-art performance on three benchmark RE datasets. In addition, we introduce negative sampling and decoding scaling techniques which provide a flexible tool to tune the precision and recall performance of our GenRE model. Our approach can be extended to extract all relation triples from a sentence in one pass. Although the one-pass approach incurs certain performance loss, it is much more computationally efficient.	PDF	11	2021
Textual Backdoor Attacks Can Be More Harmful via Two Simple Tricks	Backdoor attacks are a kind of emergent security threat in deep learning. After injected into a backdoor, a deep neural model will behave normally on standard inputs but give adversary-specified predictions once the input contains specific backdoor triggers. Current textual backdoor attacks have poor attack performance in some tough situations. In this paper, we find two simple tricks that can make existing textual backdoor attacks much more harmful. The first trick is to add an extra training task to distinguish poisoned and clean data during the training of the victim model, and the second one is to use all the clean training data rather than remove the original clean data corresponding to the poisoned data. These two tricks are universally applicable to different attack models.We conduct experiments in three tough situations including clean data fine-tuning, low-poisoning-rate, and label-consistent attacks. Experimental results show that the two tricks can significantly improve attack performance. This paper exhibits the great potential harmfulness of backdoor attacks. All the code and data will be made public to facilitate further research.	PDF	11	2021
Control False Negative Instances In Contrastive Learning To ImproveLong-tailed Item Categorization	Item categorization (IC) is an important core technology in e-commerce natural language processing (NLP). Given category labels' long-tailed distribution, IC performances on tail labels tend to be poor due to sporadic supervision. To address the long-tail issue in classification, an increasing number of methods have been proposed in the computer vision domain. In this paper, we adopted a new method, which consists of decoupling the entire classification task into (a) learning representations in a k-positive contrastive learning (KCL) way and (b) training a classifier on balanced data set, into IC tasks. Using SimCSE to be our self-learning backbone, we demonstrated that the proposed method works on the IC text classification task. In addition, we spotted a shortcoming in the KCL: false negative instances (FN) may harm the representation learning step. After eliminating FN instances, IC performance (measured by macro-F1) has been further improved.	PDF	11	2021
Can Language Models Be Specific? How?	A good speaker needs not only to be correct but also to be specific, and so are language models. In this paper, we propose to measure how specific the language of pre-trained language models (PLMs) is. To achieve this, we introduce a novel approach to build a benchmark for specificity testing by forming masked token prediction tasks with prompts. For instance, given ``J. K. Rowling was born in [MASK].'', we want to test whether a more specific answer will be better filled in by PLMs, e.g., Yate instead of England.From our evaluations, we show that existing PLMs have only a slight preference for more specific answers, indicating that PLMs are weak in specificity. We identify underlying factors affecting the specificity and design two prompt-based methods to improve the specificity. Results show that the specificity of the models can be improved by the proposed methods without additional training. We believe this work can provide a new insight for language modeling and encourage the research community to further explore this important but understudied problem.	PDF	11	2021
Prompt-Guided Few-Shot Event Detection	Practical applications of event extraction systems have long been hindered by their need for heavy human annotation. In order to scale up to new domains and event types, models must learn to cope with limited supervision, as in few-shot learning settings. To this end, the major challenge is to let the model master the semantic of event types, without requiring abundant event mention annotations. In our study, we employ cloze prompts to elicit event-related knowledge from pre-trained language models and further use event definitions and keywords to pinpoint the trigger word. By formulating the event detection task as an "identify-then-localize" procedure, we minimize the number of type-specific parameters, enabling our model to quickly adapt to event detection tasks for new types. Experiments on three event detection benchmark datasets (ACE, FewEvent, MAVEN) show that our proposed method performs favorably under fully supervised settings and surpasses existing few-shot methods by 16% F1 on the FewEvent dataset and 23% on the MAVEN dataset when only 5 examples are provided for each event type.	PDF	11	2021
SiSP: Japanese Situation-dependent Sentiment Polarity Dictionary	In order to deal with the variety of meanings and contexts of words, we created a Japanese Situation-dependent Sentiment Polarity Dictionary (SiSP) of sentiment values labeled for 20 different situations. This dictionary was annotated by crowdworkers with 25,520 Japanese words, and consists of 10 responses for each situation of each word. Using our SiSP, we predicted the polarity of each word in the dictionary and that of dictionary words in sentences considering the context. In both experiments, situation-dependent prediction showed superior results in determining emotional polarity.	PDF	11	2021
Text-to-Table: A New Way of Information Extraction	We study a new problem setting of information extraction (IE), referred to as text-to-table. In text-to-table, given a text, one creates a table or several tables expressing the main content of the text, while the model is learned from text-table pair data. The problem setting differs from those of the existing methods for IE. First, the extraction can be carried out from long texts to large tables with complex structures. Second, the extraction is entirely data-driven, and there is no need to explicitly define the schemas. As far as we know, there has been no previous work that studies the problem. In this work, we formalize text-to-table as a sequence-to-sequence (seq2seq) problem. We first employ a seq2seq model fine-tuned from a pre-trained language model to perform the task. We also develop a new method within the seq2seq approach, exploiting two additional techniques in table generation: table constraint and table relation embeddings. We consider text-to-table as an inverse problem of the well-studied table-to-text, and make use of four existing table-to-text datasets in our experiments on text-to-table. Experimental results show that the vanilla seq2seq model can outperform the baseline methods of using relation extraction and named entity extraction. The results also show that our method can further boost the performances of the vanilla seq2seq model. We further discuss the main challenges of the proposed task. The code and data will be made publicly available.	PDF	11	2021
Exploring and Adapting Chinese GPT to Pinyin Input Method	While GPT has become the de-facto method for text generation tasks, its application to pinyin input method remains unexplored.In this work, we make the first exploration to leverage Chinese GPT for pinyin input method.We find that a frozen GPT achieves state-of-the-art performance on perfect pinyin.However, the performance drops dramatically when the input includes abbreviated pinyin.A reason is that an abbreviated pinyin can be mapped to many perfect pinyin, which links to even larger amount of Chinese characters.We mitigate this issue with two strategies, including enriching the context with pinyin and optimizing the training process to help distinguish homophones. To further facilitate the evaluation of pinyin input method, we create a dataset consisting of 270K instances from 15 domains.Results show that our approach improves the performance on abbreviated pinyin across all domains.Model analysis demonstrates that both strategies contribute to the performance boost.	PDF	11	2021
Multi-head or Single-head? An Empirical Comparison for Transformer Training	Multi-head attention plays a crucial role in the recent success of Transformer, which leads to consistent performance improvements over conventional attention in various applications. The popular belief is that its effectiveness stems from the ability to attend multiple positions jointly. In this paper, we first demonstrate that jointly attending multiple positions is not a unique feature of multi-head attention, as multi-layer single-head attention also attends multiple positions. Then, we suggest the main advantage of multi-head attention is the training stability, since it has fewer layers than the single-head attention when attending the same number of positions. Meanwhile, we show that, with recent advances in deep learning, we can successfully stabilize the training of the deep single-head Transformer. As the training difficulty is no longer a bottleneck, substantially deeper single-head Transformers achieve consistent performance improvements.	PDF	11	2021
How Distributed are Distributed Representations? An Observation on the Locality of Syntactic Information in Verb Agreement Tasks	This work addresses the question of the localization of the syntactic information encoding in the Transformers representations. We tackle this question from two perspectives, considering the object-past participle agreement in French, by identifying, first, in which part of the sentence and, second, in which part of the representation syntactic information is encoded. The results of our experiments, using probing, causal analysis and feature selection method, show that syntactic information is encoded locally in a way consistent with the French grammar.	PDF	11	2021
S$^2$SQL: Injecting Syntax to Question-Schema Interaction Graph Encoder for Text-to-SQL Parsers	The task of converting a natural language question into an executable SQL query, known as text-to-SQL, is an important branch of semantic parsing. The state-of-the-art graph-based encoder has been successfully used in this task but does not model the question syntax well. In this paper, we propose S$^2$SQL, injecting Syntax to question-Schema graph encoder for Text-to-SQL parsers, which effectively leverages the syntactic dependency information of questions in text-to-SQL to improve the performance. We also employ the decoupling constraint to induce diverse relational edge embedding, which further improves the network's performance. Experiments on the Spider and robustness setting Spider-Syn demonstrate that the proposed approach outperforms all existing methods when pre-training models are used, resulting in a performance ranks first on the Spider leaderboard.	PDF	11	2021
Zero-Shot Dependency Parsing with Worst-Case Aware Automated Curriculum Learning	Large multilingual pretrained language models such as mBERT and XLM-RoBERTa have been found to be surprisingly effective for cross-lingual transfer of syntactic parsing models Wu and Dredze (2019), but only between related languages. However, source and training languages are rarely related, when parsing truly low-resource languages. To close this gap, we adopt a method from multi-task learning, which relies on automated curriculum learning, to dynamically optimize for parsing performance on {\em outlier} languages. We show that this approach is significantly better than uniform and size-proportional sampling in the zero-shot setting.	PDF	11	2021
Cost-Effective Training in Low-Resource Neural Machine Translation	While Active Learning (AL) techniques are explored in Neural Machine Translation (NMT), only a few works focus on tackling low annotation budgets where a limited number of sentences can get translated. Such situations are especially challenging and can occur for endangered languages with few human annotators or having cost constraints to label large amounts of data. Although AL is shown to be helpful with large budgets, it is not enough to build high-quality translation systems in these low-resource conditions. In this work, we propose a cost-effective training procedure to increase the performance of NMT models utilizing a small number of annotated sentences and dictionary entries. Our method leverages monolingual data with self-supervised objectives and a small-scale, inexpensive dictionary for additional supervision to initialize the NMT model before applying AL. We show that improving the model using a combination of these knowledge sources is essential to exploit AL strategies and increase gains in low-resource conditions. We also present a novel AL strategy inspired by domain adaptation for NMT and show that it is effective for low budgets. We propose a new hybrid data-driven approach, which samples sentences that are diverse from the labelled data as well as similar to unlabelled data. Finally, we show that initializing the NMT model and further using our AL strategy can achieve gains of up to $13$ BLEU compared to conventional AL methods	PDF	11	2021
Sharper Reasons: Argument Mining Leveraged with Confluent Knowledge	Relevant to all application domains where it is important to get at the reasons underlying decisions and sentiments, argument mining seeks to obtain structured arguments from unstructured text and has been addressed recently by approaches typically involving some feature and/or neural architecture engineering.By embracing a transfer learning viewpoint, the aim of this paper is to empirically assess the potential of transferring knowledge learned with confluent tasks to argument mining by means of a systematic study with a wide range of sources of related knowledge possibly suitable to leverage argument mining.This permitted to gain new empirically based insights into the argument mining task while establishing also new state of the art levels of performance for the three main sub-tasks in argument mining, viz. identification of argument components, classification of the components, and determination of the relation among them, with a leaner approach that dispenses with heavier feature and model engineering.	PDF	11	2021
e-CARE: a New Dataset for Exploring Explainable Causal Reasoning	Understanding causality has vital importance for various Natural Language Processing (NLP) applications. Beyond the labeled instances, conceptual explanations of the causality can provide deep understanding of the causal fact to facilitate the causal reasoning process. However, such explanation information still remains absent in existing causal reasoning resources. In this paper, we fill this gap by presenting a human-annotated explainable CAusal REasoning dataset (e-CARE), which contains over 20K causal reasoning questions, together with natural language formed explanations of the causal questions. Experimental results show that generating valid explanations for causal facts still remains especially challenging for the state-of-the-art models, and the explanation information can be helpful for promoting the accuracy and stability of causal reasoning models.	PDF	11	2021
Improving Paraphrase Generation models with machine translation generated pre-training	Paraphrase generation is a fundamental and longstanding problem in the Natural Language Processing field. With the huge success of pre-trained transformers, the pre-train–fine-tune approach has become a standard choice. At the same time, popular task-agnostic pre-trainings usually require terabyte datasets and hundreds of GPUs, while available pre-trained models are limited to architecture and size. We propose a simple and efficient pre-training approach specifically for paraphrase generation, which noticeably boosts model quality and doesn't require significant computing power. We also investigate how this procedure influences the scores across different architectures and show that it helps them all.	PDF	11	2021
A Graph Enhanced BERT Model for Event Prediction	Predicting the subsequent event for an existing event context is an important but challenging task, as it requires understanding the underlying relationship between events. Previous methods propose to retrieve relational features from event graph to enhance the modeling of event correlation. However, the sparsity of event graph may restrict the acquisition of relevant graph information, and hence influence the model performance. To address this issue, we consider automatically building of event graph using a BERT model. To this end, we incorporate an additional structured variable into BERT to learn to predict the event connections in the training process.Hence, in the test process, the connection relationship for unseen events can be predicted by the structured variable.Results on two event prediction tasks: script event prediction and story ending prediction, show that our approach can outperform state-of-the-art baseline methods.	PDF	11	2021
Self-Competitive Learning for Solving Math Word Problem	Math word problem (MWP) aims to automatically solve mathematical questions given in texts. Most previous MWP models tend to fit the sole ground-truth solution provided by the dataset, without considering the diverse but equivalent solution expressions. To mitigate this issue, we propose a self-competitive learning framework (called SCL), which attempts to get different predictions and improve the generalization ability of the model by cooperatively learning a source network and a pruned competitor network. The competitor network is created by pruning a source network, which perturbs the source network’s structure and is conducive to generate diverse solutions. The source network and the competitor network learn collaboratively and teach each other throughout the training process. Extensive experiments on two large-scale benchmarks demonstrate that our model substantially outperforms the strong baseline methods. In particular, our method improves the best performance (accuracy) by 8.4% (78.4% $\rightarrow$ 86.8%) for Math23k and 6.2% (70.5% $\rightarrow$ 76.7%) for Ape210K.	PDF	11	2021
BACN: Bi-direction Attention Capsule-based Network for Multimodal Sentiment Analysis	Capsule-based network has currently identified its effectiveness in analyzing the heterogeneity issue of multimodal sentiment analysis. However, existing manners could only exploit the spatial relation between representation and output layer via down-top attention, which fails to effectively explore both inter-modality and intra-modality context. In this paper, during the preprocess period, we first present the multimodal dynamic enhanced module to facilitate the intra-modality context, which significantly boost the learning efficiency in dealing with multimodal heterogeneity issue. Furthermore, the bi-direction attention capsule-based network (BACN) is proposed to capture dynamic inter-modality context via the novel bi-direction dynamic routing mechanism. Specifically, BACN firstly highlights the static and low-level inter-modality context based on top-down attention. Then, the static multimodal context is transmitted to dynamic routing procedure, naturally allowing us to investigate dynamic and high-level inter-modality context. This indeed unleash the expressive power and provides the superior capability to bridge the modality gap among all the modalities. The experiments demonstrate that BACN can achieve state-of-the-art performance.	PDF	11	2021
Aggregating Pairwise Semantic Differences for Few-Shot Claim Veracity Classification	As part of an automated fact-checking pipeline, the claim veracity classification task consists in determining if a claim is supported by an associated piece of evidence. The complexity of gathering labelled claim-evidence pairs leads to a scarcity of datasets, particularly when dealing with new domains. In this paper, we introduce SEED, a novel vector-based method to few-shot claim veracity classification that aggregates pairwise semantic differences for claim-evidence pairs. We build on the hypothesis that we can find class representative vectors that capture average semantic differences for claim-evidence pairs in a class, which can then be used for classification of new instances. We compare the performance of our method with competitive baselines including fine-tuned BERT/RoBERTa models, as well as the state-of-the-art few-shot veracity classification method that leverages language model perplexity. Experiments conducted on the FEVER and SCIFACT datasets show consistent improvements over competitive baselines in few-shot settings. Our code will be made publicly available upon publication.	PDF	11	2021
Autoregressive Language Model for Zero-shot Constrained Keyphrase Generation	Recently, most of the state-of-the-art keyphrase prediction models are based on a supervised generative model.It shows significantly better than before. Nevertheless, it still faces domain robustness and building datasets on high-resource. To overcome these limitations, unsupervised methods have also been critical and studied. We analyzed it also have a defect in a necessary process, which extracts candidates beforehand selecting keyphrase. As not including various forms of phrases, we note that the unsupervised method can't ensure oracle keyphrase.In this paper, we present zero-shot constrained keyphrase generation by leveraging a large-scale language model. To generate diverse keyphrases, we explore controlling a phrase during the generation. Finally, we evaluate benchmark datasets of the scholar domain. It results in better performances than unsupervised methods on several datasets without going through the candidate extraction stage. For domain robustness, we evaluate out-of-domain DUC compare with NUS. Since our method doesn't fine-tune to a corpus of a specific domain, it's better than supervised methods based on Sequence-to-Sequence.	PDF	11	2021
DoCoGen: Domain Counterfactual Generation for Low Resource Domain Adaptation	Natural language processing (NLP) algorithms have become very successful, but they still struggle when applied to out-of-distribution examples. In this paper we propose a controllable generation approach in order to deal with this domain adaptation (DA) challenge. Given an input text example, our DoCoGen algorithm generates a domain-counterfactual textual example (D-CON) – that is similar to the original in all aspects, including the task label, but its domain is changed to a desired one. Importantly, DoCoGen is trained using only unlabeled examples from multiple domains – no NLP task labels or pairs of textual examples and their domain-counterfactuals are required.We use the D-CONs generated by DoCoGen to augment a sentiment classifier in 20 DA setups, where source-domain labeled data is scarce. Our model outperforms strong baselines and improves the accuracy of a state-of-the-art unsupervised DA algorithm.	PDF	11	2021
Multi-layer Biaffine Model for Neural Dependency Parsing	The biaffine model is a strong and efficient model for graph-based dependency parsing. However, previous work only used the biaffine method in single-layer form. In this paper, we propose a multi-layer biaffine model for neural dependency parsing. In this model, we modify the biaffine method so that it can be utilized in multi-layer form. We evaluate our model on PTB and CTB and show our model achieves state-of-the-art results on both datasets. Further experiments show the benefits of introducing multi-layer form into the biaffine method with low efficiency loss.	PDF	11	2021
From Rewriting to Remembering: Common Ground for Conversational QA Models	In conversational QA, models have to leverage information in previous turns to answer upcoming questions. Current approaches, such as Question Rewriting, struggle to extract relevant information as the conversation unwinds. We introduce the Common Ground (CG), an approach to accumulate conversational information as it emerges and select the relevant information at every turn. We show that CG offers a more efficient and human-like way to exploit conversational information compared to existing approaches, leading to improvements on Open Domain Conversational QA.	PDF	11	2021
Emotion analysis and detection during COVID-19	Understanding emotions that people express during large-scale crises helps inform policy makers and first responders about the emotional states of the population as well as provide emotional support to those who need such support. We present CovidEmo, a dataset of 3,000 English tweets labeled with emotions and temporally distributed across 18 months. Our analyses reveal the emotional toll caused by COVID-19, and changes of the social narrative and associated emotions over time. Motivated by the time-sensitive nature of crises and the cost of large-scale annotation efforts, we examine how well large pre-trained language models generalize across domains and timeline in the task of perceived emotion prediction in the context of COVID-19. Our analyses suggest that cross-domain information transfers occur, yet there are still significant gaps. We propose semi-supervised learning as a way to bridge this gap, obtaining significantly better performance using unlabeled data from the target domain.	PDF	11	2021
HIE-SQL: History Information Enhanced Network for Context-Dependent Text-to-SQL Semantic Parsing	Recently, context-dependent text-to-SQL semantic parsing which translates natural language into SQL in an interaction process has attracted a lot of attentions. Previous works leverage context dependence information either from interaction history utterances or previous predicted queries but fail in taking advantage of both of them since of the mismatch between the natural language and logic-form SQL. In this work, we propose a History Information Enhanced text-to-SQL model (HIE-SQL) to exploit context dependence information from both history utterances and the last predicted SQL query. In view of the mismatch, we treat natural language and SQL as two modalities and propose a bimodal pre-trained model to bridge the gap between them. Besides, we design a schema-linking graph to enhance connections from utterances and the SQL query to database schema. We show our history information enhanced methods improve the performance of HIE-SQL by a significant margin, which achieves new state-of-the-art results on two context-dependent text-to-SQL benchmarks, the SparC and CoSQL datasets, at the writing time.	PDF	11	2021
Boundary Smoothing for Named Entity Recognition	Neural named entity recognition (NER) models may easily encounter the over-confidence issue, which degrades the performance and calibration. Inspired by label smoothing and driven by the ambiguity of boundary annotation in NER engineering, we propose boundary smoothing as a regularization technique for span-based neural NER models. It re-assigns entity probabilities from annotated spans to the surrounding ones. Built on a simple but strong baseline, our model achieves results better than or competitive with previous state-of-the-art systems on eight well-known NER benchmarks. Further empirical analysis suggests that boundary smoothing effectively mitigates over-confidence, improves model calibration, and brings flatter neural minima and more smoothed loss landscapes.	PDF	11	2021
UniPELT: A Unified Framework for Parameter-Efficient Language Model Tuning	Recent parameter-efficient language model tuning (PELT) methods manage to match the performance of fine-tuning with much fewer trainable parameters and perform especially well when training data is limited. However, different PELT methods may perform rather differently on the same task, making it nontrivial to select the most appropriate method for a specific task, especially considering the fast-growing number of new PELT methods and tasks. In light of model diversity and the difficulty of model selection, we propose a unified framework, UniPELT, which incorporates different PELT methods as submodules and learns to activate the ones that best suit the current data or task setup via gating mechanism. On the GLUE benchmark, UniPELT consistently achieves 1~4% gains compared to the best individual PELT method that it incorporates and even outperforms fine-tuning under different setups. Moreover, UniPELT generally surpasses the upper bound that takes the best performance of all its submodules used individually on each task, indicating that a mixture of multiple PELT methods may be inherently more effective than single methods.	PDF	11	2021
ANNA: Enhanced Language Representation for Question Answering	Pre-trained language models have brought significant improvements in performance in a variety of natural language processing tasks. Most existing models performing state-of-the-art results have shown their approaches in the separate perspectives of data processing, pre-training tasks, neural network modeling, or fine-tuning. In this paper, we demonstrate how the approaches affect performance individually, and that the language model performs the best results on a specific question answering task when those approaches are jointly considered in pre-training models. In particular, we propose an extended pre-training task, and a new neighbor-aware mechanism that attends neighboring tokens more to capture the richness of context for pre-training language modeling. Our best model achieves new state-of-the-art results of 95.7\% F1 and 90.6\% EM on SQuAD 1.1 and also outperforms existing pre-trained language models such as RoBERTa, ALBERT, ELECTRA, and XLNet on the SQuAD 2.0 benchmark.	PDF	11	2021
Improving the Faithfulness of Abstractive Summarization via Entity Coverage Control	Abstractive summarization systems leveraging pre-training language models have achieved superior results on benchmark datasets. However, such models have been shown to be more prone to hallucinate facts that are unfaithful to the input context. In this paper, we propose a method to remedy entity-level extrinsic hallucinations with Entity Coverage Control (ECC). We first compute entity coverage precision and prepend the corresponding control code for each training example, which implicitly guides the model to recognize faithfulness contents in the training phase. We further extend our method via intermediate fine-tuning on large but noisy data extracted from Wikipedia to unlock zero-shot summarization. We show that the proposed method leads to more faithful and salient abstractive summarization in supervised fine-tuning and zero-shot settings according to our experimental results on three benchmark datasets XSum, Pubmed, and SAMSum of very different domains and styles.	PDF	11	2021
AdaPrune: Pruning Transformer with Sparse Regularization	The key component of transformer architecture is the multi-head self-attention(MHA) and feed forward neural network (FFN). In this paper, we reveal that, across many applications, MHA component is nonsymmetric and FFN component is sparse. Leveraging this observation, we propose a new method, AdaPrune, to utilize sparse regularization to conduct structure pruning in MHA and FFN modules. This method selects task-specific valuable heads in multi-head attention modules and effective blocks in feed forward layers during the fine-tuning stage, while maintaining the original performance of full transformer model. Extensive experiments show that AdaPrune can achieve competitive performance on these tasks while significantly reduce the computation cost.	PDF	11	2021
Re-thinking Supertags in Linear Context-free Rewriting Systems for Constituency Parsing	Recently, a supertagging-based approach for parsing discontinuous constituent trees with linear context-free rewriting systems (LCFRS) was introduced. We reformulate their algorithm for the extraction of supertags from treebanks to be more concise. Moreover, we add some extensions that give us control over the extraction process in terms of supertag granularity and which terminal symbols are associated with supertags. Our additions lead to an increase in parsing quality with LCFRS supertagging in all three compared treebanks. The scores are among the state of the art in discontinuous constituent parsing.	PDF	11	2021
Reasoning Like Program Executors	Reasoning over natural language is a long-standing goal for the research community. However, studies have shown that existing language models are inadequate in reasoning. To address the issue, we present POET, a new pre-training paradigm. Through pre-training language models with programs and their execution results, POET empowers language models to harvest the reasoning knowledge possessed in program executors via a data-driven approach. POET is conceptually simple and can be instantiated by different kinds of programs. In this paper, we show three empirically powerful instances, i.e., POET-Math, POET-Logic, and POET-SQL. Experimental results on six benchmarks demonstrate that POET can significantly boost model performance on natural language reasoning, such as numerical reasoning, logical reasoning, and multi-hop reasoning. Taking the DROP benchmark as a representative example, POET improves the F1 metric of BART from 69.2% to 80.6%. Furthermore, POET shines in giant language models, pushing the F1 metric of T5-11B to 87.6% and achieving a new state-of-the-art performance on DROP. POET opens a new gate on reasoning-enhancement pre-training and we will make our code, models, and data publicly available to facilitate future research.	PDF	11	2021
A Parameter Aggregation Strategy on Personalized Federated Learning	We investigate the parameter aggregation weights of federated learning (FL), simulate a variety of data access scenarios for experiments, and propose a model parameter weight self-learning strategy for horizontal FL. For application use of this study, a personalized FL network structure model based on edge computing is designed.	PDF	11	2021
Training a Turn-level User Engagingness Predictor for Dialogues with Weak Supervision	The standard approach to evaluating dialogue engagingness is by measuring conversation turns per session (CTPS), which implies that the dialogue length is the main predictor of the user engagement with a dialogue system. The main limitation of CTPS is that it can be measured only at the session level, i.e., once the dialogue is already over. However, it is crucial for a dialogue system to continuously monitor user engagement throughout the dialogue session as well. Existing approaches to measuring turn-level engagingness require human annotations for training and lack interpretability of their scores. We pioneer an alternative approach, Remaining Depth as Engagingness Predictor (RDEP), which uses the remaining depth (RD) for each turn as the heuristic weak label for engagingness. RDEP does not require human annotations and also relates closely to CTPS, thus serving as a good learning proxy for this metric. In our experiments, we show that RDEP achieves the new state-of-the-art results on the fine-grained evaluation of dialog (FED) dataset (0.38 Spearman) and the Daily-Dialog dataset (0.62 Spearman).	PDF	11	2021
On the Safety of Conversational Models: Taxonomy, Dataset, and Benchmark	Dialogue safety problems severely limit the real-world deployment of neural conversational models and have attracted great research interests recently. However, dialogue safety problems remain under-defined and the corresponding dataset is scarce. We propose a taxonomy for dialogue safety specifically designed to capture unsafe behaviors in human-bot dialogue settings, with focuses on context-sensitive unsafety, which is under-explored in prior works. To spur research in this direction, we compile DiaSafety, a dataset with rich context-sensitive unsafe examples. Experiments show that existing safety guarding tools fail severely on our dataset. As a remedy, we train a dialogue safety classifier to provide a strong baseline for context-sensitive dialogue unsafety detection. With our classifier, we perform safety evaluations on popular conversational models and show that existing dialogue systems still exhibit concerning context-sensitive safety problems.	PDF	11	2021
KALA: Knowledge-Augmented Language Model Adaptation	Pre-trained language models (PLMs) have achieved remarkable success on various natural language understanding tasks. Simple fine-tuning of PLMs, on the other hand, might be suboptimal for domain-specific tasks because they cannot possibly cover knowledge from all domains. While adaptive pre-training of PLMs can help them obtain domain-specific knowledge, it requires a large training cost. Moreover, adaptive pre-training can harm the PLM's performance on the downstream task by causing catastrophic forgetting of its general knowledge. To overcome such limitations of adaptive pre-training for PLM adaption, we propose a novel domain adaption framework for PLMs coined as Knowledge-Augmented Language model Adaptation (KALA), which modulates the intermediate hidden representations of PLMs with domain knowledge, consisting of entities and their relational facts. We validate the performance of our KALA on question answering and named entity recognition tasks on multiple datasets across various domains. The results show that, despite being computationally efficient, our KALA largely outperforms adaptive pre-training.	PDF	11	2021
DAQE: Exploring the Direct Assessment on Word-Level Quality Estimation in Machine Translation	Word-level Quality Estimation (QE) of Machine Translation (MT) helps to find out potential translation errors in translated sentences without reference. The current collection of QE datasets is typically based on the exact matching between the words from MT sentences and post-edited sentences through Translation Error Rate (TER) toolkit. However, we find that the data generated by TER cannot faithfully reflect human judgment, which can make the research deviate from the correct direction. To overcome the limitation, we for the first time collect the direct assessment (DA) dataset for the word-level QE task, namely DAQE, which contains the golden corpus annotated by expert translators on two language pairs. Furthermore, we propose two tag correcting strategies, namely tag refinement strategy and tree-based annotation strategy, to make the TER-based artificial QE tags closer to human judgement, so that the corrected TER-based data can be used to improve the QE performance during pre-training. We conduct detailed experiments on our collected DAQE dataset, as well as comparison with the TER-based QE dataset MLQE-PE. The results not only show our proposed dataset DAQE is more consistent with human judgment but also confirm the effectiveness of the pre-training approach with the tag correcting strategies.	PDF	11	2021
Towards Robustness of Text-to-SQL Models Against Natural and Realistic Adversarial Table Perturbation	The robustness of Text-to-SQL parsers against adversarial perturbations plays a crucial role in delivering highly reliable applications. Previous studies along this line primarily focused on perturbations in the natural language question side, neglecting the variability of tables. Motivated by this, we propose the Adversarial Table Perturbation (ATP) as a new attacking paradigm to measure robustness of Text-to-SQL models. Following this proposition, we curate ADVETA, the first robustness evaluation benchmark featuring natural and realistic ATPs. All tested state-of-the-art models experience dramatic performance drops on ADVETA, revealing significant room of improvement. To defense against ATP, we build a systematic adversarial training example generation framework tailored for better contextualization of tabular data. Experiments show that our approach brings models best robustness improvement against ATP, while also substantially boost model robustness against NL-side perturbations. We will release ADVETA and code to facilitate future research.	PDF	11	2021
Unsupervised Chinese Word Segmentation with BERT Oriented Probing and Transformation	Word Segmentation is a fundamental step for understanding Chinese language. Previous neural approaches for unsupervised Chinese Word Segmentation (CWS) only exploits shallow semantic information, which can miss important context. Large scale Pre-trained language models (PLM) have achieved great success in many areas because of its ability to capture the deep contextual semantic relation. In this paper, we propose to take advantage of the deep semantic information embedded in PLM (e.g., BERT) with a self-training manner, which iteratively probes and transforms the semantic information in PLM into explicit word segmentation ability. Extensive experiment results show that our proposed approach achieves state-of-the-art F1 score on two CWS benchmark datasets.	PDF	11	2021
A Survey on Geocoding: Algorithms and Datasets for Toponym Resolution	Geocoding, the task of converting unstructured text to structured spatial data, has recently seen progress thanks to a variety of new datasets, evaluation metrics, and machine-learning algorithms. We provide a survey to review, organize and analyze recent work on geocoding (also known as toponym resolution) where the text is matched to geospatial coordinates and/or ontologies. We summarize the findings of this research and suggest some promising directions for future work.	PDF	11	2021
A Re-examination of Neural Selective Prediction for Natural Language Processing	We provide a survey and careful empirical comparison of the state-of-the-art in neural selective classification for NLP tasks. Across multiple trials on multiple datasets, only one of the surveyed techniques -- Monte Carlo Dropout -- significantly outperforms the simple baseline of using the maximum softmax probability as an indicator of prediction confidence. Our results provide a counterpoint to recent claims made on the basis of single-trial experiments on a small number of datasets. We also provide a blueprint and open-source code to support the future evaluation of selective prediction techniques.	PDF	11	2021
Enhancing Text Generation with Inductive Event Reasoning	How to generate informative, coherent natural language is a very important task. Previous studies mainly focus on leveraging commonsense knowledge into generative models, which can improve the informativeness of generated texts. However, these models pay little attention to discourse coherence. Instead, we propose to utilize event chains to improve the coherence of text generation. In addition, we devise an inductive encoding module to reduce the sparsity of introduced event chains and learn the useful event evolution patterns. Specifically, we first extract event chains for the input text and then connect them as a graph. The inductive graph encoding module is then used to learn the inductive and generalized event embeddings. The event reasoning flow module follows and produces the event sketch, i.e., the reasonable events conditioned by the input text. Finally, we generate the text based on the input context and the event sketch. Experimental results indicate the effectiveness of this framework in terms of coherence and informativeness of text generation.	PDF	11	2021
Entity-based Neural Local Coherence Modeling	In this paper, we propose an entity-based neural local coherence model which is linguistically more sound than previously proposed neural coherence models. Recent neural coherence models encode the input document using large-scale pretrained language models. Hence their basis for computing local coherence are words and even sub-words. The analysis of their output shows that these models frequently compute coherence on the basis of connections between (sub-)words which, from a linguistic perspective, should not play a role. Still, these models achieve state-of-the-art performance in several end applications. In contrast to these models, we compute coherence on the basis of entities by constraining the input to noun phrases and proper names. This provides us with an explicit representation of the most important items in sentences leading to the notion of focus. This brings our model linguistically in line with pre-neural models of computing coherence. It also gives us better insight into the behaviour of the model thus leading to better explainability. Our approach is also in accord with a recent study (O’Connor and Andreas, 2021), which shows that most usable information is captured by nouns and verbs in transformer-based language models. We evaluate our model on three downstream tasks showing that it is not only linguistically more sound than previous models but also that it outperforms them in end applications.	PDF	11	2021
Disentangled Sequence to Sequence Learning for Compositional Generalization	There is mounting evidence that existing neural network models, in particular the very popular sequence-to-sequence architecture, struggle to systematically generalize to unseen compositions of seen components. We demonstrate that one of the reasons hindering compositional generalization relates to representations being entangled. We propose an extension to sequence-to-sequence models which encourage disentanglement by adaptively re-encoding (at each time step) the source input. Specifically, we condition the source representations on the newly decoded target context which makes it easier for the encoder to exploit specialized information for each prediction rather than capturing it all in a single forward pass. Experimental results on semantic parsing and machine translation empirically show that our proposal delivers more disentangled representations and better generalization.	PDF	11	2021
More Than Just Attention: Improving Cross-Modal Attentions with Contrastive Constraints for Image-Text Matching	Cross-modal attention mechanisms have been widely applied to the image-text matching task and have achieved remarkable improvements thanks to its capability of learning fine-grained relevance across different modalities. However, the cross-modal attention models of existing methods could be sub-optimal and inaccurate because there is no direct supervision provided during the training process. In this work, we propose two novel training strategies, namely Contrastive Content Re-sourcing (CCR) and Contrastive Content Swapping (CCS) constraints, to address such limitations. These constraints supervise the training of cross-modal attention models in a contrastive learning manner without requiring explicit attention annotations. They are plug-in training strategies and can be easily integrated into existing cross-modal attention models. Additionally, we introduce three metrics including Attention Precision, Recall, and F1-Score to quantitatively measure the quality of learned attention models. We evaluate the proposed constraints by incorporating them into four state-of-the-art cross-modal attention-based image-text matching models. Experimental results on both Flickr30k and MS-COCO datasets demonstrate that integrating these constraints improves the model performance in terms of both retrieval performance and attention metrics.	PDF	11	2021
A Survey of Networking Cipher Algorithms and How Natural Language Can Be Used to Enhance Them	This paper provides a survey of several of the networking cipher algorithms and proposes a method for integrating natural language processing (NLP) as a protective agent for them. Two main proposals are covered for the use of NLP in networking. First, NLP is considered as the weakest link in a networking encryption model; and, second, as a hefty deterrent when combined as an extra layer over what could be considered a strong type of encryption – the stream cipher. This paper summarizes how languages can be integrated into symmetric encryption as a way to assist in the encryption of vulnerable streams that may be found under attack due to the natural frequency distribution of letters or words in a local language stream.	PDF	11	2021
First the worst: Finding better gender translations during beam search	Generating machine translations via beam search seeks the most likely output under a model. However, beam search has been shown to amplify demographic biases exhibited by a model. We aim to address this, focusing on gender bias resulting from systematic errors in grammatical gender translation. Almost all prior work on this problem adjusts the training data or the model itself. By contrast, our approach changes only the inference procedure. We explore two techniques: applying constraints during inference to improve gender diversity in n-best lists, and reranking n-best lists using gender features obtained from the source sentence. Combining these methods gives large gains in gender translation accuracy for three language pairs without requiring additional bilingual data or retraining.	PDF	11	2021
A Comparative Study of Pre-trained Encoders for Low-Resource Named Entity Recognition	Pre-trained language models (PLM) are effective components of few-shot named entity recognition (NER) approaches when augmented with continued pre-training on task-specific out-of-domain data or fine-tuning on in-domain data. However, their performance in low-resource scenarios, where such data is not available, remains an open question. We introduce an encoder evaluation framework, and use it to systematically compare the performance of state-of-the-art pre-trained representations on the task of low-resource NER. We analyze a wide range of encoders pre-trained with different strategies, model architectures, intermediate-task fine-tuning, and contrastive learning. Our experimental results across ten benchmark NER datasets in English and German show that encoder performance varies significantly, suggesting that the choice of encoder for a specific low-resource scenario needs to be carefully evaluated.	PDF	11	2021
Reasoning over Hybrid Chain for Table-and-Text Open Domain Question Answering	Tabular and textual question answering requires systems to perform reasoning over heterogeneous information, considering table structure, and the connections among table and text. In this paper, we propose a ChAin-centric Reasoning and Pre-training framework (CARP). CARP utilizes hybrid chain to model the explicit intermediate reasoning process across table and text for question answering. We also propose a novel chain-centric pre-training method, to enhance the pre-trained model in identifying the cross-modality reasoning process and alleviating the data sparsity problem. This method constructs the large-scale reasoning corpus by synthesizing pseudo heterogeneous reasoning paths from Wikipedia and generating corresponding questions. We evaluate our system on OTT-QA, a large-scale table-and-text open-domain question answering benchmark, and our system achieves the state-of-the-art performance. Further analyses illustrate that the explicit hybrid chain offers substantial performance improvement and interpretablity of the intermediate reasoning process, and the chain-centric pre-training boosts the performance on the chain extraction.	PDF	11	2021
KenMeSH: Knowledge-enhanced End-to-end Biomedical Text Labelling	Currently, Medical Subject Headings (MeSH) are manually assigned to every biomedical article published and subsequently recorded in the PubMed database to facilitate retrieving relevant information. With the rapid growth of the PubMed database, large-scale biomedical document indexing becomes increasingly important. MeSH indexing is a challenging task for machine learning, as it needs to assign multiple labels to each article from an extremely large hierachically organized collection. To address this challenge, we propose KenMeSH, an end-to-end model that combines new text features and a dynamic knowledge-enhanced mask attention that integrates document features with MeSH label hierarchy and journal correlation features to index MeSH terms. Experimental results show the proposed method achieves state-of-the-art performance on a number of measures.	PDF	11	2021
When classifying grammatical role, BERT doesn't care about word order... except when it matters	Because meaning can often be inferred from lexical semantics alone, word order is often a redundant cue in natural language. For example, the words cut, chef, and onion are more likely used to convey "The chef cut the onion," not "The onion cut the chef." Recent work has shown large language models to be surprisingly word order invariant, but crucially has largely considered natural prototypical inputs, where compositional meaning mostly matches lexical expectations. To overcome this confound, we probe grammatical role representation in BERT and GPT-2 on non-prototypical instances. Such instances are naturally occurring sentences with inanimate subjects or animate objects, or sentences where we systematically swap the arguments to make sentences like "The onion cut the chef". We find that, while early layer embeddings are largely lexical, word order is in fact crucial in defining the later-layer representations of words in semantically non-prototypical positions. Our experiments isolate the effect of word order on the contextualization process, and highlight how models use context in the uncommon, but critical, instances where it matters.	PDF	11	2021
Contrastive Demonstration Tuning for Pre-trained Language Models	Pretrained language models can be effectively stimulated by textual prompts or demonstrations, especially in low-data scenarios. Recent works have focused on automatically searching discrete or continuous prompts or optimized verbalizers, yet studies for the demonstration are still limited. Concretely, the demonstration examples are crucial for an excellent final performance of prompt-tuning. In this paper, we propose a novel pluggable, extensible, and efficient approach named contrastive demonstration tuning, which is free of demonstration sampling. Furthermore, the proposed approach can be: (i) Plugged to any previous prompt-tuning approaches; (ii) Extended to widespread classification tasks with a large number of categories. Experimental results on 16 datasets illustrate that our method integrated with previous approaches LM-BFF and P-tuning can yield better performance.	PDF	11	2021
A Multi-Document Coverage Reward for RELAXed Multi-Document Summarization	Multi-document summarization (MDS) has made significant progress in recent years, in part facilitated by the availability of new, dedicated datasets and capacious language models. However, a standing limitation of these models is that they are trained against limited references and with plain maximum-likelihood objectives. As for many other generative tasks, reinforcement learning (RL) offers the potential to improve the training of MDS models; yet, it requires a carefully-designed reward that can ensure appropriate leverage of both the reference summaries and the input documents. For this reason, in this paper we propose fine-tuning an MDS baseline with a reward that balances a reference-based metric such as ROUGE with coverage of the input documents. To implement the approach, we utilize RELAX (Grathwohl et al., 2018), a contemporary gradient estimator which is both low-variance and unbiased, and we fine-tune the baseline in a few-shot style for both stability and computational efficiency. Experimental results over the Multi-News and WCEP MDS datasets show significant improvements of up to +0.95 pp average ROUGE score and +3.17 pp METEOR score over the baseline, and competitive results with the literature. In addition, they show that the coverage of the input documents is increased, and evenly across all documents.	PDF	11	2021
Enhancing the Nonlinear Mutual Dependencies in Transformers with Mutual Information	The Predictive Uncertainty problem exists in Transformers. We present that pre-trained Transformers can be further regularized by mutual information to alleviate such issue in Neural Machine Translation (NMT). In this paper, we explicitly capture the nonlinear mutual dependencies existing in two types of attentions in decoder to reduce the model uncertainty concerning token-token interactions. Specifically, we adopt an unsupervised objective of mutual information maximization on self-attentions with the contrastive learning methodology and construct the estimation of mutual information by using InfoNCE. Experimental results on WMT'14 En$\rightarrow$De, WMT'14 En$\rightarrow$Fr demonstrate the consistent effectiveness and evident improvements of our model over the strong baselines. Quantifying the model uncertainty again verifies our hypothesis. The proposed plug-and-play approach can be easily incorporated and deployed into pre-trained Transformer models. Code will be released soon.	PDF	11	2021
Gradient Sparsification For \emph{Masked Fine-Tuning} of Transformers	Fine-tuning masked language models is widely adopted for transfer learning to downstream tasks and can be achieved by (1) freezing gradients of the pretrained network or only updating gradients of a newly added classification layer or (2) performing gradient updates on all parameters. Gradual unfreezing trades off between the two by gradually unfreezing gradients of whole layers during training. We propose to extend this to {\em stochastic gradient masking} to regularize pretrained language models for improved fine-tuning performance. We introduce \emph{GradDrop} and variants thereof, a class of gradient sparsification methods that mask gradients prior to gradient descent. Unlike gradual unfreezing which is non-sparse and deterministic, GradDrop is sparse and stochastic. Experiments on the multilingual XGLUE benchmark with XLM-R$_{\text{Large}}$ show that \emph{GradDrop} outperforms standard fine-tuning and gradual unfreezing, while being competitive against methods that use additional translated data and intermediate pretraining. Lastly, we identify cases where largest zero-shot performance gains are on less resourced languages.	PDF	11	2021
MarkBERT: Marking Word Boundaries Improves Chinese BERT	We present a Chinese BERT model dubbed MarkBERT that uses word information in this work.Existing word-based BERT models regard words as basic units, however,due to the vocabulary limit of BERT, they only cover high-frequency words and fall back to character level when encountering out-of-vocabulary (OOV) words.Different from existing works, MarkBERT keeps the vocabulary being Chinese characters and inserts boundary markers between contiguous words. Such design enables the model to handle any words in the same way, no matter they are OOV words or not.Besides, our model has two additional benefits:first, it is convenient to add word-level learning objectives over markers, which is complementary to traditional character and sentence-level pretraining tasks;second, it can easily incorporate richer semantics such as POS tags of words by replacing generic markers with POS tag-specific markers.MarkBERT pushes the state-of-the-art of Chinese named entity recognition from 95.4\% to 96.5\% on the MSRA dataset and from 82.8\% to 84.2\% on the OntoNotes dataset, respectively.Compared to previous word-based BERT models, MarkBERT achieves better accuracy on text classification, keyword recognition, and semantic similarity tasks.\footnote{All the codes and models will be made publicly available at \url{https://github.com/}}	PDF	11	2021
Error-Correcting Codes For Approximate Neural Sequence Prediction	We propose a novel neural sequence prediction method based on \textit{error-correcting codes} that avoids exact softmax normalization and allows for a tradeoff between speed and performance. Error-correcting codes represent predictions and targets as a binary code where each bit is represented by a logit. The codebook is arranged such that similar tokens are close to each other using word embedding similarity, ensuring that incorrect predictions are at least semantically close to the target. We also address the well-established problem of compounding errors by mixing the latent codes of past predictions and past targets in one of two ways: (1) according to a predefined sampling schedule or (2) a differentiable sampling procedure that replaces the argmax operation. Low dimensional codes show similar performance to models that use the full softmax and outperform alternative approximate methods for language modeling and text generation, while generation further benefits from our mixture sampling.	PDF	11	2021
Synthetic Data Generation and Joint Learning for Robust Code-Mixed Translation	The widespread online communication in a modern multilingual world has provided opportunities to blend more than one language (aka. code-mixed language) in a single utterance. This has resulted a formidable challenge for the computational models due to the scarcity of annotated data and presence of noise. A potential solution to mitigate the data scarcity problem in low-resource setup is to leverage existing data in resource-rich language through translation. In this paper, we tackle the problem of code-mixed (Hinglish and Bengalish) to English machine translation. First, we synthetically develop HINMIX, a parallel corpus of Hinglish to English, with ~5M sentence pairs. Subsequently, we propose JAMT, a robust perturbation based joint-training model that learns to handle noise in the real-world code-mixed text by parameter sharing across clean and noisy words. Further, we show the adaptability of JAMT in a zero-shot setup for Bengalish to English translation. Our evaluation and comprehensive analyses qualitatively and quantitatively demonstrate the superiority of JAMT over state-of-the-art code-mixed and robust translation methods.	PDF	11	2021
SagDRE: Sequence-Aware Graph-Based Document-Level Relation Extraction with Adaptive Margin Loss	Relation extraction (RE) is an important task for many natural language processing applications. Document-level relation extraction aims to extract the relations within a document and poses many challenges to the RE tasks as it requires reasoning across sentences and handling multiple relations expressed in the same document. Existing state-of-the-art document-level RE models use the graph structure to better connect long-distance correlations. In this work, we propose SagDRE model, which further considers and captures the original sequential information from the text. The proposed model learns sentence-level directional edges to capture the information flow in the document and uses the token-level sequential information to encode the shortest path from one entity to the other. In addition, we propose an adaptive margin loss to maximize the margins to separate positive and negative classes. The experimental results on datasets from various domains demonstrate the effectiveness of our proposed methods.	PDF	11	2021
CBLUE: A Chinese Biomedical Language Understanding EvaluationBenchmark	Artificial Intelligence (AI), along with the recent progress in biomedical language understanding, is gradually offering great promise for medical practice. With the development of biomedical language understanding benchmarks, AI applications are widely used in the medical field. However, most benchmarks are limited to English, which makes it challenging to replicate many of the successes in English for other languages. To facilitate research in this direction, we collect real-world biomedical data and present the first Chinese Biomedical Language Understanding Evaluation (CBLUE) benchmark: a collection of natural language understanding tasks including named entity recognition, information extraction, clinical diagnosis normalization, single-sentence/sentence-pair classification, and an associated online platform for model evaluation, comparison, and analysis. To establish evaluation on these tasks, we report empirical results with the current 11 pre-trained Chinese models, and experimental results show that state-of-the-art neural models perform by far worse than the human ceiling.	PDF	11	2021
ViQA-COVID: COVID-19 Machine Reading Comprehension Dataset for Vietnamese	After two years of appearance, COVID-19 has negatively affected people and normal life around the world. As in November 2021, there are more than 250 million cases and five million deaths worldwide (including nearly one million cases and over twenty-two thousand deaths in Vietnam). Economy and society are both severely affected. The variant of COVID-19, Delta, has broken disease prevention measures of countries and rapidly increased number of infections. Resources overloading in treatment and epidemics prevention is happening all over the world. It can be seen that, application of artificial intelligence (AI) to support people at this time is extremely necessary. There have been many studies applying AI to prevent COVID-19 which are extremely useful, and studies on machine reading comprehension (MRC) are also in it. Realizing that, we created the first MRC dataset about COVID-19 for Vietnamese: ViQA-COVID and can be used to build models and systems, contributing to disease prevention. Besides, ViQA-COVID is also the first multi-span extraction MRC dataset for Vietnamese, we hope that it can contribute to promoting MRC studies in Vietnamese and multilingual.	PDF	11	2021
Skill Induction and Planning with Latent Language	We present a framework for learning hierarchical policies from demonstrations, using sparse natural language annotations to guide the discovery of reusable skills for autonomous decision-making. We formulate a generative model of action sequences in which goals generate sequences of high-level subtask descriptions, and these descriptions generate sequences of low-level actions. We describe how to train this model using primarily unannotated demonstrations by parsing demonstrations into sequences of named high-level subtasks, using only a small number of seed annotations to ground language in action. In trained models, the space of natural language commands indexes a combinatorial library of skills; agents can use these skills to plan by generating high-level instruction sequences tailored to novel goals. We evaluate this approach in the ALFRED household simulation environment, providing natural language annotations for only 10% of demonstrations. It completes more than twice as many tasks as a standard approach to learning from demonstrations, matching the performance of instruction following models with access to ground-truth plans during both training and evaluation.	PDF	11	2021
Softmax Bottleneck Makes Language Models Unable to Represent Multi-mode Word Distributions	Neural language models (LMs) such as GPT-2 estimate the probability distribution over the next word by a softmax over the vocabulary. The softmax layer produces the distribution based on the dot products of a single hidden state and the embeddings of words in the vocabulary. However, we discover that this single hidden state cannot produce all probability distributions regardless of the LM size or training data size because the single hidden state embedding cannot be close to the embeddings of all the possible next words simultaneously when there are other interfering word embeddings between them. In this work, we demonstrate the importance of this limitation both theoretically and practically. Our work not only deepens our understanding of softmax bottleneck and mixture of softmax (MoS) but also inspires us to propose multi-facet softmax (MFS) to address the limitations of MoS. Extensive empirical analyses confirm our findings and show that against MoS, the proposed MFS achieves two-fold improvements in the perplexity of GPT-2 and BERT.	PDF	11	2021
CodeRetriever: Unimodal and Bimodal Contrastive Learning for Code Search	In this paper, we propose the CodeRetriever model, which combines the unimodal and bimodal contrastive learning to train function-level code semantic representations, specifically for the code search task. For unimodal contrastive learning, we design a semantic-guided method to build positive code pairs based on the documentation and function name. For bimodal contrastive learning, we leverage the documentation and in-line comments of code to build text-code pairs. Both contrastive objectives can fully leverage the large-scale code corpus for pre-training. Experimental results on several public benchmarks, (i.e., CodeSearch, CoSQA, etc.) demonstrate the effectiveness of CodeRetriever in the zero-shot setting. By fine-tuning with domain/language specified downstream data, CodeRetriever achieves the new state-of-the-art performance with significant improvement over existing code pre-trained models. We will make the code, model checkpoint, and constructed datasets publicly available.	PDF	11	2021
That Slepen Al the Nyght with Open Ye! Cross-era Sequence Segmentation with Switch-memory	Language evolution follows the rule of gradual change. Grammar, vocabulary, and lexical semantics shift took place over time, resulting in the diachronic linguistic gap. However, a considerable amount of texts are written in languages of different eras, which brings obstacles to natural language processing tasks, such as word segmentation and machine translation. Chinese is a language with a long history, but previous Chinese natural language processing works mainly focused on tasks in a specific era. Therefore, in this paper, we propose a cross-era learning framework for Chinese word segmentation (CWS), CROSSWISE, which uses the Switch-memory (SM) module to incorporate era-specific linguistic knowledge. Experiments on four corpora with different eras show that the performance of each corpus obtains a significant improvement. Further analyses also demonstrate that the SM can effectively integrate the knowledge of the eras into the neural network.	PDF	11	2021
Probing the Robustness of Trained Metrics for Conversational Dialogue Systems	This paper introduces an adversarial method to stress-test trained metrics for the evaluation of conversational dialogue systems. The method leverages Reinforcement Learning to find response strategies that elicit optimal scores from the trained metrics. We apply our method to test recently proposed trained metrics. We find that they all are susceptible to giving high scores to responses generated by rather simple and obviously flawed strategies that our method converges on. For instance, simply copying parts of the conversation context to form a response yields competitive scores or even outperforms responses written by humans.	PDF	11	2021
Multilingual offensive lexicon annotated with contextual information	Online hate speech and offensive comments detection is not a trivial research problem since pragmatic (contextual) factors influence what is considered offensive. Moreover, offensive terms are hardly found in classical lexical resources such as wordnets, sentiment, and emotion lexicons. In this paper, we embrace the challenges and opportunities of the area and introduce the first multilingual offensive lexicon (MOL), which is composed of 1,000 explicit and implicit pejorative terms and expressions annotated with contextual information. The terms and expressions were manually extracted by a specialist from Instagram abusive comments originally written in Portuguese and manually translated by American English, Latin American Spanish, African French, and German native speakers. Each expression was annotated by three different annotators, producing high human inter-annotator agreement. Accordingly, this resource provides a new perspective to explore abusive language detection.	PDF	11	2021
HateBR: Large expert annotated corpus of Brazilian Instagram comments for abusive language detection	Due to the severity of the social media abusive comments in Brazil, and the lack of research in Portuguese, this paper provides the first large-scale annotated corpus of Brazilian Instagram comments for hate speech and offensive language detection on the web and social media. The HateBR corpus was collected from Brazilian Instagram comments of political personalities and manually annotated by specialists, being composed of 7,000 documents annotated according to three different layers: a binary classification (offensive versus non-offensive comments), offense-level classes (highly, moderately, and slightly offensive messages), as well as nine hate speech targets (xenophobia, racism, homophobia, sexism, religious intolerance, partyism, apology to the dictatorship, antisemitism, and fatphobia). Each comment was annotated by three different annotators and achieved high inter-annotator agreement.	PDF	11	2021
Multi-Task Pre-Training for Plug-and-Play Task-Oriented Dialogue System	Pre-trained language models have been recently shown to benefit task-oriented dialogue (TOD) systems. Despite their success, existing methods often formulate this task as a cascaded generation problem which can lead to error accumulation across different sub-tasks and greater data annotation overhead. In this study, we present PPTOD, a unified plug-and-play model for task-oriented dialogue. In addition, we introduce a new dialogue multi-task pre-training strategy that allows the model to learn the primary TOD task completion skills from heterogeneous dialog corpora. We extensively test our model on three benchmark TOD tasks, including end-to-end dialogue modelling, dialogue state tracking, and intent classification. Experimental results show that PPTOD achieves new state of the art on all evaluated tasks in both high-resource and low-resource scenarios. Furthermore, comparisons against previous SOTA methods show that the responses generated by PPTOD are more factually correct and semantically coherent as judged by human annotators.	PDF	11	2021
Learning Universal Sentence Embeddings with Large-scale Parallel Translation Datasets	Although contrastive learning has greatly improved sentence representation, its performance is still limited by the size of monolingual sentence-pair datasets. Meanwhile, there exist large-scale parallel translation pairs (100x larger than monolingual pairs) that are highly correlated in semantic, but have not been utilized for learning universal sentence representation. Furthermore, given parallel translation pairs, previous contrastive learning frameworks can not well balance the monolingual embeddings’ alignment and uniformity which represent the quality of embeddings. In this paper, we build on the top of dual encoder and propose to freeze the source language encoder, utilizing its consistent embeddings to supervise the target language encoder via contrastive learning, where source-target translation pairs are regarded as positives. We provide the first exploration of utilizing parallel translation sentence pairs to learn universal sentence embeddings and show superior performance to balance the alignment and uniformity. We achieve a new state-of-the-art performance on the average score of standard semantic textual similarity (STS), outperforming both SimCSE and Sentence-T5, and the best performance in corresponding tracks on transfer tasks.	PDF	11	2021
CRAFT: A Benchmark for Causal Reasoning About Forces and inTeractions	Humans are able to perceive, understand and reason about causal events. Developing models with similar physical and causal understanding capabilities is a long-standing goal of artificial intelligence. As a step towards this direction, we introduce CRAFT, a new video question answering dataset that requires causal reasoning about physical forces and object interactions. It contains 58K video and question pairs that are generated from 10K videos from 20 different virtual environments, containing various objects in motion that interact with each other and the scene. Two question categories in CRAFT include previously studied descriptive and counterfactual questions. Additionally, inspired by the Force Dynamics Theory in cognitive linguistics, we introduce a new causal question category that involves understanding the causal interactions between objects through notions like cause, enable, and prevent. Our results show that even though the questions in CRAFT are easy for humans, the tested baseline models, including existing state-of-the-art methods, do not yet deal with the challenges posed in our benchmark.	PDF	11	2021
SAMBERT: Improve Aspect Sentiment Triplet Extraction by Segmenting the Attention Maps of BERT	Aspect Sentiment Triplet Extraction (ASTE) performs fine-grained sentiment analysis in a unified way through extracting sentiment triplets comprised of aspect terms, opinion spans, and their sentiment relations in sentences. The previous works show the adoption of BERT, which simply leverages its last layer output as the word representation, is beneficial for recognizing triplet elements. However, their methods limit the potential of pretrained knowledge in BERT, since the different layers can capture multi-level linguistic information existing in sentences, which are useful for ASTE as well. In this work, we explore to access the rich pretrained knowledge by fully leveraging its attention maps of different layers. To this end, we propose to Segment the Attention Maps of BERT (SAMBERT) by taking the merits of semantic segmentation, which can effectively discriminate the desired objects from others in an image. In this procedure, we can further reason over the knowledge of different levels in these attention maps to distinguish aspect terms, opinion spans and their sentiment relations from other parts, which results in a same-shape tagging matrix of word pairs for deriving sentiment triplets. Through the extensive experiments on four benchmarks, we demonstrate our method can achieve a new state of the art.	PDF	11	2021
Rebuild and Ensemble: Exploring Defense Against Text Adversaries	Adversarial attacks can mislead strong neural models; as such, in NLP tasks, substitution-based attacks are difficult to defend. Current defense methods usually assume that the substitution candidates are accessible, which cannot be widely applied against adversarial attacks unless knowing the mechanism of the attacks. In this paper, we propose a \textbf{Rebuild and Ensemble} Framework to defend against adversarial attacks in texts without knowing the candidates.We propose a rebuild mechanism to train a robust model and ensemble the rebuilt texts during inference to achieve good adversarial defense results.Experiments show that our method can improve accuracy under the current strong attack methods.	PDF	11	2021
Detection of Adversarial Examples in NLP: Benchmark and Baseline via Robust Density Estimation	Word-level adversarial attacks have shown success in NLP models, drastically decreasing the performance of transformer-based models in recent years. As a counter measure, adversarial defense has been explored, but relatively little efforts have been made to detect adversarial examples. However, detecting adversarial examples in NLP may be crucial for automated task (e.g. review sentiment analysis) that wishes to amass information about a certain population and additionally be a step towards a robust defense system. To this end, we release a dataset for four popular attack methods on three datasets and four NLP models to encourage further research in this field. Along with it, we propose a competitive baseline based on density estimation that has the highest \textsc{auc} on 21 out of 22 dataset-attack-model combinations.\footnote{https://github.com/anoymous92874838/text-adv-detection}	PDF	11	2021
TREND: Trigger-Enhanced Relation-Extraction Network for Dialogues	The goal of dialogue relation extraction (DRE) is to identify the relation between two entities in a given dialogue.During conversations, speakers may expose their relations to certain entities by some clues, such evidences called ''triggers''. However, none of the existing work on DRE tried to detect triggers and leverage the information for enhancing the performance.This paper proposes TREND, a multi-tasking BERT-based model which learns to identify triggers for improving relation extraction.The experimental results show that the proposed method achieves the state-of-the-art on the benchmark datasets.	PDF	11	2021
Towards a Fine-Grained Multi-Domain Neural Machine Translation Using Inter-Domain Relationships	While research on the domain adaptation task in neural machine translation has become popular recently, there exists no agreement on what constituents a domain, and most previous studies only focus on coarse-grained domain adaptation and their methods cannot be generalized if the domain size is large. In this work, we argue the necessity to study a fine-grained domain adaptation problem. We build a new multilingual dataset from web sources that focus on fine-grained domains and inter-domain attributes and relationships. We also propose a simple but effective adaptation method to incorporate domain knowledge leveraging models in information networks.	PDF	11	2021
Deep Reinforcement Learning for Entity Alignment	Embedding-based methods have attracted increasing attention in recent entity alignment (EA) studies. Although great promise they can offer, there are still several limitations. The most notable is that they identify the aligned entities based on cosine similarity, ignoring the semantics underlying the embeddings themselves. Furthermore, these methods are shortsighted, heuristically selecting the closest entity as the target and allowing multiple entities to match the same candidate. To address these limitations, we model entity alignment as a sequential decision-making task, in which an agent sequentially decides whether two entities are matched or mismatched based on their representation vectors. The proposed reinforcement learning (RL)-based entity alignment framework can be flexibly adapted to most embedding-based EA methods. Our experiments demonstrate that it consistently advances the performance of several state-of-the-art methods, with a maximum improvement of 31.1% on Hits@1.	PDF	11	2021
A System to Filter out Unwanted Social Media Content in Real-time on iPhones	Social media users are often harassed. This paper presents a patented system to filter out harassing content before it reaches the recipient. Our first version is for the iPhone. To detect harassment, we adopted sentiment analysis with a supervised learning approach that combines Machine Learning (ML) text classifiers with a lexicon approach that provides a feedback loop to retrain the ML model with unknown terms. Because good data is essential to obtain the best output of any system, we focused on validating our labeled data. Our results on static and real-time data have an accuracy of, respectively, 90% and 94%. Our labeled data validation allows us to correct labels; we also realized the need to increase the number of sets in our lexicons. Our prototype demonstrates that we are able to build an AI infrastructure to filter out harassment on an iPhone in real-time with good results.	PDF	11	2021
We need to talk about random seeds	Modern neural network libraries all take as a hyperparameter a random seed, typically used to determine the initial state of the model parameters. This position piece argues that there are some safe uses for random seeds: as part of the hyperparameter search to select a good model, creating an ensemble of several models, or measuring the sensitivity of the training algorithm to the random seed hyperparameter. It argues that some uses for random seeds are risky: using a fixed random seed for ``replicability'' and varying only the random seed to create score distributions for performance comparison. An analysis of 85 recent publications from the ACL Anthology shows that more than 50% contain risky uses of random seeds.	PDF	11	2021
CCQA: A New Web-Scale Question Answering Dataset for Model Pre-Training	We propose a novel open-domain question-answering dataset based on the Common Crawl project. With a previously unseen number of around 130 million multilingual question-answer pairs (including about 60 million English data-points), we use our large-scale, natural, diverse and high-quality corpus to in-domain pre-train popular language models for the task of question-answering. In our experiments, we find that our Common Crawl Question Answering dataset (CCQA) achieves promising results in zero-shot, low resource and fine-tuned settings across multiple tasks, models and benchmarks.	PDF	11	2021
On Isotropy Calibration of Transformer Models	Different studies of the embedding space of transformer models suggest that the distribution of contextual representations is highly anisotropic - the embeddings are distributed in a narrow cone. Meanwhile, static word representations (e.g., Word2Vec or GloVe) have been shown to benefit from isotropic spaces. Therefore, previous work has developed methods to calibrate the embedding space of transformers in order to ensure isotropy. However, a recent study (Cai et al. 2021) shows that the embedding space of transformers is locally isotropic, which suggests that these models are already capable of exploiting the expressive capacity of their embedding space. In this work, we conduct an empirical evaluation of state-of-the-art methods for isotropy calibration on transformers and find that they do not provide consistent improvements across models and tasks. These results support the thesis that, given the local isotropy, transformers do not benefit from additional isotropy calibration.	PDF	11	2021
Exploring the Influence of Dialog Input Format for Unsupervised Clinical Questionnaire Filling	In the medical field, we have seen the emergence of health-bots that interact with patients to gather data and track their state. One of the downstream application is automatic questionnaire filling, where the content of the dialog is used to automatically fill a pre-defined medical questionnaire. Answering questions from the dialog context can be cast as a Natural Language Inference (NLI) task and therefore benefit from current pre-trained NLI models. However, these models have not been generally trained on dialog input format, which may have an influence on their performance. In this paper, we study the influence of dialog input format on the task. Our results demonstrate that dialog pre-processing and content selection can significantly improve performance of zero-shot models.	PDF	11	2021
A Multilingual Bag-of-Entities Model for Zero-Shot Cross-Lingual Text Classification	We present a multilingual bag-of-entities model that effectively boosts the performance of zero-shot cross-lingual text classification by extending a multilingual pre-trained language model (e.g., M-BERT). It leverages the multilingual nature of Wikidata: entities in multiple languages representing the same concept are defined with a unique identifier. This enables entities described in multiple languages to be represented using shared embeddings.A model trained on entity features in a resource-rich language can thus be directly applied to other languages. Our experimental results on cross-lingual topic classification (using the MLDoc and TED-CLDC datasets) and entity typing (using the SHINRA2020-ML dataset) show that the proposed model consistently outperforms state-of-the-art models.	PDF	11	2021
Learning to Teach with Student Feedback	Knowledge distillation (KD) has gained much attention due to its effectiveness in compressing large-scale pre-trained models. In typical KD methods, the small student model is trained to match the soft targets generated by the big teacher model. However, the interaction between student and teacher is one-way. The teacher is usually fixed once trained, resulting in static soft targets to be distilled. This one-way interaction leads to the teacher's inability to perceive the characteristics of the student and its training progress. To address this issue, we propose Interactive Knowledge Distillation (IKD), which also allows the teacher to learn to teach from the feedback of the student. In particular, IKD trains the teacher model to generate specific soft target at each training step for a certain student. Joint optimization for both teacher and student is achieved by two iterative steps: a course step to optimize student with the soft target of teacher, and an exam step to optimize teacher with the feedback of student. IKD is a general framework that is orthogonal to most existing knowledge distillation methods. Experimental results show that IKD outperforms traditional KD methods on various NLP tasks.	PDF	11	2021
Accurate Online Posterior Alignments for Principled Lexically-Constrained Decoding	Online alignment in machine translation refers to the task of aligning a target word to a source word when the target sequence has only been partially decoded. Good online alignments facilitate important applications such as lexically constrained translation where user-defined dictionaries are used to inject lexical constraints into the translation model. We propose a novel posterior alignment technique that is truly online in its execution and superior in terms of alignment error rates compared to existing methods. Our proposed inference technique jointly considers alignment and token probabilities in a principled manner and can be seamlessly integrated within existing constrained beam-search decoding algorithms. On five language pairs, including two distant language pairs, we achieve consistent drop in alignment error rates. When deployed on seven lexically constrained translation tasks, we achieve significant improvements in BLEU specifically around the constrained positions.	PDF	11	2021
Salient Phrase Aware Dense Retrieval: Can a Dense Retriever Imitate a Sparse One?	Despite their recent popularity and well known advantages, dense retrievers still lag behind sparse methods such as BM25 in their ability to reliably match salient phrases and rare entities in the query. It has been argued that this is an inherent limitation of dense models. We disprove this claim by introducing the Salient Phrase Aware Retriever (SPAR), a dense retriever with the lexical matching capacity of a sparse model. In particular, we show that a dense retriever Λ can be trained to imitate a sparse one, and SPAR is built by augmenting a standard dense retriever with Λ. When evaluated on five open-domain question answering datasets and the MS MARCO passage retrieval task, SPAR sets a new state of the art for dense and sparse retrievers and can match or exceed the performance of more complicated dense-sparse hybrid systems.	PDF	11	2021
Speciesist Language and Nonhuman Animal Bias in English Masked Language Models	Various existing studies have analyzed what social biases are inherited by NLP models. These biases may directly or indirectly harm people, therefore previous studies have focused only on human attributes. If the social biases in NLP models can be indirectly harmful to humans involved, then the models can also indirectly harm nonhuman animals. However, no research on social biases in NLP regarding nonhumans exists. In this paper, we analyze biases to nonhuman animals, i.e. speciesist bias, inherent in English Masked Language Models. We analyze this bias using template-based and corpus-extracted sentences which contain speciesist (or non-speciesist) language, to show that these models tend to associate harmful words with nonhuman animals. Our code for reproducing the experiments will be made available on GitHub.	PDF	11	2021
Relation Extraction from Tables using Artificially Generated Metadata	Relation Extraction (RE) from tables is the task of identifying relations between pairs of columns of a table. Generally, RE models for this task require labelled tables for training. These labelled tables can also be generated artificially from a Knowledge Graph (KG), which makes the cost to acquire them much lower in comparison to manual annotations. However, unlike real tables, these synthetic tables lack associated metadata, such as, column-headers, captions, etc; this is because synthetic tables are created out of KGs that do not store such metadata. Meanwhile, previous works have shown that metadata is important for accurate RE from tables. To address this issue, we propose methods to artificially create some of this metadata for synthetic tables. Afterward, we experiment with a BERT-based model, in line with recently published works, that takes as input a combination of proposed artificial metadata and table content. Our empirical results show that this leads to improvements of 9\%-45\% in F1 score, in absolute terms, over 2 tabular datasets.	PDF	11	2021
Multi-Stage Framework with Refinement based Point Set Registration for Unsupervised Bi-Lingual Word Alignment	Cross-lingual alignment of word embeddings play an important role in knowledge transfer across languages, for improving machine translation and other multi-lingual applications. Current unsupervised approaches rely on learning structure preserving linear transformations using adversarial networks and refinement strategies. However, such techniques, tend to suffer from instability and convergence issues, requiring tedious fine-tuning of parameter setting. This paper proposes BioSpere, a novel multi-stage framework for unsupervised mapping of bi-lingual word embeddings onto a shared vector space, by combining adversarial initialization, refinement procedure and point set registration algorithm. We show that our framework alleviates the above shortcomings, and is robust against variable adversarial learning performance and parameter choices. Experiments for parallel dictionary induction, sentence translation and word similarity demonstrate state-of-the-art results for BioSpere on diverse language pairs.	PDF	11	2021
A Generative Approach for Mitigating Structural Biases in Natural Language Inference	Many natural language inference (NLI) datasets contain biases that allow models to perform well by only using a biased subset of the input, without considering the remainder features. For instance, models are able to classify samples by only using the hypothesis, without learning the true relationship between it and the premise. These structural biases lead discriminative models to learn unintended superficial features and generalize poorly out of the training distribution. In this work, we reformulate the NLI task as a generative task, where a model is conditioned on the biased subset of the input and the label and generates the remaining subset of the input. We show that by imposing a uniform prior, we obtain a provably unbiased model. Through synthetic experiments, we find that this approach is highly robust to large amounts of bias. We then demonstrate empirically on two types of natural bias that this approach leads to fully unbiased models in practice. However, we find that generative models are difficult to train and generally perform worse than discriminative baselines. We highlight the difficulty of the generative modeling task in the context of NLI as a cause for this worse performance. Finally, by fine-tuning the generative model with a discriminative objective, we reduce the performance gap between the generative model and the discriminative baseline, while allowing for a small amount of bias.	PDF	11	2021
Perturbation-based Self-supervised Attention for Text Classification	For text classification, the traditional attention mechanism usually pays too much attention to words that appear frequently and needs a lot of labeled data for learning a good distribution. Introducing human attention is a classical method, but it needs a high cost of manual labeling. This paper proposes a perturbation-based self-supervised attention approach to guide attention learning without any annotation overheads. Specifically, we add as much noise as possible to all the words in the sentence simultaneously while without changing their semantics and predictions. According to the words that tolerate more noise are supposed to be less significant, we can obtain attention supervision information and utilize it to refine the attention distribution. Experimental results on three text classification tasks show that our approach can significantly promote the performance of current attention-based models and is more effective than existing self-supervised methods. We also provide visualization analysis to verify the effectiveness of our approach.	PDF	11	2021
Sparsity Regularization for Chinese Spelling Check	The Chinese Spelling Check (CSC) research objective is to detect and correct the spelling errors in the input. Generally, the number of incorrect characters in the input is far less than the correct, so the error probability sequence of input sentence predicted by the detection module should be sparse and sharp. However, all existing work has ignored this problem. In this paper, we add a sparsity regularization item to the objective function to make the output of the detection module close to sparse and sharp. We study two kinds of regularization: L1 regularization and minimum entropy regularization. Extensive experiments on the SIGHAN show that the sparsity regularization proposed in this paper can effectively improve the performance of the CSC model while without increasing the computational complexity. In addition, the robustness experiment results show that our method is robust.	PDF	11	2021
Multimodal Learning: Are Captions All You Need?	In today's digital world, it is increasingly common for information to be multimodal: images or videos often accompany text. Sophisticated multimodal architectures such as ViLBERT, VisualBERT, and LXMERT have achieved state-of-the-art performance in vision-and-language tasks. However, existing vision models cannot represent contextual information and semantics like transformer-based language models can. Fusing the semantic-rich information coming from text becomes a challenge. In this work, we study the alternative of first transforming images into text using image captioning. We then use transformer-based methods to combine the two modalities in a simple but effective way. We perform an empirical analysis on different multimodal tasks, describing the benefits, limitations, and situations where this simple approach can replace large and expensive handcrafted multimodal models.	PDF	11	2021
AutoTriggER: Named Entity Recognition with Auxiliary Trigger Extraction	Deep neural models for low-resource named entity recognition (NER) have shown impressive results by leveraging distant super-vision or other meta-level information (e.g. explanation). However, the costs of acquiring such additional information are generally prohibitive, especially in domains where existing resources (e.g. databases to be used for distant supervision) may not exist. In this paper, we present a novel two-stage framework (AutoTriggER) to improve NER performance by automatically generating and leveraging "entity triggers" which are essentially human-readable clues in the text that can help guide the model to make better decisions. Thus, the framework is able to both create and leverage auxiliary supervision by itself. Through experiments on three well-studied NER datasets, we show that our automatically extracted triggers are well-matched to human triggers, and AutoTriggER improves performance over a RoBERTa-CRFarchitecture by nearly 0.5 F1 points on average and much more in a low resource setting.	PDF	11	2021
RE: A Study for Restorable Embeddings	As the number of model parameters increased, large language models achieved linguistic fluency and exhibited high performance in various natural language tasks without gradient updates because the models could retain more knowledge.However, the large model size makes difficult to apply the model to a task requiring domain knowledge not included in the training corpus, due to the fact that knowledge stored in model parameters is not controllable during generation and model parameter updates are costly.To tackle the problem, we suggest separating the language model and knowledge, and divide the end-to-end language model into three parts: 1) encoding knowledge, 2) processing the encoded knowledge, and 3) restoring the processed knowledge embedding to natural language.In this paper, we propose a model for learning restorable embeddings as a first step toward the study to separate the language model and knowledge.The experimental results shows that the proposed model can restore most knowledge in 1-2 sentences by encoding knowledge in sentence-level embeddings and then restoring the embeddings back to the original sentence.We also verify that the embeddings generated through our method significantly improves performance in the passage retrieval task.	PDF	11	2021
Question Answering Infused Pre-training of General-Purpose Contextualized Representations	We propose a pre-training objective based on question answering (QA) for learning general-purpose contextual representations,motivated by the intuition that the representation of a phrase in a passage should encode all questions that the phrase can answer in context. To this end, we train a bi-encoder QA model, which independently encodes passages and questions, to match the predictions of a more accurate cross-encoder model on 80 million synthesized QA pairs. By encoding QA-relevant information, the bi-encoder's token-level representations are useful for non-QA downstream tasks without extensive (or in some cases, any) fine-tuning. We show large improvements over both RoBERTa-large and previous state-of-the-art results on zero-shot and few-shot paraphrase detection on four datasets, few-shot named entity recognition on two datasets, and zero-shot sentiment analysis on three datasets.	PDF	11	2021
FLAP: Table-to-Text Generation with Feature Indication and Numerical Reasoning Pretraining	Recent neural models have shown success in table-to-text generation. However, the performance of content selection and content planning is still unsatisfactory. In this paper, we propose an effective framework with Feature indication and numericaL reAsoning Pretraining (FLAP) to help the neural generation model on content selection and planning. First, rather than treating the table as a sequence of token embeddings, we map each table into a numerical vector to utilize the real number information. We further propose a feature indication mechanism that introduces combination invariant bias to reduce the exposure bias problem in our generation system. Second, we propose a numerical reasoning pretraining task to help model do numerical reasoning upon the selected subset of tables. Experiments show that our framework outperforms the strong baselines on metrics of both content selection and planning on ROTOWIRE and RW-FG.	PDF	11	2021
How Good is a Recommender in Machine-Assisted Cross Document Event Coreference Resolution Annotation?	Annotating cross document event coreference links is a tedious task that requires annotators to have near-oracle knowledge of a document collection. The heavy cognitive load of this task decreases overall annotation quality while inevitably increasing latency. To support annotation efforts, machine-assisted recommenders can sample likely coreferent events for a given target event, thus eliminating the burden of examining large numbers of true negative pairs. However, there has been little to no work in evaluating the effectiveness of recommender approaches, particularly for the task of event coreference. To this end, we first create a simulated version of recommender based annotation for cross document event coreference resolution. Then, we adapt an existing method as the model governing recommendations. And finally, we introduce a novel method to assess the simulated recommender by evaluating an annotator-centric Recall-Annotation effort tradeoff.	PDF	11	2021
Morphology Informed Selections for Subword Vocabulary Size	Currently, guidance around selection of an optimal or appropriate subword vocabulary size is incomplete and confusing at best. Using a measure of subword-morpheme overlap, our analysis shows that one can find a "sweet spot" for a morphology informed subword vocabulary size. This sweet spot exhibits some variation with respect to text complexity and the morphological characteristics of a language. However, it is relatively constant with respect to corpus size.	PDF	11	2021
There’s a Time and Place for Reasoning Beyond the Image	Images are often more significant than only the pixels to human eyes, as we can infer, associate, and reason with contextual information from other sources to establish a more complete picture. For example, in Figure 1, we can find a way to identify the news articles related to the picture through segment-wise understandings on the signs, the buildings, the crowds, and more. This tells us the time when and the location where the image is taken, which will help us in subsequent tasks, such as evidence retrieval for criminal activities, automatic storyline construction, and upper-stream processing such as image clustering. In this work, we formulate this problem and introduce TARA: a dataset with 16k images with their associated news, time and location automatically extracted from New York Times (NYT), and an additional 61k examples as distant supervision from WIT. On top of the extractions, we present a crowdsourced subset in which images are believed to be feasible to find their spatio-temporal information for evaluation purpose. We show that there exists a $70\%$ gap between a state-of-the-art joint model and human performance, which is slightly filled by our proposed model that uses segment-wise reasoning, motivating higher-level vision-language joint models that can conduct open-ended reasoning with world knowledge.	PDF	11	2021
On the Effect of Isotropy on VAE Representations of Text	Injecting desired geometric properties into text representations has attracted a lot of attention. A property that has been argued for, due to its better utilisation of representation space, is isotropy. In parallel, VAEs have been successful in areas of NLP, but are known for their sub-optimal utilisation of the representation space. To address an aspect of this, we investigate the impact of injecting isotropy during training of VAEs. We achieve this by using an isotropic Gaussian posterior (IGP) instead of the ellipsoidal Gaussian posterior. We illustrate that IGP effectively encourages isotropy in the representations, inducing a more discriminative latent space. Compared to vanilla VAE, this translates into a much better classification performance, robustness to input perturbation, and generative behavior. Additionally, we offer insights about the representational properties encouraged by IGP.	PDF	11	2021
Understanding and Improving Sequence-to-Sequence Pretraining for Neural Machine Translation	In this paper, we present a substantial step in better understanding the SOTA sequence-to-sequence (Seq2Seq) pretraining for neural machine translation~(NMT). We focus on studying the impact of the jointly pretrained decoder, which is the main difference between Seq2Seq pretraining and previous encoder-based pretraining approaches for NMT. By carefully designing experiments on three language pairs, we find that Seq2Seq pretraining is a double-edged sword: On one hand, it helps NMT models to produce more diverse translations and reduce adequacy-related translation errors. On the other hand, the discrepancies between Seq2Seq pretraining and NMT finetuning limit the translation quality (i.e., domain discrepancy) and induce the over-estimation issue (i.e., objective discrepancy). Based on these observations, we further propose simple and effective strategies, named in-domain pretraining and input adaptation to remedy the domain and objective discrepancies, respectively. Experimental results on several language pairs show that our approach can consistently improve both translation performance and model robustness upon Seq2Seq pretraining.	PDF	11	2021
Quality-Aware Decoding for Neural Machine Translation	Despite the progress in machine translation quality estimation and evaluation in the last years, decoding in neural machine translation (NMT) is mostly oblivious to this and centers around finding the most probable translation according to the model (MAP decoding), approximated with beam search. In this paper, we bring together these two lines of research and propose \emph{quality-aware decoding} for NMT, by leveraging recent breakthroughs in reference-free and reference-based MT evaluation through various inference methods like $N$-best reranking and minimum Bayes risk decoding. We perform an extensive comparison of various possible candidate generation and ranking methods across four datasets and two model classes and find that quality-aware decoding consistently outperforms MAP-based decoding according both to state-of-the-art automatic metrics (COMET and BLEURT) and to human assessments.	PDF	11	2021
MReD: A Meta-Review Dataset for Structure-Controllable Text Generation	When directly using existing text generation datasets for controllable generation, we are facing the problem of not having the domain knowledge and thus the aspects that could be controlled are limited. A typical example is when using CNN/Daily Mail dataset for controllable text summarization, there is no guided information on the emphasis of summary sentences. A more useful text generator should leverage both the input text and the control signal to guide the generation, which can only be built with deep understanding of the domain knowledge. Motivated by this vision, our paper introduces a new text generation dataset, named MReD. Our new dataset consists of 7,089 meta-reviews and all its 45k meta-review sentences are manually annotated with one of the 9 carefully defined categories, including abstract, strength, decision, etc. We present experimental results on start-of-the-art summarization models, and propose methods for structure-controlled generation with both extractive and abstractive models using our annotated data. By exploring various settings and analyzing the model behavior with respect to the control signal, we demonstrate the challenges of our proposed task and the values of our dataset MReD. Meanwhile, MReD also allows us to have a better understanding of the meta-review domain.	PDF	11	2021
Research on the Evaluation of Token Imbalance Degree of NMT Corpus	As a kind of classifier, neural machine translation (NMT) is known to perform better with balanced tokens during training. Studying the token distribution in NMT corpus is of guiding significance to improve its quality and the translation effect. Due to the existing researches on token imbalance degree have deficiencies in algorithm performance and word segmentation scope, we propose the Dispersion of Token Distribution (DTD) algorithm, and use it to evaluate corpus from three segmentation levels: character, subword and word. Our experiments show that this algorithm has an improvement in accuracy, effectiveness and robustness. Meanwhile, we find that the token imbalance degree of NMT corpus varies greatly at different segmentation levels, among which character has the highest, word has the lowest and subword is in between. In addition, we also find the regularities of token imbalance degree in languages German (DE), English (EN), French (FR) and Russian (RU).	PDF	11	2021
Hierarchical Attention Decoder for Solving Math Word Problems	To answer math word problems (MWPs), models need to formalize equations from the source text of math problems. Recently, the tree-structured decoder has significantly improved model performance on this task by generating the target equation in a tree format. However, current decoders usually ignore the hierarchical relationships between tree nodes and their parents, which hinders further improvement. Thus, we propose a structure called hierarchical attention tree to aid the generation procedure of the decoder. As our decoder follows a graph-based encoder, our full model is therefore named as Graph to Hierarchical Attention Tree (G2HAT). We show a tree-structured decoder with hierarchical accumulative multi-head attention leads to significant performance improvement and reaches new state-of-the-art (SOTA) on both English MAWPS and Chinese Math23k MWP benchmarks. For further study, we also apply pre-trained language models for G2HAT, which even results in new higher performance.	PDF	11	2021
Improving Word Translation via Two-Stage Contrastive Learning	Word translation or bilingual lexicon induction (BLI) is a key cross-lingual task, aiming to bridge the lexical gap between different languages. In this work, we propose a robust and effective two-stage contrastive learning framework for the BLI task. As Stage C1, we propose to refine standard cross-lingual linear maps between static word embeddings (WEs) via a contrastive learning objective; we also show how to integrate it into the self-learning procedure for even more refined cross-lingual maps. In Stage C2, we conduct BLI-oriented contrastive fine-tuning of mBERT, unlocking its word translation capability. We also show that static WEs induced from the 'C2-tuned' mBERT complement static WEs from Stage C1. Comprehensive experiments on standard BLI datasets for diverse languages and different experimental setups demonstrate substantial gains achieved by our framework. While the BLI method from Stage C1 already yields substantial gains over all state-of-the-art BLI methods in our comparison, even stronger improvements are met with the full two-stage framework: e.g., we report gains for 112/112 BLI setups, spanning 28 language pairs.	PDF	11	2021
Summ$^N$: A Multi-Stage Summarization Framework for Long Input Dialogues and Documents	Text summarization helps readers capture salient information from documents, news, interviews, and meetings. However, most state-of-the-art pretrained language models (LM) are unable to efficiently process long text for many summarization tasks. In this paper, we propose Summ$^N$, a simple, flexible, and effective multi-stage framework for input texts that are longer than the maximum context length of typical pretrained LMs. Summ$^N$ first splits the data samples and generates a coarse summary in multiple stages and then produces the final fine-grained summary based on it. Our framework can process input text of arbitrary length by adjusting the number of stages while keeping the LM input size fixed. Moreover, it can deal with both single-source documents and dialogues, and it can be used on top of different backbone abstractive summarization models. To the best of our knowledge, Summ$^N$ is the first multi-stage split-then-summarize framework for long input summarization. Our experiments demonstrate that Summ$^N$ outperforms previous state-of-the-art methods by improving ROUGE scores on three long meeting summarization datasets AMI, ICSI, and QMSum, two long TV series datasets from SummScreen, and a long document summarization dataset GovReport. Our data and code are available at https://github.com/ANONYMOUS/Summ-N.	PDF	11	2021
Discriminative Models Still Outperform Generative Models in Aspect Based Sentiment Analysis In Cross-Domain and Cross-Lingual Settings	Aspect-based Sentiment Analysis (ABSA) helps to explain customers' opinions towards products and services. In the past, ABSA models were discriminative, but more recently generative models have been used to generate aspects and polarities directly from text. In contrast, discriminative models first select aspects from the text, and then classify the aspect's polarity. Previous results showed that generative models outperform discriminative models on several English ABSA datasets. Here, we rigorously contrast discriminative and generative models in several settings. We compare both model types in cross-lingual, cross-domain, and cross- lingual and domain, to understand generalizability in settings other than mono-lingual English in-domain. Our more thorough evaluation shows that, contrary to previous studies, discriminative models still clearly outperform generative models in almost all settings.	PDF	11	2021
Question Generation for Reading Comprehension Assessment by Modeling How and What to Ask	Reading is integral to everyday life, and yet learning to read is a struggle for many young learners. During lessons, teachers can use comprehension questions to increase engagement, test reading skills, and improve retention. Historically such questions were written by skilled teachers, but recently language models have been used to generate comprehension questions. However, many existing Question Generation (QG) systems focus on generating extractive questions from the text, and have no way to control the type of the generated question. In this paper, we study QG for reading comprehension where inferential questions are critical and extractive techniques cannot be used. We propose a two-step model (HTA-WTA) that takes advantage of previous datasets, and can generate questions for a specific targeted comprehension skill. We propose a new reading comprehension dataset that contains questions annotated with story-based reading comprehension skills (SBRCS), allowing for a more complete reader assessment. Across several experiments, our results show that HTA-WTA outperforms multiple strong baselines on this new dataset. We show that the HTA-WTA model tests for strong SCRS by asking deep inferential questions.	PDF	11	2021
How Well Do Multi-hop Reading Comprehension Models Understand Date Information?	Many previous works demonstrated that existing multi-hop reading comprehension datasets (e.g., HotpotQA) contain reasoning shortcuts, where the questions can be answered without performing multi-hop reasoning. Recently, several multi-hop datasets have been proposed to solve the reasoning shortcut problem or evaluate the internal reasoning process. However, the design of the reasoning chain for comparison questions in R4C and 2WikiMultiHopQA does not fully explain the answer; meanwhile, MuSiQue only focuses on bridge questions. Therefore, it is unclear about the ability of a model to perform step-by-step reasoning when finding an answer for a comparison question that requires comparison and numerical reasoning skills. To evaluate the model completely in a hierarchical manner, we first propose a dataset, HieraDate, created by reusing and enhancing two previous multi-hop datasets, HotpotQA and 2WikiMultiHopQA. Our dataset focuses on comparison questions on date information that require multi-hop reasoning for solving. We then evaluate the ability of existing models to understand date at three levels: extraction, reasoning, and robustness. Our experimental results reveal that the multi-hop models fail at the reasoning level. Comparison reasoning and numerical reasoning (e.g., subtraction) are key challenges that need to be addressed in future works.	PDF	11	2021
AutoLEX: An Automatic Framework for Linguistic Exploration	Each language has its own complex systems of word, phrase, and sentence construction, the guiding principles of which are often summarized in grammatical descriptions for the consumption of linguists or language learners. However, manual creation of such descriptions across many languages is a fraught process, as creating language descriptions which describe the language in "its own terms" without bias or error requires both a deep understanding of the language at hand and linguistics as a whole. We propose an automatic framework AutoLEX that aims to ease linguists' discovery and extraction of concise descriptions of linguistic phenomena. Specifically, we apply this framework to extract descriptions for three linguistic phenomena: morphological agreement, case marking, and word order, across several languages. We evaluate the extracted descriptions with the help of language experts and propose a method for automated evaluation when human evaluation is infeasible.	PDF	11	2021
Under the Morphosyntactic Lens: A Multifaceted Evaluation of Gender Bias in Speech Translation	Gender bias is largely recognized as a problematic phenomenon affecting language technologies, with recent studies underscoring that it might surface differently across languages. However, most evaluation practices adopt a word-level focus on a narrow set of occupational nouns under synthetic conditions. Such protocols overlook key features of grammatical gender languages, which are characterized by morphosyntactic chains of gender agreement, marked on a variety of lexical items and parts-of-speech (POS). To overcome this limitation, we enrich the natural, gender-sensitive MuST-SHE corpus with two new annotation layers: POS and agreement chains. On this basis, we conduct multifaceted automatic and manual evaluations for three speech translation models, trained on varying amounts of data and different word segmentation techniques. Our work sheds light on model behaviours, gender bias, and its detection at several levels of granularity for English-French/Italian/Spanish.	PDF	11	2021
Building a Role Specified Open-Domain Dialogue System Leveraging Large-Scale Language Models	Recent open-domain dialogue models have brought numerous breakthroughs. However, building a chat system is not scalable since it often requires a considerable volume of human-human dialogue data, especially when enforcing features such as persona, style, or safety. In this work, we study the challenge of imposing roles on open-domain dialogue systems, with the goal of making the systems maintain consistent roles while conversing naturally with humans. To accomplish this, the system must satisfy a role specification that includes certain conditions on the stated features as well as a system policy on whether or not certain types of utterances are allowed. For this, We propose an efficient data collection framework leveraging in-context few-shot learning of large-scale language models for building role-satisfying dialogue dataset from scratch. We then compare various architectures for open-domain dialogue systems in terms of meeting role specifications while maintaining conversational abilities. Automatic and human evaluations show that our models return few out-of-bounds utterances, keeping competitive performance on general metrics. We release a Korean dialogue dataset we built for further research.	PDF	11	2021
Adaptive Testing and Debugging of NLP Models	Current approaches to testing and debugging NLP models rely on highly variable human creativity and extensive labor, or only work for a very restrictive class of bugs. We present AdaTest, a process for adaptive testing and debugging of NLP models inspired by the test-debug cycle in traditional software engineering. AdaTest encourages a partnership between the user and a large language model (LM): the LM proposes tests that are validated and organized by the user, who in turn gives feedback and steers the LM towards better tests. Once enough bugs are discovered, these are fixed (e.g. finetuning), and the user resumes testing. In experiments with expert and non-expert users and commercial / research models for 8 different tasks, AdaTest makes users 5-10x more effective at finding bugs than current approaches, and helps users effectively fix bugs without adding new bugs.	PDF	11	2021
GLM: General Language Model Pretraining with Autoregressive Blank Infilling	There have been various types of pretraining architectures including autoencoding models (e.g., BERT), autoregressive models (e.g., GPT), and encoder-decoder models (e.g., T5). However, none of the pretraining frameworks performs the best for all tasks of three main categories including natural language understanding (NLU), unconditional generation, and conditional generation. We propose a General Language Model (GLM) based on autoregressive blank infilling to address this challenge. GLM improves blank filling pretraining by adding 2D positional encodings and allowing an arbitrary order to predict spans, which results in performance gains over BERT and T5 on NLU tasks. Meanwhile, GLM can be pretrained for different types of tasks by varying the number and lengths of blanks. On a wide range of tasks across NLU, conditional and unconditional generation, GLM outperforms BERT, T5, and GPT given the same model sizes and data, and achieves the best performance from a single pretrained model with 1.25× parameters of BERT Large , demonstrating its generalizability to different downstream tasks.	PDF	11	2021
Categorial Grammar Induction as a Compositionality Measure for Understanding the Structure of Emergent Languages	This paper proposes a method for investigating the syntactic structure of emergent languages using categorial grammar induction.Although the structural property of emergent languages is an important topic, little has been done on syntax and its relation to semantics.Inspired by previous work on CCG induction for natural languages, we propose to induce categorial grammars from the sentence-meaning pairs of emergent languages.Since an emergent language born in a common environment called signaling game is represented as pairs of a message and meaning, it is straightforward to extract sentence-meaning pairs to feed to categorial grammar induction.We also propose two compositionality measures that are based on the information obtained from induced grammars.Our experimental results reveal that our measures can recognize compositionality.While correlating with existing measure TopSim, our measures can gain more insights on the compositional structure of emergent languages from induced grammars.	PDF	11	2021
Do We Need to Differentiate Negative Candidates Before Training a Neural Ranker?	Retrieval-based Question Answering (ReQA) requires a system to find candidates (e.g., sentences or short passages) containing the answer to a given question from a large corpus. A promising way to solve this task is a two-stage pipeline, where the first stage retrieves a set of candidates, and the second stage uses a neural network to rank the retrieved candidates. There are three standard methods to train neural rankers, Binary Cross-Entropy loss, Mean Square Error loss, and Hinge loss. While all these training strategies assign the same label for all the negative candidates, we argue that negativeness is not binary but exists as a spectrum, i.e., some candidates may be more negative than the others, and thus should be treated differently. We present SCONER---scoring negative candidates before training neural ranker---a model trained to differentiate negative candidates. Our approach includes 1) semantic textual similarity-based scoring together with data augmentation for score generation of negative candidates, and 2) a neural ranker trained on data using generated scores as labels. Together, we systematically compare three standard training methods and our proposed method on a range of ReQA datasets under multiple settings (i.e., single-domain and multi-domain). Our finding suggests that using more negative candidates to train neural rankers are better than less in both single- and multi-domain settings, where SCONER is the best in the single-domain settings and Hinge loss is the best in multi-domain settings.	PDF	11	2021
CausalR: Causal Reasoning over Natural Language Rulebases	Transformers have been shown to perform deductive reasoning on a logical rulebase containing rules and statements written in natural language. Recent works show that such models can also produce the reasoning steps (i.e., the proof graph) that emulate the model’s logical reasoning process. But these models behave as a black-box unit that emulates the reasoning process without any causal constraints in the reasoning steps, thus questioning the faithfulness. In this work, we frame the deductive logical reasoning task as a causal process by defining three modular components: rule selection, fact selection, and knowledge composition. The rule and fact selection steps select the candidate rule and facts to be used and then the knowledge composition combines them to generate new inferences. This ensures model faithfulness by assured causal relation from the proof step to the inference reasoning. To test our causal reasoning framework, we propose CausalR, where the above three components are independently modeled by transformers. We observe that CausalR is robust to novel language perturbations, and performs on par with previous works on existing reasoning datasets. Furthermore, the errors made by CausalR are more interpretable due to our multi-modular approach compared to black-box generative models.	PDF	11	2021
Power Norm Based Lifelong Learning for Paraphrase Generations	Seq2seq language generation models are trained with multiple domains in a continue learning manner, where the data from each domain being observed in an online fashion. However, continual learning studies usually suffer a lot from catastrophic forgetting, a persistent challenge for lifelong learning. To handle this problem, existing work has leveraged experience replay or dynamic architecture to consolidate the past knowledge, which however result in incremental memory space or high computational cost. In this work, we propose an innovative framework PNLL that remedies the catastrophic forgetting issues with a power normalization on NLP transformer models. Specifically, PNLLL leverages power norm to achieve a better balance between past experience rehearsal and new knowledge acquisition. Our experiments on, paraphrase generation, show that PNLLL outperforms SOTA models by a considerable margin and remedy forgetting greatly.	PDF	11	2021
Evaluating the Text-to-SQL Capabilities of Large Language Models	We perform an empirical evaluation of Text-to-SQL capabilities of the Codex language model. We find that, without any finetuning, Codex is a strong baseline on the Spider benchmark; we also analyze the failure modes of Codex in this setting. Furthermore, we demonstrate on the GeoQuery and Scholar benchmarks that a small number of in-domain examples provided in the prompt enables Codex to perform better than state-of-the-art models finetuned on such few-shot examples.	PDF	11	2021
FeelsGoodMan: Inferring Semantics of Twitch Neologisms	Twitch chat messages pose a unique problem in natural language understanding due to a large presence of neologisms, specifically emotes. There are a total of 8.06 million emotes, over 400k of which were observed during the study period. There is virtually no information on the meaning or sentiment of emotes, and with a constant influx of new emotes and drift in both their frequencies and their perceived meanings, it becomes impossible to maintain an updated manually-labeled dataset. Our paper makes a two-fold contribution. First, we establish a new baseline for sentiment analysis on Twitch data, outperforming the previous benchmark by 7.36 percentage points. Secondly, we introduce a simple but powerful unsupervised framework based on word embeddings and k-NN to enrich existing models with out-of-vocabulary knowledge. This framework allows us to auto-generate an emote pseudo-dictionary, and we show that we can nearly match the supervised benchmark above, even when injecting such emote knowledge into sentiment classifiers trained on extraneous datasets such as movie reviews or Twitter.	PDF	11	2021
Greek Forced Alignment: Assessing the Accuracy of the Montreal Forced Aligner	Forced alignment has allowed for the rapid creation and annotation of corpora. In this study we examine the Montreal Foreced Aligner and its accuracy of aligning Greek data. Using a conversational Greek corpus we train a small grapheme-to-phoneme model and use this model to align the entire corpus. We compare our results to various previous studies of the MFA and other forced alignment software and conclude that forced alignment greatly increases the ability to create new corpora for low-resource and understudied languages.	PDF	11	2021
Who Are We Talking About? Handling Person Names in Speech Translation	Recent work has shown that systems for speech translation (ST) -- similarly to automatic speech recognition (ASR) -- poorly handle person names. This shortcoming does not only lead to errors that can seriously distort the meaning of the input, but also hinders the adoption of such systems in application scenarios (like computer-assisted interpreting) where the translation of named entities, like person names, is crucial. In this paper, we first analyse the outputs of ASR/ST systems to identify the reasons of failures in person name transcription/translation. Besides the frequency in the training data, we pinpoint the nationality of the referred person as a key factor. We then mitigate the problem by creating multilingual models, and further improve our ST systems by forcing them to jointly generate transcripts and translations, prioritising the former over the latter. Overall, our solutions result in a relative improvement in token-level person name accuracy by 47.8% on average for three language pairs (en$\rightarrow$es,fr,it).	PDF	11	2021
Generating Data to Mitigate Spurious Correlations in Natural Language Inference Datasets	Natural language processing models often exploit spurious correlations between task-independent features and labels in datasets to perform well only within the distributions they are trained on, while not generalising to different task distributions. We propose to tackle this problem by generating a debiased version of a dataset, which can then be used to train a debiased, off-the-shelf model, by simply replacing its training data. Our approach consists of 1) a method for training data generators to generate high-quality, label-consistent data samples; and 2) a filtering mechanism for removing data points that contribute to spurious correlations, measured in terms of z-statistics. We generate debiased versions of the SNLI and MNLI datasets, and we evaluate on a large suite of debiased, out-of-distribution, and adversarial test sets. Results show that models trained on our debiased datasets generalise significantly better than those trained on the original datasets in all settings. On the majority of the datasets, our method outperforms or performs comparably to previous state-of-the-art debiasing strategies, and when combined with an orthogonal technique, product-of-experts, the performance improves further and achieves state-of-the-art results of SNLI-hard and MNLI-hard.	PDF	11	2021
Combining Feature and Instance Attribution to Detect Artifacts	Training the deep neural networks that dominate NLP requires large datasets. These are often collected automatically or via crowdsourcing, and may exhibit systematic biases or annotation artifacts. By the latter we mean spurious correlations between inputs and outputs that do not represent a generally held causal relationship between features and classes; models that exploit such correlations may appear to perform a given task well, but fail on out of sample data. In this paper we evaluate use of different attribution methods for aiding identification of training data artifacts. We propose new hybrid approaches that combine saliency maps (which highlight "important" input features) with instance attribution methods (which retrieve training samples "influential" to a given prediction). We show that this proposed training-feature attribution can be used to efficiently uncover artifacts in training data when a challenging validation set is available. We also carry out a small user study to evaluate whether these methods are useful to NLP researchers in practice, with promising results.	PDF	11	2021
Sentence-level Privacy for Document Embeddings	User language data can contain highly sensitive personal content. As such, it is imperative to offer users a strong and interpretable privacy guarantee when learning from their data. In this work we propose SentDP, pure local differential privacy at the sentence level for a single user document. We propose a novel technique, DeepCandidate, that combines concepts from robust statistics and language modeling to produce high (768) dimensional, general $\epsilon$-SentDP document embeddings. This guarantees that any single sentence in a document can be substituted with any other sentence while keeping the embedding $\epsilon$-indistinguishable. Our experiments indicate that these private document embeddings are useful for downstream tasks like sentiment analysis and topic classification and even outperform baseline methods with weaker guarantees like word-level Metric DP.	PDF	11	2021
Speech-to-SQL Parsing: Error Correction with Multi-modal Representations	We study the task of spoken natural language to SQL parsing (speech-to-SQL), where the goal is to map a spoken utterance to the corresponding SQL. Existing work on SQL parsing has focused on text as input (text-to-SQL). To develop a speech-to-SQL parser, we harness progress in text-to-SQL parsing, and automatic speech recognition (ASR). However, ASR is still error-prone, we therefore propose an error correction method that fixes ASR errors in the context of a DB schema. We present a novel multi-modal representation of text, audio, and DB schema with audio attention and a phoneme prediction auxiliary task. Our experiments show that our method yields better performance, is much faster to train, has greater transparency, and is parser-agnostic compared to baselines that seek to adapt to ASR errors.	PDF	11	2021
A Graph Fusion Approach to Cross-Lingual Machine Reading Comprehension	Although great progress has been made for Machine Reading Comprehension (MRC) in English, scaling out to a large number of languages remains a huge challenge due to the lack of large amounts of annotated training data in non-English languages. To address this challenge, some recent efforts of cross-lingual MRC employ machine translation to transfer knowledge from English to other languages, through either explicit alignment or implicit attention. For effective knowledge transition, it is beneficial to leverage both semantic and syntactic information. However, the existing methods fail to explicitly incorporate syntax information in model learning. Consequently, the models are not robust to errors in alignment and noises in attention. In this work, we propose a novel approach, named GraFusionMRC, which jointly models the cross-lingual alignment information and the mono-lingual syntax information using a graph. We develop a series of algorithms including graph construction, learning, and pre-training. The experiments on two benchmark datasets for cross-lingual MRC show that our approach outperforms all strong baselines, which verifies the effectiveness of syntax information for cross-lingual MRC. The code will be made open-sourced on Github.	PDF	11	2021
Mental Health Assessment for the Chatbots	Previous researches on dialogue system assessment usually focus on the quality evaluation (e.g. fluency, relevance, etc) of responses generated by the chatbots, which are local and technical metrics. For a chatbot which responds to millions of online users including minors, we argue that it should have a healthy mental tendency in order to avoid the negative psychological impact on them. In this paper, we establish several mental health assessment dimensions for chatbots (depression, anxiety, alcohol addiction, empathy) and introduce the questionnaire-based mental health assessment methods. We conduct assessments on some well-known open-domain chatbots and find that there are severe mental health issues for all these chatbots. We consider that it is due to the neglect of the mental health risks during the dataset building and the model training procedures. We expect to attract researchers' attention to the serious mental health problems of chatbots and improve the chatbots' ability in positive emotional interaction.	PDF	11	2021
Synthetic Question Value Estimation for Domain Adaptation of Question Answering	Synthesizing QA pairs with a question generator (QG) on the target domain has become a popular approach for domain adaptation of question answering (QA) models. Since synthetic questions are often noisy in practice, existing work adapts scores from a pretrained QA (or QG) model as criteria to select high-quality questions. However, these scores do not directly serve the ultimate goal of improving QA performance on the target domain. In this paper, we introduce a novel idea of training a question value estimator (QVE) that directly estimates the usefulness of synthetic questions for improving the target-domain QA performance. By conducting comprehensive experiments, we show that the synthetic questions selected by QVE can help achieve better target-domain QA performance, in comparison with existing techniques. We additionally show that by using such questions and only around 15% of the human annotations on the target domain, we can achieve comparable performance to the fully-supervised baselines.	PDF	11	2021
MAD for Robust Reinforcement Learning in Machine Translation	We introduce a new distributed policy gradient algorithm and show that it outperforms existing reward-aware training procedures such as REINFORCE, minimum risk training (MRT) and proximal policy optimization (PPO) in terms of convergence speed and stability, and overall performance at optimising machine translation models. Our algorithm, which we call MAD (on account of using the mean absolute deviation in the importance weighting calculation), has distributed data generators sampling multiple candidates per source sentence on worker nodes, while a central learner updates the policy. MAD depends crucially on two variance reduction strategies: (1) a new robust importance weighting scheme that encourages learning from examples that are not too likely or unlikely relative to the current policy and (2) by learning from balanced numbers of high- and low-reward training examples. Finally, our algorithm has few hyperparameters, making it easy to use on new tasks with little or no adaptation. Experiments on a variety of tasks show the translation policies learned with MAD perform very well with both greedy decoding and beam search, and the learned policies are sensitive to the specific reward used during training.	PDF	11	2021
Self-supervised Schema Induction for Task-oriented Dialog	Hand-crafted schemas describing how to collect and annotate dialog corpora are a prerequisite towards building task-oriented dialog systems. In practical applications, manually designing schemas can be error-prone, laborious, iterative, and slow, especially when the schema is complicated. To automate this process, we propose a self-supervised approach for schema induction from unlabeled dialog corpora. Our approach utilizes representations provided by in-domain language models constrained on unsupervised structures, followed by multi-step coarse-to-fine clustering. We compare our method against several strong supervised baselines, and show significant performance improvement in schema induction on MultiWoz and SGD datasets. We also demonstrate the effectiveness of induced schemas on downstream tasks including dialog state tracking and response generation.	PDF	11	2021
HiStruct+: Improving Extractive Text Summarization with Hierarchical Structure Information	Transformer-based language models usually treat texts as linear sequences. However, most texts also have an inherent hierarchical structure, i.,e., parts of a text can be identified using their position in this hierarchy. In addition, section titles usually indicate the common topic of their respective sentences. We propose a novel approach to extract, encode and inject hierarchical structure (HiStruct) information into an extractive summarization model (HiStruct+ model) based on a pre-trained, encoder-only language model. Our HiStruct+ model achieves SOTA extractive ROUGE scores on three public summarization datasets (CNN/DailyMail, PubMed, arXiv), the improvement is especially substantial on PubMed and arXiv. Using various experimental settings, our HiStruct+ model outperforms a strong baseline, which differs from our model only in that the HiStruct information is not injected. The ablation study demonstrates that the hierarchical position information is the main contributor to our model's SOTA performance.	PDF	11	2021
Meta-Adapter: Parameter Efficient Few-Shot Learning through Meta-Learning	With consistent improvements in the representational capacity of large pre-trained transformers, it has become increasingly viable to serve these models as shared backbones that enable modeling a large number of tasks simultaneously. However, fine-tuning the entire model for every task of interest makes a copy of all the model parameters, rendering such scenarios highly impractical. Recently introduced Adapter methods propose a promising alternative, one where only a small number of additional parameters are introduced per task specifically for fine-tuning. However, Adapter often require large amounts of task-specific data for good performance and don't work well in data-scarce few-shot scenarios. In this paper, we take a meta-learning viewpoint for parameter-efficient fine-tuning in few-shot settings. We introduce Meta-Adapter, which are small blocks of meta-learned adapter layers inserted in a pre-trained model that re-purpose a frozen pre-trained model into a parameter-efficient few-shot learner. Meta-Adapter perform competitively with state-of-the-art few-shot learning methods, that require full fine-tuning, while only fine-tuning 0.6\% of the parameters. We evaluate Meta-Adapter along with multiple transfer learning baselines on an evaluation suite of 17 classification tasks and find that they improve few-shot learning accuracy by a large margin over competitive parameter-efficient methods while requiring significantly lesser parameters for fine-tuning.	PDF	11	2021
Data Augmentation for Intent Classification with Generic Large Language Models	Data augmentation alleviates the problem of data scarcity when training language models (LMs) by generating new examples based on the existing data. A successful approach to generate new samples is to fine-tune a pretrained LM on the task-specific data and then sample from the label-conditioned LM. However, fine-tuning can be difficult when task-specific data is scarce. In this work, we explore whether large pretrained LMs can be used to generate new useful samples without fine-tuning. For a given class, we propose concatenating few examples and prompt them to GPT-3 to generate new examples. We evaluate this method for few-shot intent classification on CLINC150 and SNIPS and find that data generated by GPT-3 greatly improves the performance of the intent classifiers. Importantly, we find that, without any LM fine-tuning, the gains brought by data augmentation with GPT-3 are similar to those reported in prior work on LM-based data augmentation. Experiments with models of different sizes show that larger LMs generate higher quality samples that yield higher accuracy gains.	PDF	11	2021
Prix-LM: Pretraining for Multilingual Knowledge Base Construction	Knowledge bases (KBs) contain plenty of structured world and commonsense knowledge. As such, they often complement distributional text-based information and facilitate various downstream tasks. Since their manual construction is resource- and time-intensive, recent efforts have tried leveraging large pretrained language models (PLMs) to generate additional monolingual knowledge facts for KBs. However, such methods have not been attempted for building and enriching multilingual KBs. Besides wider application, such multilingual KBs can provide richer combined knowledge than monolingual (e.g., English) KBs. Knowledge expressed in different languages may be complementary and unequally distributed: this implies that the knowledge available in high-resource languages can be transferred to low-resource ones. To achieve this, it is crucial to represent multilingual knowledge in a shared/unified space. To this end, we propose a unified representation model, Prix-LM, for multilingual KB construction and completion. We leverage two types of knowledge, monolingual triples and cross-lingual links, extracted from existing multilingual KBs, and tune a multilingual language encoder XLM-R via a causal language modeling objective. Prix-LM integrates useful multilingual and KB-based factual knowledge into a single model. Experiments on standard entity-related tasks, such as link prediction in multiple languages, cross-lingual entity linking and bilingual lexicon induction, demonstrate its effectiveness, with gains reported over strong task-specialised baselines.	PDF	11	2021
Good Night at 4 pm?! Time Expressions in Different Cultures	We propose the task of culture-specific time expression grounding, i.e. mapping from expressions such as "morning" in English or "Manhã" in Portuguese to specific hours in the day. We propose 3 language-agnostic methods, one of which achieves promising results on gold standard annotations that we collected for a small number of languages. We then apply this method to 28 languages and analyze the similarities across languages in the grounding of time expressions.	PDF	11	2021
Language-Family Adapters for Multilingual Neural Machine Translation	Massively multilingual pretrained models yield state-of-the-art results in a wide range of cross-lingual natural language processing tasks. For machine translation, the de facto way to leverage knowledge of pretrained models is fine-tuning on parallel data from one or multiple language pairs. Multilingual fine-tuning improves performance on medium- and low-resource languages but requires modifying the entire model and can be prohibitively expensive. Training either language-pair specific or language-agnostic adapters while keeping most of the pretrained model's parameters frozen has been proposed as a lightweight alternative. However, the former do not learn useful cross-lingual representations for multiple language pairs, while the latter share parameters for all languages and potentially have to deal with negative interference. In this paper, we propose training language-family adapters on top of a pretrained multilingual model to facilitate cross-lingual transfer. Using language families, our model consistently outperforms other adapter-based approaches and is on par with multilingual fine-tuning, while being more efficient. We also demonstrate that language-family adapters provide an effective method to translate to languages unseen during pretraining and substantially outperform the baselines.	PDF	11	2021
Predicting Attention Sparsity in Transformers	Transformers' quadratic complexity with respect to the input sequence length has motivated a body of work on efficient sparse approximations to softmax. An alternative path, used by entmax transformers, consists of having built-in exact sparse attention; however this approach still requires quadratic computation. In this paper, we propose Sparsefinder, a simple model trained to identify the sparsity pattern of entmax attention before computing it. We experiment with three variants of our method, based on distances, quantization, and clustering, on two tasks: machine translation (attention in the decoder) and masked language modeling (encoder-only). Our work provides a new angle to study model efficiency by doing extensive analysis of the tradeoff between the sparsity and recall of the predicted attention graph. This allows for detailed comparison between different models along their Pareto curves, important to guide future benchmarks for sparse attention models.	PDF	11	2021
Teaching BERT to Wait: Balancing Accuracy and Latency for Streaming Disfluency Detection	In modern interactive speech-based systems speech is consumed and transcribed incrementally prior to having disfluencies removed. While this post-processing step is crucial for producing clean transcripts and high performance on downstream tasks (e.g. machine translation), most current state-of-the-art NLP models such as the Transformer operate non-incrementally, potentially causing unacceptable delays for the user. In this work we propose a streaming BERT-based sequence tagging model that, combined with a novel training objective, is capable of detecting disfluencies in real-time while balancing accuracy and latency. This is accomplished by training the model to decide whether to immediately output a prediction for the current input or to wait for further context, in essence learning to dynamically size the lookahead window. Our results demonstrate that our model produces comparably accurate predictions and does so sooner than our baselines, with lower flicker. Furthermore, the model attains state-of-the-art latency and stability scores when compared with recent work on incremental disfluency detection.	PDF	11	2021
TABi: Type-Aware Bi-encoders for End-to-End Entity Retrieval	Entity retrieval---retrieving information about entities in a query---is a core step in open-domain tasks, such as question answering or fact checking. However, state-of-the-art entity retrievers struggle to retrieve rare entities in queries. There are two key challenges: (1) most retrievers are trained on unstructured text about entities and ignore structured data about entities that can be challenging to learn from text, such as entity types, and (2) methods that leverage structured types are not designed for end-to-end retrieval, which is necessary for open-domain tasks. In this work, we introduce a method, TABi, to jointly train bi-encoders on unstructured text and structured types for end-to-end retrieval. TABi uses a type-enforced contrastive loss to encode type information in the embedding space and trains over datasets from multiple open-domain tasks to learn to retrieve entities. We demonstrate that this simple method can improve retrieval of rare entities on the AmbER sets, while maintaining strong overall performance on retrieval for open-domain tasks when compared to state-of-the-art retrievers. We also find that TABi produces embeddings that better capture types on a nearest neighbor type classification and an entity similarity task.	PDF	11	2021
Addressing Resource and Privacy Constraints in Semantic Parsing Through Data Augmentation	We introduce a novel setup for low-resource task-oriented semantic parsing which incorporates several constraints that may arise in real-world scenarios: (1) lack of similar datasets/models from a related domain, (2) inability to sample useful logical forms directly from a grammar, and (3) privacy requirements for unlabeled natural utterances. Our goal is to improve a low-resource semantic parser using utterances collected through user interactions. In this highly challenging but realistic setting, we investigate data augmentation approaches involving generating a set of structured canonical utterances corresponding to logical forms, before simulating corresponding natural language and filtering the resulting pairs. We find that such approaches are effective despite our restrictive setup: in a low-resource setting on the complex SMCalFlow calendaring dataset (Andreas et al. 2020), we observe 33% relative improvement over a non-data-augmented baseline in top-1 match.	PDF	11	2021
Composable Sparse Fine-Tuning for Cross-Lingual Transfer	Fine-tuning the entire set of parameters of a large pretrained model has become the mainstream approach for transfer learning. To increase its efficiency and prevent catastrophic forgetting and interference, techniques like adapters and sparse fine-tuning have been developed. Adapters are modular, as they can be combined to adapt a model towards different facets of knowledge (e.g., dedicated language and/or task adapters). Sparse fine-tuning is expressive, as it controls the behavior of all model components. In this work, we introduce a new fine-tuning method with both these desirable properties. In particular, we learn sparse, real-valued masks based on a simple variant of the Lottery Ticket Hypothesis. Task-specific masks are obtained from annotated data in a source language, and language-specific masks from masked language modeling in a target language. Both these masks can then be composed with the pretrained model. Unlike adapter-based fine-tuning, this method neither increases the number of parameters at inference time nor alters the original model architecture. Most importantly, it outperforms adapters in zero-shot cross-lingual transfer by a large margin in a series of multilingual benchmarks, including Universal Dependencies, MasakhaNER, and AmericasNLI. Based on an in-depth analysis, we additionally find that sparsity is crucial to prevent both 1) interference between the fine-tunings to be composed and 2) overfitting. We release the code and models at [ANONYMOUS-URL].	PDF	11	2021
Your fairness may vary: Pretrained language model fairness in toxic text classification	Warning: This paper contains samples of offensive text.The popularity of pretrained language models in natural language processing systems calls for a careful evaluation of such models in down-stream tasks, which have a higher potential for societal impact. The evaluation of such systems usually focuses on accuracy measures. Our findings in this paper call for attention to be paid to fairness measures as well. Through the analysis of more than a dozen pretrained language models of varying sizes on two toxic text classification tasks, we demonstrate that focusing on accuracy measures alonecan lead to models with wide variation in fairness characteristics. Specifically, we observe that fairness can vary even more than accuracy with increasing training data size and different random initializations. At the same time, we find that little of the fairness variation is explained by model size, despite claims in the literature. To improve model fairness without retraining, we show that two post-processing methods developed for structured, tabular data can be successfully applied to a range of pretrained language models.	PDF	11	2021
Voxel-informed Language Grounding	Even when applied to 2D images, natural language describes a fundamentally 3D world. We present the Voxel-informed Language Grounder (VLG), a language grounding model that leverages 3D geometric information in the form of voxel maps derived from the visual input using a volumetric reconstruction model. We show that VLG significantly improves grounding accuracy on SNARE, an object reference game task.At the time of writing, VLG holds the top place (anonymized) on the SNARE leaderboard, achieving SOTA results with a 1.9% absolute improvement on grounding geometric descriptions and 1.7% overall improvement on all descriptions.	PDF	11	2021
On Systematic Style Differences between Unsupervised and Supervised MT and an Application for High-Resource Machine Translation	Modern unsupervised machine translation (MT) systems reach reasonable translation quality under clean and controlled data conditions. As the performance gap between supervised and unsupervised MT narrows, it is interesting to ask whether the different training methods result in systematically different output beyond what is visible via quality metrics like adequacy or BLEU. We compare translations from supervised and unsupervised MT systems of similar quality, finding that unsupervised output is more fluent and more structurally different in comparison to human translation than is supervised MT. We then demonstrate a way to combine the benefits of both methods into a single system which results in improved adequacy and fluency as rated by human evaluators. Our results open the door to interesting discussions about how supervised and unsupervised MT might be different yet mutually-beneficial.	PDF	11	2021
Learning Tokenization in Private Federated Learning with Sub-Word Model Sampling	Federated learning with differential privacy, i.e. private federated learning (PFL), makes it possible to train models on private data distributed across users' devices without harming privacy. However, it is only known how to do this for models, such as neural networks, that have a fixed number of parameters, and thus a fixed-dimensional gradient vector. Such models include neural-net language models, but not n-gram language models or, indeed, tokenizers, the topic of this work. Training a tokenizer normally requires access to the training data. An alternative is to train the tokenizer on publicly available data, but this, we show, degrades accuracy for a next-word prediction task by 10-20% across different datasets and models. We propose to take a tokenizer built on public data, use it to train a language model with PFL, and sample from the language model to find a new tokenizer. Retraining with the new tokenizer brings performance to within 2\,\% of the oracle tokenizer, without expending additional privacy budget. Finally, we build a new federated pipeline to update the tokenizer during model training by modifying affected model embeddings.	PDF	11	2021
Leveraging Visual Knowledge in Language Tasks: An Empirical Study on Intermediate Pre-training for Cross-Modal Knowledge Transfer	Pre-trained language models are still far from human performance in tasks that need understanding of properties (e.g. appearance, measurable quantity) and affordances of everyday objects in the real world since the text lacks such information due to reporting bias.In this work, we study whether integrating visual knowledge into a language model can fill the gap.We investigate two types of knowledge transfer: (1) \textit{text knowledge transfer using image captions that may contain enriched visual knowledge and (2) \textit{cross-modal knowledge transfer} using both images and captions with vision-language training objectives.On 5 downstream tasks that may need visual knowledge to solve the problem, we perform extensive empirical comparisons over the presented objectives.Our experiments show that visual knowledge transfer can improve performance in both low-resource and fully supervised settings.	PDF	11	2021
Zero-Shot Visual Grounding of Referring Utterances in Dialogue	This work explores whether current pretrained multimodal models, which are optimized to align images and captions, can be applied to the rather different domain of referring expressions. In particular, we test whether one such model, CLIP, is effective in capturing two main trends observed for referential chains uttered within a multimodal dialogue, i.e., that utterances become less descriptive over time while their discriminativeness remains unchanged. We show that CLIP captures both, which opens up the possibility to use these models for reference resolution and generation. Moreover, our analysis indicates a possible role for these architectures toward discovering the mechanisms employed by humans when referring to visual entities.	PDF	11	2021
A Good Prompt Is Worth Millions of Parameters: Low-resource Prompt-based Learning for Vision-Language Models	Large pre-trained vision-language (VL) models can learn a new task with a handful of examples and generalize to a new task without fine-tuning. However, these VL models are hard to deploy for real-world applications due to their impractically huge sizes and slow inference speed. To solve this limitation, we study prompt-based low-resource learning of VL tasks with our proposed method, FewVLM, relatively smaller than recent few-shot learners. For FewVLM, we pre-train a sequence-to-sequence transformer model with prefix language modeling (PrefixLM) and masked language modeling (MaskedLM). Furthermore, we analyze the effect of diverse prompts for few-shot tasks. Experimental results on VQA show that FewVLM with prompt-based learning outperforms Frozen which is 31x larger than FewVLM by 18.2\% point on zero-shot VQAv2 and achieves comparable results to a 246x larger model, PICa.In our analysis, we observe that (1) prompts significantly affect zero-shot performance but marginally affect few-shot performance, (2) models with noisy prompts learn as quickly as hand-crafted prompts given larger training data, and (3) MaskedLM helps VQA tasks while PrefixLM boosts captioning performance.	PDF	11	2021
DialFact: A Benchmark for Fact-Checking in Dialogue	Fact-checking is an essential tool to mitigate the spread of misinformation and disinformation. We introduce the task of fact-checking in dialogue, which is a relatively unexplored area. We construct DialFact, a testing benchmark dataset of 22,123 annotated conversational claims, paired with pieces of evidence from Wikipedia. There are three sub-tasks in DialFact: 1) Verifiable claim detection task distinguishes whether a response carries verifiable factual information; 2) Evidence retrieval task retrieves the most relevant Wikipedia snippets as evidence; 3) Claim verification task predicts a dialogue response to be supported, refuted, or not enough information. We found that existing fact-checking models trained on non-dialogue data like FEVER} fail to perform well on our task, and thus, we propose a simple yet data-efficient solution to effectively improve fact-checking performance in dialogue. We point out unique challenges in DialFact such as handling the colloquialisms, coreferences and retrieval ambiguities in the error analysis to shed light on future research in this direction.	PDF	11	2021
Primum Non Nocere: Before working with Indigenous data, the ACL must confront ongoing colonialism	In this paper, we challenge the ACL community to reckon with historical and ongoing colonialism by adopting a set of ethical obligations and best practices drawn from the Indigenous studies literature. While the vast majority of NLP research focuses on a very small number of very high resource languages (English, Chinese, etc), some work has begun to engage with Indigenous languages. No research involving Indigenous language data can be considered ethical without first acknowledging that Indigenous languages are not merely very low resource languages. The toxic legacy of colonialism permeates every aspect of interaction between Indigenous communities and outside academic researchers. Ethical research must actively challenge this colonial legacy by explicitly acknowledging and centering Indigenous community goals and Indigenous ways of knowing. To this end, we propose that the ACL draft and adopt an ethical framework for NLP researchers and computational linguists wishing to engage in research involving Indigenous languages.	PDF	11	2021
Multimodal Semi-supervised Learning for Disaster Tweet Classification	During natural disasters, people often use social media platforms, such as Twitter, to post information about casualties and damage produced by disasters. This information can help relief authorities gain situational awareness in nearly real time, and enable them to quickly distribute resources where most needed. However, annotating data for this purpose can be burdensome, subjective and expensive. In this paper, we investigate how to leverage the copious amounts of unlabeled data generated by disaster eyewitnesses and affected individuals during disaster events. To this end, we propose a semi-supervised learning approach to improve the performance of neural models on several multimodal disaster tweet classification tasks. Our approach shows significant improvements, obtaining up to $3.5\%$ F1 performance gain at no additional annotation cost.	PDF	11	2021
Training Data is More Valuable than You Think: A Simple and Effective Method by Retrieving from Training Data	Retrieval-based methods have been shown to be effective in NLP tasks via introducing external knowledge. However, the indexing and retrieving of large-scale corpora bring considerable computational cost. Surprisingly, we found that REtrieving from the traINing datA (REINA) only can lead to significant gains on multiple NLG and NLU tasks. We retrieve the labeled training instances most similar to the input text and then concatenate them with the input to feed into the model to generate the output. Experimental results show that this simple method can achieve significantly better performance on a variety of NLU and NLG tasks, including summarization, machine translation, language modeling, and question answering tasks. For instance, our proposed method achieved state-of-the-art results on XSum, BigPatent, and CommonsenseQA.	PDF	11	2021
RoMe: A Robust Metric for Evaluating Natural Language Generation	Evaluating Natural Language Generation (NLG) systems is a challenging task. Firstly, the metric should ensure that the generated hypothesis reflects the reference's semantics. Secondly, it should consider the grammatical quality of the generated sentence. Thirdly, it should be robust enough to handle various surface forms of the generated sentence. Thus, an effective evaluation metric has to be multifaceted. In this paper, we propose an automatic evaluation metric incorporating several core aspects of natural language understanding (language competence, syntactic and semantic variation). Our proposed metric, RoMe, is trained on language features such as semantic similarity combined with tree edit distance and grammatical acceptability, using a self-supervised neural network to assess the overall quality of the generated sentence. Moreover, we perform an extensive robustness analysis of the state-of-the-art methods and RoMe. Empirical results suggest that RoMe has a stronger correlation to human judgment over state-of-the-art metrics in evaluating system-generated sentences across several NLG tasks.	PDF	11	2021
Query and Extract: Refining Event Extraction as Type-oriented Binary Decoding	Event extraction is typically modeled as a multi-class classification problem where both event types and argument roles are treated as atomic symbols. These approaches are usually limited to a set of pre-defined types. We propose a novel event extraction framework that takes both event types and argument roles as natural language queries to extract candidate triggers and arguments from the input text. With the rich semantics in the queries, our framework benefits from the attention mechanisms to better capture the semantic correlation between the event types or argument roles and the input text. Furthermore, the query-and-extract formulation allows our approach to leverage all available event annotations from various ontologies as a unified model. Experiments on two public benchmark datasets, ACE and ERE, demonstrate that our approach achieves the state-of-the-art performance on each dataset and significantly outperforms existing methods on zero-shot event extraction. We will make all the programs publicly available once the paper is accepted.	PDF	11	2021
Learning and Evaluating Character Representations in Novels	We address the problem of learning fixed-length vector representations of characters in novels. Recent advances in word embeddings have proven successful in learning entity representations from short texts, but fall short on longer documents because they do not capture full book-level information. To overcome the weakness of such text-based embeddings, we propose two novel methods for representing characters: (i) graph neural network-based embeddings from a full corpus-based character network; and (ii) low-dimensional embeddings constructed from the occurrence pattern of characters in each novel. We test the quality of these character embeddings using a new benchmark suite to evaluate character representations, encompassing 12 different tasks. We show that our representation techniques combined with text-based embeddings lead to the best character representations, outperforming text-based embeddings in four tasks. Our dataset and evaluation script will be made publicly available to stimulate additional work in this area.	PDF	11	2021
WeaNF: Weak Supervision with Normalizing Flows	A popular approach to decrease the need for costly manual annotation of large data sets is weak supervision, which introduces problems of noisy labels, coverage and bias. Methods for overcoming these problems have either relied on discriminative models, trained with cost functions specific to weak supervision, and more recently, generative models, trying to model the output of the automatic annotation process. In this work, we explore a novel direction of generative modeling for weak supervision: Instead of modeling the output of the annotation process (the labeling function matches), we generatively model the input-side data distributions (the feature space) covered by labeling functions. Specifically, we estimate a density for each weak labeling source, or labeling function, by using normalizing flows. An integral part of our method is the flow-based modeling of multiple simultaneously matching labeling functions, and therefore phenomena such as labeling function overlap and correlations are captured. We analyze the effectiveness and modeling capabilities on various commonly used weak supervision data sets, and show that weakly supervised normalizing flows compare favorably to standard weak supervision baselines.	PDF	11	2021
Investigating Selective Prediction Approaches Across Several Tasks in IID, OOD, and Adversarial Settings	Humans usually choose not to answer questions on which they are likely to be incorrect. In order to equip NLP systems with this selective answering capability, several task-specific approaches have been proposed. However, which approaches work best across tasks or even if they consistently outperform the simplest baseline `MaxProb' remains to be explored. To this end, we systematically study `selective prediction' in a large-scale setup of 17 datasets across several NLP tasks. Through comprehensive experiments under in-domain (IID), out-of-domain (OOD), and adversarial (ADV) settings, we show that despite leveraging additional resources (held-out data/computation), none of the existing approaches consistently and considerably outperforms MaxProb in all three settings. Furthermore, their performance does not translate well across tasks. For instance, 'Monte-Carlo Dropout' outperforms all other approaches on Duplicate Detection datasets but does not fare well on NLI datasets, especially in the OOD setting. Thus, we recommend that future selective prediction approaches should be evaluated across tasks and settings for reliable estimation of their capabilities.	PDF	11	2021
Revisiting the Compositional Generalization Abilities of Neural Sequence Models	Compositional generalization is a fundamental trait in humans, allowing us to effortlessly combine known phrases to form novel sentences. Recent works have claimed that standard seq-to-seq models severely lack the ability to compositionally generalize. In this paper, we focus on one-shot primitive generalization as introduced by the popular SCAN benchmark. We demonstrate that modifying the training distribution in simple and intuitive ways enables standard seq-to-seq models to achieve near-perfect generalization performance, thereby showing that their compositional generalization abilities were previously underestimated. We perform detailed empirical analysis of this phenomenon. Our results indicate that the generalization performance of models is highly sensitive to the characteristics of the training data which should be carefully considered while designing such benchmarks in future.	PDF	11	2021
A Deep Generative XAI Framework for Natural Language Inference Explanations Generation	Explainable artificial intelligence with natural language explanations (Natural-XAI) aims to produce human-readable explanations as evidence for AI decision-making. This evidence can enhance human trust and understanding of AI systems and contribute to AI explainability and transparency. However, the current approaches focus on single explanation generation only. In this paper, we conduct experiments with the state-of-the-art Transformer architecture and explore \textit{multiple explanations generation} using a public benchmark dataset, e-SNLI \cite{camburu2018snli}. We propose a novel deep generative Natural-XAI framework: \textbf{INITIATIVE}, standing for \textit{expla\underline{\textbf{I}}n a\underline{\textbf{N}}d pred\underline{\textbf{I}}c\underline{\textbf{T}} w\underline{\textbf{I}}th contextu\underline{\textbf{A}}l condi\underline{\textbf{TI}}onal \underline{\textbf{V}}ariational auto\underline{\textbf{E}}ncoder} for generating natural language explanations and making a prediction at the same time. Our method achieves competitive or better performance against the state-of-the-art baseline models on generation (4.7\% improvement in the BLEU score) and prediction (4.4\% improvement in accuracy) tasks. Our work can serve as a solid deep generative model baseline for future Natural-XAI research. Our code will be publicly available on GitHub upon paper acceptance.	PDF	11	2021
Learning to execute or ask clarification questions	Collaborative tasks are ubiquitous activities where a form of communication is required in order to reach a joint goal. Collaborative building is one of such tasks. To this end, we wish to develop an intelligent builder agent in a simulated building environment (Minecraft) that can build whatever users wish to build by just talking to the agent. However, in order to achieve this goal, such agents need to be able to take the initiative by asking clarification questions when further information is needed. Existing work on Minecraft Corpus Dataset only learned to execute instructions neglecting the importance of asking for clarifications. In this paper, we extend the Minecraft Corpus Dataset by annotating all builder utterances into eight types, including clarification questions, and propose a new builder agent model capable of determining when to ask or execute instructions. Experimental results show that our model achieves state-of-the-art performance on the collaborative building task with a substantial improvement. We also provide baselines for the new tasks, learning to ask and the joint tasks, which consists in solving both collaborating building and learning to ask tasks jointly.	PDF	11	2021
Controllable Natural Language Generation with Contrastive Prefixes	To guide the generation of large pretrained language models (LM), previous work has focused on directly fine-tuning the language model or utilizing an attribute discriminator. In this work, we propose a novel lightweight framework for controllable GPT2 generation, which utilizes a set of small attribute-specific vectors, called prefixes (Li and Liang, 2021), to steer natural language generation. Different from Li and Liang (2021), where each prefix is trained independently, we take the relationship among prefixes into consideration and train multiple prefixes simultaneously. We propose a novel supervised method and also an unsupervised method to train the prefixes for single-aspect control while the combination of these two methods can achieve multi-aspect control. Experimental results on both single-aspect and multi-aspect control show that our methods can guide generation towards the desired attributes while keeping high linguistic quality.	PDF	11	2021
Investigating the Use of BERT Anchors for Bilingual Lexicon Induction with Minimal Supervision	This paper investigates the use of static anchors from transformer architectures for the task of Bilingual Lexicon Induction. We revisit an existing approach built around the ELMo architecture and explore the use of the methodology on the BERT family of language models. Experiments are performed and analysed for three language pairs, combining English with three target languages from very different language families, Hindi, Dutch, and Russian. Although the contextualised approach is not able to outperform the SOTA VecMap method, we find that it is easily adaptable to newer transformer models and can compete with the MUSE approach. An error analysis reveals interesting trends accross languages and shows how the method could be further improved by building on the basic hypothesis that transformer embeddings can indeed be decomposed into a static anchor and a dynamic context component. We make the code, the extracted anchors (before and after alignement) and the modified train and test sets available for use.	PDF	11	2021
Neural Pipeline for Zero-Shot Data-to-Text Generation	In data-to-text (D2T) generation, training on in-domain data leads to overfitting to the data representation and repeating training data noise. We examine how to avoid finetuning the pretrained language models (PLMs) on D2T generation datasets while still taking advantage of surface realization capabilities of PLMs. Inspired by pipeline approaches, we propose to generate text by rephrasing single-item templates using a sequence of modules trained on general-domain text-based operations—ordering, aggregation, and paragraph compression. We train PLMs for performing these operations on a synthetic corpus WikiFluent which we build from English Wikipedia. Our experiments on two major triple-to-text datasets—WebNLG and E2E—show that our approach enables D2T generation from RDF triples in zero-shot settings.	PDF	11	2021
Strategies in subword tokenization: humans vs. algorithms	The output of subword tokenization can be very different depending on what algorithm is used. It is typically judged as more or less plausible, depending on how much it corresponds to human intuition. A subword vocabulary overlap between manual and automatic segmentation is an indicator of plausibility, but it does not reveal much on how the process of segmentation compares with human analysis. In this study, we propose a new method to analyze subword segmentation strategies relying on a spatial analysis of the distribution of subwords' lengths. Our experiments on English, Finnish and Turkish show that humans tend to balance creativity and consistency, while algorithms tend to be either strongly biased or inconsistent. To imitate humans better, algorithms need to produce subword segments of moderately uneven length, which can be achieved by combining complementary strategies.	PDF	11	2021
Local Structure Matters Most: Perturbation Study in NLU	Recent research analyzing the sensitivity of natural language understanding models to word-order perturbations has shown that neural models are surprisingly insensitive to the order of words.In this paper, we investigate this phenomenon by developing order-altering perturbations on the order of words, subwords, and characters to analyze their effect on neural models' performance on language understanding tasks.We experiment with measuring the impact of perturbations to the local neighborhood of characters and global position of characters in the perturbed texts and observe that perturbation functions found in prior literature only affect the global ordering while the local ordering remains relatively unperturbed.We empirically show that neural models, invariant of their inductive biases, pretraining scheme, or the choice of tokenization, mostly rely on the local structure of text to build understanding and make limited use of the global structure.	PDF	11	2021
Neural Dynamic Focused Topic Model	Topic models and all their variants analyse text by learning meaningful representations through word co-occurrences. As pointed out by Williamson et al. (2010), such models implicitly assume that the probability of a topic to be active and its proportion within each document are positively correlated. This correlation can be strongly detrimental in the case of documents created over time, simply because recent documents are likely better described by new and hence rare topics.In this work we leverage recent advances in neural variational inference and present an alternative neural approach to the Focused Topic Model and its dynamic extensions. Indeed, we develop a neural model for topic evolution which exploits a compound Bernoulli structure in order to track the appearances of topics, thereby decoupling their activities from their proportions.On three different corpora namely, the UN general debates, the collection of NeurIPS papers, and the ACL Anthology dataset, our model outperforms competing neural variational topic models.	PDF	11	2021
Generating Summaries for Scientific Paper Review	The review process is essential to ensure the quality of publications. Recently, the increase of submissions for top venues in machine learning and NLP has caused a problem of excessive burden on reviewers and has often caused concerns regarding how this may not only overload reviewers, but also may affect the quality of the reviews. An automatic system for assisting with the reviewing process could be a solution for ameliorating the problem. In this paper, we explore automatic review summary generation for scientific papers. We posit that neural language models have the potential to be valuable candidates for this task. In order to test this hypothesis, we release a new dataset of scientific papers and their reviews, collected from papers published in the NeurIPS conference from 2013 to 2020. We evaluate state of the art neural summarization models, present initial results on the feasibility of automatic review summary generation, and propose directions for the future	PDF	11	2021
Prompting as Multimodal Fusing	Tsimpoukelli et al. (2021) devise Frozen, empowering a language model to solve multimodal tasks by pretraining a vision encoder whose outputs are prompts fed to the language model. The vision encoder has a dual objective: extracting image features and aligning image/text representation spaces. We propose to disentangle the objectives by using prompt vectors to align the spaces; this lets the vision encoder focus on extracting image features. We show that this disentangled approach is modular and parameter-efficient for processing tasks that involve two or more modalities.	PDF	11	2021
Bi-Matching Mechanism to Combat the Long Tail of Word Sense Disambiguation	The long tail phenomenon of word sense distribution in linguistics causes the Word Sense Disambiguation (WSD) task to face a serious polarization of word sense distribution, that is, Most Frequent Senses (MFSs) with huge sample sizes and Long Tail Senses (LTSs) with small sample sizes. The single matching mechanism model that does not distinguish between the two senses will cause LTSs to be ignored because LTSs are in a weak position. The few-shot learning method that mainly focuses on LTSs is not conducive to grasping the advantage of easy identification of MFSs. This paper proposes a bi-matching mechanism to serve the WSD model to deal with two kinds of senses in a targeted manner, namely definition matching and collocation feature matching. The experiment is carried out under the evaluation framework of English all-words WSD and is better than the baseline models. Moreover, state-of-the-art performance is achieved through data enhancement.	PDF	11	2021
Deep-to-bottom Weights Decay: A Systemic Knowledge Review Learning Technique for Transformer Layers in Knowledge Distillation	There are millions of parameters and huge computational power consumption behind the outstanding performance of pre-trained language models in natural language processing tasks. Knowledge distillation is considered as a compression strategy to address this problem. However, previous works (i) distill partial transformer layers of the teacher model, which ignore the importance of bottom base information, or (ii) neglect the difficulty differences of knowledge from deep to shallow, which corresponds to different level information of teacher model. We introduce a deep-to-bottom weights decay review mechanism to knowledge distillation, which fuses teacher-side information taking each layer’s difficulty level into consideration. To validate our claims, we distill a 12-layer BERT into a 6-layer model and evaluate it on the GLUE dataset. Experimental results show that our review approach is able to outperform other existing techniques.	PDF	11	2021
Detecting Unassimilated Borrowings in Spanish: An Annotated Corpus and Approaches to Modeling	This work presents a new resource for borrowing identification and analyzes the performance and errors of several models on this task.We introduce a new annotated corpus of Spanish newswire rich in unassimilated lexical borrowings---words from one language that are introduced into another without orthographic adaptation---and use it to evaluate how several sequence labeling models (CRF, BiLSTM-CRF, and Transformer-based models) perform. The corpus contains 370,000 tokens and is larger, more borrowing-dense, OOV-rich, and topic-varied than previous corpora available for this task. Our results show that a BiLSTM-CRF model fed with either Transformer-based embeddings pretrained on codeswitched data or a combination of contextualized word embeddings (along with character embeddings and Spanish and English subword embeddings) outperforms results obtained by a multilingual BERT-based model.	PDF	11	2021
Causal Transformers: Improving the Robustness on Spurious Correlations	The fully-connected dependencies in self-attention over-fit spurious correlations and limit the generalization on out-of-distribution data. Pre-trained language models (PLMs) alleviate this problem benefitted from the appreciable counterexamples in large-scale pre-training corpora. However, there is no study to resolve this problem by improving the model structure. We enforced the causal independence mechanism in the self-attention network, which constrains attention mapping topologies (AMGs) as causal structures. To implement it, we defined a smooth loss on the Markov boundary constrained directed acyclic graph (DAG) with the Lagrange duality, and used it to optimize the AMGs towards causal structures. Further, this causal attention network was applied on Transformer (Causal Transformer). The empirical results on two spurious correlation challenging (SCC) datasets, neural machine translation (NMT) and natural language inference (NLI) tasks demonstrated that our Causal Transformer outperforms the state-of-the-art model and improves the out-of-distribution prediction.	PDF	11	2021
OrderSum: Reading Order-Aware Unsupervised Opinion Summarization	Opinion summarization aims to create a concise summary reflecting subjective information conveyed by multiple user reviews about the same product. To avoid the high expense of curating golden summaries for training, many unsupervised methods have been recently developed. Most state-of-the-art methods utilize the extracted segments following their salience ranking as pseudo labels to train a summary generator. However, the extracted salient segments can be verbose and their reading order has been long overlooked. In this paper, we propose a reading order-aware framework, OrderSum, aiming to generate concise and logical summaries. Specifically, we first formulate the segment ordering problem in pseudo labels as path-choosing and solve it using reinforcement learning. Moreover, to generate a more concise summary, we propose to encourage the generative model to skip useless words based on the token link information derived from concise sentences, which can be collected easily from massive raw reviews by considering the ratio of sentiment/aspect words. Extensive experiments demonstrate that OrderSum benefits from the awareness of reading order and the conciseness modeling, thus being more effective than existing unsupervised methods and achieving the state-of-the-art performance.	PDF	11	2021
A Comprehensive and Large-Scale Dataset for Integrated Argument Mining Tasks	Traditionally, a debate usually requires a manual preparation process, including reading plenty of articles, selecting the claims, identifying the stances of the claims, seeking the evidences for the claims, etc. As the AI debate attracts more attention these years, it is worth exploring the methods to automate the tedious process involved in the debating system. In this work, we introduce a comprehensive and large dataset, which can be applied to a series of argument mining tasks, including claim extraction, stance classification, evidence extraction, etc. Our dataset is collected from over 1k articles related to 123 topics. Near 70k sentences in the dataset are fully annotated based on their argument properties (e.g., claims, stances, evidences, etc.). We further propose two new integrated argument mining tasks associated with the debate preparation process: (1) claim extraction with stance classification (CESC) and (2) claim-evidence pair extraction (CEPE). We adopt a pipeline approach and an end-to-end method for each integrated task separately. Promising experimental results are reported to show the values and challenges of our proposed tasks, and motivate future research on argument mining.	PDF	11	2021
Unsupervised Domain Adaptation for Event Detection via Meta Self-Paced Learning	As important events in textual data are usually highly specific in terms of tasks and domains, a change in data distribution would have a significant impact on detection performance. Recent methods addressing unsupervised domain adaptation for event detection task typically extracted domain-invariant representations through combining and balancing various objectives to align the feature space between source and target domains. While effective, these methods are impractical as large-scale language models are drastically growing bigger to achieve optimal performance. To this end, we propose Meta Self-Paced Domain Adaption framework (MSP-DA) that effectively and efficiently alleviates the need for domain-specific hyperparameter tuning. By imitating the train-test dataset split based on the difficulties of source domain's samples, the model is trained through a meta-learning process that learns to weigh the importance of each labeled instance and to balance every alignment objective, simultaneously. Extensive experiments demonstrate our framework substantially improves performance on target domains, surpassing state-of-the-art approaches. Furthermore, we present detailed analyses to validate our method and provide insight into how each domain affects the learned hyperparameters.	PDF	11	2021
Empirical Evaluation of Topic Zero- and Few-Shot Learning for Stance Dissonance Detection	We address stance dissonance detection, the task of detecting conflicting stance between two input statements. Computational models for traditional stance detection have typically been trained to indicate pro/cons for a given target topic (e.g. gun control) and thus do not generalize well to new topics. In this paper, we systematically evaluate the generalizability of this task to situations where examples of the topic have not been seen at all (zero-shot) or only a few times (few-shot). We first build a large-scale dataset of stance dissonance detection from an online debate platform, consisting of 23.8k pairs of statements from 34 diverse topics. We show that stance dissonance detection models trained only on a small number of non-target topics already perform as well as those trained on a target topic. We also show that adding more non-target topics further boosts the performance, indicating the generalizability of non-target topics to a target topic in the stance dissonance detection task.	PDF	11	2021
Do Pre-trained Models Benefit Knowledge Graph Completion? A Reliable Evaluation and a Reasonable Approach	In recent years, pre-trained language models (PLMs) have been shown to capture factual knowledge from massive texts, which encourages the proposal of PLM-based knowledge graph completion (KGC) models. However, these models are still quite behind the SOTA KGC models in terms of performance. In this work, we find two main reasons for the weak performance: (1) Inaccurate evaluation setting. The evaluation setting under the closed-world assumption (CWA) may underestimate the PLM-based KGC models since they introduce more external knowledge; (2) Inappropriate utilization of PLMs. Most PLM-based KGC models simply splice the labels of entities and relations as inputs, leading to incoherent sentences that do not take full advantage of the implicit knowledge in PLMs. To alleviate these problems, we highlight a more accurate evaluation setting under the open-world assumption (OWA), which manual checks the correctness of knowledge that is not in KGs. Moreover, motivated by prompt tuning, we propose a novel PLM-based KGC model named PKGC. The basic idea is to convert each triple and its support information into natural prompt sentences, which is further fed into PLMs for classification. Experiment results on two KGC datasets demonstrate OWA is more reliable for evaluating KGC, especially on the link prediction, and the effectiveness of our PKCG model on both CWA and OWA settings.	PDF	11	2021
TwittIrish: A Universal Dependencies Treebank of Tweets in Modern Irish	Modern Irish is a minority language lacking sufficient linguistic resources for the task of accurate automatic syntactic parsing of user-generated content. As with other languages, the linguistic style observed in Irish tweets differs, in terms of orthography, lexicon and syntax, to that of standard texts more commonly used in Natural Language Processing (NLP) for the development of language models and parsers.This paper reports on the development of TwittIrish, the first Irish Universal Dependencies Twitter Treebank. We describe our bootstrapping method, and report on preliminary parsing experiments.	PDF	11	2021
Generated Knowledge Prompting for Commonsense Reasoning	It remains an open question whether incorporating external knowledge benefits commonsense reasoning while maintaining the flexibility of pretrained sequence models. To investigate this question, we develop generated knowledge prompting, which consists of generating knowledge from a language model, then providing the knowledge as additional input when answering a question. Our method does not require task-specific supervision for knowledge integration, or access to a structured knowledge base, yet it improves performance of large-scale, state-of-the-art models on four commonsense reasoning tasks, achieving state-of-the-art results on numerical commonsense (NumerSense), general commonsense (CommonsenseQA 2.0), and scientific commonsense (QASC) benchmarks. Generated knowledge prompting highlights large-scale language models as flexible sources of external knowledge for improving commonsense reasoning.Our code is available at \url{github.com/anonymous_repo}.	PDF	11	2021
LOPS: Learning Order Inspired Pseudo-Label Selection for Weakly Supervised Text Classification	Iterative self-training is a popular framework in weakly supervised text classification that involves bootstrapping a deep neural classifier from heuristic pseudo-labels. The quality of pseudo-labels, especially the initial ones, is crucial to final performance but they are inevitably noisy due to their heuristic nature, so selecting the correct ones has a huge potential for performance boost. One straightforward solution is to select samples based on the softmax probability scores corresponding to their pseudo-labels. However, we show through our experiments that such methods are ineffective and unstable due to the erroneously high-confidence predictions from poorly calibrated models. Recent studies on the memorization effects of deep neural models suggest that these models first memorize training samples with clean labels and then those with noisy labels. Inspired by this observation, we propose a novel pseudo-label selection method LOPS that takes learning order of samples into consideration. We hypothesize that the learning order reflects the probability of wrong annotation in terms of ranking, and therefore, select the top samples that are learnt earlier. LOPS can be viewed as a strong performance-boost plug-in to most of existing weakly-supervised text classification methods, as confirmed in extensive experiments on six real-world datasets.	PDF	11	2021
Improving Neural Models for Radiology Report Retrieval with Lexicon-based Automated Annotation	Many clinical informatics tasks that are based on electronic health records need relevant patient cohorts to be selected based on findings, symptoms, and diseases. Frequently, these conditions are described in radiology reports which can be retrieved using information retrieval (IR) methods. The latest of these techniques utilize neural IR models such as BERT trained on clinical text. However, these methods still lack semantic understanding of the underlying clinical conditions as well as ruled out findings, resulting in poor precision during retrieval. In this paper we combine clinical finding detection with supervised query match learning. Specifically, we use lexicon-driven concept detection to detect relevant findings in sentences. These findings are used as queries to train a Sentence-BERT (SBERT) model using triplet loss on matched and unmatched query-sentence pairs. We show that the proposed supervised training task remarkably improves the retrieval performance of SBERT. The trained model generalizes well to unseen queries and reports from different collections.	PDF	11	2021
Extreme Multi-label Text Classification with Multi-layer Experts	Extreme multi-label text classification (XMTC) is the task of tagging each document with the relevant labels from a very large space of predefined categories, which presents an open challenge in the recent development of neural classifiers. Popular Transformer-based XMTC methods typically use the last-layer features to represent the document and to match it against candidate labels. We argue that the last-layer features may not be sufficient for predicting labels at different levels of semantic granularity, and that multi-layer features may offer a better choice instead. Based on this insight we propose a novel multi-expert model, namely ME-XML (Multiple Experts for XMTC), which combines multi-layer embeddings in Transformer for improving the prediction power of the model.	PDF	11	2021
Debiased Contrastive Learning of Unsupervised Sentence Representations	Recently, contrastive learning has shown effectiveness in fine-tuning pre-trained language models (PLM) to derive sentence representations, which pulls augmented positive examples together to improve the alignment while pushing apart irrelevant negatives for the uniformity of the whole representation space. However, previous works mostly sample negatives from the batch or training data at random. It may cause sampling bias that improper negatives (\eg false negatives and anisotropy representations) will be learned by sentence representations, and hurt the uniformity of the representation space. To solve it, we present a new framework \textbf{DCLR} to alleviate the influence of sampling bias. In DCLR, we design an instance weighting method to punish false negatives and generate noise-based negatives to guarantee the uniformity of the representation space. Experiments on 7 semantic textual similarity tasks show that our approach is more effective than competitive baselines. Our codes and data will be released to reproduce all the experiments.	PDF	11	2021
Scaling Laws vs Model Architectures: How does Inductive Bias Influence Scaling? An Extensive Empirical Study on Language Tasks	There have been a lot of interest in the scaling properties of Transformer models. However, not much has been done on the front of investigating the effect of scaling properties of different inductive biases and model architectures. Do model architectures scale differently? If so, how does inductive bias affect scaling behaviour? How does this influence upstream (pretraining) and downstream (transfer)? This paper conducts a systematic study of scaling behaviour of ten diverse model architectures such as Transformers, Switch Transformers, Universal Transformers, Dynamic convolutions, Performers, and recently proposed MLP-Mixers. Via extensive experiments, we show that (1) architecture is an indeed an important consideration when performing scaling and (2) the best performing model can fluctuate at different scales. We believe that the findings outlined in this work has significant implications to how model architectures are currently evaluated in the community.	PDF	11	2021
Word2Box: Capturing Set-Theoretic Semantics of Words using BoxEmbeddings	Learning representations of words in a continuous space is perhaps the most fundamental task in NLP, a prerequisite for nearly all modern machine-learning techniques. Often the objective is to capture distributional similarity via vector dot product, however this is just one relation between word meanings we may wish to capture. It is natural to consider words as (soft) equivalence classes based on similarity, it is natural to expect the ability to perform set-theoretic operations (intersection, union, difference) on these representations. This is particularly relevant for words which are homographs- for example, “tongue”∩“body” should be similar to “mouth”, while “tongue”∩“language” should be similar to “dialect”. Box embeddings are a novel region-based representation which provide the capability to perform these set-theoretic operations. In this work, we provide a fuzzy-set interpretation of box embeddings, and train box embeddings with a CBOW objective where contexts are represented using intersection. We demonstrate improved performance on various word similarity tasks, particularly on less common words, and perform a quantitative and qualitative analysis exploring the additional unique expressivity provided by Word2Box.	PDF	11	2021
Distantly Supervised Named Entity Recognition with Category-Oriented Confidence Calibration	In this work, we study the noisy-labeled named entity recognition under distant supervision setting. Considering that most NER systems based on confidence estimation deal with noisy labels ignoring the fact that model has different levels of confidence towards different categories, we propose a category-oriented confidence calibration(Coca) strategy with an automatically confidence threshold calculation module. We integrate our method into a teacher-student framework to improve the model performance. Our proposed approach achieves promising performance among advanced baseline models, setting new state-of-the-art performance on three existing distantly supervised NER benchmarks.	PDF	11	2021
Robust and Effective Grammatical Error Correction with Simple Cycle Self-Augmenting	Recent studies have revealed that grammatical error correction methods in the sequence-to-sequence paradigm are vulnerable to adversarial attack, and simply utilizing adversarial examples in the pre-training or post-training process can significantly enhance the robustness of GEC models to certain types of attack without suffering too much performance loss on clean data. In this paper, we further conduct a thorough robustness evaluation of cutting-edge GEC methods to four different types of adversarial attacks and propose a simple yet very effective Cycle Self-Augmenting (CSA) method accordingly. By leveraging the augmenting data from the GEC models themselves in the post-training process and introducing regularization data for cycle training, our proposed method can effectively improve model robustness of well-trained GEC models with only a few more training epochs as the extra cost. Experiments on four benchmark datasets and seven strong models indicate that our proposed training method can significantly enhance the robustness to four types of attacks without using purposely built adversarial examples in training. Evaluation results on clean data further confirm that our proposed CSA method significantly improves the performance of four baselines and yields nearly comparable results with other state-of-the-art models. Our code is available in the supplementary .zip file, which will be released after the anonymous period.	PDF	11	2021
When Chosen Wisely, More Data Is What You Need: A Universal Sample-Efficient Strategy For Data Augmentation	Data Augmentation (DA) is vital in deep learning to improve the generalizability of neural networks. Most of existing DA techniques in NLP naively add a certain number of augmented samples without paying attention to the quality and added computational cost of these samples. Furthermore, state-of-the-art DA techniques in the literature usually learn to generate or re-weight augmented samples more specific to the main task; however, these learning-based DA techniques are not sample-efficient and they are computationally expensive. In this work, we propose a universal DA technique, called Glitter, for NLP which aims at efficiency and performance at the same time. In other words, Glitter can be applied to any existing DA technique to improve its training efficiency and sample efficiency and maintain its competitive performance. We evaluate Glitter on several downstream tasks such as the GLUE benchmark, SQuAD, and HellaSwag in a variety of scenarios including general single network, consistency training, self-distillation and knowledge distillation (KD) setups.	PDF	11	2021
KART: Parameterization of Privacy Leakage Scenarios from Pre-trained Language Models	For the safe sharing pre-trained language models, no guidelines exist at present owing to the difficulty in estimating the upper bound of the risk of privacy leakage. One problem is that previous studies have assessed the risk for different real-world privacy leakage scenarios and attack methods, which reduces the portability of the findings.To tackle this problem, we represent complex real-world privacy leakage scenarios under a universal parameterization, \textit{Knowledge, Anonymization, Resource, and Target} (KART). KART parameterization has two merits: (i) it clarifies the definition of privacy leakage in each experiment and (ii) it improves the comparability of the findings of risk assessments. We show that previous studies can be simply reviewed by parameterizing the scenarios with KART. We also demonstrate privacy risk assessments in different scenarios under the same attack method, which suggests that KART helps approximate the upper bound of risk under a specific attack or scenario. We believe that KART helps integrate past and future findings on privacy risk and will contribute to a standard for sharing language models.	PDF	11	2021
Probing and Generalization of Metaphorical Knowledge in Pre-Trained Language Models	Human languages are full of metaphorical expressions. Metaphors help people understand the world by connecting new concepts and domains to more familiar ones. Large pre-trained language models (PLMs) are therefore assumed to encode metaphorical knowledge useful for NLP systems. In this paper, we investigate this hypothesis for PLMs, by probing metaphoricity information in their encodings, and by measuring the cross-lingual and cross-dataset generalization of this information. We present studies in multiple metaphor detection datasets and in four languages (i.e., English, Spanish, Russian, and Farsi). Our extensive experiments suggest that contextual representations in PLMs do encode metaphorical knowledge, and mostly in their middle layers. The knowledge is transferable between languages and datasets, especially when the annotation is consistent across training and testing sets. Our findings give helpful insights for both cognitive and NLP scientists.	PDF	11	2021
ARCNN: A Semantic Enhanced Relation Detection Model for Knowledge Base Question Answering	Relation detection plays an important role in knowledge base question answering (KBQA), and it is critical for the final performance of KBQA systems. The previous works mainly focused on enriching the information representations of questions and relations, and neglected the interaction information of questions and relations and different tokens within the relation. In this paper, we propose a semantic enhanced relation detection model called ARCNN, which is carefully designed by combining BiGRU, multi-scale semantic extracted CNN, and different attention mechanisms in a seamless way. Moreover, we combine four levels of relation abstractions to ensure the integrity of relation information and hence to enrich the relation representation. The experimental results on two benchmarks show that our ARCNN model achieves new state-of-the-art accuracies of 96.42% for SimpleQuestions and 90.4% for WebQuestions. Moreover, it helps our KBQA system to yield the accuracy of 81.5% and the F1 score of 72.0% on two benchmarks, respectively.	PDF	11	2021
DialogVED: A Pre-trained Latent Variable Encoder-Decoder Model for Dialog Response Generation	Dialog response generation in open domain is an important research topic where the main challenge is to generate relevant and diverse responses. In this paper, we propose a new dialog pre-training framework called DialogVED, which introduces continuous latent variables into the enhanced encoder-decoder pre-training framework to increase the relevance and diversity of responses. With the help of a large dialog corpus (Reddit), we pre-train the model using the following 4 tasks, used in training language models (LMs) and Variational Autoencoders (VAEs) literature: 1) masked language model; 2) response generation; 3) bag-of-words prediction; and 4) KL divergence reduction. We also add additional parameters to model the turn structure in dialogs to improve the performance of the pre-trained model. We conduct experiments on PersonaChat, DailyDialog, and DSTC7-AVSD benchmarks for response generation. Experimental results show that our model achieves the new state-of-the-art results on all these datasets.	PDF	11	2021
Eye Gaze and Self-attention: How Humans and Transformers Attend Words in Sentences	Attention mechanisms are used to describe human reading processes and natural language processing by transformer neural networks. On the surface, attention appears to be very different under these two contexts. However, this paper presents evidence that there are links between the two during reading tasks.During reading, the dwell times of human eye movements were strongly correlated with the attention patterns occurring in the early layers of pre-trained transformers such as BERT. Furthermore, we explored what factors lead to variations in these correlations and observed that data were more correlated when humans read for comprehension than when they were searching for specific information. Additionally, the strength of a correlation was not related to number of parameters within a transformer.	PDF	11	2021
Contextual Fine-to-Coarse Distillation for Coarse-grained Response Selection in Open-Domain Conversations	We study the problem of coarse-grained response selection in retrieval-based dialogue systems. The problem is equally important with fine-grained response selection, but is less explored in existing literature. In this paper, we propose a Contextual Fine-to-Coarse (CFC) distilled model for coarse-grained response selection in open-domain conversations. In our CFC model, dense representations of query, candidate contexts and responses is learned based on the multi-tower architecture using contextual matching, and richer knowledge learned from the one-tower architecture (fine-grained) is distilled into the multi-tower architecture (coarse-grained) to enhance the performance of the retriever. To evaluate the performance of the proposed model, we construct two new datasets based on the Reddit comments dump and Twitter corpus. Extensive experimental results on the two datasets show that the proposed method achieves huge improvement over all evaluation metrics compared with traditional baseline methods.	PDF	11	2021
Event Detection via Derangement Question Answering	Event detection (ED), aiming to detect events from texts and categorize them, is vital to understanding the messages. Recently, ED without triggers has been proposed and gained benefits since it relieves the tedious effort of data labeling. However, it still suffers from several formidable challenges: multi-label, insufficient clues, and imbalanced event types. We, therefore, propose a novel Derangement Question-Answering (DQA) framework on top of BERT to tackle the above challenges. More specially, we treat the input text as a {\em question} and directly concatenate it with all event types, who are deemed as {\em answers}. Thus, by utilizing the original information, we can facilitate the power of self-attention in BERT to absorb the semantic relation between the original input text and the event types. Moreover, we design a simple yet effective {\em derangement} mechanism to relieve the issue of imbalanced event types. By including such perturbation, we can train a more robust model to promote the semantic information in the major events while recording the position of the minor events than the vanilla QA framework. The empirical results show that: (1) our proposed DQA framework attains state-of-the-art performance over previous competitive models. (2) Our model can automatically link the triggers with the event types while signifying the corresponding arguments.	PDF	11	2021
Distantly Supervised Named Entity Recognition via Confidence-Based Multi-Class Positive and Unlabeled Learning	In this paper, we study the named entity recognition (NER) problem under distant supervision. Due to the incompleteness of the external dictionaries and/or knowledge bases, such distantly annotated training data usually suffer from a high false negative rate. To this end, we formulate the Distantly Supervised NER (DS-NER) problem via Multi-class Positive and Unlabeled (MPU) learning and propose a theoretically and practically novel CONFidence-based MPU (Conf-MPU) approach. To handle the incomplete annotations, Conf-MPU consists of two steps. First, a confidence score is estimated for each token of being an entity token. Then, the proposed Conf-MPU risk estimation is applied to train a multi-class classifier for the NER task. Thorough experiments on two benchmark datasets labeled by various external knowledge demonstrate the superiority of the proposed Conf-MPU over existing DS-NER methods.	PDF	11	2021
Simple yet Powerful: An Overlooked Architecture for Nested Named Entity Recognition	Named Entity Recognition (NER) is an important task in Natural Language Processing that aims to identify text spans belonging to predefined categories. Traditional NER research ignores nested entities, which are entities contained in other entity mentions. Although several methods have been proposed to address this case, most of them rely on complex task-specific structures and ignore potentially useful baselines for the task. We argue that this creates an overly optimistic impression of their performance. This paper revisits the Multiple LSTM-CRF (MLC) model, a simple, overlooked, yet powerful approach based on training independent sequence labeling models for each entity type. Extensive experiments with three nested NER corpora show that, regardless of the simplicity of this model, its performance is better or at least as well as more sophisticated methods. Furthermore, we show that the MLC architecture achieves state-of-the-art results in the Chilean Waiting List corpus by including pre-trained language models. In addition, we propose new task-specific metrics that adequately measure the ability of models to detect nestings. The results show that standard NER metrics do not measure well the ability of a model to detect nested entities, while our task-specific metrics provide new evidence on how existing approaches handle the task.	PDF	11	2021
KinyaBERT: a Morphology-aware Kinyarwanda Language Model	Pre-trained language models such as BERT have been successful at tackling many natural language processing tasks. However, the unsupervised sub-word tokenization methods commonly used in these models (e.g., byte-pair encoding - BPE) are sub-optimal at handling morphologically rich languages. Even given a morphological analyzer, naive sequencing of morphemes into a standard BERT architecture is inefficient at capturing morphological compositionality and expressing word-relative syntactic regularities. We address these challenges by proposing a simple two-tier BERT architecture that leverages a morphological analyzer and explicitly represents morphological compositionality. Despite the success of BERT, most of its evaluations have been conducted on high-resource languages, obscuring its applicability on low-resource languages. We evaluate our proposed method on the low-resource morphologically rich Kinyarwanda language, naming the proposed model architecture KinyaBERT. A robust set of experimental results reveal that KinyaBERT outperforms solid baselines by 2% F1 score on a named entity recognition task and by 4.3% average score of a machine-translated GLUE benchmark. KinyaBERT fine-tuning has better convergence and achieves more robust results on multiple tasks even in the presence of translation noise.	PDF	11	2021
Efficient Long Sequence Encoding via Synchronization	Pre-trained Transformer models have achieved successes in a wide range of NLP tasks, but are inefficient when dealing with long input sequences. Existing studies try to overcome this challenge via segmenting the long sequence followed by hierarchical encoding or post-hoc aggregation. We propose a synchronization mechanism for hierarchical encoding. Our approach first identifies anchor tokens across segments and groups them by their roles in the original input sequence. Then inside Transformer layer, anchor embeddings are synchronized within their group via a self-attention module. Our approach is a general framework with sufficient flexibility -- when adapted to a new task, it is easy to be enhanced with the task-specific anchor definitions. Experiments on two representative tasks with different types of long input texts, NarrativeQA summary setting and wild multi-hop reasoning from HotpotQA, demonstrate that our approach is able to improve the global information exchange among segments while maintaining efficiency.	PDF	11	2021
A New Dataset for Summarizing Radiology Reports	The radiology report summary is an important technology in smart healthcare. Compared with medical image processing and disease recognition which have been comprehensively studied, the research on radiology report summary is much limited, which is mainly due to the lack of a high-quality benchmark dataset. In this paper, we present a dataset called CRRsum for radiology report summary, where it is constructed from over 10K real radiology reports that contain diagnostic findings and diagnostic opinions. We perform extensive evaluation with the current state-of-the-art methods for radiology report summary on the proposed dataset. An extensive evaluation is performed with the current state-of-the-art methods for radiology report summary on our proposed dataset. Our experiments reveal the challenges of radiology report summary and provide many opportunities for research going forward. We also show that the CRRsum can be used in medical classification to facilitate the research in this task.	PDF	11	2021
Modular Domain Adaptation	Off-the-shelf models are widely used by computational social science researchers to measure properties of text, such as sentiment. However, without access to source data it is difficult to account for domain shift, which presents a threat to validity. Here, we treat domain adaptation as a modular process that involves separate model producers and model consumers, and show how they can independently cooperate to facilitate more accurate measurements of text. We introduce two lightweight techniques for this scenario, and demonstrate that they reliably increase out-of-domain accuracy on four multi-domain text classification datasets when used with linear and contextual embedding models. We conclude with recommendations for model producers and consumers, and release models and replication code to accompany this paper.	PDF	11	2021
Multimodal Sentiment Analysis with Common-sense Modulation	Our world is inherently multimodal and recent work highlights the importance of machine learning models leveraging multiple streams of information in making decisions. Multimodal sentiment analysis has been an active area of research that requires models to take advantage of the linguistic, acoustic, and visual signals available in an utterance. However, most current models do not take into account any social common-sense knowledge which is crucial in how we perceive sentiment in a conversation. To address that, in this paper, we aim to influence or modulate modality representations with common-sense knowledge obtained from a generative social common-sense knowledge base. We provide a novel way to modulate the linguistic, acoustic, and visual features corresponding to an utterance by scaling and shifting these representations. We use the knowledge base to obtain knowledge latent representations for an utterance corresponding to different states of the speaker such as the intent and the reaction, and we use it to shift and scale the three modalities. Our experiments on popular multimodal sentiment analysis benchmark datasets show that our proposed method is on par and often surpasses the current state-of-the art models.	PDF	11	2021
On a Benefit of Masked Language Model Pretraining: Robustness to Simplicity Bias	Despite the success of pretrained masked language models (MLM), why MLM pretraining is useful is still a question not fully answered. In this work we theoretically and empirically that MLM pretraining makes models robust to lexicon-level spurious features, partly answering the question. Our explanation is that MLM pretraining may alleviate problems brought by simplicity bias (Shah et al., 2020), which refers to the phenomenon that a deep model tends to rely excessively on simple features. In NLP tasks, those simple features could be token-level features whose spurious association with the label can be learned easily. We show that MLM pretraining makes learning from the context easier. Thus, pretrained models are less likely to rely excessively on a single token. We also explore the theoretical explanations of MLM’s efficacy in causal settings. Compared with Wei et al. (2021), we achieve similar results with milder assumption. Finally, we close the gap between our theories and real-world practices by conducting experiments on real-world tasks.	PDF	11	2021
The impact of lexical and grammatical processing on generating code from natural language	Considering the seq2seq architecture of Yin and Neubig (2018) for natural language to code translation, we identify four key components of importance: grammatical constraints, lexical preprocessing, input representations, and copy mechanisms. To study the impact of these components, we use a state-of-the-art architecture that relies on BERT encoder and a grammar-based decoder for which a formalization is provided. The paper highlights the importance of the lexical substitution component in the current natural language to code systems.	PDF	11	2021
FedParsing: a Semi-Supervised Federated Learning Model on Semantic Parsing	Although many semantic parsing models have been proven to work effectively on "NL-to-SQL", the limitation of annotated datasets remains a great challenge. In many semi-supervised models, while they use unlabeled data to greatly improve the model accuracy, they fail to take data privacy of users into account. In this work, we focus on improving the performance of the semantic parsing model and protecting the users’ data privacy without increasing the size of the labeled dataset. Our new model, which is named FedParsing, is a semi-supervised Federated Learning model. In order to solve the difficulty on convergence of traditional semi-supervised Federated Learning model, we incorporate the Mean Teacher algorithm and apply the Exponential Moving Average algorithm to update model parameters. Experiments on WikiSQL show that with extra unlabeled data, our model performs better than supervised training model and traditional semi-supervised Federated Learning model, which proves the effectiveness of FedParsing model.	PDF	11	2021
How to be Helpful on Online Support Forums?	Internet forums such as Reddit offer people a platform to ask for advice when they encounter various issues at work, school or in relationships. Telling helpful comments apart from unhelpful comments to these advice-seeking posts can help people and dialogue agents to become more helpful in offering advice. We propose a dataset that contains both helpful and unhelpful comments in response to such requests. We then relate helpfulness to the closely related construct of empathy. Finally, we are the first study to analyze the language features that are associated with helpful and unhelpful comments.	PDF	11	2021
Making Document-Level Information Extraction Right for the Right Reasons	Document-level information extraction is a flexible framework compatible with applications where information is not necessarily localized in a single sentence. For example, key features of a diagnosis in radiology a report may not be explicitly stated, but nevertheless can be inferred from the report's text. However, document-level neural models can easily learn spurious correlations from irrelevant information. This work studies how to ensure that these models make correct inferences from complex text and make those inferences in an auditable way: beyond just being right, are these models "right for the right reasons?" We experiment with post-hoc evidence extraction in a predict-select-verify framework using feature attribution techniques. While this basic approach can extract reasonable evidence, it can be regularized with small amounts of evidence supervision during training, which substantially improves the quality of extracted evidence. We evaluate on two domains: a small-scale labeled dataset of brain MRI reports and a large-scale modified version of DocRED (Yao et al., 2019) and show that models' plausibility can be improved with no loss in accuracy.	PDF	11	2021
Heterogeneous Language Model Optimization in Automatic Speech Recognition	The rising data privacy risks make it difficult for automatic speech recognition (ASR) systems to acquire complete training data in practical application. Recently, the merge paradigm for acoustic model has been proposed to solve the issue. However, ASR still suffers from another salient issue on language model. Current efforts mainly focus on isomorphic neural network models, while language model optimization is characterized by merging and matching heterogeneous models including $n$-gram and neural network models. In this paper, we propose a novel Match-and-Merge paradigm to fill up the vacuum for the language model optimization. Based on different training datasets, we train multiple language model pairs. In order to merge them into a target pair with the best performance, we first propose a Genetic Match-and-Merge (GMM) method that can be specifically adopted to optimize heterogeneous models. To improve the algorithm efficiency, we further propose a Reinforced Match-and-Merge (RMM) method, which maintains superior recognition accuracy while reducing convergence time. Extensive experiments demonstrate the effectiveness and generalization of our proposed methods, which significantly establishes the new state-of-the-art.	PDF	11	2021
A Flexible Multi-Task Model for BERT Serving	We present an efficient BERT-based multi-task (MT) framework that is particularly suitable for iterative and incremental development of the tasks. The proposed framework is based on the idea of partial fine-tuning, i.e. only fine-tune some top layers of BERT while keep the other layers frozen. For each task, we train independently a single-task (ST) model using partial fine-tuning. Then we compress the task-specific layers in each ST model using knowledge distillation. Those compressed ST models are finally merged into one MT model so that the frozen layers of the former are shared across the tasks. We exemplify our approach on eight GLUE tasks, demonstrating that it is able to achieve 99.6\% of the performance of the full fine-tuning method, while reducing up to two thirds of its overhead.	PDF	11	2021
Predictive text for agglutinative and polysynthetic languages	This paper presents a set of experiments in the area of morphological modelling and predictioning. We examine the tasks of segmentationand predictive text entry for two under-resourced and indigenous languages, K'iche'and Chukchi. We use different segmentation methods to make datasets for language modelling and then train models of different types: single-way segmented, which are trained using data from one segmentor; two-way segmented, which are trained using concatenated data from two segmentors; and finetuned, which are trained on two datasets from different segmentors. We measure word and character level perplexities of the language models and find that single-way segmented models trained using morphologically segmented data and finetuned models work the best.Finally, we test the language models on the task of predictive text entry using gold standard data and measurethe average number of clicks per character and keystroke savings rate. We find that the models trained using morphologically segmented data work better,although with substantial room for improvement. At last, we propose the usage of morphological segmentation in order to improve the end-user experience while using predictive text and we plan on testing this assumption by training other models and experimenting on more languages.	PDF	11	2021
Graph-based Fine-grained Multimodal Attention Mechanism for Sentiment Analysis	Multimodal sentiment analysis is a popular research area in natural language processing. Mainstream multimodal learning models barely consider that the visual and acoustic behaviors often have a much higher temporal frequency than words. Therefore, these models lack the representation capability to accurately model multimodal interactions. In this paper, we propose an attachment called Graph-based Fine-grained Multimodal Attention Mechanism (GFMAM), which can utilize the multimodal information from different subspaces to achieve accurate multimodal interactions. Firstly, the attachment further splits the information of every modality into multiple subspaces.Then, the fine-grained multimodal information from different subspaces is converted into multimodal interaction graphs dominant by the language modality. The multimodal interaction graph can capture significant interactions among multiple modalities at the subspace level.Finally, the information of nonverbal modalities is additionally added to compensate for the loss of continuity caused by the splitting operation. Embedding GFMAM into BERT, we propose a new model called GFMAM-BERT that can directly accept nonverbal modalities in addition to language modality. We conducted experiments on both publicly available multimodal sentiment analysis datasets CMU-MOSI and CMU-MOSEI. The experiment results demonstrate that GFMAM-BERT exceeds the state-of-the-art models. Moreover, the proposed model outperforms humans on most metrics on the CMU-MOSI dataset.	PDF	11	2021
On Event Detection in Scientific Papers: A Multi-Domain Dataset	Given the growing number of scientific papers, automatic information extraction in scientific documents is important for efficient knowledge update and discovery. A key component in scientific papers involves rhetorical activities/events to convey new knowledge and convince readers of the correctness. This work explores a new information extraction problem for scientific documents, aiming to identify event trigger words of rhetorical events/activities, i.e., event detection (ED). To promote future research in this area, we present SciEvent, the first and new dataset for event detection in scientific documents. SciEvent annotates scientific papers of four different domains (i.e., computer science, biology, physics, and mathematics) using 8 popular event types. Our experiments on SciEvent demonstrate the challenges of scientific ED for existing models and call for further research effort in this area. We will publicly release SciEvent to facilitate future research.	PDF	11	2021
Semantic Role Labeling as Dependency Parsing: Exploring Latent Tree Structures Inside Arguments	Semantic role labeling (SRL) is a fundamental yet challenging task in the NLP community.Recent works of SRL mainly fall into two lines: 1) BIO-based; 2) span-based.Despite ubiquity, they share some intrinsic drawbacks of not explicitly considering internal argument structures, which may potentially hinder the model's expressiveness.To remedy this, we propose to reduce SRL to a dependency parsing task and regard the flat argument spans as latent subtrees.In particular, we equip our formulation with a novel span-constrained TreeCRF to make tree structures span-aware, and further extend it to the second-order case.Experiments on CoNLL05 and CoNLL12 benchmarks reveal that the results of our methods outperform all previous works and achieve the state-of-the-art.	PDF	11	2021
Comprehensive Multi-Modal Interactions for Referring Image Segmentation	We investigate Referring Image Segmentation (RIS), which outputs a segmentation map corresponding to the natural language description. Addressing RIS efficiently requires considering the interactions happening across visual and linguistic modalities and the interactions within each modality. Existing methods are limited because they either compute different forms of interactions sequentially (leading to error propagation) or ignore intra-modal interactions. We address this limitation by performing all three interactions simultaneously through a Synchronous Multi-Modal Fusion Module (SFM). Moreover, to produce refined segmentation masks, we propose a novel Hierarchical Cross-Modal Aggregation Module (HCAM), where linguistic features facilitate the exchange of contextual information across the visual hierarchy. We present thorough ablation studies and validate our approach's performance on four benchmark datasets, showing considerable performance gains over the existing state-of-the-art (SOTA) methods.	PDF	11	2021
Persian Natural Language Inference: A Meta-learning approach	Incorporating information from other languages can improve the results of tasks in low-resource languages. A powerful method of building functional natural language processing systems for low-resource languages is to combine multilingual pre-trained representations with cross-lingual transfer learning. In general, however, shared representations are learned separately, either across tasks or across languages. This paper proposes a meta-learning approach for inferring natural language in Persian. Alternately, meta-learning uses different task information (such as QA in Persian) or other language information (such as natural language inference in English). Also, we investigate the role of task augmentation strategy for forming additional high-quality tasks. We evaluate the proposed method using four languages and an auxiliary task. Compared to the baseline approach, the proposed model consistently outperforms it, improving accuracy by roughly six percent. We also examine the effect of finding appropriate initial parameters using zero-shot evaluation and CCA similarity.	PDF	11	2021
KETOD: Knowledge-Enriched Task-Oriented Dialogue	Existing studies in dialogue system research mostly treat task-oriented dialogue and chit-chat separately. Towards building a human-like assistant that can converse naturally and seamlessly with users, the system needs to be able to conduct both types of conversations effectively. In this work, we investigate how task-oriented dialogue and knowledge-grounded chit-chat can be effectively integrated into a single model. To this end, we create a new dataset, KETOD (Knowledge-Enriched Task-Oriented Dialogue), where we naturally enrich task-oriented dialogue with chit-chats based on relevant entity knowledge. We also propose two new models, SimpleToDPlus and Combiner, for the proposed task. Experimental results on both automatic and human evaluations show that the proposed methods can significantly improve the performance in knowledge-enriched response generation while maintaining a competitive task-oriented dialog performance. We believe our new dataset will be a valuable resource for future studies. We will make the code and the dataset publicly available upon acceptance.	PDF	11	2021
A Novel Efficient and Effective Preprocessing Strategy for Text Classification	Text classification is an essential task of natural language processing. Preprocessing, which determines the representation of text features, is one of the key steps of text classification architecture. This paper proposes a novel efficient and effective preprocessing strategy with three methods for text classification using OMP algorithm to complete the classification. The main idea of our new preprocessing strategy is that we combine regular filtering and/or stopwords removal with tokenization and lowcase convertion, which can effectively reduce the feature dimension and improve the quality of text feature matrix to some extent. Simulation tests on 20Newsgroups dataset show compared with the existing state-of-the-art method, our new best method reduces the number of features by 19.85$\%$, 34.35$\%$, 26.25$\%$, and 38.67$\%$, and increase the speed of text classification by 17.38\%, 25.64\%, 23.76\%, and 33.38\% with similar classification accuracy on religion, computer, science and sport data, respectively.	PDF	11	2021
On Spoken Language Understanding Systems for Low Resourced Languages	Spoken dialog systems are slowly becoming and integral part of the human experience due to their various advantages over textual interfaces. Spoken language understanding (SLU) systems are fundamental building blocks of spoken dialog systems. But creating SLU systems for low resourced languages is still a challenge. In a large number of low resourced settings we don't have access to enough data to build automatic speech recognition (ASR) technologies, which are fundamental to any SLU system. Also, ASR based SLU systems do not generalize to unwritten languages. In this paper, we present a series of experiments to explore an extremely low resourced setting - something we refer to as a true k-shot setting, where we perform intent classification with systems trained on different values of k. We test our system on English and Flemish and find that even in such granular settings and no language specific ASR technology, we can create SLU systems that can be deployed in the real world.	PDF	11	2021
Speaking Rationally by Gestures: Information Theoretic Insights from Multi-Modal Language Models	The multi-modality nature of human communication can be utilized to enhance the performance of computational language models. However, few studies have explored the non-verbal channels with finer theoretical lens. We use multi-modal language models trained against monologue video data to study how the non-verbal expression contributes to communication, by examining two aspects: first, whether incorporating gesture representations can improve the language model's performance (perplexity), and second, whether the gesture channel demonstrates the similar pattern of entropy rate constancy (ERC) found in verbal language, which is governed by Information Theory. We have positive results to support both assumptions. The conclusion is that speakers indeed use simple gestures to convey information that enhances verbal communication, and how this information is organized is a rational process.	PDF	11	2021
Improving Generalizability in Implicitly Abusive Language Detection with Concept Activation Vectors	Robustness of machine learning models on ever-changing real-world data is critical, especially for applications affecting human well-being such as content moderation. New kinds of abusive language continually emerge in online discussions in response to current events (e.g., COVID-19), and the deployed abuse detection systems should be updated regularly to remain accurate. In this paper, we show that general abusive language classifiers tend to be fairly reliable in detecting out-of-domain explicitly abusive utterances but fail to detect new types of more subtle, implicit abuse. Next, we propose an interpretability technique, based on the Testing Concept Activation Vector (TCAV) method from computer vision, to quantify the sensitivity of a trained model to the human-defined concepts of explicit and implicit abusive language, and use that to explain the generalizability of the model on new data, in this case, COVID-related anti-Asian hate speech. Extending this technique, we introduce a novel metric, Degree of Explicitness, for a single instance and show that the new metric is beneficial in suggesting out-of-domain unlabeled examples to effectively enrich the training data with informative, implicitly abusive texts.	PDF	11	2021
Overcoming a Theoretical Limitation of Self-Attention	Although transformers are remarkably effective for many tasks, there are some surprisingly easy-looking regular languages that they struggle with. Hahn shows that for languages where acceptance depends on a single input symbol, a transformer's classification decisions get closer and closer to random guessing (that is, a cross-entropy of 1) as input strings get longer and longer. We examine this limitation using two languages: PARITY, the language of bit strings with an odd number of 1s, and FIRST, the language of bit strings starting with a 1. We demonstrate three ways of overcoming the limitation implied by Hahn's lemma. First, we settle an open question by constructing a transformer that recognizes PARITY with perfect accuracy, and similarly for FIRST. Second, we use layer normalization to bring the cross-entropy of both models arbitrarily close to zero. Third, when transformers need to focus on a single position, as for FIRST, we find that they can fail to generalize to longer strings; we offer a simple remedy to this problem that also improves length generalization in machine translation.	PDF	11	2021
Improving Abstractive Dialogue Summarization with Speaker-Aware Supervised Contrastive Learning	Pre-trained models have brought remarkable success on the text summarization task. For dialogue summarization, the subdomain of text summarization, utterances are concatenated to flat text before being processed. As a result, existing summarization systems based on pre-trained models are unable to recognize the unique format of the speaker-utterance pair well in the dialogue. To investigate this issue, we conduct probing tests and manual analysis, and find that the powerful pre-trained model can not identify different speakers well in the conversation, which leads to various factual errors. Moreover, we propose three speaker-aware supervised contrastive learning (SCL) tasks: Token-level SCL, Turn-level SCL, and Global-level SCL. Comprehensive experiments demonstrate that our methods achieve significant performance improvement on two mainstream dialogue summarization datasets. According to detailed human evaluations, pre-trained models equipped with SCL tasks effectively generate summaries with better factual consistency.	PDF	11	2021
Incorporating Multiple Knowledge Sources for Targeted Aspect-based Financial Sentiment Analysis	Combining symbolic and subsymbolic methods has become a promising strategy as research tasks in AI grow increasingly complicated and require a higher levels of understanding. Targeted Aspect-based Financial Sentiment Analysis (TABFSA) is one of such complicated tasks, as it involves information extraction, specification, and domain adaptation. External knowledge has been proven useful for general-purpose sentiment analysis, but not yet for the finance domain. Current state-of-the-art Financial Sentiment Analysis (FSA) models, however, have overlooked the importance of external knowledge. To fill this gap, we propose using attentive CNN and LSTM to strategically integrate multiple external knowledge sources into the pre-trained language model fine-tuning process for TABFSA. Experiments on the FiQA Task 1 and SemEval 2017 Task 5 datasets show that the knowledge-enabled models systematically improve upon their plain deep learning counterparts, and some outperform the state-of-the-art results reported in terms of aspect sentiment analysis error.	PDF	11	2021
Multi-Stage Prompting for Knowledgeable Dialogue Generation	Existing knowledge-grounded dialogue systems typically use finetuned versions of a pretrained language model (LM) and large-scale knowledge bases. These models typically fail to generalize on topics outside of the knowledge base, and require maintaining separate potentially large checkpoints each time finetuning is needed. In this paper, we aim to address these limitations by leveraging the inherent knowledge stored in the pretrained LM as well as its powerful generation ability. We propose a multi-stage prompting approach to generate knowledgeable responses from a single pretrained LM. We first prompt the LM to generate knowledge based on the dialogue context. Then, we further prompt it to generate responses based on the dialogue context and the previously generated knowledge. Results show that our knowledge generator outperforms the state-of-the-art retrieval-based model by 5.8\% when combining knowledge relevance and correctness. In addition, our multi-stage prompting outperforms the finetuning-based dialogue model in terms of response knowledgeability and engagement by up to 10% and 5%, respectively. Furthermore, we scale our model up to 530 billion parameters and demonstrate that larger LMs improve the generation correctness score by up to 10%, and response relevance, knowledgeability and engagement by up to 10%.	PDF	11	2021
A Primer in NMTology: What we have understood about NMT	Neural Machine Translation (NMT) has been through great revolutions in recent years. Accompanied with improvements in translation quality are works that attempted to understand the working mechanism of various aspects of the NMT framework. In our paper, we survey those efforts on unveiling the \textit{black box} of the standard NMT framework. To begin with, we briefly introduce the three critical components of the holistic NNT framework; nextly, we deliver a clear \textit{component-centric} categorization and clean summary of these specific works \textit{guided} by \textit{frequently-asked} questions (FAQs) that aim at making up \textit{lack} of understanding; finally, we discuss several limitations, future directions and inspirations. We believe this paper could facilitate the community to weave a holistic and clear picture of our current understandings of the standard NMT framework and shed light on its future improvements and developments. Please check this website https://nmtology.github.io/ for a visual guidance of the FAQs.	PDF	11	2021
Pre-Training with Syntactic Structure Prediction for Chinese Semantic Error Recognition	Existing Chinese text error detection mainly focuses on spelling errors and simple grammatical errors. These errors have been studied extensively and are relatively simple for humans. Chinese Semantic Error Recognition (CSER) pays attention to more complex semantic errors that humans cannot easily recognize compared with Chinese text error detection. Considering the complex syntactic relation between words, we find that syntactic structure from the syntax tree can help identify semantic errors. In this paper, we consider adopting the pre-trained models to solve the task of CSER. To make the model learn syntactic structure in the pre-training stage, we designed a novel pre-training task to predict the syntactic structure from the syntax tree between different words. Due to the lack of a published dataset for CSER, we build a high-quality dataset for CSER for the first time named Corpus of Chinese Linguistic Semantic Acceptability (CoCLSA), which is extracted from the high school examinations. The experimental results on the CoCLSA show that our pre-trained model based on the new pre-training task has a positive performance compared with existing pre-trained models.	PDF	11	2021
ED2LM: Encoder-Decoder to Language Model for Faster Document Re-ranking Inference	State-of-the-art neural models typically encode document-query pairs using cross-attention for re-ranking. To this end, models generally utilize an encoder-only (like BERT) paradigm or an encoder-decoder (like T5) approach. These paradigms, however, are not without flaws, i.e., running the model on all query-document pairs at inference-time incurs a significant computational cost. This paper proposes a new training and inference paradigm for re-ranking. We propose to finetune a pretrained encoder-decoder model using in the form of document to query generation. Subsequently, we show that this encoder-decoder architecture can be decomposed into a decoder-only language model during inference. This results in significant inference time speedups since the decoder-only architecture only needs to learn to interpret static encoder embeddings during inference. Our experiments show that this new paradigm achieves results that are comparable to the more expensive cross-attention ranking approaches while being up to 6.8X faster. We believe this work paves the way for more efficient neural rankers that leverage large pretrained models.	PDF	11	2021
Impact of Tokenization on Language Models: An Analysis for Turkish	Tokenization is an important text preprocessing step to prepare input tokens for language models. WordPiece and BPE are de-facto methods employed by large language models, such as BERT and GPT. However, the impact of tokenization can be different for the agglutinative languages having words with prefixes and suffixes, such as Turkic languages. We compare five tokenization methods, including a morphological-level tokenization that takes agglutinative language structure into account. We train tokenizers, and pre-train mini language models using RoBERTa pre-training procedure on Turkish OSCAR corpus. We then fine-tune our models on six downstream tasks. There are two main outcomes: (i) Morphological and word-level tokenizers outperform de-facto tokenizers in particular cases. (ii) Mini models can be competitive to larger state-of-the-art models, such that a 14-times smaller model can recover 94\% of the performance of a larger model.	PDF	11	2021
Dual-space Hierarchical Learning for Goal-guided Conversational Recommendation	Proactively and naturally guiding the dialog from the non-recommendation context~(e.g., Chit-chat) to the recommendation scenario is crucial for the Conversational Recommender System~(CRS). Prior studies mainly focus on planning the next dialog goal~(e.g., chat on a movie star) conditioned on the previous dialog. However, we find the dialog goals can be simultaneously observed at different levels, which can be utilized to improve CRS.In this paper, we propose the \textit{\textbf{D}ual-space \textbf{H}ierarchical \textbf{L}earning}~(\textbf{DHL}) to leverage multi-level goal sequences and their hierarchical relationships for conversational recommendation. Specifically, we exploit multi-level goal sequences from both the representation space and the optimization space. In the representation space, we propose the hierarchical representation learning where a cross attention module derives mutually enhanced multi-level goal representations. Additionally, we propose a soft labeling strategy to gradually guide the optimization direction. Experiments on two real-world datasets verify the effectiveness of our approach.	PDF	11	2021
Breaking Down Questions for Outside-Knowledge Visual Question Answering	There is a recent trend towards Knowledge-Based VQA (KB-VQA) where different aspects of the question require different sources of knowledge including the image's visual content and external knowledge such as commonsense concepts and factual information. To address this issue, we propose a novel approach that passes knowledge from various sources between different pieces of semantic content in the question. Questions are first segmented into several chunks, and each segment is used to generate queries to retrieve knowledge from ConceptNet and Wikipedia. Then, a graph neural network, taking advantage of the question's syntactic structure, integrates the knowledge for different segments to jointly predict the answer. Our experiments on the OK-VQA dataset show that our approach achieves new state-of-the-art results.	PDF	11	2021
Visual-Language Navigation Pretraining via Prompt-based Environmental Self-exploration	Vision-language navigation (VLN) is a challenging task due to its large searching space in the environment. To address this problem, previous works have proposed some methods of fine-tuning a large model that pretrained on large-scale datasets. However, the conventional fine-tuning methods require extra human-labeled navigation data and lack self-exploration capabilities in environments, which hinders their generalization of unseen scenes. To improve the ability of fast cross-domain adaptation, we propose Prompt-based Environmental Self-exploration (ProbES), which can self-explore the environments by sampling trajectories and automatically generates structured instructions via a large-scale cross-modal pretrained model (CLIP). Our method fully utilizes the knowledge learned from CLIP to build an in-domain dataset by self-exploration without human labeling. Unlike the conventional approach of fine-tuning, we introduce prompt tuning to achieve fast adaptation for language embeddings, which substantially improves the learning efficiency by leveraging prior knowledge. By automatically synthesizing trajectory-instruction pairs in any environment without human supervision and instruction prompt tuning, our model can adapt to diverse vision-language navigation tasks, including VLN and REVERIE. Both qualitative and quantitative results show that our ProbES significantly improves the generalization ability of the navigation model.	PDF	11	2021
Learning to Ignore Adversarial Attacks	Despite the strong performance of current NLP models, they can be brittle against adversarial attacks. To enable effective learning against adversarial inputs, we introduce the use of rationale models that can explicitly learn to ignore attack tokens. We find that the rationale models can ignore over 90\% of attack tokens. This approach leads to consistent sizable improvements ($\sim$8\%) over baseline models in robustness, for both BERT and RoBERTa, on MultiRC and FEVER, and also reliably outperforms data augmentation with adversarial examples alone. In many cases, we find that our method is able to close the gap between model performance on a clean test set and an attacked test set, eliminating the effect of adversarial attacks.	PDF	11	2021
Do Current Natural Language Inference Models Truly Understand Sentences? Insights from Simple Sentences	Natural language inference (NLI) is a task to infer the relationship between a premise and a hypothesis (e.g. entailment, neutral, or contradiction), and transformer-based models perform well on current NLI datasets such as MNLI and SNLI. Nevertheless, given the complexity of the task, especially the complexity of the sentences used for model evaluations, it remains controversial whether these models can truly infer the meaning of sentences or they simply guess the answer via non-humanlike heuristics. Here, we reduce the complexity of the task using two approaches. The first approach simplifies the relationship between the premise and hypothesis by making them unrelated. A test set, referred to as Random Pair, is constructed by randomly pairing premises and hypotheses in MNLI/SNLI. Models fine-tuned on MNLI/SNLI identify a large proportion (up to 77.6%) of these unrelated statements as being contradictory. Models fine-tuned on SICK, a dataset that included unrelated premise-hypothesis pairs, perform well on Random Pair. The second approach simplifies the task by constraining the premises/hypotheses to be syntactically/semantically simple sentences. A new test set, referred to as Simple Pair, is constructed using simple sentences, such as short SVO sentences, and basic conjunction sentences. We find that models fine-tuned on MNLI/SNLI generally fail to understand these simple sentences, but their performance can be boosted by re-fine-tuning the models using only a few hundreds of samples from SICK. All models tested here, however, fail to understand the fundamental compositional binding relation between a subject and a predicate (up to ~100% error rate) for basic conjunction sentences. Taken together, the results show that models achieving high accuracy on mainstream datasets can still lack basic sentence comprehension capacity, and datasets discouraging non-humanlike heuristics are required to build more robust NLI models.	PDF	11	2021
When Does Translation Require Context? A Data-driven, Multilingual Exploration	Although proper handling of discourse phenomena significantly contributes to the quality of machine translation (MT), improvements on these phenomena are not adequately measured in common translation quality metrics. Recent works in context-aware MT attempt to target a small set of these phenomena during evaluation. In this paper, we propose a methodology to identify translations that require context systematically, and use this methodology to both confirm the difficulty of previously studied phenomena as well as uncover new ones that have not been addressed in previous work. We then develop the \textbf{Mu}ltilingual \textbf{D}iscourse-\textbf{A}ware (MuDA) benchmark, a series of taggers for these phenomena in 14 different language pairs, which we use to evaluate context-aware MT. We find that state-of-the-art context-aware MT models make marginal improvements over context-agnostic models, which suggests current models do not handle these ambiguities effectively. We will release code and data to invite the MT research community to increase efforts on context-aware translation on discourse phenomena and languages that are currently overlooked.	PDF	11	2021
Tracking Satisfaction States for Customer Satisfaction Prediction in E-commerce Service Chatbots	Due to the increasing use of service chatbots in E-commerce platforms in recent years, customer satisfaction prediction (CSP) is gaining more and more attention. CSP is dedicated to evaluating subjective customer satisfaction in conversational service and thus helps improve customer service experience. However, previous methods focus on modeling customer-chatbot interaction at different single turns, neglecting the important dynamic satisfaction states throughout the customer journey. In this work, we investigate the problem of satisfaction states tracking and its effects on CSP in E-commerce service chatbots. To this end, we propose a dialogue-level classification model named DialogueCSP to track satisfaction states for CSP. In particular, we explore a novel two-step interaction module to represent the dynamic satisfaction states at each turn. In order to capture dialogue-level satisfaction states for CSP, we further introduce dialogue-aware attentions to integrate historical informative cues into the interaction module. To evaluate the proposed approach, we also build a Chinese E-commerce dataset for CSP. Experiment results demonstrate that our model significantly outperforms multiple baselines, illustrating the benefits of satisfaction states tracking on CSP.	PDF	11	2021
Better Language Model with Hypernym Class Prediction	Class-based language models (LMs) have been long devised to address context sparsity in $n$-gram LMs. In this study, we revisit this approach in the context of neural LMs. We hypothesize that class-based prediction leads to an implicit context aggregation for similar words and thus can improve generalization for rare words. We map words that have a common WordNet hypernym to the same class and train large neural LMs by gradually annealing from predicting the class to token prediction during training. Empirically, this curriculum learning strategy consistently improves perplexity over various large, highly-performant state-of-the-art Transformer-based models on two datasets, WikiText-103 and Arxiv. Our analysis shows that the performance improvement is achieved without sacrificing performance on rare words. Finally, we document other attempts that failed to yield empirical gains, and discuss future directions for the adoption of class-based LMs on a larger scale.	PDF	11	2021
Topic-Aware Response Generation in Task-Oriented Dialogue with Unstructured Knowledge Access	To alleviate the problem of structured databases' limited coverage, recent task-oriented dialogue systems incorporate external unstructured knowledge to guide the generation of system responses. However, these usually use word or sentence level similarities to detect the relevant knowledge context, which only partially captures the topical level relevance. In this paper, we examine how to better integrate topical information in knowledge grounded task-oriented dialogue and propose ``Topic-Aware Response Generation'' (TARG), an end-to-end response generation model. TARG incorporates multiple topic-aware attention mechanisms to derive the importance weighting scheme over dialogue utterances and external knowledge sources towards a better understanding of the dialogue history. Experimental results indicate that TARG achieves state-of-the-art performance in knowledge selection and response generation, outperforming previous state-of-the-art by 3.2, 3.6, and 4.2 points in EM, F1 and BLEU-4 respectively on Doc2Dial, and performing comparably with previous work on DSTC9; both being knowledge-grounded task-oriented dialogue datasets.\footnote{Code will be made public on the paper's acceptance.}	PDF	11	2021
On the Origin of Hallucinations in Conversational Models: Is it the Datasets or the Models?	Although knowledge-grounded conversational models are able to generate fluent responses that are indistinguishable from human-generated ones, they are known to suffer from producing factually invalid statements, a phenomenon commonly called hallucination. In this work, we investigate the underlying causes of this phenomenon: is hallucination due to the training data, or to the models? We conduct a comprehensive human study on both existing knowledge-grounded dialogue datasets and several state-of-the-art models. Our study reveals that the standard benchmarks consist of more than 60% hallucinated responses, leading to models that not only hallucinate but even amplify hallucinations. Moreover, we qualitatively analyze the nature of hallucinations, and identify key response strategies used by humans and models that lead to hallucinations. We hope these insights will show the way forward towards building hallucination-free conversational models.	PDF	11	2021
Enhancing Robustness of Pre-trained Language Model with Lexical Simplification	For both human readers and pre-trained language models (PrLMs), lexical diversity may lead to confusion and inaccuracy when understanding the underlying semantic meanings of given sentences. By substituting complex words with simple alternatives, lexical simplification (LS) is a recognized method to reduce such lexical diversity. In this paper, we leverage a novel improved LS approach which can enhance robustness of PrLMs, resulting in improved performances in downstream tasks. A rule-based simplification process is applied to a given sentence. PrLMs are encouraged to predict the real label of the given sentence with auxiliary inputs from the simplified version. Using strong PrLMs (BERT and ELECTRA) as baselines, our approach can still further improve the performance in various text classification tasks.	PDF	11	2021
Coreference Resolution as Span Boundary Alignment	We propose a new fast and accurate solution for coreference resolution, in which the task is formulated as a span boundary alignment problem. In this solution, a mention is linked to another one via two edges modeling how likely two linked mentions point to the same entity. Specifically, for each mention, its head word (left boundary) needs to be well aligned with the head words of all other mentions that refer to the same entity, so does its tail word (right boundary). Such a ``head-to-head'' and ``tail-to-tail'' alignment strategy greatly reduces the computational complexity of coreference decisions on any pair of mentions, mitigates the error propagation problem caused by mention pruning, and encourages the sharing of features across all mentions that refer to the same entity. Experimental results show that our solution achieves close to state-of-the-art performance on the CoNLL-2012 and GAP benchmarks with much less computational cost.	PDF	11	2021
LogInsights: Understanding and Extracting Information from Logs for Fault Classification at run-time	Software monitoring is the most critical part in any software management life cycle. One of the ways to detect the health of the program and the software is to monitor the logs efficiently. In this paper, we describe a method to process a stream of logs for identifying any fault being mentioned in the log at runtime. At first, we extract meaningful features for detecting the erroneous ones from the stream of logs. Next, we categorize the erroneous logs into the pre-defined categories of commonly occurring faults, using the proposed two-step framework. We propose efficient, fast and intelligent rule-based systems with the domain knowledge being incorporated using the word embedding model. We have built a domain specific corpus and trained a word embedding model for this purpose. The methods described here have shown improved results in the existing product pipeline. Experiments on logs obtained from various applications also show the efficacy of our proposed method.	PDF	11	2021
So Different Yet So Alike! Constrained Unsupervised Text Style Transfer	Automatic transfer of text between domains has become popular in recent times. One of its aims is to preserve the semantic content while adapting to the target domain. However, it does not explicitly maintain other attributes between the source and translated text: e.g., text length and descriptiveness. Maintaining constraints in transfer has several downstream applications, including data augmentation and debiasing. We introduce a method for such constrained unsupervised text style transfer by introducing two complementary losses to the generative adversarial network (GAN) family of models. Unlike the competing losses used in GANs, we introduce cooperative losses where the discriminator and the generator cooperate and reduce the same loss. The first is a contrastive loss and the second is a classification loss --- aiming to regularize the latent space further and bring similar sentences closer together. We demonstrate that such training retains lexical, syntactic and domain-specific constraints between domains for multiple benchmark datasets, including ones where more than one attribute change. We show that the complementary cooperative losses improve text quality, according to both automated and human evaluation measures.	PDF	11	2021
Focus on the Target’s Vocabulary: Masked Label Smoothing for Machine Translation	Label smoothing and vocabulary sharing are two widely used techniques in neural machine translation models. However, we argue that jointly adopting these two techniques can be conflicting and even leads to sub-optimal performance, since the soft label produced by label smoothing still considers the source-side words that would not appear at the target side. To address this issue, we propose Masked Label Smoothing (MLS), a new mechanism that masks the soft label probability of source-side words to zero. Simple yet effective, MLS manages to better integrate label smoothing with vocabulary sharing and hence improves the quality of the translation. Our extensive experiments show that MLS consistently yields improvement over original label smoothing on different datasets, including bilingual and multilingual translation in both BLEU and calibration scores.	PDF	11	2021
Progressive Down-Sampling for Acoustic Encoding	In acoustic encoding, the fine-grained frame-level features are not suited for capturing global dependencies. But condensing them into a semantically complete representation by stacked down-sampling does not work well. We find that the condensation leads to the degraded correlation of the representations in adjacent positions, which poses the risk of information loss in the stacked method. In this work, we propose a new method, progressive down-sampling (PDS), for encoding the context sufficiently before each condensation. Also, we develop a representation fusion method to alleviate information loss by combining the multi-scale representations. Experimental results on the 960h LibriSpeech automatic speech recognition task show that, for a strong Conformer-based system, our method down-samples the input speech features to 1/32 of the initial length, while yielding an improvement of 0.47 WER with a speedup of 1.42$\times$. It also achieves the state-of-the-art BLEU score (25.8) on the MuST-C En-De speech translation benchmark with no additional training data.	PDF	11	2021
Prompt-Learning for Fine-Grained Entity Typing	As an effective approach to tune pre-trained language models (PLMs) for specific tasks, prompt-learning has recently attracted much attention from researchers. By using cloze-style language prompts to stimulate the versatile knowledge of PLMs, prompt-learning can achieve promising results on a series of NLP tasks, such as natural language inference, sentiment classification, and knowledge probing. In this work, we investigate the application of prompt-learning on fine-grained entity typing in fully supervised, few-shot, and zero-shot scenarios. We first develop a simple and effective prompt-learning pipeline by constructing entity-oriented verbalizer and templates and conducting masked language modeling. Further, to tackle the zero-shot regime, we propose a self-supervised strategy that carries out distribution-level optimization in prompt-learning to automatically summarize the information of entity types. Extensive experiments on three fine-grained entity typing benchmarks (with up to 86 classes) under fully supervised, few-shot and zero-shot settings show that prompt-learning methods significantly outperform fine-tuning baselines, especially when the training data is insufficient.	PDF	11	2021
Dynamic Schema Graph Fusion Network for Multi-Domain Dialogue State Tracking	Dialogue State Tracking (DST) aims to keep track of users' intentions during the course of a conversation. In DST, modelling the relations among domains and slots is still an under-studied problem. Existing approaches that have considered such relations generally fall short in: (1) fusing prior slot-domain membership relations and dialogue-aware dynamic slot relations explicitly, and (2) generalizing to unseen domains. To address these issues, we propose a novel \textbf{D}ynamic \textbf{S}chema \textbf{G}raph \textbf{F}usion \textbf{Net}work (\textbf{DSGFNet}), which generates a dynamic schema graph to explicitly fuse the prior slot-domain membership relations and dialogue-aware dynamic slot relations. It also uses the schemata to facilitate knowledge transfer to new domains. DSGFNet consists of a dialogue utterance encoder, a schema graph encoder, a dialogue-aware schema graph evolving network, and a schema graph enhanced dialogue state decoder. Empirical results on benchmark datasets, including SGD, MultiWOZ2.1, and MultiWOZ2.2, show that DSGFNet outperforms the existing methods.	PDF	11	2021
Structured Pruning Learns Compact and Accurate Models	The growing size of neural language models has led to increased attention in model compression. Pruning methods start from a large model and gradually remove model weights---they can significantly reduce the model size but hardly achieve impressive runtime efficiency.On the other hand, distillation methods start from a shallower, compact model and can obtain large speedups---however, they are costly to train on large amounts of unlabeled data. In this work, we show that structured pruning can match the distillation counterparts in both latency ($>$10$\times$) and accuracy ($>$92\%) and result in highly compact and efficient subnetworks. Unlike distillation, our task-specific pruning approach, {\ours}, does not need to pre-specify the model architecture nor rely on unlabeled data. Our solution is to jointly prune layers and sub-modules such as heads and hidden units in Transformer models through $l_0$ regularization while ensuring that the resulting model is parallelizable. We also propose a layerwise distillation approach to further guide pruning. Finally, our pruned structures reveal interesting patterns---for example, more than 70\% of feed-forward and 50\% of self-attention layers can be easily pruned, while the first and last 1-2 layers are likely to remain for highly compressed models.	PDF	11	2021
Can Rationalization Improve Robustness?	A growing line of work has investigated the development of neural NLP models that can produce rationales---subsets of input that can explain their model predictions.In this paper, we ask whether such rationale models can also provide robustness to adversarial attacks in addition to their interpretable nature. Since these models need to first generate rationales (``rationalizer'') before making predictions (``predictor''), they have the potential to ignore noise or adversarially added text by simply masking it out of the generated rationale. To this end, we systematically generate various types of `AddText' attacks for both token and sentence-level rationalization tasks, and perform an extensive empirical evaluation of state-of-the-art rationale models across five different tasks.Our experiments reveal that the rationale models show the promise to improve robustness, while they struggle in certain scenarios---when the rationalizer is sensitive to positional bias or lexical choices of attack text. Further, leveraging human rationale as supervision does not always translate to better performance. Our study is a first step towards exploring the interplay between interpretability and robustness in the rationalize-then-predict framework.	PDF	11	2021
Mitigating Gender Bias in Machine Translation through Adversarial Learning	Machine translation and other NLP systems often contain significant biases regarding sensitive attributes, such as gender or race, that worsen system performance and perpetuate harmful stereotypes. Recent preliminary research suggests that adversarial learning can be used as part of a model-agnostic bias mitigation method that requires no data modifications. However, adapting this strategy for machine translation and other modern NLP domains requires (1) restructuring training objectives in the context of fine-tuning pretrained large language models and (2) developing measures for gender or other protected variables for tasks in which these attributes must be deduced from the data itself.We present an adversarial learning framework that addresses these challenges to mitigate gender bias in seq2seq machine translation. Our framework improves the disparity in translation quality for sentences with male vs. female entities by 86% for English-German translation and 91% for English-French translation, with minimal effect on translation quality. The results suggest that adversarial learning is a promising technique for mitigating gender bias in machine translation.	PDF	11	2021
One-Shot Learning from a Demonstration with Hierarchical Latent Language	Humans have the capability, aided by the expressive compositionality of their language, to learn quickly by demonstration. They are able to describe unseen task-performing procedures and generalize their execution to other contexts. In this work, we introduce DescribeWorld, an environment designed to test this sort of generalization skill in grounded agents, where tasks are linguistically and procedurally composed of elementary concepts. The agent observes a single task demonstration in a Minecraft-like grid world, and is then asked to carry out the same task in a new map.To enable such a level of generalization, we propose a neural agent infused with hierarchical latent language—both at the level of task inference and subtask planning. Our agent first generates a textual description of the demonstrated unseen task, then leverages this description to replicate it. Through multiple evaluation scenarios and a suite of generalization tests, we find that agents that perform text-based inference are better equipped for the challenge under a random split of tasks.	PDF	11	2021
Combining Paraphrase Pre-trained Model and Controllable Rules for Unsupervised Sentence Simplification	Although neural sequence-to-sequence models for sentence simplification achieve some progress, they still suffer from the data sparsity problem and are lack of controllability. This paper proposes a two-stage approach for text simplification. First, considering text simplification is closely related to text summarization and paraphrase, we fine-tune the pre-trained model on the dataset of summarization and paraphrase. Further, in order to achieve interpretation and controllablity, we design controllable scorers to evaluate the simplified sentence from three aspects: adequacy, fluency and simplicity, which are applied to sort the generated sentences and output the best one. Experiments show that our approach improves the previous best performance of the unsupervised model by a considerable margin of 5.53 points, achieving a new state-of-the-art result. Our method even performs competitively with supervised models in both automatic metrics and human evaluation.	PDF	11	2021
EIDER: Evidence-enhanced Document-level Relation Extraction	Document-level relation extraction (DocRE) aims to extract the semantic relations among entity pairs in a document. Typical DocRE methods blindly take the full document as input, while a subset of the sentences in a document, noted as the evidence, are often sufficient for humans to predict the relation of an entity pair. In this paper, we propose an evidence-enhanced DocRE framework called Eider that automatically extracts and leverages evidence. We first train an evidence extraction model together with relation extraction via multi-task learning, which allows the two tasks to benefit from shared representations and improve each other. Experiments show that even if human annotation of evidence is unavailable, using silver evidence labels extracted by heuristic rules still leads to better RE performance. We further design a simple yet effective evidence-enhanced inference process that makes RE predictions on both extracted evidence and the full document and fuses the predictions through a blending layer. This allows Eider to focus on the important context while still having access to all the information in the document. Extensive experiments show that \ours outperforms state-of-the-art methods on three benchmark datasets, e.g., by 1.37/1.26 Ign F1/F1 on DocRED.	PDF	11	2021
Towards Transparent Interactive Semantic Parsing via Step-by-Step Correction	Existing studies on semantic parsing focus on mapping a natural-language utterance to a logical form (LF) in one turn. However, because natural language may contain ambiguity and variability, this is a difficult challenge. In this work, we investigate an interactive semantic parsing framework that explains the predicted LF step by step in natural language and enables the user to make corrections through natural-language feedback for individual steps. We focus on question answering over knowledge bases (KBQA) as an instantiation of our framework, aiming to increase the transparency of the parsing process and help the user trust the final answer. We construct INSPIRED, a crowdsourced dialogue dataset derived from the ComplexWebQuestions dataset. Our experiments show that this framework has the potential to greatly improve overall parse accuracy. Furthermore, we develop a pipeline for dialogue simulation to evaluate our framework w.r.t. a variety of state-of-the-art KBQA models without further crowdsourcing effort. The results demonstrate that our framework promises to be effective across such models.	PDF	11	2021
Cross-cultural Emotion Classification: the Effect of Emotional Intensity and Acoustic Features	Cross-cultural emotion recognition is attracting increasing research attention; robustness to such differences in emotional expression is important for speech modality emotion recognition. In this work we quantify the accuracy loss when classifying cross-culturally for multiple emotional intensities, and investigate the effect of feature sets, including feature importance. We find that different emotional intensities yield a similar decrease in cross-culture accuracy relative to within-culture, and different acoustic feature sets also yield similar relative cross-culture accuracy. The top 10 important eGeMAPS features for within-cultural and cross-cultural classification share only one common feature, which partially explains differences in accuracy.	PDF	11	2021
Achieving Conversational Goals with Unsupervised Post-hoc Knowledge Injection	A limitation of current neural dialog models is that they tend to suffer from a lack of specificity and informativeness in generated responses, primarily due to dependence on training data that covers a limited variety of scenarios and conveys limited knowledge. One way to alleviate this issue is to extract relevant knowledge from external sources at decoding time and incorporate it into the dialog response. In this paper, we propose a post-hoc knowledge-injection technique where we first retrieve a diverse set of relevant knowledge snippets conditioned on both the dialog history and an initial response from an existing dialog model. We construct multiple candidate responses, individually injecting each retrieved snippet into the initial response using a gradient-based decoding method, and then select the final response with an unsupervised ranking step. Our experiments in goal-oriented and knowledge-grounded dialog settings demonstrate that human annotators judge the outputs from the proposed method to be more engaging and informative compared to responses from prior dialog systems. We further show that knowledge-augmentation promotes success in achieving conversational goals in both experimental settings.	PDF	11	2021
Towards Building Automatic Medical Consultation System: Framework, Task and Dataset	In this paper, we propose two frameworks to support automatic medical consultation, namely doctor-patient dialogue understanding and diagnosis-oriented interaction. A new medical dialogue dataset with multi-level fine-grained annotations is introduced and five evaluation tasks are established, including medical named entity recognition, dialogue act classification, symptom recognition, medical report generation and diagnosis-oriented dialogue system. We report a set of benchmark results for each track, which shows the usability of the dataset and sets a baseline for future studies.	PDF	11	2021
Fair comparison of knowledge graphs for question answering	Knowledge graphs are commonly used as sources of information in question answering. Models often combine pre-trained text encoders with a graph encoder to use this information to increase accuracy. However, the way that these two types of model interact is not clear. Here we show that, when provided with graph information for a random question, two recent models exhibit no significant change in performance. These models cannot therefore be used to obtain graph-structured explanations, or to compare the relevance of a particular knowledge graph to a dataset. We perform two model ablations and show that the resulting model is more responsive to variation in graph input, and so can be used for gathering explanations and measuring KG-dataset fit. We also show that uncontrollable nondeterminism can cause significant changes in results, and highlight the importance of statistical testing of these models.	PDF	11	2021
Inverse is Better! Fast and Accurate Prompt for Slot Tagging	Prompting methods recently achieve impressive success in few-shot learning. These methods embed input samples with prompt sentence pieces and decode label-related tokens to map samples to the label. However, such a paradigm is very inefficient for the task of slot tagging. Because the slot tagging samples are multiple consecutive words in a sentence, the prompting methods have to enumerate all n-grams token span to find all the possible slots, which greatly slows down the prediction. To tackle this, we introduce an inverse paradigm for prompting. Different from the classic prompts map tokens to labels, we reversely predict slot values given slot types. Such inverse prompting only requires a one-turn prediction for each slot type and greatly speeds up the prediction. Besides, we propose a novel Iterative Prediction Strategy, from which the model learns to refine predictions by considering the relations between different slot types. We find, somewhat surprisingly, the proposed method not only predicts faster, but also significantly improves the effect (improve over 6.1 F1-scores on 10-shot setting) and achieves new state-of-the-art performance.	PDF	11	2021
Learning to learn STEM courses	We curate a new dataset from MIT EECS (Course 6), Physics (Course 8), Economics (Course 14), Mathematics (Course 18), Harvard Statistics, and Columbia Computer Science course questions, transform them into programming tasks using OpenAI Codex, and solve them by executing programs. We curate, transform, and solve ten courses: (i) MIT EECS 6.003 Signal Processing, (ii) MIT EECS 6.036 Introduction to Machine Learning, (iii) MIT EECS 6.042 Mathematics for Computer Science, (iv) MIT Physics 8.282 Introduction to Astronomy, (v) MIT Economics 14.01 Principles of Microeconomics, (vi) MIT Mathematics 18.05 Introduction to Probability and Statistics, (vii) MIT Mathematics 18.06 Linear Algebra, (viii) MIT Mathematics 18.781 Theory of Numbers, (ix) Harvard Statistics STATS110 Probability, and (x) Columbia University COMS3251 Computational Linear Algebra. Our approach works surprisingly well since question solutions and programs share an underlying tree representation. We are able to use Codex to correctly solve all questions by specifying both question and programming contexts such as which mathematical rules to use or which programming packages to load. In addition to generating code which solves problems the resulting code generates plots which are useful for understanding the solutions. We interactively transform the original course questions until they are solved correctly and measure the similarity between the original and transformed questions. Finally, we automatically generate novel questions for each course, providing a way to rapidly synthesize new course content. Our approach is the first scalable solution towards automatically learning to learn all university STEM courses by machine.	PDF	11	2021
Diversifying Neural Dialogue Generation via Negative Distillation	Generative dialogue models suffer from serious generic response problems, limiting their applications to a few toy scenarios. Recently, an interesting approach, namely negative training, has been proposed to alleviate this problem by reminding the model not to generate high-frequency responses during training. However, its performance is hindered by two issues, ignoring low-frequency but generic responses and bringing low-frequency but meaningless responses. In this paper, we propose a novel negative training paradigm, called negative distillation, to keep the model away from the undesirable generic responses while avoiding the above problems. First, we introduce a negative teacher model that can produce query-wise generic responses, and then the student model is required to maximize the distance with multi-level negative knowledge. Empirical results show that our method outperforms previous negative training methods significantly.	PDF	11	2021
Mukayese: Turkish NLP Strikes Back	Having sufficient resources for a language X lifts it from the $\textit{under-resourced}$ languages class, but does not necessarily lift it from the $\textit{under-researched}$ class. In this paper, we address the problem of the absence of organized benchmarks in the Turkish language. We demonstrate that languages such as Turkish are left behind the State-of-the-Art in NLP applications. As a solution, we present Mukayese, a set of NLP benchmarks for the Turkish language that contains several NLP tasks. For each benchmark, we work on one or more datasets and present two or more baselines. Moreover, we present four new benchmarking datasets in Turkish for language modeling, sentence segmentation, and spellchecking and correction.	PDF	11	2021
Towards Asking Clarification Questions in Task-Oriented Dialogue	Task-oriented dialogues aim at providing users with task-specific services. To provide satisfactory services, two major challenges exist: 1) users are not able to fully describe their complex needs due to lack of task knowledge, and; 2) systems need to personalize the service to their users since different users have different profiles and preferences. In order to solve these challenges, systems need to be able to ask questions so as to clarify the user's profile and needs. However, existing task-oriented dialogue systems ignore this aspect. In this paper, we formulate the problem of asking clarification questions in task-oriented dialogue systems. To this end, we propose a dialogue-based user simulator to collect a dataset, called TaskClariQ\footnote{To foster research in this area, the dataset and code will be made public upon paper’s acceptance.}. We further propose a new System Ask paradigm and a Multi-Attention Seq2Seq Networks (MAS2S) that implements it. Experimental results on TaskClariQ show that MAS2S outperforms competitive baselines.	PDF	11	2021
MINER: Improving Out-of-Vocabulary Named Entity Recognition from an Information Theoretic Perspective	NER model has achieved promising performance on standard NER benchmarks. However, recent studies show that previous approaches may over-rely on entity mention information, resulting in poor performance on out-of-vocabulary(OOV) entity recognition. In this work, we propose MINER, a novel NER learning framework, to remedy this issue from an information-theoretic perspective. The proposed approach contains two mutual information based training objectives: i) generalizing information maximization, which enhances representation via deep understanding of context and entity surface forms; ii) superfluous information minimization, which discourages representation from rotate memorizing entity names or exploiting biased cues in data. Experiments on various settings and datasets demonstrate that it achieves better performance in predicting OOV entities.	PDF	11	2021
Hierarchical Inductive Transfer for Continual Dialogue Learning	Pre-trained models have achieved excellent performance on the dialogue task. However, for the continual increase of online chit-chat scenarios, directly fine-tuning these models for each of the new tasks not only explodes the capacity of the dialogue system on the embedded devices but also causes knowledge forgetting on pre-trained models and knowledge interference between diverse dialogue tasks. In this work, we propose a hierarchical inductive transfer framework to learn and deploy the dialogue skills continually and efficiently. First, we introduce the adapter module into pre-trained models for learning new dialogue tasks. As the only trainable module, it is beneficial for the dialogue system on the embedded devices to acquire new dialogue skills with negligible additional parameters. Then, for alleviating knowledge interference between tasks yet benefiting the regularization between them, we further design hierarchical inductive transfer that enables new tasks to use general knowledge in the base adapter without being misled by diverse knowledge in task-specific adapters. Empirical evaluation and analysis indicate that our framework obtains comparable performance under deployment-friendly model capacity.	PDF	11	2021
Modeling Hierarchical Syntax Structure with Triplet Position for Source Code Summarization	Automatic code summarization, which aims to describe the source code in natural language, has become an essential task in software maintenance. Our fellow researchers have attempted to achieve such a purpose through various machine learning-based approaches. One key challenge keeping these approaches from being practical lies in the lacking of retaining the semantic structure of source code, which has unfortunately been overlooked by the state-of-the-art. Existing approaches resort to representing the syntax structure of code by modeling the Abstract Syntax Trees (ASTs). However, the hierarchical structures of ASTs have not been well explored. In this paper, we propose CODESCRIBE to model the hierarchical syntax structure of code by introducing a novel triplet position for code summarization. Specifically, CODESCRIBE leverages the graph neural network and Transformer to preserve the structural and sequential information of code, respectively. In addition, we propose a pointer-generator network that pays attention to both the structure and sequential tokens of code for a better summary generation. Experiments on two real-world datasets in Java and Python demonstrate the effectiveness of our proposed approach when compared with several state-of-the-art baselines.	PDF	11	2021
Lexical Gender Made Simple: A Scalable Methodology for Gender Detection with Online Lexical Databases	The evaluation of gender bias in Natural Language Processing relies on the use of gendered expressions, such as pronouns and words with lexical gender. Up until this point, researchers have manually compiled lists that record lexical gender for individual words. However, manual compilation leads to static information if lists are not periodically updated and categorization requires value judgements by annotators and researchers. Moreover, words that are not covered by the list fall out of the range of analysis.To address these issues, we devised a dictionary-based method to automatically detect lexical gender that can provide a dynamic, up-to-date analysis with high coverage. Our approach reaches 90 % accuracy in determining the lexical gender of words retrieved randomly from a Wikipedia sample, and when testing on a manually compiled list that the method aims to replace.	PDF	11	2021
Toward More Meaningful Resources for Lower-resourced Languages	In this paper, we describe our perspective on how meaningful resources for lower-resourced languages can be developed in connection with the speakers of those languages. We examine two massively multilingual resources in detail. We explore the contents of the names stored in Wikidata for a few lower-resourced languages and find that many of them are not in fact in the languages they claim to be and require non-trivial effort to correct. We discuss quality issues present in WikiAnn and evaluate whether it is a useful supplement to hand annotated data. We then discuss the importance of creating annotation for lower-resourced languages in a thoughtful and ethical way that includes the languages' speakers as part of the development process. We conclude with recommended guidelines for resource development.	PDF	11	2021
WLASL-LEX: a Dataset for Recognising Phonological Properties in American Sign Language	Signed Language Processing (SLP) concerns the automated processing of signed languages, the main means of communication of Deaf and hearing impaired individuals. SLP features many different tasks, ranging from sign recognition to translation and production of signed speech, but has been overlooked by the NLP community thus far.In this paper, we bring to attention the task of modelling the phonology of sign languages. We leverage existing resources to construct a large-scale dataset of American Sign Language signs annotated with six different phonological properties. We then conduct an extensive empirical study to investigate whether data-driven end-to-end and feature-based approaches can be optimised to automatically recognise these properties. We find that, despite the inherent challenges of the task, graph-based neural networks that operate over skeleton features extracted from raw videos are able to succeed at the task to a varying degree. Most importantly, we show that this performance pertains even on signs unobserved during training.	PDF	11	2021
Can Pre-trained Models Really Generate Single-Step Textual Entailment?	We investigate the task of generating textual entailment (GTE). Different from prior works on recognizing textual entailment, also known as NLI, GTE requires the models with deeper reasoning capabilities - generating entailment from premises rather than making prediction on given premises and the entailment. We argue that existing adapted datasets are limited and inadequate to train and evaluate human-like reasoning in the GTE. In this paper, we propose a new large-scale benchmark, named \mydataset, targeted for learning and evaluating models' capabilities towards RTE. \mydataset consists of 15k instances with each containing a pair of premise statements and a human-annotated entailment. It is constructed by first retrieving instances from a knowledge base, and then augmenting each instance with several complementary instances by 7 manually crafted transformations. We demonstrate that even extensively fine-tuned pre-trainedmodels perform poorly on \mydataset. The best generator models can only generate valid textual entailment 59.1\% of times. Further, to motivate future advances, we provide detailed analysis to show significant gaps between baselines and human performance.	PDF	11	2021
Rare Tokens Degenerate All Tokens: Improving Neural Text Generation via Adaptive Gradient Gating for Rare Token Embeddings	Recent studies have determined that the learned token embeddings of large-scale neural language models are degenerated to be anisotropic with a narrow-cone shape. This phenomenon, called the representation degeneration problem, facilitates an increase in the overall similarity between token embeddings that negatively affect the performance of the models. Although the existing methods that address the degeneration problem based on observations of the phenomenon triggered by the problem improves the performance of the text generation, the training dynamics of token embeddings behind the degeneration problem are still not explored. In this study, we analyze the training dynamics of the token embeddings focusing on rare token embedding. We demonstrate that the specific part of the gradient for rare token embeddings is the key cause of the degeneration problem for all tokens during training stage. Based on the analysis, we propose a novel method called, adaptive gradient gating(AGG). AGG addresses the degeneration problem by gating the specific part of the gradient for rare token embeddings. Experimental results from language modeling, word similarity, and machine translation tasks quantitatively and qualitatively verify the effectiveness of AGG.	PDF	11	2021
Towards Improving Topic Models with the BERT-based Neural Topic Encoder	Neural Topic Models (NTMs) have been popular for mining a set of topics from a collection of corpora. Recently, there is an emerging direction of combining NTMs with pre-trained language models such as BERT, which aims to use the contextual information to of BERT to help train better NTMs.However, existing works in this direction either use the contextual information of pre-trained language models as the input of NTMs or align the outputs of the two kinds of models.In this paper, we study how to build deeper interactions between NTMs and pre-trained language and propose a BERT-based neural topic encoder, which deeply integrates with the transformer layers of BERT. Our proposed encoder encodes both the BoW data and the sequence of words of a document, which can be complementary to each other for learning a better topic distribution for the document.The proposed encoder is a better alternative to the ones used in existing NTMs.Thanks to the in-depth integration with BERT, extensive experiments show that the proposed model achieves the state-of-art performances the comparisons with many advanced models.	PDF	11	2021
Towards Automated Real-time Evaluation in Text-based Counseling	Automated real-time evaluation of counselor-client interaction is important for ensuring quality counseling but the rules are difficult to articulate. Recent advancements in machine learning methods show the possibility of learning such rules automatically. However, these methods often demand large scale and high quality counseling data, which are difficult to collect. To address this issue, we build an online counseling platform, which allows professional psychotherapists to provide free counseling services to those are in need. In exchange, we collect the counseling transcripts. Within a year of its operation, we manage to get one of the largest set of (675) transcripts of counseling sessions. To further leverage the valuable data we have, we label our dataset using both coarse- and fine-grained labels and use a set of pretraining techniques. In the end, we are able to achieve practically useful accuracy in both labeling system.	PDF	11	2021
On the Importance of Data Size in Probing Fine-tuned Models	Several studies have investigated the reasons behind the effectiveness of fine-tuning, usually through the lens of probing. However, these studies often neglect the role of the size of the dataset on which the model is fine-tuned. In this paper, we highlight the importance of this factor and its undeniable role in probing performance. We show that the extent of encoded linguistic knowledge depends on the number of fine-tuning samples, specifically the number of iterations for which the model is updated. The analysis also reveals that larger training data mainly affects higher layers, and that the extent of this change is a factor of the number of iterations in fine-tuning rather than the diversity of the training samples. Finally, we show through a set of experiments that fine-tuning introduces shallow and recoverable changes to model's representation.	PDF	11	2021
Should a Bot be Sarcastic?\\Understanding User Preferences Towards Sarcasm Generation	Previous sarcasm generation research has focused on \emph{how} to generate text that people perceive as sarcastic to create more human-like interactions. In this paper, we argue that we should first turn our attention to the question of \emph{when} sarcasm should be generated, finding that humans consider sarcastic responses inappropriate to many input utterances. Next, we use a theory-driven framework for generating sarcastic responses, which allows us to control the linguistic devices included during generation. For each device, we investigate how much humans associate it with sarcasm, finding that pragmatic insincerity and emotional markers are devices crucial for making sarcasm recognisable.	PDF	11	2021
Unified NMT models for the Indian subcontinent transcending script-barriers	Highly accurate machine translation systems are very important in societies and countries where multilinguality is very common, and where English often does not suffice. The Indian subcontinent is such a region, with all the Indic languages currently being under-represented in the NLP ecosystem. It is essential to advance the state-of-the-art of such low-resource languages atleast by using whatever data is available in open-source, which itself is something not very explored in the Indic ecosystem. In our work, we focus on improving the performance of very-low-resource Indic languages, especially of countries in addition to India. Specifically, we propose how unified models can be built that can exploit the data from comparatively resource-rich languages of the same region. We propose strategies to unify different types of unexplored scripts, especially Perso-Arabic scripts and Indic scripts to build multilingual models for all the Indic languages despite the script barrier. We also study how augmentation techniques like back-translation can be made use-of to build unified models that achieve state-of-the-art result among open source models, especially just using openly available raw data.	PDF	11	2021
FrugalScore: Learning Cheaper, Lighter and Faster Evaluation Metrics for Automatic Text Generation	Fast and reliable evaluation metrics are key to R&D progress. While traditional natural language generation metrics are fast, they are not very reliable. Conversely, new metrics based on large pretrained language models are much more reliable, but require significant computational resources. In this paper, we propose FrugalScore, an approach to learn a fixed, low cost version of any expensive NLG metric, while retaining most of its original performance. Experiments with BERTScore and MoverScore on summarization and translation show that FrugalScore is on par with the original metrics (and sometimes better), while having several orders of magnitude less parameters and running several times faster. On average over all learned metrics, tasks, and variants, FrugalScore retains 96.8% of the performance, runs 24 times faster, and has 35 times less parameters than the original metrics. We make our trained metrics publicly available, to benefit the entire NLP community and in particular researchers and practitioners with limited resources.	PDF	11	2021
Predicate-Argument Based Bi-Encoder for Paraphrase Identification	Paraphrase identification involves identifying whether a pair of sentences express the same or similar meanings. While cross-encoders have achieved high performances across several benchmarks, bi-encoders such as SBERT have been widely applied to sentence pair tasks. They exhibit substantially lower computation complexity and are better suited to symmetric tasks. In this work, we adopt a bi-encoder approach to the paraphrase identification task, and investigate the impact of explicitly incorporating predicate-argument information into SBERT through weighted aggregation. Experiments on six paraphrase identification datasets demonstrate that, with a minimal increase in parameters, the proposed model is able to outperform SBERT/SRoBERTa significantly. Further, ablation studies reveal that the predicate-argument based component plays a significant role in the performance gain.	PDF	11	2021
Repo4QA: Answering Complex Coding Questions via Dense Retrieval on GitHub Repositories	Open-source platforms such as Github and Stack Overflow both play important roles in our software ecosystem. It is crucial but time-consuming for programmers to raise their specific programming questions on coding forums such as Stack Overflow, which guides them to actual solutions on Github repositories. We show our interest in accelerating such a process and find that traditional Information Retrieval based methods fail to handle the long and complex questions in coding forums and thus cannot find the suitable coding repositories. In order to bridge the semantic gap between repositories and real-world coding questions effectively and efficiently, we introduce a specialized dataset named Repo4QA, which includes over 12,000 question-repository pairs constructed from Stack Overflow and Github. Furthermore, we propose QuReCL, a contrastive learning model based on CodeBERT, to jointly learn the representation of both questions and repositories. Experimental results demonstrate that our model can simultaneously capture the semantic features in both questions and repositories through jointly embedding, and outperforms existing state-of-art methods.	PDF	11	2021
Automatic Mining of Salient Events from Multiple Documents	This paper studies a new event knowledge extraction task, Event Chain Mining. Given multiple documents on a super event, it aims to mine a series of salient events in a temporal order. For example, the event chain of super event Mexico Earthquake in 2017 is {earthquake hit Mexico, destroy houses, kill people, block roads}. This task can help readers capture the gist of texts quickly, thereby improving reading efficiency and deepening text comprehension. To address this task, we regard an event as a cluster of different mentions of similar meanings. In this way, we can identify the different expressions of events, enrich their semantic knowledge and enhance order information among them. Taking events as the basic unit, we propose a novel and flexible unsupervised framework, EMiner. Specifically, we extract event mentions from texts and merge those of similar meanings into a cluster as an event. Then, essential events are selected and arranged into a chain in the order of their occurrences. We then develop a testbed for the proposed task, including a human-annotated benchmark and comprehensive evaluation metrics. Extensive experiments are conducted to verify the effectiveness of EMiner in terms of both automatic and human evaluations.	PDF	11	2021
Fair NLP Models with Differentially Private Text Encoders	Encoded text representations often capture sensitive attributes about individuals (e.g., gender, race, or age), which can raise privacy concerns and contribute to making downstream models unfair to certain groups. In this work, we propose FEDERATE, an approach that combines ideas from differential privacy and adversarial learning to learn private text representations which also induces fairer models. We empirically evaluate the trade-off between the privacy of the representations and the fairness and accuracy of the downstream model on two challenging NLP tasks. Our results show that FEDERATE consistently improves upon previous methods.	PDF	11	2021
Detecting Rumor Veracity with Only Textual Information by Double-Channel Structure	We develop a double-channel classifier to detect the veracity of social media rumors, relying only on the most basic textual information. Our model first assigns each thread into a “certain” or “uncertain” category. Since authors with a proprietary source of information are likely to post threads with a certain textual tone, we apply lie detection algorithms to certain texts. In contrast, as uncertain threads are arbitrary, we examine whether the replies are in accordance with the threads instead of applying the lie detection algorithms. This approach yields a macro-F1 score of 0.4027, outperforming all the baseline models and the second-place winner of SemEval 2019 Task 7. Further, we show that dividing the sample into two subgroups significantly improves the classification accuracy, reinforcing our claim that applying appropriate classifiers is crucial in rumor veracity detection.	PDF	11	2021
AdapLeR: Speeding up Inference by Adaptive Length Reduction	Pre-trained language models have shown stellar performance in various downstream tasks. But, this usually comes at the cost of high latency and computation, hindering their usage in resource-limited settings. In this work, we propose a novel approach for reducing the computational cost of BERT with minimal loss in downstream performance. Our model dynamically eliminates less contributing tokens through layers, resulting in shorter lengths and consequently lower computational cost. To determine the importance of each token representation, we train a Contribution Predictor for each layer using a gradient-based saliency method. Our experiments on several diverse classification tasks show speedups up to 17x during inference time. We also validate the quality of the selected tokens in our method using human annotations in the ERASER benchmark. In comparison to other widely used strategies for selecting important tokens, such as saliency and attention, our proposed method has significantly less false positive rate in generating rationales.	PDF	11	2021
Universal Phone Recognition for Language Agnostic Keyword Search	Recently, significant advanced have been made in universal phone recognition. Certain of these methods allow researchers to recognize phones in thousands of languages. In this paper, we explore the usage of such universal phone recognition for phonetic keyword search (KWS). That is, we apply these methods to search for specific sequences of phones, corresponding to keywords, in a set of audio files. We find that truly universal phone recognition might not be viable for KWS, but phone recognition systems can be fine-tuned with small amounts of data (3-5 hours of recordings) to produce useful results.	PDF	11	2021
High Interpretable Transfer Network for Aspect Level Sentiment Classification	Aspect-level affective classification (ASC)aims to detect the affective polarity of agiven viewpoint target in a sentence. In theASC method based on neural network,most of the work uses the attentionmechanism to capture the sentiment wordscorresponding to the opinion target, andthen gather them as evidence to infer theemotion of the target. However, due to thecomplexity of annotation, the scale ofaspect level data sets is relatively small.Data scarcity leads to the attentionmechanism sometimes unable to payattention to the sentiment wordscorresponding to the target, which finallyweakens the performance of the neuralmodel. In order to solve this problem, thispaper proposes a complete HighInterpretable Transfer Network transferlearning framework (HITN), which adoptsmethods such as data enhancement,attention adjustment and transfer toeffectively improve the performance ofASC model. A large number ofexperimental results show that our methodhas always been all the previous migrationmethods in this field, even compared withsome complex models.	PDF	11	2021
Improving Neural Topic Models by Contrastive Learning with BERT	We present a general plug-and-play contrastive learning framework that improves existing neural topic models (NTMs) by incorporating the knowledge distilled from pre-trained language models. Recent NTMs have been applied to many applications and shown promising improvement on text analysis. However, they mainly focus on word-occurrences and are often optimized by maximizing the likelihood-based objective, which could lead to suboptimal topic coherence and document representation. To overcome the above bottleneck, we introduce an additional contrastive loss that pushes the topical representation of a document learned by an NTM close to the semantic representation of the document obtained from pre-trained language models. In this way, the prior knowledge of the pre-trained language models can enrich the contextual information of the target corpus for NTMs. Comprehensive experiments show that the proposed framework achieve the state-of-the-art performance. Importantly, our framework is general approach to improve most of the existing NTMs.	PDF	11	2021
Isomorphic Cross-lingual Embeddings for Low-Resource Languages	Recent research in cross-lingual representation learning has focused on offline mapping approaches due to their simplicity, computational efficacy, and ability to work with minimal parallel resources. However, they crucially depend on the assumption of embedding spaces being approximately isomorphic, which does not hold in practice, leading to poorer performance on low-resource and distant language pairs. In this paper, we introduce a framework to learn cross-lingual word embeddings, without assuming isometry, for low-resource pairs via joint exploitation of a related higher-resource language. Both the source and target monolingual embeddings are independently aligned to the related language, enabling the use of offline methods. We show that this approach successfully outperforms other methods on several low-resource language pairs in both bilingul lexicon induction as well as eigen value simialrity.	PDF	11	2021
MetaPrompting: Learning to Learn Better Prompts	Prompting method is regarded as one of the crucial progress for few-shot nature language processing. Recent research on prompting moves from discrete tokens based "hard prompts" to continuous "soft prompts", which employ learnable vectors as pseudo prompts and achieve better performance. Though showing promising prospects, these soft-prompting methods are observed to rely heavily on good initialization to take effect. Unfortunately, obtaining a perfect initialization for soft prompt requires understanding of inner language models working and elaborate design, which is no easy task and has to restart from scratch for each new task. To remedy this, we propose a generalized soft prompting method called MetaPrompting, which adopts the well-recognized model-agnostic meta-learning algorithm to automatically find better prompt initialization that facilitates fast adaptation to new prompting tasks. Experiments show MetaPrompting brings significant improvements on three different datasets (over 6.5 points improvements for 1-shot setting), and achieves new state-of-the-art performance.	PDF	11	2021
Context-Paraphrase Enhanced Commonsense Question Answering	Commonsense question answering (CQA) generally means that the machine use the mastered commonsense to answer questions without relevant background material, which is a challenging task in natural language processing. Many prior methods mainly retrieve question related evidences from the structured knowledge base as the background material of the question, while the extracted evidence is generally described through the entities and the relationship between the entities, making it difficult for the machine to understand the meaning of the evidence completely. In this paper, we integrate the paraphrase in WordNet and Wiktionary into the evidence extraction process and machine reading comprehension (MRC) model, and propose a context-paraphrase enhanced commonsense question answering method. Specifically, the context-paraphrase obtained by WordNet and Wiktionary is first incorporated into the construction process of the heterogeneous graph, and the question related triple is extracted based on the heterogeneous graph, the triple is converted to triple-text based on a relational template. Then, the triple-text is used as the context of the question to establish an association graph containing the relationship between the context entities and the paraphrases. We further integrate the association graph into the MRC model to better guide the model to answer. Experimental results on CommonsenseQA and OpenBookQA show that context-paraphrase is effective in improving the answer accuracy of the MRC model.	PDF	11	2021
Meta Learning for Code Summarization	Source code summarization is the task of generating a high-level natural language description for a segment of programming languagecode. Current neural models for the task differ in their architecture and the aspects of code they consider. In this paper, we show that threeSOTA models for code summarization work well on largely disjoint subsets of a large code-base. This complementarity motivates modelcombination: We propose three meta-models that select the best candidate summary for a given code segment. The two neural models improve significantly over the performance of the best individual model, obtaining an improvement of 2.1 BLEU points on a dataset ofcode segments where at least one of the individual models obtains a non-zero BLEU.	PDF	11	2021
UNICON: Unsupervised Intent Discovery via Semantic-level Contrastive Learning	Discovering new intents is crucial for expanding domains in dialogue systems or natural language understanding (NLU) systems. A typical approach is to leverage unsupervised and semi-supervised learning to train a neural encoder to produce representations of utterances that are adequate for clustering then perform clustering on the representations to detect unseen clusters of intents. Recently, instance-level contrastive learning has been proposed to improve representation quality for better clustering. However, the proposed method suffers from semantic distortion in text augmentation and even from representation inadequacy due to limitations of using representations of pre-trained language models, typically BERT. Neural encoders can be powerful representation learners, but the initial parameters of pre-trained language models do not reliably produce representations that are suitable for capturing semantic distances. To eliminate the necessity of data augmentation and reduce the negative impact of pre-trained language models as encoders, we propose UNICON, a novel contrastive learning method that utilizes auxiliary external representations to provide powerful guidance for the encoder. Neural encoders can be powerful representation learners, but the initial parameters of pre-trained language models do not reliably produce representations that are suitable for capturing semantic distances. To eliminate the necessity of data augmentation and reduce the negative impact of pre-trained language models as encoders, we propose UNICON, a novel contrastive learning method that utilizes auxiliary external representations to provide powerful guidance for the encoder.	PDF	11	2021
SumHiS: Extractive Summarization Exploiting Hidden Sctructure	Extractive summarization is a task of highlighting the most important parts of the text. We introduce a new approach to extractive summarization task using hidden clustering structure of the text. Experimental results on CNN/DailyMail demonstrate that our approach generates more accurate summaries than both extractive and abstractive methods, achieving state-of-the-art results in terms of ROUGE-2 metric exceeding the previous approaches by 10\%. Additionally, we show that hidden structure of the text could be interpreted as aspects.	PDF	11	2021
A Meta-Learning Approach for Few-Shot (Dis)Agreement Identification in Online Discussions	Online discussions are abundant with different opinions for a common topic, and identifying agreement and disagreement between online posts enables many opinion mining applications. Realizing the increasing needs to analyze opinions for emergent new topics (e.g., from "mask mandate" to "COVID vaccination") that however tend to lack annotations, we present the first meta-learning approach for few-shot (dis)agreement identification on a new topic with few labeled instances. We further design a lexicon based regularization loss and propose domain-aware task augmentation for meta-training to enable the meta-learner to learn both domain-invariant cues and domain-specific expressions for (dis)agreement identification. Extensive experiments on two benchmark datasets and evaluation on three topic domains demonstrate the effectiveness of the meta-learning approach that consistently and noticeably outperforms the conventional transfer learning approach based on fine-tuning.	PDF	11	2021
HighIE: High-Order Inference for Entity Recognition, Relation Extraction, and Event Extraction	Most prior work on Information Extraction typically predicts labels of individual instances (e.g., event triggers, relations, entities) independently regardless of their interactions. We propose a novel framework, HighIE, that aims to integrate high-order cross-subtask and cross-instance dependencies in both learning and inference. High-order inference on label variables is an NP-hard problem. To address it, we propose a high-order decoder that is unfolded from an approximate inference algorithm. The experimental results show that our approach achieves consistent improvement compared with prior work.	PDF	11	2021
MOROCCO: Model Resource Comparison Framework	A new generation of pre-trained transformer language models has established new state-of-the-art results on many tasks, even exceeding the human level in standard NLU benchmarks. Despite the rapid progress, the benchmark-based evaluation has generally relied on the downstream performance as a primary metric which limits the scope of model comparison in terms of their practical use. This paper presents MOdel ResOurCe COmparison (MOROCCO), a framework that allows to assess models with respect to their downstream quality combined with two computational efficiency metrics such as memory consumption and throughput during the inference stage. The framework allows for a flexible integration with popular leaderboards compatible with jiant environment that supports over 50 downstream tasks. We demonstrate the MOROCCO applicability by evaluating 10 transformer models on two multi-task GLUE-style benchmarks in English and Russian and provide the model analysis.	PDF	11	2021
Calibration of Machine Reading Systems at Scale	In typical machine learning systems, an estimate of the probability of the prediction is used to assess the system's confidence in the prediction.This confidence measure is usually uncalibrated; i.e.\ the system's confidence in the prediction does not match the true probability of the predicted output.In this paper, we present an investigation into calibrating open setting machine reading systemssuch as open-domain question answering and claim verification systems.We show that calibrating such complex systems which contain discrete retrieval and deep reading components is challenging and current calibration techniques fail to scale to these settings. We propose simple extensions to existing calibration approaches that allows us to adapt them to these settings.Our experimental results reveal that the approach works well, and can be useful to selectively predict answers when question answering systems are posed with unanswerable or out-of-the-training distribution questions.	PDF	11	2021
Input-specific Attention Subnetworks for Adversarial Detection	Self-attention heads are characteristic of Transformer models and have been well studied for interpretability and pruning.In this work, we demonstrate an altogether different utility of attention heads, namely for adversarial detection. Specifically, we propose a method to construct input-specific attention subnetworks (IAS) from which we extract three features to discriminate between authentic and adversarial inputs. The resultant detector significantly improves (by over 7.5%) the state-of-the-art adversarial detection accuracy for the BERT encoder on 10 NLU datasets with 11 different adversarial attack types. We also demonstrate that our method (a) is more accurate for larger models which are likely to have more spurious correlations and thus vulnerable to adversarial attack, and (b) performs well even with modest training sets of adversarial examples.	PDF	11	2021
Cross-lingual Word Embeddings in Hyperbolic Space	Cross-lingual word embeddings can be applied to several natural language processing applications across multiple languages. Unlike prior works that use word embeddings based on the Euclidean space, this short paper presents a simple and effective cross-lingual Word2Vec model that adapts to the Poincaré ball model of hyperbolic space to learn unsupervised cross-lingual word representations from a German-English parallel corpus. It has been shown that hyperbolic embeddings can capture and preserve hierarchical relationships.We evaluate the model on both hypernymy and analogy tasks. The proposed model achieves comparable performance with the %standard vanilla Word2Vec model on the cross-lingual analogy task, the hypernymy task shows that the cross-lingual Poincaré Word2Vec model can capture latent hierarchical structure from free text across languages, which are absent from the Euclidean-based Word2Vec representations. Our results show that by preserving the latent hierarchical information, hyperbolic spaces can offer better representations for cross-lingual embeddings.	PDF	11	2021
Multilingual pre-training with Language and Task Adaptation for Multilingual Text Style Transfer	We exploit the pre-trained seq2seq model mBART for multilingual text style transfer. Using machine translated data as well as gold aligned English sentences yields state-of-the-art results in the three target languages we consider. Besides, in view of the general scarcity of parallel data, we propose a modular approach for multilingual formality transfer, which consists of two training strategies that target adaptation to both language and task. Our approach achieves competitive performance without monolingual task-specific parallel data and can be applied to other style transfer tasks as well as to other languages.	PDF	11	2021
Contextual Representation Learning beyond Masked Language Modeling	Currently, masked language modeling (e.g., BERT) is the prime choice to learn contextualized representations. Due to the pervasiveness, it naturally raises an interesting question: how do masked language models (MLMs) learn contextual representations? In this work, we analyze the learning dynamics of MLMs and find that it adopts sampled embeddings as anchors to estimate and inject contextual semantics to representations, which limits the efficiency and effectiveness of MLMs. To address these problems, we propose TACO, a simple yet effective representation learning approach to directly model global semantics. To be specific, TACO extracts and aligns contextual semantics hidden in contextualized representations to encourage models to attend global semantics when generating contextualized representations. Experiments on the GLUE benchmark show that TACO achieves up to 5x speedup and up to 1.2 points average improvement over MLM.	PDF	11	2021
How do people talk about images? A study on open-domain conversation on images.	Open-domain conversation on images requires the model to consider the relation and balance between utterances and images in order to generate proper responses. This paper explore how human conduct conversation on images by investigating a well-constructed open-domain image conversation dataset, ImageChat. We examine the conversations on images from three perspectives: $\textit{image relevancy}$, $\textit{image information}$ and $\textit{utterance style}$. We show that objects in the image are indeed the most important element for conversations on image, which could be directly discussed or be a bait to other off-image conversations. Thus, being able to accurately detect objects in the image and knowing their attributes are essential to chat on image. Understanding the scenarios of the image, except extracting the image objects, is also a key factor to the conversation on images. Based on our analysis, we propose to enriching the image information with image caption and object tags, increasing the diversity and image-relevancy of generated responses. We believe that our analysis provides useful insights and directions that facilitate future research on open-domain conversation on images.	PDF	11	2021
An Unsupervised Multiple-Task and Multiple-Teacher Model for Cross-lingual Named Entity Recognition	Cross-lingual named entity recognition task is one of the critical problem for evaluating the potential transfer learning techniques on low resource languages. Knowledge distillation using pre-trained multilingual language models between source and target languages have shown their superiority. However, existing cross-lingual distillation models merely consider the potential transferability between two identical single tasks across both domain. Other possible auxiliary tasks to improve the learning performance have not been fully investigated. In this study, based on the knowledge distillation framework and multi-task learning, we introduce the similarity metric model as an auxiliary task to improve the cross-lingual NER performance on target domain. Specifically, an entity recognizer and a similarity evaluator teachers are first trained in parallel from the source domain. Then, two tasks in the student model are supervised by the two teachers simultaneously. Empirical studies on the datasets across 7 different languages confirm the effectiveness of the proposed model.	PDF	11	2021
A Multilingual Corpus for Socio-political Event Coreference Resolution	We propose a dataset for event coreference resolution, which is based on random samples drawn from multiple sources, languages, and countries. Early scholarship on event information collection has not quantified the contribution of event coreference resolution. We prepared and analyzed a representative multilingual corpus and measured the performance and contribution of the state-of-the-art event coreference resolution approaches. We found that almost half of the event mentions in documents co-occur with other event mentions and this makes it inevitable to obtain erroneous or partial event information. We showed that event coreference resolution could help improving this situation. Our contribution sheds light on a challenge that has been overlooked or hard to study to date. Future event information collection studies can be designed based on the results we present in this report.	PDF	11	2021
Generic Dependency Modeling in Multi-Party Conversation	Modeling the dependency between utterances in a multi-party conversation facilitates the understanding of conversation more precisely and holistically. In this paper, we propose a simple and generic framework for this purpose, in which the dependency is built on discourse parsing of utterances. Particularly, we present two approaches to encoding the dependency, namely absolute dependency encoding and relative dependency encoding, and combine them in Transformers by modifying the computation of self-attention. To enhance the understanding of utterance dependency, we further introduce a span distance prediction pre-training task for the proposed model. Experimental results on four multi-party conversation benchmarks for different tasks show that this model successfully boosts the generic performance of Transformer-based language models. Systematic studies are conducted to investigate why utterance dependencies are essential for multi-party conversation tasks and how they are learned in a simple and effective framework.	PDF	11	2021
C$^3$KG: A Chinese Commonsense Conversation Knowledge Graph	Existing commonsense knowledge bases often organize tuples in an isolated manner, which is deficient for commonsense conversational models to plan the next steps. To fill the gap, we curate a large-scale multi-turn human-written conversation corpus, and create the first Chinese commonsense conversation knowledge graph which incorporates both social commonsense knowledge and dialog flow information. To show the potential of our graph, we develop a graph-conversation matching approach, and benchmark two graph-grounded conversational tasks. All the resources in this work will be released to foster future research.	PDF	11	2021
Explicit Modeling the Context for Chinese NER	Named entity recognition (NER) is the foundation of many natural language processing tasks. Current NER models have achieved promising results. But as pointed by several studies, they fail with a high ratio on generalization tests such as invariance test because they heavily rely on name information. So, we propose a context module to explicitly model the contextual information, and a trainable balance factor is designed to incorporate the result of context module. To learn this factor, we propose several tailored data augmentation strategies to generate synthetic labels for it. These approaches help the model learn whether it should focus on the context. Our method achieves on average 1.2\% absolute improvement of F1 than BERT-CRF on three datasets. Moreover, our method performs on par with the best solutions who rely heavily on external features besides BERT. We also conduct invariance test to analyse the effect of the context information. The source code of our model and augmentation strategies will be available at anonymous.url.	PDF	11	2021
Uncertainty in the Social World and its Interdependence	This research investigates the variety of social behaviours that we engage in on a daily basis. There are several unknown factors in each scenario, reflecting the many sources of uncertainty inherent in social judgement. We illustrate how uncertainty emerges in social situations (the thoughts and intentions of others are generally hidden, making predicting a person's behaviour difficult) and why people are driven to reduce the aversive feelings created by uncertainty. We propose a model in which social uncertainty is mitigated first through automatic modes of inference (such as impression generation), before more control-demanding modes of inference (such as perspective-taking) are used to narrow one's expectations even further. Finally, social uncertainty is reduced further by allocating resources to update these predictions based on newer inputs. We propose a novel quantitative framework to provide an account of the mechanisms underlying social cognition and action, by integrating studies from multiple disciplines.	PDF	11	2021
PARE: A Simple and Strong Baseline for Monolingual and Multilingual Distantly Supervised Relation Extraction	Neural models for distantly supervised relation extraction (DS-RE) encode each sentence in an entity-pair bag separately. These are then aggregated for bag-level relation prediction. Since, at encoding time, these approaches do not allow information to flow from other sentences in the bag, we believe that they do not utilize the available bag data to the fullest. In response, we explore a simple baseline approach (PARE) in which all sentences of a bag are concatenated into a passage of sentences, and encoded jointly using BERT. The contextual embeddings of tokens are aggregated using attention with the candidate relation as query -- this summary of whole passage predicts the candidate relation. We find that our simple baseline solution outperforms existing state-of-the-art DS-RE models in both monolingual and multilingual DS-RE datasets.	PDF	11	2021
Graph Neural Networks for Multiparallel Word Alignment	After a period of decrease, interest in word alignments is increasing again for their usefulness in domains such as typological research, cross-lingual annotation projection and machine translation. Generally, alignment algorithms only use bitext and do not make use of the fact that many parallel corpora are multiparallel. Here, we compute high-quality word alignments between multiple language pairs by considering all language pairs together. First, we create a multiparallel word alignment graph, joining all bilingual word alignment pairs in one graph. Next, we use graph neural networks (GNNs) and community detection algorithms to exploit the graph structure. Our GNN approach (i) utilizes information about the meaning, position and language of the input words, (ii) incorporates information from multiple parallel sentences, (iii) adds and removes edges from the initial alignments, and (iv) provides a prediction model that can generalize beyond the sentences it is trained on. We show that community detection provides valuable information for multiparallel word alignment. Our method outperforms previous work on three word alignment datasets and on a downstream task.	PDF	11	2021
Flagging Comprehensibility Issues in Hindi Text with Question Answering	There is a critical need for checking the quality of translations while localizing important content across all the industries. This paper presents question-answering based techniques to check the comprehensibility of a text translation. The viability of the method is evaluated using text translated from English to Hindi, where we see comprehensibility issues identified with up to 87\% accuracy.	PDF	11	2021
Top-Down Influence? Predicting CEO Personality and Risk Impact from Speech Transcripts	How much does a CEO's personality impact the performance of their company? Management theory posits a great influence, but it is difficult to show empirically---there is a lack of publicly available self-reported personality data of top managers. Instead, we propose a text-based personality regressor based on crowd-sourced MBTI assessments. The ratings have a high internal and external validity and can be predicted with moderate to strong correlations for three out of for dimensions. Providing evidence for the upper echelons theory, we demonstrate that the predicted CEO personalities have explanatory power of financial risk.	PDF	11	2021
ArchivalQA: A Large-scale Benchmark Dataset for Open Domain Question Answering over Archival News Collections	In the last few years, open-domain question answering (ODQA) has advanced rapidly due to the development of deep learning techniques and the availability of large-scale QA datasets. However, the current datasets are essentially designed for synchronic document collections (e.g., Wikipedia). Temporal news collections such as long-term news archives spanning several decades, are rarely used in training the models despite they are quite valuable for our society. To foster the research in the field of ODQA on such historical collections, we present ArchivalQA, a large question answering dataset consisting of 532,444 question-answer pairs which is designed for temporal news QA. We divide our dataset into four subparts based on the question difficulty levels and the containment of temporal expressions, which we believe are useful for training and testing ODQA systems characterized by different strengths and abilities. The novel QA dataset-constructing framework that we introduce can be also applied to create datasets over other types of collections.	PDF	11	2021
Hierarchical Recurrent Aggregative Generation for Few-Shot NLG	Large pretrained models enable transfer learning to low-resource domains for language generation tasks. However, previous end-to-end approaches do not account for the fact that some generation sub-tasks, specifically aggregation and lexicalisation, can benefit from transfer learning in different extents. To exploit these varying potentials for transfer learning, we propose a new hierarchical approach for few-shot and zero-shot generation. Our approach consists of a three-moduled jointly trained architecture: the first module independently lexicalises the distinct units of information in the input as sentence sub-units (e.g. phrases), the second module recurrently aggregates these sub-units to generate a unified intermediate output, while the third module subsequently post-edits it to generate a coherent and fluent final text. We perform extensive empirical analysis and ablation studies on few-shot and zero-shot settings across 4 datasets. Automatic and human evaluation shows that the proposed hierarchical approach is consistently capable of achieving state-of-the-art results when compared to previous work.	PDF	11	2021
Contrastive Learning for Low Resource Machine Translation	Representation learning plays a vital role in natural language processing tasks. More recent works study the geometry of the representation space for each layer of pre-trained language models. They find that the context representation of all words is not isotropic in any layer of the pre-trained language model. However, how contextual are the contextualized representations produced by transformer-based machine translation models? In this paper, we find that the contextualized representations of the same word in different contexts have a greater cosine similarity than those of two different words, but this self-similarity is still relatively low between the same words. This suggests that output of machine translation models produce more context-specific representations. In this work, we present a contrastive framework for machine translation, that adopts contrastive learning to train model in a supervised way. By making use of data augmentation, our supervised contrastive learning method solves the issue of low-resource machine translation representations learning. Experimental results on the IWSLT14 and WMT14 datasets show our method can outperform competitive baselines significantly.	PDF	11	2021
Vision-and-Language Navigation: A Survey of Tasks, Methods, and Future Directions	A long-term goal of AI research is to build intelligent agents that can communicate with humans in natural language, perceive the environment, and perform real-world tasks. Vision-and-Language Navigation (VLN) is a fundamental and interdisciplinary research topic towards this goal, and receives increasing attention from the natural language processing, computer vision, and machine learning communities. In this paper, we review contemporary studies in the emerging field of VLN, covering tasks, evaluation metrics, methods, etc. Through structured analysis of current progress and challenges, we also highlight the limitations of current VLN and opportunities for future work. This paper serves as a thorough reference for the VLN research community.	PDF	11	2021
Neighbour Contrastive Learning with Heterogeneous Graph Attention Networks on Short Text Classification	Graph neural networks (GNNs) have attracted extensive research interests in text classification tasks, due to their superiority in representation learning. However, most existing studies adopt the same semi-supervised learning setting as the vanilla Graph Convolution Network (GCN), which require a large amount of labelled data during training and thus are less robust when dealing with large-scale graph data with few labels. Additionally, graph structure information is normally captured by direct information aggregation via network schema and missing adjacency knowledge may hinder the performance. Addressing those problems, this paper proposes a novel method to learn graph structure, by using simple neighbour contrastive learning for an existing self-supervised heterogeneous graph neural network model (NC-HGAT). It considers the graph structure information from heterogeneous graphs with a multi-layer perceptrons (MLPs) and delivers consistent results, despite the corrupted neighbouring connections. Extensive experiments have been implemented on four benchmark short-text datasets, and demonstrate that our proposed model NC-HGAT outperforms the state-of-the-art methods on three datasets and achieves a competitive result on the remaining dataset.	PDF	11	2021
ST-SQL: Semi-Supervised Self-Training for Text-to-SQL via Column Specificity Meta-Learning	The few-shot problem is an urgent challenge for the generalization capability of the single-table text-to-SQL task. Current few-shot methods neglect the potential information of unlabeled data and have a domain bias due to the same weight of samples. Motivated by this, this paper proposes a Self-Training text-to-SQL (ST-SQL) method which handles the problem from both views of data and algorithms. At the data level, ST-SQL performs data expansion by using an iterative framework to attach pseudo-labels to unlabeled data. The expanded data are sampled to reversely train the model. At the algorithm level, ST-SQL defines a column specificity to perform a more fine-grained gradient update during meta-training. The common samples are attached more weight to eliminate the domain bias. ST-SQL achieves state-of-the-art results on both open-domain and domain-specific benchmarks and brings more significant improvements on few-shot tests.	PDF	11	2021
"You might think about slightly revising the title": identifying hedges in peer-tutoring interactions	Hedges have an important role in the management of rapport. In peer-tutoring, they are notably used by tutors in dyads experiencing low rapport to tone down the impact of instructions and negative feedback.Pursuing the objective of building a tutoring agent that manages rapport with teenagers in order to improve learning, we used a multimodal peer-tutoring dataset to construct a computational framework for identifying hedges. We compared approaches relying on pre-trained resources with others that integrate insights from the social science literature. Our best performance involved a hybrid approach that outperforms the existing baseline while being easier to interpret. We employ a model explainability tool to explore the features that characterize hedges in peer-tutoring conversations, and we identify some novel features, and the benefits of a such a hybrid model approach.	PDF	11	2021
A Multi-task Event and Argument Trigger Detection in Hindi using POS Tagging as an Auxiliary Task	The event, as well as argument trigger detection, are essential sub-tasks of the event extraction system. Lots of effort has been devoted to improving the performance of trigger detection systems. But, the effect of low-level tasks like Parts-of-Speech (POS) tagging as an auxiliary task in multi-task learning of event and argument trigger detection has not been understood well in literature. In our current work, we propose a BERT-based multi-task architecture that learns a shared representation from two sequence labeling tasks, trigger detection (both event and argument), and POS tagging in a multi-task setup using POS tagging as an auxiliary task. We show that our proposed approach achieves a significant performance boost as compared to single-task models. We perform our experiment in the Hindi language, unlike previously proposed works.	PDF	11	2021
Computer Science Articles Named Entity Recognition Datasets: Survey and Our Recent Development	Domain-specific named entity recognition on Computer Science (CS) scholarly articles is an information extraction task that is arguably more challenging and less studied than named entity recognition (NER) for the general domain. Given that significant progress has been made on NER, we believe that scholarly domain-specific NER will receive increasing attention in the NLP community. Nevertheless, progress on the task is currently hampered in part by its recency and the lack of standardized concept types for scientific entities/terms. This paper presents a survey of the current state of research on scholarly domain-specific NER with a focus on language resources; further, it creates a novel dataset and model for CS NER.	PDF	11	2021
Table-based Fact Verification with Self-adaptive Mixture of Experts	The table-based fact verification task has recently gained widespread attention and yet remains to be a very challenging problem. It inherently requires informative reasoning over natural language together with different numerical and logical reasoning on tables (e.g., count, superlative, comparative). In this paper, we present a Self-adaptive Mixture-of-Experts Network (SaMoE), a novel framework built on this fundamental property. Specifically, we have developed a mixture-of-experts neural network to recognize and execute different types of reasoning---the network is composed of multiple experts, each handling a specific part of the semantics for reasoning, whereas a management module is applied to decide the contribution of each expert network to the verification result. A self-adaptive method is developed to teach the management module combining results of different experts more efficiently without external knowledge. The experimental results illustrate that our framework achieves 85.1% accuracy on the benchmark dataset TabFact, comparable with the previous state-of-the-art models. We hope our framework can serve as a new baseline for table-based verification. Our code will be available at (URL to be released here).	PDF	11	2021
Topic Sentence Named Entity Recognition: A New Task with Its Dataset and Benchmarks	In this paper, we focus on a new type of named entity recognition (NER) task called topic sentence NER. A topic sentence means a short and compact sentence that acts as a summary of a long document. For example, a title can be seen as a topic sentence of its article. Topic sentence NER aims to extract named entities in a topic sentence given the corresponding unlabeled document as a reference. This task represents real-world scenarios where full-document NER is too expensive and obtaining the entities only in topic sentences is enough for downstream tasks. To achieve this, we construct a large-scale human-annotated Topic Sentence NER dataset, named TSNER. The dataset contains 12,000 annotated sentences accompanied by their unlabeled document. Based on TSNER, we propose a family of representative and strong baseline models, which can utilize both single-sentence and document-level features. We will make the dataset public in the hope of advancing the research on the topic sentence NER task.	PDF	11	2021
CQARE: Contrastive Question-Answering for Few-shot Relation Extraction with Prompt Tuning	Prompt tuning with pre-trained language models (PLM) has exhibited outstanding performance by closing the gap between pre-training tasks and various downstream applications, without the need for uninitialized parameters to be introduced. However, prompt tuning requires vast amounts of prompt engineering and predefined label word mapping, which obstructs its implements in practice. Besides, the ample label space makes prompt tuning more arduous and challenging when it comes to relation extraction (RE). To tackle these issues, we propose a Contrastive Question-Answering method with prompt tuning for few-shot RE (CQARE). CQARE carries out a RE task-specific pre-training with four entity-relation-aware pre-training objects, including a prompt pre-training to automatically generate continuous prompts. The proposed pre-training can provide more robust initialization with prompt tuning while maintaining semantic consistency with the proposed PLM. Furthermore, CQARE can effectively avoid label words mapping by reformulating RE as contrastive question answering. The results indicate CQARE raising averaged accuracy of 5.11\% on a cross-domain few-shot dataset, demonstrating that robust initialization is crucial for prompt tuning and effective contrastive question answering.	PDF	11	2021
A Graph Enhanced Label Attention Model for ICD Coding from Clinical Text	Medical code assignment from clinical texts is a crucial task in the healthcare industry. Clinical texts are typically very long sequences and the number of possible labels are large, making this task quite challenging. Recent work applies deep neural network models to encode the medical notes and assign medical codes to clinical documents. Some works use effective attention mechanisms to construct label-specific document representations and show promising results. In this paper, we propose a new attention mechanism, GE-LAAT (graph enhanced label attention), which utilizes code graphs to learn robust representation vectors for medical codes and improve upon the state of the art models. Experiments on the MIMIC-III dataset are conducted to show the effectiveness of our proposed model.	PDF	11	2021
StableMoE: Stable Routing Strategy for Mixture of Experts	The Mixture-of-Experts (MoE) technique can scale up the model size of Transformers with an affordable computational overhead. We point out that existing learning-to-route MoE methods suffer from the routing fluctuation issue, i.e., the target expert of the same input may change along with training, but only one expert will be activated for the input during inference. The routing fluctuation tends to harm sample efficiency because the same input updates different experts but only one is finally used. In this paper, we propose StableMoE with two training stages to address the routing fluctuation problem. In the first training stage, we learn a balanced and cohesive routing strategy and distill it into a lightweight router decoupled from the backbone model. In the second training stage, we utilize the distilled router to determine the token-to-expert assignment and freeze it for a stable routing strategy. We validate our method on language modeling and multilingual machine translation. The results show that StableMoE outperforms existing MoE methods in terms of both convergence speed and performance.	PDF	11	2021
Using Structured Content Plans for Fine-grained Syntactic Control in Pretrained Language Model Generation	Large pretrained language models offer powerful generation capabilities, but suffer from a lack of interpretability and fine-grained control. We propose an approach to fine-grained control in generating text directly from a semantic representation, Abstract Meaning Representation (AMR) by augmenting the nodes with syntactic tags. We experiment with English-language generation of three modes of syntax relevant to the framing of a sentence - verb voice (active or passive), verb tense, and realization of human entities - and demonstrate that they can be reliably controlled. Controlling how information is framed is important for applications such as summarization, which aim to highlight salient information.	PDF	11	2021
D$^4$: A Psychiatrist-proofread Dialogue Dataset for Depression Diagnosis	Depression has affected large populations and become a significant threat to life expectations globally.Automatic depression diagnosis methods have been a new research focus. In particular, automatic dialogue-based diagnosis systems are desired since depression diagnosis highly relies on clinical consultation.Based on clinical diagnosis criteria, doctors initiate a conversation with ample emotional support that guides the patients to expose their symptoms. Such a dialog is a combination of task-oriented and chitchat, different from traditional single-purpose human-machine dialog systems. However, due to the social stigma associated with mental illness, the dialogue data related to the diagnosis of actual patients are rarely disclosed. The lack of data has become one of the major factors restricting the research on the consultation dialogue system of depression.%Although there are effective methods of diagnosis and treatment for depression, more than 75\% of people in low- and middle-income countries receive no treatment. Based on clinical depression diagnostic criteria ICD-11 and DSM-5, we construct a Psychiatrist-proofread Dialogue Dataset for Depression Diagnosis which simulates the dialogue between the doctor and the patient during the diagnosis of depression and provides diagnosis results and symptom summary given by professional psychiatrists for each dialogue.Finally, we finetune on state-of-art pre-training models and respectively give our dataset baselines on response generation, topic prediction, dialog summary, and severity classification of depression and suicide risk.	PDF	11	2021
Dynamic Entity Memory Network for Dialogue Relational Triplet Extraction	Relational triplet extraction (RTE) is a crucial task in information extraction and has aroused extensive attention. Although advanced studies on RTE have achieved great progress, they are still insufficient for supporting practical applications, such as dialogue system and information retrieval. In this paper, we focus on relational triplet extraction in dialogue scenarios and introduce a new task named dialogue relational triplet extraction (DRTE). Instead of being treated as static texts like sentences or documents, dialogues should be regarded as dynamic ones generated with the progress of conversations. Thus, it imposes three important challenges, including extracting triplets in real-time with incomplete dialogue context, discovering cross-utterance relational triplets, and perceiving the transition of dialogue topics. To tackle these challenges, we propose a Dynamic Entity Memory Network (DEMN). Specifically, the key of our approach is an attentional context encoder and an entity memory network. The attentional context encoder learns dialogue semantics utterance by utterance and dynamically captures salient contexts for each utterance. The entity memory network is devised to store the entities extracted from previous utterances and for cross-utterance triplets extraction. Meanwhile, it also tracks topic transitions in real-time and forgets the semantics of trivial entities. To verify the effectiveness of our model, we manually build three datasets based on KdConv benchmark. Extensive experimental results demonstrate that our model achieves state-of-the-art performances.	PDF	11	2021
E-KAR: A Benchmark for Rationalizing Natural Language Analogical Reasoning	The ability to recognize analogies is fundamental to human cognition. Existing benchmarks to test word analogy does not reveal the underneath process of analogical reasoning of neural models. Holding the belief that models capable of reasoning should be right for the right reasons, we propose a first-of-its-kind Explainable Knowledge-intensive Analogical Reasoning benchmark (E-KAR). Our benchmark consists of 1,665 problems sourced from the Civil Service Exams, which require intensive background knowledge to solve. Besides, we design a free-text explanation scheme to explain how an analogy is drawn, and manually annotate E-KAR with 8,325 knowledge-rich sentences of such explanations. Empirical results suggest that this benchmark is very challenging to some state-of-the-art models for both explanation generation and analogical question answering tasks, which invites further research in this area.	PDF	11	2021
Metaphor Detection for Low Resource Languages: From Zero-Shot to Few-Shot Learning in Middle High German	In this work, we present a novel unsupervisedmethod for adjective-noun metaphor detectionon low resource languages. We propose twonew approaches: First, a way of artificiallygenerating metaphor training examples andsecond, a novel way to find metaphors rely-ing only on word embeddings. The latter en-ables application for low resource languages.Our method is based on a transformation ofword embedding vectors into another vectorspace, in which the distance between the ad-jective word vector and the noun word vec-tor represents the metaphoricity of the wordpair. We train this method in a zero-shotpseudo-supervised manner by generating arti-ficial metaphor examples and show that ourapproach can be used to generate a metaphordataset with low annotation cost. It can thenbe used to finetune the system in a few-shotmanner. In our experiments we show the capa-bilities of the method in its unsupervised andin its supervised version. Additionally, we testit against a comparable unsupervised baselinemethod and a supervised variation of it.	PDF	11	2021
Bootstrapping Text Anonymization Models with Distant Supervision	We propose a novel method to bootstrap text anonymization models based on distant supervision. Instead of requiring manually labeled training data, the approach relies on a knowledge graph expressing the background information assumed to be publicly available about various individuals. This knowledge graph is employed to automatically annotate text documents including personal data about a subset of those individuals. More precisely, the method determines which text spans ought to be masked in order to guarantee $k$-anonymity, assuming an adversary with access to both the text documents and the background information expressed in the knowledge graph. The resulting collection of labeled documents is then used as training data to fine-tune a pre-trained language model for text anonymization. We illustrate this approach using a knowledge graph extracted from Wikidata and short biographical texts from Wikipedia. Evaluation results with a BERT-based model and a manually annotated collection of 553 summaries showcase the potential of the approach, but also unveil a number of issues that may arise the knowledge graph is noisy or incomplete. The results also illustrate that, contrary to most sequence labeling problems, the text anonymization task may admit several alternative solutions.	PDF	11	2021
DS-TOD: Efficient Domain Specialization for Task-Oriented Dialog	Recent work has shown that self-supervised dialog-specific pretraining on large conversational datasets yields substantial gains over traditional language modeling (LM) pretraining in downstream task-oriented dialog (TOD). These approaches, however, exploit general dialogic corpora (e.g., Reddit) and thus presumably fail to reliably embed domain-specific knowledge useful for concrete downstream TOD domains. In this work, we investigate the effects of domain specialization of pretrained language models (PLMs) for TOD. Within our DS-TOD framework, we first automatically extract salient domain-specific terms, and then use them to construct DomainCC and DomainReddit -- resources that we leverage for domain-specific pretraining, based on (i) masked language modeling (MLM) and (ii) response selection (RS) objectives, respectively. We further propose a resource-efficient and modular domain specialization by means of domain adapters -- additional parameter-light layers in which we encode the domain knowledge. Our experiments with prominent TOD tasks -- dialog state tracking (DST) and response retrieval (RR) -- encompassing five domains from the MultiWOZ benchmark demonstrate the effectiveness of DS-TOD. Moreover, we show that the light-weight adapter-based specialization (1) performs comparably to full fine-tuning in single domain setups and (2) is particularly suitable for multi-domain specialization, where besides advantageous computational footprint, it can offer better downstream performance.	PDF	11	2021
Sentence-Level Discourse Parsing as Text-to-Text Generation	Previous studies have made great advances in RST discourse parsing through neural frameworks or efficient features, but they split the parsing process into two subtasks and heavily depended on gold segmentation. In this paper, we introduce an end-to-end method for sentence-level RST discourse parsing via transforming it into a text-to-text generation task. Our method unifies the traditional two-stage parsing and generates the parsing tree directly from the input text without requiring a complicated model. Moreover, the EDU segmentation can be simultaneously generated and extracted from the parsing tree. Experimental results on the RST Discourse Treebank demonstrate that our proposed method outperforms existing methods in both tasks of sentence-level RST parsing and discourse segmentation. Considering the lack of annotated data in RST parsing, we also create high-quality augmented data based on several filtering strategies, which further improves the performance.	PDF	11	2021
ElitePLM: An Empirical Study on General Language Ability Evaluation of Pretrained Language Models	Pretrained language models (PLMs), such as BERT and GPT-3, have dominated the majority of NLP tasks. However, relatively little work has been conducted on systematically evaluating the language abilities of PLMs. In this paper, we present a large-scale empirical study on gen\underline{E}ral \underline{l}anguage ab\underline{i}li\underline{t}y \underline{e}valuation of PLMs (ElitePLM). We first design four evaluation dimensions in ElitePLM, including memory, comprehension, reasoning, and composition, and further measure ten widely-used PLMs within five categories. Our empirical results demonstrate that: (1) the pretraining objectives and strategies have significant impacts on PLMs performance in downstream tasks; (2) fine-tuning PLMs in downstream tasks is usually sensitive to the data size and distribution; (3) PLMs have excellent transferability between similar tasks. Our experimental results summarize several important findings, which can guide the future work to choose, apply, and design PLMs for specific tasks. We have made all the details of experiments publicly available at https://anonymous.4open.science/r/Paper-for-ACL-4FD1.	PDF	11	2021
Can Pre-trained Language Models Interpret Similes as Smart as Human?	Simile interpretation is a crucial task in natural language processing. Nowadays, pre-trained language models (PLMs) have achieved state-of-the-art performance on many tasks. However, it remains under-explored whether PLMs can interpret similes or not. In this paper, we investigate the ability of PLMs in simile interpretation by designing a novel task named Simile Property Probing, i.e., to let the PLMs infer the shared properties of similes. We construct our simile property probing datasets from both general textual corpus and human-designed questions, which contain a total of 1,633 examples covering seven main categories. Our empirical study based on the constructed datasets shows that PLMs exhibit the ability to infer shared properties of similes, while they still underperform humans. To bridge the gap with human performance, we additionally design a knowledge-enhanced training objective by incorporating the simile knowledge into PLMs via knowledge embedding methods. Our method brings up to an 8.58% gain in the probing task, and up to a 1.37% gain in the downstream task of sentiment classification. The datasets and code will be publicly available soon.	PDF	11	2021
DrugEHRQA: A Question Answering Dataset on Structured and Unstructured Electronic Health Records For Medicine Related Queries	This paper develops the first question answering dataset (DrugEHRQA) containing question-answer pairs from both structured tables and unstructured notes from a publicly available Electronic Health Record (EHR). EHRs contain patient records, stored in structured tables as well as unstructured clinical notes. The information in structured and unstructured EHR records is not strictly disjoint: information may be duplicated, contradictory, or provide additional context between these sources. This presents a rich opportunity to study question answering (QA) models that combine reasoning over both structured and unstructured data. Additionally, we propose a novel methodology that automatically generates a large QA dataset by retrieving answers from both structured and unstructured EHR records. The automatically-generated dataset has medication-related queries, containing over 70,000 question-answer pairs. Our dataset is validated for both individual modalities using state-of-the-art QA models. In order to address the problem arising from complex, nested queries, this is the first time Relation-Aware Schema Encoding and Linking for Text-to-SQL Parsers (RAT-SQL) has been used for EHR data. Finally, we introduce a rule-based method to obtain multi-modal answers, combining the answers from the different modalities. Our goal is to provide a benchmark dataset for multi-modal QA systems, and to open up new avenues of research in improving question answering over EHR structured data by using context from unstructured clinical data.	PDF	11	2021
An Empirical Survey of the Effectiveness of Debiasing Techniques for Pre-trained Language Models	Recent work has shown pre-trained language models capture social biases from the text corpora they are trained on. This has attracted attention to developing techniques that mitigate such biases. In this work, we perform an empirical survey of five recently proposed bias mitigation techniques: Counterfactual Data Augmentation (CDA), Dropout, Iterative Nullspace Projection, Self-Debias, and SentenceDebias.We quantify the effectiveness of each technique using three intrinsic bias benchmarks while also measuring the impact of these techniques on a model's language modeling ability, as well as its performance on downstream NLU tasks. We experimentally find that: (1) Self-Debias is the strongest debiasing technique, obtaining improved scores on all bias benchmarks; (2) Current debiasing techniques perform less consistently when mitigating non-gender biases; And (3) improvements on bias benchmarks such as StereoSet and CrowS-Pairs by using debiasing strategies are often accompanied by a decrease in language modeling ability, making it difficult to determine whether the bias mitigation was effective.	PDF	11	2021
Hansel: A Chinese Few-Shot and Zero-Shot Entity Linking Benchmark	Modern Entity Linking (EL) systems entrench a popularity bias. However, there is no dataset focusing on tail and emerging entities in languages other than English. We present Hansel, a new benchmark in Chinese that fills the vacancy of non-English few-shot and zero-shot EL challenges. Hansel is human annotated and reviewed, with a novel method for collecting zero-shot EL datasets. It is a diverse dataset covering 8.2K documents in news, social media posts and other web articles, with Wikidata as its target Knowledge Base. We demonstrate that the existing state-of-the-art EL system performs poorly on Hansel (R@1 of 35.8% on Few-Shot). We then establish a strong baseline that scores a R@1 of 43.2% on Few-Shot and 76.6% on Zero-Shot on our dataset. We also show that our baseline achieves competitive results on TAC-KBP2015 Chinese Entity Linking task.	PDF	11	2021
On the Robustness of Reading Comprehension Models to Entity Renaming	We study the robustness of machine reading comprehension (MRC) models to entity renaming---do models make more wrong predictions when answer entities have different names? Such failures imply that models overly rely on entity information to answer questions, and thus may generalize poorly when facts about the world change or questions are asked about novel entities. To systematically audit this issue, we present a general and scalable pipeline to replace entity names with names from a variety of sources, ranging from common English names to names from other languages to arbitrary strings. Across five datasets and three pretrained model architectures, MRC models consistently perform worse when entities are renamed, with particularly large accuracy drops on datasets constructed via distant supervision. We also find large differences between models: SpanBERT, which is pretrained with span-level masking, is more robust than RoBERTa, despite having similar accuracy on unperturbed test data. We further experiment with different masking strategies as the continual pretraining objective and find that entity-based masking can improve the robustness of MRC models.	PDF	11	2021
ReadE: Learning Relation-Dependent Entity Representation for Knowledge Graph Completion	Conventional knowledge graph embedding methods learn semantic representations for entities considering their intrinsic interactions through powerful graph neural networks. However, previous methods represent each node solely with a coarse-grained unique representation, regardless of the variance of emphasis of entity semantics by different relations. To tackle this problem, we propose ReadE, a method to learn relation-dependent entity representations of which the semantic information is emphasized by varied relations types. First, we propose a relation-controlled gating mechanism targeting on utilizing the relation to control the information flow in the aggregation step of the graph neural network. Second, we propose a contrastive learning method with mixing both relation-level and entity-level negative samples to enhance semantics preserved in relation-dependent entity representations. Experiments on three benchmarks show that our proposed model outperforms all strong baselines. The code will be made open-sourced on Github.	PDF	11	2021
Evidence Decomposition Graph Network for Fact Verification	Fact verification is the task to verify a given claim according to extracted evidence sentences. Most existing works use whole evidence sentences or break them into phrases to perform evidence interaction, where evidence is treated either too coarsely or over fragmented. We also find that many models suffer from exposure bias, which finally leads to them only paying attention to the evidence ranked higher by previous steps while failing to recognize crucial pieces from all candidates. In this paper, we propose an Evidence Decomposition Graph Network (EDGN), which decomposes each evidence sentence, especially the complex ones, into several simple sentences, highlighting the required key information without losing sentence structure and meaning. EDGN also absorbs a simple but effective evidence shuffling method to mitigate exposure bias. Experiments on the FEVER benchmark show our model can take all evidence candidates into account, distill necessary key information from complex evidence, and outperform existing methods in the literature. We will release our code to the community for further exploration.	PDF	11	2021
Augmenting Memory Networks for Rich and Efficient Retrieval in Grounded Dialogue	Grounded dialogue consists of conditioning a conversation on additional latent inputs ("factoids") beyond the dialogue context, such as Wikipedia articles, IMDB reviews, persona, and images. Due to a scarcity of <context, factoid> labels, it is common practice to jointly learn the knowledge-selection and grounded response generation tasks end-to-end. When conditioning the response on these factoids, previous work has either treated the factoids as a weighed average vector, or separately computed probabilities for each <context, factoid> pair. However, the former creates a bottleneck whilst the latter prevents factoids from being considered jointly. Our new method, PolyMemNet, learns a matrix representation of the context and factoids, allowing for multiple factoids to be jointly considered in response selection, without imposing a bottleneck. We show how this achieves up to a $17\%$ boost in knowledge-selection accuracy and $13\%$ in response-selection accuracy versus memory networks.	PDF	11	2021
gaBERT — an Irish Language Model	The BERT family of neural language models have become highly popular due to their ability to provide sequences of text with rich context-sensitive token encodings which are able to generalise well to many Natural Language Processing tasks. We introduce, gaBERT, a monolingual BERT model for the Irish language. We compare our gaBERT model to multilingual BERT and monolingual WikiBERT, and we show that gaBERT provides better representations for a downstream parsing task. We also show how different filtering criteria, vocabulary size and the choice of subword tokenisation model affect downstream performance. We release gaBERT and related code to the community.	PDF	11	2021
Repetition Facilitates Processing: The Processing Advantage of Construction Repetition in Dialogue	Repetitions occur frequently in dialogue. This study focuses on the repetition of lexicalised constructions—i.e., recurring multi-word units—in English open domain spoken dialogues. We hypothesise that construction repetition is an efficient communication strategy that reduces processing effort, and make three predictions based on this hypothesis. Our three predictions are confirmed: repetitions facilitate the processing of constructions and of their linguistic context; facilitating effects are higher when repetitions accumulate, and lower when repetitions are less locally distributed. We measure reduction in processing effort using two surprisal-based measures and estimate surprisal with an adaptive neural language model. Our findings suggest that human-like patterns of repetitions can be learned implicitly by utterance generation models equipped with psycholinguistically motivated surprisal-based objectives and adaptation mechanisms.	PDF	11	2021
Exploring Meaning Encoded in Random Character Sequences with Character-Aware Language Models	Natural language processing models learn word representations based on the distributional hypothesis, which asserts that word context (e.g., co-occurrence) correlates with semantic meaning. We propose that n-grams composed of random character sequences, or garble, provide a novel context for studying word meaning both within and beyond extant language. In particular, randomly-generated character n-grams lack semantic meaning but contain primitive information based on the distribution of characters they contain. By studying the embeddings of a large corpus of garble, extant language, and pseudowords using CharacterBERT, we identify an axis in the model's high-dimensional embedding space that separates these classes of n-grams. Furthermore, we show that this axis relates to structure within extant language, including word part of speech, morphology, and concreteness. Thus, in contrast to studies that are mainly limited to extant language, our work reveals that semantic meaning and primitive information are intrinsically linked.	PDF	11	2021
The Sensitivity of Annotator Bias to Task Definitions	NLP models are biased by the data they are trained on, including how it is annotated, but NLP research increasingly examines the em social biases of models, often in the light of their training data. This paper is first to examine to what extent social bias is sensitive to how data is annotated. We do so by collecting annotations of arguments in the same documents following four different guidelines and from four different demographic annotator backgrounds. We show that annotations exhibit widely different levels of group disparity depending on which guidelines annotators follow. The differences are not explained by task complexity, but rather by characteristics of these groups, as previously identified by sociological studies.	PDF	11	2021
Incremental Topic Modeling for Scientific Trend Topics Extraction	Caused by the exponential growth of scientific research, the number of scientific publications and reports, one of the most urgent and challenging tasks now is the early detection of trending topics. In this paper, we investigate recent topic modeling approaches to accurately extract trending topics at an early stage. The incremental training technique is suggested so that the model can operate on data in real-time. For validation, we propose a novel dataset that contains a collection of early-stage articles and a set of key collocations for each trend. The proposed metric estimates the delay in days when determining the trend, and the developed matching method suffices to calculate it automatically. The conducted experiments demonstrate that the topic model with regularization, namely ARTM, is superior to the base PLSA model. Apart from that, the best ARTM-based model is able to extract most of the labeled trends during the first year of their evolution.	PDF	11	2021
Good Examples Make A Faster Learner: Simple Demonstration-based Learning for Low-resource NER	Recent advances in prompt-based learning have shown strong results on few-shot text classification by using cloze-style templates.Similar attempts have been made on named entity recognition (NER) which manually design templates to predict entity types for every text span in a sentence. However, such methods may suffer from error propagation induced by entity span detection, high cost due to enumeration of all possible text spans, and omission of inter-dependencies among token labels in a sentence. Here we present a simple demonstration-based learning method for NER, which lets the input be prefaced by task demonstrations for in-context learning. We perform a systematic study on demonstration strategy regarding what to include (entity examples, with or without surrounding context), how to select the examples, and what templates to use. Results on in-domain learning and domain adaptation show that the model's performance in low-resource settings can be largely improved with a suitable demonstration strategy (e.g., a 4-17% improvement on 25 train instances). We also find that good demonstration can save many labeled examples and consistency in demonstration contributes to better performance.	PDF	11	2021
Improving GPT-3 after deployment with a dynamic memory of feedback	Large LMs such as GPT-3 while powerful, are not immune to mistakes, but are prohibitively costly to retrain. One failure mode is misinterpreting a user's instruction (e.g., GPT-3 interpreting "What word is similar to `good'?" to mean a homonym, while the user intended a synonym). Our goal is to allow users to correct such errors directly through interaction -- without retraining. Our approach is to pair GPT-3 with a growing memory of cases where the model misunderstood the user's intent and was provided with feedback, clarifying the instruction. Given a new query, our memory-enhanced GPT-3 uses feedback from similar, prior queries to enrich the prompt. Through simple proof-of-concept experiments, we demonstrate how a user can interactively teach a deployed GPT-3, doubling its accuracy on basic lexical tasks (e.g., generate a synonym) where users query in different, novel (often misunderstood) ways. In such scenarios, memory helps avoid repeating similar past mistakes. Our simple idea is a first step towards strengthening deployed models, potentially broadening their utility.	PDF	11	2021
Evaluating Inclusivity, Equity, and Accessibility of NLP Technology: A Case Study for Indian Languages	In order for NLP technology to be widely applicable and useful, it needs to be inclusive of users across the world's languages, equitable, i.e., not unduly biased towards any particular language, and accessible to users, particularly in low-resource settings where compute constraints are common. In this paper, we propose an evaluation paradigm that assesses NLP technologies across all three dimensions, hence quantifying the diversity of users they can serve. While inclusion and accessibility have received attention in recent literature, equity is currently unexplored. We propose to address this gap using the Gini coefficient, a well-established metric used for estimating societal wealth inequality. Using our paradigm, we highlight the distressed state of diversity of current technologies for Indian (IN) languages. Our focus on IN is motivated by their linguistic diversity and their large, varied speaker population. To improve upon these metrics, we demonstrate the importance of region-specific choices in model building and dataset creation and also propose a novel approach to optimal resource allocation during fine-tuning. Finally, we discuss steps that must be taken to mitigate these biases and call upon the community to incorporate our evaluation paradigm when building linguistically diverse technologies.	PDF	11	2021
Image Retrieval from Contextual Descriptions	The ability to integrate context, including perceptual and temporal cues, plays a pivotal role in grounding the meaning of a linguistic utterance. In order to measure to what extent current vision-and-language models master this ability, we devise a new multimodal challenge, Image Retrieval from Contextual Descriptions (ImageCoDe). In particular, models are tasked with retrieving the correct image from a set of 10 minimally contrastive candidates based on a contextual description.As such, each description contains only the details that help distinguish between images.Because of this, descriptions tend to be complex in terms of syntax and discourse and require drawing pragmatic inferences. Images are sourced from both static pictures and video frames.We benchmark several state-of-the-art models, including both cross-encoders such as ViLBERT and bi-encoders such as CLIP, on ImageCoDe.Our results reveal that these models dramatically lag behind human performance: the best variant achieves an accuracy of 20.9 on video frames and 59.4 on static pictures, compared with 90.8 in humans.Furthermore, we experiment with new model variants that are better equipped to incorporate visual and temporal context into their representations, which achieve modest gains. Our hope is that ImageCoDE will foster progress in grounded language understanding by encouraging models to focus on fine-grained visual differences.	PDF	11	2021
Training Dynamics for Text Summarization Models	Pre-trained language models (e.g. BART) have shown impressive results when fine-tuned on large summarization datasets. However, little is understood about this fine-tuning process, including what knowledge is retained from pre-training models or how content selection and generation strategies are learnt across iterations. In this work, we analyze the training dynamics for generation models, focusing on news summarization. Across different datasets (CNN/DM, XSum, MediaSum) and model behaviors (content selection, abstractiveness, hallucination), we study what the model learns at different stages of its fine-tuning process. We find that properties such as copy behavior and content selection are learnt earlier in the training process and these observations are robust across domains. On the other hand, factual errors, such as hallucination of unsupported facts, are learnt in the later stages, and this behavior is more varied across domains. Based on these observations, we demonstrate two techniques for modifying training: first, disregarding high-loss tokens that are challenging to learn and second, disregarding low-loss tokens that are learnt very quickly. We show that these simple modifications can help achieve different goals, such as improving factuality or improving abstractiveness.	PDF	11	2021
SHIELD: Defending Textual Neural Networks against Black-Box Adversarial Attacks with Stochastic Multi-Expert Patcher	Even though several methods have proposed to defend textual neural network (NN) models against black-box adversarial attacks, they often defend against a specific text perturbation strategy and/or require re-training the models from scratch. This leads to a lack of generalization in practice and redundant computation. In particular, the state-of-the-art transformer models (e.g., BERT, RoBERTa) require great time and computation resources. By borrowing an idea from software engineering, in order to address these limitations, we propose a novel algorithm, SHIELD, which modifies and re-trains only the last layer of a textual NN, and thus it "patches" and "transforms" the NN into a stochastic weighted ensemble of multi-expert prediction heads. Considering that most of current black-box attacks rely on iterative search mechanisms to optimize their adversarial perturbations, SHIELD confuses the attackers by automatically utilizing different weighted ensembles of predictors depending on the input. In other words, SHIELD breaks a fundamental assumption of the attack, which is a victim NN model remains constant during an attack. By conducting comprehensive experiments, we demonstrate that all of CNN, RNN, BERT, and RoBERTa-based textual NNs, once patched by SHIELD, exhibit a relative enhancement of 15%--70% in accuracy on average against 14 different black-box attacks, outperforming 6 defensive baselines across 3 public datasets. All codes are to be released.	PDF	11	2021
Transductive Learning for Abstractive News Summarization	Pre-trained and fine-tuned news summarizers are expected to generalize to news articles unseen in the fine-tuning (training) phase. However, these articles often contain specifics, such as events and people, a summarizer could not learn about in training. This applies to scenarios such as when a news publisher trains a summarizer on dated news and wants to summarize incoming recent news. In this work, we explore the first application of transductive learning to summarization where we further fine-tune models on test set’s input. Specifically, we construct references for learning from article salient sentences and condition on the randomly masked articles. We show that this approach is also beneficial in the fine-tuning phase when extractive references are jointly predicted with abstractive ones in the training set. In general, extractive references are inexpensive to produce as they are automatically created without human effort. We show that our approach yields state-of-the-art results on CNN/DM and NYT datasets, for instance, more than 1 ROUGE-L points improvement on the former. Moreover, we show the benefits of transduction from dated to more recent CNN news. Finally, through human and automatic evaluation, we demonstrate improvements in summary abstractiveness and coherence.	PDF	11	2021
Fine-Grained Controllable Text Generation Using Non-Residual Prompting	The introduction of immensely large Causal Language Models (CLMs) has rejuvenated the interest in open-ended text generation. However, controlling the generative process for these Transformer-based models is at large an unsolved problem. Earlier work has explored either plug-and-play decoding strategies, or more powerful but blunt approaches such as prompting. There hence currently exists a trade-off between fine-grained control, and the capability for more expressive high-level instructions. To alleviate this trade-off, we propose an encoder-decoder architecture that enables intermediate text prompts at arbitrary time steps. We propose a resource-efficient method for converting a pre-trained CLM into this architecture, and demonstrate its potential on various experiments, including the novel task of contextualized word inclusion. Our method provides strong results on multiple experimental settings, proving itself to be both expressive and versatile.	PDF	11	2021
Perturbations in the Wild: Leveraging Human-Written Text Perturbations for Realistic Adversarial Attack and Defense	We proposes a novel algorithm, ANTHRO, that inductively extracts over 600K human-written text perturbations in the wild and leverages them for realistic adversarial attack. Unlike existing character-based attacks which often deductively hypothesize a set of manipulation strategies, our work is grounded on actual observations from real-world texts. We find that adversarial texts generated by ANTHRO achieve the best trade-off between (1) attack success rate, (2) semantic preservation of the original text, and (3) stealthiness--i.e. indistinguishable from human writings hence harder to be flagged as suspicious. Specifically, our attacks accomplished around 83% and 91% attack success rates on BERT and RoBERTa, respectively. Moreover, it outperformed the TextBugger baseline with an increase of 50% and 40% in terms of semantic preservation and stealthiness when evaluated by both layperson and professional human workers. ANTHRO can further enhance a BERT classifier's performance in understanding different variations of human-written toxic texts via adversarial training when compared to the Perspective API. All source code will be released.	PDF	11	2021
Compact Token Representations with Contextual Quantization for Efficient Document Re-ranking	Transformer based re-ranking models can achieve high search relevance through context-aware soft matching of query tokens with document tokens. To alleviate runtime complexity of such inference, previous work has adopted a late interaction architecturewith pre-computed contextual token representations at the cost of a large online storage. This paper proposes contextual quantization of token embeddings by decoupling document-specific and document-independent ranking contributions during codebook-based compression. This allows effective online decompression and embedding composition for better search relevance. This paper presents an evaluation of the above compact token representation model in terms of relevance and space efficiency.	PDF	11	2021
Are We NER Yet? Measuring the Impact of ASR Errors on Named Entity Recognition in Spontaneous Conversation Transcripts	Transcriptions of spontaneous human conversations present a significant obstacle for traditional NER models trained on prescriptive written language. The lack of grammatical structure of spoken utterances, combined with word errors introduced by the ASR, makes downstream NLP tasks challenging. In this paper, we examine the impact of ASR errors on the ability of NER models to recover entity mentions from transcripts of spontaneous human conversations in English. We experimentally compare several commercial ASR systems paired with state-of-the-art NER models. We use both publicly available benchmark datasets (Switchboard Named Entity Corpus, SWNE), and the proprietary, real-life dataset of gold (human-transcribed) phone conversation transcripts. To measure the performance of NER models on ASR transcripts, we introduce a new method of token alignment between transcripts. Our findings unequivocally show that NER models trained on the written language struggle when processing transcripts of spontaneous human conversations. The presence of ASR errors only exacerbates the problem.	PDF	11	2021
From the Detection of Toxic Spans in Online Discussions to the Analysis of Toxic-to-Civil Transfer	We study the task of toxic spans detection, which concerns the detection of the spans that make a text toxic, when detecting such spans is possible. We introduce a dataset for this task, ToxicSpans, which we release publicly. By experimenting with several methods, we show that sequence labeling models perform best, but methods that add generic rationale extraction mechanisms on top of classifiers trained to predict if a post is toxic or not are also surprisingly promising. Finally, we use ToxicSpans and systems trained on it, to provide further analysis of state-of-the-art toxic to non-toxic transfer systems, as well as human performance on that latter task. Our work highlights challenges in finer toxicity detection and mitigation.	PDF	11	2021
An Empirical Investigation of Commonsense Self-Supervision with Knowledge Graphs	Large knowledge graphs have been shown to benefit zero-shot evaluation of downstream tasks, through continual pre-training of language models. Yet, little is known about how to optimally learn from this knowledge, and what is the impact of the resulting models on different task partitions. This paper studies the effect of model architectures, loss functions, and knowledge subsets on the generalization of zero-shot models across task partitions. Our experiments show that data size, model size, model architecture, and loss function all play an important role in the accuracy and generalizability of the models. Most of the improvement occurs on questions with short answers and dissimilar answer candidates, which corresponds to the characteristics of the data used for pre-training. These findings inform future work that uses self-supervision with large knowledge graphs in order to create generalizable commonsense reasoning agents.	PDF	11	2021
A Simple Unsupervised Approach for Coreference Resolution using Rule-based Weak Supervision	Labeled data for the task of Coreference Resolution is a scarce resource, requiring significant human effort. While state-of-the-art coreference models rely on such data, we propose an approach that leverages an end-to-end neural model in settings where labeled data is unavailable. Specifically, using weak supervision, we transfer the linguistic knowledge encoded by Stanford’s rule-based coreference system to the end-to-end model, which jointly learns rich, contextualized span representations and coreference chains. Our experiments on the English OntoNotes corpus demonstrate that our approach effectively benefits from the noisy coreference supervision, producing an improvement over Stanford’s rule-based system (+3.7 F$_1$) and outperforming the previous best unsupervised model (+0.9 F$_1$). Additionally, we validate the efficacy of our method on two other datasets: PreCo and Litbank (+2.5 and +4 F$_1$ on Stanford's system, respectively).	PDF	11	2021
FairLex: A Multilingual Benchmark for Evaluating Fairness in Legal Text Processing	We present a benchmark suite of four datasets for evaluating the fairness of pre-trained legal language models and the techniques used to fine-tune them for downstream tasks. Our benchmarks cover four jurisdictions (European Council, USA, Swiss, and Chinese), five languages (English, German, French, Italian and Chinese) and fairness across five attributes (gender, age, nationality/region, language, and legal area). In our experiments, we evaluate pre-trained language models using several group-robust fine-tuning techniques and show that none of these combinations guarantee fairness, nor consistently mitigate group disparities. Furthermore, we analyze what causes performance differences across groups, and how group-robust fine-tuning techniques fail to mitigate group disparities under both representation inequality and temporal distribution swift.	PDF	11	2021
Identifying Moments of Change from Longitudinal User Text	Identifying changes in individuals' behaviour and mood, as observed via content shared on online platforms, is increasingly gaining importance. Most research to-date on this topic focuses on either: (a) identifying individuals at risk or with a certain mental health condition given a batch of posts or (b) providing equivalent labels at the post level. A disadvantage of such work is the lack of a strong temporal component and the inability to make longitudinal assessments following an individual's trajectory and allowing timely interventions. Here we define a new task, that of identifying moments of change in individuals on the basis of their shared content online. The changes we consider are sudden shifts in mood (switches) or gradual mood progression (escalations). We have created detailed guidelines for capturing moments of change and a corpus of 500 manually annotated user timelines (18.7K posts). We have developed a variety of baseline models drawing inspiration from related tasks and show that the best performance is obtained through context aware sequential modelling. We also introduce new metrics for capturing rare events in temporal windows.	PDF	11	2021
A Natural Diet: Towards Improving Naturalness of Machine Translation Output	Machine translation (MT) evaluation often focuses on accuracy and fluency, without paying much attention to translation style. This means that, even when considered accurate and fluent, MT output can still sound less natural than high quality human translations or text originally written in the target language. Machine translation output notably exhibits lower lexical diversity, and employs constructs that mirror those in the source sentence. In this work we propose a method for training MT systems to achieve a more natural style, i.e. mirroring the style of text originally written in the target language. Our method tags parallel training data according to the naturalness of the target side by contrasting language models trained on natural and translated data. Tagging data allows us to put greater emphasis on target sentences originally written in the target language. Automatic metrics show that the resulting models achieve lexical richness on par with human translations, mimicking a style much closer to sentences originally written in the target language. Furthermore, we find that their output is preferred by human experts when compared to the baseline translations.	PDF	11	2021
Making Transformers Solve Compositional Tasks	Several studies have reported the inability of Transformer models to generalize compositionally, a key type of generalization in many NLP tasks such as semantic parsing. In this paper we explore the design space of Transformer models showing that the inductive biases given to the model by several design decisions significantly impact compositional generalization. We identified Transformer configurations that generalize compositionally significantly better than previously reported in the literature in many compositional tasks. We achieve state-of-the-art results in a semantic parsing compositional generalization benchmark (COGS), and a string edit operation composition benchmark (PCFG).	PDF	11	2021
Sequence-to-sequence AMR Parsing with Ancestor Information	AMR parsing is the task that maps a sentence to an AMR semantic graph automatically. The difficulty comes from generating the complex graph structure. The previous state-of-the-art method translates the AMR graph into a sequence, then directly fine-tunes a pretrained sequence-to-sequence Transformer model (BART). However, purely treating the graph as a sequence does not take advantage of structural information about the graph. In this paper, we design several strategies to add the important \textit{ancestor information} into the Transformer Decoder. Our experiments show that we can improve the performance for both AMR 2.0 and AMR 3.0 dataset and achieve new state-of-the-art results.	PDF	11	2021
Improved grammatical error correction by ranking elementary edits	We offer a rescoring method for grammatical error correction which is based on two-stage procedure: the first stage model extracts local edits and the second classiifies them as correct or false. We show how to use an encoder-decoder or sequence labeling approach as the first stage of our model. We achieve state-of-the-art quality on BEA 2019 English dataset even with a weak BERT-GEC basic model. When using a state-of-the-art GECToR edit generator and the combined scorer, our model beats GECToR on BEA 2019 by $2-3\%$. Our model also beats previous state-of-the-art on Russian, despite using smaller models and less data than the previous approaches.	PDF	11	2021
Invariant Language Modeling	Modern pretrained language models are critical components of NLP pipelines. Yet, they suffer from spurious correlations, poor out-of-domain generalization, and biases.Inspired by recent progress in causal machine learning, in particular the invariant risk minimization (IRM) paradigm,we propose invariant language modeling, a framework for learning invariant representations that generalize better across multiple environments.In particular, we adapt a game-theoretic implementation of IRM (IRM-games)to language models, where the invariance emerges from a specific training schedule in which all the environments compete to optimize their own environment-specific loss by updating subsets of the model in a round-robin fashion.In a series of controlled experiments, we demonstrate the ability of our method to (i) remove structured noise, (ii) ignore specific spurious correlations without affecting global performance, and (iii) achieve better out-of-domain generalization.These benefits come with a negligible computational overhead compared to standard training, do not require changing the local loss, and can be applied to any language model architecture.We believe this framework is promising to help mitigate spurious correlations and biases in language models.	PDF	11	2021
Modeling Multidimensional Language Matrices to Learn Predictive Text	The predictive text in the tray bed of the Chinese typewriter presents “Radiating style” and other important patterns, which reflect the main properties of the Chinese language. For a robot to understand these patterns like human’s “once glanced, never forgotten”, we construct multidimensional language matrices (MLM) to present the characters and/or words of predictive text for Chinese Natural Language Processing (NLP). Using 2D LM, our approach identified the core character as the prefix of radiating outward words, and as the suffix of radiating inward words to show the best distribution of the characters in a nine-grid. Using 3D LM, our approach, for robots doing as human, recognized the meaning and location of the words in a nine-grid by “Once learning mechanism”. Even though these approaches are proposed for the Chinese language, their methods are extendable to other languages.	PDF	11	2021
The Case of Imperfect Negation Cues: A Two-Step Approach for Automatic Negation Scope Resolution	Neural network-based methods are the state of the art in negation scope resolution. However, they often use the unrealistic assumption that cue information is completely accurate. Even if this assumption holds, there remains a de-pendency on engineered features from state-of-the-art machine learning methods. The cur-rent study adopted a two-step negation resolving approach to assess whether a bidirectional long short-term memory-based method can be used for cue detection as well, and how inaccurate cue predictions would affect the scope resolution performance. Results suggest that the scope resolution performance is most robust against inaccurate information for models with a recurrent layer only, compared to ex-tensions with a conditional random field layer or a post-processing algorithm. We advocate for more research into the application of automated deep learning on negation cue detection and the effect of imperfect information on scope resolution.	PDF	11	2021
Compositional Data Augmentation for Abstractive Conversation Summarization	Recent abstractive conversation summarization systems generally rely on large-scale annotated summaries. However, collecting conversations and annotating their corresponding summaries can be time-consuming and labor-intensive. To alleviate the data scarcity issue, in this work, we present a simple yet effective compositional data augmentation method, Compo, for generating diverse and high-quality pairs of conversations and summaries. Specifically, we generate novel conversation and summary pairs through first extracting conversation snippets and summary sentences based on conversation stages and then randomly composing them constrained by the temporal relation and semantic similarities. To deal with the noises in the augmented data, we further utilize knowledge distillation to learn concise representation from a teacher model trained on high-quality data. Extensive experiments on benchmark datasets demonstrate that Compo significantly outperforms prior state-of-the-art baselines in terms of both quantitative and qualitative evaluation, and exhibits a reasonable level of interpretability.	PDF	11	2021
Make the Best of Cross-lingual Transfer: Evidence from POS Tagging with over 100 Languages	Cross-lingual transfer learning with large multilingual pre-trained models can be an effective approach for low-resource languages with no labeled training data. Existing evaluations of cross-lingual generalisability of large pre-trained models use datasets with English training data, and test data in a selection of target languages. We explore a more extensive transfer learning setup with 65 different source languages and 105 target languages for part-of-speech tagging. Through our analysis, we show that pre-training of both source and target language, as well as matching language families, writing systems, word order systems, and lexical-phonetic distance significantly impact cross-lingual performance.	PDF	11	2021
MILLIE: Modular & Iterative Multilingual Open Information Extraction	Open Information Extraction (OpenIE) is the task of extracting $(subject, predicate, object)$ triples from natural language sentences. Current OpenIE systems extract all triple slots independently. In contrast, we investigate the hypothesis that it may be beneficial to extract triple slots iteratively: first extract easy slots, followed by the difficult ones by conditioning on the easy slots, and therefore achieve a better overall extraction. Based on this hypothesis, we propose a neural OpenIE system, MILLIE, that operates in an iterative fashion. Due to the iterative nature, the system is also modular: it is possible to seamlessly integrate rule based extraction systems with a neural end-to-end system, thereby allowing rule based systems to supply extraction slots which MILLIE can leverage for extracting the remaining slots. We confirm our hypothesis empirically: MILLIE outperforms SOTA systems on multiple languages ranging from Chinese to Arabic. Additionally, we are the first to provide an OpenIE test dataset for Arabic.	PDF	11	2021
Comparative Opinion Summarization via Collaborative Decoding	Opinion summarization focuses on generating summaries that reflect popular opinions of multiple reviews for a single entity (e.g., a hotel or a product.) While generated summaries offer general and concise information about a particular entity, the information may be insufficient to help the user compare multiple entities. Thus, the user may still struggle with the question ``Which one should I pick?'' In this paper, we propose a {\em comparative opinion summarization} task, which is to generate two contrastive summaries and one common summary from two given sets of reviews from different entities. We develop a comparative summarization framework CoCoSum, which consists of two few-shot summarization models that jointly generate contrastive and common summaries.Experimental results on a newly created benchmark CoCoTrip show that CoCoSum can produce higher-quality contrastive and common summaries than state-of-the-art opinion summarization models.	PDF	11	2021
PRIMERA: Pyramid-based Masked Sentence Pre-training for Multi-document Summarization	We introduce PRIMERA, a pre-trained model for multi-document representation with a focus on summarization that reduces the need for dataset-specific architectures and large amounts of fine-tuning labeled data. PRIMERA uses our newly proposed pre-training objective designed to teach the model to connect and aggregate information across documents. It also uses efficient encoder-decoder transformers to simplify the processing of concatenated input documents. With extensive experiments on 6 multi-document summarization datasets from 3 different domains on zero-shot, few-shot and full-supervised settings, PRIMERA outperforms current state-of-the-art dataset-specific and pre-trained models on most of these settings with large margins.	PDF	11	2021
E-MMAD: Multimodal Advertising Caption Generation Based on Structured Information	With multimodal tasks increasingly getting popular in recent years, datasets with large scale and reliable authenticity are in urgent demand. Therefore, we present an e-commercial multimodal advertising dataset, E-MMAD, which contains 120 thousand valid data elaborately picked out from 1.3 million real product examples in both Chinese and English. Noticeably, it is one of the largest video captioning datasets in this field, in which each example has its product video (around 30 seconds), title, caption and structured information table that is observed to play a vital role in practice. We also introduce a fresh task for vision-language research based on E-MMAD: e-commercial multimodal advertising generation, which requires to use aforementioned product multimodal information to generate textual advertisement. Accordingly, we propose a baseline method on the strength of structured information reasoning to solve the demand in reality on this dataset.	PDF	11	2021
Sequence-to-Sequence Knowledge Graph Completion and Question Answering	Knowledge graph embedding (KGE) models represent each entity and relation of a knowledge graph (KG) with low-dimensional embedding vectors. These methods have recently been applied to KG link prediction and question answering over incomplete KGs (KGQA). KGEs typically create an embedding for each entity in the graph, which results in large model sizes on real-world graphs with millions of entities. Their atomic entity representation also necessitates a multi-stage approach to downstream tasks, which limits their utility. We show that an off-the-shelf encoder-decoder Transformer model can serve as a scalable and versatile KGE model obtaining state-of-the-art results for KG link prediction and KGQA. We achieve this by posing KG link prediction as a sequence-to-sequence task and exchange the triple scoring approach taken by prior KGE methods with a generative decoding approach. Such a simple but powerful method reduces the model size up to 90% compared to conventional KGE models and attains the best performance among small-sized models. An ensemble with a traditional KGE model even sets a new state-of-the-art. After finetuning this model on the task of KGQA over incomplete KGs, our approach outperforms baselines on multiple large-scale datasets without extensive hyperparameter tuning.	PDF	11	2021
"That Is a Suspicious Reaction!": Interpreting Logits Variation to Detect NLP Adversarial Attacks	Adversarial attacks are a major challenge faced by current machine learning research. These purposely crafted inputs fool even the most advanced models, precluding their deployment in safety-critical applications. Extensive research in computer vision has been carried to develop reliable defense strategies. However, the same issue remains less explored in natural language processing. Our work presents a model-agnostic detector of adversarial text examples. The approach identifies patterns in the logits of the target classifier when perturbing the input text. The proposed detector improves the current state-of-the-art performance in recognizing adversarial inputs and exhibits strong generalization capabilities across different NLP models, datasets, and word-level attacks.	PDF	11	2021
Ensembling and Knowledge Distilling of Large Sequence Taggers for Grammatical Error Correction	In this paper, we investigate GEC sequence tagging architecture with focusing on ensembling of the recent cutting-edge Transformers’ encoders in their Large configurations. We encourage ensembling models by majority votes on span-level edits because it's tolerant to the model architecture and vocabulary size. Our best ensemble achieves a new SOTA result, the F_0.5 score of 76.05 on BEA-2019 (test), even without pre-training on synthetic datasets. Also, we perform model distillation of a trained ensemble to generate new training synthetic datasets, "Troy-Blogs" and "Troy-1BW". Our best single sequence tagging model that is pretrained on generated Troy- datasets in combination with publicly available synthetic PIE dataset achieves a near-SOTA result of the F_0.5 score of 73.21 on BEA-2019 (test). The code, datasets, and trained models are publicly available.	PDF	11	2021
Interpreting character embeddings with perceptual representations: the case of shape, sound, and color	Character-level information is included in many NLP models, but evaluating the information encoded in character representations is an open issue. We leverage perceptual representations in the form of shape, sound, and color embeddings to investigate their correlation to textual representations in five languages. This cross-lingual analysis shows that textual character representations correlate strongly with sound representations for languages using an alphabetic script, while shape correlates with featural scripts. We further develop a set of probing classifiers to intrinsically evaluate what phonological information is encoded in character embeddings. Our results suggest that information on features such as voiceness are embedded in both LSTM and transformer-based representations.	PDF	11	2021
SOS: Systematic Offensive Stereotyping Bias in Word Embeddings	Hate speech detection models aim to provide a safe environment for marginalised social groups to express themselves. However, the bias in these models could lead to silencing those groups. In this paper, we introduce the systematic offensive stereotyping (SOS) bias. We propose a method to measure the SOS bias in different word embeddings and also investigate its influence on the downstream task of hate speech detection. Our results show that SOS bias against various groups exists in widely used word embeddings and that our SOS bias metric correlates positively with the statistics of published surveys on online abuse and \added[id=fa]{extremism}. However, we found that it is not easy to prove that bias in word embeddings influences downstream task performance. Finally, we show that SOS bias is more indicative of sexism and racism in the inspected word embeddings when used for sexism and racism detection than social biases.	PDF	11	2021
Contrastive Conditional Masked Language Model for Non-autoregressive Neural Machine Translation	Inspired by the success of contrastive learning in natural language processing, we incorporate contrastive learning into the conditional masked language model which is extensively used in non-autoregressive neural machine translation (NAT) that we term Contrastive Conditional Masked Language Model (CCMLM). CCMLM optimizes the similarity of several different representations of the same token in the same sentence, resulting in a richer and more robust representation. We propose two methods to obtain various representations: Contrastive Common Mask and Contrastive Dropout. Positive pairs are various different representations of the same token, while negative pairs are representations of different tokens. In the feature space, the model with contrastive loss pulls positive pairs together and pushes negative pairs away. We conduct extensive experiments on four translation directions with different data sizes. The results demonstrate that CCMLM showed a consistent and significant improvement with margins ranging from 0.80-1.04 BLEU and is state-of-the-art on WMT'16 Ro-En (34.18 BLEU).	PDF	11	2021
Effective Unsupervised Constrained Text Generation based on Perturbed Masking	Unsupervised constrained text generation aims to generate text under a given set of constraints without any supervised data. Current state-of-the-art methods stochastically sample edit positions which may cause unnecessary search steps. In this paper, we propose PMCTG to improve effectiveness by searching for the best position and action in each step. Specifically, PMCTG extends the perturbed masking technique to effectively search for the best edit position. Then it uses proposed multi-aspect scoring functions to select edit action to further reduce search difficulty. Since PMCTG does not require supervised data, it can extend to different generation tasks. We show PMCTG achieves state-of-the-art results in keywords-to-sentence generation and paraphrasing.	PDF	11	2021
BORT: Back and Denoising Reconstruction for End-to-End Task-Oriented Dialog	A typical end-to-end task-oriented dialog system transfers context into dialog state, and upon which generates a response, which usually faces the problem of error propagation from both previously generated inaccurate dialog states and responses, especially in low-resource scenarios. To alleviate these issues, we propose BORT, a back and denoising reconstruction approach for end-to-end task-oriented dialog system. To improve the accuracy of dialog state that is essential for the task completion of dialog system, back reconstruction is used to reconstruct the original input context from the generated dialogue state since the inaccurate dialog state cannot recover its corresponding input context. To enhance the antinoise capability of the model, denosing reconstruction is used to reconstruct the corrupted dialog state and response. Extensive experiments conducted on MultiWOZ 2.0 and CamRest676 show the effectiveness of BORT which achieves state-of-the-art performance. Furthermore, BORT demonstrates its advanced capabilities in zero-shot domain scenarios and in low-resource scenarios.	PDF	11	2021
The Change that Matters in Discourse Parsing: Estimating the Impact of Domain Shift on Parser Error	Discourse analysis allows us to attain high-level inferences of a text document beyond the sentence-level. However, currently the performance of discourse models is very low on texts outside of the training distribution's coverage. There is need for a measure that can inform us to what extent our model generalizes from the training to the test sample when these samples may be drawn from distinct distributions. While this can be estimated via distribution shift, we argue that this does not directly correlate with change in the observed error of a classifier (i.e. error-gap). Thus, we propose to use a statistic from the theoretical domain adaptation literature which can be directly tied to error-gap. We study the bias of this statistic as an estimator of error-gap both theoretically and through a large-scale empirical study of over 2400 experiments on 6 discourse datasets from domains including, but not limited to: news, biomedical texts, TED talks, Reddit posts, and fiction. Our results not only motivate our proposal and help us to understand its limitations, but also provide insight on the properties of discourse models and datasets which improve performance in domain adaptation. For instance, we find that non-news datasets are slightly easier to transfer to than news datasets when the training and test sets are very different. We plan to release our code as a Python package to allow practitioners to make more informed model and dataset choices.	PDF	11	2021
Evaluating Extreme Hierarchical Multi-label Classification	Several natural language processing (NLP) tasks are defined as a classification problem in its most complex form: Multi-label Hierarchical Extreme classification, in which items may be associated with multiple classes from a set of thousands of possible classes organized in a hierarchy and with a highly unbalanced distribution both in terms of class frequency and the number of labels per item. We analyze the state of the art of evaluation metrics based on a set of formal properties and we define an information theoretic based metric inspired by the Information Contrast Model (ICM). Experiments on synthetic data and a case study on real data show the suitability of the ICM for such scenarios.	PDF	11	2021
Tell me who you are and i'll tell you what to do: A Persona Grounded Task Oriented Dialogue Generation System	Modern dialogue agents can broadly be categorized as either chit-chat or task-oriented systems. While the purpose of a chit-chat agent is to entertain and engage the user- lubricating the conversation, so to say-, the task oriented chat-bot is dedicated to fulfilling specific requests (e.g., ticket booking). Current task-oriented agents produce precise but bland and uninteresting responses. While using such agents a user may interpose personal remarks, and the failure of the agent to process and respond to such statements could be a put-off for the user. In this paper we propose a system that is persona-specific, can handle chit-chat utterances, and produces responses that add a human element to the conversation, while always remaining grounded on the task. Since current task-oriented datasets do not have persona-profiles, and do not consist of personalized remarks in utterances, we modify an existing dataset (MultiWOZ 2.1) to suit our needs. We give a semi-automated dataset creation method that uses GPT-2 model trained on the PERSONA-CHAT dataset. A small subset of the obtained data is also manually crafted to acquire a gold standard data. Our framework is based on GPT-2, Graph Convolution Network (GCN) and Memory Network that is trained on this dataset to generate persona-grounded task-oriented responses. Both automatic and manual evaluation show the effectiveness of our model and dataset. Our proposed system achieves a BLEU score of 12.12 on this new dataset.	PDF	11	2021
Rethinking Negative Sampling for Handling Missing Entity Annotations	Negative sampling is highly effective in handling missing annotations for named entity recognition (NER). One of our contributions is an analysis on how it makes sense through introducing two insightful concepts: missampling and uncertainty. Empirical studies show low missampling rate and high uncertainty are both essential for achieving promising performances with negative sampling. Based on the sparsity of named entities, we also theoretically derive a lower bound for the probability of zero missampling rate, which is only relevant to sentence length. The other contribution is an adaptive and weighted sampling distribution that further improves negative sampling via our former analysis. Experiments on synthetic datasets and well-annotated datasets (e.g., CoNLL-2003) show that our proposed approach benefits negative sampling in terms of F1 score and loss convergence. Besides, models with improved negative sampling have achieved new state-of-the-art results on real-world datasets (e.g., EC).	PDF	11	2021
Probing Position-Aware Attention Mechanism in Long Document Understanding	Long document understanding is a challenging problem in natural language understanding. Most current transformer-based models only employ textual information for attention calculation due to high computation limit. To address those issues for long document understanding, we explore new approaches using different position-aware attention masks and investigate their performance on different benchmarks. Experimental results show that our models have the advantages on long document understanding based on various evaluation metrics. Furthermore, our approach makes changes only to the attention module in the transformer and thus can be flexibly detached and plugged into any other transformer-based solutions with ease.	PDF	11	2021
Self-Distilled Pruning of Neural Networks	Pruning aims to reduce the number of parameters while maintaining performance close to the original network. This work proposes a novel \emph{self-distillation} based pruning strategy, whereby the representational similarity between the pruned and unpruned versions of the same network is maximized. Unlike previous approaches that treat distillation and pruning separately, we use distillation to inform the pruning criteria, without requiring a separate student network as in knowledge distillation.We show that the proposed {\em cross-correlation objective for self-distilled pruning} implicitly encourages sparse solutions, naturally complementing magnitude-based pruning criteria. Experiments on the GLUE and XGLUE benchmarks show that self-distilled pruning increases mono- and cross-lingual language model performance. Self-distilled pruned models also outperform smaller Transformers with an equal number of parameters and are competitive against (6 times) larger distilled networks. We also observe that self-distillation (1) maximizes class separability, (2) increases the signal-to-noise ratio, and (3) converges faster after pruning steps, providing further insights into why self-distilled pruning improves generalization.	PDF	11	2021
On Vision Features in Multimodal Machine Translation	Previous work on multimodal machine translation (MMT) has focused on the way of incorporating vision features into translation but little attention is on the quality of vision models. In this work, we investigate the impact of vision models on MMT. Given the fact that Transformer is becoming popular in computer vision, we experiment with various strong models (such as Vision Transformer) and enhanced features (such as object-detection and image captioning). We develop a selective attention model to study the patch-level contribution of an image in MMT. On detailed probing tasks, we find that stronger vision models are helpful for learning translation from the vision modality. Our results also suggest the need of carefully examining MMT models, especially when current benchmarks are small-scale and biased.	PDF	11	2021
CTM - A Model for Large-Scale Multi-View Tweet Topic Classification	Automatically associating social media posts with topics is an important prerequisite for effective search and recommendation on many social media platforms. However, topic classification of such posts is quite challenging because of (a) a large topic space (b) short text with weak topical cues, and (c) multiple topic associations per post. In contrast to most prior work which only focuses on post classification into a small number of topics ($10-20$), we consider the task of large-scale topic classification in the context of Twitter where the topic space is $10$ times larger with potentially multiple topic associations per Tweet. We address the challenges above and propose a novel neural model, CTM that (a) associates tweets from a large topic space of $300$ topics (b) takes a holistic approach to tweet content modeling -- leveraging multi-modal content, author context, and deeper semantic cues in the tweet. We evaluate CTM quantitatively and show that our method offers an effective way to classify Tweets into topics at scale and is superior in performance to other approaches yielding a significant relative lift of $\mathbf{20}\%$.	PDF	11	2021
Contextualized Sensorimotor Norms: multi-dimensional measures of sensorimotor strength for ambiguous English words, in context	Most large language models are trained on linguistic input alone, yet humans appear to ground their understanding of words in sensorimotor experience. A natural solution is to augment LM representations with human judgments of a word's sensorimotor associations (e.g., the Lancaster Sensorimotor Norms), but this raises another challenge: most words are ambiguous, and judgments of words in isolation fail to account for this multiplicity of meaning (e.g., "wooden table" vs. "data table". We attempted to address this problem by building a new lexical resource of contextualized sensorimotor judgments for 112 English words, each rated in four different contexts (448 sentences total). We show that these ratings encode overlapping but distinct information from the Lancaster Sensorimotor Norms, and that they also predict other measures of interest (e.g., relatedness), above and beyond measures derived from BERT.	PDF	11	2021
A Sentence is Worth 128 Pseudo Tokens: A Semantic-Aware Contrastive Learning Framework for Sentence Embeddings	Contrastive learning has shown great potential in unsupervised sentence embedding tasks, e.g., SimCSE \citep{gao2021simcse}.However, these existing solutions are heavily affected by superficial features like the length of sentences or syntactic structures. In this paper, we propose a semantic-aware contrastive learning framework for sentence embeddings, termed Pseudo-Token BERT (PT-BERT), which is able to explore the pseudo-token space (i.e., latent semantic space) representation of a sentence while eliminating the impact of superficial features such as sentence length and syntax. Specifically, we introduce an additional pseudo token embedding layer independent of the BERT encoder to map each sentence into a sequence of pseudo tokens in a fixed length. Leveraging these pseudo sequences, we are able to construct same-length positive and negative pairs based on the attention mechanism to perform contrastive learning. In addition, we utilize both the gradient-updating and momentum-updating encoders to encode instances while dynamically maintaining an additional queue to store the representation of sentence embeddings, enhancing the encoder's learning performance for negative examples. Experiments show that our model outperforms the state-of-the-art baselines on six standard semantic textual similarity (STS) tasks. Furthermore, experiments on alignments and uniformity losses, as well as hard examples with different sentence lengths and syntax, consistently verify the effectiveness of our method.	PDF	11	2021
Temporal Knowledge-Aware Image Captioning	Contextualized image captioning is a task that extends beyond generating a purely visual description of the image content and aims to produce a caption that is influenced by the context and informed by the real world knowledge. In this paper, we present an approach to knowledge-aware image captioning, with a specific focus on the temporal domain. We propose a way to identify relevant information in external data sources, such as geographic databases and common knowledge bases, and then encode it in a way that is most useful for the captioning network. We develop an end-to-end caption generation system that incorporates external knowledge into the captioning process at several stages. The system is trained and tested on our novel temporal knowledge-aware captioning dataset, achieving significant improvements over multiple baselines across standardly used metrics. We demonstrate that our approach is effective for generating highly contextualized captions with both relevant and accurate temporal facts.	PDF	11	2021
A Generative Language Model for Few-shot Aspect-Based Sentiment Analysis	Sentiment analysis is an important task in natural language processing. In recent works, pre-trained language models are often used to achieve state-of-the-art results, especially when training data is scarce. It is common to fine-tune on the downstream task, usually by adding task-specific layers on top of the model. In this paper, we focus on aspect-based sentiment analysis, which involves extracting aspect term, category, and predicting their corresponding polarities. In particular, we are interested in few-shot settings. We propose to reformulate the extraction and prediction tasks into the sequence generation task, using a generative language model with unidirectional attention (GPT2 is used unless stated otherwise). This way, the model learns to accomplish the tasks via language generation without the need of training task-specific layers. Our evaluation results on the single-task polarity prediction show that our approach outperforms the previous state-of-the-art (based on BERT) on average performance by a large margins in few-shot and full-shot settings. More importantly, our generative approach significantly reduces the model variance caused by low-resource data. We further demonstrate that the proposed generative language model can handle joint and multi-tasking settings, unlike previous work. We observe that the proposed sequence generation method achieves further improved performances on polarity prediction when the model is trained via joint and multi-tasking settings. Further evaluation on similar sentiment analysis datasets, SST-2, SST-5 and OOS intent detection validates the superiority and noise robustness of generative language model in few-shot settings.	PDF	11	2021
Language Level Classification on German Texts using a Neural Approach	Studies on language level classification (LLC) for German are scarce. Of the few existing, most use a feature-engineered approach. To the best of our knowledge, there is no deep learning approach on German texts yet. This paper shows that LLC can also be successfully applied to German texts by exploiting different pre-existing neural network architectures. Seven diverse corpora represent the data basis for training the networks: a web-scraped corpus, a corpus created from newspaper articles, three second language learner corpora, a corpus created by a company that translates complex texts into incremental simplified versions, and a corpus created from a collection of written examinations covering the whole CEFR spectrum (A1-C2). An approach based on the BERT architecture yielded the best results. The highest F1 score achieved was 1.0 and 0.83 on a document and sentence level, respectively.	PDF	11	2021
UserIdentifier: Implicit User Representations for Simple and Effective Personalized Sentiment Analysis	Conventionally trained classification models are trained to be as generalizable as possible, with user invariance considered desirable since the models are shared across multitudes of users. As such, these models are often unable to produce personalized responses for individual users, based on their data. Contrary to widely-used personalization techniques based on few-shot and meta learning, we propose UserIdentifier, a novel scheme for training a single shared model for all users. Our approach produces personalized responses by prepending a fixed, user-specific non-trainable string (called ``user identifier'') to each user's input text. Unlike prior work, this method doesn't need any additional model parameters, any extra rounds of personal few-shot learning or any change made to the vocabulary. We empirically study different types of user identifiers (numeric, alphanumeric and also randomly generated) and demonstrate that, surprisingly, randomly generated user identifiers outperform the prefix-tuning based state-of-the-art approach by up to 13%, on a suite of sentiment analysis datasets.	PDF	11	2021
Graph Refinement for Coreference Resolution	The state-of-the-art models for coreference resolution are based on independent mention pair-wise decisions. We propose a modelling approach that learns coreference at the document-level and takes global decisions. For this purpose, we model coreference links in a graph structure where the nodes are tokens in the text, and the edges represent the relationship between them. Our model predicts the graph in a non-autoregressive manner, then iteratively refines it based on previous predictions, allowing global dependencies between decisions. The experimental results show improvements over various baselines, reinforcing the hypothesis that document-level information improves conference resolution.	PDF	11	2021
Boosting coherence of language models	Naturality of long-term information structure -- coherence -- remains a challenge in language generation. Large language models have insufficiently learned such structure, as their long-form generations differ from natural text in measures of coherence. To alleviate this divergence, we propose coherence boosting, an inference procedure that increases the effect of distant context on next-token prediction. We show the benefits of coherence boosting with pretrained models by distributional analyses of generated ordinary text and dialog responses. We also find that coherence boosting with state-of-the-art models for various zero-shot NLP tasks yields performance gains with no additional training.	PDF	11	2021
Divide and Rule: Effective Pre-Training for Context-Aware Multi-Encoder Translation Models	Multi-encoder models are a broad family of context-aware neural machine translation systems that aims to improve translation quality by encoding document-level contextual information alongside the current sentence. The context encoding is undertaken by contextual parameters, trained on document-level data. In this work, we discuss the difficulty of training these parameters effectively, due to the sparsity of the words in need of context (i.e., the training signal), and their relevant context. We propose to pre-train the contextual parameters over split sentence pairs, which makes an efficient use of the available data for two reasons. Firstly, it increases the contextual training signal by breaking intra-sentential syntactic relations, and thus pushing the model to search the context for disambiguating clues more frequently. Secondly, it eases the retrieval of relevant context, since context segments become shorter. We propose four different splitting methods, and evaluate our approach with BLEU and contrastive test sets. Results show that it consistently improves learning of contextual parameters, both in low and high resource settings.	PDF	11	2021
Negative-aware Entity Set Expansion	Entity Set Expansion (ESE) aims to find all entities of one target semantic class with a few seed entities describing it. However, existing ESE methods cannot express what entities we explicitly dislike, and thus hinder its application in real-world scenarios.In this paper, to endow models with the capability of understanding the ``dislike'' relationship among seed entities, we express the target semantic class with both positive and negative seed entities.To this end, we propose an efficient and learnable negative-aware entity set expansion framework, which is essentially a retrieval model. To facilitate this study, a large-scale Negative-aware ESE Dataset NED with more than 1M entities is further collected and annotated. Extensive experiments on NED show that the proposed framework can effectively understand the dislike relations expressed by the negative seeds and expand fewer dislike entities than baseline methods.	PDF	11	2021
Letters from the past: modeling historical sound change through diachronic character embeddings	While a great deal of work has been done on NLP approaches to Lexical Semantic Change detection, other aspects of language change have received less attention from the NLP community. In this paper, we address the detection of sound change through historical spelling. We propose that a sound change, a→b / c, can be captured by comparing the relative distance through time between the distributions of the corresponding characters, a and b. We model these distributions using PPMI character embeddings. We verify this hypothesis in synthetic data and then test the method’s ability to trace the well-known historical change of lenition of plosives in Danish historical sources. We show that the models are able to identify several of the changes under consideration and to uncover meaningful contexts in which they appeared. The methodology has the potential to contribute to the study of open questions such as the relative chronology of sound shifts and their geographical distribution.	PDF	11	2021
Unsupervised multiple-choice question generation for out-of-domain Q\&A fine-tuning	Pre-trained models have shown very good performances on a number of question answering benchmarks especially when fine-tuned on multiple question answering datasets at once. In this work, we propose an approach for generating a fine-tuning dataset thanks to a rule-based algorithm that generates questions and answers from unannotated sentences. We show that the state-of-the-art model UnifiedQA can greatly benefit from such a system on a multiple-choice benchmark about physics, biology and chemistry it has never been trained on. We further show that improved performances may be obtained by selecting the most challenging distractors (wrong answers), with a dedicated ranker based on a pretrained RoBERTa model.	PDF	11	2021
Multilingual Syntax-aware Language Modeling through Dependency Tree Conversion	Incorporating stronger syntactic biases into neural language models (LMs) is a long-standing goal, but research in this area often focuses on modeling English text, where constituent treebanks are readily available. Extending constituent tree-based LMs to the multilingual setting, where dependency treebanks are more common, is possible via dependency-to-constituency conversion methods. However, this raises the question of which tree formats are best for learning the model, and for which languages. We investigate this question by training recurrent neural network grammars (RNNGs) using various conversion methods, and evaluating them empirically in a multilingual setting. We examine the effect on LM performance across nine conversion methods and five languages through seven types of syntactic tests. On average, the performance of our best model represents a 19 \% increase in accuracy over the worst choice across all languages. Our best model shows the advantage over sequential/overparameterized LMs, suggesting the positive effect of syntax injection in a multilingual setting. Our experiments highlight the importance of choosing the right tree formalism, and provide insights into making an informed decision.	PDF	11	2021
Challenges for Open-domain Targeted Sentiment Analysis	Since previous studies on open-domain targeted sentiment analysis are limited in dataset domain variety and sentence level, we propose a novel dataset consisting of 6,013 human-labeled data to extend the data domains in topics of interest and document level. Furthermore, we offer a nested target annotation schema to extract the complete sentiment information in documents, boosting the practicality and effectiveness of open-domain targeted sentiment analysis. Moreover, we leverage the pre-trained model BART in a sequence-to-sequence generation method for the task. Benchmark results show that there exists large room for improvement of open-domain targeted sentiment analysis. Meanwhile, experiments have shown that challenges remain in the effective use of open-domain data, long documents, the complexity of target structure, and domain gaps.	PDF	11	2021
Aligned Weight Regularizers for Pruning Pretrained Neural Networks	Pruning aims to reduce the number of parameters while maintaining performance close to the original network. This work proposes a novel \emph{self-distillation} based pruning strategy, whereby the representational similarity between the pruned and unpruned versions of the same network is maximized. Unlike previous approaches that treat distillation and pruning separately, we use distillation to inform the pruning criteria, without requiring a separate student network as in knowledge distillation. We show that the proposed {\em cross-correlation objective for self-distilled pruning} implicitly encourages sparse solutions, naturally complementing magnitude-based pruning criteria. Experiments on the GLUE and XGLUE benchmarks show that self-distilled pruning increases mono- and cross-lingual language model performance. Self-distilled pruned models also outperform smaller Transformers with an equal number of parameters and are competitive against (6 times) larger distilled networks. We also observe that self-distillation (1) maximizes class separability, (2) increases the signal-to-noise ratio, and (3) converges faster after pruning steps, providing further insights into why self-distilled pruning improves generalization.	PDF	11	2021
Clause Attention based on Signal Words Division	Clause Attention (CA) is very important for long sentences processing. We build and label datasets for signal word training. According to the position of signal word, the long sentences are divided into clauses which are assigned to additional block attention. The original sentence is mapped and fed into the shared encoder to learn the extraneous representation of words in its clause sentences. We use attention with prior to balance global attention with local attention. It improves the quality of long sentence processing in NER and NMT task.	PDF	11	2021
QuoteR: A Benchmark of Quote Recommendation for Writing	It is very common to use quotations (quotes) to make our writings more elegant or convincing. To help people find appropriate quotes more efficiently, the task of quote recommendation is presented, aiming to recommend quotes that fit the current context of writing. There have been various quote recommendation approaches, but they are evaluated on different unpublished datasets. To facilitate the research on this task, we build a large and fully open quote recommendation dataset called QuoteR, which comprises three parts including English, standard Chinese and classical Chinese. Any part of it is larger than previous unpublished counterparts. We conduct an extensive evaluation of existing quote recommendation methods on QuoteR. Furthermore, we propose a new quote recommendation model that significantly outperforms previous methods on all three parts of QuoteR. All the code and data of this paper will be released.	PDF	11	2021
Meet Your Favorite Character: Open-domain Chatbot Mimicking Fictional Characters with only a Few Utterances	In this paper, we consider mimicking fictional characters as a promising direction for building engaging conversation models. To this end, we present a new practical task where only a few utterances of each fictional character are available to generate responses mimicking them. Furthermore, we propose a new method named Pseudo Dialog Prompting (PDP) that generates responses by leveraging the power of large-scale language models with prompts containing the target character's utterances. To better reflect the style of the character, PDP builds the prompts in the form of dialog that includes the character's utterances as dialog history. Since only utterances of the characters are available in the proposed task, PDP matches each utterance with an appropriate pseudo-context from a predefined set of context candidates using a retrieval model. Through human and automatic evaluation, we show that PDP generates responses that better reflect the style of fictional characters than baseline methods.	PDF	11	2021
Global Responses to the COVID-19 Pandemic: A Case Study of Spatiotemporal Evidence Finding and Verification	This paper explores methods for adapting fact verification models to real-world scenarios that require spatial and temporal inference. As a case study, we search for evidence on governments’ responses to the COVID-19 pandemic. We demonstrate that existing fact verification models perform poorly when the verification requires reasoning about spatiotemporal information.The suggested techniques lead to great improvements and we recommend implementing them for such uses.	PDF	11	2021
Data-adaptive Transfer Learning for Low-resource Translation: A Case Study in Haitian	Multilingual transfer techniques often improve low-resource machine translation (MT). Many of these techniques are applied without considering data characteristics. We show in the context of Haitian-to-English translation that transfer effectiveness is correlated with amount of training data and relationships between knowledge-sharing languages. Our experiments suggest that beyond a threshold of authentic data, back-translation augmentation methods are counterproductive, while cross-lingual transfer during training is preferred. We complement this finding by contributing a rule-based French-Haitian orthographic and syntactic engine and a novel method for phonological embedding. When used with multilingual techniques, orthographic transformation significantly improves performance over conventional methods, and phonological transfer greatly improves performance in Jamaican MT.	PDF	11	2021
Learning Scalable Representation for Source Code	This paper presents a scalable distributed code representation (SDCR) learning technique, which addresses the most common sparsity and out-of-vocabulary (OoV) concerns simultaneously. We introduce abstract syntax tree (AST) to reflect the structural information of code snippet and adopt the well-recognized 'bag of AST paths' as its intermediate representation, so that the unique structural and syntactic information of programs can be captured. Our proposed SDCR is supported by two core pillars. First, we provide comprehensive empirical study showing that only 1% of the AST paths can account for approximately 75% of the AST path occurrences. That is, dropping most of unnecessary AST paths still allows SDCR to perform well. Second, all AST paths (without leaf nodes in AST) are made up of a limited number of descriptive path elements, for which a lightweight encoder may produce a good embedding of any AST path. Incorporating these two pillars enables us to represent code snippets with better generalizability and scalability. Based on extensive experiments on two real-world datasets, we show that our SDCR have superior performance against the state-of-the-art with nearly 40% reduction in the number of model parameters.	PDF	11	2021
Improved Multi-label Classification under Temporal Concept Drift: Rethinking Group-Robust Algorithms in a Label-Wise Setting	In document classification for, e.g., legal and biomedical text, we often deal with hundreds of classes, including very infrequent ones, as well as temporal concept drift caused by the influence of real-world events, e.g., policy changes, conflicts, or pandemics. Both class imbalance and drift are often approached by resampling the training data to simulate (or compensate for) a known target distribution, but what if the target distribution is determined by unknown future events? Instead of resampling uniformly to hedge our bets, we focus on the underlying optimization algorithms used to train such document classifiers and evaluate several group-robust optimization algorithms, initially proposed to mitigate group-level disparities. Reframing group-robust algorithms as adaptation algorithms under concept drift, we find that Invariant Risk Minimization and Spectral Decoupling outperform sampling-based approaches to class imbalance and concept drift, and lead to much better performance on minority classes. The effect is more pronounced the larger the label set.	PDF	11	2021
Self-conditioning pre-trained language models	We present a method to condition pre-trained Transformer-based Language Models without fine-tuning or using additional parameters. Our approach leverages the presence of existing \emph{expert units} in the model that can be used to steer text generation. We describe how to identify such expert units, and propose an inference time intervention upon them at that allows conditioning. Results show that our method is effective for conditioning, even on fine-grained homograph concepts. Furthermore, we use a large corpus of contexts that highlights the presence of inherited gender bias in the output generated by an unconditioned model. Our experiments show that our method can be used to correct this behaviour and to achieve gender parity for all of the contexts. We compare our method with PPLM-BoW (Dathathri et al., 2020), and show that our approach is able to achieve parity at a much lower perplexity. The proposed method is accessible to a wide audience thanks to its simplicity and minimal compute needs.	PDF	11	2021
Textomics: A Dataset for Genomics Data Summary Generation	Summarizing biomedical discovery from genomics data using natural languages is an essential step in biomedical research but is mostly done manually. Here, we introduce Textomics, a novel dataset of genomics data description, which contains 22,273 pairs of genomics data matrix and its summary. Each summary is written by the researchers who generated the data and associated with a scientific paper. Based on this dataset, we study two novel tasks: generating textual summary from genomics data matrix and vice versa. Inspired by the successful applications of $k$ nearest neighbors in modeling genomics data, We propose a $k$NN-Vec2Text model to address these tasks and observe substantial improvement on our dataset. We further illustrate how Textomics can be used to advance other applications, including evaluating scientific paper embeddings and generating masked templates for scientific paper understanding. Textomics serves as the first benchmark for generating textual summary for genomics data and we envision it will be broadly applied to other biomedical and natural language processing applications.	PDF	11	2021
Co-training an Unsupervised Constituency Parser with Weak Supervision	We introduce a method for unsupervised parsing that relies on bootstrapping classifiers to identify if a node dominates a specific span in a sentence. There are two types of classifiers, an inside classifier that acts on a span, and an outside classifier that acts on everything outside of a given span. Through self-training and co-training with the two classifiers, we show that the interplay between them helps improve the accuracy of both, and as a result, effectively parse. A seed bootstrapping technique prepares the data to train these classifiers. Our analyses further validate that such an approach in conjunction with weak supervision using prior branching knowledge of a known language (left/right-branching) and minimal heuristics injects strong inductive bias into the parser, achieving 63.1 F$_1$ on the English (PTB) test set. In addition, we show the effectiveness of our architecture by evaluating on treebanks for Chinese (CTB) and Japanese (KTB) and achieve new state-of-the-art results.\footnote{For code or data, please contact the authors.}	PDF	11	2021
Structural Characterization for Dialogue Disentanglement	Tangled multi-party dialogue contexts lead to challenges for dialogue reading comprehension, where multiple dialogue threads flow simultaneously within a common dialogue history, increasing difficulties in understanding a dialogue history for both human and machine. Previous studies mainly focus on utterance encoding methods with carefully designed features and pay inadequate attention to characteristic features of the structure of dialogues. We specially take dialogue structure factors into account and design a novel model for dialogue disentangling. Based on the fact that dialogues are constructed on successive participation of speakers and interactions between users of interest, we extract clues of speaker property and reference of users to model structural information of dialogues. The proposed method achieves new state-of-the-art on benchmark dataset and contributes to dialogue-related comprehension.	PDF	11	2021
Composing Structure-Aware Batches for Pairwise Sentence Classification	Identifying the relation between two sentences requires datasets with pairwise annotations. In many cases, these datasets contain instances that are annotated multiple times as part of different pairs. They constitute a structure that contains additional helpful information about the inter-relatedness of the text instances based on the annotations. This paper investigates how this kind of structural dataset information can be exploited during training.We propose three batch composition strategies to incorporate such information and measure their performance over 14 heterogeneous pairwise sentence classification tasks. Our results show statistically significant improvements (up to 3.9%) - independent of the pre-trained language model - for most tasks compared to baselines that follow a standard training procedure. Further, we see that even this baseline procedure can profit from having such structural information in a low-resource setting.	PDF	11	2021
Parameter-Efficient Abstractive Question Answering over Tables and over Text	A long-term ambition of information seeking question answering (QA) systems is to reason over multi-modal contexts and generatenatural answers to user queries. Today, memory intensive pre-trained language models are adapted to downstream tasks such as QA by fine-tuning the model on QA data in a specific modality like unstructured text or structured tables. To avoid training such memory-hungry models and utilizing a uniform architecture for each modality, parameter-efficient transfer learning techniques such as adapters add and train small task-specific bottle-neck layers between transformer layers. However, modality-specific adapter layers infused in a pre-trained transformer also require uniformity in the input sequence, which contradicts with existing work that trains structure-specific layers on multi-modal data. In this work, we study parameter-efficient abstractive QA in encoder-decoder models over structured tabular data and unstructured textual data using only 1.5% additional parameters for each modality. We retain table structure information by a hierarchy preserving transformation of complex hierarchical tables to 1-dimensional sequences, thus maintaining uniformity in the model input. We also ablate over adapter layers in both encoder and decoder modules and study the efficiency-performance trade-off and demonstrate that reducing additional trainable parameters down to 0.7%–1.0% leads to comparable results. Our models outperform current state-of-the-art models on tabular QA datasets such as Tablesum and FeTaQA and achieve comparable performance on a text QA dataset such as NarrativeQA using significantly less trainable parameters.	PDF	11	2021
Toward Fine-grained Causality Reasoning and CausalQA	Understanding causality is key to the success of NLP applications, especially in high-stakes domains. Causality comes in various perspectives such as enable and prevent that, despite their importance, have been largely ignored in the literature. This paper introduces a first-of-its-kind, fine-grained causal reasoning dataset that contains seven causal relations and defines a series of NLP tasks, from causality detection to event causality extraction and causal reasoning. Our dataset contains human annotations of 25K cause-effect event pairs and 24K question-answering pairs within multi-sentence samples, where each can contain multiple causal relationships. Through extensive experiments and analysis, we show that the complex relations in our dataset bring unique challenges to state-of-the-art methods across all three tasks and highlight potential research opportunities, especially in developing ''causal-thinking'' methods.	PDF	11	2021
SegMix: A Simple Structure-Aware Data Augmentation Method	Many Natural Language Processing tasks involve predicting structures, such as Syntax Parsing and Relation Extraction (RE). One central challenge in supervised structured prediction is the lack of high-quality annotated data. The recently proposed interpolation-based data augmentation (DA) algorithms (i.e. mixup) augment the training set via making convex interpolation between training data points. However, current algorithms (e.g. SeqMix, LADA) that apply mixup to language structured prediction tasks are not aware of the syntactic or output structures of the tasks, making their performance unstable and requiring additional heuristic constraints. Furthermore, SeqMix-like algorithms expect a linear encoding scheme of the output structure, such as BIO-Scheme for Named Entity Recognition (NER), restricting its applicability.To this end, we propose SegMix, a simple framework of interpolation-based algorithms that can adapt to both the syntactic and output structures, making it robust to hyper-parameters and applicable to different tasks. We empirically show that SegMix consistently improves performance over several strong baseline models on two structured prediction tasks (NER and RE). SegMix is a flexible framework that unifies existing rule-based language DA methods, creating interesting mixtures of DA techniques. Furthermore, the method is easy to implement and adds negligible overhead to training and inference.	PDF	11	2021
Zero-Shot Aspect-Based Scientific Document Summarization using Self-Supervised Pre-training	We study the zero-shot setting for the aspect-based scientific document summarization task. Summarizing scientific documents with respect to an aspect can remarkably improve document assistance systems and readers experience. However, existing large-scale datasets contain a limited variety of aspects, causing summarization models to over-fit to a small set of aspects. We establish baseline results in zero-shot performance (over unseen aspects and the presence of domain shift), paraphrasing, leave-one-out, and limited supervised samples experimental setups. We propose a self-supervised pre-training approach to enhance the zero-shot performance. Experimental results on the FacetSum and PubMed aspect-based datasets show promising performance when the model is pre-trained using unlabeled in-domain data.	PDF	11	2021
Lifting the Curse of Multilinguality by Pre-training Modular Transformers	Multilingual pre-trained models are known to suffer from the curse of multilinguality, which causes per-language performance to drop as they cover more languages. We address this issue by introducing language-specific modules, which allows us to grow the total capacity of the model without any additional cost in training and inference FLOPs. In contrast to prior work which learns language-specific components post-hoc, we pre-train the modules of our Cross-lingual Modular (X-Mod) models from the start. Our experiments on natural language inference, named entity recognition and question answering show that our approach not only mitigates the negative interference between languages, but also enables positive transfer, resulting in improved monolingual and cross-lingual performance. Furthermore, our approach enables adding languages post-hoc with no measurable drop in performance, no longer limiting the model usage to the set of pre-trained languages.	PDF	11	2021
A Novel Framework Based on Medical Concept Driven Attention for Explainable Medical Code Prediction via External Knowledge	Medical code prediction from clinical notes aims at automatically associating medical codes with the clinical notes. Rare code problem, the medical codes with low occurrences, is prominent in medical code prediction. Recent studies employ deep neural networks and the external knowledge to tackle it. However, such approaches lack interpretability which is a vital issue in medical application. Moreover, due to the lengthy and noisy clinical notes, such approaches fail to achieve satisfactory results. Therefore, in this paper, we propose a novel framework based on medical concept driven attention to incorporate external knowledge for explainable medical code prediction. In specific, both the clinical notes and Wikipedia documents are aligned into topic space to extract medical concepts using topic modeling. Then, the medical concept-driven attention mechanism is applied to uncover the medical code related concepts which provide explanations for medical code prediction. Experimental results on the benchmark dataset show the superiority of the proposed framework over several state-of-the-art baselines.	PDF	11	2021
Improve Discourse Dependency Parsing with Contextualized Representations	Previous works show that discourse analysis benefits from modeling intra- and inter-sentential levels separately, where proper representations for text units of different granularities are desired to capture both the information of the text units and their relation to the context. In this paper, we propose to take advantage of transformers to encode different contextualized representations of units of different levels to dynamically capture the information required for discourse dependency analysis on intra- and inter-sentential levels. Motivated by the observation of writing patterns shared across articles to improve discourse analysis, we propose to design sequence labeling methods to take advantage of such structural information from the context that substantially outperforms traditional direct classification methods. Experiments show that our model achieves state-of-the-art results on both English and Chinese datasets.	PDF	11	2021
Interpreting the Robustness of Neural NLP Models to Textual Perturbations	Modern Natural Language Processing (NLP) models are known to be sensitive to input perturbations and their performance can decrease when applied to real-world, noisy data. However, it is still unclear why models are less robust to some perturbations than others. In this work, we test the hypothesis that the extent to which a model is affected by an unseen textual perturbation (robustness) can be explained by the learnability of the perturbation (defined as how well the model learns to identify the perturbation with a small amount of evidence). We further give a causal justification for the learnability metric. We conduct extensive experiments with four prominent NLP models --- TextRNN, BERT, RoBERTa and XLNet --- over eight types of textual perturbations on three datasets. We show that a model which is better at identifying a perturbation (higher learnability) becomes worse at ignoring such a perturbation at test time (lower robustness), providing empirical support for our hypothesis.	PDF	11	2021
SUPERB-SG: Enhanced Speech processing Universal PERformance Benchmark for Semantic and Generative Capabilities	Transfer learning has proven to be crucial in advancing the state of speech and natural language processing research in recent years. In speech, a model pre-trained by self-supervised learning transfers remarkably well on multiple tasks. However, the lack of a consistent evaluation methodology is limiting towards a holistic understanding of the efficacy of such models. SUPERB was a step towards introducing a common benchmark to evaluate pre-trained models across various speech tasks. In this paper, we introduce SUPERB-SG, a new benchmark focusing on evaluating the semantic and generative capabilities of pre-trained models by increasing task diversity and difficulty over SUPERB. We use a lightweight methodology to test the robustness of representations learned by pre-trained models under shifts in data domain and quality across different types of tasks. It entails freezing pre-trained model parameters, only using simple task-specific trainable heads. The goal is to be inclusive of all researchers, and encourage efficient use of computational resources. We also show that the task diversity of SUPERB-SG coupled with limited task supervision is an effective recipe for evaluating the generalizability of model representation.	PDF	11	2021
A Cueing Strategy for Prompt Tuning in Relation Extraction	Traditional relation extraction models predict confidence scores for each relation type based on a condensed sentence representation. In prompt tuning, prompt templates is used to tune pretrained language models (PLMs), which outputs relation types as verbalized type tokens. This strategy shows great potential to support relation extraction because it is effective to take full use of rich knowledge in PLMs. However, current prompt tuning models are directly implemented on a raw input. It is weak to encode contextual features and semantic dependencies of a relation instance. In this paper, we designed a cueing strategy which implants task specific cues into the input. It controls the attention of prompt tuning, which enable PLMs to learn task specific contextual features and semantic dependencies of a relation instance. We evaluated our method on two public datasets. Experiments show great improvement. It exceeds state-of-the-art performance by more than 4.8% and 1.4% in terms of F1-score on the SemEval corpus and the ReTACRED corpus.	PDF	11	2021
Inducing Global and Local Knowledge Attention in Multi-turn Dialog Understanding	In multi-turn dialog understanding, semantic frames are constructed by detecting intents and slots within each user utterance. However, recent works lack the capability of modeling multi-turn dynamics within a dialog where the contexts are mostly adopted for updating dialog states instead of capturing overall intent semantic flows in spoken language understanding (SLU). Moreover, external knowledge related to dialogs may be beneficial in exploring deep semantic information across dialog turns, which many works only considered for end-to-end response generation. In this paper, we propose to equip a BERT-based joint framework with a context attention module and a knowledge attention module to introduce knowledge attention with contexts between two SLU tasks. We propose three attention mechanisms to induce both global and local attention on knowledge triples. Experimental results in two complicated multi-turn dialog datasets have demonstrated significant improvements of our proposed framework by mutually modeling two SLU tasks with filtered knowledge and dialog contexts. Attention visualization also provides nice interpretability of how our modules leverage knowledge across the utterance.	PDF	11	2021
STaR: Knowledge Graph Embedding by Scaling, Translation and Rotation	The bilinear method is mainstream in Knowledge Graph Embedding (KGE), aiming to learn low-dimensional representations for entities and relations in Knowledge Graph (KG) and complete missing links. Most of the existing works are to find patterns between relationships and effectively model them to accomplish this task. Previous works have mainly discovered 6 important patterns like non-commutativity. Although some bilinear methods succeed in modeling these patterns, they neglect to handle 1-to-N, N-to-1, and N-to-N relations (or complex relations) concurrently, which hurts their expressiveness. To this end, we integrate scaling, the combination of translation and rotation that can solve complex relations and patterns, respectively, where scaling is a simplification of projection. Therefore, we propose a corresponding bilinear model Scaling Translation and Rotation (STaR) consisting of the above two parts. Besides, since translation can not be incorporated into the bilinear model directly, we introduce translation matrix as the equivalent. Theoretical analysis proves that STaR is capable of modeling all patterns and handling complex relations simultaneously, and experiments demonstrate its effectiveness on commonly used benchmarks for link prediction.	PDF	11	2021
FewNLU: Benchmarking State-of-the-Art Methods for Few-Shot Natural Language Understanding	The few-shot natural language understanding (NLU) task has attracted much recent attention. However, prior methods have been evaluated under a disparate set of protocols, which hinders fair comparison and measuring the progress of the field. To address this issue, we introduce an evaluation framework that improves previous evaluation procedures in three key aspects, i.e., test performance, dev-test correlation, and stability. Under this new evaluation framework, we re-evaluate several state-of-the-art few-shot methods for NLU tasks. Our framework reveals new insights: (1) both the absolute performance and relative gap of the methods were not accurately estimated in prior literature; (2) no single method dominates most tasks with consistent performance; (3) improvements of some methods diminish with a larger pretrained model; and (4) gains from different methods are often complementary and the best combined model performs close to a strong fully-supervised baseline. We open-source our toolkit, FewNLU, that implements our evaluation framework along with a number of state-of-the-art methods.	PDF	11	2021
Is Attention Explanation? An Introduction to the Debate	The performance of deep learning models in NLP and other fields of machine learning has led to a rise in their popularity, and so the need for explanations of these models becomes paramount. Attention has been seen as a solution to increase performance, while providing some explanations. However, a debate has started to cast doubt on the explanatory power of attention in neural networks. Although the debate has created a vast literature thanks to contributions from various areas, the lack of communication is becoming more and more tangible. In this paper, we provide a clear overview of the insights on the debate by critically confronting works from these different areas. This holistic vision can be of great interest for future works in all the communities concerned by this debate. We sum up the main challenges spotted in these areas, and we conclude by discussing the most promising future avenues on attention as an explanation.	PDF	11	2021
Building Chinese Biomedical Language Models via Multi-Level Text Discrimination	Pre-trained language models (PLMs), such as BERT and GPT, have revolutionized the field of NLP, not only in the general domain but also in the biomedical domain. Most prior efforts in building biomedical PLMs have resorted simply to domain adaptation and focused mainly on English. In this work we introduce eHealth, a Chinese biomedical PLM built from scratch with a new pre-training framework. This new framework pre-trains eHealth as a discriminator through both token- and sequence-level discrimination. The former is to detect input tokens corrupted by a generator and recover their original identities from plausible candidates, while the latter is to further distinguish corruptions of a same original sequence from those of others. As such, eHealth can learn language semantics at both token and sequence levels. Extensive experiments on 11 Chinese biomedical language understanding tasks of various forms verify the effectiveness and superiority of our approach. We release the pre-trained model to the public,\footnote{\url{Anonymous URL}} and will also release the code later.	PDF	11	2021
An Empirical Study of Document-to-document Neural Machine Translation	This paper does not aim at introducing a novel method for document NMT. Instead, we head back to the original transformer model with document-level training and hope to answer the following question: Is the capacity of current models strong enough for document-level NMT? Interestingly, we observe that the original transformer with appropriate training techniques can achieve strong results for document translation, even with a length of 2000 words. We evaluate this model and several recent approaches on nine document-level datasets and two sentence-level datasets across six languages. Experiments show that the original Transformer model outperforms sentence-level models and many previous methods in a comprehensive set of metrics, including BLEU, four lexical indices, three newly proposed assistant linguistic indicators, and human evaluation.	PDF	11	2021
Exploiting Data Characteristics for Document-level Event Extraction	Document-level event extraction (DEE) extracts structured information of events from a document. Previous studies focus on improving the model architecture. We propose to exploit data characteristics: 1) we utilize more coreference information to obtain better document-level entity representations; 2) we manually identify core roles of each event type and propose the hybrid extraction to shallow the memory and alleviate error propagation. Experiments on a large dataset demonstrate that our methods significantly improve model performance on both the role-level and record-level metrics. Our code is available at https://github.com/coszeros/CAB.	PDF	11	2021
To Know by the Company Words Keep and What Else Lies in the Vicinity	The development of state-of-the-art (SOTA) Natural Language Processing (NLP) systems has steadily been establishing new techniques to absorb the statistics of linguistic data. These techniques often trace well-known constructs from traditional theories, and we study these connections to close gaps around key NLP methods as a means to orient future work. For this, we introduce an analytic model of the statistics learned by seminal algorithms (including GloVe and Word2Vec), and derive insights for systems that use these algorithms and the statistics of co-occurrence, in general. In this work, we derive—to the best of our knowledge—the first known solution to Word2Vec's softmax-optimized, skip-gram algorithm. This result presents exciting potential for future development as a direct solution to a deep learning (DL) language model's (LM's) matrix factorization. However, we use the solution to demonstrate a seemingly-universal existence of a property that word vectors exhibit and which allows for the prophylactic discernment of biases in data—prior to their absorption by DL models. To qualify our work, we conduct an analysis of independence, i.e., on the density of statistical dependencies in co-occurrence models, which in turn renders insights on the distributional hypothesis' partial fulfillment by co-occurrence statistics.	PDF	11	2021
Rewire-then-Probe: A Contrastive Recipe for Probing Biomedical Knowledge of Pre-trained Language Models	Knowledge probing is crucial for understanding the knowledge transfer mechanism behind the pre-trained language models (PLMs). Despite the growing progress of probing knowledge for PLMs in the general domain, specialised areas such as the biomedical domain are vastly under-explored. To facilitate this, we release a well-curated biomedical knowledge probing benchmark, MedLAMA, constructed based on the Unified Medical Language System~(UMLS) Metathesaurus. We test a wide spectrum of state-of-the-art PLMs and probing approaches on our benchmark, reaching at most $3\%$ of acc@10. While highlighting various sources of domain-specific challenges that amount to this underwhelming performance, we illustrate that the underlying PLMs have a higher potential for probing tasks. To achieve this, we propose Contrastive-Probe, a novel self-supervised contrastive probing approach, that adjusts the underlying PLMs without using any probing data. While Contrastive-Probe pushes the acc@10 to $28\%$, the performance gap still remains notable. Our human expert evaluation suggests that the probing performance of our Contrastive-Probe is still under-estimated as UMLS still does not include the full spectrum of factual knowledge. We hope MedLAMA and Contrastive-Probe facilitate further developments of more suited probing techniques for this domain.	PDF	11	2021
EmRel: Joint Representation of Entities and Embedded Relations for Multi-triple Extraction	Multi-triple extraction is a challenging task due to the existence of informative inter-triple correlations and consequently rich interactions across the constituent entities and relations.While existing works only explore cross-entity interactions, we propose to explicitly introduce relation representation, jointly represent it with entities, and novelly align them to identify valid triples.We perform comprehensive experiments on document-level relation extraction and joint entity and relation extraction along with detailed ablations to demonstrate the advantage of the proposed method.	PDF	11	2021
BigFive: A Dataset of Coarse- and Fine-Grained Personality Characteristics	Obtaining the personalities of users conveyed by their published short texts has a wide and important range of applications, from detecting abnormal behavior of online users to accurately customization recommendation. Advancement in this area can be improved using large-scale datasets with coarse- and fine-grained typologies, adaptable to multiple downstream tasks. Therefore, this paper introduces $BigFive$, a large, high quality dataset manually annotated by experts. $BigFive$ contains 13,478 Chinese phrases that belong to five categories (coarse-grained) and 30 categories (fine-grained). The reliability of five categories grouped by personality level and 30 categories grouped by dimension level is demonstrated via a detailed data analysis. In addition, a strong baseline is build based on fine-tuning a BERT model. Our BERT-based model achieves an average F1-score of .33 (std=.24) in terms of 30 categories and an average F1-score of .66 (std=.05) in terms of five categories. The experimental results suggest that there is much room for improvement.	PDF	11	2021
Structure and Features Fusion with Evidential Graph Convolutional Neural Network for Node Classification	Recently, text-enhanced network representation learning has achieved great success by taking advantage of rich text information and network structure information. However, content-rich network representation learning and quantifying classiﬁcation uncertainty are challenging when it comes to integrating complex structural dependencies and rich content features at an evidence level. In this paper, we propose an evidential graph representation learning model (EGCN), which can not only fuse network structure and content information into a more complete and powerful representation for each node, but also assess the quality of graph node features to improve classification accuracy. To achieve better fusion, we integrate the node's features representation into structure-aware representation through a delivery operator. Besides, to overcome the difﬁculty of predicting node classification conﬁdence, we employ a novel module based on Dirichlet distribution theory of evidence and subject opinion learning to collect the evidence of the class probabilities. Experimental results on three real-world networks show that our model can improve both node classification accuracy and robustness as compared to all baselines.	PDF	11	2021
Neighborhood Contrastive Learning for Scientific Document Representations with Citation Embeddings	Learning scientific document representations can be substantially improved through contrastive learning objectives, where the challenge lies in creating positive and negative training samples that encode the desired similarity semantics. Prior work relies on discrete citation relations to generate contrast samples. However, discrete citations enforce a hard cut-off to similarity. This is counter-intuitive to similarity-based learning, and ignores that scientific papers can be very similar despite lacking a direct citation - a core problem of finding related research. Instead, we use controlled nearest neighbor sampling over citation graph embeddings for contrastive learning. This control allows us to learn continuous similarity, to sample hard-to-learn negatives and positives, and also to avoid collisions between negative and positive samples by controlling the sampling margin between them. The resulting method SciNCL outperforms the state-of-the-art on the SciDocs benchmark. Furthermore, we demonstrate that it can train (or tune) models sample-efficiently, which improves compute efficiency, and that it can be combined with recent training-efficient methods. Perhaps surprisingly, even training a general-domain language model this way outperforms baselines pretrained in-domain.	PDF	11	2021
Subword Information for Authorship Attribution: A Deep Learning Approach	Authorship attribution is the process of unveiling the hidden identity of authors from a corpus of literary data. Many previous works on authorship attribution employed word-based models to capture an author's distinctive writing style. The vocabulary of the training corpus is heavily dependent on the pre-trained word vectors, which limits the performance of these models. Alternate methods using character-based models proposed to overcome the rare word problems arising from different linguistic features fail to capture the sequential relationship of words inherently present in the texts. The question we addressed in this paper is whether it is possible to tackle the ambiguity of hidden writing style (or words) as we introduce Gaussian noise while preserving the sequential context of the text to improve authorship-related tasks. In this work, we propose two bidirectional long short-term memory (BLSTM) with a 2D convolutional neural network (CNN) over a two-dimensional pooling operation to capture sequential writing styles for distinguishing different authors. To determine the appropriate writing style representation, we used BLSTM to obtain the sequential relationship between characteristics using subword information and 2D CNN is adopted to understand the local syntactical position of the style from unlabelled input text. We extensively evaluate the model that leverages subword embedding and compare it against state-of-the-art methods for an extensive range of authors. Our methods improve 2.42\%, 0.96\% and 0.97\% on CCAT50, Blog50 and Twitter, respectively and produce comparable results on the remaining one.	PDF	11	2021
Blackbird's language matrices (BLMs): a new benchmark to investigate disentangled generalisation in neural networks	Current successes of machine learning architectures are based on computationally expensive algorithms and prohibitively large amounts of data. We need to develop tasks and data to train networks to reach more complex and more compositional skills. In this paper, we illustrate Blackbird's language matrices (BLMs), a novel grammatical dataset developed to test a linguistic variant of Raven's progressive matrices, an intelligence test usually based on visual stimuli. The dataset consists of roughly 48000 sentences, generatively constructed to support investigations of current models' linguistic mastery of grammatical rules and their ability to generalize them. We present the logic of the dataset, the method to automatically construct data on a large scale, and the architecture to learn them. Through error analysis and several experiments on variations of the dataset, we demonstrate that this language task and the data that instantiate it provide a new challenging testbed to understand generalization and abstraction.	PDF	11	2021
CluSent – Combining Semantic Expansion and De-Noising for Dataset-Oriented Sentiment Analysis of Short Texts	The lack of sufficient information, mainly in short texts, is a major challenge to building effective sentiment models. Short texts can be enriched with more complex semantic relationships that can better capture affective information, with a potential undesired side effect of noise introduced into the data. In this work, we propose a new strategy for customized dataset-oriented sentiment analysis -- CluSent -- that exploits a powerful, recently proposed concept for representing semantically related words -- CluWords. CluSent tackles the issues mentioned above of information shortage and noise by: (i) exploiting the semantic neighborhood of a given pre-trained word embedding to enrich document representation, and (ii) introducing dataset-oriented filtering and weighting mechanisms to cope with noise, which take advantage of the polarity and intensity information from lexicons. In our experimental evaluation, considering 19 datasets, 5 state-of-the-art baselines (including modern transformer architectures) and two metrics, CluSent was the best method in 30 out of 38 possibilities, with significant gains over the strongest baselines (over 14%).	PDF	11	2021
Self-Supervised Contrastive Learning with Adversarial Perturbations for Robust Pretrained Language Models	In this paper, we present an approach to improve the robustness of BERT language models against word substitution-based adversarial attacks by leveraging adversarial perturbations for self-supervised contrastive learning. We create an efficient word-level adversarial attack, and use it to finetune BERT on adversarial examples generated \textit{on the fly} during training.In contrast with previous works, our method improves model robustness without using any labeled data. Experimental results show that our method improves robustness of BERT against four different word substitution-based adversarial attacks, and combining our method with adversarial training gives higher robustness than adversarial training alone.As our method improves the robustness of BERT purely with unlabeled data, it opens up the possibility of using large text datasets to train robust language models.	PDF	11	2021
Detect Low-Resource Rumors in Microblog Posts via Adversarial Contrastive Learning	Massive false rumors emerging along with breaking news or trending topics severely hinder the truth. Exiting rumor detection approaches achieve promising performance on the yesterday's news, since there is enough corpus collected from the same domain for model training. However, they are poor at detecting rumors about unforeseen events such as COVID-19 due to the lack of training data and prior knowledge (i.e., low-resource rumors). In this paper, we propose an adversarial contrastive learning framework to detect low-resource rumors by adapting the features learned from well-resourced rumor data to that of the low-resourced. Our model explicitly overcomes the restriction of both domain and language usage via language alignment and contrastive training. Moreover, we develop an adversarial augmentation mechanism to further enhance the robustness of low-resource rumor representation. Extensive experiments conducted on two low-resource datasets collected from real-world microblog platforms demonstrate that our framework achieves much better performance than state-of-the-art methods and exhibits a superior capacity for detecting rumors at early stages.	PDF	11	2021
Few-Shot Named Entity Recognition with Biaffine Span Representation	While Named Entity Recognition (NER) is a widely studied task, making inferences of entities with only a few labeled data (i.e., few-shot NER) has been challenging. Correspondingly, the N-way K-shot NER task is proposed to recognize entities in the given N categories with only K labeled samples for each category. Existing methods treat this task as a sequence labeling problem, while this paper regards it as an entity span classification problem and designs a Biaffine Span Representation (BSR) method to learn contextual span dependency representation to fit into the classification algorithm. The BSR applies a biaffine pooling module to establish the dependencies of each word on the whole sentence and to reduce the dimension of word features, thus, the span representation could gain contextual dependency information to help improve recognition accuracies. Experimental study on four standard NER datasets shows that our proposed BSR method outperforms pre-trained language models and existing N-way K-shot NER algorithms in two types of adaptations (i.e., Intra-Domain Cross-Type Adaptation and Cross-Domain Cross-Type Adaptation). Notably, F_1 value has increased by an average of 13.77% and 18.30% on the 5-way 1-shot task and the 5-way 5-shot task, respectively.	PDF	11	2021
A Bit Bayesian Facilitates Efficient Training in Token Classification	Token classification is a fundamental subject matter in computational linguistics. Token classification models, like other modern deep neural network models, are usually trained on the entire training set in each epoch, while research has found all of the training data may not be needed in late epochs of training. Inspired by human pedagogy, we propose a teacher-aware structure to accelerate the training of token classification models. After each epoch of training, the teacher samples data that it is uncertain to and data it predicts differently from the student, which are passed into the structure for training in the next epoch. As a proof of concept, we use a Bayesian linear classifier as the teacher, and use two commonly used backbone models as the student. Experiments show that our method reduces the number of training iterations, speeding up the training without affecting the model's performance.	PDF	11	2021
SummaReranker: A Multi-Task Mixture-of-Experts Re-ranking Framework for Abstractive Summarization	Sequence-to-sequence neural networks have recently achieved great success in abstractive summarization, especially with the trend of fine-tuning large pre-trained language models on the downstream dataset. These models are typically decoded with beam search to generate a unique summary. However, the search space is very large, and due to exposure bias, such decoding is not optimal. In this paper, we show that it is possible to directly train a second-stage model performing re-ranking on a set of summary candidates. Our mixture-of-experts SummaReranker learns to select a better candidate and systematically improves the performance of the base model. With a base PEGASUS, we push ROUGE scores by 5.44% on CNN-DailyMail (47.16 ROUGE-1), 1.31% on XSum (48.12 ROUGE-1) and 9.34% on Reddit TIFU (29.83 ROUGE-1), reaching a new state-of-the-art.	PDF	11	2021
Coloring the Blank Slate: Pre-training Imparts a Hierarchical Inductive Bias to Sequence-to-sequence Models	Relations between words are governed by hierarchical structure rather than linear ordering. Sequence-to-sequence (seq2seq) models, despite their success in downstream NLP applications, often fail to generalize in a hierarchy-sensitive manner when performing syntactic transformations—for example, transforming declarative sentences into questions—instead generalizing linearly using positional surface heuristics. However, syntactic evaluations of seq2seq models have only observed models that were not pre-trained on natural language data before being trained to perform syntactic transformations, in spite of the fact that pre-training has been found to induce hierarchical linguistic generalizations in language models; in other words, the syntactic capabilities of seq2seq models may have been greatly understated. Here, we make use of the pre-trained seq2seq model T5 (and its multilingual variant mT5) and evaluate whether they generalize hierarchically on two syntactic transformations in two languages: question formation and passivization in English and German. We find that T5 and mT5 generalize hierarchically when performing syntactic transformations, whereas non-pre-trained baseline models do not. This result presents additional evidence for the learnability of hierarchical syntactic information from non-annotated natural language text while also demonstrating that seq2seq models are capable of syntactic generalization.	PDF	11	2021
SlotGAN: Detecting Mentions in Text via Adversarial Distant Learning	We present SlotGAN, a framework for training a mention detection model that only requires unlabeled text and a gazetteer. It consists of a generator trained to extract spans from an input sentence, and a discriminator trained to determine whether a span comes from the generator, or from the gazetteer.We evaluate the method on English newswire data and compare it against supervised, weakly-supervised, and unsupervised methods. We find that the performance of the method is lower than these baselines, because it tends to generate more and longer spans, and in some cases it relies only on capitalization. In other cases, it generates spans that are valid but differ from the benchmark. When evaluated with metrics based on overlap, we find that SlotGAN performs within 95% of the precision of a supervised method, and 84% of its recall. Our results suggest that the model can generate spans that overlap well, but an additional filtering mechanism is required.	PDF	11	2021
SuMe: A Dataset Towards Summarizing Biomedical Mechanisms	Can language models read biomedical texts and explain the biomedical mechanisms discussed? In this work we introduce a biomedical mechanism summarization task. Biomedical studies often investigate the mechanisms behind how one entity (e.g., a protein or a chemical) affects another in a biological context. The abstracts of these publications often include a focused set of sentences that present relevant supporting statements regarding such relationships, associated experimental evidence, and a concluding sentence that summarizes the mechanism underlying the relationship. We leverage this structure and create a summarization task, where the input is a collection of sentences in an abstract and the output includes the main relationships and a natural language sentence that summarizes the mechanism. Using a small amount of manually labeled mechanism sentences, we train a mechanism sentence classifier to filter a large biomedical abstract collection and create a summarization dataset with 22k instances. We also introduce a pretraining conclusion generation task with 611k samples. Our benchmarking experiments with large language models show that the pretraining is helpful for the original task, but the model performance isn't still satisfactory and this task presents significant challenges in biomedical language understanding and summarization.	PDF	11	2021
Target-Guided Dialogue Response Generation Using Commonsense and Data Augmentation	Targeted-guided response generation enables dialogue systems to smoothly guide a conversation from a dialogue context toward a target sentence. Such control is useful for designing dialogue systems that direct a conversation toward specific goals, such as providing counselling and creating non-obtrusive recommendations. In this paper, we introduce a new technique for target-guided response generation, which first finds a bridging path of commonsense knowledge concepts between the source and target, and then uses the identified bridging path to generate transition responses. Additionally, we propose techniques to re-purpose existing dialogue datasets for target-guided generation. Finally, we demonstrate the shortcomings of existing automated metrics for this task, and propose a novel evaluation metric that we show is more effective for target-guided response evaluation. Our experiments show that our proposed evaluation metric is reliable and our techniques outperform baselines on the generation task. Our work generally enables dialog system designers to exercise more control over the conversations that their systems produce.	PDF	11	2021
Generating Diverse and High-Quality Abstractive Summaries with Variational Transformers	Existing works on abstractive summarization mainly focus on boosting summarization's quality (informativeness, contextual similarity). To generate summaries of both high diversity and quality, we proposes the Transformer+CVAE model, which integrates the CVAE framework into the Transformer by introducing the prior/recognition networks that bridges the Transformer encoder and decoder. We utilize the latent variables generated in the global receptive field of the transformer by fusing them to the starting-of-sequence ([SOS]) of the decoder inputs. To better tune the weights of the latent variables in the sequence, we designed a gated unit to blend the latent representation and the [SOS] token. Evaluated on the Gigaword dataset, our model outperforms the state-of-the-art seq-to-seq models and the base Transformer in diversity and quality metrics. After scrutinizing the pre-training and the gating mechanism we apply, we discover that both schemes help improve the quality of generated summaries in the CVAE framework.	PDF	11	2021
Alleviating the Sparsity of Open Knowledge Graphs with Pretrained Contrastive Learning	Due to the sparsity of formal knowledge and the roughness of non-ontological construction methods, relevant facts are often missing in Open Knowledge Graphs (OpenKGs). Although existing completion methods have achieved promising performance, they do not alleviate the sparsity problem of OpenKGs. Owing to fewer training chances caused by sparse links, many few-shot and zero-shot entities cannot fully learn high-dimensional features. In this paper, we propose a new OpenKG Contrastive Learning (OKGCL) model to alleviate the sparsity with contrastive entities and relations. OKGCL designs (a) negative entities to discriminate different entities with the same relation, (b) negative relations to discriminate different relations with the same entity-pair, and (c) \emph{self} positive samples to give zero-shot and few-shot entities chances to learn discriminative representations. Extensive experiments on benchmark datasets show the superiority of OKGCL over state-of-the-art models.	PDF	11	2021
Generative Prompt Tuning for Relation Classification	Prompt tuning is proposed to better tune pre-trained language models by filling the objective gap between the pre-training process and the downstream tasks. Current methods mainly convert the downstream tasks into masked language modeling (MLM) problems, which have proven effective for tasks with simple label sets. However, when applied to relation classification tasks which often exhibit a complex label space, vanilla prompt tuning methods designed for MLM may struggle with handling complex label verbalizations with variable length as in such methods, the locations and number of masked tokens are typically fixed. Inspired by the text infilling task for pre-training generative models that can flexibly predict missing spans, we propose a novel generative prompt tuning method to reformulate relation classification as an infilling problem to eliminate the rigid prompt restrictions, which allows our method to process label verbalizations of varying lengths at multiple predicted positions and thus be able to fully leverage rich semantics of entity and relation labels. In addition, we design entity-guided decoding and discriminative relation scoring to predict relations effectively and efficiently in the inference process. Extensive experiments under low-resource settings and fully supervised settings demonstrate the effectiveness of our approach.	PDF	11	2021
Learning Functional Distributional Semantics with Visual Data	Functional Distributional Semantics is a recently proposed framework for learning distributional semantics that provides linguistic interpretability. It models the meaning of a word as a binary classifier rather than a numerical vector. In this work, we propose a method to train a Functional Distributional Semantics model with grounded visual data. We train it on the Visual Genome dataset, which is closer to the kind of data encountered in human language acquisition than a large text corpus. On four external evaluation datasets, our model outperforms previous work on learning semantics from Visual Genome.	PDF	11	2021
Models can use keywords to answer questions that human cannot	Recent studies raised that reading comprehension (RC) models learn to exploit biases and annotation artifacts in current Machine Reading Comprehension (MRC) datasets to achieve impressive performance. This hinders the community from measuring sophisticated understanding of RC systems. MRC questions whose answers can be rightly predicted without understanding their contexts are defined as biased ones. Previous researches aimed to split unintended biases and determine their influence have some limitations. Some methods using partial test data to extract biases lack holistic consideration with question-context-option tuple. Others relied on artificial statistical features are limited by question types. In this paper, we employ two simple heuristics to identify biased questions in current MRC datasets through human-annotated keywords. We implement three neural networks on the biased data and find that they have outstanding abilities to capture the biases, and further study the superficial features of the biased data exploited by models as shortcuts in views of lexical choice and paragraphs. Experiments show that (i) models can answer some questions merely using several keywords which are unanswerable or difficulty for human. (ii) lexical choice preference in options creates biases utilized by models. (iii) fewer paragraphs are more likely to introduce biases in MRC datasets.	PDF	11	2021
A Weak Self-supervision with Transition-Based Modeling for Reference Resolution	The reference resolution is a task to find the link between an entity and its source action in the same recipe. In this study, we introduce a weak self-supervision method with a transition-based model for reference resolution tasks for recipes, where the aim of the task is to make the syntax of the instructions used for reference resolution with self annotation. The results show that our approach to the problem outperforms the previous unsupervised methods with %8 F1. Especially, our models show %82 accuracies of pronoun, and %85 accuracies for null entity resolution.	PDF	11	2021
A Scalable Holistic approach for Age and Gender inference of Twitter Users	Numerous studies have focused on inference of age and gender. We consider a new approach that takes advantage of contrastive learning methods by using both text and image content for this prediction task. We also consider the case where only text or image data is available. Under both of these conditions, we show that our model achieves better performance than the state-of-the-art ones, and still performs well with text/images only. Moreover, because demographic datasets can be small, we also consider combining different datasets to understand when augmentation is valuable and when it is not.	PDF	11	2021
REDTABS: A Collection of Report Document Datasets for Long Text and Multi-Table Summarization	Automatic document summarization aims to produce a concise summary covering the input document's salient content. Within a report document, both the textual and non-textual content (e.g., tables and figures) can be important information sources for the summary. However, most available document summarization datasets focus on the text and filter out the non-textual content. Missing tabular data can limit the informativeness of produced summaries, especially when target summaries require to cover quantitative descriptions of critical metrics, whose numerical information is usually kept in tables. In this paper, we address this issue by introducing REDTABS, the first collection of large-scale datasets for long text and multi-table summarization. Built on companies' annual reports, it includes three large-scale datasets for summarizing these companies' business, results of operations, and overall conditions, respectively. We also present the Segment-Alignment-based long Text and multi-Table summarization (SATT) method incorporating textual and tabular data into the summarization process. Besides, we propose a set of automatic evaluation metrics to assess the numerical information in summaries produced by summarization models. Dataset analyses and experimental results reveal the importance of incorporating textual and tabular data into the report document summarization. We will release our data and code to facilitate advances in summarization and text generation research.	PDF	11	2021
How does the pre-training objective affect what large language models learn about linguistic properties?	Several pre-training objectives, such as masked language modeling (MLM), have been proposed to pre-train language models (e.g. BERT) with the aim of learning better language representations. However, to the best of our knowledge, no previous work so far has investigated how different pre-training objectives affect what BERT learns about linguistics properties. We hypothesize that linguistically motivated objectives (e.g. MLM) should help BERT to acquire better linguistic knowledge compared to using non-linguistically motivated objectives, i.e. hard for humans to guess the association between the input and the label to be predicted. To this end, we pre-train BERT with two linguistically motivated objectives and three non-linguistically motivated ones. We then probe for linguistic characteristics encoded in the representation of the resulting models. We find strong evidence that there is no actual differences in probing performance between the representations learned by the two different types of objectives. These surprising results question the dominant narrative of linguistically informed pre-training.	PDF	11	2021
Towards Job-Transition-Tag Graph for a Better Job Title Representation Learning	Works on learning job title representation are mainly based on Job-Transition Graph, built from the working history of talents. However, since the records are usually messy, this graph is very sparse, which affects the quality of the learned representation and hinders further analysis. To address this specific issue, we propose to enrich the graph with additional nodes that improve the quality of job title representation. Specifically, we construct Job-Transition-Tag Graph, a heterogeneous graph containing two types of nodes, i.e., job titles and tags (i.e., words related to job responsibilities or functions). Along this line, we reformulate job title representation learning as the task of learning node embedding on the Job-Transition-Tag Graph. Experiments on a public CareerBuilder12 dataset and a private Randstad dataset show interest of our approach.	PDF	11	2021
Delving Deep into Extractive Question Answering Data	The impact of large-scale pre-trained language models on Question Answering in recent times is undeniably positive.Few prior works have attempted however to provide detailed insight into how such models learn from QA dataset component parts.For example, what specific kinds of examples are most important for models to learn from? In this paper, we examine two English QA datasets, namely SQuAD1.1 and NewsQA, and report findings on the internal characteristics of these widely employed extractive QA datasets. Experiment results reveal: (i) Models learn relatively independently of examples from outside a given question type (the performance on each question type mainly comes from that data belonging to that same question type); (ii) Increased difficulty in the training data results in better performance; (iii) Learning from QA data approximates to the process of learning question-answer matches.	PDF	11	2021
Maximum Proxy-Likelihood Estimation for Non-autoregressive Machine Translation	Maximum Likelihood Estimation (MLE) is commonly used in machine translation, where models with higher likelihood are assumed to perform better in translation. However, this assumption does not hold in the non-autoregressive Transformers (NATs), a new family of translation models. In this paper, we present both theoretical and empirical analysis on why simply maximizing the likelihood does not produce a good NAT model. Based on the theoretical analysis, we propose Maximum Proxy-Likelihood Estimation (MPLE), a novel method to address the training issue in MLE. Additionally, MPLE provides a novel perspective to understand existing success in training NATs, namely much previous work can be regarded as implicitly optimizing our objective.	PDF	11	2021
Hey AI, Can You Solve Complex Tasks by Talking to Agents?	Training giant models from scratch for each complex task is resource- and data-inefficient. To help develop models that can leverage existing systems, we propose a new challenge: Learning to solve complex tasks by communicating with existing agents (or models) in natural language. We design a synthetic benchmark, CommaQA, with three complex reasoning tasks (explicit, implicit, numeric) designed to be solved by communicating with existing QA agents. For instance, using text and table QA agents to answer questions such as "Who had the longest javelin throw from USA?". We show that black-box models struggle to learn this task from scratch (accuracy under 50\%) even with access to each agent's knowledge and gold facts supervision. In contrast, models that learn to communicate with agents outperform black-box models, reaching scores of 100\% when given gold decomposition supervision. However, we show that the challenge of learning to solve complex tasks by communicating with existing agents \emph{without relying on any auxiliary supervision or data} still remains highly elusive. We will release CommaQA, along with a compositional generalization test split, to advance research in this direction.	PDF	11	2021
Enabling Multimodal Generation on CLIP via Vision-Language Knowledge Distillation	The recent large-scale vision-language pre-training (VLP) of dual-stream architectures (e.g., CLIP) with a tremendous amount of image-text pair data, has shown its superiority on various multimodal alignment tasks. Despite its success, the resulting models are not capable of generative multimodal tasks due to the weak text encoder. To tackle this problem, we propose to augment the dual-stream VLP model with a textual pre-trained language model (PLM) via vision-language knowledge distillation (VLKD), enabling the capability for multimodal generation. VLKD is pretty data- and computation-efficient compared to the pre-training from scratch. Experimental results show that the resulting model has strong zero-shot performance on multimodal generation tasks, such as open-ended visual question answering and image captioning. For example, it achieves 39.7% zero-shot accuracy on the VQA 2.0 dataset, surpassing the previous state-of-the-art zero-shot model with 14x fewer parameters. Furthermore, the original text processing ability of the PLM is maintained after VLKD, which makes our model versatile for both multimodal and unimodal tasks.	PDF	11	2021
Exploring Topic-Metadata Relationships with the STM: A Bayesian Approach	The initial purpose of topic models was to identify latent topical clusters within unstructured text. Meanwhile, the focus of advanced studies has changed primarily to estimating the relationship between the discovered topical structure and theoretically relevant metadata. Methods used to estimate such relationships must take into account that the topical structure is not directly observed, but instead being estimated itself in an unsupervised fashion. In the Structural Topic Model (STM;Roberts et al., 2016), for instance, multiple repeated linear regressions of sampled topic proportions on metadata covariates are performed. This is done by using a Monte Carlo sampling technique known as the \textit{method of composition}. In this paper, we propose two modifications of this approach: First, we implement a substantial correction to the model by replacing linear regression with the more appropriate Beta regression. Second, we provide a fundamental enhancement of the entire estimation framework by substituting the current blending of frequentist and Bayesian methods with a fully Bayesian approach instead. This allows for a more appropriate quantification of uncertainty. We illustrate our improved methodology by investigating relationships between Twitter posts by German parliamentarians and different metadata covariates related to their electoral districts.	PDF	11	2021
Effective Token Graph Modeling using a Novel Labeling Strategy for Structured Sentiment Analysis	The state-of-the-art model for structured sentiment analysis casts the task as a dependency parsing problem, which has some limitations: (1) The label proportions for span prediction and span relation prediction are imbalanced; (2) Two nodes in a dependency graph cannot have multiple arcs, which are necessary for this task; (3) The losses of predicting the imbalanced labels are directly applied in the prediction layer, which further exacerbate the imbalance problem. In this work, we propose nichetargeting solutions for this issues. First, we introduce a novel labeling strategy, which contains two sets of token pair labels, namely essential labels and whole labels. The essential label set consists of the minimum labels for this task, which are relatively balanced and applied in the prediction layer. The whole label set includes rich labels to help our model capture various token relations, which is imbalanced but merely applied in the hidden layer to softly influence our model. Moreover, we also propose an effective model to well collaborate with our labeling strategy, which is equipped with the graph attention network to iteratively refine token representations, and the adaptive multi-label classification to dynamically predict multiple relations between token pairs. We perform extensive experiments on 5 benchmark datasets in four languages. Experimental results show that our model outperforms previous SOTA models by a large margin. We believe that our labeling strategy and model can be well extended to other structured prediction tasks.	PDF	11	2021
Self-training with Modeling Ambiguous Data for Low-Resource Relation Extraction	We present a simple yet effective approach to improve the performance of self-training relation extraction in a low-resource scenario. The approach first classifies the auto-annotated instances into two groups: confident instances and uncertain instances, according to the probabilities predicted by a teacher model. In contrast to most previous studies, which mainly only use the confident instances for self-training, we make use of the uncertain instances. We propose a method to identify some ambiguous but useful instances from the uncertain instances. Then, we propose to utilize negative training for the ambiguous instances and positive training for the confident instances. Finally, they are combined in a joint-training manner to build a relation extraction system. Experimental results on two widely used datasets with low-resource settings demonstrate that this new approach indeed achieves significant and consistent improvements when compared to several competitive self-training systems.	PDF	11	2021
Automatic Generation of Electromyogram Diagnosis Report: Task and Dataset	Report-writing of electromyogram can be problematic for under-experienced physicians and time-consuming for experienced physicians. In this paper, we explore to generate textual report from tabular diagnostic records of electromyogram. We construct the first dataset for this task and demonstrate results of some baseline approaches.	PDF	11	2021
DAML-ST5: Low Resource Style Transfer via Domain Adaptive Meta Learning	Text style transfer (TST) without parallel data has achieved some practical success. However, most of the existing unsupervised text style transfer methods suffer from (i) requiring massive amounts of nonparallel data to guide the transferring of different text styles. (ii) huge performance degradation when fine-tuning the model in new domains. In this work, we propose DAML-ST5, which consists of two parts, DAML and ST5. DAML is a domain adaptive meta-learning approach to refine general knowledge in multi-heterogeneous source domains, which is capable of adapting to new unseen domains with a small amount data. Moreover, we propose a new unsupervised TST model Style-T5 (ST5), which is composed of a sequence-to-sequence pre-trained language model T5 and uses style adversarial training for better content preservation and style transfer. Results on multi-domain datasets demonstrate that our approach generalize well on unseen low-resource domains, achieving state of the art results against ten strong baselines.	PDF	11	2021
AutoMin: A Novel Dataset for Automatic Minuting from Multi-Party Meetings in English and Czech	Taking minutes is an essential component of every meeting, although the goals, style, and procedure of this activity (``minuting'' for short) can vary. Minuting is a rather unstructured writing activity and is affected by who is taking the minutes and for whom the intended minutes are. With the rise of online meetings, automatic minuting would be an important benefit for the meeting participants as well as for those who might have missed the meeting. However, automatically generating meeting minutes is a challenging problem due to a variety of factors including the quality of automatic speech recorders (ASRs), availability of public meeting data, subjective knowledge of the minuter, etc. In this work, we present the first of its kind dataset on Automatic Minuting. We develop a dataset of English and Czech technical project meetings which consists of transcripts generated from ASRs, manually corrected, and minuted by several annotators. Our dataset, AutoMin, consists of 113 (English) and 53 (Czech) meetings, covering more than 160 hours of meeting content. The meeting sessions are recorded, automatically transcribed, corrected, equipped with human-generated minutes, and finally de-identified. We would publicly release (aaa.bbb.ccc) the dataset as a set of meeting transcripts and minutes, excluding the recordings for privacy reasons. A unique feature of our dataset is that most meetings are equipped with more than one minute, each created independently. Our corpus thus allows studying differences in what people find important while taking the minutes. We also provide baseline experiments for the community to explore this novel problem further. To the best of our knowledge AutoMin is probably the first resource on minuting in English and also in a language other than English (Czech).	PDF	11	2021
SalesBot: Transitioning from Chit-Chat to Task-Oriented Dialogues	Dialogue systems are usually categorized into two types, open-domain and task-oriented. The first one focuses on chatting with users and making them engage in the conversations, where selecting a proper topic to fit the dialogue context is essential for a successful dialogue. The other one focuses on a specific task instead of casual talks, e.g., finding a movie on Friday night, playing a song. These two directions have been studied separately due to their different purposes. However, how to smoothly transition from social chatting to task-oriented dialogues is important for triggering the business opportunities, and there is no any public data focusing on such scenarios. Hence, this paper focuses on investigating the conversations starting from open-domain social chatting and then gradually transitioning to task-oriented purposes, and releases a large-scale dataset with detailed annotations for encouraging this research direction. To achieve this goal, this paper proposes a framework to automatically generate many dialogues without human involvement, in which any powerful open-domain dialogue generation model can be easily leveraged. The human evaluation shows that our generated dialogue data has a natural flow at a reasonable quality, showing that our released data has a great potential of guiding future research directions and commercial activities. Furthermore, the released models allow researchers to automatically generate unlimited dialogues in the target scenarios, which can greatly benefit semi-supervised and unsupervised approaches.	PDF	11	2021
Graph Pre-training for AMR Parsing and Generation	Abstract meaning representation (AMR) highlights the core semantic information of text in a graph structure.Recently, pre-trained language models (PLMs) have advanced tasks of AMR parsing and AMR-to-text generation.However, PLMs are typically pre-trained on textual data, thus are sub-optimal for modeling structural knowledge.To this end, we investigate graph self-supervised training to improve the structure awareness of PLMs over AMR graphs.In particular, we introduce two graph auto-encoding strategies for graph-to-graph pre-training and four tasks to integrate text and graph information during pre-training.We further design a unified framework to bridge the gap between pre-training and fine-tuning tasks.Experimental results on both AMR parsing and AMR-to-text generation tasks show the superiority of our model.To our knowledge, we are the first to consider pre-training on AMR graphs.	PDF	11	2021
Improving Document-level Relation Extraction via Context Guided Mention Integration and Inter-pair Reasoning	Document-level Relation Extraction (DRE) aims to recognize the relations between two entities.The entity may correspond to multiple mentions that span beyond sentence boundary.Few previous studies have investigated the mention integration, which may be problematic because coreferential mentions do not equally contribute to a specific relation.Moreover, prior efforts mainly focus on reasoning at entity-level rather than capturing the global interactions between entity pairs.In this paper, we propose two novel techniques, Context Guided Mention Integration and Inter-pair Reasoning (CGM2IR), to improve the DRE.Instead of simply applying average pooling, the contexts are utilized to guide the integration of coreferential mentions in a weighted sum manner.Additionally, inter-pair reasoning executes an iterative algorithm on the entity pair graph, so as to model the interdependency of relations.We evaluate our CGM2IR model on three widely used benchmark datasets, namely DocRED, CDR, and GDA. Experimental results show that our model outperforms previous state-of-the-art models.	PDF	11	2021
Post-processing Networks: A Method for Optimizing Pipeline Task-oriented Dialogue Systems using Reinforcement Learning	Many studies have proposed methods for optimizing the dialogue performance of an entire pipeline system by jointly training modules in the system using reinforcement learning. However, these methods are limited in that they can only be applied to modules implemented using trainable neural-based methods. To address this problem, we propose a method for optimizing a pipeline system composed of modules implemented with arbitrary methods for dialogue performance. In our method, neural-based components called post-processing networks (PPNs) are installed inside the system to post-process the output of each module. All PPNs are updated to improve the overall dialogue performance of the system by using reinforcement learning, not necessitating each module to be updated. Through dialogue simulation experiments on the MultiWOZ dataset, we show that PPNs can improve the dialogue performance of pipeline systems consisting of various modules.	PDF	11	2021
Probing the Prompting of CLIP on Human Faces	Large-scale multimodal models such as CLIP have caught great attention due to their generalization capability. CLIP can take free-form text prompts, but the performance varies with different text prompt manipulations, which is considered unpredictable. In this paper, we conduct a controlled study to understand how CLIP perceives images with different forms of text prompts, particularly on human facial attributes. We find that (1) using the prompt starter "a photo of" can guide the model to allocate higher attention weights to human faces, leading to better classification performance; (2) CLIP model is better at aligning information from shorter text prompts, as additional textual details shift away the attention from key words; (3) properly adding punctuation or removing stop words in the text prompt can shift attention to target information. Our practice on facial attributes shed light on the design of reliable text prompts for CLIP in other tasks.	PDF	11	2021
Distill and Calibrate: Denoising Inconsistent Labeling Instances for Chinese Named Entity Recognition	Data-driving supervised models for named entity recognition (NER) have made significant improvements on standard benchmarks. However, such models often have severe performance degradation on large-scale noisy data. Thus, a practical and challenging question arises: Can we leverage only a small amount of relatively clean data to guide the NER model learning from large-scale noisy data? To answer this question, we focus on the inconsistent labeling instances problem. We observe that inconsistent labeling instances can be classified into five types of noise, each of which will largely hinder the model performance in our experiments. Based on the above observation, we propose a simple yet effective denoising framework named Distillation and Calibration for Chinese NER (DCNER). DCNER consists: (1) a Dual-stream Label Distillation mechanism for distilling five types of inconsistent labeling instances from the noisy data; and (2) a Consistency-aware Label Calibration network for calibrating inconsistent labeling instances based on relatively clean data. Additionally, we propose the first benchmark towards validating the ability of Chinese NER to resist inconsistent labeling instances. Finally, detailed experiments show that our method consistently and significantly outperforms previous methods on the proposed benchmark.	PDF	11	2021
End-to-end Task-oriented Dialog Policy Learning based on Pre-trained Language Model	This paper presents our approach to dialog policy learning (DPL), which aims to determine the next system’s action based on the current dialog state maintained by a dialog state tracking module. Different from previous stage-wise DPL, we propose an end-to-end DPL system to avoid error accumulation between the dialogue turns. The DPL system is deployed from two perspectives. Firstly, we consider turn-level DPL that selects the best dialog action from a predefined action set. Specifically, we proposed a dialog action-oriented BERT (DA-BERT), which integrates a new pre-training procedure named masked last action task (MLA) that encourages BERT to be dialog-aware and distill action-specific features. Secondly, we propose a word-level DPL that directly generates the dialog action. We creatively model DPL as a sequence generation model conditioned on the dialog action structure. Then GPT-2 equipped with an action structure parser module (termed as DA-GPT-2) is applied to learn the word level DPL. The effectiveness and different characteristics of the proposed models are demonstrated with the in-domain tasks and domain adaptation tasks on MultiWOZ with both simulator evaluation and human evaluation.	PDF	11	2021
Redistributing Low-Frequency Words: Making the Most of Monolingual Data in Non-Autoregressive Translation	Knowledge distillation (KD) is the preliminary step for training non-autoregressive translation (NAT) models, which eases the training of NAT models at the cost of losing important information for translating low-frequency words. In this work, we provide an appealing alternative for NAT -- monolingual KD, which trains NAT student on external monolingual data with AT teacher trained on the original bilingual data. Monolingual KD is able to transfer both the knowledge of the original bilingual data (implicitly encoded in the trained AT teacher model) and that of the new monolingual data to the NAT student model. Extensive experiments on eight WMT benchmarks over two advanced NAT models show that monolingual KD consistently outperforms the standard KD by improving low-frequency word translation, without introducing any computational cost. Monolingual KD enjoys desirable expandability, which can be further enhanced (when given more computational budget) by combining with the standard KD, a reverse monolingual KD, or enlarging the scale of monolingual data. Extensive analyses demonstrate that these techniques can be used together profitably to further recall the useful information lost in the standard KD. Encouragingly, combining with standard KD, our approach achieves 30.4 and 34.1 BLEU points on the WMT14 English-German and German-English datasets, respectively. Code, data, and models will be released.	PDF	11	2021
Identifying Corporate Credit Risk Sentiments from Financial News	Credit risk management is one major practice for financial institutions, that helps them measure and understand the inherent risk within their portfolios. Historically, they relied on the assessment of default probabilities (via structural or default intensity models) and used the press as one tool to gather insights on the latest credit event developments of an entity. However, because the current news volume and coverage for companies is generally heavy, analyzing news manually by financial experts is considered a highly laborious task. To this end, we propose a novel deep learning-powered approach to automate news analysis and credit adverse events detection, with the aim of scoring the credit sentiment associated with a company in order to assist credit risk management efficiently. The result is a complete system leveraging news extraction and data enrichment (with targeted sentiment entity recognition to detect companies and text classification to identify credit events), as well as a custom scoring mechanism designed to provide the company's credit sentiment, called Credit Sentiment Score™ (CSS). Additionally, studies are shown to illustrate how CSS helps to gain knowledge about the company's credit profile but also discriminates between defaulters and non-defaulters.	PDF	11	2021
Tokenization on the Number Line is All You Need	Despite the recent breakthroughs in language modeling, their ability to represent numbers is insufficient. Subword tokenization, the standard choice for number representation, breaks down a number into arbitrary chunks thereby failing to explicitly capture the relationship between two numbers on on the number-line. To alleviate this shortcoming, alternate approaches have been proposed that modify numbers at various stages of the language modeling pipeline. These methods can be broadly classified into three categories that make changes to a) the notation (\eg scientific vs decimal) b) vocabulary (\eg introduce a new token for numbers in range $10-100$) and c) architectural changes to directly regress to a desired number. The contributions of this work are three fold -- firstly, we propose vocabulary level changes in the decoding stage and study its behavior. Next, we study the performance of both the proposed approach and existing number representation schemes in the context of masked number presentation. We find that a carefully designed tokenization scheme is both the simplest to implement and sufficient \ie with similar performance to the state-of-the-art approach that requires making significant architectural changes. Finally, we evaluate the various number representation schemes on the downstream task of numerical fact estimation (for fermi problems) in a zero-shot setting and find similar trends \ie changes at the tokenization level achieve near state-of-the-art results while requiring minimal resources compared to other number representation schemes.	PDF	11	2021
Unsupervised Domain Adaptation with Contrastive Learning for Cross-domain Chinese NER	Understanding and recognizing the entities of Chinese articles highly relies on the fully supervised learning based on the domain-specific annotation corpus. However, this paradigm fails to generalize over other unlabeled domain data which consists of different entities semantics and domain knowledge. To address this domain shift issue, we propose the framework of unsupervised Domain Adaptation with Contrastive learning for Chinese NER (DAC-NER). We follow Domain Separation Network (DSN) framework to leverage private-share pattern to capture domain-specific and domain-invariant knowledge. Specifically, we enhance the Chinese word by injecting external lexical knowledge base into the context-aware word embeddings, and then combine with sentence-level semantics to represent the domain knowledge. To learn the domain-invariant knowledge, we replace the conventional adversarial method with novel contrastive regularization to further improve the generalization abilities. Extensive experiments conducted over the labeled source domain MSRA and the unlabeled target domain Social Media and News show that our approach outperforms state-of-the-arts, and achieves the improvement of F1 score by 8.7% over the baseline.	PDF	11	2021
CREATE: A Benchmark for Chinese Short Video Retrieval and Title Generation	Previous works of video captioning aim to objectively describe the video's actual content, lack of subjective and attractive expression, limiting its practical application scenarios. Video titling is intended to achieve this goal, but there is a lack of a proper benchmark. In this paper, we propose CREATE, the first large-scale Chinese shoRt vidEo retrievAl and Title gEneration benchmark, to facilitate research and application in video titling and video retrieval in Chinese. CREATE consists of a high-quality labeled 210K dataset and two large-scale 3M/10M pre-training datasets, covering 51 categories, 50K+ tags, 537K manually annotated titles and captions, and 10M+ short videos. Based on CREATE, we propose a novel model ALWIG which combines video retrieval and video titling tasks to achieve the purpose of multi-modal ALignment WIth Generation with the help of video tags and GPT pre-trained model. CREATE opens new directions for facilitating future research and applications on video titling and video retrieval in the field of Chinese short videos.	PDF	11	2021
EventBERT	Pre-trained language models (PrLMs) have shown impressive performance in natural language understanding. However, they mainly rest on extracting context-sensitive statistical patterns without explicit modeling of linguistic information such as semantic relationships entailed in natural language. In this work, we propose EventBERT, an event-based semantic representation model that takes BERT as the backbone and refines with event-based structural semantics in terms of graph convolution network. EventBERT benefits simultaneously from rich event-based structures embodied in the graph and contextual semantics learned in pre-trained model BERT. Experimental results on the GLUE benchmark show the effectiveness.	PDF	11	2021
Integrating Vectorized Lexical Constraints for Neural Machine Translation	Lexically constrained neural machine translation (NMT), which controls the generation of NMT models with pre-specified constraints, is important in many practical scenarios. Due to the representation gap between discrete constraints and continuous vectors of NMT models, most existing works propose to construct synthetic data or modify the decoding algorithm to impose lexical constraints, treating the NMT model as a black box. In this work, we directly integrate the constraints into NMT models through vectorizing discrete constraints into continuous keys and values that can be utilized by the attention modules of NMT models. The proposed integration method is based on the assumption that the correspondence between the keys and values in attention modules is naturally suitable for modeling constraint pairs. Experimental results show that our method consistently outperforms several representative baselines on four language pairs, demonstrating the necessity of integrating vectorized lexical constraints.	PDF	11	2021
Towards Interpretable Math Word Problem Solving with Grounded Linguistic Logic Reasoning	Automatically math word problem (MWP) solving is a challenging artificial intelligence task since a machine should be able to not only understand problem comprehensively on linguistics but also the grounded math logic entailed in problem. Recently, lots of deep learning models have made great progress in MWP solving on answer accuracy, they rely on shallow heuristics to achieve high performance, lacking of grounded math logic reasoning, which makes them uninterpretable. To address this issue and push the research boundary of MWPs to interpretable MWP solving, we construct a large-scale and high-quality MWP dataset named InterMWP which consists of 11507 MWP data and annotates interpretable algebraic knowledge formulas as the grounded linguistic logic of each solving equation and asks for a solver to output the formulas when it decides current predicted node is a inner-node (operator) during expression reasoning. We further propose a strong baseline called InterSolver to show the effectiveness of our constructed dataset and show how to harvest these logic knowledge by fusing logic knowledge with semantic representation to improve problem solving and make a step towards providing interpretability. Experimental results shows that our InterSolver has strong logical formula-based interpretability while achieving high answer accuracy simultaneously.	PDF	11	2021
Multi-Granularity Contrastive Knowledge Distillation for Multimodal Named Entity Recognition	It is very valuable to recognize named entities from short and informal multimodal posts in this age of information explosion. Despite existing methods success in multi-modal named entity recognition (MNER), they rely on the well aligned text and image pairs, while a lot of noises exist in the datasets. And the representation of text and image with internal correlations is difficult to establish a deep connection, because of the mismatched semantic levels of the text encoder and image encoder. In this paper, we propose multi-granularity contrastive knowledge distillation (MGC) to build a unified joint representation space of two modalities. By leveraging multi-granularity contrastive loss, our approach pushes representations of matched image-text pairs or image-entity pairs together while pushing those unrelated image-text or image-entity pairs apart. By utilizing CLIP model for knowledge distillation, we can obtain a more fine-grained visual concept. Experimental results on two benchmark datasets prove the effectiveness of our method.	PDF	11	2021
OneAligner: Zero-shot Cross-lingual Transfer with One Rich-Resource Language Pair for Low-Resource Sentence Retrieval	Aligning parallel sentences in multilingual corpora is essential to curate data for downstream applications such as Machine Translation. In this work, we present OneAligner, an alignment model specially designed for sentence retrieval tasks. This model is able to train on only one language pair and transfers, in a cross-lingual fashion, to low-resource language pairs with negligible degradation in performance. When trained with all language pairs of a large-scale parallel multilingual corpus (OPUS-100), this model achieves the state-of-the-art result on the Tateoba dataset, outperforming an equally-sized previous model by $8.0$ points in accuracy while using less than $0.6\%$ of their parallel data. When finetuned on a single rich-resource language pair, be it English-centered or not, our model is able to match the performance of the ones finetuned on all language pairs under the same data budget with less than $2.0$ points decrease in accuracy. Furthermore, with the same setup, scaling up the number of rich-resource language pairs monotonically improves the performance, reaching a minimum of $0.4$ points discrepancy in accuracy, essentially obviating the need to collect any low-resource parallel data. Finally, we conclude through empirical results and analyses that the performance on the retrieval tasks depends mostly on the monolingual and parallel data size, up to a certain size threshold, rather than on what language pairs are used for training or evaluation.	PDF	11	2021
Real-Time Visual Feedback to Guide Benchmark Creation: A Human-and-Metric-in-the-Loop Workflow	Recent research has shown that language models exploit 'artifacts' in benchmarks to solve tasks, rather than truly learning them, leading to inflated model performance. In pursuit of creating better benchmarks, we propose VAIDA, a novel benchmark creation paradigm for NLP, that focuses on guiding crowdworkers, an under-explored facet of addressing benchmark idiosyncrasies. VAIDA facilitates sample correction by providing realtime visual feedback and recommendations to improve sample quality. Our approach is domain, model, task, and metric agnostic, and constitutes a paradigm shift for robust, validated, and dynamic benchmark creation via human-and-metric-in-the-loop workflows. We evaluate via expert review and a user study with NASA TLX. We find that VAIDA decreases effort, frustration, mental, and temporal demands of crowdworkers and analysts, simultaneously increasing the performance of both user groups with a 45.8% decrease in the level of artifacts in created samples. As a by product of our user study, we observe that created samples are adversarial across models, leading to decreases of 31.3% (BERT), 22.5% (RoBERTa), 14.98% (GPT-3 fewshot) in performance.	PDF	11	2021
Uncertainty-based Visual Question Answering: Estimating Semantic Inconsistency between Image and Knowledge Base	Knowledge-based visual question answering (KVQA) task aims to answer questions that require additional external knowledge as well as an understanding of images and questions. Recent studies on KVQA inject an external knowledge in a multi-modal form, and as more knowledge is used, irrelevant information may be added and can confuse the question answering. In order to properly use the knowledge, this study proposes the following: 1) We introduce a novel semantic inconsistency measure using caption uncertainty and semantic similarity. 2) We suggest a new external knowledge assimilation method based on the semantic inconsistency measure and apply it to integrate explicit knowledge and implicit knowledge for KVQA. 3) The proposed method is evaluated on the OK-VQA dataset and achieves the state-of-the-art performance.	PDF	11	2021
Linguistic Diversity Scores for NLP Data Sets	Quantifying linguistic diversity in multilingual data sets is important for improving cross-linguistic coverage of NLP models. However, current linguistic diversity scores rely mostly on measures such as the number of languages in the sample, which are not very informative about the structural properties of languages. In this paper, we propose a score derived from the distribution of text statistics (mean word length) as a linguistic attribute suitable for cross-linguistic comparison. We compare NLP data sets (UD, Bible100. mBERT, XTREME, XGLUE, XNLI, XCOPA, TyDiQA, XQuAD) to a new data set designed specifically for the purpose of being typologically representative (WALS-SC). To do so, we apply a version of the Jaccard index ($J_{mm}$) suitable for comparing sets of measures. This diversity score can identify the types of languages that need to be included in multilingual data sets in order to reach broad linguistic coverage. We find, for example, that (poly)synthetic languages are missing in almost all data sets.	PDF	11	2021
Exploiting Dialogue Act for Knowledge Selection and Response Generation	Dialogue act (DA) is the description of the intention or function of a dialogue utterance. In document-grounded dialogue, correctly understanding the dialogue context is crucial for models to select knowledge and inject knowledge into responses. Leveraging dialogue act can help to understand the dialogue context and consequently assist the utilization of document information. In this paper, we propose a novel framework leveraging two different kinds of DAs (model-annotated and human-annotated) for \textbf{Knowledge Selection} (KS) and \textbf{Response Generation} (RG). The framework consists of two modules: the prediction module is trained with multi-task learning and learns to select knowledge and predict the next DA; the generation module uses the selected knowledge and the predicted DA for the RG. Our model achieves new state-of-the-art performance on three public datasets and the results verify that leveraging DA can help KS and RG. Our code and data will be released on github.com.	PDF	11	2021
A Well-Composed Text is Half Done! Semantic Composition Sampling for Diverse Conditional Generation	We propose Composition Sampling, a simple but effective method to generate higher quality diverse outputs for conditional generation tasks, compared to previous stochastic decoding strategies. It builds on recently proposed planning-based neural generation models that are trained to first create a composition of the output using an entity chain and then continue to generate conditioned on the entity chain and the input \cite{frost}. Our approach avoids text degeneration by first sampling a composition in the form of an entity chain and then using beam search to generate the best possible text grounded to the entity chain. Experiments on CNN/DailyMail and XSum using a variety of automatic metrics and human-based evaluation demonstrate that Composition Sampling is currently the best available decoding strategy for generating diverse meaningful summaries. We further outperform state-of-the-art approaches for question generation in terms of BLEU.	PDF	11	2021
Two Front-Ends, One Model : Fusing Heterogeneous Speech Features for Low Resource ASR with Multilingual Pre-Training	Transfer learning is widely applied in various deep learning-based speech tasks, especially for tasks with a limited amount of data. Recent studies in transfer learning mainly focused on either supervised or self-supervised perspectives. This work, however, seeks to incorporate the two schemes together towards low-resource automatic speech recognition (ASR) for minor and endangered language (EL) communities. We propose a general framework to use learned transformations to resolve time resolution differences between any speech features, allowing for fusion of any self-supervised representations or spectral features used in multilingual pre-training. Our experiments over two low-resource languages and three ELs demonstrate that the proposed framework can significantly improve the absolute average word error rate from 45.4% to 35.5%.	PDF	11	2021
BanglaBERT: Language Model Pretraining and Benchmarks for Low-Resource Language Understanding Evaluation in Bangla	In this paper, we introduce 'BanglaBERT', a BERT-based Natural Language Understanding (NLU) model pretrained in Bangla, a widely spoken yet low-resource language in the NLP literature. To pretrain BanglaBERT, we collect 27.5 GB of Bangla pretraining data (dubbed 'Bangla2B+') by crawling 110 popular Bangla sites. We introduce a new downstream task dataset on Natural Language Inference (NLI) and benchmark on four diverse NLU tasks covering text classification, sequence labeling, and span prediction. In the process, we bring them under the first-ever Bangla Language Understanding Evaluation (BLUE) benchmark. BanglaBERT achieves state-of-the-art results outperforming multilingual and monolingual models. We will make the BanglaBERT model, the new datasets, and a leaderboard publicly available to advance Bangla NLP.	PDF	11	2021
ShrinkNAS : Single-Path One-Shot Operator Exploratory Training for Transformer with Dynamic Space Shrinking	Neural Architecture Search (NAS) for Transformer has shown its growing capabilities in exploiting the benefits of various Transformer architecture configurations. Recent studies envision the diverse potential of introducing unprecedented Transformer operators (OPs, such as Convolution) to its structure, yet the existing methods of doing so are all time-consuming. Traditionally, Single-Path One-Shot (SPOS) models enable efficient search over a vast set of OPs. However, existing SPOS methods on Transformer focus only on dimensional configurations of the vanilla Transformer OP (e.g., Multi-head Attention), and did not consider introducing other OPs. This paper explores the possibility of including OPs in the Transformer-based SPOS architecture search to discover better Transformer structures with the high efficiency facilitated in the SPOS category. To achieve that, we propose Dynamic Space Shrinking (DSS), a novel method that resolves problems brought from newly added OPs by dynamically keeping the current sample space containing subnets with good configurations and performance. We implemented DSS in ShrinkNAS, the first SPOS one-shot inter-OP model for Transformer. Our evaluation shows that ShrinkNAS is of much higher elasticity by finding a better structure beating the human-designed ones under tight constraint (<10M parameters), while existing intra-OP SPOS methods are not even close.	PDF	11	2021
Nested Named Entity Recognition as Latent Lexicalized Constituency Parsing	Nested named entity recognition (NER) has been receiving increasing attention. Recently, Fu et al. (2020) adapt a span-based constituency parser to tackle nested NER. They treat nested entities as partially-observed constituency trees and propose the masked inside algorithm for partial marginalization. However, their method cannot leverage entity heads, which have been shown useful in entity mention detection and entity typing. In this work, we resort to more expressive structures, lexicalized constituency trees in which constituents are annotated by headwords, to model nested entities. We leverage the Eisner-Satta algorithm to perform partial marginalization and inference efficiently.In addition, we propose to use (1) a two-stage strategy (2) a head regularization loss and (3) a head-aware labeling loss in order to enhance the performance. We make a thorough ablation study to investigate the functionality of each component. Experimentally, our method achieves the state-of-the-art performance on ACE2004, ACE2005 and NNE, and competitive performance on GENIA, and meanwhile has a fast inference speed.	PDF	11	2021
Answering Open-Domain Multi-Answer Questions via a Recall-then-Verify Framework	Open-domain questions are likely to be open-ended and ambiguous, leading to multiple valid answers. Existing approaches typically adopt the rerank-then-read framework, where a reader reads top-ranking evidence to predict answers. According to our empirical analysis, this framework faces three problems: first, to leverage a large reader under a memory constraint, the reranker should select only a few relevant passages to cover diverse answers, while balancing relevance and diversity is non-trivial; second, the small reading budget prevents the reader from accessing valuable retrieved evidence filtered out by the reranker; third, when using a generative reader to predict answers all at once based on all selected evidence, whether a valid answer will be predicted also pathologically depends on evidence of some other valid answer(s). To address these issues, we propose to answer open-domain multi-answer questions with a recall-then-verify framework, which separates the reasoning process of each answer so that we can make better use of retrieved evidence while also leveraging large models under the same memory constraint. Our framework achieves state-of-the-art results on two multi-answer datasets, and predicts significantly more gold answers than a rerank-then-read system that uses an oracle reranker.	PDF	11	2021
Bidirectional Modeling for Simultaneous Neural Machine Translation	Simultaneous Neural Machine Translation (SimulNMT) generates the output before the entire input sentence is available and only uses the unidirectional attention from left-to-right so that its decoding highly relies on future forecast according to word ordering rules. However, it is utopian that the word order strictly obeys the grammar rules in a language, especially in oral. To address the mismatch between SimulNMT expecting strict word order and free word order in real scenario, we propose a bidirectional modeling. In detail, we train another backward model where the input sentence is from right-to-left and keep the target sentence from left-to-right. Then we join this backward model into the standard forward SimulNMT model during decoding. This strategy enhances the robustness of SimulNMT and empowers the model to be more adaptable for the inconstant word ordering phenomenon. Experiments show that our method brings improvement over the strong baselines.	PDF	11	2021
MDERank: A Masked Document Embedding Rank Approach for Unsupervised Keyphrase Extraction	Keyphrase extraction (KPE) automatically extracts phrases in a document that provide a concise summary of the core content, which benefits downstream information retrieval and NLP tasks. Previous state-of-the-art methods select candidate keyphrases based on the similarity between learned representations of the candidates and the document. They suffer performance degradation on long documents due to discrepancy between sequence lengths which causes mismatch between representations of keyphrase candidates and the document. In this work, we propose a novel unsupervised embedding-based KPE approach, Masked Document Embedding Rank (MDERank), to address this problem by leveraging a mask strategy and ranking candidates by the similarity between embeddings of the source document and the masked document. We further develop a KPE-oriented BERT (KPEBERT) model by proposing a novel self-supervised contrastive learning method, which is more compatible to MDERank than vanilla BERT. Comprehensive evaluations on six KPE benchmarks demonstrate that the proposed MDERank outperforms state-of-the-art unsupervised KPE approach by average 1.80 $F1@15$ improvement. MDERank further benefits from KPEBERT and overall achieves average 3.53 $F1@15$ improvement over SIFRank.	PDF	11	2021
Break it - Message it - Fix it : Learning to Repair Python Programs using Error Messages without Labelled Data	In recent years there is an increasing demand to reduce the gap in development to deployment. It has been estimated that developers spend almost 20\% of their time in fixed problems with their code. Therefore tools which can automatically repair code can help accelerate the DevOps cycles. In this work we build upon recent success of deploying neuro-symbolic approaches for automatic code repair. In our approach, we use a dataset of python code, viz, CodeNet, which represents data distribution for human generated code. We train two neural modules a breaker and a fixer, which are trained iteratively, along with a symbolic module Pylint. The breaker learns to introduce errors in the code, the symbolic module acts as a Critic and is able to fragment the error by identifying the line, as well provide the error type with a specific exception message. The Fixer utilizes the exception message to repair the erroneous line in the code. We are able to cover 32 different syntax errors, and iterative training based on back translation actually helps improve the performance of the Fixer.	PDF	11	2021
Explicit Object Relation Alignment for Vision and Language Navigation	We propose a neural agent to solve the navigation instruction following problem in a photo-realistic environment. We explicitly align the spatial information in both instruction and the visual environment, including landmarks and spatial relationships between the agent and landmarks. Our method significantly improves the baseline and is competitive with the SOTA in unseen environments. The qualitative analysis shows that explicitly modeled spatial reasoning improves the explainability of the action decisions and the generalizability of the model.	PDF	11	2021
Text Smoothing: Enhance Various Data Augmentation Methods on Text Classification Tasks	Before entering the neural network, a token needs to be converted to its one-hot representation, which is a discrete distribution of the vocabulary. Smoothed representation is the probability of candidate tokens obtained from the pre-trained masked language model, which can be seen as a more informative augmented substitution to the one-hot representation. We propose an efficient data augmentation method, dub as text smoothing, by converting a sentence from its one-hot representation to controllable smoothed representation.We evaluate text smoothing on different datasets in a low-resource regime. Experimental results show that text smoothing outperforms various mainstream data augmentation methods by a substantial margin. Moreover, text smoothing can be combined with these data augmentation methods to achieve better performance.	PDF	11	2021
Unsupervised Multi-Granularity Summarization	Text summarization is a user-preference based task. For one document, users often have different priorities for summary. Granularity level of the summary is a core component of these preferences. However, most existing studies focus solely on single-granularity scenarios, resulting in models that are limited to producing summaries with similar semantic coverage and are not customizable. In this paper, we propose the first unsupervised multi-granularity summarization framework, GranuSum. We regard events as basic semantic units of the original text and design a model that can take these events as anchors when generating summary. Meanwhile, by ranking these hint events and controlling the number of events, GranuSum is capable of generating summaries at different granularities in an unsupervised manner. We develop a testbed for the multi-granularity summarization task, including a new human-annotated benchmark GranuDUC where each document is paired with multiple summaries with different granularities. Extensive experiments on this benchmark and other large-scale datasets show that GranuSum substantially outperforms previous baselines. We also find that GranuSum exhibits impressive performance on conventional unsupervised abstractive summarization tasks via exploiting the event information, achieving new state-of-the-art results on three summarization datasets.	PDF	11	2021
Speaker Information Can Guide Models to Better Inductive Biases: A Case Study On Predicting Code-Switching	Natural language processing (NLP) models trained on people-generated data can be unreliable because, without any constraints, they can learn from spurious correlations or propagate dangerous biases about personal identities. We hypothesize that enriching models with speaker information in a controlled, educated way can guide them to pick up on relevant inductive biases. For the speaker-driven task of predicting code-switching points in English--Spanish bilingual dialogues, we show that adding sociolinguistically-grounded speaker features as prepended prompts significantly helps to improve accuracy. We find that by adding influential phrases to the input, speaker-informed models learn useful and explainable linguistic information. To our knowledge, we are the first to incorporate speaker characteristics in the code-switching setup, and more generally, take a step towards developing transparent models that control for biases in person-centric tasks.	PDF	11	2021
EveMRC: A Two-stage Evidence Modeling For Multi-choice Machine Reading Comprehension	Many impressive works have been proposed to improve the performance of Machine Reading Comprehension (MRC) systems in recent years. However, it is still difficult to interpret the predictions of existing MRC models, which makes the predictions unconvincing.In this work, we propose a two-stage explainable framework for multi-choice MRC to model not only the correlation between answers and evidence, but also the competition among evidence. In stage 1, we select evidence sentences for both the right answer and wrong answers using the semi-supervised evidence selector. In stage 2, we employ an evidence discriminator to compare among the competitive evidence set and make final judgments. Moreover, we propose an evidence-enabled data augmentation method. Experiments on four multi-choice MRC datasets show that: stage 1 provides strong explainability for MRC systems and stage 2 improves both the performance and robustness of MRC systems meanwhile.	PDF	11	2021
EmoWOZ: A Large-Scale Corpus and Labelling Scheme for Emotion Recognition in Task-Oriented Dialogue Systems	The ability to recognise emotions lends a conversational artificial intelligence a human touch. While emotions in chit-chat dialogues have received substantial attention, emotions in task-oriented dialogues have been largely overlooked despite having an equally important role, such as to signal failure or success. Existing emotion-annotated task-oriented corpora are limited in size, label richness, and public availability, creating a bottleneck for downstream tasks. To lay a foundation for studies on emotions in task-oriented dialogues, we introduce EmoWOZ, a large-scale manually emotion-annotated corpus of task-oriented dialogues. EmoWOZ is based on MultiWOZ, a multi-domain task-oriented dialogue dataset. It contains more than 11K dialogues with more than 83K emotion annotations of user utterances. In addition to Wizard-of-Oz dialogues from MultiWOZ, we collect human-machine dialogues within the same set of domains to sufficiently cover the space of various emotions that can happen during the lifetime of a data-driven dialogue system. To the best of our knowledge, this is the first large-scale open-source corpus of its kind. We propose a novel emotion labelling scheme, which is tailored to task-oriented dialogues. We report a set of experimental results to show the usability of this corpus for emotion recognition and state tracking in task-oriented dialogues.	PDF	11	2021
MSP: Multi-Stage Prompting for Making Pre-trained Language Models Better Translators	Prompting has recently been shown as a promising approach for applying pre-trained language models to perform downstream tasks. We present Multi-Stage Prompting, a simple and automatic approach for leveraging pre-trained language models to translation tasks. To better mitigate the discrepancy between pre-training and translation, MSP divides the translation process via pre-trained language models into three separate stages: the encoding stage, the re-encoding stage, and the decoding stage. During each stage, we independently apply different continuous prompts for allowing pre-trained language models better shift to translation tasks. We conduct extensive experiments on three translation tasks. Experiments show that our method can significantly improve the translation performance of pre-trained language models.	PDF	11	2021
Moving the Eiffel Tower to ROME: Tracing and Editing Facts in GPT	We investigate the mechanisms underlying factual knowledge recall in auto-regressive transformer language models. To this end, we develop a method for identifying neuron activations that are capable of altering a model's factual predictions. Within GPT-2, this reveals two distinct sets of neurons that we hypothesize correspond to knowing an abstract fact and saying a concrete word, respectively. Based on this insight, we propose ROME, a simple and efficient rank-one model editing method for rewriting abstract facts in auto-regressive language models. For validation, we introduce CounterFact, a dataset of over twenty thousand rewritable facts, as well as tools to facilitate sensitive measurements of edit quality. Compared to previously-published knowledge editing methods, ROME achieves superior generalization and specificity.	PDF	11	2021
Dataset for N-ary Relation Extraction of Drug Combinations	Combination therapies have become the standard of care for diseases such as cancer, tuberculosis, malaria and HIV. However, the combinatorial set of available multi-drug treatments creates a challenge, particularly in the presence of antagonistic drug combinations that may lead to negative patient outcomes. To assist medical professionals in identifying beneficial drug-combinations, we construct an expert-annotated dataset for extracting information about the efficacy of drug combinations from the scientific literature. Beyond its practical utility, the dataset also presents a unique NLP challenge, as it is the first relation extraction dataset consisting of variable-length relations. Furthermore, the relations in this dataset predominantly require language understanding beyond the sentence level, adding to the challenge of this task. We provide a strong baseline model and identify clear areas for further improvement. We release our dataset and code (https://anonymous.4open.science/r/drug-synergy-models--C8B7/README.md) publicly to encourage the NLP community to participate in this task.	PDF	11	2021
Life after BERT: What do Other Muppets Understand about Language?	Pre-trained transformers are at the core of natural language processing today. However, the understanding of what model learns during pre-training is still limited. Existing model analysis works usually focus only on one or two model families at a time, overlooking the variety of existing architectures and pre-training objectives. In our work, we utilize the oLMpics benchmark and psycholinguistic probing datasets for a diverse set of 28 models including T5, BART, and ALBERT. Additionally, we adapt the oLMpics zero-shot setup for autoregressive models and evaluate GPT networks of different sizes. Our findings show that none of these models can resolve compositional questions in a zero-shot fashion, suggesting that this skill is not learnable using existing pre-training objectives. Additionally, we find that global model decisions such as architecture, directionality, size of the dataset, and pre-training objective are not predictive of a model's linguistic capabilities.	PDF	11	2021
A Model-agnostic Data Manipulation Method for Persona-based Dialogue Generation	Towards building intelligent dialogue agents, there has been a growing interest in introducing explicit personas in generation models. However, with limited persona-based dialogue data at hand, it may be difficult to train a dialogue generation model well. We point out that the data challenges of this generation task lie in two aspects: first, it is expensive to scale up current persona-based dialogue datasets; second, each data sample in this task is more complex to learn with than conventional dialogue data. To alleviate the above data issues, we propose a data manipulation method, which is model-agnostic to be packed with any persona-based dialogue generation model to improve their performance. The original training samples will first be distilled and thus expected to be fitted more easily. Next, we show various effective ways that can diversify such easier distilled data. A given base model will then be trained via the constructed data curricula, i.e. first on augmented distilled samples and then on original ones. Experiments illustrate the superiority of our method with two strong base dialogue models (Transformer encoder-decoder and GPT2).	PDF	11	2021
Exploring Human-judged and Automatically-induced Correction Difficulty for Grammatical Error Correction	While grammatical error correction (GEC) has improved in its correction performance, one of the key challenges in GEC research still remains in evaluation. Specifically, all errors are equally treated in the conventional performance measures despite the fact that some errors are more difficult to correct than others. Ideally, difficult errors should be regarded to be more important than easy ones in evaluation. This leads to the following ultimate research question --- Can even human experts estimate correction difficulty well? In this paper, we explore questions about correction difficulty centering on this research question. For this purpose, we first introduce a method for estimating agreement rates in correction difficulty judgements based on pairwise comparison. With the annotation of 2,025 instances using this method, we show that human experts exhibit a moderate agreement rate of 66.39\% (Cohen's-$\kappa$: 0.42) in judging correction difficulty. We also show that the agreement between this human-based difficulty and an automatically induced difficulty is comparable (64.50\% and $\kappa=0.35$ on average). We further look into the annotation results to reveal the insights of the human-judged and machine-judged correction difficulties, reporting on following three findings: (i) where the human-judged and machine-judged difficulties are strong and weak; (ii) based on (i), correction difficulty can be GEC-algorithm- and training-corpus-dependent; (iii) human-judged and machine-judged correction difficulties complement each other.	PDF	11	2021
Fine-grained video paragraph captioning via exploring object-centered internal and external knowledge	Video paragraph captioning task aims at generating a fine-grained, coherent and relevant paragraph for a video. Existing works often treat the objects (the potential main components in a sentence) isolated from the whole video content, and rarely explore the latent semantic relation between a certain object and the current video concepts, causing the generated description dull and even incorrect. Besides, different from images where objects are static, the temporal states of objects are changing in videos. The dynamic information could be contributed to better understand the whole video content. Towards generating a more detailed and stick-to-the-topic paragraph, we propose a novel framework that focuses on exploring the rich semantic and temporal meaning of objects, by constructing the concept graph from the external commonsense knowledge and the state graph from the internal video frames. Extensive experiments on ActivityNet captions and Youcook2 demonstrate the effectiveness of our method compared to the state-of-the-art works. We will release our code on GitHub community.	PDF	11	2021
Investigating Crowdsourcing Protocols for Evaluating the Factual Consistency of Summaries	Current pre-trained models applied for summarization are prone to factual inconsistencies which misrepresent the source text. Thus, evaluating the factual consistency of summaries is necessary to develop better models. However, the optimal human evaluation setup for factual consistency has not been standardized. To address this issue, we crowdsourced evaluations for factual consistency using the rating-based Likert Scale and ranking-based Best-Worst Scaling to determine the factors that affect the reliability of the human evaluation. Our crowdsourced evaluations are conducted on the summaries of CNN-Daily Mail and XSum datasets generated by four state-of-the-art models. Ranking-based Best-Worst Scaling offers a more reliable measure of summary quality across datasets, and the reliability of Likert ratings highly depends on the target dataset and the evaluation design. To improve the reliability, we extend the scale of the Likert rating to make it more flexible and we present a scoring algorithm for Best-Worst Scaling, called value learning. Our crowdsourcing guidelines and evaluation protocols will be publicly available to facilitate future research on factual consistency in summarization.	PDF	11	2021
Active Relation Discovery: Towards General and Label-aware OpenRE	Open Relation Extraction (OpenRE) aims to discover and label novel relations from open domains. Previous methods mainly suffer from two problems: (1) Insufficient capacity to discriminate between known and novel relations. When extending conventional test settings to a more general setting where test data might also come from seen classes, existing OpenRE approaches have a significant performance decline. (2) Secondary labeling must be performed before practical application. Existing methods cannot label human-readable and meaningful types for novel relations, which is urgently required by the downstream tasks. To address these issues, we propose the Active Relation Discovery (ARD) framework, which utilizes relational outlier detection for discriminating known and novel relations and involves active learning for labeling novel relations. Extensive experiments\footnote{The source code will be available for reproducibility.} on three real-world datasets show that ARD significantly outperforms state-of-the-art methods on both conventional and our proposed general OpenRE settings.	PDF	11	2021
Structure Representation Learning by Jointly Learning to Pool and Represent	Structure representation learning is a task to provide an overall representation for a given structure (e.g., sequential text, non-sequential graph). This representation characterizes the property of that structure. Previous methods decompose the task into an element representation learning phase and a pooling phase to aggregate element representations. Their pooling phase only considers the final representation of each element without considering the relationship between these elements that are used only to construct representations of elements. In this paper, we conjecture that classification performance suffers from the lack of relation exploitation while pooling and propose the Self-Attention Pooling to dynamically provide centrality scores for pooling based on the self-attention scores from the element representation learning. Simply applying Self-Attention Pooling improves model performance on $3$ sentence classification tasks ({$\boldsymbol{\uparrow 2.9}$}) and $5$ graph classification tasks ({$\boldsymbol{\uparrow 2.1}$}) on average.	PDF	11	2021
Cross-Document Temporal Relation Extraction with Temporal Anchoring Events	Automatically extracting a timeline on a certain topic from multiple documents has been a challenge in natural language processing, partly due to the difficulty of collecting large amounts of training data. In this work, we collect a dataset for cross-document timeline extraction from online news that gives access to metadata such as hyperlinks and publication dates. The metadata allows us to define a set of important events while linking them to time anchors, which opens the opportunity to scale up data collection. Furthermore, with this set of linked news articles, we propose a method to enhance the inference process of temporal relation prediction, by utilizing a model to link events to a set of anchoring events that are added to the inference program. We report performance of common neural models and show that our method can boost the performance of all baseline models.	PDF	11	2021
Empathetic Persuasion: Reinforcing Empathy and Persuasiveness in Dialogue Systems	Persuasion is an intricate process involving empathetic connection between two individuals. Plain persuasive responses may make a conversation non-engaging. Even the most well-intended and reasoned persuasive conversations can fall through in the absence of empathetic connection between the speaker and listener. In this paper, we propose a novel task of incorporating empathy when generating persuasive responses. We develop an empathetic persuasive dialogue system by fine-tuning a maximum likelihood Estimation (MLE)-based language model in a reinforcement learning (RL) framework. To design feedbacks for our RL-agent, we define an effective and efficient reward function considering consistency, repetitiveness, emotion and persuasion rewards to ensure consistency, non-repetitiveness, empathy and persuasiveness in the generated responses. Due to lack of emotion annotated persuasive data, we first annotate the existing PersuaionForGood dataset with emotions, then build transformer based classifiers to provide emotion based feedbacks to our RL agent. Our experimental results confirm that our proposed model increases the rate of generating persuasive responses as compared to the available state-of-the-art dialogue models while making the dialogues empathetically more engaging and retaining the language quality in responses.	PDF	11	2021
Generative Pretraining for Paraphrase Evaluation	We introduce ParaBLEU, a paraphrase representation learning model and evaluation metric for text generation. Unlike previous approaches, ParaBLEU learns to understand paraphrasis using generative conditioning as a pretraining objective. ParaBLEU correlates more strongly with human judgements than existing metrics, obtaining new state-of-the-art results on the 2017 WMT Metrics Shared Task. We show that our model is robust to data scarcity, exceeding previous state-of-the-art performance using only $50\%$ of the available training data and surpassing BLEU, ROUGE and METEOR with only $40$ labelled examples. Finally, we demonstrate that ParaBLEU can be used to conditionally generate novel paraphrases from a single demonstration, which we use to confirm our hypothesis that it learns abstract, generalized paraphrase representations.	PDF	11	2021
Towards Focused and Connected Document-Level Event Extraction	Document-level event extraction (DEE) is indispensable when events are naturally described in the form of a document. Although previous methods have made great success on DEE, they are limited by two bottlenecks: losing focus and losing the connection. In this paper, to break through the above bottlenecks, we annotated a new dataset, named WIKIEVENT++, towards focused and connected DEE. Besides, we propose two different models to approach this task: the extractive model and the generative model. Experimental results verify the effectiveness of our proposed methods. We further present a promising case study to explore the performance bottleneck for this task. Data and code will be released at \url{http://anonymized} to advance the research on document-level event extraction.	PDF	11	2021
Leveraging Uni-Modal Self-Supervised Learning for Multimodal Audio-visual Speech Recognition	Training Transformer-based models demands a large amount of data, while obtaining parallel aligned and labelled data in multimodality is rather cost-demanding, especially for audio-visual speech recognition (AVSR). Thus it makes a lot of sense to make use of unlabelled uni-modal data. On the other side, although the effectiveness of large-scale self-supervised learning is well established in both audio and visual modalities, how to integrate those pre-trained models into a multimodal scenario remains underexplored. In this work, we successfully leverage uni-modal self-supervised learning to promote the multimodal AVSR. In particular, we first train audio and visual encoders on a large-scale uni-modal dataset, then we integrate components of both encoders into a larger multimodal framework which learns to recognize paired audio-visual data into characters through a combination of CTC and seq2seq decoding. We show that both components inherited from uni-modal self-supervised learning cooperate well, resulting in that the multimodal framework yields competitive results through fine-tuning. Our model is experimentally validated on both word-level and sentence-level AVSR tasks. Especially, even without an external language model, our proposed model raises the state-of-the-art performances on the widely accepted Lip Reading Sentences 2 (LRS2) dataset by a large margin, with a relative improvement of 30%.	PDF	11	2021
Plug-Tagger: A Pluggable Sequence Labeling Framework with Pre-trained Language Models	Fine-tuning the pre-trained language models (PLMs) on downstream tasks is the de-facto paradigm in NLP. Despite the superior performance on sequence labeling, the fine-tuning requires large-scale parameters and time-consuming deployment for each task, which limits its application in real-world scenarios. To alleviate these problems, we propose a pluggable sequence labeling framework, plug-tagger. By switching the task-specific plugin on the input, plug-tagger allows a frozen PLM to perform different sequence labeling tasks without redeployment. Specifically, the plugin on the input are a few continuous vectors, which manipulates the PLM without modifying its parameters, and each task only needs to store the lightweight vectors rather than a full copy of PLM. To avoid redeployment, we propose the label word mechanism, which reuses the language model head to prevent task-specific classifiers from modifying model structures. Experimental results on three sequence labeling tasks show that the proposed method achieves comparable performance with fine-tuning by using 0.1% task-specific parameters. Experiments show that our method is faster than other lightweight methods under limited computational resources	PDF	11	2021
Offensive Text Detection Across Languages and Datasets Using Rule-based and Hybrid Methods	We investigate the potential of rule-based systems for the task of offensive text detection in English and German, and demonstrate their effectiveness in low-resource settings, as an alternative or addition to transfer learning across tasks and languages. Task definitions and annotation guidelines used by existing datasets show great variety, hence state-of-the-art machine learning models do not transfer well across datasets or languages. Furthermore, such systems lack explainability and pose a critical risk of unintended bias. We present simple rule systems based on semantic graphs for classifying offensive text in two languages and provide both quantiative and qualitative comparison of their performance with deep learning models on 5 datasets across multiple languages and shared tasks.	PDF	11	2021
SuperShaper: Task-Agnostic Super Pre-training of BERT Models with Variable Hidden Dimensions	Task-agnostic pre-training followed by task-specific fine-tuning is a default approach to train NLU models which need to be deployed on devices with varying resource and accuracy constraints. However, repeating pre-training and fine-tuning across tens of devices is prohibitively expensive. To address this, we propose SuperShaper, a task agnostic pre-training approach wherein we pre-train a single model which subsumes a large number of Transformer models by varying shapes, i.e., by varying the hidden dimensions across layers. This is enabled by a backbone network with linear bottleneck matrices around each Transformer layer which are sliced to generate differently shaped sub-networks. Despite its simple design space and efficient implementation, SuperShaper radically simplifies NAS for language models and discovers networks that effectively trade-off accuracy and model size: Discovered networks are more accurate than a range of hand-crafted and automatically searched networks on GLUE benchmarks. Further, we find two critical advantages of shape as a design variable for Neural Architecture Search (NAS): (a) networks found with these heuristics derived for good shapes, match and even improve on carefully searched networks across a range of parameter counts, and (b) the latency of networks across multiple CPUs and GPUs are insensitive to the shape and thus enable device-agnostic search.	PDF	11	2021
The Past Mistake is the Future Wisdom: Error-driven Contrastive Probability Optimization for Chinese Spell Checking	Chinese Spell Checking (CSC) aims to detect and correct Chinese spelling errors, which are mainly caused by the phonological or visual similarity. Recently, pre-trained language models (PLMs) promote the progress of CSC task. However, there exists a gap between the learned knowledge of PLMs and the goal of CSC task. PLMs focus on the semantics in text and tend to correct the erroneous characters to semantically proper or commonly used ones, but these aren't the ground-truth corrections. To address this issue, we propose an Error-driven COntrastive Probability Optimization (ECOPO) framework for CSC task. ECOPO refines the knowledge representations of PLMs, and guides the model to avoid predicting these common characters through an error-driven way. Particularly, ECOPO is model-agnostic and it can be combined with existing CSC methods to achieve better performance. Extensive experiments and detailed analyses on SIGHAN datasets demonstrate that ECOPO is simple yet effective.	PDF	11	2021
Co-VQA : Answering by Interactive Sub Question Sequence	Most existing approaches to Visual Question Answering (VQA) answer questions directly, however, people usually decompose a complex question into a sequence of simple sub questions and finally obtain the answer to the original question after answering the sub question sequence(SQS). By simulating the process, this paper proposes a conversation-based VQA (Co-VQA) framework, which consists of three components: Questioner, Oracle, and Answerer. Questioner raises the sub questions using an extending HRED model, and Oracle answers them one-by-one. An Adaptive Chain Visual Reasoning Model (ACVRM) for Answerer is also proposed, where the question-answer pair is used to update the visual representation sequentially. To perform supervised learning for each model, we introduce a well-designed method to build a SQS for each question on VQA 2.0 and VQA-CP v2 datasets. Experimental results show that our method achieves state-of-the-art on VQA-CP v2. Further analyses show that SQSs help build direct semantic connections between questions and images, provide question-adaptive variable-length reasoning chains, and with explicit interpretability as well as error traceability.	PDF	11	2021
Systematicity, Compositionality and Transitivity of Deep NLP Models: a Metamorphic Testing Perspective	Metamorphic testing has recently been used to check the safety of neural NLP models. Its main advantage is that it does not rely on a ground truth to generate test cases. However, existing studies are mostly concerned with robustness-like metamorphic relations, limiting the scope of linguistic properties they can test. We propose three new classes of metamorphic relations, which address the properties of systematicity, compositionality and transitivity. Unlike robustness, our relations are defined over multiple source inputs, thus increasing the number of test cases that we can produce by a polynomial factor. With them, we test the internal consistency of state-of-the-art NLP models, and show that they do not always behave according to their expected linguistic properties. Lastly, we introduce a novel graphical notation that efficiently summarises the inner structure of metamorphic relations.	PDF	11	2021
A Relation-Attentive 3D Matrix Framework for Relational Triple Extraction	Extracting relational triples from unstructured text is crucial for information extraction. Recent methods achieve considerable performance, but due to the insufficient consideration of triple global information, there is an obvious performance gap between triple (E1, R, E2) and E1/R/E2, that is, some extracted entities or relations fail to form a valid relational triple. To break this bottleneck, we propose a relation-attentive 3D matrix framework (RA3D) composed of an encoder module, a fusion module, and a 3D matrix module. Instead of using a 2D table to align the subject and object, we integrate clearly encoded relation information to convert the 2D table into a 3D matrix, so that the entries of the 3D matrix can capture the interaction in subjects, objects, and relations completely. To extract relation and entity information required for the 3D matrix reasonably, we design a transformer-decoder-based fusion module that updates the representation of relations and entities iteratively. Our model achieves state-of-the-art performance with F1 score up to 93.5\% and 94.3\% on two public datasets and delivers consistent performance gain on complex scenarios of overlapping triples.	PDF	11	2021
Logic Traps in Evaluating Attribution Scores	Modern deep learning models are notoriously opaque, which has motivated the development of methods for interpreting how deep models predict.This goal is usually approached with attribution method, which assesses the influence of features on model predictions. As an explanation method, the evaluation criteria of attribution methods is how accurately it reflects the actual reasoning process of the model (faithfulness). Meanwhile, since the reasoning process of deep models is inaccessible, researchers design various evaluation methods to demonstrate their arguments.However, some crucial logic traps in these evaluation methods are ignored in most works, causing inaccurate evaluation and unfair comparison.This paper systematically reviews existing methods for evaluating attribution scores and summarizes the logic traps in these methods.We further conduct experiments to demonstrate the existence of each logic trap.Through both theoretical and experimental analysis, we hope to increase attention on the inaccurate evaluation of attribution scores. Moreover, with this paper, we suggest stopping focusing on improving performance under unreliable evaluation systems and starting efforts on reducing the impact of proposed logic traps.	PDF	11	2021
Measuring Fairness of Text Classifiers via Prediction Sensitivity	With the rapid growth in language processing applications, fairness has emerged as an important consideration in data-driven solutions. Although various fairness definitions have been explored in the recent literature, there is lack of consensus on which metrics most accurately reflect the fairness of a system. In this work, we propose a new formulation -- accumulated prediction sensitivity, which measures fairness in machine learning models based on the model's prediction sensitivity to perturbations in input features. The metric attempts to quantify the extent to which a single prediction depends on a protected attribute, where the protected attribute encodes the membership status of an individual in a protected group. We show that the metric can be theoretically linked with a specific notion of group fairness (statistical parity) and individual fairness. It also correlates well with humans' perception of fairness. We conduct experiments on two text classification datasets -- Jigsaw Toxicity, and Bias in Bios, and evaluate the correlations between metrics and manual annotations on whether the model produced a fair outcome. We observe that the proposed fairness metric based on prediction sensitivity is statistically significantly more correlated with human annotation than the existing counterfactual fairness metric.	PDF	11	2021
Searching for fingerspelled content in American Sign Language	Natural language processing for sign language video---including tasks like recognition, translation, and search---is crucial for making artificial intelligence technologies accessible to deaf individuals, and is gaining research interest in recent years. In this paper, we address the problem of searching for fingerspelled keywords or key phrases in raw sign language videos. This is an important task since significant content in sign language is often conveyed via fingerspelling, and to our knowledge the task has not been studied before. We propose an end-to-end model for this task, FSS-Net, that jointly detects fingerspelling and matches it to a text sequence. Our experiments, done on a large public dataset of ASL fingerspelling in the wild, show the importance of fingerspelling detection as a component of a search and retrieval model. Our model significantly outperforms baseline methods adapted from prior work on related tasks.	PDF	11	2021
What does it take to bake a cake? The RecipeRef corpus and anaphora resolution in procedural text	Procedural text contains rich anaphoric phenomena yet has not received much attention in NLP. To fill this gap, we investigate the textual properties of two types of procedural text, recipes and chemical patents, and generalize an anaphora annotation framework developed for the chemical domain for modelling anaphoric phenomena in recipes. We apply this framework to annotate the RecipeRef corpus with both bridging and coreference relations. Through comparison to chemical patents, we show the complexity of anaphora resolution in recipes. We demonstrate empirically that transfer learning from the chemical domain improves resolution of anaphora in recipes, suggesting transferability of general procedural knowledge. The corpus is made available at \url{withheld\_for\_review}.	PDF	11	2021
Metadata Shaping: Natural Language Annotations for the Long Tail	Language models (LMs) struggle to capture knowledge about rare entities. To better capture entity knowledge, a common procedure in prior work is to start with a base LM such as BERT and to modify the LM architecture or objective function to produce a knowledge-aware LM. Proposed knowledge-aware LMs perform well compared to base LMs on entity-rich tasks; however deploying, understanding, and maintaining many different specialized architectures is challenging, and they also often introduce additional computational costs. Thus we ask to what extent we can match the quality of these architectures using a base LM and only changing the data. We propose metadata shaping, a method which inserts readily available entity metadata, such as descriptions and categorical tags, into examples at train and inference time based on mutual information. Intuitively, if metadata corresponding to popular entities overlap with metadata for rare entities, the LM may be able to better reason about the rare entities using patterns learned from similar popular entities. On standard entity-rich tasks (TACRED, FewRel, OpenEntity), metadata shaping exceeds the BERT-baseline by an average of 4.3 F1 points and achieves state-of-the-art results. We further show the gains are on average 4.4x larger for the slice of examples containing tail vs. popular entities.	PDF	11	2021
Packed Levitated Marker for Entity and Relation Extraction	Recent entity and relation extraction works focus on investigating how to obtain a better span representation from the pre-trained encoder. However, a major limitation of existing works is that they ignore the interrelation between spans (pairs). In this work, we propose a novel span representation approach, named Packed Levitated Markers (PL-Marker), to consider the interrelation between the spans (pairs) by strategically packing the markers in the encoder. In particular, we propose a neighborhood-oriented packing strategy, which considers the neighbor spans integrally to better model the entity boundary information. Furthermore, for those more complicated span pair classification tasks, we design a subject-oriented packing strategy, which packs each subject and all its objects to model the interrelation between the same-subject span pairs. The experimental results show that, with the enhanced marker feature, our model advances baselines on six NER benchmarks, and obtains a 3.5%-3.6% strict relation F1 improvement with higher speed over previous state-of-the-art models on ACE04 and ACE05. All the code and data of this paper will be made publicly available.	PDF	11	2021
Description-Driven Task-Oriented Dialog Modeling	Task-oriented dialogue (TOD) systems are required to identify key information from conversations for the completion of given tasks. Such information is conventionally specified in terms of intents and slots contained in task-specific ontology or schemata. Since these schemata are designed by system developers, the naming convention for slots and intents is not uniform across tasks, and may not convey their semantics effectively. This can lead to models memorizing arbitrary patterns in data, resulting in suboptimal performance and generalization. In this paper, we propose that schemata should be modified by replacing names or notations entirely with natural language descriptions. We show that a language description-driven system exhibits better understanding of task specifications, higher performance on state tracking, improved data efficiency, and effective zero-shot transfer to unseen tasks. Following this paradigm, we present a simple yet effective Description-Driven Dialog State Tracking (D3ST) model, which relies purely on schema descriptions and an "index-picking'' mechanism. We demonstrate the superiority in quality, data efficiency and robustness of our approach as measured on the MultiWOZ (Budzianowski et al.,2018), SGD (Rastogi et al., 2020), and the recent SGD-X (Lee et al., 2021) benchmarks.	PDF	11	2021
CaM-Gen: Causally-aware Guided Text Generation	Content is created for a well-defined purpose, often described by a metric or signal represented in the form of structured information. The relationship between the goal (metrics) of target content and the content itself is non-trivial. While large-scale language models show promising text generation capabilities, guiding the generated text with external metrics is challenging.These metrics and content tend to have inherent relationships and not all of them may be of consequence. We introduce CaM-Gen: Causally-aware Generative Networks guided by user-defined target metrics incorporating the causal relationships between the metric and content features. We leverage causal inference techniques to identify causally significant aspects of a text that lead to the target metric and then explicitly guide generative models towards these by a feedback mechanism. We propose this mechanism for variational autoencoder and Transformer-based generative models. The proposed models beat baselines in terms of the target metric control while maintaining fluency and language quality of the generated text. To the best of our knowledge, this is one of the early attempts at controlled generation incorporating a metric guide using causal inference.	PDF	11	2021
A Simple Hash-Based Early Exiting Approach For Language Understanding and Generation	Early exiting allows instances to exit at different layers according to the estimation of difficulty.Previous works usually adopt heuristic metrics such as the entropy of internal outputs to measure instance difficulty, which suffers from generalization and threshold-tuning. In contrast, learning to exit, or learning to predict instance difficulty is a more appealing way. Though some effort has been devoted to employing such "learn-to-exit" modules, it is still unknown whether and how well the instance difficulty can be learned. As a response, we first conduct experiments on the learnability of instance difficulty, which demonstrates that modern neural models perform poorly on predicting instance difficulty. Based on this observation, we propose a simple-yet-effective Hash-based Early Exiting approach HashEE) that replaces the learn-to-exit modules with hash functions to assign each token to a fixed exiting layer. Different from previous methods, HashEE requires no internal classifiers nor extra parameters, and therefore is more efficient.HashEE can be used in various tasks (including language understanding and generation) and model architectures such as seq2seq models. Experimental results on classification, regression, and generation tasks demonstrate that HashEE can achieve higher performance with fewer FLOPs and inference time compared with previous state-of-the-art early exiting methods.	PDF	11	2021
PREME: Preference-based Meeting Exploration through an Interactive Questionnaire	The recent increase in the volume of online meetings necessitates automated tools for managing and organizing the material, especially when an attendee has missed the discussion and needs assistance in quickly exploring it. In this work, we propose a novel end-to-end framework for generating interactive questionnaires for preference-based meeting exploration. As a result, users are supplied with a list of suggested questions reflecting their preferences. Since the task is new, we introduce an automatic evaluation strategy. Namely, it measures how much the generated questions via questionnaire are answerable to ensure factual correctness and covers the source meeting for the depth of possible exploration.	PDF	11	2021
DialogConv: A Lightweight Fully Convolutional Network for Multi-view Response Selection	Current end-to-end retrieval-based dialogue systems are primarily based on Recurrent Neural Networks or Transformers with attention mechanisms. Despite promising results have been achieved, these models usually suffer from slow inference speed or an enormous amount of parameters. In this paper, we propose a novel lightweight fully convolutional architecture called DialogConv for the response selection. DialogConv is built exclusively on convolutions for distilling the matching features of context and response. The dialogue is modeled in a 3D view, where DialogConv conducts convolution operations on embedding dimension, word dimension and utterance dimension iteratively to capture richer semantic information from a multi-view of context. On four benchmark datasets, DialogConv is approximately 4.0x smaller and up to 27x faster in inference compared with strong baselines. Moreover, DialogConv can achieve competitive performance results on four public datasets.	PDF	11	2021
HADE: Hierarchical Affective Dialog Encoder for Personality Recognition in Conversation	Personality recognition in conversation aims to determine the personality traits of speakers through the dialogue content, which is of great importance in designing personalized conversational AI. Existing methods that use only linguistic patterns in utterances limit their performance. To fill in the gap, we investigate the effectiveness of incorporating affective information and modeling the interactions among speakers in conversations for personality recognition. However, available corpus with personality and explicit affective annotations is rare. Besides, modeling the dialog flow with multiple speakers is difficult. Faced with the issues, we proposed Hierarchical Affective Dialog Encoder (HADE) for effective personality recognition in conversation. HADE utilizes manual annotated Valance-Arousal-Dominance (VAD) vectors of single words and implicitly extracts affective information from utterances. Then, it introduces a hierarchical architecture with the dialog state embeddings to identify the speakers and encode the whole dialog flow. Finally, the affective information is integrated by an auxiliary VAD regression task to enhance personality recognition. Extensive experiments on a well-known dataset, \textbf{FriendsPersona}, demonstrate the effectiveness of our method compared with state-of-the-art models. Besides, we conduct an ablation study to discuss different approaches for integrating affective information and dialog flow modeling; the design of both parts in HADE is also verified to be effective for personality recognition in conversation.	PDF	11	2021
Is Cross-lingual Evaluation Only About Cross-lingual?	Multilingual pre-trained language models (mPLMs) have achieved great success on various cross-lingual tasks. However, we find that the higher performance on these tasks cannot be regarded as the better cross-lingual ability because models’ task-specific abilities can also influence the performance. In this work, we do a comprehensive study on two representative cross-lingual evaluation protocols: sentence retrieval and zero-shot transfer. We find that current cross-lingual evaluation results strongly depend on mPLMs' task-specific abilities so that the performance can be improved without any improvement in models' cross-lingual ability. To have more accurate comparisons of cross-lingual ability between mPLMs, we propose two new indexes based on the two evaluation protocols: calibrated sentence retrieval performance and transfer rate, and experimentally show that our proposed indexes effectively eliminate the effects of task-specific abilities on the cross-lingual evaluation.	PDF	11	2021
Multi-Party Empathetic Dialogue Generation: A New Task for Dialog Systems	Empathetic dialogue assembles emotion understanding, feeling projection, and appropriate response generation. Existing work for empathetic dialogue generation concentrates on the two-party conversation scenario. Multi-party dialogues, however, are pervasive in reality. Furthermore, emotion and sensibility are typically confused; a refined empathy analysis is needed for comprehending fragile and nuanced human feelings. We address these issues by proposing a novel task called Multi-Party Empathetic Dialogue Generation in this study. A new dataset MPED with 130k multi-party dialogues is correspondingly presented for this task, which makes up for the absence of a large-scale benchmark in this field. Additionally, a Static-Dynamic model for Multi-Party Empathetic Dialogue Generation, SDMPED, is introduced as a baseline by exploring the static sensibility and dynamic emotion for the multi-party empathetic dialogue learning, the aspects that help SDMPED achieve the state-of-the-art performance on MPED.	PDF	11	2021
An Isotropy Analysis in the Multilingual BERT Embedding Space	Several studies have explored various advantages of multilingual pre-trained models (e.g., multilingual BERT) in capturing shared linguistic knowledge. However, their limitations have not been paid enough attention to. In this paper, we investigate the representation degeneration problem and outlier dimensions in multilingual contextual word representations (CWRs) of BERT. We show that though mBERT exhibits no outliers among its representations, its multilingual embedding space is highly anisotropic. Furthermore, our experimental results demonstrate that similarly to their monolingual counterparts, increasing the isotropy of multilingual embedding spaces can significantly improve their representation power and performance. Our analysis indicates that, although the degenerated directions vary in different languages, they encode similar linguistic knowledge, suggesting a shared linguistic space among languages.	PDF	11	2021
What Makes Machine Reading Comprehension Questions Difficult? Investigating Variation in Passage Sources and Question Types	For a natural language understanding benchmark to be useful in research, it has to consist of examples that are diverse and difficult enough to discriminate among current and near-future state-of-the-art systems.However, we do not yet know how best to select passages to collect a variety of challenging examples.In this study, we crowdsource multiple-choice reading comprehension questions for passages taken from seven qualitatively distinct sources, analyzing what attributes of passages contribute to the difficulty and question types of the collected examples.To our surprise, we find that passage source, length, and readability measures do not significantly affect question difficulty.Through our manual annotation of seven reasoning types, we observe several trends between passage sources and reasoning types, e.g., logical reasoning is more often required in questions written for technical passages.These results suggest that when creating a new benchmark dataset, selecting a diverse set of passages can help ensure a diverse range of question types, but that passage difficulty need not be a priority.	PDF	11	2021
Square One Bias in NLP: Towards a Multi-Dimensional Exploration of the Research Manifold	The first NLP experiment many researchers performed in their careers likely involved training a standard architecture on labeled English data and optimizing for accuracy, without accounting for other dimensions such as fairness, interpretability, or computational efficiency. We show through surveys that this is indeed the case and refer to it as the square one experimental setup. NLP research often goes beyond the square one setup, e.g, focusing not only on accuracy, but also on fairness or interpretability, but typically along a single dimension. Most work focused on multilinguality, for example, considers only accuracy; most work on fairness or interpretability considers only English; and so on. We show this through manual classification of recent NLP research papers and ACL Test-of-Time award recipients. Such one-dimensionality of most research means we are only exploring a fraction of the NLP research search space. We provide historical and recent examples of how the square one bias has led researchers to draw false conclusions or make unwise choices, point to promising yet unexplored directions on the research manifold, and make practical recommendations to enable more multi-dimensional research.	PDF	11	2021
A Slot Is Not Built in One Utterance: Spoken Language Dialogs with Sub-Slots	A slot value might be provided segment by segment over multiple-turn interactions in a dialog, especially for some important information such as phone numbers and names. It is a common phenomenon in daily life, but little attention has been paid to it in previous work. To fill the gap, this paper defines a new task named Sub-Slot based Task-Oriented Dialog (SSTOD) and builds a Chinese dialog dataset SSD for boosting research on SSTOD. The dataset includes total 40K dialogs and 500K utterances from four different domains: Chinese names, phone numbers, ID numbers and license plate numbers. The data is well annotated with sub-slot values, slot values, dialog states and actions. We find some new linguistic phenomena and interactive manners in SSTOD which raise some new challenges of building agents for the task. We test three state-of-the-art models on SSTOD and find they cannot handle the new task well on any of the four domains. We also investigate an improved model by involving slot knowledge in a plug-in manner. More work should be done to meet the new challenges raised from SSTOD which widely exists in real-life applications.	PDF	11	2021
Distilling Causal Metaknowledge from Massive Knowledge Graph	In recent years, the growing information overload facilitates the access to billions of relational facts in the world, which are usually integrated in all manner of knowledge graphs. The metaknowledge, defined as the knowledge about knowledge, reveals the inner principle of arising these factual knowledge, and hence is of vital importance to be discovered for the understanding, exploiting and completion of knowledge. In this paper, we focus on capturing the causal component of metaknowledge, that is a metarule with causal semantic.For the propose, we devise an efficient causal rule discovery algorithm called CaRules that distills the causal rules between two knowledge graph schemata abstracted from instances from massive knowledge graphs. Extensive experiments demonstrate that the quality and interpretability of the causation-based rules outperform the correlation-based rules, especially in the out-of-distribution tasks.	PDF	11	2021
EigenNoise: A Contrastive Prior to Warm-Start Representations	In this work, we present a naïve initialization scheme for word vectors based on a dense, independent co-occurrence model and provide preliminary results that suggest it is competitive, and warrants further investigation. Specifically, we demonstrate through information-theoretic minimum description length (MDL) probing that our model, EigenNoise, can approach the performance of empirically trained GloVe despite the lack of any pre-training data (in the case of EigenNoise). We present these preliminary results with interest to set the stage for further investigations into how this competitive initialization works without pre-training data, as well as to invite the exploration of more intelligent initialization schemes informed by the theory of harmonic linguistic structure. Our application of this theory likewise contributes a novel (and effective) interpretation of recent discoveries which have elucidated the underlying distributional information that linguistic representations capture from data and contrast distributions.	PDF	11	2021
Pre-training and Fine-tuning Neural Topic Model: A Simple yet Effective Approach to Incorporating External Knowledge	Recent years have witnessed growing interests in incorporating external knowledge such as pre-trained word embeddings (PWEs) or pre-trained language models (PLMs) into neural topic modeling. However, we found that employing PWEs and PLMs for topic modeling only achieved limited performance improvements but with huge computational overhead. In this paper, we propose a novel strategy to incorporate external knowledge into neural topic modeling where the neural topic model is pre-trained on a large corpus and then fine-tuned on the target dataset. Experiments have been conducted on three datasets and results show that the proposed approach significantly outperforms both current state-of-the-art neural topic models and some topic modeling approaches enhanced with PWEs or PLMs. Moreover, further study shows that the proposed approach greatly reduces the need for the huge size of training data.	PDF	11	2021
Understanding Multimodal Procedural Knowledge by Sequencing Multimodal Instructional Manuals	The ability to sequence unordered events is evidence of comprehension and reasoning about real world tasks/procedures, and is essential for applications such as task planning and multi-source instruction summarization.It often requires thorough understanding of temporal common sense and multimodal information, since these procedures are often conveyed by a combination of texts and images.While humans are capable of reasoning about and sequencing unordered procedural instructions, the extent to which the current machine learning methods possess such a capability is still an open question.In this work, we benchmark models' capability of reasoning over and sequencing unordered multimodal instructions by curating datasets from online instructional manuals and collecting comprehensive human annotations.We find current state-of-the-art models not only perform significantly worse than humans but also seem incapable of efficiently utilizing multimodal information.To improve machines' performance on multimodal event sequencing, we propose sequence-aware pretraining techniques exploiting the sequential alignment properties of both texts and images, resulting in > 5% improvements on perfect match ratio.	PDF	11	2021
MoEfication: Conditional Computation of Transformer Models for Efficient Inference	Transformer-based pre-trained language models achieve superior performance on most NLP tasks due to large parameter capacity, but also lead to huge computation cost. Fortunately, we observe that most inputs only activate a tiny ratio of neurons of large Transformer-based pre-trained models during inference. Hence, we propose to convert a model into its mixture-of-experts (MoE) version with the same parameters, namely MoEfication, which accelerates large-model inference by conditional computation based on the sparse activation phenomenon. Specifically, MoEfication consists of two phases: (1) splitting the parameters of feed-forward neural networks (FFNs) into multiple parts as experts, and (2) building expert routers to decide which experts will be used for each input. Experimental results show that MoEfication can save $80\%$ computation cost of FFNs while maintaining over $95\%$ original performance for different models, including models with different sizes (up to 3 billion parameters) and distilled models, on various downstream tasks. Moreover, we find that the MoEfied model achieves better performance than the MoE model pre-trained from scratch with the same model size. We will release all the code and models of this paper.	PDF	11	2021
The Algorithmic Inflection and Morphological Variability of Russian	We present a set of deterministic algorithms for Russian inflection and automated text synthesis. These algorithms are implemented in a publicly available web-service www.passare.ru. This service provides functions for inflection of single words, word matching and synthesis of grammatically correct Russian text. The inflectional functions have been tested against the annotated corpus of Russian language OpenCorpora and used for estimating the morphological variability and complexity of different parts of speech in Russian.	PDF	11	2021
BARCOR: Towards A Unified Framework for Conversational Recommendation	Recommendation systems focus on helping users find items of interest in the situations of information overload, where users' preferences are typically estimated by past observed behaviors. In contrast, conversational recommendation systems (CRS) aim to understand users' preferences via interactions in conversation flows. CRS is a complex problem that consists of two main tasks: (1) recommendation and (2) response generation. Previous work often tried to solve the problem in a modular manner, where recommenders and response generators are separate neural models. Such modular architectures often come with a complicated and unintuitive connection between the modules, leading to inefficient learning and other issues. In this work, we propose a unified framework based on BART for conversational recommendation, which tackles two tasks in a single model. Furthermore, we also design and collect a lightweight knowledge graph for CRS in the movie domain. The experimental results show that the proposed methods achieve the state-of-the-art.	PDF	11	2021
Exploring Low-dimensional Intrinsic Task Subspace via Prompt Tuning	Why can pre-trained language models (PLMs) learn universal representations and effectively adapt to broad NLP tasks differing a lot superficially? In this work, we empirically find evidence indicating that the adaptations of PLMs to various few-shot tasks can be reparameterized as optimizing only a few free parameters in a unified low-dimensional intrinsic task subspace, which may help us understand why PLMs could easily adapt to various NLP tasks with small-scale data. To find such a subspace and examine its universality, we propose an analysis pipeline called intrinsic prompt tuning (IPT). Specifically, we resort to the recent success of prompt tuning and decompose the soft prompts of multiple NLP tasks into the same low-dimensional nonlinear subspace, then we learn to adapt the PLM to unseen data or tasks by only tuning parameters in this subspace. In the experiments, we study diverse few-shot NLP tasks and surprisingly find that in a 5-dimensional subspace found with 100 tasks, by only tuning 5 free parameters, we can recover 87% and 65% of the full prompt tuning performance for 100 seen tasks (using different training data) and 20 unseen tasks, respectively, showing great generalization ability of the found intrinsic task subspace. Besides being an analysis tool, IPT could further bring practical benefits, such as improving the prompt tuning stability.	PDF	11	2021
Improving Syntactic Parsing with Consistency Learning	In this paper, we propose using \emph{consistency learning} to improve constituency and dependency parsing performances on a multi-task setting. It utilizes a consistent constraint between the predictions. While multi-task learning implicitly learns shared representations for multiple sub-tasks, our method introduces an explicit consistency objective, which encourages shared representations that result in consistent predictions. Our intuition is that correct predictions are more likely consistent ones. To introduce consistent constraints, we propose a general method for introducing consistency objectives, as well as other prior knowledge, into existing neural models. This method only requires a boolean function that tells whether or not the multiple predictions are consistent, which does not need to be differentiable. We demonstrate the efficacy of our method by showing that it out-performs a state-of-the-art joint dependency and constituency parser on CTB.	PDF	11	2021
Efficient Hyper-parameter Search for Knowledge Graph Embedding	While hyper-parameters (HPs) are important for knowledge graph (KG) embedding, existing methods fail to search them efficiently. To solve this problem, we first analyze the properties of different HPs and quantize the transferability from small subgraph to the large graph. Based on the analysis, we propose an efficient two-stage search algorithm, which efficiently explores HP configurations on small subgraph at the first stage and transfers the top configurations for fine-tuning on the large whole graph at the second stage. Experiments show that our method can consistently find better HPs than the baseline algorithms with the same time budget. We achieve 10.8% average relevant improvement for four embedding models on the large-scale KGs in open graph benchmark.	PDF	11	2021
Listen to Both Sides and be Enlightened! -- Hierarchical Modality Fusion Network for Entity and Relation Extraction	Multimodal named entity recognition and relation extraction (MNER and MRE) is a fundamental and crucial branch in multimodal learning. However, existing approaches for MNER and MRE mainly suffer from 1) error sensitivity when images contain irrelevant concepts not mentioned in texts; and 2) large modality gap between image and text features, especially hierarchical visual features. To deal with these issues, we propose a novel Hierarchical Modality fusion NeTwork (HMNeT) for visual-enhanced entity and relation extraction, aim to reduce the modality gap and achieve more effective and robust performance. Specifically, we innovatively leverage hierarchical pyramidal visual features to conduct multi-layer internal integration in Transformer. We further present a dynamic gated aggregation strategy to decide modality integration according to different images. Extensive experiments on three benchmark datasets demonstrate the effectiveness of our method, and achieve state-of-the-art performance.	PDF	11	2021
Knowledge Enhanced Embedding: Improve Model Generalization Through Knowledge Graphs	Pre-trained language models have achieved excellent results in NLP and NLI, and since the birth of Bert, various new types of Bert have emerged.They are able to grasp the ubiquitous linguistic representational information from large-scale corpora in different ways, but when reading texts, it is difficult for them to combine and use external knowledge to make inferences about other meanings that the text may contain, as people do.To this end, we propose a linguistic model (K2E-BERT) capable of simply incorporating external knowledge, which fuses information from the knowledge graph (triad) with the entity information in the original text.In order to better integrate external knowledge into the original text without letting it deviate from the original meaning of the sentence, we propose a method called EaKA (Entity and Knowledge Align), which can better distance and combine entities and knowledge so that the model can accept new external knowledge without losing the meaning of the original sentence; additionally, we can easily and beyond Bert without changing the internal structure of Bert, we can easily and go beyond the results of BERT, which shows that our approach is feasible.After our experiments, we found good results in several NLP tasks we selected, which indicated that K2E-BERT easily surpassed BERT in generalization ability, proving its effectiveness.	PDF	11	2021
ELLE: Efficient Lifelong Pre-training for Emerging Data	Current pre-trained language models (PLM) are typically trained with static data, ignoring that in real-world scenarios, streaming data of various sources may continuously grow. This requires PLMs to integrate the information from all the sources in a lifelong manner. Although this goal could be achieved by exhaustive pre-training on all the existing data, such a process is known to be computationally expensive. To this end, we propose ELLE, aiming at efficient lifelong pre-training for emerging data. Specifically, ELLE consists of (1) function preserved model expansion, which flexibly expands an existing PLM's width and depth to improve the efficiency of knowledge acquisition; and (2) pre-trained domain prompts, which disentangle the versatile knowledge learned during pre-training and stimulate the proper knowledge for downstream tasks. We experiment ELLE with streaming data from $5$ domains on BERT and GPT. The results show the superiority of ELLE over various lifelong learning baselines in both pre-training efficiency and downstream performances. All the data, model parameters and codes used will be available upon publication.	PDF	11	2021
CSL: A Large-scale Chinese Scientific Literature Dataset for Cross-task Evaluation	Scientific literature serves as a high-quality corpus, which could provide natural annotated data for many natural language processing (NLP) research. In this work, we introduce a Chinese Scientific Literature dataset – CSL, which contains the titles, abstracts, keywords and academic fields of 400,000 papers. The rich semantic information in these scientific literature creates extensive NLP tasks and provides a natural cross-task scenario. Based on this, we present a cross-task few-shot benchmark. To evaluate the cross-task transferability of the model, we design scenarios with different aspects and difficulties. Compared with previous cross-task benchmarks, these tasks are constructed from homogeneous corpus, allowing researchers to investigate the relationships between tasks, without being disturbed by heterogeneous data sources, annotation, and other factors. We analyze the behavior of existing text-to-text models on the proposed benchmark, and reveal the challenges for cross-task generalization, which provides a valuable reference for future research. Code and data are publicly available at https://github.com/CSL-Dataset/CSL_Dataset.	PDF	11	2021
The Dark Side of the Language: Pre-trained Transformers in the DarkNet	Pre-trained Transformers are challenging human performances in many natural language processing tasks. The gigantic datasets used for pre-training seem to be the key for their success on existing tasks. In this paper, we explore how a range of pre-trained natural language understanding models perform on truly novel and unexplored data, provided by classification tasks over a DarkNet corpus. Surprisingly, results show that syntactic and lexical neural networks largely outperform pre-trained Transformers. This seems to suggest that pre-trained Transformers have serious difficulties in adapting to radically novel texts.	PDF	11	2021
How to Translate Your Samples and Choose Your Shots? Analyzing Translate-train & Few-shot Cross-lingual Transfer	Translate-train or few-shot cross-lingual transfer can be used to improve the zero-shot performance of multilingual pretrained language models. Few-shot utilizes high-quality low-quantity samples (often manually translated from the English corpus to the target language). Translate-train employs a machine translation of the English corpus, resulting in samples with lower quality that could be scaled to high quantity. Given the lower cost and higher availability of machine translation compared to manual professional translation, it is important to systematically compare few-shot and translate-train, understand when few-shot is beneficial, and whether choosing the shots to translate increases the few-shot gain. This work aims to fill this gap: we compare and quantify the performance gain of few-shot vs. translate-train using a varying number of samples for three tasks/datasets (XNLI, PAWS-X, XQuAD) spanning 17 languages. We show that scaling up the training data using machine translation gives a larger gain compared to using the small-scale (higher-quality) few-shot data. When few-shot is beneficial, we show that there are random sets of samples that perform better across languages and that the performance on English and on the machine-translation of the samples can both be used to choose the shots to manually translate for an increased few-shot gain.	PDF	11	2021
One Agent To Rule Them All: Towards Multi-agent Conversational AI	The increasing volume of commercially available conversational agents (CAs) on the market has resulted in users being burdened with learning and adopting multiple agents to accomplish their tasks.Though prior work has explored supporting a multitude of domains within the design of a single agent, the interaction experience suffers due to the large action space of desired capabilities.To address these problems, we introduce a new task BBAI: Black-Box Agent Integration, focusing on combining the capabilities of multiple black-box CAs at scale.We explore two techniques: question agent pairing and question response pairing aimed at resolving this task.Leveraging these techniques, we design One For All (OFA), a scalable system that provides a unified interface to interact with multiple CAs.Additionally, we introduce MARS: Multi-Agent Response Selection, a new encoder model for question response pairing that jointly encodes user question and agent response pairs.We demonstrate that the OFA system is able to automatically and accurately integrate an ensemble of commercially available CAs spanning disparate domains.Specifically, using the MARS encoder we achieve 88.5% accuracy on our BBAI task, outperforming strong baselines.	PDF	11	2021
Multi-stage Distillation Framework for Cross-Lingual Semantic Similarity Matching	Previous studies have proved that cross-lingual knowledge distillation can significantly improve the performance of pre-trained models for cross-lingual similarity matching tasks. However, the student model needs to be large in this operation. Otherwise, its performance will drop sharply, thus making it impractical to be deployed to memory-limited devices. To address this issue, we delve into cross-lingual knowledge distillation and propose a multi-stage distillation framework for constructing a small-size but high-performance cross-lingual model. In our framework, the contrastive learning and an assistant model are introduced to prevent performance from being compromised during the compression process. The experimental results demonstrate that our method can compress the size of XLM-R and MiniLM by more than 50%, while the performance is only reduced by about 1%. In addition, our framework is model-independent and applicable to all transformer-based models.	PDF	11	2021
Finding the Dominant Winning Ticket in Pre-Trained Language Models	The Lottery Ticket Hypothesis suggests that for any over-parameterized model, a small subnetwork exists to achieve competitive performance compared to the backbone architecture. In this paper, we study whether there is a winning lottery ticket for pre-trained language models, which allow the practitioners to fine-tune the parameters in the ticket but achieve good downstream performance. To achieve this, we regularize the fine-tuning process with L1 distance and explore the subnetwork structure (what we refer to as the "dominant winning ticket"). Empirically, we show that (a) the dominant winning ticket can achieve performance that is comparable with that of the full-parameter model, (b) the dominant winning ticket is transferable across different tasks, (c) and the dominant winning ticket has a natural structure within each parameter matrix. Strikingly, we find that a dominant winning ticket that takes up 0.05% of the parameters can already achieve satisfactory performance, indicating that the PLM is significantly reducible during fine-tuning.	PDF	11	2021
CLGP: Multi-Feature Embedding based Cross-Attention for Chinese NER	The previous works fused lexicon information while ignoring two important Chinese language characteristics: glyph and pinyin, which carry significant syntax and semantics information for sequence tagging tasks. This paper proposes CLGP, which utilizes three specific extractors to obtain the embeddings of the glyph, pinyin, and lexicon, and further uses a network based on cross-attention to perform multi-feature embedding fusion. Specifically, we introduce the embedding scheme to preserve the lexicon matching results, and design two specific CNN architectures to extract glyph and pinyin embeddings. Moreover, we fuse the four embeddings by the cross-attention-based network to enhance the Chinese NER. The experimental results on four famous datasets show that CLGP achieves the SOTA performance.	PDF	11	2021
Prototypical Verbalizer for Prompt-based Few-shot Tuning	Prompt-based tuning for pre-trained language models (PLMs) has shown its effectiveness in few-shot learning. Typically, prompt-based tuning wraps the input text into a cloze question. To make predictions, the model maps the output words to labels via a verbalizer, which is either manually designed or automatically built. However, manual verbalizers heavily depend on domain-specific prior knowledge and human efforts, while finding appropriate label words automatically still remains challenging.In this work, we propose the prototypical verbalizer (ProtoVerb) which is built directly from training data. Specifically, ProtoVerb learns prototype vectors as verbalizers by contrastive learning. In this way, the prototypes summarize training instances and are able to enclose rich class-level semantics. We conduct experiments on both topic classification and entity typing tasks, and the results demonstrate that ProtoVerb significantly outperforms current automatic verbalizers, especially when training data is extremely scarce. More surprisingly, ProtoVerb consistently boosts prompt-based tuning even on untuned PLMs, indicating an elegant non-tuning way to utilize PLMs.	PDF	11	2021
A New Framework for Fast Automated Phonological Reconstruction Using Trimmed Alignments and Sound Correspondence Patterns	Computational approaches in historical linguistics have been increasingly applied during the past decade and many new methods that implement parts of the traditional comparative method have been proposed. Despite these increased efforts, there are not many easy-to-use and fast approaches for the task of phonological reconstruction. Here we present a new framework that combines state-of-the-art techniques for automated sequence comparison with novel techniques for phonetic alignment analysis and sound correspondence pattern detection to allow for the supervised reconstruction of word forms in ancestral languages. We test the method on a new datasets covering six groups from three different language families. The results show that our method yields promising results while at the same time being not only fast but also easy to apply and expand.	PDF	11	2021
A Structured Semantic Reinforcement method for Task-Oriented Dialogue	Recently, many BERT based approachs have been proposed for task-oriented dialogue (TOD) task. Despite their impressive performance, the insufficient utilization of deep semantic information and long-distance context understanding makes it difficult for these methods to digest complex dialogue scenarios for they cannot obtain sufficient evidence from dialogue data to support dialogue decision-making.In this work, we propose a novel structured semantics reinforcement (SSR) method to handle these issues.SSR reorganized the end-to-end TOD structure, which mainly includes two key components: 1. The dialogue symbolic memory, which cache the objects mentioned in the dialogue and the structure under the semantic relationship. 2. semantic projection module, understanding module, based on the previous structured results, determines the source of the slot extraction required for the current task.And our approach achieves state-of-the-art results on dataset MultiWOZ 2.1, where we acquire a joint goal accuracy beyond 60\% and also gains a significant effect on dataset DSTC8.	PDF	11	2021
Bottom Up Parsing via Sequence Labeling	We translate the sequence labeling framework, first introduced for top-down discourse parsing by Koto et al. (2021), to bottom-up discourse parsing. We introduce a novel parser that is not constrained by parsing direction (left-to-right or otherwise), and is conditioned on previous parsing decisions. We describe the unique training requirements of a (directionally) unconstrained parser and explore two different training procedures. Additionally, we introduce a novel dynamic oracle for unconstrained bottom-up parsing. Our proposed parser achieves state-of-the-art performance amongst bottom-up RST parsers.	PDF	11	2021
VQN: Variable Quantization Noise for Neural Network Compression	Quantization refers to a set of methods that compress a neural network by representing its parameters with fewer bits. However, applying quantization to a neural network after training often leads to severe performance regressions. Quantization Aware Training (QAT) addresses this problem by applying simulated training-time quantization for the model to learn robustness to inference-time quantization. One key drawback of this approach is that quantization functions induce biased gradient flow through the network during backpropagation, thus preventing the network from best-fitting to the learning task. Fan et al. addressed this issue by proposing Quant-Noise, in which simulated quantization is applied to a fixed proportion, called the quantization noise rate, of parameters during training. Our study, Variable Quantization Noise (VQN), builds upon their technique by exploring a variable quantization noise rate instead of a fixed one. We craft three candidate functions to vary noise rate during training and evaluate the variants with 3 datasets and 3 quantization schemes for each dataset. First, we report negative results on our hand-crafted candidate functions. Second, we observe somewhat positive results on a method, originally intended as an ablation study, of randomly varying the noise rate during training. This method outperforms Quant-Noise on two out of three quantization schemes for all three tested datasets. Moreover, on two of the datasets, this method at 4x compression matches or exceeds performance of even the uncompressed model. Future work should determine whether these unexpected results hold for more datasets and quantization schemes, as well as investigating other schemes for varying the noise rate during training.	PDF	11	2021
DocEE: A Large-Scale and Fine-grained Benchmark for Document-level Event Extraction	Event extraction aims to identify an event and then extract the arguments participating in the event. Despite the great success in sentence-level event extraction, events are more naturally presented in the form of documents, with event arguments scattering in multiple sentences. However, a major barrier to promote document-level event extraction has been the lack of large-scale and practical training and evaluation datasets. In this paper, we present DocEE, a new document-level event extraction dataset including 20,000+ events, 100,000+ arguments. We highlight three features: large-scale manual annotations, fine-grained argument types and application-oriented settings. Experiments show that there is still a big gap between state-of-the-art models and human beings (43\% Vs 85\% in F1 score), indicating that DocEE is an open issue. We will publish DocEE upon acceptance.	PDF	11	2021
A Structure-Aware Argument Encoder for Literature Discourse Analysis	Existing research for argument representation learning mainly treats tokens in the sentence equally and ignores the implied structure information of argumentative context. In this paper, we propose to separate tokens into two groups, namely framing tokens and topic ones, to capture structural information of arguments. In addition, we consider high-level structure by incorporating paragraph-level position information. A novel structure-aware argument encoder is proposed for literature discourse analysis. Experimental results on both a self-constructed corpus and a public corpus show the effectiveness of our model.	PDF	11	2021
Fine-grained Category Discovery under Coarse-grained supervision with Hierarchical Weighted Self-contrastive Learning	In this paper, we propose a new task named Fine-grained Category Discovery under Coarse-grained supervision (FCDC). Without asking for any fine-grained knowledge, FCDC aims at discovering fine-grained categories with only coarse-grained labeled data, which can not only reduce significant labeling costs, but also adapt to novel fine-grained categories. It is also a challenging task since performing FCDC requires models to ensure fine-grained sample separability with only coarse-grained supervision and can easily make models overfit on the training set. Considering most current methods cannot transfer knowledge from coarse-grained level to fine-grained level, we propose a novel hierarchical weighted self-contrastive network to approach the FCDC task. Inspired by the hierarchy of pre-trained models (e.g. BERT), we combine supervised learning and contrastive learning to learn fine-grained knowledge from shallow to deep. Specifically, we use coarse-grained labels to train bottom layers of our model to learn surface knowledge, then we build a novel weighted self-contrastive module to train top layers of our model to learn more fine-grained knowledge. Extensive experiments on two public datasets show both effectiveness and efficiency of our model over state-of-the-art methods.	PDF	11	2021
An Empirical Study on Explanations in Out-of-Domain Settings	Recent work in Natural Language Processing has focused on developing approaches that extract faithful explanations, either via identifying the most important tokens in the input (i.e. post-hoc explanations) or by designing inherently faithful models that first select the most important tokens and then use them to predict the correct label (i.e. select-then-predict models). Currently, these approaches are largely evaluated on in-domain settings. Yet, little is known about how post-hoc explanations and inherently faithful models perform in out-of-domain settings. In this paper, we conduct an extensive empirical study that examines: (1) the out-of-domain faithfulness of post-hoc explanations, generated by five feature attribution methods; and (2) the out-of-domain performance of two inherently faithful models over six datasets. Contrary to our expectations, results show that in many cases out-of-domain post-hoc explanation faithfulness measured by sufficiency and comprehensiveness is higher compared to in-domain. We find this misleading and suggest using a random baseline as a yardstick for evaluating post-hoc explanation faithfulness. Our findings also show that select-then predict models demonstrate comparable predictive performance in out-of-domain settings to full-text trained models.	PDF	11	2021
RepAL: A Simple and Plug-and-play Method for Improving Unsupervised Sentence Representations	Unsupervised sentence representation learning is a fundamental problem in natural language processing and has been studied extensively in recent years. This paper presents Representation ALchemy (RepAL), an extremely simple post-processing method that enhances unsupervised sentence representations. The basic idea in RepAL is to extract redundant information from the representation of a sentence generated by the existing models and then refine the representation through an embedding refinement operation to filter such redundant information. In this paper, we analyze the redundant information from two levels: sentence-level and corpus-level, and the theoretical analysis for the latter is also conducted. We point out that RepAL is free of training and is a plug-and-play method that can be combined with most existing unsupervised sentence learning models. Extensive experiments demonstrate RepAL’s effectiveness and show that RepAL is a model-agnostic method for unsupervised sentence embedding enhancement. Besides, we also designed detailed ablation studies to understand why RepAL works and provided in-depth analysis and understanding of the redundant information.	PDF	11	2021
A Comparison of the Validity of Measurement Methods for the General English Proficiency through Dictation and Read-Aloud Performances	This paper compares three measurement methods for the general proficiency of learners of English as a second language (GEP). If students’ GEP can be measured on course materials frequently, for instance, at the beginning and end of a semester, English teachers can confirm students’ levels of learning achievement. So far, English teachers have two options for GEP measurement: calculating scores for read-aloud or those of dictation performance. This study expands an option to measure GEP using both dictation and read-aloud performances. When comparing the three types of measurement methods, the experimental results suggest that GEP should be measured by calculating dictation and read-aloud performances.	PDF	11	2021
Increasing Entity Linking upper bound through a more effective Candidate Generation System	Entity Linking (EL) aligns entity mentions in text to entries in a knowledge base. It usually comprises of two phases: candidate generation and candidate ranking. While most methods focus on the latter phase, it is candidate generation that sets the upper bound for both time and accuracy of an EL system. We propose a simple approach for improving candidate generation by efficiently embedding mention-entity pairs in dense space through a BERT-based bi-encoder. Specifically, we introduce a new pooling function and incorporate entity type side-information. We achieve a new state-of-the-art 84.28% recall of the gold entity in the Zero-shot EL dataset with just 50 candidates, compared to the previous 82.06% with 64 candidates. We report the results from extensive experimentation using our proposed model on both seen and unseen entity datasets. Our results suggest that our approach could be a useful complement to existing EL methods.	PDF	11	2021
Multilingual Document-Level Translation Enables Zero-Shot Transfer From Sentences to Documents	Document-level neural machine translation (DocNMT) achieves coherent translations by incorporating cross-sentence context. However, for most language pairs there's a shortage of parallel documents, although parallel sentences are readily available. In this paper, we study whether and how contextual modeling in DocNMT is transferable via multilingual modeling. We focus on the scenario of zero-shot transfer from teacher languages with document level data to student languages with no documents but sentence level data, and for the first time treat document-level translation as a transfer learning problem. Using simple concatenation-based DocNMT, we explore the effect of 3 factors on the transfer: the number of teacher languages with document level data, the balance between document and sentence level data at training, and the data type of document level data (genuine vs. back-translated). Our experiments on Europarl-7 and IWSLT-10 show the feasibility of multilingual transfer for DocNMT, particularly on document-specific metrics. We observe that more teacher languages and adequate data schedule both contribute to better transfer quality. Surprisingly, the transfer is less sensitive to the data type, where multilingual DocNMT delivers decent performance with either back-translated or genuine document pairs.	PDF	11	2021
More Informative Dialogue Generation via Multiple Knowledge Selection	Knowledge-grounded dialogue generation is a task of generating a fluent and informative response based on both dialogue context and a collection of external knowledge. There is a lot of noise in the knowledge pool, and appropriate knowledge selection plays an important role. Existing methods can only select one piece of knowledge to participate in the generation of the response, which inevitably loses some useful clues contained in the discarded candidates. In this work, we propose MSEL, a novel knowledge selector which could select multiple useful knowledge. MSEL takes the dialog context and knowledge pool as inputs and predicts a subset of knowledge pool in sequence-to-sequence manner. MSEL is easy to implement and can benefits from the generative pre-trained language models. Empirical results on the Wizard-of-Wikipedia dataset indicate that our model can significantly outperforms state-of-the-art approaches in both automatic and human evaluation.	PDF	11	2021
Sense Embeddings are also Biased -- Evaluating Social Biases in Static and Contextualised Sense Embeddings	Sense embedding learning methods learn different embeddings for the different senses of an ambiguous word.One sense of an ambiguous word might be socially biased while its other senses remain unbiased.In comparison to the numerous prior work evaluating the social biases in pretrained word embeddings, the biases in sense embeddings have been relatively under studied.In this paper, we create a benchmark dataset for evaluating the social biases in sense embeddings and propose novel sense-specific bias evaluation measures.We conduct an extensive evaluation of multiple static and contextualised sense embeddings for various types of social biases using the proposed measures.Our experimental results show that even in cases where no biases are found at word-level, there still exist worrying levels of social biases at sense-level, which are often ignored by the word-level bias evaluation measures.	PDF	11	2021
Is Neural Topic Modelling Better than Clustering? An Empirical Study on Clustering with Contextual Embeddings for Topics	Recent work incorporates pre-trained word embeddings such as BERT embeddings into Neural Topic Models (NTMs), generating highly coherent topics. However, with high-quality contextualized document representations, do we really need sophisticated neural models to obtain coherent and interpretable topics? In this paper, we conduct thorough experiments showing that directly clustering high-quality sentence embeddings with an appropriate word selecting method can generate more coherent and diverse topics than NTMs, achieving also higher efficiency and simplicity.	PDF	11	2021
Probing BERT’s priors with serial reproduction chains	We can learn as much about language models from what they say as we learn from their performance on targeted benchmarks. Sampling is a promising bottom-up method for probing, but generating samples from successful models like BERT remains challenging. Taking inspiration from theories of iterated learning in cognitive science, we explore the use of serial reproduction chains to probe BERT's priors. Although the masked language modeling objective does not guarantee a consistent joint distribution, we observe that a unique and consistent estimator of the ground-truth joint distribution may be obtained by a GSN sampler, which randomly selects which word to mask and reconstruct on each step. We compare the lexical and syntactic statistics of sentences from the resulting prior distribution against those of the ground-truth corpus distribution and elicit a large empirical sample of naturalness judgments to investigate how, exactly, the model deviates from human speakers. Our findings suggest the need to move beyond top-down evaluation methods toward bottom-up probing to capture the full richness of what has been learned about language.	PDF	11	2021
Learning from Missing Relations: Contrastive Learning with Commonsense Knowledge Graphs for Commonsense Inference	Commonsense inference poses a unique challenge to reason and generate the physical, social, and causal conditions of a given event. Existing approaches to commonsense inference utilize commonsense transformers, which are large-scale language models that learn commonsense knowledge graphs. However, they suffer from a lack of coverage and expressive diversity of the graphs, resulting in a degradation of the representation quality. In this paper, we focus on addressing missing relations in commonsense knowledge graphs, and propose a novel contrastive learning framework called SOLAR. Our framework contrasts sets of semantically similar and dissimilar events, learning richer inferential knowledge compared to existing approaches. Empirical results demonstrate the efficacy of SOLAR in commonsense inference of diverse commonsense knowledge graphs. Specifically, SOLAR outperforms the state-of-the-art commonsense transformer on commonsense inference with ConceptNet by 1.84% on average among 8 automatic evaluation metrics. In-depth analysis of SOLAR sheds light on the effects of the missing relations utilized in learning commonsense knowledge graphs.	PDF	11	2021
As Little as Possible, as Much as Necessary: Detecting Over- and Undertranslations with Contrastive Conditioning	Omission and addition of content is a typical issue in neural machine translation. We propose a method for detecting such phenomena with off-the-shelf translation models. Using contrastive conditioning, we compare the likelihood of a full sequence under a translation model to the likelihood of its parts, given the corresponding source or target sequence. This allows to pinpoint superfluous words in the translation and untranslated words in the source even in the absence of a reference translation. The accuracy of our method is comparable to a supervised method that requires a custom quality estimation model.	PDF	11	2021
Crossword: Estimating Unknown Embeddings using Cross Attention and Alignment Strategies	Word embedding methods like word2vec and GloVe have been shown to learn strong representations of words. However, these methods only learn representations for words in the training corpus. This is problematic, as models using these representations need ways to handle unknown and new words, known as out-of-vocabulary (OOV) words. As a result, there have been multiple attempts to learn OOV word representations in a similar fashion to how humans learn new words, using surrounding words (``context clues") and word roots/subwords. However, most current approaches suffer from two problems. First, these models calculate context clue estimates and subword estimates separately and then combine them shallowly for a final estimate, therefore ignoring potentially important information each type can learn from the other. Secondly, although subword embeddings are trained to estimate word vectors, we find these embeddings don't occupy the same space as word embeddings. Current models do not take this into account, and do not align the spaces before combining them. In response to this, we propose Crossword, a transformer based OOV estimation model that combines context and subwords at the attention level, allowing each type to influence the other for a stronger final estimate. Crossword successfully combines these different sources of information using cross attention, along with strategies to align subword and context spaces.	PDF	11	2021
Hyperlink-induced Pre-training for Passage Retrieval of Open-domain Question Answering	To alleviate the data scarcity problem in training question answering systems, recent works propose additional intermediate pre-training for dense passage retrieval (DPR). However, there still remains a large discrepancy between the provided upstream signals and the downstream question-passage relevance, which leads to less improvement. To bridge this gap, we propose the Hyperlink-induced Pre-training (HLP), a method to pre-train the dense retriever with the text relevance induced by hyperlink-based topology within Web documents. We demonstrate that the hyperlink-based structures of dual-link and co-mention can provide effective relevance signals for large-scale pre-training that better facilitate downstream passage retrieval. We investigate the effectiveness of our approach across a wide range of open-domain QA datasets under zero-shot, few-shot, multi-hop, and out-of-domain scenarios. The experiments show our HLP outperforms the BM25 by up to 7 points as well as other pre-training methods by up to 30 points in terms of top-20 retrieval accuracy under the zero-shot scenario. Furthermore, HLP significantly outperforms other pre-training methods under the other scenarios.	PDF	11	2021
A Comparative Study of Faithfulness Metrics for Model Interpretability Methods	Interpretable methods to reveal the internal reasoning processes behind machine learning models have attracted increasing attention in recent years. To quantify the extent to which the identified interpretations truly reflect the intrinsic decision-making mechanisms, various faithfulness evaluation metrics have been proposed. However, we find that different faithfulness metrics show conflicting preferences when comparing different interpretations. Motivated by this observation, we aim to conduct a comprehensive and comparative study of the widely adopted faithfulness metrics. In particular, we introduce two assessment dimensions, namely diagnosticity and complexity. Diagnosticity refers to the degree to which the faithfulness metric favors relatively faithful interpretations over randomly generated ones, and complexity is measured by the average number of model forward passes. According to the experimental results, we find that sufficiency and comprehensiveness metrics have higher diagnosticity and lower complexity than the other faithfulness metrics.	PDF	11	2021
Measuring HLT Research Equality of European Languages	This work explores quantitatively the equality of the languages of the European Union in the field of HLT. Our ultimate goal is to investigate European language diversity and identify low-resource and endangered languages taking into account the research papers of the main HLT conferences. This framework has been selected with the goal to identify potential inequalities among theoretically similarly capable languages in terms of available social and economical resources as well as political status. We have identified several groups of EU languages in terms of HLT research equality, each group comprising languages of very varying number of speakers. We have discovered a relative equality among surprisingly different languages in terms of speaker base and also relevant inequalities within the most spoken languages. All data and code will be released upon acceptance.	PDF	11	2021
More Than Words: Collocation Retokenization for Latent Dirichlet Allocation Models	Traditionally, Latent Dirichlet Allocation (LDA) ingests words in a collection of documents to discover their latent topics using word-document co-occurrences. Previous studies show that representing bigrams collocations in the input can improve topic coherence in English. However, it is unclear how to achieve the best results for languages without marked word boundaries such as Chinese and Thai. Here, we explore the use of retokenization based on chi-squared measures, $t$-statistics, and raw frequency to merge frequent token ngrams into collocations when preparing input to the LDA model. Based on the goodness of fit and the coherence metric, we show that topics trained with merged tokens result in topic keys that are clearer, more coherent, and more effective at distinguishing topics than those of unmerged models.	PDF	11	2021
A simple log-based loss function for ordinal text classification	The cross-entropy loss function is widely used and generally considered the default loss function for text classification. When it comes to ordinal text classification where there is an ordinal relationship between labels, the cross-entropy is not optimal as it does not incorporate the ordinal character into its feedback. In this paper, we propose a new simple loss function called ordinal log-loss (OLL). We show that this loss function outperforms state-of-the-art previously introduced losses on four benchmark text classification datasets.	PDF	11	2021
A New Search Paradigm for Natural Language Code Search	Code search can accelerate the efficiency of software development by finding code snippets for the given query. The dominant code search paradigm is to learn the semantic matching between code snippets and queries by neural networks. However, this search paradigm causes the gap transferring and expansion between code snippets and queries because researchers utilize pairs of code snippets and code descriptions (e.g., comments and documentation) to train their models and evaluate the trained models on the query which is different from the code description in writing style and application scenario. To remedy the issue, we propose a new simple but effective search paradigm, Query2Desc, which entirely depends on natural language and conducts code search by performing the semantic matching between code descriptions and queries. Experimental results on dataset CoSQA show that the state-of-the-art model CodeBERT gets improvement of 17.48\% in terms of the average MRR when applying it on Query2Desc. Moreover, baseline models on Query2Desc can return the right results in top-$10$ search results for at least 95\% of queries in the test set of CoSQA.	PDF	11	2021
A Novel End-to-End CAPT System for L2 Children Learners	Recently, Conformer-based model shows promising results in automatic speech recognition (ASR) task. There still is a dearth of research on Conformer based model for computer-assisted pronunciation learning (CAPT) system. In this paper, a Conformer-based CAPT system is introduced to provide the mispronunciation detection and diagnosis. We apply the Conformer as the main pronunciation error detection model in phoneme level since superior phoneme recognition performance. Then, the features, including the Log Phone Posterior (LPP), the Log Posterior Ratio (LPR) and some other features, extracted from the Conformer decoder, are trained by a XGBoost model to predict phoneme and sentence level scores labeled by experts. Both results on open datasets and our internal Chinese children data demonstrate that the Conformer-based system, which has smaller model size and detailed diagnosis, achieves better performance compared with neutral network (NN)-based system.	PDF	11	2021
MarCQAp: Effective Context Modeling for Conversational Question Answering	State-of-the-art models for Document-grounded Conversational Question Answering (DCQA) are based on the Transformer architecture. This raises two open issues: (a) Is it sufficient to concatenate the dialog history and the grounding document and perform cross-attention via a Transformer in order to capture the document/dialogue relationships? and (b) What is the best way to cope with the Transformers’ quadratic complexity, given the long inputs in DCQA? We address these issues in two dimensions. First, we introduce MarCQAp, a new modeling approach which encodes the historic answers by adding textual markups in the grounding document text, and then answers the question conditioned on the marked document. Second, we show that sparse self-attention architectures, such as the Longformer, can replace the Transformer, resolving the input length limitation. Our results demonstrate the effectiveness of each approach and their combination for explicit representation of dialogue/document relationships, significantly improving over state-of-the-art DCQA models.	PDF	11	2021
Can Neural Networks Understand Programs like Humans?	Program understanding is a fundamental task in program language processing. Despite the success, existing works fail to take human minds as reference in understanding programs. In this paper, we incorporate human minds and propose the PGNN-EK model that consists of two main components. On the one hand, inspired by the “divide-and-conquer” reading behaviours of humans, we present a partitioning-based graph neural network model PGNN on the upgraded AST of codes. On the other hand, to characterize human minds of resorting to other resources to help code comprehension, we transform raw codes with external knowledge and apply pre-training techniques for information extraction. Finally, we combine the two embeddings generated from the two components to output code embeddings. We conduct extensive experiments to show the superior performance of PGNN-EK on the code summarization and code clone detection tasks. In particular, to show the generalization ability of our model, we release a new dataset that is more challenging for code clone detection and could advance the development of the community. Our codes and data are publicly available at https://github.com/anonymousforpaper1997/PGNN-EK.	PDF	11	2021
PALBERT: Teaching ALBERT to Ponder	Currently, pre-trained models can be considered the default choice for a wide range of NLP tasks. Despite their SoTA results, there is practical evidence that these models may require a different number of computing layers for different input sequences, since evaluating all layers leads to overconfidence on wrong predictions (namely overthinking). This problem can potentially be solved by implementing adaptive computation time approaches, which were first designed to improve inference speed.Recently proposed PonderNet may be a promising solution for performing an early exit by treating the exit layer's index as a latent variable. However, the originally proposed exit criterion, relying on sampling from trained posterior distribution on the probability of exiting from $i$-th layer, introduces major variance in model outputs, significantly reducing the resulting model's performance.In this paper, we propose Ponder ALBERT (PALBERT) – an improvement to PonderNet with a novel deterministic Q-exit criterion and a revisited model architecture. We compared PALBERT with recent methods for performing an early exit. We observed that the proposed changes can be considered significant improvements on the original PonderNet architecture and outperform PABEE on a wide range of GLUE tasks. In addition, we also performed an in-depth ablation study of the proposed architecture to further understand Lambda layers and their performance.	PDF	11	2021
Understanding Gender Bias in Knowledge Base Embeddings	Knowledge base (KB) embeddings have been shown to contain gender biases. In this paper, we study two questions regarding these biases: how to quantify them, and how to trace their origins in KB? Specifically, first, we develop two novel bias measures respectively for a group of person entities and an individual person entity. Evidence of their validity is observed by comparison with real-world census data. Second, we use the influence function to inspect the contribution of each triple in KB to the overall group bias. To exemplify the potential applications of our study, we also present two strategies (by adding and removing KB triples) to mitigate gender biases in KB embeddings.	PDF	11	2021
Unsupervised Dependency Graph Network	Recent work has identified properties of pretrained self-attention models that mirror those of dependency parse structures. In particular, some self-attention heads correspond well to individual dependency types. Inspired by these developments, we propose a new competitive mechanism that encourages these attention heads to model different dependency relations. We introduce a new model, the Unsupervised Dependency Graph Network (UDGN), that can induce dependency structures from raw corpora and the masked language modeling task. Experiment results show that UDGN achieves very strong unsupervised dependency parsing performance without gold POS tags and any other external information. The competitive gated heads show a strong correlation with human-annotated dependency types. Furthermore, the UDGN can also achieve competitive performance on masked language modeling and sentence textual similarity tasks.	PDF	11	2021
MIMICause: Representation and automatic extraction of causal relation types from clinical notes	Understanding causal narratives communicated in clinical notes can help make strides towards personalized healthcare. Extracted causal information from clinical notes can be combined with structured EHR data such as patients' demographics, diagnoses, and medications. This will enhance healthcare providers' ability to identify aspects of a patient's story communicated in the clinical notes and help make more informed decisions. In this work, we propose annotation guidelines, develop an annotated corpus and provide baseline scores to identify types and direction of causal relations between a pair of biomedical concepts in clinical notes; communicated implicitly or explicitly, identified either in a single sentence or across multiple sentences. We annotate a total of 2714 de-identified examples sampled from the 2018 n2c2 shared task dataset and train four different language model based architectures. Annotation based on our guidelines achieved a high inter-annotator agreement i.e. Fleiss' kappa ($\kappa$) score of 0.72, and our model for identification of causal relations achieved a macro F1 score of 0.56 on the test data. The high inter-annotator agreement for clinical text shows the quality of our annotation guidelines while the provided baseline F1 score sets the direction for future research towards understanding narratives in clinical texts.	PDF	11	2021
Generating a Temporally Coherent Image Sequence for a Story by Multimodal Recurrent Transformers	Story visualization is a challenging text-to-image generation task for the difficulty of rendering visual details from abstract text descriptions. Besides the difficulty of image generation, the generator also need to conform to the narrative of a multi-sentence story input. While prior arts in this domain has focused on improving semantic relevance between generated images and input text, controlling the generated images to be temporally consistent still remains as a challenge. To generate a semantically coherent image sequence, we propose an explicit memory controller which can augment the temporal coherence of images in the multi-modal autoregressive transformer, and call Story visualization by MultimodAl Recurrent Transformers or SMART for short. Our method generates high resolution high quality images, outperforming prior works by a significant margin across multiple evaluation metrics on PororoSV dataset.	PDF	11	2021
Differentiable Learning of Rules with Constants in Knowledge Graph	Knowledge reasoning, helping overcome the incompleteness issue of knowledge graph(KG), significantly contributes to the development of large KG, which consists of relations and constants. Rule mining studies the problem of capturing interpretable patterns over KG, which is one of the key tasks of knowledge reasoning. However, previous works mainly focus on the combination of different relations, and are limited for ignoring the importance of constants. In this paper, we propose that constants should be considered in rule mining process, and introduce an Elegant Differentiable rUle learning with Constant mEthod (EduCe). Based on soft constant operator and dynamic weight, the model we proposed can mine more diverse and accurate logical rules while controlling the number of parameters, which is also a great challenge to this problem. Experiment results on several benchmark datasets demonstrate the effectiveness and accuracy of rule with constants.	PDF	11	2021
MoFE: Mixture of Factual Experts for Controlling Hallucinations in Abstractive Summarization	Neural abstractive summarization models are susceptible to generating factually inconsistent content, a phenomenon known as hallucination. This limits the usability and adoption of these systems in real-world applications. To reduce the presence of hallucination, we propose the Mixture of Factual Experts (MoFE) model, which combines multiple summarization experts that each target a specific type of factual error. We construct MoFE by combining the experts using weights and logits ensembling strategies and find that the MoFE provides a modular approach to control different factual errors while maintaining performance on standard ROUGE metrics.	PDF	11	2021
Non-Parametric Domain Adaptation for End-to-End Speech Translation	The end-to-end speech translation (E2E-ST) has received increasing attention due to the potential of its less error propagation, lower latency, and fewer parameters. However, the effectiveness of neural-based approaches to this task is severely limited by the available training corpus, especially for domain adaptation where in-domain triplet training data is scarce or nonexistent. In this paper, we propose a novel non-parametric method that leverages domain-specific text translation corpus to achieve domain adaptation for the E2E-ST system. To this end, we first incorporate an additional encoder into the pre-trained E2E-ST model to realize text translation modeling and then unify the decoder's output representation for text and speech translation tasks by reducing the correspondent representation mismatch in available triplet training data. During domain adaptation, a $k$-nearest-neighbor ($k$NN) classifier is introduced to produce the final translation distribution using the external datastore built by the domain-specific text translation corpus, while the universal output representation is adopted to perform a similarity search. Experiments on the Europarl-ST benchmark demonstrate that when in-domain text translation data is used only, our proposed approach significantly improves the baseline by 12.82 BLEU on average in all translation directions.	PDF	11	2021
Placing (Historical) Events on a Timeline: A Classification cum Co-ref Resolution Approach	The event timeline provides one of the most effective ways to visualize the important historical events that occurred over a period of time, presenting the insights that may not be so apparent from reading the equivalent information in textual form. By leveraging generative adversarial learning for important event classification and by assimilating knowledge based tags for improving the performance of event coreference resolution we introduce a two staged system for event timeline generation from multiple (historical) text documents. In addition, we propose a vis-timeline based visualization technique to portray the event timeline. We demonstrate our results on two very well known historical documents -- the Collected Works of Mahatma Gandhi (CWMG) and the Collected Works of Abraham Lincoln (CWAL). Our results can be extremely helpful for historians, in advancing research in history and in understanding the socio-political landscape of a country as reflected in the writings of political leaders/scholars. Our work has some parallels with timeline summarization (TLS) tasks and therefore we use these as baselines. Rigorous experiments demonstrate that prior event detection which was hitherto absent in the TLS methods can improve summarization performance. In order to show that our methods are very generic we reuse our method to visualize the evolution of coronavirus related events in India from a collection of various COVID-19 articles.	PDF	11	2021
Are Shortest Rationales the Best Explanations For Human Understanding?	Existing self-explaining models typically favor extracting the shortest rationales possible (“shortest yet coherent subset of input to predict the same label”), with the assumption that short rationales are more intuitive to humans, even though short rationales lead to lower accuracy. However, there is a lack of human studies on validating the effect of rationale length on human understanding. Is the shortest rationale indeed the most understandable for humans? To answer this question, we design a self-explaining model that can take control on rationale length. Our model incorporates contextual information and supports flexibly extracting rationales at any target length. Through quantitative evaluation on model performance, we further verify that our method LIMITEDINK outperforms existing self-explaining baselines on both end-task prediction and human-annotated rationale agreement. We use it to generate rationales at 5 length levels, and conduct user studies to understand how much rationale would be sufficient for humans to confidently make predictions. We show that while most prior work extracts 10%-30% of the text to be the rationale, human accuracy tends to stabilize after seeing 40% of the full text. Our result suggests the need for more careful design of the best human rationales.	PDF	11	2021
Unified Speech-Text Pre-training for Speech Translation and Recognition	In this work, we describe a method to jointly pre-train speech and text in an encoder-decoder modeling framework for speech translation and recognition. The proposed method utilizes multi-task learning to integrate four self-supervised and supervised subtasks for cross modality learning. A self-supervised speech subtask, which leverages unlabelled speech data, and a (self-)supervised text to text subtask, which makes use of abundant text training data, take up the majority of the pre-training time. Two auxiliary supervised speech tasks are included to unify speech and text modeling space. Detailed analysis reveals learning interference among subtasks. In order to alleviate the subtask interference, two pre-training configurations are proposed for speech translation and speech recognition respectively. Our experiments show the proposed method can effectively fuse speech and text information into one model. It achieves between 1.7 and 2.3 BLEU improvement above the state of the art on the MuST-C speech translation dataset and comparable WERs to wav2vec 2.0 on the Librispeech speech recognition task.	PDF	11	2021
Are We Evaluating Paraphrase Generation Accurately?	Paraphrase is a restatement of a text that conveys the same meaning using different expressions. The evaluation of paraphrase generation (PG) is a complex task and currently lacks a complete picture of the criteria and metrics. In this paper, we survey the automatic evaluation metrics and human evaluation criteria of PG evaluation. Base on the survey result, we propose a reference-free automatic toolkit and list clear human evaluation criteria. Moreover, we notice the paraphrases selection in downstream tasks and propose a simple but effective evaluation Filter model. It can fusion multi automatic metrics to fit the human evaluation without any references.	PDF	11	2021
Freezing the Pivot for Triangular Machine Translation	Triangular machine translation is a special case of low-resource machine translation where the language pair of interest has limited parallel data, but both languages have abundant parallel data with a pivot language. Naturally, the key to triangular machine translation is the successful exploitation of such auxiliary data. In this work, we propose a transfer-learning-based approach that utilizes all types of auxiliary data. As we train auxiliary source-pivot and pivot-target translation models, we initialize some parameters of the pivot side with a pre-trained language model and freeze them to encourage both translation models to work in the same pivot language space, so that they can be smoothly transferred to the source-target translation model. Experiments show that our approach can outperform previous ones.	PDF	11	2021
Towards Collaborative Neural-Symbolic Graph Semantic Parsing via Uncertainty	Recent work in task-independent graph semantic parsing has shifted from grammar-based symbolic approaches to data-intensive neural approaches, and has shown strong performance on different types of meaning representations. However, it is still unclear that what are the limitations of these neural parsers, and whether these limitations can be compensated by collaborating with symbolic parsers. In this paper, we attempt to answer these questions by taking English Resource Grammar (ERG) parsing as a case study. Specifically, we first develop a state-of-the-art neural ERG parser, and then conduct detailed analyses on fine-grained linguistic phenomena. The results suggest that the neural parser's performance degrades significantly on long-tail examples, while the symbolic parser performs more robustly. To address this, we further propose a collaborative neural-symbolic semantic parsing framework. Specifically, we improve the beam search strategy by designing a decision criterion that incorporates both the model uncertainty about the testing data distribution and the prior knowledge from a symbolic parser. Experimental results show that this collaborative parsing framework can outperform the single neural parser and concretely improve the model's performance on long-tail examples.	PDF	11	2021
Unsupervised Preference-Aware Language Identification	Recognizing the language of ambiguous texts has become a main challenge in language identification (LID). When using multilingual applications, users have their own language preferences, which can be regarded as external knowledge for LID. Nevertheless, current studies do not consider the inter-personal variations due to the lack of user annotated training data. To fill this gap, we introduce preference-aware LID and propose a novel unsupervised learning strategy. Concretely, we construct pseudo training set for each user by extracting training samples from a standard LID corpus according to his/her historical language distribution. Besides, we contribute the first user labeled LID test set called "U-LID". Experimental results reveal that our model can incarnate user traits and significantly outperforms existing LID systems on handling ambiguous texts. Our code and dataset are released at XXX.	PDF	11	2021
Accurate, yet Inconsistent? Consistency Analysis on Language Models	Consistency, which refers to generating the same predictions for semantically similar contexts, is highly desirable for a sound language model. Although recent pre-trained language models (PLMs) deliver an outstanding performance in various downstream tasks, they should also exhibit a consistent behaviour, given that the models truly understand language. In this paper, we propose a simple framework, called consistency analysis on language models (CALM), to evaluate a model's lower-bound consistency ability. Via experiments, we confirm that current PLMs are prone to generate inconsistent predictions even for semantically identical inputs with high confidence. We also observe that multi-task training is of benefit to improve consistency, increasing the value by 17% on average.	PDF	11	2021
Entailment Graph Learning with Textual Entailment and Soft Transitivity	Typed entailment graphs try to learn the entailment relations between predicates from text and model them as edges between predicate nodes. The construction of entailment graphs usually suffers from severe sparsity and unreliability of distributional similarity. We propose a two-stage method, Entailment Graph with Textual Entailment and Transitivity (EGT2). EGT2 learns the local entailment relations by recognizing the textual entailment between template sentences formed by typed CCG-parsed predicates. Based on the generated local graph, EGT2 then uses three novel soft transitivity constraints to consider the logical transitivity in entailment structures. Experiments on benchmark datasets show that EGT2 can well model the transitivity in entailment graph to alleviate the sparsity, and leads to signifcant improvement over current state-of-the-art methods.	PDF	11	2021
Tackling Situated Multi-Modal Task-Oriented Dialogs with a Single Transformer Model	The Situated Interactive Multi-Modal Conversations (SIMMC) 2.0 aims to create virtual shopping assistants that can accept complex multi-modal inputs, i.e. visual appearances of objects and user utterances. It consists of four subtasks, multi-modal disambiguation (MM-Disamb), multi-modal coreference resolution (MM-Coref), multi-modal dialog state tracking (MM-DST), and response retrieval and generation. While many task-oriented dialog systems usually tackle each subtask separately, we propose a jointly learned encoder-decoder that performs all four subtasks at once for efficiency. Moreover, we handle the multi-modality of the challenge by representing visual objects as special tokens whose joint embedding is learned via auxiliary tasks. This approach won the MM-Coref and response retrieval subtasks and nominated runner-up for the remaining subtasks using a single unified model. In particular, our model achieved 81.5\% MRR, 71.2\% R@1, 95.0\% R@5, 98.2\% R@10, and 1.9 mean rank in response retrieval task, setting a high bar for the state-of-the-art result in the SIMMC 2.0 track of the Dialog Systems Technology Challenge 10 (DSTC10).	PDF	11	2021
Chinese Word Attention based on Valid Division of Sentence	Chinese word attention (CWA) with word-level information is very important for natural language processing. The purpose is how to attention words in a sentence. We first explore the valid divisions of a sentence by splitting word tools. We use BERT for character and word pre-training. Each character embedding with its word in one division is encoded in block local attention. We use attention with prior to assign attention weights to each splitting result, and finally combine the global attention mechanism to get the optimal recognition result in Chinese NER.	PDF	11	2021
Pairwise proximity metrics for topic modelling evaluation based on BERT embeddings.	The use of topic modelling methods is a popular way to describe natural language text with a representative set of words. In order to evaluate such methods, objective metrics such as coherence and silhouette scores are commonly used. However, it has been shown that topic assessment based on such metrics does not align well with human judgment for classical document corpora such as articles, books and server logs and, at the same time, it is still unclear how appropriate they are for dialog data. In this paper, we investigate the most commonly used topic modelling evaluation scores in terms of their alignment with human judgment in the specific area of dialog speech. We show that there is still space for improvement in the objective evaluation of topic modelling, and propose a new group of metrics, called Pairwise Proximity metrics, that are shown to align better with human judgment, when compared to coherence and silhouette scores.	PDF	11	2021
Diversifying Neural Text Generation with Part-of-Speech Guided Softmax and Sampling	Neural text generation models are likely to suffer from the low-diversity problem. Various decoding strategies and training-based methods have been proposed to promote diversity only by exploiting contextual features, but rarely do they consider incorporating syntactic structure clues. In this work, we propose using linguistic annotation, i.e., part-of-speech (POS), to guide the text generation. In detail, we introduce POS Guided Softmax to explicitly model two posterior probabilities: (i) next-POS, and (ii) next-token from the vocabulary of the target POS. A POS Guided Sampling strategy is further proposed to address the low-diversity problem by enriching the diversity of POS. Extensive experiments and human evaluations demonstrate that, compared with existing state-of-the-art methods, our POS Guided Softmax and Sampling (POSG) can generate more diverse text while maintaining comparable quality.	PDF	11	2021
A Light Label Denoising Method with the Internal Data Guidance	Samples with incorrect labels are common in datasets, even annotated by humans. Some approaches have been proposed to alleviate the negative impact of mislabeling on the training process by removing erroneous data or reducing their weights. Unlike previous works, this paper introduces a light yet effective denoising method based on the relationship between the samples within the dataset, namely internal guidance. We examine the method on five datasets with mainstream models. The results demonstrate that this light denoising approach can obtain consistent improvement for all the datasets and models.	PDF	11	2021
A Graph-to-Sequence Model for Joint Intent Detection and Slot Filling in Task-Oriented Dialogue Systems	Effectively decoding semantic frames in task-oriented dialogue systems remains a challenge, which typically includes intent detection and slot filling. Although RNN-based neural models show promising results by jointly learning of these two tasks, dominant RNNs are primarily focusing on modeling sequential dependencies. Rich graph structure information hidden in the dialogue context is seldomly explored. In this paper, we propose a novel Graph-to-Sequence model to tackle the spoken language understanding problem by modeling both temporal dependencies and structural information in a conversation. We introduce a new Graph Convolutional LSTM (GC-LSTM) encoder to learn the semantics contained in the dialogue dependency graph by incorporating a powerful graph convolutional operator. Our proposed GC-LSTM can not only capture the spatio-temporal semantic features in a dialogue, but also learn the co-occurrence relationship between intent detection and slot filling. Furthermore, a LSTM decoder is utilized to perform final decoding of both slot filling and intent detection, which mutually improves both tasks through global optimization. Experiments on benchmark ATIS and Snips datasets show that our model achieves state-of-the-art performance and outperforms existing models.	PDF	11	2021
Compressing Sentence Representation via Homomorphic Projective Distillation	How to learn highly compact yet effective sentence representation? Pre-trained language models have been effective in many NLP tasks. However, these models are often huge and produce large sentence embeddings. Moreover, there is a big performance gap between large and small models. In this paper, we propose Homomorphic Projective Distillation (HPD) to learn compressed sentence embeddings. Our method augments a small Transformer encoder model with learnable projection layers to produce compact representations while mimicking a large pre-trained language model to retain the sentence representation quality. We evaluate our method with different model sizes on both semantic textual similarity (STS) and semantic retrieval (SR) tasks. Experiments show that our method achieves 2.7-4.5 points performance gain on STS tasks compared with previous best representations of the same size. In SR tasks, our method improves retrieval speed (8.2×) and memory usage (8.0×) compared with state-of-the-art large models.	PDF	11	2021
Towards Better Citation Intent Classification	Accurate classification of citation intents in a scientific article provides deeper contextual understanding of and better quantifies the contributions of cited articles. This improves scientific literature platform capabilities such as search relevance, ranking and more. To our knowledge, we present the most comprehensive survey of Transformer-based language models performance on the citation intent classification task using SciCite dataset. Here, we make three recommendations. Firstly, we propose to report model performance as a distribution in contrast to a single averaged performance value. This arises from our observation that model performance is sensitive to the random seed choice resulting in wide performance variations from multiple finetuning runs. Secondly, this provides practical insights for model selection, showing the model's best possible performance. Thus, we propose that practitioners perform multiple finetuning runs before selecting the best performing model. Thirdly, we propose a simple data augmentation to improve the distribution of model performance overall. Moving forward, we suggest exploring improvements to the finetuning and model selection process as promising future directions.	PDF	11	2021
Augmenting Document Representations for Dense Retrieval with Interpolation and Perturbation	Dense retrieval models, which aim at retrieving the most relevant document for an input query on a dense representation space, have gained considerable attention for their remarkable success. Yet, dense models require a vast amount of labeled training data for notable performance, whereas it is often challenging to acquire query-document pairs annotated by humans. To tackle this problem, we propose a simple but effective Document Augmentation for dense Retrieval (DAR) framework, which augments the representations of documents with their interpolation and perturbation. We validate the performance of DAR on retrieval tasks with two benchmark datasets, showing that the proposed DAR significantly outperforms relevant baselines on the dense retrieval of both the labeled and unlabeled documents.	PDF	11	2021
Open Domain Question Answering over Virtual Documents: A Unified Approach for Data and Text	Due to its potential for a universal interface over both data and text, data-to-text generation is becoming increasingly popular.However, few prior work has focused on its application to downstream tasks, e.g. using the converted data for grounding or reasoning. In this work, we bridge this gap and use the data-to-text method as a means for encoding structured knowledge for knowledge-intensive applications, i.e. open-domain question answering (ODQA).Specifically, we propose a verbalizer-retriever-reader framework for ODQA over data and text where verbalized tables from Wikipedia and graphs from Wikidata are used as augmented knowledge sources.We show that our Unified Data and Text QA, UDT-QA.can effectively benefit from the expanded knowledge index, leading to large gains over text-only baselines.Notably, our approach sets the single-model state-of-the-art on Natural Questions.Furthermore, our analyses indicate that verbalized knowledge is preferred for answer reasoning for both adapted and hot-swap settings.	PDF	11	2021
S$^4$-Tuning: A Simple Cross-lingual Sub-network Tuning Method	The emergence of multilingual pre-trained language models makes it possible to adapt to target languages with only few labeled examples.However, vanilla fine-tuning tends to achieve degenerated and unstable results, owing to the Language Interference among different languages, and Parameter Overload under the few-sample transfer learning scenarios.To address two problems elegantly, we propose S$^4$-Tuning, a Simple Cross-lingual Sub-network Tuning method. S$^4$-Tuning first detects the most essential sub-network for each target language, and only updates it during fine-tuning.In this way, the language sub-networks lower the scale of trainable parameters, and hence better suit the low-resource scenarios.Meanwhile, the commonality and characteristics across languages are modeled by the overlapping and non-overlapping parts to ease the interference among languages.Simple but effective, S$^4$-Tuning gains consistent improvements over vanilla fine-tuning on three multi-lingual tasks involving 37 different languages in total (XNLI, PAWS-X, and Tatoeba).	PDF	11	2021
Interpretable embeddings to understand computing careers	We propose an approach for analyzing and comparing curricula of study programs in higher education. Pre-trained word embeddings are fine-tuned in a study program classification task, where each curriculum is represented by the names and content of its courses. By combining metric learning with a novel course-guided attention mechanism, our method obtains more accurate curriculum representations than strong baselines. Experiments on a new dataset containing curricula of computing programs demonstrate the interpretability power of our approach via attention weights, topic modeling, and embeddings visualizations. We also present a use case that compares computing study programs in the US and Latin America and showcase the capabilities of our method for identifying similarities and differences in topics of study in curricula from different countries.	PDF	11	2021
Detection, Disambiguation, Re-ranking: Autoregressive Entity Linking as a Multi-Task Problem	We propose an autoregressive entity linking model, that is trained with two auxiliary tasks, and learns to re-rank generated samples at inference time. Our proposed novelties address two weaknesses in the literature. First, as recent improvements in entity linking suggest learning mention detection explicitly could increase performance, we train mention detection as an auxiliary task. Second, previous work suggests that re-ranking could help correct prediction errors. We add a new, auxiliary task, match prediction, to learn re-ranking. Without the use of a knowledge base or candidate sets, our model sets a new state of the art in two benchmark datasets of entity linking: COMETA in the biomedical domain, and AIDA-CoNLL in the news domain. We show through ablation studies that each of the two auxiliary tasks increases performance, and that re-ranking is an important factor to the increase. Finally, our low-resource experimental results suggest that performance on the main task benefits from the knowledge learned by the auxiliary tasks, and not just from the additional training data.	PDF	11	2021
DAML: Chinese Named Entity Recognition with a fusion method of data-augmentation and meta-learning	Overfitting is still a common problem in NER with insufficient data. Latest methods such as Transfer Learning, which focuses on storing knowledge gained while solving one task and applying it to a different but related task, or Model-Agnostic Meta-Learning (MAML), which learns a model parameter initialization that generalizes better to similar tasks. However, these methods still need rich resources to pre-train. In this work, we present new perspectives on how to make the most of in-domain and out-domain information. By introducing a fusion method of data augmentation and MAML, we first use data augmentation to mine more information. With the augmented resources, we directly utilize out-domain and in-domain data with MAML, while avoiding performance degradation after domain transfer. To further improve the model’s generalization ability, we proposed a new data augmentation method based on a generative approach. We conduct experiments on six open Chinese NER datasets (MSRANER, PeopleDairyNER, CLUENER, WeiboNER, Resume NER, and BOSONNER). The results show that our method significantly reduces the impact of insufficient data and outperforms the state-of-the-art.	PDF	11	2021
DGMED: A Novel Document-Level Graph Convolution Network for Multi-Event Detection	Online news documents can contain thousands of characters and tens of events. To detect events in these documents, it is important to construct long-range context information. Such information, however, is not effectively created in existing event detection methods including DMBERT, MOGANED. As a result, these methods show poor event detection accuracy in production where long documents are common. To address this, this paper proposes a Document-level Graph convolution network for Multi-event Detection (DGMED). DGMED represents each sentence in a long document as a graph, and it interconnects these graphs using novel cross-sentence global neural network nodes. These nodes allow DGMED further construct accurate document-level contextual information, thus accurately extracting multiple events as required. We evaluate DGMED using a public event extraction dataset (i.e., Maven) and a production large-scale dataset (named AML). Evaluation results show that DGMED can out-perform state-of-the-art methods BERT+CRF and BiLSTM+CRF up to 0.7% in Maven and 5.7% in AML.	PDF	11	2021
Towards Few-shot Entity Recognition in Document Images: A Label-aware Sequence-to-Sequence Framework	Entity recognition is a fundamental task in understanding document images. Traditional sequence labeling framework requires extensive datasets and high-quality annotations, which are typically expensive in practice. In this paper, we aim to build an entity recognition model based on only a few shots of annotated document images. To overcome the data limitation, we propose to leverage the label surface names to better inform the model of the target entity semantics. Specifically, we go beyond sequence labeling and develop a novel label-aware seq2seq framework, LASER. We design a new labeling scheme that generates the label surface names word-by-word explicitly after generating the entities. Moreover, we design special layout identifiers to capture the spatial correspondence between regions and labels. During training, LASER refines the label semantics by updating the label surface name representations and also strengthens the label-region correlation. In this way, LASER recognizes the entities from document images through both semantic and layout correspondence. Extensive experiments on two benchmark datasets demonstrate the superiority of LASER under the few-shot setting.	PDF	11	2021
Spot the Difference: A Cooperative Object-Referring Game in Non-Perfectly Co-Observable Scene	Visual dialogue has witnessed great progress after introducing various vision-oriented goals into the conversation, especially such as GuessWhich and GuessWhat, where the only image is visible by either and both of the questioner and the answerer, respectively. Researchers explore more on visual dialogue tasks in such kind of single- or perfectly co-observable visual scene, while somewhat neglect the exploration on tasks of non-perfectly co-observable visual scene, where the images accessed by two agents may not be exactly the same, often occurred in practice. Although building common ground in non-perfectly co-observable visual scene through conversation is significant for advanced dialogue agents, the lack of such dialogue task and corresponding large-scale dataset makes it impossible to carry out in-depth research. To break this limitation, we propose an object-referring game in non-perfectly co-observable visual scene, where the goal is to spot the difference between the similar visual scenes through conversing in natural language. The task addresses challenges of the dialogue strategy in non-perfectly co-observable visual scene and the ability of categorizing objects. Correspondingly, we construct a large-scale multi-modal dataset, named \textit{SpotDiff}, which contains 49k Virtual Reality images and 97k dialogues generated by self-play. Finally, we give benchmark models for this task, and conduct extensive experiments to evaluate its performance as well as analyze its main challenges.	PDF	11	2021
Legal Fairness Analysis via Treatment Effect Estimation	Legal fairness is one of the most important principles pursued by modern legal systems. Unfortunately, unfairness may be inevitably introduced in real-world cases due to both objective and subjective uncertainty, such as ambiguity in the law or practical bias in judgments. Existing works for fairness analysis mainly rely on labor-intensive element annotation for cases, which suffer from limited generalization ability. To address this issue, we propose to utilize large-scale textual data to perform quantitative legal fairness analysis via our Causal-based Legal Fairness Measuring Framework (CaLF). To verify its effectiveness, we construct a legal-fairness dataset, and experimental results show that CaLF can accurately characterize the unfairness. Further, we adopt CaLF on a large-scale real-world dataset and come to several interesting experimental observations from the perspective of gender, age, and region.	PDF	11	2021
Divide-and-Conquer Text Simplification by Scalable Data Enhancement	Text simplification, whose aim is to reduce reading difficulty, can be decomposed into four discrete rewriting operations: substitution, deletion, reordering, and splitting. However, due to a large distribution discrepancy between existing training data and human-annotated data, models may learn improper operations, thus lead to poor generalization capabilities. In order to bridge this gap, we propose a novel data enhancement method, Simsim, that generates training pairs by simulating specific simplification operations. Experiments show that the models trained with Simsim outperform multiple strong baselines and achieve the better SARI on the Turk and Asset datasets. The newly constructed dataset Simsim is available at *.	PDF	11	2021
Personality Prediction of Narrative Characters from Movie Scripts	An NLP model that understands stories should be able to understand the characters in them. To support the development of neural models for this purpose, we construct a benchmark, Story2Personality. The task is to predict a movie character’s personality based on the narratives of the character in the movie script. Experiments show that our task is challenging for the existing text classification models, as none is able to largely outperform random guesses. We further proposed a multi-view model to use both verbal and non-verbal descriptions for personality prediction, which gives improvement compared to using only verbal descriptions. The uniqueness and challenges in our dataset call for the development of narrative comprehension techniques from the perspective of understanding characters.	PDF	11	2021
Situated Dialogue Learning through Procedural Environment Generation	We teach goal-driven agents to interactively act and speak in situated environments by training on generated curriculums. Our agents operate in LIGHT (Urbanek et al. 2019)---a large-scale crowd-sourced fantasy text adventure game wherein an agent perceives and interacts with the world through textual natural language. Goals in this environment take the form of character-based quests, consisting of personas and motivations. We augment LIGHT by learning to procedurally generate additional novel textual worlds and quests to create a curriculum of steadily increasing difficulty for training agents to achieve such goals. In particular, we measure curriculum difficulty in terms of the rarity of the quest in the original training distribution---an easier environment is one that is more likely to have been found in the unaugmented dataset. An ablation study shows that this method of learning from the tail of a distribution results in significantly higher generalization abilities as measured by zero-shot performance on never-before-seen quests.	PDF	11	2021
CINO: A Chinese Minority Pre-trained Language Model	Multilingual pre-trained language models have shown impressive performance on cross-lingual tasks. It greatly facilitates the applications of natural language processing on low-resource languages. However, there are still some languages that the existing multilingual model does not perform well on. In this paper, we propose CINO (Chinese Minority Pre-trained Language Model), a multilingual pre-trained language model for Chinese minority languages. It covers Standard Chinese, Cantonese, and six other Chinese minority languages. To evaluate the cross-lingual ability of the multilingual models on the minority languages, we collect documents from Wikipedia and build a text classification dataset WCM (Wiki-Chinese-Minority). We test CINO on WCM and two other text classification tasks. Experiments show that CINO outperforms the baselines notably. The CINO model and the WCM dataset will be made publicly available.	PDF	11	2021
IDPG: An Instance-Dependent Prompt Generation Method	Prompt tuning is a new, efficient NLP transfer learning paradigm that adds a task-specific prompt in each input instance during the model training stage. It freezes the pre-trained language model and only optimizes a few task-specific prompts. In this paper, we propose a conditional prompt generation method to generate prompts for each input instance, referred to as the Instance-Dependent Prompt Generation (IDPG). Unlike traditional prompt tuning methods that use a fixed prompt, IDPG introduces a lightweight and trainable component to generate prompts based on each input sentence. Empirical experiments on ten natural language understanding (NLU) tasks show that our proposed method consistently outperforms various prompt tuning methods and other efficient transfer learning methods such as Compacter while tuning far fewer model parameters.	PDF	11	2021
Semi-supervised New Event Type Induction and Description via Contrastive Loss-Enforced Batch Attention	Most event extraction methods have traditionally relied on an annotated set of event types. However, creating event ontologies and annotating supervised training data are expensive and time-consuming. Previous work has proposed semi-supervised approaches which leverage seen (annotated) types to learn how to automatically discover new event types. State-of-the-art methods, both semi-supervised or fully unsupervised, use a form of reconstruction loss on specific tokens in a context. In contrast, we present a novel approach to semi-supervised new event type induction using a masked contrastive loss which learns similarities between event mentions by enforcing an attention mechanism over the data minibatch. We further disentangle the discovered clusters by approximating the underlying manifolds in the data, which allows us to increase normalized mutual information and Fowlkes-Mallows scores by over 20% absolute. Building on these clustering results, we extend our approach to two new tasks: predicting the type name of the discovered clusters and linking them to FrameNet frames.	PDF	11	2021
Code Synonyms Do Matter: Multiple Synonyms Matching Network for Automatic ICD Coding	Automatic ICD coding is defined as assigning disease codes to electronic medical records (EMRs).Existing methods apply label attention with code representations to match related text snippets for coding.Unlike these works that model the label with the code hierarchy or description, we argue that the code synonyms can provide more comprehensive knowledge based on the observation that the code expressions in EMRs vary from their descriptions in ICD. By aligning codes to concepts in UMLS, we collect synonyms of every code in ICD. Then, we propose a multiple synonyms matching network to leverage synonyms for better code representation learning, and finally help the code classification. Experiments on two settings of the MIMIC-III dataset show that our proposed method outperforms previous state-of-the-art methods.	PDF	11	2021
Fusing Heterogeneous Factors with Triaffine Mechanism for Nested Named Entity Recognition	Nested entities are observed in many domains due to their compositionality, which cannot be easily recognized by the widely-used sequence labeling framework.A natural solution is to treat the task as a span classification problem.To learn better span representation and increase classification performance, it is crucial to effectively integrate heterogeneous factors including inside tokens, boundaries, labels, and related spans which could be contributing to nested entities recognition.To fuse these heterogeneous factors, we propose a novel triaffine mechanism including triaffine attention and scoring.Triaffine attention uses boundaries and labels as queries, and uses inside tokens and related spans as keys and values for span representations.Triaffine scoring interacts with boundaries and span representations for classification.Experiments show that our proposed method achieves the state-of-the-art $F_1$ scores on four nested NER datasets: ACE2004, ACE2005, GENIA, and KBP2017.	PDF	11	2021
XLTime: A Cross-Lingual Knowledge Transfer Framework for Zero-Shot Low-Resource Language Temporal Expression Extraction	Temporal Expression Extraction (TEE) is essential for understanding time in natural language. It has applications in Natural Language Processing (NLP) tasks such as question answering, information retrieval, and causal inference. To date, work in this area has mostly focused on English as TEE for low-resource languages is hindered by a scarcity of training data. We propose XLTime, a novel framework for zero-shot low-resource language TEE. XLTime works on top of pre-trained language models and leverages multi-task learning to prompt cross-language knowledge transfer both from English and within the low-resource languages. It alleviates the problems caused by the shortage in low-resource language training data. We apply XLTime with different language models and show that it outperforms the previous automatic SOTA methods on four low-resource languages, i.e., French, Spanish, Portuguese, and Basque, by large margins. It also closes the gap considerably on the handcrafted HeidelTime tool.	PDF	11	2021
A Simple General Method for Detecting Textual Adversarial Examples	Although deep neural networks have achieved state-of-the-art performance in various machine learning and artificial intelligence tasks, adversarial examples, constructed by adding small non-random perturbations to correctly classified inputs, successfully fool highly expressive deep classifiers into incorrect predictions. Approaches to adversarial attacks in natural language tasks have boomed in the last five years using character-level, word-level, phrase-level, or sentence-level textual perturbations. While there is some work in NLP on defending against such attacks through proactive methods, like adversarial training, there is to our knowledge no effective reactive approaches to defence via detection of textual adversarial examples such as is found in the image processing literature. In this paper, we apply distance-based ensemble learning and semantic representations from different representation learning models based on our understanding of the reason for adversarial examples to fill this gap. Our technique, MultiDistance Representation Ensemble Method (MDRE), obtains state-of-the-art results on character-level, word-level, and phrase-level attacks on the IMDB dataset as well as on the later two with respect to the MultiNLI dataset. If this paper is accepted, we will publish our code.	PDF	11	2021
Cross-domain Named Entity Recognition via Graph Matching	Cross-domain NER is a practical yet challenging problem since the data scarcity in the real-world scenario. A common practice is first to learn a NER model in a rich-resource general domain and then adapt the model to specific domains. Due to the mismatch problem between entity types across domains, the wide knowledge in the general domain can not effectively transfer to the target domain NER model. To this end, we model the label relationship as a probability distribution and construct label graphs in both source and target label spaces. To enhance the contextual representation with label structures, we fuse the label graph into the word embedding output by BERT. By representing label relationships as graphs, we formulate cross-domain NER as a graph matching problem. Furthermore, the proposed method has good applicability with pre-training methods and is potentially capable of other cross-domain prediction tasks. Empirical results on four datasets show that our method outperforms a series of transfer learning, multi-task learning, and few-shot learning methods.	PDF	11	2021
Learning Non-Autoregressive Models from Search for Unsupervised Sentence Summarization	Text summarization aims to generate a short summary for an input text. In this work, we propose a Non-Autoregressive Unsupervised Summarization (NAUS) approach, which does not require parallel data for training. Our NAUS first performs edit-based search towards a heuristically defined score, and generates a summary as pseudo-groundtruth. Then, we train an encoder-only non-autoregressive Transformer based on the search result. We also proposed a dynamic programming approach for length-control decoding, which is important for the summarization task. Experiments on the Gigaword headline generation and DUC2004 datasets show that NAUS achieves state-of-the-art performance for unsupervised summarization, yet largely improving inference efficiency. Further, our algorithm is able to perform length-transfer summary generation.	PDF	11	2021
Assessing the Coherence Modeling Capabilities of Pretrained Transformer-based Language Models	The task of ordering a shuffled set of sentences into a coherent text is used to evaluate the capacity of a model to understand causal and temporal relations between entities and events. Recent approaches rely on pretrained Transformer-based models, but it remains unknown whether the differences between them, such as size, pretraining data and objectives, affect their coherence modeling capacity. We present a simple architecture for sentence ordering that relies exclusively on pretrained Transformer-based encoder-only models. This allows us to compare the coherence modeling capabilities of the monolingual and multilingual versions of BERT, RoBERTa, and DistilBERT. We show that RoBERTa-based models outperform BERT-based models and are more robust when ordering longer documents with more than 10 sentences. Thus, the intuitive advantage offered by sentence-based objectives such as Next Sentence Prediction used in BERT is effectively compensated by the higher amount and diversity of the training data used in RoBERTa. However, the difference between multilingual versions of BERT and RoBERTa is narrower. This suggests that exposure to different languages partially makes up for the benefits of larger and more diverse training data.	PDF	11	2021
M6-T: Exploring Sparse Expert Models and Beyond	Sparse expert models can achieve promising results with outrageous large amount of parameters but constant computation cost, and thus it has become a trend in model scaling. Still, it is a mystery how Mixture-of-Experts (MoE) layers leveraging the parameters with sparse activation bring quality gains. In this work, we investigate several key factors in sparse expert models. We find that load imbalance may not be a significant problem affecting model quality, and auxiliary balancing loss can be removed without significant performance degrade. We further discover that larger number of sparsely activated experts $k$ may not necessarily benefit the performance on the time basis, and we observe diminishing marginal utility that the performance gap gradually narrows with the increase in $k$. We take a step forward to propose a simple method called expert prototyping that splits experts into different prototypes and applies top-$k$ routing for each prototype in parallel. Our experiments demonstrate that the prototyping strategy improves the model quality, in comparison with further increasing to a larger $k$ with comparable computation cost to prototyping. Furthermore, we conduct an exploration on training extremely large-scale models, and we figure out that the strategy shows greater effectiveness in training larger models. Notably, we push the model scale to over $1$ trillion parameters on solely $480$ NVIDIA V100-32GB GPUs. The proposed giant model M6-T with expert prototyping achieves substantial speedup in convergence over the same-size baseline.	PDF	11	2021
Semantic-Oriented Unlabeled Priming for Large-Scale Language Models	Due to the high costs associated with finetuning large language models, various recent works propose to adapt them to specific tasks without any parameter updates through in-context learning. Unfortunately, for in-context learning there is currently no way to leverage unlabeled data, which is often much easier to obtain in large quantities than labeled examples. In this work, we therefore investigate ways to make use of unlabeled examples to improve the zero-shot performance of pretrained language models without any finetuning: We introduce Semantic-Oriented Unlabeled Priming (SOUP), a method that classifies examples by retrieving semantically similar unlabeled examples, assigning labels to them in a zero-shot fashion, and then using them for in-context learning. We also propose bag-of-contexts priming, a new priming strategy that is more suitable for our setting and enables the usage of more examples than fit into the context window.	PDF	11	2021
BERT Learns to Teach: Knowledge Distillation with Meta Learning	We present Knowledge Distillation with Meta Learning (MetaDistil), a simple yet effective alternative to traditional knowledge distillation (KD) methods where the teacher model is fixed during training. We show the teacher network can learn to better transfer knowledge to the student network (i.e., \textit{learning to teach}) with the feedback from the performance of the distilled student network in a meta learning framework. Moreover, we introduce a pilot update mechanism to improve the alignment between the inner-learner and meta-learner in meta learning algorithms that focus on an improved inner-learner. Experiments on various benchmarks show that MetaDistil can yield significant improvements compared with traditional KD algorithms and is less sensitive to the choice of different student capacity and hyperparameters, facilitating the use of KD on different tasks and models.	PDF	11	2021
Learning Disentangled Representations in Natural Language Definitions with Semantic Role Labeling Supervision	Disentangling the encodings of neural models is a fundamental aspect for improving interpretability, semantic control and downstream task performance in Natural Language Processing. However, most disentanglement methods are unsupervised or rely on synthetic datasets with known generative factors. We argue that recurrent syntactic and semantic regularities in textual data can be used to provide the models with both structural biases and generative factors. We leverage the semantic structures present in a representative and semantically dense category of sentence types, definitional sentences, for training a Variational Autoencoder to learn disentangled representations. Our experimental results show that the proposed model outperforms unsupervised baselines on several qualitative and quantitative benchmarks for disentanglement, and it also improves the results in the downstream task of definition modeling.	PDF	11	2021
k-Rater Reliability: The Correct Unit of Reliability for Aggregated Human Annotations	Since the inception of crowdsourcing, aggregation has been a common strategy for dealing with unreliable data. Aggregate ratings are more reliable than individual ones. However, many NLP datasets that rely on aggregate ratings only report the reliability of individual ones, which is the incorrect unit of analysis. In these instances, the data reliability is being under-reported. We present empirical, analytical, and bootstrap-based methods for measuring the reliability of aggregate ratings. We call this k-rater reliability (kRR), a multi-rater extension of inter-rater reliability (IRR). We apply these methods to the widely used word similarity benchmark dataset, WordSim. We conducted two replications of the WordSim dataset to obtain an empirical reference point. We hope this discussion will nudge researchers to report kRR, the correct unit of reliability for aggregate ratings, in addition to IRR.	PDF	11	2021
Flooding-X: Improving BERT's Resistance to Adversarial Attacks via Loss-Restricted Fine-Tuning	Adversarial robustness has attracted much attention recently, and the mainstream solution is adversarial training. However, the tradition of generating adversarial perturbations for each input embedding (in the settings of NLP) scales up the training computational complexity by the number of gradient steps it takes to obtain the adversarial samples. To address this problem, we leverage Flooding method which primarily aims at better generalization and we find promising in defending adversarial attacks. We further propose an effective criterion to bring hyper-parameter-dependent flooding into effect with a narrowed-down search space by measuring how the gradient steps taken within one epoch affect the loss of each batch. Our approach requires zero adversarial sample for training, and its time consumption is equivalent to fine-tuning, which can be 2-15 times faster than standard adversarial training. We experimentally show that our method improves Bert's resistance to textual adversarial attacks by a large margin, and achieves state-of-the-art robust accuracy on various text classification and GLUE tasks.	PDF	11	2021
Improving End-to-end Speech Translation by Leveraging Auxiliary Speech and Text Data	We present a method for introducing a text encoder into pre-training end-to-end speech translation systems. It enhances the ability of adapting one modality (i.e., source-language speech) to another (i.e., source-language text). Thus, the speech translation model can learn from both unlabeled and labeled data, especially when the source-language text data is abundant. Beyond this, we present a denoising method for a robust text encoder that can deal with both normal and noisy text data. Our system sets new state-of-the-art on the MuST-C En-De, En-Fr, and LibriSpeech En-Fr tasks.	PDF	11	2021
Rethinking and Refining the Distinct Metric	Distinct is a widely used automatic metric for evaluating the diversity of language generation tasks.However, we observe that the original approach to calculating distinct scores has evident biases that tend to add higher penalties to longer sequences. In this paper, we refine the calculation of distinct scores by re-scaling the number of distinct tokens based on its expectation. We provide both empirical and theoretical evidence to show that our method effectively removes the biases exhibited in the original distinct score. Further analyses also demonstrate that the refined score correlates better with human evaluations.	PDF	11	2021
Knowledgeable Prompt-tuning: Incorporating Knowledge into Prompt Verbalizer for Text Classification	Tuning pre-trained language models (PLMs) with task-specific prompts has been a promising approach for text classification. Particularly, previous studies suggest that prompt-tuning has remarkable superiority in the low-data scenario over the generic fine-tuning methods with extra classifiers. The core idea of prompt-tuning is to insert text pieces, i.e., template, to the input and transform a classification problem into a masked language modeling problem, where a crucial step is to construct a projection, i.e., verbalizer, between a label space and a label word space. A verbalizer is usually handcrafted or searched by gradient descent, which may lack coverage and bring considerable bias and high variances to the results. In this work, we focus on incorporating external knowledge into the verbalizer, forming a knowledgeable prompttuning (KPT), to improve and stabilize prompttuning. Specifically, we expand the label word space of the verbalizer using external knowledge bases (KBs) and refine the expanded label word space with the PLM itself before predicting with the expanded label word space. Extensive experiments on zero and few-shot text classification tasks demonstrate the effectiveness of knowledgeable prompt-tuning.	PDF	11	2021
Towards Unified Prompt Tuning for Few-shot Learning	Prompt-based fine-tuning has boosted the performance of Pre-trained Language Models (PLMs) on few-shot learning by employing task-specific prompts. However, PLMs are unfamiliar with the prompt-style expressions during pre-training, which limits the few-shot learning performance on downstream tasks. It would be desirable if models can acquire some prompting knowledge before task adaptation. We present the Unified Prompt Tuning (UPT) framework, leading to better few-shot learning for BERT-style models by explicitly capturing prompting semantics from non-target NLP datasets. In UPT, a novel paradigm Prompt-Options-Verbalizer is proposed for joint prompt learning across different NLP tasks, forcing PLMs to capture task-invariant prompting knowledge. We further design a self-supervised task named Knowledge-enhanced Selective Masked Language Modeling to improve the PLM's generalization abilities for accurate adaptation to previously unseen tasks. After multi-task learning, the PLM can be fine-tuned for any target few-shot NLP tasks using the same prompting paradigm. Experiments over a variety of NLP tasks show that UPT consistently outperforms state-of-the-arts for prompt-based fine-tuning.	PDF	11	2021
Prompt Combines Paraphrase: Enhancing Biomedical “Pre-training, Prompt and Predicting” Models by Explaining Rare Biomedical Concepts	Prompt-based fine-tuning for pre-trained models has been proven resultful in general domain for few-shot learning in downstream tasks. As to the biomedical domain, rare biomedical entities, which are quite ubiquitous in healthcare contexts, can affect the performance of pre-trained models, especially in low-resource scenarios. We propose a simple yet effective approach to helping models understand rare biomedical words during tuning with prompt. Experiments demonstrate that our method can achieve up to 5% improvement in biomedical tasks without any additional parameters or training steps in few-shot vanilla prompt settings.	PDF	11	2021
Sparse Progressive Distillation: Resolving Overfitting under Pretrain-and-Finetune Paradigm	Conventional wisdom in pruning Transformer-based language models is that pruning reduces the model expressiveness and thus is more likely to underfit rather than overfit. However, under the trending pretrain-and-finetune paradigm, we postulate a counter-traditional hypothesis, that is: pruning increases the risk of overfitting when performed at the fine-tuning phase. In this paper, we aim to address the overfitting problem and improve pruning performance via progressive knowledge distillation with error-bound properties. We show for the first time that reducing the risk of overfitting can help the effectiveness of pruning under the pretrain-and-finetune paradigm. Ablation studies and experiments on the GLUE benchmark show that our method outperforms the leading competitors across different tasks.	PDF	11	2021
POLITICS: Pretraining with Same-story Article Comparison for Ideology Prediction and Stance Detection	Ideology is at the core of political science. Yet, there still does not exist general-purpose tools that can characterize and predict ideology across different genres of text. To this end, we study the training of PLMs using novel ideology-driven pretraining objectives that rely on the comparison of articles that are on the same stories but written by media of different ideologies. We further collect a large-scale dataset consisting of more than 3.6M political news articles for experiments. Our model POLITICS and its variants outperform strong baselines on 10 out of the 11 ideology prediction and stance detection tasks. Our analysis further shows that POLITICS is especially good at understanding long or formally written texts, and is also robust in few-shot learning scenarios.	PDF	11	2021
INDICXNLI: A Dataset for Studying NLI in Indic Languages	While Indic NLP has made rapid advances recently in terms of availability of corpora and pre-trained models, benchmark dataset on standard NLU tasks are limited. To this end, we introduce INDICXNLI, an NLI dataset for 11 Indic languages. It has been created by high-quality machine translation of the original English XNLI dataset and out analysis attests to the quality of INDICXNLI. By finetuning different pre-trained LMs on this INDICXNLI, we analyze various cross-lingual transfer techniques with respect to the impact of choice of language models, languages, multi-linguality, mix-language input, etc. These experiments provide us with useful insights into the behaviour of pre-trained models for a diverse set of languages. INDICXNLI will be publicly available for research.	PDF	11	2021
TACO: Pre-training of Deep Transformers with Attention Convolution using Disentangled Positional Representation	Word order, as a crucial part to understand natural language, has been carefully considered in pre-trained models by incorporating different kinds of positional encodings. However, existing pre-trained models mostly lack the ability to maintain robustness against minor permutation of words in learned representations. We therefore propose a novel architecture named Transformer with Attention COnvolution (TACO), to explicitly disentangle positional representations and incorporate convolution over multi-source attention maps before softmax in self-attention. Additionally, we design a novel self-supervised task, masked position modeling (MPM), to assist our TACO model in capturing complex patterns with regard to word order. Combining MLM (masked language modeling) and MPM objectives, the proposed TACO model can efficiently learn two disentangled vectors for each token, representing its content and position respectively. Experimental results show that TACO significantly outperforms BERT in various downstream tasks with fewer model parameters. Remarkably, TACO achieves +2.6% improvement over BERT on SQuAD 1.1 task, +5.4% on SQuAD 2.0 and +3.4% on RACE, with only 46K pre-training steps.	PDF	11	2021
Continual Pre-training of Language Models for Math Problem Understanding with Syntax-Aware Memory Network	Recently, pre-trained language models (PLMs) have shown effectiveness in domain transfer and task adaption. However, two major challenges limit the effectiveness of transferring PLMs into math problem understanding tasks. First, a math problem usually contains a textual description and formulas. The two types of information have a natural semantic gap. Second, textual and formula information is essential to each other, it is hard but necessary to deeply fuse the two types of information. To address these issues, we enrich the formula information by combining the syntax semantics of the text to construct the math syntax graph, and design the syntax-aware memory networks to deeply fuse the characteristics from the graph and text. With the help of syntax relations, the token from the text can trace its semantic-related nodes within the formulas, which is able to capture the fine-grained correlations between text and formulas. Besides, we also devise three continual pre-training tasks to further align and fuse the representations of the text and graph. Experimental results on four tasks in the math domain demonstrate the effectiveness of our approach.	PDF	11	2021
Layout-Aware Neural Model for Resolving Hierarchical Table Structure	While many pipelines for extracting information from tables assume simple table structure, tables in the financial domain frequently have complex, hierarchical structure. The main example would be parent-child relationships between header cells. Most prior datasets of tables annotated from images or .pdf and most models for extracting table structure concentrate on the problems of table, cell, row, and column bounding box extraction. The area of fine-grained table structure remains relatively unexplored. In this study, we present a dataset of 887 tables, manually labeled for cell types and column hierarchy relations. The tables are selected from IBM FinTabNet, a much larger dataset of more than 100,000 financial tables having cell, row, and column bounding boxes extracted by deep learning, but not including semantic cell type or cell-to-cell relation labels, which we add. Selection of these 887 tables is performed using heuristics which result in a much larger proportion, roughly half, of the selected tables having complex hierarchical structure, than a random sample from FinTabNet. Further, we fine-tune models based on LayoutLM on the cell-type classification task and on the identification of hiearchical relations among column headers. We achieve F1 scores of 95% and 70% on the respective tasks. Finally, we use the trained model to create soft labels for the entirety of FinTabNet.	PDF	11	2021
Bridging the Data Gap between Training and Inference for Unsupervised Neural Machine Translation	Back-translation is a critical component of Unsupervised Neural Machine Translation (UNMT), which generates pseudo parallel data from target monolingual data. A UNMT model is trained on the pseudo parallel data with $\text{\bf translated source}$, and translates $\text{\bf natural source}$ sentences in inference. The source discrepancy between training and inference hinders the translation performance of UNMT models. By carefully designing experiments, we identify two representative characteristics of the data gap in source: (1) $\text{\textit{style gap}}$ (i.e., translated vs. natural text style) that leads to poor generalization capability; (2) $\text{\textit{content gap}}$ that induces the model to produce hallucination content biased towards the target language. To narrow the data gap, we propose an online self-training approach, which simultaneously uses the pseudo parallel data $\{$natural source, translated target$\}$ to mimic the inference scenario. Experimental results on several widely-used language pairs show that our approach outperforms two strong baselines (XLM and MASS) by remedying the style and content gaps.	PDF	11	2021
Reframing Instructional Prompts to GPTk's Language	What kinds of instructional prompts are easier to follow for Language Models (LMs)? We study this question by conducting extensive empirical analysis that shed light on important features of successful instructional prompts. We propose several reframing techniques for model designers to manually create more effective prompts. Some examples include decomposing a complex task instruction into multiple simpler tasks or itemizing instructions into sequential steps. Our experiments compare the zero-shot and few-shot performance of LMs prompted with reframed instructions on 12 NLP tasks across 6 categories. Compared with original instructions, our reframed instructions lead to significant improvements across LMs with different sizes, underscoring the cross-model generality of these guidelines. For example, the same reframed prompts boost few-shot performance of GPT3-series and GPT2-series by 12.5% and 6.7% respectively averaged over all tasks. Furthermore, reframed instructions reduce the number of examples required to prompt LMs in the few-shot setting. We hope these empirically-driven techniques will pave the way for more effective ways to prompt LMs in the future.	PDF	11	2021
Analyzing Gender Representation in Multilingual Models	Multilingual language models were shown to allow for nontrivial transfer across scripts and languages. In this work, we study the structure of the internal representations that enable this transfer. We focus on the representations of gender distinctions as a practical case study, and examine the extent to which the gender concept is encoded in shared subspaces across different languages. Our analysis shows that gender representations consist of several prominent components that are shared across languages, alongside language-specific components. The existence of language-independent and language-specific components provides an explanation for an intriguing empirical observation we make: while gender classification transfers well across languages, bias mitigation interventions trained on a single language do not transfer easily to others.	PDF	11	2021
Towards Computationally Feasible Deep Active Learning	Active learning (AL) is a prominent technique for reducing the annotation effort required for training machine learning models. Deep learning offers a solution for several essential obstacles to deploying AL in practice but introduces many others. One of such problems is the excessive computational resources required to train an acquisition model and estimate its uncertainty on instances in the unlabeled pool. We propose two techniques that tackle this issue for text classification and tagging tasks, offering a substantial reduction of AL iteration duration and the computational overhead introduced by deep acquisition models in AL. We also demonstrate that our algorithm that leverages pseudolabeling and distilled models overcomes one of the essential obstacles revealed previously in the literature. Namely, it was shown that due to differences between an acquisition model used to select instances during AL and a successor model trained on the labeled data, the benefits of AL can diminish. We show that our algorithm, despite using a smaller and faster acquisition model, is capable of training a more expressive successor model with higher performance.	PDF	11	2021
When More Data Hurts: A Troubling Quirk in Developing Broad-Coverage Natural Language Understanding Systems	In natural language understanding (NLU) production systems, the end users' evolving needs necessitate the addition of new abilities, indexed by discrete symbols, requiring additional training data and resulting in dynamic, ever-growing datasets.Dataset growth introduces new challenges: we find that when learning to map inputs to a new symbol from a fixed number of annotations, more data can in fact {\emph{reduce}} the model's performance on examples that involve this new symbol.We show that this trend holds for multiple models on two datasets for common NLU tasks: intent recognition and semantic parsing.We demonstrate that the performance decrease is largely associated with an effect we refer to as source signal dilution, which occurs when strong lexical cues in the training data become diluted as the dataset grows.Selectively dropping training examples to prevent source dilution often reverses the performance decrease, suggesting a direction for improving models.We release our code and models at \url{anonymous-link}.	PDF	11	2021
Discourse Context Primes Hindi Word Order	Hindi has a flexible word order, yet certain word orders are consistently preferred over others. A number of factors are known to influence Hindi word order preferences in isolation, including information structure and syntactic complexity. However, the relative impact of these factors on Hindi constituent ordering is not well understood. Inspired by prior work on syntactic priming, we investigate how the words and syntactic structures in a sentence influence the word order of the following sentences. Specifically, we extract sentences from the Hindi-Urdu Treebank corpus (HUTB), we permute the preverbal constituents of those sentences, and we build a classifier to predict which sentences actually occurred in the corpus against our generated distractors. The classifier uses a number of discourse-based features and cognitive features to make its predictions, including dependency length, surprisal, and information status. We find that lexical and syntactic priming and referent givenness drive order preferences. Moreover, along the lines of previous work in psycholinguistics, we find that certain verbs are more susceptible to priming than others. We conclude by situating our results within the broader syntactic priming literature.	PDF	11	2021
Get Your Model Puzzled: Introducing Crossword-Solving as a New NLP Benchmark	Solving crossword puzzles requires diverse reasoning capabilities, access to a vast amount of knowledge about language and the world, and the ability to satisfy the constraints imposed by the structure of the puzzle. In this work, we introduce solving crossword puzzles as a new natural language understanding task. We release a corpus of crossword puzzles collected from the New York Times daily crossword spanning 25 years and containing a total of 9152 puzzles, with an average of 85 clues per puzzle. These puzzles include a diverse set of clues: historic, factual, word meaning, synonyms/antonyms, fill-in-the-blank, abbreviations, prefixes/suffixes, wordplay, and cross-lingual, as well as clues that depend on the answers to other clues. We separately release the clue-answer pairs from these puzzles as an open-domain question answering dataset containing over half a million unique clue-answer pairs. For the question answering task, our baselines include several sequence-to-sequence and retrieval-based generative models. We also introduce a non-parametric constraint satisfaction baseline for solving the entire crossword puzzle. Finally, we propose an evaluation framework which consists of several complementary performance metrics.	PDF	11	2021
Unsupervised Keyphrase Extraction via Interpretable Neural Networks	Keyphrase extraction aims at automatically extracting a list of "important'' phrases which represent the key concepts in a document. Traditionally, it has been approached from an information-theoretic angle using phrase co-occurrence statistics. This work proposes a novel unsupervised approach to keyphrase extraction that uses a more intuitive notion of phrase importance, inspired by interpretability research. In particular, we use a self-explaining neural model to measure the predictive impact of input phrases on downstream task performance, and consider the resulting interpretations as document keyphrases for the target task. We show the efficacy of our approach on four datasets in two domains---scientific publications and news articles---attaining state-of-the-art results in unsupervised keyphrase extraction.	PDF	11	2021
Substructure Distribution Projection for Zero-Shot Cross-Lingual Dependency Parsing	We present substructure distribution projection (SubDP), a technique that projects a distribution over structures in one domain to another, by projecting substructure distributions separately. Models for the target domain can then be trained, using the projected distributions as soft silver labels. We evaluate SUBDP on zero shot cross-lingual dependency parsing, taking dependency arcs as substructures: we project the predicted dependency arc distributions in the source language(s) to target language(s), and train a target language parser on the resulting distributions. Given an English tree bank as the only source of human supervision, SUBDP achieves better unlabeled attachment score than all prior work on the Universal Dependencies v2.2 (Nivre et al., 2020) test set across eight diverse target languages, as well as the best labeled attachment score on six languages. In addition, SUBDP improves zero shot cross-lingual dependency parsing with very few (e.g., 50) supervised bitext pairs, across a broader range of target languages.	PDF	11	2021
Towards a Progression-Aware Autonomous Dialogue Agent	Recent advances in large-scale language modeling and generation have enabled the creation of dialogue agents that exhibit human-like responses in a wide range of conversational scenarios spanning a diverse set of tasks, from general chit-chat to focused goal-oriented discourse. While these agents excel at generating high-quality responses that are relevant to prior context, they suffer from a lack of awareness of the overall direction in which the conversation is headed, and the likelihood of task success inherent therein. Thus, we propose a framework in which dialogue agents can evaluate the progression of a conversation toward or away from desired outcomes, and use this signal to inform planning for subsequent responses. Our framework is composed of three key elements: (1) the notion of a "global" dialogue state (GDS) space, (2) a task-specific progression function (PF) computed in terms of a conversation's trajectory through this space, and (3) a planning mechanism by which a dialogue agent may use progression signals to select its next response.	PDF	11	2021
Revisiting Softmax for Uncertainty Approximation in Text Classification	Uncertainty approximation in text classification is an important area with applications in domain adaptation and interpretability. The most widely used uncertainty approximation method is Monte Carlo Dropout, which is computationally expensive as it requires multiple forward passes through the model. A cheaper alternative is to simply use a softmax to estimate model uncertainty. However, prior work has indicated that the softmax can generate overconfident uncertainty estimates and can thus be tricked into producing incorrect predictions. In this paper, we perform a thorough empirical analysis of both methods on three datasets with two base neural architectures in order to reveal insight into the trade-offs between the two. We compare the methods' uncertainty approximations and downstream text classification performance, while weighing their performance against their computational complexity as a cost-benefit analysis. We find that, while Monte Carlo produces the best uncertainty approximations, using a simple softmax leads to competitive F1 results for text classification at a much lower computational cost, suggesting that softmax can in fact be a sufficient uncertainty estimate when computational resources are a concern.	PDF	11	2021
Data-Driven Adaptive Simultaneous Machine Translation	In simultaneous translation (SimulMT), the most widely used strategy is the \waitk policy thanks to its simplicity and effectiveness in balancing translation quality and latency. However, \waitk suffers from two major limitations: (a) it is a fixed policy that can not adaptively adjust latency given context, and (b) its training is much slower than full-sentence translation. To alleviate these issues, we propose a novel and efficient training scheme for adaptive SimulMT by augmenting the training corpus with adaptive prefix-to-prefix pairs, while the training complexity remains the same as that of training full-sentence translation models. Experiments on two language pairs show that our method outperforms all strong baselines in terms of translation quality and latency.	PDF	11	2021
Program Transfer for Answering Complex Questions over Knowledge Bases	Program induction for answering complex questions over knowledge bases (KBs) aims to decompose a question into a multi-step program, whose execution against the KB produces the final answer. Learning to induce programs relies on a large number of parallel question-program pairs for the given KB. However, for most KBs, the gold program annotations are usually lacking, making learning difficult. In this paper, we propose the approach of program transfer, which aims to leverage the valuable program annotations on the rich-resourced KBs as external supervision signals to aid program induction for the low-resourced KBs that lack program annotations. For program transfer, we design a novel two-stage parsing framework with an efficient ontology-guided pruning strategy. First, a sketch parser translates the question into a high-level program sketch, which is the composition of functions. Second, given the question and sketch, an argument parser searches the detailed arguments from the KB for functions. During the searching, we incorporate the KB ontology to prune the search space. The experiments on ComplexWebQuestions and WebQuestionSP show that our method outperforms SOTA methods significantly, demonstrating the effectiveness of program transfer and our framework.	PDF	11	2021
ReCLIP: A Strong Zero-Shot Baseline for Referring Expression Comprehension	Training a referring expression comprehension (ReC) model for a new visual domain requires collecting referring expressions, and potentially corresponding bounding boxes, for images in the domain. While large-scale pre-trained models are useful for image classification across domains, it remains unclear if they can be applied in a zero-shot manner to more complex tasks like ReC. We present ReCLIP, a simple but strong zero-shot baseline that repurposes CLIP, a state-of-the-art large-scale model, for ReC. Motivated by the close connection between ReC and CLIP’s contrastive pre-training objective, the first component of ReCLIP is a region-scoring method that isolates object proposals via cropping and blurring, and passes them to CLIP. However, through controlled experiments on a synthetic dataset, we find that CLIP is largely incapable of performing spatial reasoning off-the-shelf. Thus, the second component of ReCLIP is a spatial relation resolver that handles several types of spatial relations. We reduce the gap between zero-shot baselines from prior work and supervised models by as much as 30% on RefCOCOg, and on RefGTA (video game imagery), we outperform supervised ReC models trained on real images by an absolute 12%.	PDF	11	2021
HybriDialogue: An Information-Seeking Dialogue Dataset Grounded on Tabular and Textual Data	A pressing challenge in current dialogue systems is to successfully converse with users on topics with information distributed across different modalities. Previous work in multi-turn dialogue systems has primarily focused on either text or table information. In more realistic scenarios, having a joint understanding of both is critical as knowledge is typically distributed over both unstructured and structured forms. We present a new dialogue dataset, HybriDialogue, which consists of crowdsourced natural conversations grounded on both Wikipedia text and tables. The conversations are created through the decomposition of complex multihop questions into simple, realistic multiturn dialogue interactions. We conduct several baseline experiments, including retrieval, system state tracking, and dialogue response generation. Our results show that there is still ample opportunity for improvement, demonstrating the importance of building stronger dialogue systems that can reason over the complex setting of information-seeking dialogue grounded on tables and text.	PDF	11	2021
Dict-NMT: Bilingual Dictionary based NMT for Extremely Low Resource Languages	Neural Machine Translation (NMT) models have been effective on large bilingual datasets. However, the existing methods and techniques show that the model's performance is highly dependent on the number of examples in training data. For many languages, having such an amount of corpora is a far-fetched dream. Taking inspiration from monolingual speakers exploring new languages using bilingual dictionaries, we investigate the applicability of bilingual dictionaries for languages with extremely low, or no bilingual corpus. In this paper, we explore methods using bilingual dictionaries with an NMT model to improve translations for extremely low resource languages. We extend this work for multilingual systems, exhibiting zero-shot property. We present a detailed analysis of the effects of the quality of dictionary, training dataset size, language family, etc., on the translation quality. Results on multiple low-resource test languages show a clear advantage of our bilingual dictionary-based method over the baselines.	PDF	11	2021
AdapterBias: Parameter-efficient Token-dependent Embedding Shift for Adapters in NLP Tasks	Transformer-based pre-trained models with millions of parameters require large storage. Recent approaches tackle this shortcoming by training adapters, but these approaches still require a relatively large number of parameters. In this study, AdapterBias, a surprisingly simple yet effective adapter architecture, is proposed. AdapterBias adds a token-dependent shift to the embedding to adapt to downstream tasks with only a vector and a linear layer. Extensive experiments are conducted to demonstrate the effectiveness of AdapterBias. The experiments show that our proposed method can dramatically reduce the trainable parameters than the previous works with a minimal decrease in task performances compared with fine-tuned pre-trained models. We further find that AdapterBias automatically learns to assign more significant shifts to the tokens related to the task in consideration.	PDF	11	2021
Learning Emotion-Aware Contextual Representations for Emotion-Cause Pair Extraction	Emotion Cause Pair Extraction (ECPE), the task expanded from the previous emotion cause extraction (ECE) task, focuses on extracting emotion-cause pairs in text. Two reasons have made ECPE a more challenging, but more applicable task in real world scenarios: 1) an ECPE model needs to identify both emotions and their corresponding causes without the annotation of emotions. 2) the ECPE task involves finding causes for multiple emotions in the document context, while ECE is for one emotion. However, most existing methods for ECPE adopt an unified approach that models emotion extraction and cause extraction jointly through shared contextual representations, which is suboptimal in extracting multiple emotion-cause pairs. In addition, previous ECPE works are evaluated on one ECE dataset, which exhibits a bias that majority of documents have only one emotion-cause pair. In this work, we propose a simple pipelined approach that builds on two independent encoders, in which the emotion extraction model only provide input features for the cause extraction model. We reconstruct the benchmark dataset to better meet ECPE settings. Based on experiments conducted on the original and reconstructed dataset, we validate that our model can learn distinct contextual representations specific to each emotion, and thus achieves state-of-the-art performance on both datasets, while showing robustness in analyzing more complex document context.	PDF	11	2021
Sequence-to-Sequence Multilingual Pre-Trained Models: A Hope for Low-Resource Language Translation?	We investigate the capability of mBART, a sequence-to-sequence multilingual pre-trained model in translating low-resource languages under five factors: the amount of data used in pre-training the original model, the amount of data used in fine-tuning, the noisiness of the data used for fine-tuning, the domain-relatedness between the pre-training, fine-tuning, and testing datasets, and the language relatedness. When limited parallel corpora are available, fine-tuning mBART can measurably improve translation performance over training Transformers from scratch. mBART effectively uses even domain-mismatched text, suggesting that mBART can learn meaningful representations when data is scarce. Still, it founders when too-small data in unseen languages is provided.	PDF	11	2021
A Self-Adaptive Learning Rate and Curriculum Learning Based Framework for Few-Shot Text Classification	Due to the lack of labeled data in many realistic scenarios, a number of few-shot learning methods for text classification have been proposed, among which the meta learning based ones have recently attracted much attention. Such methods usually consist of a learner as the classifier and a meta learner for specializing the learner to tasks. For the learner, learning rate is crucial to its performance. However, existing methods treat it as a hyper parameter and adjust it manually, which is time-consuming and laborious. Intuitively, for different tasks and neural network layers, the learning rates should be different and self-adaptive. For the meta learner, it requires a good generalization ability so as to quickly adapt to new tasks. Therefore, we propose a novel meta learning framework, called MetaCLSLR, for few-shot text classification. Specifically, we present a novel meta learning mechanism to obtain different learning rates for different tasks and neural network layers so as to enable the learner to quickly adapt to new training data. Moreover, we propose a task-oriented curriculum learning mechanism to help the meta learner achieve a better generalization ability by learning from different tasks with increasing difficulties. Extensive experiments on three benchmark datasets demonstrate the effectiveness of MetaCLSLR.	PDF	11	2021
GlobalWoZ: Globalizing MultiWoZ to Develop Multilingual Task-Oriented Dialogue Systems	Over the last few years, there has been a move towards data curation for multilingual task-oriented dialogue (ToD) systems that can serve people speaking different languages. However, existing multilingual ToD datasets either have a limited coverage of languages due to the high cost of data curation, or ignore the fact that dialogue entities barely exist in countries speaking these languages. To tackle these limitations, we introduce a novel data curation method that generates GlobalWoZ --- a large-scale multilingual ToD dataset globalized from an English ToD dataset for three unexplored use cases of multilingual ToD systems. Our method is based on translating dialogue templates and filling them with local entities in the target-language countries. Besides, we extend the coverage of target languages to 20 languages. We will release our dataset and a set of strong baselines to encourage research on multilingual ToD systems for real use cases.	PDF	11	2021
The AI Doctor Is In: A Survey of Task-Oriented Dialogue Systems for Healthcare Applications	Task-oriented dialogue systems in healthcare are increasingly common and have been characterized by diverse architectures and objectives. Although they have been surveyed in the medical community from a non-technical perspective, a systematic review from a rigorous computational perspective remains noticeably absent. This has resulted in limited knowledge of important implementation and replicability details, slowing the pace of innovation. To fill this gap, we investigated an initial pool of 4070 papers from well-known computer science, natural language processing, and artificial intelligence venues, identifying 70 papers discussing the system-level implementation of task-oriented dialogue systems for healthcare applications. We comprehensively reviewed these papers, and present our key findings including identified gaps and corresponding recommendations.	PDF	11	2021
Mix and Match: Learning-free Controllable Text Generationusing Energy Language Models	Due to the unidirectional nature of prevalent autoregressive generation models, recent work on controlled generation based on global text attributes has either required attribute-based fine-tuning of the base language model, or restricted the parametrization of the attribute prediction model to be compatible with the base LM. In this work, we propose Mix and Match LM, a global score-based alternative to controllable text generation that combines arbitrary pretrained blackbox models for the desired attributes without involving any fine-tuning or structural assumptions about the blackbox models. We interpret the task of controllable generation as drawing samples from an energy-based model whose energy values are a linear combination of scores from blackbox models that are separately responsible for fluency, the control attribute, and faithfulness to any conditioning context. We use a Metropolis Hastings sampling scheme to sample from this energy-based model using bidirectional context and global attribute features. We validate the effectiveness of our approach on various conditional generation and style transfer tasks by outperforming recently proposed methods that involve extra training, finetuning, or restrictive assumptions over the form of models.	PDF	11	2021
Revisiting Transformer-based Models for Long Document Classification	The recent literature in text classification is biased towards short text sequences (e.g., sentences or paragraphs). In real-world applications, multi-page multi-paragraph documents are common and they cannot be efficiently encoded by vanilla Transformer-based models. We compare different long document classification approaches that aim to mitigate the computational overhead of vanilla transformers to encode much longer text, namely sparse attention and hierarchical encoding methods.We examine several aspects of sparse attention (e.g., size of attention window, use of global attention) and hierarchical based (e.g., document splitting strategy) transformers on two different datasets, and we derive practical advice of applying Transformer-based models on long document classification tasks.We find that, if applied properly, Transformer-based models can outperform former state-of-the-art CNN based models on MIMIC-III, a challenging dataset from the clinical domain.	PDF	11	2021
Unsupervised Open-Domain Question Answering with Higher Answerability	Open-domain Question Answering (ODQA) has achieved significant results in terms of supervised learning manner. However, data annotation cannot also be irresistible for its huge demand in an open domain. Though unsupervised QA or unsupervised Machine Reading Comprehension (MRC) has been tried more or less, unsupervised ODQA has not been touched according to our best knowledge. This paper thus pioneers the work of unsupervised ODQA by formally introducing the task and proposing a series of key data construction methods. Our exploration in this work inspiringly shows unsupervised ODQA can reach up to 86% performance of supervised ones.	PDF	11	2021
When did you become so smart, oh wise one?! Sarcasm Explanation in Multi-modal Multi-party Dialogues	Indirect speech such as sarcasm achieves a constellation of discourse goals in human communication. While the indirectness of figurative language warrants speakers to achieve certain pragmatic goals, it is challenging for AI agents to comprehend such idiosyncrasies of human communication. Though sarcasm identification has been a well-explored topic in dialogue analysis, for conversational systems to truly grasp a conversation's innate meaning and generate appropriate responses, simply detecting sarcasm is not enough; it is vital to explain its underlying sarcastic connotation to capture its true essence. In this work, we study the discourse structure of sarcastic conversations and propose a novel task -- Sarcasm Explanation in Dialogue (SED). Set in a multimodal and code-mixed setting, the task aims to generate natural language explanations of satirical conversations. To this end, we curate WITS, a new dataset to support our task. We propose MAF (Modality Aware Fusion), a multimodal context-aware attention and global information fusion module to capture multimodality and use it to benchmark WITS. The proposed attention module surpasses the traditional multimodal fusion baselines and reports the best performance on almost all metrics. Lastly, we carry out detailed analysis both quantitatively and qualitatively.	PDF	11	2021
Are All the Datasets in Benchmark Necessary? A Pilot Study of Dataset Evaluation for Text Classification	In this paper, we ask the research question if all the datasets in the benchmark are necessary. We approach this by first characterizing the distinguishability of datasets when comparing different systems. Experiments on 9 datasets and 36 systems show that several existing benchmark datasets contribute little to discriminating top-scoring systems, while those less used datasets exhibit impressive discriminative power. We further, taking the text classification task as a case study, investigate the possibility of predicting dataset discrimination based on its properties (e.g., average sentence length). Our preliminary experiments promisingly show that given a sufficient number of training experimental records, a meaningful predictor can be learned to estimate dataset discrimination over unseen datasets. We released all related code at Github \url{https://github.com/annonnlp-demo/acl-V2} and a new benchmark dataset for text classification based on our observations.	PDF	11	2021
GradMask: Gradient-Guided Token Masking for Textual Adversarial Example Detection	We present a simple model-agnostic textual adversarial example detection scheme called GradMask. It uses gradient signals to detect adversarially perturbed tokens in an input sequence and occludes such tokens by a masking process. GradMask provides several advantages over existing methods including improved detection performance and a weak interpretation of its decision. Extensive evaluations on widely adopted natural language processing benchmark datasets demonstrate the efficiency and effectiveness of GradMask	PDF	11	2021
MSAMSum: Towards Benchmarking Multi-lingual Dialogue Summarization	Dialogue summarization helps users capture salient information from various types of dialogues has received much attention recently. However, current works mainly focus on English dialogue summarization, leaving other languages under exploration. Therefore, we present a multi-lingual dialogue summarization dataset, namely MSAMSum, which covers dialogue-summary pairs in six languages. Specifically, we derive MSAMSum from the standard SAMSum using sophisticated translation techniques and further employ two methods to ensure the integral translation quality and summary factual consistency. Given the proposed MSAMum, we systematically set up five multi-lingual settings for this task, including a novel mix-lingual dialogue summarization setting. To illustrate the utility of our dataset, we benchmark various experiments with pre-trained models under different settings and report results in both supervised and zero-shot manners. We also discuss some future works towards this task to motivate future researches.	PDF	11	2021
$AmbiPun$ : Generating Humorous Puns with Ambiguous Context	Computational humor has garnered interest of the natural language processing community due to its wide applications to real world scenarios. One way to express humor is via the use of puns. A homographic pun plays on words that are spelled the same way but have different meanings. In this paper, we propose a simple yet effective way to generate pun sentences that does not require any pun sentences to train on. Our approach is inspired by humor theories that ambiguity comes from the context rather than the pun word itself. Given a pair of definitions of a pun word, our model first produces a list of related concepts through a reverse dictionary. We then utilize one-shot GPT3 to generate context words, and then generate punning sentences that incorporate context words from both worlds.We also investigate how the position of a pun word appearing in the sentence will influence the generated results. We compare our proposed $\textsc{AmbiPun}$ with well crafted baselines. Human evaluation shows that our method successfully generates pun 52% of the time, outperforming the the state-of-the-art model by a large margin.	PDF	11	2021
Pinyin-bert: A new solution to Chinese pinyin to character conversion task	Pinyin to Character conversion (P2C) task is the key task of Input Method Engine (IME) in commercial input software for Asian languages, such as Chinese, Japanese, Thai language, and so on. The dominant technique is Ngram language model together with smoothing technique. However, Ngram model's low capacity limits its performance. Under the trend of deep learning, this paper choose the powerful bert network architecture and propose Pinyin-bert to solve the P2C task, which achieves substantial performance improvement from Ngram model. Furthermore, we combine Pinyin-bert with Ngram model under Markov model's framework and improve performance further. Lastly, we design a way to incorporate external lexicon into Pinyin-bert so as to adapt to the out of domain.	PDF	11	2021
What Makes The Story Forward? Inferring Commonsense Explanations as Prompts for Future Event Generation	Future Event Generation (FEG) aims to generate fluent and reasonable future event descriptions given preceding events. It requires not only fluent text generation but also commonsense reasoning to maintain the coherence of the entire event story. However, existing FEG methods are easily trapped into repeated or general events without imposing any logical constraint to the generation process. In this paper, we propose a novel explainable FEG framework that consists of a commonsense inference model (\textsc{Im}) and an event generation model (\textsc{Gm}). The \textsc{Im}, which is pre-trained on a commonsense knowledge graph ATOMIC, learns to interpret the preceding events and conducts commonsense reasoning to reveal the character’s psychology such as intent, reaction and needs as latent variables. The \textsc{Gm} further takes the commonsense knowledge as prompts to guide and enforce the generation of logistically coherent future events. As a unique merit, the commonsense prompts can be further decoded into textual descriptions, yielding explanations for the future event. Automatic and human evaluation demonstrate that our approach can generate more coherent, specific, and logical future events than the strong baselines. All the programs and resources will be made public upon acceptance.	PDF	11	2021
Rethinking News Text Classification from a Timeliness Perspective under the Pre-training and Fine-tuning Paradigm	Pre-trained language models (PLMs) have made significant progress in NLP. News text classification is one of the most fundamental tasks in NLP, and various existing works have shown that fine-tuned on PLMs could score up to the accuracy of 98% on the target task. It seems that this task has been well-addressed. However, we discover that news timeliness can cause a massive impact on the news text classification, which drops nearly 20% points from the initial results. In this paper, we define timeliness issues in news classification and design the experiment to measure the influence. Moreover, we investigate several methods to recognize and replace obsolete vocabularies. However, the results show that it is difficult to eliminate the impact of news timeliness from the words' perspective. In addition, we propose a set of large-scale, time-sensitive news datasets to facilitate the study of this problem.	PDF	11	2021
MRCLens: an MRC Dataset Bias Detection Toolkit	Many recent neural models have shown remarkable empirical results in Machine Reading Comprehension, but evidence suggests sometimes the models take advantage of dataset biases to predict and fail to generalize on out-of-sample data. While many other approaches have been proposed to address this issue from the computation perspective such as new architectures or training procedures, we believe a method that allows researchers to discover biases, adjust the data or the models in an earlier stage will be beneficial. Thus, we introduce MRCLens, a toolkit which detects whether biases exist before users train the full model. For the convenience of introducing the toolkit, we also provide a categorization of common biases in MRC.	PDF	11	2021
QubitE: Qubit Embedding for Knowledge Graph Completion	Knowledge graph embeddings (KGEs) learn low-dimensional representations of entities and relations to predict missing facts based on existing ones.Quantum-based KGEs utilise variational quantum circuits for link prediction and score triples via the probability distribution of measuring the qubit states. However, there exists another best measurement for training variational quantum circuits. Besides, current quantum-based methods ignore theoretical analysis which are essential for understanding the model performance and applying for downstream tasks such as reasoning, path query answering, complex query answering, etc.To address measurement issue and bridge theory gap, we propose QubitE whose score of a triple is defined as the similarity between qubit states.Here, our measurements are viewed as kernel methods to separate the qubit states, while preserving quantum adavantages.Furthermore, we show that (1) QubitE is full-expressive; (2) QubitE can infer various relation patterns including symmetry/antisymmetry, inversion, and commutative/non-commutative composition; (3) QubitE subsumes serveral existing approaches, \eg~DistMult, pRotatE, RotatE, TransE and ComplEx; (4) QubitE owns linear space complexity and linear time complexity.Experiments results on multiple benchmark knowledge graphs demonstrate that QubitE can achieve comparable results to the state-of-the-art classical models.	PDF	11	2021
All in One: A Multi-Task Learning for Emoji, Sentiment and Emotion Analysis in Code-Mixed Text	Code mixed language and emojis are being extensively used in social media to express opinions. In this paper, we propose a novel task that focuses on suggesting appropriate emojis in English-Hindi code-mixed sentences. We aim to exploit the dependency between emotion, sentiment, and emojis for building an end-to-end framework that can simultaneously identify the emotion, sentiment and emojis in code-mixed sentences. We introduce the Code-Mixed Emoji, Emotion and Sentiment aware Dataset (CMEESD) which is an extension of the SemEval 2020 Task 9. We establish strong baselines to predict the correct emojis by simultaneously identifying the emotion and sentiment of a given tweet. The sentiment and emotion prediction in turn helps for the appropriate emoji classification. Empirical results on the CMEESD dataset demonstrate that the proposed multi-task framework yields better performance over the single-task framework.	PDF	11	2021
UniSAr: A Unified Structure-Aware Autoregressive Language Model for Text-to-SQL	Existing text-to-SQL semantic parsers are typically designed for particular settings such as handling queries that span multiple tables, domains or turns which makes them ineffective when applied to different settings. We present UniSAr (Unified Structure-Aware Autoregressive Language Model), which benefits from directly using an off-the-shelf language model architecture and demonstrates consistently high performance under different settings. Specifically, UniSAr extends existing autoregressive language models to incorporate three non-invasive extensions to make them structure-aware: (1) adding structure mark to encode database schema, conversation context, and their relationships; (2) constrained decoding to decode well structured SQL for a given database schema; and (3) SQL completion to complete potential missing JOIN relationships in SQL based on database schema. On seven well-known text-to-SQL datasets including multi-domain, multi-table and multi-turn, UniSAr demonstrates highly comparable or better performance to the most advanced specifically-designed text-to-SQL models. Importantly, our UniSAr is non-invasive, such that other core model advances in text-to-SQL can also adopt our extensions to further enhance performance.	PDF	11	2021
It is AI’s Turn to Ask Human a Question: Question and Answer Pair Generation for Children Storybooks in FairytaleQA Dataset	Existing question answering (QA) techniques are created mainly to answer questions asked by humans. But in educational applications, teachers and parents sometimes may not know what questions they should ask best help children to develop their narrative understanding abilities. We design an automated question-answer generation (QAG) system for education purposes: given a storybook at the kindergarten to eighth-grade level, our system can automatically produce QA pairs that are capable of testing a variety of student comprehension skills. Using a new QA dataset FairytaleQA that has 278 child-friendly storybooks with 10,580 QA pairs labeled by experts, we design a novel QAG system architecture to generate QA pairs. Automatic and human evaluations show that our model outperforms state-of-the-art QAG systems. On top of our QAG system, we also build an interactive story-telling application for future real-world deployment.	PDF	11	2021
Continual Prompt Tuning for Dialog State Tracking	A desirable dialog system should be able to continually learn new skills without forgetting old ones, and thereby adapt to new domains or tasks in its life cycle. However, continually training a model often leads to a well-known catastrophic forgetting issue. In this paper, we present Continual Prompt Tuning, a parameter-efficient framework that not only avoids forgetting but also enables knowledge transfer between tasks. To avoid forgetting, we only learn and store a few prompt tokens' embeddings for each task while freezing the backbone pre-trained model. To achieve bi-directional knowledge transfer among tasks, we propose several techniques (continual prompt initialization, query fusion, and memory replay) to transfer knowledge from preceding tasks and a memory-guided technique to transfer knowledge from subsequent tasks. Extensive experiments demonstrate the effectiveness and efficiency of our proposed method on continual learning for dialog state tracking, compared with state-of-the-art baselines.	PDF	11	2021
KESA: A Knowledge Enhanced Approach For Sentiment Analysis	Though some recent works focus on injecting sentiment knowledge into pre-trained language models, they usually design mask and reconstruction tasks in the post-training phase. In this paper, we aim to benefit from sentiment knowledge in a lighter way. To achieve this goal, we study sentence-level sentiment analysis and, correspondingly, propose two sentiment-aware auxiliary tasks named sentiment word cloze and conditional sentiment prediction. The first task learns to select the correct sentiment words within the input, given the overall sentiment polarity as prior knowledge. On the contrary, the second task predicts the overall sentiment polarity given the sentiment polarity of the word as prior knowledge. In addition, two kinds of label combination methods are investigated to unify multiple types of labels in each task. We argue that more information can promote the models to learn more profound semantic representation. We implement it in a straightforward way to verify this hypothesis. The experimental results demonstrate that our approach consistently outperforms pre-trained models and is additive to existing knowledge-enhanced post-trained models.	PDF	11	2021
ComSearch: Equation Searching with Combinatorial Mathematics for Solving Math Word Problems with Weak Supervision	Previous studies have introduced a weakly-supervised paradigm for solving math word problems requiring only the answer value annotation. While these methods search for correct value equation candidates as pseudo labels, they search among a narrow sub-space of the enormous equation space. To address this problem, we propose a novel search algorithm with combinatorial mathematics \textbf{ComSearch}, which can compress the search space by excluding mathematical equivalent equations. The compression allows the searching algorithm to enumerate all possible equations and obtain high-quality data. Experimental results show that our method achieves state-of-the-art results, especially for problems with more variables.	PDF	11	2021
"Is Whole Word Masking Always Better for Chinese BERT?": Probing on Chinese Grammatical Error Correction	Whole word masking (WWM), which masks all subwords corresponding to a word at once, makes a better English BERT model. For the Chinese language, however, there is no subword because each token is an atomic character. The meaning of a word in Chinese is different in that a word is a compositional unit consisting of multiple characters. Such difference motivates us to investigate whether WWM leads to better context understanding ability for Chinese BERT. To achieve this, we introduce two probing tasks related to grammatical error correction and ask pretrained models to revise or insert tokens in a masked language modeling manner. We construct a dataset including labels for 19,075 tokens in 10,448 sentences. We train three Chinese BERT models with standard character-level masking (CLM), WWM, and a combination of CLM and WWM, respectively. Our major findings are as follows: First, when one character needs to be inserted or replaced, the model trained with CLM performs the best. Second, when more than one character needs to be handled, WWM is the key to better performance. Finally, when being fine-tuned on sentence-level downstream tasks, models trained with different masking strategies perform comparably.	PDF	11	2021
PESTO: A Post-User Fusion Network for Rumour Detection on Social Media	Rumour detection on social media is an important topic due to the challenges of misinformation propagation and slow verification of misleading information.Most previous work focus on the response posts on social media, ignoring the useful characteristics of involved users and their relations.In this paper, we propose a novel framework, Post-User Fusion Network (PESTO), which models the patterns of rumours from both post diffusion and user social networks. Specifically, we propose a novel Chronologically-masked Transformer architecture to model both temporal sequence and diffusion structure of rumours, and apply a Relational Graph Convolutional Network to model the social relations of involved users, with a fusion network based on self-attention mechanism to incorporate the two aspects. Additionally, two data augmentation techniques are leveraged to improve the robustness and accuracy of our models. Empirical results on several benchmarks show the superiority of the proposed method.	PDF	11	2021
Sentence-aware Contrastive Learning for Open-Domain Passage Retrieval	Training dense passage representations via contrastive learning has been shown effective for Open-Domain Passage Retrieval (ODPR). Existing studies focus on further optimizing by improving negative sampling strategy or extra pretraining. However, these studies keep unknown in capturing passage with internal representation conflicts from improper modeling granularity. This work thus presents a refined model on the basis of a smaller granularity, contextual sentences, to alleviate the concerned conflicts. In detail, we introduce an in-passage negative sampling strategy to encourage a diverse generation of sentence representations within the same passage. Experiments on three benchmark datasets verify the efficacy of our method, especially on datasets where conflicts are severe. Extensive experiments further present good transferability of our method across datasets.	PDF	11	2021
Human Language Modeling	Natural language is generated by people, yet traditional language modeling views words or documents as if generated independently. Here, we propose human language modeling (HuLM), a hierarchical extension to the language modeling problem where by a human- level exists to connect sequences of documents (e.g. social media messages) and capture the notion that human language is moderated by changing human states. We introduce, HaRT, a large-scale transformer model for solving HuLM, pre-trained on approximately 100,000 social media users, and demonstrate it’s effectiveness in terms of both language modeling (perplexity) for social media and fine-tuning for 4 downstream tasks spanning document- and user-levels. Results on all tasks meet or surpass the current state-of-the-art.	PDF	11	2021
Automatic Song Translation for Tonal Languages	This paper addresses automatic song translation (AST) for tonal languages and the unique challenge of aligning words' tones with the melody of a song in addition to conveying the original meaning. We propose three criteria for effective AST---preserving semantics, singability and intelligibility---and develop objectives for these criteria. We develop a new benchmark for English--Mandarin song translation and develop an unsupervised AST system, the Guided AliGnment for Automatic Song Translation (GagaST), which combines pre-training with three decoding constraints. Both automatic and human evaluations show GagaST successfully balances semantics and singability.	PDF	11	2021
Right for the Right Reason: Evidence Extraction for Trustworthy Tabular Reasoning	When pre-trained contextualized embeddings-based models developed for unstructured data are adapted for structured tabular data, they perform admirably. However, recent probing studies show that these models use spurious correlations and often ignore or focus on wrong evidence to predict labels. To study this issue, we introduce the task of Trustworthy Tabular Reasoning, where a model needs to extract evidence to be used for reasoning, in addition to predicting the label. As a case study, we propose a two-stage sequential prediction approach, which includes an evidence extraction and an inference stage. To begin, we crowdsource evidence row labels and develop several unsupervised and supervised evidence extraction strategies for InfoTabS, a tabular NLI benchmark. Our evidence extraction strategy outperforms earlier baselines. On the downstream tabular inference task, using the automatically extracted evidence as the only premise, our approach outperforms prior benchmarks.	PDF	11	2021
Data Contamination: From Memorization to Exploitation	It is common nowadays to train NLP models on massive web-based datasets. Previous works have shown that these datasets sometimes contain downstream test sets, a phenomenon typically referred to as "data contamination". It is not clear however to what extent models exploit the contaminated data for downstream tasks. In this paper we present a principled method to study this question. We pretrain BERT models on joint corpora of Wikipedia and labeled downstream datasets, and fine-tune them on the relevant task. Comparing performance between samples seen and unseen during pretraining enables us to define and quantify levels of memorization and exploitation.Our experiments with two models and three downstream tasks indicate that exploitation exists in some cases, but in others the models memorize the contaminated data, but do not exploit it. We show these two measures are affected by different factors such as contaminated data occurrences, model size, and random seeds. Our results highlight the importance of analyzing massive web-scale datasets to verify that progress in NLP is obtained by better language understanding and not better data exploitation.	PDF	11	2021
The patient is more dead than alive: exploring the current state of the multi-document summarisation of the biomedical literature	Although multi-document summarisation (MDS) of the biomedical literature is a highly valuable task that has recently attracted substantial interest, evaluation of the quality of biomedical summaries lacks consistency and transparency. In this paper, we examine the summaries generated by two current models in order to understand the deficiencies of existing evaluation approaches in the context of the challenges that arise in the MDS task. Based on this analysis, we propose a new approach to human evaluation and identify several challenges that must be overcome to develop effective biomedical MDS systems.	PDF	11	2021
LipKey: A Large-Scale News Dataset with Abstractive Keyphrases and Their Benefits for Summarization	Summaries, keyphrases, and titles are different ways of concisely capturing the content of a document. While most previous work has addressed them separately, in this work, we jointly use the three elements via multi-task training and training as joint structured inputs, in the context of document summarization. We release LipKey, the largest news corpus with human-written summaries, titles, and keyphrases, as well as being the first large-scale Indonesian keyphrase dataset. We find that including keyphrases and titles as additional context to the source document improves transformer-based summarization models.	PDF	11	2021
Constructing Phrase-level Semantic Labels to Form Multi-GrainedSupervision for Image-Text Retrieval	Existing research for image text retrieval mainly relies on sentence-level supervision to distinguish matched and mismatched sentences for a query image. However, semantic mismatch between an image and sentences usually happens in finer grain, i.e., phrase level. In this paper, we explore to introduce additional phrase-level supervision for the better identification of mismatched units in the text. In practice, multi-grained semantic labels are automatically constructed for a query image in both sentence-level and phrase-level. We construct text scene graphs for the matched sentences and extract entities and triples as the phrase-level labels. In order to integrate both supervision of sentence-level and phrase-level, we propose Semantic Structure Aware Multimodal Transformer (SSAMT) for multi-modal representation learning. Inside the SSAMT, we utilize different kinds of attention mechanisms to enforce interactions of multi-grain semantic units in both sides of vision and language. For the training, we propose multi-scale matching losses from both global and local perspectives, and penalize mismatched phrases. Experimental results on MS-COCO and Flickr30K show the effectiveness of our approach compared to some state-of-the-art models.	PDF	11	2021
Meta-learning via Language Model In-context Tuning	The goal of meta-learning is to learn to adapt to a new task with only a few labeled examples. Inspired by the recent progress in large language models, we propose $\textit{in-context tuning}$ (ICT), which recasts task adaptation and prediction as a simple sequence prediction problem: to form the input sequence, we concatenate the task instruction, labeled in-context examples, and the target input to predict; to meta-train the model to learn from in-context examples, we fine-tune a pre-trained language model (LM) to predict the target label given the input sequence on a collection of tasks.We benchmark our method on two collections of text classification tasks: LAMA and BinaryClfs. Compared to MAML which adapts the model through gradient descent, our method leverages the inductive bias of pre-trained LMs to perform pattern matching, and outperforms MAML by an absolute $6\%$ average AUC-ROC score on BinaryClfs, gaining more advantage with increasing model size. Compared to non-fine-tuned in-context learning (i.e. prompting a raw LM), in-context tuning meta-trains the model to learn from in-context examples. On BinaryClfs, ICT improves the average AUC-ROC score by an absolute $10\%$, and reduces the variance due to example ordering by 6x and example choices by 2x.	PDF	11	2021
Relevant CommonSense Subgraphs for "What if..." Procedural Reasoning	This work deals with the challenge of learning causal reasoning over procedural text to answer "What if..." questions when external commonsense knowledge is required. We propose a novel multi-hop graph reasoning model to 1) efficiently extract a commonsense subgraph with the most relevant information from a large knowledge graph; 2) predict the causal answer by reasoning over the representations obtained from the commonsense subgraph and the contextual interactions between the questions and context. We evaluate our model on WIQA dataset and achieve state-of-the-art performance compared to the recent models.	PDF	11	2021
End-To-End Sign Language Translation via Multitask Learning	Sign language translation (SLT) is usually seen as a two-step process of continuous sign language recognition (CSLR) and gloss-to-text translation. We propose a novel, Transformer-based architecture to jointly perform CSLR and sign-translation in an end-to-end fashion. We extend the ordinary Transformer decoder with two channels to support multitasking, where each channel is devoted to solve a particular problem. To control the memory footprint of our model, channels are designed to share most of their parameters among each other. However, each channel still has a dedicated set of parameters which is fine-tuned with respect to the channel's task. In order to evaluate the proposed architecture, we focus on translating German signs into English sequences and use the RWTH-PHOENIX-Weather 2014 T corpus in our experiments. Evaluation results indicate that the mixture of information provided by the multitask decoder was successful and enabled us to achieve superior performance in comparison to other SLT models.	PDF	11	2021
Identifying Semantically Difficult Samples to Improve Text Classification	In this paper, we investigate the effect of addressing difficult samples from a given text dataset on the downstream text classification task. We define difficult samples as being non-obvious cases for text classification by analysing them in the semantic embedding space; specifically - (i) semantically similar samples that belong to different classes and (ii) semantically dissimilar samples that belong to the same class. We propose a penalty function to measure the overall difficulty score of every sample in the dataset. We conduct exhaustive experiments on 13 standard datasets to show a consistent improvement of up to 9\% and discuss qualitative results to show effectiveness of our approach in identifying difficult samples for a text classification model.	PDF	11	2021
LAGr: Label Aligned Graphs for Better Systematic Generalization in Semantic Parsing	Semantic parsing is the task of producing structured meaning representations for natural language sentences. Recent research has pointed out that the commonly-used sequence-to-sequence (seq2seq) semantic parsers struggle to generalize systematically, i.e. to handle examples that require recombining known knowledge in novel settings. In this work, we show that better systematic generalization can be achieved by producing the meaning representation directly as a graph and not as a sequence. To this end we propose LAGr (Label Aligned Graphs), a general framework to produce semantic parses by independently predicting node and edge labels for a complete multi-layer input-aligned graph. The strongly-supervised LAGr algorithm requires aligned graphs as inputs, whereas weakly-supervised LAGr infers alignments for originally unaligned target graphs using approximate maximum-a-posteriori inference. Experiments demonstrate that LAGr achieves significant improvements in systematic generalization upon the baseline seq2seq parsers in both strongly- and weakly-supervised settings.	PDF	11	2021
Towards Coding Social Science Datasets with Language Models	Researchers often rely on humans to code (label, annotate, etc.) large sets of texts. This is a highly variable task and requires a great deal of time and resources. Efforts to automate this process have achieved human-level accuracies in some cases, but often rely on thousands of hand-labeled training examples, which makes them inapplicable to small-scale research studies and still costly for large ones. At the same time, it is well known that language models can classify text; in this work, we use OpenAI's GPT-3 as a synthetic coder, and explore what classic methodologies and metrics (such as intercoder reliability) look like in this new context. We find that GPT-3 is able to match the performance of typical human coders and frequently outperforms humans in terms of intercoder agreement across a variety of social science tasks, suggesting that language models could be a useful tool to the social sciences.	PDF	11	2021
AlephBERT: Language Model Pre-training and Evaluation from Sub-Word to Sentence Level	Large Pre-trained Language Models (PLMs) have become ubiquitous in the development of language understanding technology and lie at the heart of many artificial intelligence advances.While advances reported for English using PLMs are unprecedented, reported advances using PLMs for Hebrew are few and far between.The problem is twofold.First, so far, Hebrew resources for training large language models are not of the same magnitude as their English counterparts.Second, there are no accepted benchmarks to evaluate the progress of Hebrew PLMs on, and in particular, sub-word (morphological) tasks.We aim to remedy both aspects.We present AlephBERT, a large PLM for Modern Hebrew, trained on larger vocabulary and a larger dataset than any Hebrew PLM before.Moreover, we introduce a novel language-agnostic architecture that can recover all of the sub-word morphological segments encoded in contextualized word embedding vectors.Based on this new morphological component we offer a new PLM evaluation suite consisting of multiple tasks and benchmarks, that cover sentence level word-level and sub-word level analyses.On all tasks, AlephBERT obtains state-of-the-art results beyond contemporary Hebrew baselines. We make our AlephBERT model, the morphological extraction mode, and the Hebrew evaluation suite publicly available, providing a single point of entry for assessing Hebrew PLMs.	PDF	11	2021
Schema-Free Dependency Parsing via Sequence Generation	Dependency parsing aims to extract syntactic dependency structure or semantic dependency structure for sentences. Existing methods suffer the drawbacks of lacking universality or highly relying on the auxiliary decoder. To remedy these drawbacks, we propose to achieve universal and schema-free Dependency Parsing (DP) via Sequence Generation (SG) DPSG by utilizing only the pre-trained language model (PLM) without any auxiliary structures or parsing algorithms. We first explore different serialization designing strategies for converting parsing structures into sequences. Then we design dependency units and concatenate these units into the sequence for DPSG. Thanks to the high flexibility of the sequence generation, our DPSG can achieve both syntactic DP and semantic DP using a single model. By concatenating the prefix to indicate the specific schema with the sequence, our DPSG can even accomplish the multi-schemata parsing. The effectiveness of our DPSG is demonstrated by the experiments on widely used DP benchmarks, i.e., PTB, CODT, SDP15, and SemEval16. DPSG achieves comparable results with the first-tier methods on all the benchmarks and even the state-of-the-art (SOTA) performance in CODT and SemEval16. This paper demonstrates our DPSG has the potential to be a new parsing paradigm. We will release our codes upon acceptance.	PDF	11	2021
Some Languages are More Equal than Others: Probing Deeper into the Linguistic Disparity in the NLP World	Linguistic disparity in the NLP world is a problem that has been widely acknowledged recently. However, different facets of this problem, or the reasons behind this disparity are seldom discussed within the NLP community. This paper provides a comprehensive analysis of the disparity that exists within the languages of the world. Using an existing language categorisation based on speaker population and vitality, we analyse the distribution of language data resources, amount of NLP/CL research, inclusion in multilingual web-based platforms, and the inclusion in pre-trained multilingual models.We show that many languages do not get covered in these resources or platforms, and even within the languages belonging to the same language group, there is wide disparity. We analyse the impact of family, geographical location, and the speaker population of languages, provide possible reasons for this disparity, and argue that a solution to this problem should be orchestrated by a wide alliance of stakeholders, of which ACL, as an association should be a key partner.	PDF	11	2021
Learning to Learn Recognising Biomedical Entities from Multiple Domains with Task Hardness	Few-shot learning has been a big challenge for many classification tasks, where the final classifier is trained only with a few examples. This problem amplifies when we apply the few-shot setup to recognising named entity from different domains, i.e., few-shot domain adaption for NER. In this paper, we present a simple yet effective MAML-based NER model that can effectively leverage the task hardness information to improve the adaptability of the learnt model in the few-shot setting. Experimental results on biomedical datasets show that our model can achieve significant performance improvement over the recently published MetaNER model.	PDF	11	2021
CLD²: Language Documentation Meets Natural Language Processing for Revitalising Endangered Languages	Language revitalisation should not be understood as a direct outcome of language documentation, which is mainly focused on the creation of language repositories. Natural language processing (NLP) offers the potential to complement and exploit these repositories through the development of language technologies that may directly impact the vitality status of endangered languages. In this paper, we discuss the current state of the interaction between language documentation and computational linguistics, present a diagnosis of how the outputs of recent documentation projects for endangered languages are underutilised for the NLP community, and discuss how the situation could change from both the documentary linguistics and NLP perspectives. All this is introduced as a bridging paradigm called Computational Language Documentation and Development (CLD²). CLD² calls for (1) the inclusion of NLP-friendly annotated data as a deliverable of future language documentation projects; and (2) the exploitation of language documentation databases by the NLP community to promote the computerisation of endangered languages at a global scale.	PDF	11	2021
KNN-BERT: Fine-Tuning Pre-Trained Models with KNN Classifier	Pre-trained models are widely used in fine-tuning downstream tasks with linear classifiers optimized by the cross entropy loss, which might face robustness and stability problems.These problems can be improved by learning representations that focus on similarities in the same class and variance in different classes when making predictions.In this paper, we utilize the K-Nearest Neighbors Classifier in pre-trained model fine-tuning.For this KNN classifier, we introduce a supervised momentum contrastive learning framework to learn the clustered representations of the supervised downstream tasks.Extensive experiments on text classification tasks and robustness tests show that by incorporating KNNs with the traditional fine-tuning process, we can obtain significant improvements on the clean accuracy in both rich-source and few-shot settings and can improve the robustness against adversarial attacks.\footnote{all codes will be available at https://github.com//}	PDF	11	2021
Fantastic Questions and Where to Find Them: FairytaleQA--An Authentic Dataset for Narrative Comprehension	Question answering (QA) is a fundamental means to facilitate assessment and training of narrative comprehension skills for both machines and young children, yet there is scarcity of high-quality QA datasets carefully designed to serve this purpose. In particular, existing datasets rarely distinguish fine-grained reading skills, such as the understanding of varying narrative elements. Drawing from education domains where QA is also used to train children's narrative comprehension, we introduce FairytaleQA, a dataset focusing on narrative comprehension of kindergarten to eighth grade students. Generated by educational experts based on an evidence-based theoretical framework, FairytaleQA consists of 10,580 explicit and implicit questions derived from 278 children-friendly stories, covering seven types of narrative elements/relations.Our dataset is valuable in two folds: First, with annotations on particular reading skills required for answering each question, FairytaleQA decomposes the otherwise scarce performance into multiple analysis dimensions that are consistent to human-language-learning assessment. We ran existing QA models on our dataset, and confirmed that this annotation helps assess models' fine-grained learning skills. Second, the dataset supports generating questions (QG) in the education domain. Through benchmarking with QG models, we show that the QG model trained on FairytaleQA is capable of asking high-quality and more diverse questions.	PDF	11	2021
Learn More from Less: Improving Conversational Recommender Systems via Contextual and Time-Aware Modeling	Conversational Recommender Systems (CRS) aims to perform recommendations through interactive conversations. Prior work on CRS tends to incorporate more external knowledge to enhance performance. Given the fact that too much extra knowledge introduces the difficulty to balance among them and degenerates the generalizability, we propose to fully discover and extract the internal knowledge from the context. We capture both entity-level and contextual-level representations to jointly model user preferences for the recommendation, where a time-aware attention is designed to emphasize the recently appeared items in entity-level representations. We further use the pre-trained BART to initialize the generation module to alleviate the data scarcity and enhance the context modeling. Experiments on two public CRS datasets show that our model achieves comparable performance with less external knowledge and generalizes well to other domains. Further analyses demonstrate the effectiveness of our model in different scenarios.	PDF	11	2021
ICLEA: Interactive Contrastive Learning for Self-supervised Entity Alignment	Self-supervised entity alignment (EA) aims to link equivalent entities across different knowledge graphs (KGs) without seed alignments. The current SOTA self-supervised EA method draws inspiration from contrastive learning, originally designed in computer vision based on instance discrimination and contrastive loss, and suffers from two shortcomings. Firstly, it puts unidirectional emphasis on pushing sampled negative entities far away rather than pulling positively aligned pairs close, as is done in the well-established supervised EA. Secondly, KGs contain rich side information (e.g., entity description), and how to effectively leverage those information has not been adequately investigated in self-supervised EA. In this paper, we propose an interactive contrastive learning model for self-supervised EA. The model encodes not only structures and semantics of entities (including entity name, entity description, and entity neighborhood), but also conducts cross-KG contrastive learning by building pseudo-aligned entity pairs. Experimental results show that our approach outperforms previous best self-supervised results by a large margin (over 9% average improvement) and performs on par with previous SOTA supervised counterparts, demonstrating the effectiveness of the interactive contrastive learning for self-supervised EA.	PDF	11	2021
ROCK: A Causal Inference Framework for Reasoning about Commonsense Causality	Commonsense causality reasoning (CCR) aims at identifying plausible causes and effects in natural language descriptions that deemed reasonable by an average person. Although being of great academic and practical interest, this problem is still shadowed by the lack of a well-posed theoretical framework; existing work usually relies on various notions of correlation and is susceptible to confounding co-occurrences. This paper articulates the central question of CCR and develops a novel framework, ROCK, to Reason O(A)bout Commonsense K(C)ausality based on classical causal inference principles. ROCK leverages temporal signals as incidental supervision, and makes use of temporal propensities that are analogous to propensity scores for balancing confounding effects. We implement a modular zero-shot pipeline which is effective and demonstrates good potential for CCR on various datasets.	PDF	11	2021
Using Interactive Feedback to Improve the Accuracy and Explainability of Question Answering Systems Post-Deployment	Most research on question answering focuses on the pre-deployment stage; i.e., building an accurate model for deployment.In this paper, we ask the question: Can we improve QA systems further post-deployment based on user interactions? We focus on two kinds of improvements: 1) improving the QA system's performance itself, and 2) providing the model with the ability to explain the correctness or incorrectness of an answer.We collect a retrieval-based QA dataset, FeebackQA, which contains interactive feedback from users. We collect this dataset by deploying a base QA system to crowdworkers who then engage with the system and provide feedback on the quality of its answers.The feedback contains both structured ratings and unstructured natural language explanations.We train a neural model with this feedback data that can generate explanations and re-score answer candidates. We show that usage of the feedback data improves the accuracy of the QA system, and helps users make informed decisions about the correctness of answers.\footnote{We will make both the data and the code public.}	PDF	11	2021
FRSUM: Towards Faithful Abstractive Summarization via Enhancing Factual Robustness	Though current Seq2Seq summarization models are capable of generating fluent and grammatical summaries, they are still suffering from the unfaithful generation problem.In this paper, we study the faithfulness of existing systems from a new perspective of factual robustness which is the ability to correctly generate factual information over adversarial unfaithful information.We first define the measurement of a model's factual robustness as its success rate to defend against adversarial attacks when generating factual information. The factual robustness analysis on a wide range of current systems shows its good consistency with human judgments on faithfulness.Inspired by these findings, we propose to improve a model's faithfulness by enhancing its factual robustness.Specifically, we propose a novel training strategy, namely FRSUM, which teaches the model to defend against both explicit adversarial samples and implicit factual adversarial perturbations.Extensive automatic and human evaluation results show that FRSUM consistently improves the faithfulness of various Seq2Seq models, such as T5, BART and PEGASUS, and reduces up to 41\% target errors in summaries.	PDF	11	2021
Towards Fully Self-Supervised Learning of Knowledge from Unstructured Text	Pre-trained language models (PLMs) like BERT have made significant progress in various downstream NLP tasks. However, recent works find that PLMs are short in acquiring knowledge from the unstructured text, by asking models to do cloze-style tests. To understand the internal behavior of PLMs in retrieving knowledge, we firstly define knowledge-baring tokens and knowledge-free tokens for unstructured text and manually label on some samples. Then, we find that PLMs are more likely to predict incorrectly on K-B tokens and attend less attention to those tokens inside the self-attention module. Based on these observations, we develop two solutions to help the model learn more knowledge from the unstructured text in a fully self-supervised manner.Experiments on knowledge probing tasks show the effectiveness of the proposed methods. To our knowledge, we are the first to explore fully self-supervised learning of knowledge in continue pre-training.	PDF	11	2021
Computational historical linguistics and language diversity in South Asia	South Asia is home to a plethora of languages, most of which are severely lacking access to language technologies that have been developed with the maturity of NLP/CL. This linguistic diversity, however, also results in a research environment conducive to the study of comparative, contact, and historical linguistics---fields which necessitate the gathering of extensive data from many languages. We claim that data scatteredness (rather than scarcity) is the primary obstacle in the development of South Asian language technology, and suggest that the study of language history is uniquely aligned with surmounting this obstacle. We review recent developments in, and the intersection of, South Asian NLP and historical--comparative linguistics, explaining our current efforts in this area while also offering new paths towards breaking the data barrier.	PDF	11	2021
Continual Few-shot Relation Learning via Embedding Space Regularization and Data Augmentation	Existing continual relation learning (CRL) methods rely on plenty of labeled training data for learning a new task, which can be hard to acquire in real scenario as getting large and representative labeled data is often expensive and time-consuming. It is therefore necessary for the model to learn novel relational patterns with very few labeled data while avoiding catastrophic forgetting of previous task knowledge. In this paper, we formulate this challenging yet practical problem as continual few-shot relation learning (CFRL). Based on the finding that learning for new emerging few-shot tasks often results in feature distributions that are incompatible with previous tasks' learned distributions, we propose a novel method based on embedding space regularization and data augmentation. Our method generalizes to new few-shot tasks and avoids catastrophic forgetting of previous tasks by enforcing extra constraints on the relational embeddings and by adding extra {relevant} data in a self-supervised manner. With extensive experiments we demonstrate that our method can significantly outperform previous state-of-the-art methods in CFRL task settings.	PDF	11	2021
Text Complexity And Linguistic Features: Is The Relationship Multilingual?	Text complexity assessment is a challenging task requiring various linguistic aspects to be taken into consideration. A large number of studies have been introduced in this field. Nevertheless, as the methods and corpora are quite diverse, it may be hard to draw general conclusions as to the effectiveness of linguistic information for evaluating text complexity due to the diversity of methods and corpora. Moreover, a cross-lingual impact of different features on various datasets has not been investigated. We experimentally assessed seven commonly used feature types on six corpora for text complexity employing four common machine learning models. We showed which feature types can significantly improve the performance and analyzed their impact according to the dataset characteristics, language, and origin.	PDF	11	2021
Phrase-aware Unsupervised Constituency Parsing	Recent studies have achieved inspiring success in unsupervised grammar induction using masked language modeling (MLM) as the proxy task. Despite their high accuracy in identifying low-level structures, prior arts tend to struggle in capturing high-level structures like clauses, since the MLM task usually only requires information from local context. In this work, we revisit LM-based constituency parsing from a phrase-centered perspective. Inspired by the natural reading process of human, we propose to regularize the parser with phrases extracted by an unsupervised phrase tagger to help the LM model quickly manage low-level structures. For a better understanding of high-level structures, we propose a phrase-guided masking strategy for LM to emphasize more on reconstructing non-phrase words. We show that the initial phrase regularization serves as an effective bootstrap, and phrase-guided masking improves the identification of high-level structures. Experiments on the public benchmark with two different backbone models demonstrate the effectiveness and generality of our method.	PDF	11	2021
Weakly Supervised Medical Entity Extraction and Linking for Chief Complaints	A Chief complaint (CC) is the reason for the medical visit as stated in the patient's own words. It helps medical professionals to quickly understand a patient's situation, and also serves as a short summary for medical text mining. However, chief complaint records often take a variety of entering methods, resulting in a wide variation of medical notations, which makes it difficult to standardize across different medical institutions for record keeping or text mining. In this study, we propose a weakly supervised method to automatically extract and link entities in chief complaints in the absence of human annotation. We first adopt a split-and-match algorithm to produce weak annotations, including entity mention spans and class labels, on 1.2 million real-world de-identified and IRB approved chief complaint records. Then we train a BERT-based model with generated weak labels to locate entity mentions in chief complaint text and link them to a pre-defined ontology. We conducted extensive experiments and the results showed that our Weakly Supervised Entity Extraction and Linking (WeSEEL) method produced superior performance over previous methods without any human annotation.	PDF	11	2021
Data Augmentation with Sentence Recombination Method for Semi-supervised Text Classification	As the need of large amount of time and expertise to obtain enough labeled data, semi-supervised learning has received much attention to utilize both labeled and unlabeled data. In this paper, we present SeRe: a Sentence Recombination method to augment training data for semi-supervised text classification. SeRe makes full use of the similarities between sentences in different samples through the grouping and recombining process to form rich and varied training data. SeRe generates data from three combinations, including labeled, unlabeled, and mixed data. Meanwhile, SeRe combines the self-training framework to improve the quality of augmented training data iteratively. We apply SeRe to text classification tasks and conduct extensive experiments on four publicly available benchmarks. Experimental results show that SeRe achieves new state-of-the-art performances on all of them.	PDF	11	2021
Cross-Task Generalization via Natural Language Crowdsourcing Instructions	Humans (e.g., crowdworkers) have a remarkable ability in solving different tasks, by simply reading textual instructions that define them and looking at a few examples. Despite the success of the conventional supervised learning on individual datasets, such models often struggle with generalization across tasks (e.g., a question-answering system cannot solve classification tasks). A long-standing challenge in AI is to build a model that learns a new task by understanding the human-readable instructions that define it. To study this, we introduce NATURAL INSTRUCTIONS, a dataset of 61 distinct tasks, their human-authored instructions, and 193k task instances (input-output pairs). The instructions are obtained from crowdsourcing instructions used to create existing NLP datasets and mapped to a unified schema. Using this meta-dataset, we measure cross-task generalization by training models on seen tasks and measuring generalization to the remaining unseen ones. We adopt generative pre-trained language models to encode task-specific instructions along with input and generate task output. Our results indicate that models benefit from instructions when evaluated in terms of generalization to unseen tasks (19% better for models utilizing instructions). These models, however, are far behind an estimated performance upperbound indicating significant room for more progress in this direction.	PDF	11	2021
Tailor: Generating and Perturbing Text with Semantic Controls	Controlled text perturbation is useful for evaluating model generalizability and improving model robustness to dataset artifacts. However, current techniques rely on training a perturbation model for every targeted attribute, which is expensive and hard to generalize. We present Tailor, a semantically-controlled text generation system. Tailor builds on a pretrained seq2seq model, and produces textual outputs conditioning on $\textbf{control codes}$ derived from semantic representations. We craft a set of operations to modify the control codes, which in turn steer generation towards targeted attributes. These operations can be further composed into higher-level ones, allowing for flexible perturbation strategies. Tailor can be applied in various scenarios. We use it to automatically create high-quality contrast sets for four distinct natural language processing (NLP) tasks. These contrast sets contain fewer spurious biases and are complementary to manually annotated ones in terms of lexical diversity. We show that Tailor helps improve model generalization through data augmentation, with a 5.8-point gain on an NLI challenge set, by perturbing just $\sim2\%$ of training data.	PDF	11	2021
Modeling Intensification for Signed Language Generation: A Computational Approach	End-to-end sign language generation models do not accurately represent the prosody of the languages. This lack of temporal and spatial variation in generated signs leads to poor quality and lower human perception. In this paper, we seek to improve prosody in generated sign languages by modeling intensification in a data-driven manner with strategies grounded in the linguistics of sign language by enhancing the representation of intensifiers in the gloss annotations. To employ our strategies, we first annotate a subset of the benchmark PHOENIX14T dataset with different levels of intensification. We then use a supervised intensity tagger to extend the tagging to the whole dataset. This enhanced dataset is then used to train state-of-the-art transformer models for sign language generation. We find that our efforts in intensifier modeling yield better results evaluated with automated metrics. Human evaluation also indicates a significantly higher preference of the videos generated using our strategies in the presence of intensity modifiers.	PDF	11	2021
IndicBART: A Pre-trained Model for Indic Natural Language Generation	We study pre-trained sequence-to-sequence model for a specific-language family with a focus on Indic languages. We present IndicBART, a multilingual, sequence-to-sequence pre-trained model focusing on 11 Indic languages and English. IndicBART utilizes the orthographic similarity between Indic scripts to improve transfer learning between similar Indic languages. We evaluate IndicBART on two NLG tasks: Neural Machine Translation (NMT) and extreme summarization. Our experiments on NMT and extreme summarization show that a language family-specific model like IndicBART is competitive with large pre-trained models like mBART50 despite being significantly smaller. It also performs well on very low-resource translation scenarios: languages not included in pre-training or fine-tuning. Script sharing, multilingual training and better utilization of limited model capacity contribute to the good performance of the compact IndicBART model.	PDF	11	2021
Last to Learn Bias: Analyzing and Mitigating a Shortcut in Question Matching	Recent studies report that even if deep neural models make correct predictions, models may be relying on shortcut rather than understanding the semantics of the text. Previous studies indicate that the shortcut deriving from the biased data distribution in training set makes spurious correlations between features and labels. In this paper, we focus on analyzing and mitigating the biased data distribution in question matching by exploring the model behavior and performance. In particular, we define bias-word as the shortcut, and explore the following questions: (1) Will the bias affect the model? (2) How does the bias affect the model's decision? Our analysis reveals that bias-words make significantly higher contributions to model predictions than random words, and the models tend to assign labels that are highly correlated to the bias-words. To mitigate the effects of shortcut, we propose a simple approach that learns no-bias-examples first but bias-examples last. The experiments demonstrate the effectiveness of the proposed approach.	PDF	11	2021
Automatic Identification of Cuneiform Fragments Using String Alignment Algorithms	The literature from ancient Mesopotamia is still riddled with textual lacunas. Scores of fragments which could potentially fill those lacunas lie unidentified in museums's cabinets, but their identification has traditionally been slow and laborious due to the ambiguities of cuneiform script. The present article presents a novel method for dealing with these ambiguities by using a string alignment algorithm adapted for cuneiform, which makes identification much easier and speeds up the process dramatically. The availability of this algorithm and of corpora on which to use it will advance significantly the reconstruction of Mesopotamian literature.	PDF	11	2021
PeerSum: A Peer Review Dataset for Abstractive Multi-document Summarization	We present PeerSum, a new MDS dataset using peer reviews of scientific publications. Our dataset differs from the existing MDS datasets in that our summaries (i.e., the meta-reviews) are highly abstractive and they are real summaries of the source documents (i.e., the reviews) and it also features disagreements among source documents. We found that current state-of-the-art MDS models struggle to generate high-quality summaries for PeerSum, offering new research opportunities.	PDF	11	2021
A Simple but Effective Pluggable Entity Lookup Table for Pre-trained Language Models	Pre-trained language models (PLMs) cannot well recall rich factual knowledge of entities exhibited in large-scale corpora, especially those rare entities. In this paper, we propose to build a simple but effective Pluggable Entity Lookup Table (PELT) on demand by aggregating the entity's output representations of multiple occurrences in the corpora. PELT can be compatibly plugged as inputs to infuse supplemental entity knowledge into PLMs. Compared to previous knowledge-enhanced PLMs, PELT only requires 0.2%~5% pre-computation with capability of acquiring knowledge from out-of-domain corpora for domain adaptation scenario. The experiments on knowledge-related tasks demonstrate that our method, PELT, can flexibly and effectively transfer entity knowledge from related corpora into PLMs. We will make all the data and codes publicly available to facilitate future research.	PDF	11	2021
GauSE: Gaussian Enhanced Self-Attention for Event Extraction	Event Extraction (EE) has benefited from pre-trained language models (PLMs), in which the self-attention mechanism could pay attention to the global relationship between triggers/arguments and context words to enhance performance. However, existing PLM-based methods are not good enough at capturing local trigger/argument-specific knowledge. To this end, we propose a Gaussian enhanced Self-attention Event extraction framework (GauSE), which models the syntactic-related local information of trigger/argument as a Gaussian bias for the first time, to pay more attention to the syntactic scope of the local region. Furthermore, existing methods rarely consider multiple occurrences of the same triggers/arguments in EE. We explore the global interaction strategies among multiple localness of the same triggers/arguments to fuse the corresponding distributions and capture more latent information scopes. Compared to traditional GCN-based models, our methods could introduce syntactic relationships without over-smoothing problem in deep GCN layers. Experiments on EE datasets demonstrate the effectiveness and generalization of our proposed approach.	PDF	11	2021
An Investigation of the (In)effectiveness of Counterfactually Augmented Data	While pretrained language models achieve excellent performance on natural language understanding benchmarks, they tend to rely on spurious correlations and generalize poorly to out-of-distribution (OOD) data. Recent work has explored using counterfactually-augmented data (CAD)---data generated by minimally perturbing examples to flip the ground-truth label---to identify robust features that are invariant under distribution shift. However, empirical results using CAD during training for OOD generalization have been mixed. To explain this discrepancy, through a toy theoretical example and empirical analysis on two crowdsourced CAD datasets, we show that: (a) while features perturbed in CAD are indeed robust features, it may prevent the model from learning unperturbed robust features; and (b) CAD may exacerbate existing spurious correlations in the data. Our results thus show that the lack of perturbation diversity limits CAD's effectiveness on OOD generalization, calling for innovative crowdsourcing procedures to elicit diverse perturbation of examples.	PDF	11	2021
Pretraining over Interactions for Learning Grounded Object Representations	Large language models have been criticized for their limited ability to reason about \textit{affordances} - the actions that can be performed on an object. It has been argued that to accomplish this, models need some form of grounding, i.e., connection, to objects and how they interact in the physical world. Inspired by the way humans learn about the world through interaction, we develop an approach to learning physical properties directly. We introduce a dataset of 200k object interactions in a 3D virtual environment and a self-supervised pretraining objective for learning representations of these objects. We show with probing and clustering experiments that even in the zero-shot setting, derived models learn robust representations of objects and their affordances in an unsupervised manner. Our model outperforms pretrained language and vision models on an affordance prediction baseline, suggesting that pretraining on observed interactions encodes grounded information that is not readily learned in conventional text or vision models.	PDF	11	2021
Extreme Multi-label Text Classification with Pseudo Label Descriptions	Extreme multi-label text classification (XMTC) is the task of tagging each document with the relevant labels in a large predefined label space, where the label frequency distribution is often highly skewed. That is, a large portion of labels (namely the tail labels) have very few positive instances, posing a hard optimization problem for training the classification models. The severe data sparse issue with tail labels is more announced in recent neural classifiers, where the embeddings of both the input documents and the output labels need to be jointly learned, and the success of such learning relies on the availability of sufficient training instances. This paper addresses this tough challenge in XMTC by proposing a novel approach that combines the strengths of both traditional bag-of-words (BoW) classifiers and recent neural embedding based classifiers. Specifically, we use a trained BoW model to generate a pseudo description for each label, and apply a neural model to establish the mapping between input documents and target labels in the latent embedding spaces. Our experiments show significant improvements of the proposed approach over other strong baseline methods on benchmark datasets, especially on tail label prediction. We also provide a theoretical analysis for relating BoW and neural models w.r.t. performance lower bound.	PDF	11	2021
Retrieval-guided Counterfactual Generation for QA	Deep NLP models have been shown to be brittle to input perturbations. Recent work has shown that data augmentation using counterfactuals --- i.e. minimally perturbed inputs --- can help ameliorate this weakness. We focus on the task of creating counterfactuals for question answering, which presents unique challenges related to world knowledge, semantic diversity, and answerability. To address these challenges, we develop a Retrieve-Generate-Filter(RGF) technique to create counterfactual evaluation and training data with minimal human supervision. Using an open-domain QA framework and question generation model trained on original task data, we create counterfactuals that are fluent, semantically diverse, and automatically labeled. Data augmentation with RGF counterfactuals improves performance on out-of-domain and challenging evaluation sets over and above existing methods, in both the reading comprehension and open-domain QA settings. Moreover, we find that RGF data leads to significant improvements in a model's robustness to local perturbations.	PDF	11	2021
QA Domain Adaptation using Data Augmentation and Contrastive Adaptation	Domain adaptation for question answering (QA) has recently shown impressive results for answering out-of-domain questions. Yet, a common challenge is to build approaches that are effective for niche domains with small text corpora. In this paper, we propose a novel framework called QADA for QA domain adaptation. QADA has two components: (1) A question generation model is used to generate synthetic question-answer samples from the target domain. Different from existing baselines, we enrich the samples via a novel pipeline for data augmentation: for questions, we introduce token-level augmentation (i.e., synonym replacement and token swapping), and, for contexts, we develop hidden-space augmentation which learns to drop context spans via a custom attentive sampling strategy. (2) The QA model is based on transformers. However, unlike existing approaches, we propose to train it via a novel attention-based contrastive adaptation. Here, we use the attention weights to sample informative tokens for discrepancy estimation that helps the QA model separate answers and generalize across source and target domain. To the best of our knowledge, our work is the first in QA domain adaptation to leverage data augmentation and attention-based contrastive adaptation. Our evaluation shows that QADA achieves considerable improvements over state-of-the-art baselines for QA domain adaptation.	PDF	11	2021
RELiC: Retrieving Evidence for Literary Claims	Humanities scholars commonly provide evidence for claims that they make about a work of literature (e.g., a novel) in the form of quotations from the work. We collect a large-scale dataset (RELiC) of 90K literary quotations and surrounding critical analysis and use it to formulate the novel task of literary evidence retrieval, in which models are given an excerpt from a literary analysis surrounding a masked quotation and asked to retrieve the quoted passage from the set of all passages in the work. Solving this retrieval task requires a deep understanding of complex literary and linguistic phenomena, which proves challenging to methods that overwhelmingly rely on lexical and semantic similarity matching. We implement a RoBERTa-based dense passage retriever for this task that outperforms existing pretrained information retrieval baselines; however, experiments and analysis by human domain experts indicate that there is substantial room for improvement.	PDF	11	2021
Towards Faithful Response Generation for Chinese Table Question Answering	The response generation for TableQA aims to automatically generate a response to end-users from a SQL query and its corresponding execution result (in the form of table). It is an essential and practical task. However, there has been little work on it in recent years. We consider this may be blamed on the lack of large-scale and high-quality datasets in this area. In this paper, we present ResponseNLG, a large-scale and high-quality Chinese dataset for TableQA response generation, to advance the field in both academic and industrial communities. Further, to bridge the structural gap between the input SQL and table and establish better semantic alignments, we propose a Heterogeneous Graph Transformation approach. In this way, we establish a joint encoding space for the two heterogeneous input data and convert this task to a Graph-to-Text problem. We further introduce the Node Segment Embedding to better preserve the original graph structure upon PLMs based models.	PDF	11	2021
Improving Compositional Generalization with Self-Training for Data-to-Text Generation	Data-to-text generation focuses on generating fluent natural language responses from structured meaning representations (MRs). Such representations are compositional and it is costly to collect responses for all possible combinations of atomic meaning schemata, thereby necessitating few-shot generalization to novel MRs. In this work, we systematically study the compositional generalization of the state-of-the-art T5 models in few-shot data-to-text tasks. We show that T5 models fail to generalize to unseen MRs, and we propose a template-based input representation that considerably improves the model's generalization capability. To further improve the model's performance, we propose an approach based on self-training using fine-tuned BLEURT for pseudo-response selection. On the commonly-used SGD and Weather benchmarks, the proposed self-training approach improves tree accuracy by $46\%+$ and reduces the slot error rates by $73\%+$ over the strong T5 baselines in few-shot settings.	PDF	11	2021
Textual Entailment with Dynamic Contrastive Learning for Zero-shot NER	In this paper, we study the problem of zero-shot NER, which aims at building a Named Entity Recognition (NER) system from scratch. It needs to identify the entities in the given sentences when we have zero token-level annotations for training. Previous works usually use sequential labeling models to solve the NER task and obtain weakly labeled data from entity dictionaries in the zero-shot setting. However, these labeled data are quite noisy since we need the labels for each token and the entity coverage of the dictionaries is limited. Here we propose to formulate the NER task as a Textual Entailment problem and solve the task via Textual Entailment with Dynamic Contrastive Learning (TEDC). TEDC not only alleviates the noisy labeling issue, but also transfers the knowledge from pre-trained textual entailment models. Additionally, the dynamic contrastive learning framework contrasts the entities and non-entities in the same sentence and improves the model's discrimination ability. Experiments on two datasets show that TEDC can achieve state-of-the-art performance on the task of zero-shot NER.	PDF	11	2021
On the Difficulties of using NLP for Language Revitalization	This paper is dedicated to discussing the general difficulty of using emerging Natural Language Processing (NLP) technologies to the revitalization of languages. Previous literature had described the social causes of language shift; legal prohibitions, social and economic marginalization as well as a lack of inclusion in public life have been identified as the main factors for the non-viability of minority languages. As such, as innovative as they may be, these emerging tools are not enough to rescue languages and the core issues must be addressed if meaningful results are expected.	PDF	11	2021
Logic-Driven Context Extension and Data Augmentation for Logical Reasoning of Text	Logical reasoning of text requires identifying critical logical structures in the text and performing inference over them. Existing methods for logical reasoning mainly focus on contextual semantics of text while struggling to explicitly model the logical inference process. In this paper, we not only put forward a logic-driven context extension framework but also propose a logic-driven data augmentation algorithm. The former follows a three-step reasoning paradigm, and each step is respectively to extract logical expressions as elementary reasoning units, symbolically infer the implicit expressions following equivalence laws and extend the context to validate the options. The latter augments literally similar but logically different instances and incorporates contrastive learning to better capture logical information, especially logical negative and conditional relationships. We conduct experiments on two benchmark datasets, ReClor and LogiQA. The results show that our method achieves state-of-the-art performance on both datasets, and even surpasses human performance on the ReClor dataset.	PDF	11	2021
Utterance Rewriting with Contrastive Learning in Multi-turn Dialogue	Context modeling plays a significant role in building multi-turn dialogue systems. In order to make full use of context information, systems can use Incomplete Utterance Rewriting(IUR) methods to simplify the multi-turn dialogue into single-turn by merging current utterance and context information into a self-contained utterance. However, previous approaches ignore the intent consistency between the original query and rewritten query. The detection of omitted or coreferred locations in the original query can be further improved. In this paper, we introduce contrastive learning and multi-task learning to jointly model the problem. Our method benefits from carefully designed self-supervised objectives, which act as auxiliary tasks to capture semantics at both sentence-level and token-level. The experiments show that our proposed model achieves state-of-the-art performance on several public datasets.	PDF	11	2021
Improving Equation Set Problems with Label Augmentation	Math word problems solving has received considerable attention from many NLP researchers. Inspired by the encoder-decoder structure, they created a series of neural network models to solve arithmetic word problems and equation set problems. However, these encoder-decoder models used the ground truth as the only generation target, resulting in shallow heuristics to generate expressions. In this paper, we propose a simple and effective label augmentation method for equation set problems. Specifically, we transform the ground truth into several equivalent labels by normalization rules, and these new labels will be used as additional generation targets for model training. Experimental results on the English dataset DRAW1K and Chinese dataset HMWP show that the label augmentation method has at most 4.5% improvement over the state-of-the-art (SoTA) models.	PDF	11	2021
Things not Written in Text: Exploring Spatial Commonsense from Visual Signals	Spatial commonsense, the knowledge about spatial position and relationship between objects (like the relative size of a lion and a girl, and the position of a boy relative to a bicycle when cycling), is an important part of commonsense knowledge. Although pretrained language models (PLMs) succeed in many NLP tasks, they are shown to be ineffective in spatial commonsense reasoning. Starting from the observation that images are more likely to exhibit spatial commonsense than texts, we explore whether models with visual signals learn more spatial commonsense than text-based PLMs. We propose a spatial commonsense benchmark that focuses on the relative scales of objects, and the positional relationship between people and objects under different actions.We probe PLMs and models with visual signals, including vision-language pretrained models and image synthesis models, on this benchmark, and find that image synthesis models are more capable of learning accurate and consistent spatial knowledge than other models. The spatial knowledge from image synthesis models also helps in natural language understanding tasks that require spatial commonsense.	PDF	11	2021
From Stance to Concern: Adaptation of Propositional Analysis to New Tasks and Domains	We present a generalized paradigm for adaptation of propositional analysis (predicate-argument pairs) to new tasks and domains, leveraging an analogy between stances (belief-driven sentiment) and concerns (topical issues with moral dimensions/endorsements). A key contribution is the combination of semi-automatic resource building for extraction of domain-dependent concern types (with 2-4 hours of human labor per domain) and an entirely automatic procedure for extraction of domain-independent moral dimensions and endorsement values. Prudent (automatic) selection of terms from propositional structures for lexical expansion (via semantic similarity) produces new moral dimension lexicons at three levels of granularity beyond a strong baseline lexicon. We develop a ground truth (GT) based on expert annotators and compare our concern detection output to GT, to yield 231% improvement in recall over baseline, with only a 10% loss in precision. F1 yields 66% improvement over baseline and 97.8% of human performance. Moreover, our lexically based approach yields large savings in terms of human labor and costly model building. Work produced herein provides to the community a newly expanded moral dimension/value lexicon, annotation guidelines, and GT.	PDF	11	2021
Low rank softmax can have unargmaxable classes in theory but rarely in practice	Classifiers in natural language processing (NLP) often have a large number of output classes. For example, neural language models (LMs) and machine translation (MT) models both predict tokens from a vocabulary of thousands. The softmax output layer of these models typically receives as input a dense feature representation, which has much lower dimensionality than the output. In theory, the result is some words may be impossible to predict via argmax, irrespective of input features, and empirically, this has been shown to happen in small language models (Demeter et al., 2020). In this paper we ask whether it can happen in practical large language models and translation models. To do so, we develop algorithms to detect such unargmaxable tokens in public models. We find that that 13 out of 150 models do indeed have such tokens; however, they are very infrequent and unlikely to impact model quality. We release our algorithms and code to the public.	PDF	11	2021
Multilingual unsupervised sequence segmentation transfers to extremely low-resource languages	We show that unsupervised sequence-segmentation performance can be transferred to extremely low-resource languages by pre-training a Masked Segmental Language Model (Downey et al., 2021) multilingually. Further, we show that this transfer can be achieved by training over a collection of low-resource languages that are typologically similar (but phylogenetically unrelated) to the target language. In our experiments, we transfer from a collection of 10 Indigenous American languages (AmericasNLP, Mager et al., 2021) to K'iche', a Mayan language. We compare our multilingual model to a monolingual (from-scratch) baseline, as well as a model pre-trained on Quechua only. We show that the multilingual pre-trained approach yields consistent segmentation quality across target dataset sizes, exceeding the monolingual baseline in 6/10 experimental settings. Our model yields especially strong results at small target sizes, including a zero-shot performance of 20.6 F1. These results have promising implications for low-resource NLP pipelines involving human-like linguistic units, such as the sparse transcription framework proposed by Bird (2020).	PDF	11	2021
MATHion: Solving Math Word Problems with Logically Consistent Problems	Solving the math word problems (MWPs) is a challenging task. Some existing MWP solvers retrieve textually similar problems and draw on their solutions to solve the given problem. However, textually similar questions are not guaranteed to have similar solutions. And questions could share the same solution but with different descriptions. Therefore in this work, we investigate the logical consistency among different problems and propose a novel framework named MATHion which solves math word problems with logically consistent problems. Experimental results show that our method outperforms many strong baselines, including some pre-trained language model-based methods. Further analysis shows that our retrieval method can learn the logical similarity between questions and plays a key role in our model's performance.	PDF	11	2021
One-to-Many and Many-to-One Dialogue Learning via Sentence Semantic Segmentation Guided Conditional Variational Auto-Encoder	Due to the complex mapping relations, one-to-many and many-to-one phenomena are huge challenges for open-domain dialogue generation task, which tend to make dialogue models generate irrelevant, incoherent or non-diverse responses. Most existing methods avoid learning such phenomena through introducing the external information, reconstructing the optimization function or manipulating data samples. However, avoiding confronting such challenges ignores valuable information in these responses, and the dialogue models cannot learn the nature of such phenomena. In this paper, we propose a Sentence Semantic Segmentation guided Conditional Variational Auto-Encoder (SegCVAE) to directly learn one-to-many and many-to-one responses. SegCVAE uses prominent semantics to replace the original semantics to learn the distribution of latent variables, which avoids the gap between latent variables and the context, thus ensuring the relevance and coherence of the generated responses. Furthermore, SegCVAE can segment multiple prominent semantics to ensure the diversity of generated responses. To evaluate the model, we first define two new tasks named one-to-many dialogue learning task and many-to-one dialogue learning task. And then provide two new dialogue datasets named One-to-Many and Many-to-One, which are extracted from the well-established dataset. Finally, we also propose the evaluation strategies based on some commonly-used metrics. The experiment results show that our model achieve better performance than the baseline models in addressing these two new tasks.	PDF	11	2021
Two-Level Supervised Contrastive Learning for Response Selection in Multi-Turn Dialogue	Selecting an appropriate response from many candidates given the utterances in a multi-turn dialogue is the key problem for a retrieval-based dialogue system. Existing work formalizes the task as matching between the utterances and a candidate and uses the cross-entropy loss in learning of the model. This paper applies contrastive learning to the problem by using the supervised contrastive loss. In this way, the learned representations of positive examples and representations of negative examples can be more distantly separated in the embedding space, and the performance of matching can be enhanced. We further develop a new method for supervised contrastive learning, referred to as two-level supervised contrastive learning, and employ the method in response selection in multi-turn dialogue. Our method exploits two techniques: sentence token shuffling (STS) and sentence re-ordering (SR) for supervised contrastive learning. Experimental results on three benchmark datasets demonstrate that the proposed method significantly outperforms the contrastive learning baseline and the state-of-the-art methods for the task.	PDF	11	2021
MAML-CL: Edited Model-Agnostic Meta-Learning for Continual Learning	Continual learning (CL) exhibits a learning ability to well-learn all sequentially seen tasks drawn from various domains. Yet, existing sequential training methods fail to consolidate learned knowledge from earlier tasks due to data distribution shifts, hereby leading to catastrophic forgetting. We devise an optimization-based meta learning framework for CL in accordance with MAML, where query samples are edited for generalization of learned knowledge. We conduct extensive experiments on text classification in a low resource CL setup, where we downsize training set to its 10%. The experimental results demonstrate the superiority of our method in terms of stability, fast adaptation, memory efficiency and knowledge retention across various domains.	PDF	11	2021
BERT is Robust! A Case Against Synonym-Based Adversarial Examples in Text Classification	In this work, we investigate the robustness of BERT using four word substitution-based attacks. We combine a human evaluation of individual word substitutions and a probabilistic analysis to show that between 96% and 99% of the analyzed attacks do not preserve semantics, indicating that their success is mainly based on feeding poor data to the model. To further confirm that, we introduce an efficient data augmentation procedure and show that many successful attacks can be prevented by including data similar to adversarial examples during training. Compared to traditional adversarial training, our data augmentation procedure requires 30x less computation time per epoch, while achieving better performance on two out of three datasets. We introduce an additional post-processing step that reduces the success rates of state-of-the-art attacks below 4%, 5%, and 8% on the three considered datasets. Finally, by looking at constraints for word substitutions that better preserve the semantics, we conclude that BERT is considerably more robust than previous research suggests.	PDF	11	2021
Neural Keyphrase Generation: Analysis and Evaluation	Keyphrase generation aims at generating topical phrases from a given text either by copying from the original text (present keyphrases) or by producing new keyphrases (absent keyphrases) that capture the semantic meaning of the text. Encoder-decoder models are most widely used for this task because of their capabilities for absent keyphrase generation. However, there has been little to no analysis on the performance and behavior of such models for keyphrase generation. In this paper, we study various tendencies exhibited by two strong models: T5 (based on a pre-trained transformer) and ExHiRD (based on a recurrent neural network). We analyze prediction confidence scores, model calibration, and the effect of position on present keyphrases generation. Moreover, we motivate and propose a novel metric, SoftKeyScore, to evaluate the similarity between two sets of keyphrases by using soft-scores to account for partial matching and semantic similarity. We find that SoftKeyScore performs better than the standard F$_{1}$ metric for evaluating two sets of given keyphrases. We will release our code.	PDF	11	2021
AbductionRules: Training Transformers to Explain Unexpected Inputs	Transformers have recently been shown to be capable of reliably performing logical reasoning over facts and rules expressed in natural language, but abductive reasoning - inference to the best explanation of an unexpected observation - has been underexplored despite significant applications to scientific discovery, common-sense reasoning, and model interpretability.This paper presents AbductionRules, a group of natural language datasets designed to train and test generalisable abduction over natural-language knowledge bases.We use these datasets to finetune pretrained Transformers and discuss their performance, finding that our models learned generalisable abductive techniques but also learned to exploit the structure of our data.Finally, we discuss the viability of this approach to abductive reasoning and ways in which it may be improved in future work.	PDF	11	2021
A Dual-Channel Framework for Sarcasm Recognition by Detecting Sentiment Conflict	Sarcasm employs ambivalence, where one says something positive but actually means negative, vice versa. The essence of sarcasm, which is also a sufficient and necessary condition, is the conflict between literal and implied sentiments. However, it is difficult to recognize the sentiment conflict because more than one mixed or even implicit sentiments coexist in one text. As a result, the recognition of sophisticated and obscure sentiment brings in a great challenge to sarcasm detection. In this paper, we propose a dual-channel framework by modeling both literal and implied sentiment separately. Based on the flexible dual-channel framework, we design Dual-Channel Net (DC-Net) to recognize the sentiment conflict. Experiments on political debates (i.e. IAC-V1 and IAC-V2) and Twitter datasets show that our proposed DC-Net achieves state-of-the-art performance on sarcasm recognition.	PDF	11	2021
Phone-ing it in: Towards Flexible Multi-Modal Language Model Training by Phonetic Representations of Data	Multi-modal techniques offer significant untapped potential to unlock improved NLP functionality for local languages. However, many advances in language model pre-training are focused on text, a fact that only increases systematic inequalities in the performance of NLP tasks across the world's languages. In this work, we propose a multi-modal approach to train language models using whatever text and/or audio data might be available in a language. Initial experiments using Swahili and Kinyarwanda data suggest the viability of the approach for downstream Named Entity Recognition (NER) tasks, with models pre-trained on phone data showing an improvement of up to 6\% F1-score above models that are trained from scratch.	PDF	11	2021
Learning to Rank Visual Stories From Human Ranking Data	Visual storytelling (VIST) is a typical vision and language task that has seen extensive development in the natural language generation research domain. However, it remains unclear whether conventional automatic evaluation metrics for text generation are applicable on VIST. In this paper, we present the VHED (VIST Human Evaluation Data) dataset, which first re-purposes human evaluation results for automatic evaluation; hence we develop Vrank (VIST Ranker), a novel reference-free VIST metric for story evaluation. We first show that the results from commonly adopted automatic metrics for text generation have little correlation with those obtained from human evaluation, which motivates us to directly utilize human evaluation results to learn the automatic evaluation model. In the experiments, we evaluate the generated texts to predict story ranks using our model as well as other reference-based and reference-free metrics. Results show that Vrank prediction is significantly more aligned to human evaluation than other metrics with almost 30\% higher accuracy when ranking story pairs. Moreover, we demonstrate that only Vrank shows human-like behavior in its strong ability to find better stories when the quality gap between two stories is high. Finally, we show the superiority of Vrank by its generalizability to pure textual stories, and conclude that this reuse of human evaluation results puts Vrank in a strong position for continued future advances.	PDF	11	2021
OpenKorPOS: Democratizing Korean Tokenization with Voting-Based Open Corpus Annotation	Korean is a language with complex morphology that uses spaces at larger-than-word boundaries, unlike other East-Asian languages. While morpheme-based text generation can provide significant semantic advantages compared to commonly used character-level approaches, Korean morphological analyzers only provide a sequence of morpheme-level tokens, losing information in the tokenization process. Two crucial issues are the loss of spacing information and subcharacter level morpheme normalization, both of which make the tokenization result challenging to reconstruct the original input string, deterring the application to generative tasks. As this problem originates from the conventional scheme used when creating a POS tagging corpus, we propose an improvement to the existing scheme, which makes it friendlier to generative tasks.On top of that, we suggest a semi-automatic annotation of a corpus by leveraging public analyzers. We vote the surface and POS from the outcome and fill the sequence with the selected morphemes, yielding tokenization with a decent quality that incorporates space information. Our scheme is verified via an evaluation done on an external corpus, and subsequently, is adopted to Korean Wikipedia to construct an open, permissive resource. We compare morphological analyzer performance trained on our corpus with existing methods, then perform an extrinsic evaluation on a downstream task.	PDF	11	2021
HonestBait: Headline Generation via Faithful Forward Reference	Current methods for generating attractive headlines often learn directly from data, which bases attractiveness on the number of user clicks and views. Although clicks or views do reflect user interest, they can fail to reveal how much interest is raised by the writing style and how much is caused by the event or topic itself. Also, such approaches can lead to harmful hallucinations by over-exaggerating the content, aggravating the spread of false information. In this work, we propose HonestBait, a novel framework for solving these issues from another aspect generating headlines using forward references(FR), a writing technique often used in clickbait. A self-verification process is also included in training to avoid harmful hallucinations. We start with a preliminary user study to understand how FR affects user interest, after which we present PANCO, an innovative dataset containing pairs of fake news with verified news020for attractive but faithful news headline generation. Automatic metrics and human evaluations show our framework yields better results in attractiveness while maintaining high veracity.	PDF	11	2021
Improving Aspect Extraction based on Rules through Deep Syntax-Semantics Communication	Recent studies show integrating language resources which consist of lexical resources, syntactic resources and semantic resources can improve the performance of natural language processing (NLP) tasks. The existing methods mostly perform simple integration through concatenating these resources successively, seldom consider complementary relationship among them, such as the deep communication of syntactic and semantic relations between words. To enhance deep syntax-semantics communication, this paper takes aspect term extraction (ATE) task as an example and explores four integration strategies of language resources. These strategies, based on Answer Set Programming (ASP) rules, have interpretability. Experiments on eight ATE datasets show that our strategies achieve superior performance, demonstrating that they are highly effective in integrating language resources.	PDF	11	2021
A Simple Information-Based Approach to Unsupervised Domain-Adaptive Aspect-Based Sentiment Analysis	Aspect-based sentiment analysis (ABSA) is a fine-grained sentiment analysis task which aims to extract the aspects from sentences and identify their corresponding sentiments. Aspect term extraction (ATE) is the crucial step for ABSA. Due to the expensive annotation for aspect terms, we often lack labeled target domain data for fine-tuning. To address this problem, many approaches have been proposed recently to transfer common knowledge in an unsupervised way, but such methods have too many modules and require expensive multi-stage preprocessing. In this paper, we propose a simple but effective technique based on mutual information maximization, which can serve as an additional component to enhance any kind of model for cross-domain ABSA and ATE. Furthermore, we provide some analysis of this approach. Experiment results show that our proposed method outperforms the state-of-the-art methods for cross-domain ABSA by 4.32\% Micro-F1 on average over 10 different domain pairs. Apart from that, our method can be extended to other sequence labeling tasks, such as named entity recognition (NER). Codes will be released.	PDF	11	2021
Controlling Pretrained Language Generation Models by Learning to Focus	Transformer-based language models, which are pretrained on large-scale unsupervised data and then finetuned on task-specific datasets, have become the dominant paradigm for various natural language generation tasks. The finetuning and usages of such models are typically conducted in an end-to-end manner. This work attempts to develop a control mechanism by which a user can select spans of context as "highlights'' for the model to focus on, while generating output text. To achieve this goal, we augment a pretrained model with trainable "attention vectors'' that are directly applied to the model's embeddings, while the model itself is kept fixed. These vectors, trained on automatic annotations derived from attribution methods, act as indicators for context importance. We test our approach on two core generation tasks: dialogue response generation and abstractive summarization. We also collect evaluation data where the highlight-generation pairs are annotated by humans. Our experiments show that the trained attention vectors are effective in steering the model to generate outputs that are relevant to user-selected highlights.	PDF	11	2021
Zero-Shot Script Parsing	Script knowledge (shrank, 1977) proved useful to a variety of NLP tasks. However, existing resources only covering a small number of activities, limiting its practical usefulness. In this work, we propose a zero-shot learning approach to script parsing, the task of tagging texts with pre-defined, scenario-specific event and participant types, which makes it possible to acquire script knowledge without domain-specific annotations. We (1) learn representations of potential event and participant mentions by promoting cluster consistency according to the annotated data; (2) perform clustering on the event / participant candidates from unannotated texts that belongs to an unseen scenario. We further exploit dependency and coreference information. The model achieves 68.1/74.4 average F1 for event / participant parsing, respectively, outperforming a previous CRF model that has access to domain-specific supervision.	PDF	11	2021
Local Differential Privacy for Privacy-Preserving NLP Tasks	In this paper, we propose a Local Differentially Private Natural Language Processing (LDP-NLP) model that protects the privacy of user input sentences for both training and inference stages while requiring no server security trust. Compared to existing methods, the novel privacy-preserving methodology significantly reduces calibrated noise power and thus improves model accuracy by incorporating (a) an LDP-layer, (b) sub-sampling and up-sampling DP amplification algorithms for training and inference, and (c) DP composition algorithms for noise calibration. This novel LDP-NLP solution guarantees privacy for the entire training/inference data for the first time, whereas current methods can only guarantee privacy for either a single training/inference step. Furthermore, the total privacy cost is reduced to a reasonable range, i.e., less than 10, for the first time with an accuracy loss of only 2-5\% compared to the accuracy upper bound produced by the original model without privacy guarantee.	PDF	11	2021
UCTopic: Unsupervised Contrastive Learning for Phrase Representations and Topic Mining	High-quality phrase representations are essential to finding topics and related terms in documents (a.k.a. topic mining). Existing phrase representation learning methods either simply combine unigram representations in a context-free manner or rely on extensive annotations to learn context-aware knowledge. In this paper, we propose UCTopic, a novel unsupervised contrastive learning framework for context-aware phrase representations and topic mining. UCTopic is pretrained in a large scale to distinguish if the contexts of two phrase mentions have the same semantics. The key to the pretraining is positive pair construction from our phrase-oriented assumptions. However, we find traditional in-batch negatives cause performance decay when finetuning on a dataset with small topic numbers. Hence, we propose cluster-assisted contrastive learning (CCL) which largely reduces noisy negatives by selecting negatives from clusters and further improves phrase representations for topics accordingly. UCTopic outperforms the state-of-the-art phrase representation model by 38.2% NMI in average on four entity clustering tasks. Comprehensive evaluation on topic mining shows that UCTopic can extract coherent and diverse topical phrases.	PDF	11	2021
On the interpretability and significance of bias metrics in texts: a PMI-based approach	In recent years, the use of word embeddings has become popular to measure the presence of biases in texts. Despite the fact that these measures have been proven to be effective in detecting a wide variety of biases, metrics based on word embeddings lack transparency, explainability and interpretability. In this study, we propose a PMI-based metric to quantify biases in texts. We prove that this metric can be approximated by an odds ratio, which allows estimating the confidence interval and statistical significance of textual bias. This PMI-based measure can be expressed as a function of conditional probabilities, providing a simple interpretation in terms of word co-occurrences. Our approach produces a performance comparable to GloVe-based and skip-gram-based metrics in experiments of gender-occupation and gender-name associations. We discuss the advantages and disadvantages of using methods based on first-order vs second-order co-occurrences, from the point of view of the interpretability of the metric and the sparseness of the data.	PDF	11	2021
Fix Bugs with Transformer through a Neural-Symbolic Edit Grammar	We introduce NSEdit (neural-symbolic edit), a novel Transformer-based code repair method. Given only the source code that contains bugs, NSEdit predicts an editing sequence that can fix the bugs. The edit grammar is formulated as a regular language, and the Transformer uses it as a neural-symbolic scripting interface to generate editing programs. We modify the Transformer and add a pointer network to select the edit locations. An ensemble of rerankers are trained to re-rank the editing sequences generated by beam search. We fine-tune the rerankers on the validation set to reduce over-fitting. NSEdit is evaluated on various code repair datasets and achieved a new state-of-the-art accuracy ($24.04\%$) on the Tufano small dataset of the CodeXGLUE benchmark. NSEdit performs robustly when programs vary from packages to packages and when buggy programs are concrete. We conduct detailed analysis on our methods and demonstrate the effectiveness of each component.	PDF	11	2021
DAWSON: Data Augmentation using Weak Supervision On Natural Language	We propose a novel data augmentation model for text, using all available data through weak supervision. To improve generalization, recent work in the field uses BERT and masked language modeling to conditionally augment data. These models all involve a small, high-quality labeled dataset, but omit the abundance of unlabeled data, which is likely to be present if one considers a model in the first place. Weak supervision methods, such as Snorkel, make use of the vastness of unlabeled data, but largely omit the available ground truth labels. We combine data augmentation and weak supervision techniques into a holistic method, consisting of 4 training phases and 2 inference phases, to efficiently train an end-to-end model when only a small amount of annotated data is available. We outperform the benchmark (Kumar et al.,2020) for the SST-2 task by 1.5, QQP task by 4.4, and QNLI task by 3.0 absolute accuracy points, and show that data augmentation is also effective for natural language understanding tasks, such as QQP and QNLI.	PDF	11	2021
Teaching Models new APIs: Domain-Agnostic Simulators for Task Oriented Dialogue	We demonstrate that large language models are able to simulate Task Oriented Dialogues in novel domains, provided only with an API implementation and a list of goals. We show these simulations can formulate online, automatic metrics that correlate well with human evaluations. Furthermore, by filtering for dialogues where goals are met, we can use simulation to repeatedly generate training data and improve the quality of the dialogues themselves. With no human intervention or domain-specific training data, our simulations bootstrap end-to-end models which achieve a 37\% error reduction over baseline in previously unseen domains. By including as few as 32 domain-specific conversations, bootstrapped models can match the performance of a fully-supervised model with $10\times$ more data.	PDF	11	2021
Expanding Pretrained Models to Thousands More Languages via Lexicon-based Adaptation	The performance of multilingual pretrained models is highly dependent on the availability of monolingual or parallel text present in a target language. Thus, the majority of the world’s languages cannot benefit from recent progress in NLP as they have no or limited textual data. To expand possibilities of using NLP technology in these under-represented languages, we systematically study strategies that relax the reliance on conventional language resources through the use of bilingual lexicons, an alternative resource with much better language coverage. We analyze different strategies to synthesize textual or labeled data using lexicons, and how this data can be combined with monolingual or parallel text when available. For 19 under-represented languages across 3 tasks, our methods lead to consistent improvements of up to 5 and 15 points with and without extra monolingual text respectively. Overall, our study highlights how NLP methods can be adapted to thousands more languages that are under-served by current technology.	PDF	11	2021
Psych-E: Configurable Response Generation using Personality Traits and Pragmatics	Personality traits influence human actions and thoughts, which is manifested in day to day conversations. Although glimpses of personality traits are observable in existing open domain conversation corpora, leveraging generic language modelling for response generation overlooks the interlocutor idiosyncrasies, resulting in non-customizable personality agnostic responses. With the motivation of enabling configurable response generators, in this paper we experiment with ways to ground neural response generators based on both (i) interlocutor Big-5 personality traits, and (ii) discourse intent as control codes, training an end-to-end dialogue agent that can not only leverage the control codes as policy for nuanced response generation, but also predict and decide the generation policy to be utilized by the generator. Since most of the existing large scale open domain chat corpora do not include Big-5 personality traits and discourse intent, we employ automatic annotation schemes to enrich the corpora with policy consisting of noisy estimates of these features as control codes, and leverage automatic evaluation metrics along with ablation studies, to assess the impact of using control codes for response generation. Additionally, we leverage human judgement to demonstrate the effectiveness of using such personality and pragmatics based policy for response generation. Our experiments illustrate the effectiveness of this strategy resulting in improvements to existing benchmarks.	PDF	11	2021
Improving Multimodal Speech Recognition by Data Augmentation and Speech Representations	Multimodal speech recognition aims to improve the performance of automatic speech recognition (ASR) systems by leveraging additional visual information that is usually associated to the audio input. While previous approaches make crucial use of strong visual representations, e.g. by finetuning pretrained image recognition networks, significantly less attention has been paid to its counterpart: the speech component. In this work, we investigate ways of improving the base speech recognition system by following similar techniques to the ones used for the visual encoder, namely, transferring representations and data augmentation. First, we show that starting from a pretrained ASR significantly improves the state-of-the-art performance; interestingly, even when building upon a strong unimodal system, we still find gains by including the visual modality. Second, we employ speech data augmentation techniques to encourage the multimodal system to attend to the visual stimuli. This technique replaces previously used word masking and comes with the benefits of being conceptually simpler and yielding consistent improvements in the multimodal setting. We back up our conclusions by empirical results on three multimodal datasets, including the newly introduced Localized Narratives.	PDF	11	2021
Can Explanations Be Useful for Calibrating Black Box Models?	One often wants to take an existing, trained NLP model and use it on data from a new domain. While fine-tuning or few-shot learning can be used to adapt the base model, there is no one simple recipe to getting these working; moreover, one may not have access to the original model weights if it is deployed as a black box. To this end, we study how to improve a black box model's performance on a new domain given examples from the new domain by leveraging explanations of the model's behavior. Our approach first extracts a set of features combining human intuition about the task with model attributions generated by black box interpretation techniques, and then uses a simple model to calibrate or rerank the model's predictions based on the features. We experiment with our method on two tasks, extractive question answering and natural language inference, covering adaptation from several pairs of domains. The experimental results across all the domain pairs show that explanations are useful for calibrating these models. We show that the calibration features transfer to some extent between tasks and shed light on how to effectively use them.	PDF	11	2021
Breaking Down Multilingual Machine Translation	While multilingual training is now an essential ingredient in machine translation (MT) systems, recent work has demonstrated that it has different effects in different multilingual settings, such as many-to-one, one-to-many, and many-to-many learning. These training settings expose the encoder and the decoder in a machine translation model with different data distributions. In this paper, we examine how different varieties of multilingual training contribute to learning these two components of the MT model. Specifically, we compare bilingual models with encoders and/or decoders initialized by multilingual training. We show that multilingual training is beneficial to encoders in general, while it only benefits decoders for low-resource languages (LRLs). We further find the important attention heads for each language pair and compare their correlations during inference. Our analysis sheds light on how multilingual translation models work and also enables us to propose methods to improve performance by training with highly related languages. Our many-to-one models for high-resource languages and one-to-many models for LRL outperform the best results reported by Aharoni et al. (2019).	PDF	11	2021
RNG-KBQA: Generation Augmented Iterative Ranking for Knowledge Base Question Answering	Existing KBQA approaches, despite achieving strong performance on i.i.d. test data, often struggle in generalizing to questions involving unseen KB schema items. Prior ranking-based approaches have shown some success in generalization, but suffer from the coverage issue. We present RnG-KBQA, a Rank-and-Generate approach for KBQA, which remedies the coverage issue with a generation model while preserving a strong generalization capability. Our approach first uses a contrastive ranker to rank a set of candidate logical forms obtained by searching over the knowledge graph. It then introduces a tailored generation model conditioned on the question and the top-ranked candidates to compose the final logical form. We achieve new state-of-the-art results on GrailQA and WebQSP datasets. In particular, our method surpasses the prior state-of-the-art by a large margin on the GrailQA leaderboard. In addition, RnG-KBQA outperforms all prior approaches on the popular WebQSP benchmark, even including the ones that use the oracle entity linking. The experimental results demonstrate the effectiveness of the interplay between ranking and generation, which leads to the superior performance of our proposed approach across all settings with especially strong improvements in zero-shot generalization.	PDF	11	2021
Event-Event Relation Extraction using Probabilistic Box Embedding	To understand a story with multiple events, it is important to capture the proper relations across these events. However, existing event relation extraction (ERE) framework regards it as a multi-class classification task and do not guarantee any coherence between different relation types, such as anti-symmetry. If a phone line "died" after "storm", then it is obvious that the "storm" happened before the "died". Current framework of event relation extraction do not guarantee this coherence and thus enforces it via constraint loss function (Wang et al., 2020). In this work, we propose to modify the underlying ERE model to guarantee coherence by representing each event as a box representation (BERE) without applying explicit constraints. From our experiments, BERE also shows stronger conjunctive constraint satisfaction while performing on par or better in F1 compared to previous models with constraint injection.	PDF	11	2021
Towards Robust Online Dialogue Response Generation	Although pre-trained sequence-to-sequence models have achieved great success in dialogue response generation, chatbots still suffer from generating inconsistent responses in real-world practice, especially in multi-turn settings. We argue that this can be caused by a discrepancy between training and real-world testing. At training time, chatbots generate response with the golden context, while it has to generate based on the context consisting of both user utterances and the model predicted utterances during real-world testing. With the growth of the number of utterances, this discrepancy becomes more serious in the multi-turn settings. In this paper, we propose a hierarchical sampling-based method consisting of both utterance-level sampling and semi-utterance-level sampling, to alleviate the discrepancy, which implicitly increases the dialogue coherence. We further adopt reinforcement learning and re-ranking methods to explicitly optimize the dialogue coherence during training and inference, respectively. Empirical experiments show the effectiveness of the proposed methods for improving the robustness of chatbots in real practice.	PDF	11	2021
Data Augmentation and Learned Layer Aggregation for Improved Multilingual Language Understanding in Dialogue	Scaling dialogue systems to a multitude of domains, tasks and languages relies on costly and time-consuming data annotation for different domain-task-language configurations. The annotation efforts might be substantially reduced by the methods that generalise well in zero- and few-shot scenarios, and also effectively leverage external unannotated data sources (e.g., Web-scale corpora). We propose two methods to this aim, offering improved dialogue natural language understanding (NLU) across multiple languages: 1) Multi-SentAugment, and 2) LayerAgg. Multi-SentAugment is a self-training method which augments available (typically few-shot) training data with similar (automatically labelled) in-domain sentences from large monolingual Web-scale corpora. LayerAgg learns to select and combine useful semantic information scattered across different layers of a Transformer model (e.g., mBERT); it is especially suited for zero-shot scenarios as semantically richer representations should strengthen the model's cross-lingual capabilities. Applying the two methods with state-of-the-art NLU models obtains consistent improvements across two standard multilingual NLU datasets covering 16 diverse languages. The gains are observed in zero-shot, few-shot, and even in full-data scenarios. The results also suggest that the two methods achieve a synergistic effect: the best overall performance in few-shot setups is attained when the methods are used together.	PDF	11	2021
Towards Coherent Visual Storytelling with Ordered Image Attention	We address the problem of visual storytelling, i.e., generating a story for a given sequence of images. While each story sentence should describe a corresponding image, a coherent story also needs to be consistent and relate to both future and past images. Current approaches encode images independently, disregarding relations between images. Our approach learns to encode images with different interactions based on the story position (i.e., past image or future image). To this end, we develop a novel message-passing-like algorithm for ordered image attention (OIA) that collects interactions across all the images in the sequence. Finally, to generate the story's sentences, a second attention mechanism picks the important image attention vectors with an Image-Sentence Attention (ISA). The obtained results improve the METEOR score on the VIST dataset by 1%. Furthermore, a thorough human study confirms improvements and demonstrates that order-based Interactions significantly improve coherency (64.20% vs. 28.70%).	PDF	11	2021
ASSIST: Towards Label Noise-Robust Dialogue State Tracking	The MultiWOZ 2.0 dataset has greatly boosted the research on dialogue state tracking (DST). However, substantial noise has been discovered in its state annotations. Such noise brings about huge challenges for training DST models robustly. Although several refined versions, including MultiWOZ 2.1-2.4, have been published recently, there are still lots of noisy labels, especially in the training set. Besides, it is costly to rectify all the problematic annotations. In this paper, instead of improving the annotation quality further, we propose a general framework, named ASSIST (lAbel noiSe-robuSt dIalogue State Tracking), to train DST models robustly from noisy labels. ASSIST first generates pseudo labels for each sample in the training set by using an auxiliary model trained on a small clean dataset, then puts the generated pseudo labels and vanilla noisy labels together to train the primary model. We show the validity of ASSIST theoretically. Experimental results also demonstrate that ASSIST improves the joint goal accuracy of DST by up to $28.16\%$ on the initial version MultiWOZ 2.0 and $8.41\%$ on the latest version MultiWOZ 2.4, respectively.	PDF	11	2021
AMRize, then Parse! Enhancing AMR Parsing with PseudoAMR Data	As Abstract Meaning Representation (AMR) implicitly involves compound semantic annotations, we hypothesize auxiliary tasks which are semantically or formally related can better enhance AMR parsing. With carefully designed control experiments, we find that 1) Semantic role labeling (SRL) and dependency parsing (DP), would bring much more significant performance gain than unrelated tasks in the text-to-AMR transition. 2) To make a better fit for AMR, data from auxiliary tasks should be properly "AMRized'' to PseudoAMR before training. 3) Intermediate-task training paradigm outperforms multitask learning when introducing auxiliary tasks to AMR parsing. From an empirical perspective, we propose a principled method to choose, reform, and train auxiliary tasks to boost AMR parsing. Extensive experiments show that our method achieves new state-of-the-art performance on in-distribution, out-of-distribution, and few-shots benchmarks of AMR parsing.	PDF	11	2021
How Do Seq2Seq Models Perform on End-to-End Data-to-Text Generation?	With the rapid development of deep learning, Seq2Seq paradigm has become prevalent for end-to-end data-to-text generation, and the BLEU scores have been increasing in recent years. However, it is widely recognized that there is still a gap between the quality of the texts generated by models and the texts written by human. In order to better understand the ability of Seq2Seq models, evaluate their performance and analyze the results, we choose to use Multidimensional Quality Metric(MQM) to evaluate several representative Seq2Seq models on end-to-end data-to-text generation. We annotate the outputs of five models on four datasets with eight error types and find that 1) copy mechanism is helpful for the improvement in Omission and Inaccuracy Extrinsic errors but it increases other types of errors such as Addition; 2) pre-training techniques are highly effective, and pre-training strategy and model size are very significant; 3) the structure of the dataset also influences the model's performance greatly; 4) some specific types of errors are generally challenging for seq2seq models.	PDF	11	2021
ERNIE-SPARSE: Learning Hierarchical Efficient Transformer Through Regularized Self-Attention	Sparse Transformer has recently attracted a lot of attention since the ability for reducing the quadratic dependency on the sequence length. We argue that two factors, information bottleneck sensitivity and inconsistency between different attention topologies, could affect the performance of the Sparse Transformer. This paper proposes a well-designed model named ERNIE-Sparse. It consists of two distinctive parts: (i) Hierarchical Sparse Transformer (HST) to sequentially unify local and global information. (ii) Self-Attention Regularization (SAR) method, a novel regularization designed to minimize the distance for transformers with different attention topologies. To evaluate the effectiveness of ERNIE-Sparse, we perform extensive evaluations. Firstly, we perform experiments on a multi-modal long sequence modeling task benchmark, Long Range Arena (LRA). Experimental results demonstrate that ERNIE-Sparse significantly outperforms a variety of strong baseline methods including the dense attention and other efficient sparse attention methods and achieves improvements by 2.77% (57.78% vs. 55.01%). Secondly, to further show the effectiveness of our method, we pretrain ERNIE-Sparse and verified it on 3 text classification and 2 QA downstream tasks, achieve improvements on classification benchmark by 0.83% (92.46% vs. 91.63%), on QA benchmark by 3.24% (74.67% vs. 71.43%). Experimental results continue to demonstrate its superior performance.	PDF	11	2021
Morphosyntactic Tagging with Pre-trained Language Models for Arabic and its Dialects	We present state-of-the-art results on morphosyntactic tagging across different varieties of Arabic using fine-tuned pre-trained transformer language models. Our models consistently outperform existing systems in Modern Standard Arabic and all the Arabic dialects we study, achieving 2.6% absolute improvement over the previous state-of-the-art in Modern Standard Arabic, 2.8% in Gulf, 1.6% in Egyptian, and 8.3% in Levantine. We explore different training setups for fine-tuning pre-trained transformer language models, including training data size, the use of external linguistic resources, and the use of annotated data from other dialects in a low-resource scenario. Our results show that strategic fine-tuning using datasets from other high-resource dialects is beneficial for a low-resource dialect. Additionally, we show that high-quality morphological analyzers as external linguistic resources are beneficial especially in low-resource settings.	PDF	11	2021
NEWSFARM: the Largest Chinese Corpus for Long News Summarization	Recently, driven by a large number of datasets, the field of natural language processing(NLP) has developed rapidly. However, the lack of large-scale and high-quality Chinese datasets is still a critical bottleneck for further research on automatic text summarization. To close this gap, we searched Chinese news websites of domestic and abroad media, designed the algorithm HSS(hidden text topic, semantic similarity, and syntactic similarity) to crawl and filter these records to construct NEWSFARM. NEWSFARM is the largest highest quality Chinese long news summarization corpus, containing more than 200K Chinese long news and summaries written by professional editors or authors, which are all released to the public. Based on the corpus, we calculated the static metrics and designed many experiments with the baseline models. By comparing with the common datasets, the experiment results show that the high quality of our dataset and training effect of the models, which not only demonstrates the usefulness and challenges of the proposed corpus for automatic text summarization but also provides a benchmark for further research.	PDF	11	2021
MDG: Metaphorical Sentence Detection and Generation with Masked Metaphor Modeling	This study tackles literal to metaphorical sentence generation, presenting a framework that can potentially lead to the production of an infinite number of new metaphors. To achieve this goal, we propose a complete workflow that tackles metaphorical sentence classification and metaphor reconstruction. Unlike similar research works regarding metaphor generation, our approach does not require any customor closed-source model, hence with this work we introduce a complete literal to metaphorical open-source model. The obtained results show that a good ratio of originally literal sentences, coming from different data sources and topics, are turned to metaphorical. Human evaluation shows that our constructed metaphors are considered more fluent, creative and metaphorical than figurative statements created by a real person. Furthermore, by using our artificial data to increase the training size of a metaphorical sentence classification dataset, we register an improvement of 3% over the baseline.	PDF	11	2021
Roles of Words: What Should (n’t) Be Augmented in Text Augmentation on Text Classification Tasks?	Text augmentation techniques are widely used in text classification problems to improve the performance of classifiers, especially in low-resource scenarios. Previous text-editing-based methods augment the text in a non-selective manner: the words in the text are treated without difference during augmentation, which may result in unsatisfactory augmented samples. In this work, we present four kinds of roles of words (ROWs) which have different functions in text classification tasks, and design effective methods to automatically extract these ROWs based on statistical and semantic perspectives. Systematic experiments are conducted on what ROWs should (n't) be augmented during augmentation for classification tasks. Based on these experiments, we discover some interesting and instructive potential patterns that certain ROWs are especially suitable or unsuitable for certain augmentation operations. Guided by these patterns, we propose a set of Selective Text Augmentation (STA) operations, which significantly outperform traditional methods and show outstanding generalization performance.	PDF	11	2021
Few-shot Controllable Style Transfer for Low-Resource Multilinugal Settings	Style transfer is the task of rewriting an input sentence into a target style while approximately preserving its content. While most prior literature assumes access to large style-labelled corpora, recent work (Riley et al. 2021) has attempted "few-shot" style transfer using only 3-10 sentences at inference for extracting the target style. In this work we study a relevant low-resource setting: style transfer for languages where no style-labelled corpora are available. We find that existing few-shot methods perform this task poorly, with a strong tendency to copy inputs verbatim. We push the state-of-the-art for few-shot style transfer with a new method modeling the stylistic difference between paraphrases. When compared to prior work using automatic and human evaluations, our model achieves 2-3x better performance and output diversity in formality transfer and code-mixing addition across seven languages. Moreover, our method is better able to control the amount of style transfer using an input scalar knob. We report promising qualitative results for several attribute transfer directions, including sentiment transfer, text simplification, gender neutralization and text anonymization, all without retraining the model. Finally we found model evaluation to be difficult due to the lack of evaluation datasets and metrics for many languages. To facilitate further research in formality transfer for Indic languages, we crowdsource annotations for 4000 sentence pairs in four languages, and use this dataset to design our automatic evaluation suite.	PDF	11	2021
HSC-Rocket: An interactive dialogue assistant to make agents composing service better through human feedback	Facing the current dynamic service environment, fast and efficient service composition has attracted great attention in recent years. Users prefer to express their personal requirements based on natural language, and their real-time feedback could better reflect the effect of service composition to a great extent. Consequently, this paper designs an interactive dialogue assistant, HSC-Rocket, to better provide service composition by considering human feedback. Firstly, we propose a human-computer interaction dynamic service composition algorithm based on reinforcement learning. The design of the reward mechanism considers the quality of service (QoS) and real-time feedback, which can more accurately meet the demands of users. Then, the functional requirements are analyzed through word embedding, to realize the dynamic composition of abstract and concrete services. Furthermore, we utilize the sample enhancement method to alleviate the issue of fewer sample data in the initial stage of user interaction, which improves the robustness of our system. Accordingly, we have implemented the HSC-Rocket prototype, which allows users to obtain multi-domain dialogue requirements. Extensive experiments on the RapidAPI dataset have demonstrated the superiority and effectiveness of the HSC-Rocket.	PDF	11	2021
Generating Authentic Adversarial Examples beyond Meaning-preserving with Doubly Round-trip Translation	Generating adversarial examples for Neural Machine Translation (NMT) with single Round-Trip Translation (RTT) has achieved promising results by releasing the meaning-preserving restriction. However, a potential pitfall for this approach is that we cannot decide whether the generated examples are adversarial to the target NMT model or the auxiliary backward one, as the reconstruction error through the RTT can be related to either. To remedy this problem, we propose a new definition for NMT adversarial examples based on the Doubly Round-Trip Translation (DRTT). Specifically, apart from the source-target-source RTT, we also consider the target-source-target one, which is utilized to pick out the authentic adversarial examples for the target NMT model. Additionally, to enhance the robustness of the NMT model, we introduce the masked language models to construct bilingual adversarial pairs based on DRTT, which are used to train the NMT model directly. Extensive experiments on both the clean and noise test sets (including the artificial and natural noise) show that our approach substantially improves the robustness of NMT models.	PDF	11	2021
Improving Robustness of Language Models from a Geometry-aware Perspective	Recent studies have found that removing the norm-bounded projection and increasing search steps in adversarial training can significantly improve robustness. However, we observe that a too large number of search steps can hurt accuracy. We aim to obtain strong robustness efficiently using fewer steps. Through a toy experiment, we find that perturbing the clean data to the decision boundary but not crossing it does not degrade the test accuracy. Inspired by this, we propose friendly adversarial data augmentation (FADA) to generate ``friendly'' adversarial data. On top of FADA, we propose geometry-aware adversarial training (GAT) to perform adversarial training (e.g., FGM) on friendly adversarial data so that we can save a large number of search steps. Comprehensive experiments across two widely used datasets and three pre-trained language models demonstrate that GAT can obtain stronger robustness via less steps. In addition, we provide extensive empirical results and in-depth analyses on robustness to facilitate future studies.	PDF	11	2021
Making Small Language Models Better Few-Shot Learners	Large-scale language models coupled with prompts have shown remarkable performance on few-shot learning. However, through systematic experiments, we find that the few-shot performance of small language models is poor, and using prompts on them brings fewer improvements than on larger ones. In this paper, we propose \textbf{SMASH}, an approach to improve \textbf{SMA}ll language models' few-\textbf{SH}ot ability by training on intermediate tasks before prompt-based fine-tuning on downstream tasks. We design intermediate tasks for sentence-pair tasks and single-sentence classification tasks by creating training examples with prompt templates similar to downstream tasks using sentences sampled from a large-scale unsupervised corpus, and apply knowledge distillation to distill from outputs of larger pre-trained models as training objective. We conduct extensive experiments and show that SMASH can make a 6-layer DistilRoBRETa-base achieve comparable performance on few-shot datasets to a 12-layer RoBERTa-base at a low cost.	PDF	11	2021
Challenges in Region-Specific Image Captioning: A Deep Learning Approach	Region-specific image captioning is the task of generating a caption from an image such that the caption is about the specific region in that image.This paper describes the challenges involved in region-specific image captioning and provides several methods to utilize the region-specific features to enhance the quality of the captions in addition to utilizing the features from the whole image. Our experiments on real-world data sets demonstrate that generating region-specific captions is challenging even after utilizing the information specific to the region.We analyze the variables impacting the quality of the captions which include the bounding box size and the region-specific feature extractor.	PDF	11	2021
WeTS: A Benchmark for Translation Suggestion	Translation suggestion (TS), which provides alternatives for specific words or phrases given the entire documents generated by machine translation (MT), has been proven to play a significant role in post-editing (PE). There are two main pitfalls for previous researches in this line. First, most conventional works only focus on the overall performance of PE but ignore the exact performance of TS, which makes the progress of PE sluggish and less explainable; Second, as no publicly available golden dataset to support in-depth research for TS, almost all of the previous works conduct experiments on their in-house datasets or the noisy datasets built automatically, which makes their experiments hard to be reproduced and compared. To break these limitations mentioned above and spur the research in TS, we create a benchmark dataset, called \emph{WeTS}, which is a golden corpus annotated by expert translators on four translation directions. Apart from the golden corpus, we also propose several methods to generate synthetic corpus which can be used to improve the performance substantially through pre-training. As for the model, we propose the segment-aware self-attention based Transformer for TS. Experimental results show that our approach achieves state-of-the-art results on all four directions, including English-to-German, German-to-English, Chinese-to-English, and English-to-Chinese.	PDF	11	2021
Lot or Not: Identifying Multi-Quantity Offerings in E-Commerce	The term \textit{lot} in \ecom is defined to mean an offering that contains a collection of multiple identical items for sale. In a large online marketplace, lot offerings play an important role, allowing buyers and sellers to set price levels to optimally balance supply and demand needs. In spite of their central role, \ecom platforms often struggle to identify lot offerings, since explicit lot status identification is frequently not provided by sellers. The ability to identify lot offerings plays a key role in many fundamental \ecom tasks, from matching offerings to catalog products, through ranking \ecom search results, to providing effective pricing guidance. In this work, we seek to determine the lot status (and lot size) of each offering, in order to facilitate an improved buyer experience, while reducing the friction for sellers posting new offerings. We demonstrate experimentally the ability to accurately classify offerings as lots (and predict their lot size) using only the offer title, by adapting state-of-the-art natural language techniques to the lot identification problem.	PDF	11	2021
A Neural Approach to KGQA via SPARQL Silhouette Generation	Semantic parsing is a predominant approach to solve the Knowledge Graph Question Answering (KGQA) task where, natural language question is translated into a logic form such as SPARQL. Semantic parsing based solutions are mostly modular/pipelined where, noise introduced by the upstream modules for entity/relation linking makes it hard to solve the complex questions. Recently, Neural Machine Translation (NMT) based approaches have emerged that are capable of handling complex questions. However, NMT-based approaches struggle with handling the large number of test entities and relations that are unseen during training. In this work,we propose a modular two-stage neural approach which combines best of both the worlds - NMT and semantic parsing pipeline. Stage-I of our approach comprises an NMT-based seq2seq module that translates a question into a sketch of the desired SPARQL, called as SPARQL silhouette. This stage also contains a noise simulator which combines the masking scheme with an entity/relation linker in a novel manner so as to take care of unseen entities/relation without blowing up the vocabulary of seq2seq module.Stage-II of our approach comprises a Neural Graph Search (NGS) module which aims to distil the SPARQL silhouette in order to reduce the entity/relation linking noise. Experimental results show that, the quality of generated SPARQL silhouette is impressive for an ideal scenario where entity/relation linker is noise-free. For the realistic scenario (i.e. noisy linker), the quality of the SPARQL silhouette drops but our NGS module recovers it considerably. We show that, our proposed approach improves state-of-the-art on LC-QuAD-1 dataset by an absolute margin of $3.72 \%$ $F_1$.	PDF	11	2021
Rethinking Offensive Text Detection as a Multi-Hop Reasoning Problem	We introduce the task of implicit offensive text detection in dialogues, where a statement may have either an offensive or non-offensive interpretation, depending on the listener and context. We argue that reasoning is crucial for understanding this broader class of offensive utterances, and create Mh-RIOT ($\textbf{M}$ulti-hop $\textbf{R}$easoning $\textbf{I}$mplicitly $\textbf{O}$ffensive $\textbf{T}$ext Dataset), to support research on this task. Experiments using the dataset show that state-of-the-art methods of offense detection perform poorly when asked to detect implicitly offensive statements, achieving only ${\sim} 0.11$ accuracy.In contrast to existing offensive text detection datasets, Mh-RIOT features human-annotated chains of reasoning which describe the mental process by which an offensive interpretation can be reached from each ambiguous statement. We explore the potential for a multi-hop reasoning approach by utilizing existing entailment models to score the transitions of these chains, and show that even naive reasoning models can result in improved performance in most situations. Analysis of the chains provides insight into the human interpretation process and emphasizes the importance of incorporating additional commonsense knowledge.	PDF	11	2021
Investigating Logic Tensor Networks for Neural-Symbolic Argument Mining	We present an application of neural-symbolic learning to argument mining.We use Logic Tensor Networks to train neural models to jointly fit the data and satisfy specific domain rules.Our experiments on a corpus of scientific abstracts indicate that including symbolic rules during the training process improves classification performance, compliance with the rules, and robustness of the results.	PDF	11	2021
One Country, 700+ Languages: NLP Challenges for Underrepresented Languages and Dialects in Indonesia	NLP research is impeded by a lack of resources and awareness of the challenges presented by under-represented languages and dialects. Focusing on the languages spoken in Indonesia, the second most linguistically diverse and the fourth most populous nation of the world, we provide an overview of the current state of NLP research for Indonesia's 700+ languages. We highlight challenges in Indonesian NLP and how these affect the performance of current NLP systems. Finally, we provide general recommendations to help develop NLP technology not only for languages of Indonesia, but also other underrepresented languages.	PDF	11	2021
Question-Led Semantic Structure Enhanced Attentions for VQA	The exploit of the semantic structure in the visual question answering (VQA) task is a trending topic where researchers are interested in leveraging internal semantics and bringing in external knowledge to tackle more complex questions. The prevailing approaches either encode the external knowledge separately from the local context, which magnificently increases the complexity of the ensemble system, or use graph neural networks to model the semantic structure in the context, which suffers from the limited reasoning capability due to the relatively shallow network. In this work, we propose a question-led structure extraction scheme using external knowledge and explore multiple training methods, including direct attention supervision, SGHMC-EM Bayesian multitask learning, and masking strategies, to aggregate the structural knowledge into deep models without changing the architectures. We conduct extensive experiments on two domain-specific but challenging sub-tasks of VrR-VG dataset and demonstrate that our proposed methods achieve significant improvements over strong baselines, showing the promising potentials of applicability.	PDF	11	2021
Cross-Lingual UMLS Named Entity Linking using UMLS Dictionary Fine-Tuning	We study cross-lingual UMLS named entity linking, where mentions in a given source language are mapped to UMLS concepts, most of which are labeled in English. We propose a general solution that can be easily adapted to any source language and demonstrate the method on Hebrew documents. Our cross-lingual framework includes an offline unsupervised construction of a bilingual UMLS dictionary and a per-document pipeline which identifies UMLS candidate mentions and uses a fine-tuned pretrained transformer language model to filter candidates according to context.Our method exploits a small dataset of manually annotated UMLS mentions in the source language and uses this supervised data in two ways: to extend the unsupervised UMLS dictionary and to fine-tune the contextual filtering of candidate mentions in full documents. Our method addresses cross-lingual UMLS NEL in a low resource setting, where the ontology is large, there is a lack of descriptive text defining most entities, and labeled data can only cover a small portion of the ontology. We demonstrate results of our approach on both Hebrew and English. We achieve new state-of-the-art results on the Hebrew Camoni corpus, +8.9 F1 on average across three communities in the dataset. We also achieve new SOTA on the English dataset MedMentions with +7.3 F1.	PDF	11	2021
Weight Squeezing: Reparameterization for Knowledge Transfer and Model Compression	In this work, we present a novel approach to simultaneous knowledge transfer and model compression called \textbf{Weight Squeezing}. With this method, we perform knowledge transfer from a teacher model \textbf{by learning the mapping from its weights to smaller student model weights}.We applied Weight Squeezing to a pre-trained text classification model based on a BERT-Medium model. We compared our method to various other knowledge transfer and model compression methods using the GLUE multitask benchmark. We observed that our approach produces better results while being significantly faster than other methods for training student models.We also proposed a variant of Weight Squeezing called Gated Weight Squeezing, in which we combined fine-tuning a small BERT model and learning mapping from larger BERT weights. We showed that, in most cases, fine-tuning a BERT model with Gated Weight Squeezing can outperform plain fine-tuning.	PDF	11	2021
Zero-shot Cross-lingual Conversational Semantic Role Labeling	While conversational semantic role labeling (CSRL) has shown its usefulness on Chinese conversational tasks, it is still under-explored in non-Chinese languages due to the lack of multilingual CSRL annotations for the parser training. To avoid expensive data collection and error-propagation of translation-based methods, we present a simple but effective approach to perform zero-shot cross-lingual CSRL. Our model implicitly learns language-agnostic, conversational structure-aware and semantically rich representations with the hierarchical encoders and elaborately designed pre-training objectives. Experimental results show that our cross-lingual model not only outperforms baselines by large margins but it is also robust to low-resource scenarios. More importantly, we confirm the usefulness of CSRL to English conversational tasks such as question-in-context rewriting and multi-turn dialogue response generation by incorporating the CSRL information into the downstream conversation-based models. We believe this finding is significant and will facilitate the research of English dialogue tasks which suffer the problems of ellipsis and anaphora.	PDF	11	2021
Modeling speech recognition and synthesis simultaneously: Encoding and decoding lexical and sublexical semantic information into speech with no access to speech data	Human speakers encode information into raw speech which is then decoded by the listeners. This complex relationship between encoding (production) and decoding (perception) is often modeled separately. Here, we test how decoding of lexical and sublexical semantic information can emerge automatically from raw speech in unsupervised generative deep convolutional networks that combine both the production and perception principle. We introduce, to our knowledge, the most challenging objective in unsupervised lexical learning: an unsupervised network that must learn to assign unique representations for lexical items with no direct access to training data. We train several models (ciwGAN and fiwGAN by Beguš 2021) and test how the networks classify raw acoustic lexical items in the unobserved test data. Strong evidence in favor of lexical learning emerges. The architecture that combines the production and perception principles is thus able to learn to decode unique information from raw acoustic data in an unsupervised manner without ever accessing real training data. We propose a technique to explore lexical and sublexical learned representations in the classifier network. The results bear implications for both unsupervised speech synthesis and recognition as well as for unsupervised semantic modeling as language models increasingly bypass text and operate from raw acoustics.	PDF	11	2021
An Information-theoretic Approach to Prompt Engineering Without Ground Truth Labels	Pre-trained language models derive substantial linguistic and factual knowledge from the massive corpora on which they are trained, and prompt engineering seeks to align these models to specific tasks. Unfortunately, existing prompt engineering methods require significant amounts of labeled data, access to model parameters, or both. We introduce a new method for selecting prompt templates \textit{without labeled examples} and \textit{without direct access to the model}. Specifically, over a set of candidate templates, we choose the template that maximizes the mutual information between the input and the corresponding model output. Across 8 datasets representing 7 distinct NLP tasks, we show that when a template has high mutual information, it also has high accuracy on the task. On the largest model, selecting prompts with our method gets 90\% of the way from the average prompt accuracy to the best prompt accuracy and requires no ground truth labels.	PDF	11	2021
Hate Speech and Counter Speech Detection: Context Does Matter	Hate speech is plaguing the cyberspace along with user-generated content. Adding counter speech has become an effective way to combat hate speech online. Existing datasets and models target either (a) hate speech or (b) hate and counter speech but disregard the context. This paper investigates the role of context in the annotation and detection of online hate and counter speech, where context is defined as the preceding comment in a conversation thread. We created a context-aware dataset for a 3-way classification task on Reddit comments: hate speech, counter speech, or neutral. Our analyses indicate that context is critical to identify hate and counter speech: human judgments change for most comments depending on whether we show annotators the context. A linguistic analysis draws insights into the language people use to express hate and counter speech. Experimental results show that neural networks obtain significantly better results if context is taken into account. We also present qualitative error analyses shedding light into (a) when and why context is beneficial and (b) the remaining errors made by our best model when context is taken into account.	PDF	12	2021
Towards Faithful Personalized Response Selection in Retrieval Based Dialog Systems	Personalized response selection systems are generally grounded on persona. However, the angle of emotion influencing response selection is not explored. Also, faithfulness to the conversation context of these systems plunges when a contradictory or an off-topic response is selected. This paper makes an attempt to address these issues by proposing a suite of fusion strategies that capture the interaction between persona, emotion, and entailment information of the utterances. A concept-flow encoder is designed which capture the relevant concept knowledge both in context and responses. Ablation studies were done on Persona-Chat dataset show that incorporating emotion, entailment improves the accuracy of response selection. We combine our fusion strategies and concept-flow encoding to train a BERT based model which outperforms the previous methods by margins larger than 1.9% on original personas and 1.7% on revised personas in terms of hits@1 (top-1 accuracy), achieving a new state-of-the-art performance on the Persona-Chat dataset.	PDF	12	2021
Time Waits for No One! Analysis and Challenges of Temporal Misalignment	When an NLP model is trained on text data from one time period and tested or deployed on data from another, the resulting temporal misalignment can degrade end-task performance. In this work, we establish a suite of eight diverse tasks across different domains (social media, science papers, news, and reviews) and periods of time (spanning five years or more) to quantify the effects of temporal misalignment. Our study is focused on the ubiquitous setting where a pretrained model is optionally adapted through continued domain-specific pretraining, followed by task-specific finetuning. We establish a suite of tasks across multiple domains to study temporal misalignment in modern NLP systems. We find stronger effects of temporal misalignment on task performance than have been previously reported. We also find that, while temporal adaptation through continued pretraining can help, these gains are small compared to task-specific finetuning on data from the target time period. Our findings motivate continued research to improve temporal robustness of NLP models.	PDF	12	2021
About Time: Do Transformers Learn Temporal Verbal Aspect?	Aspect is a linguistic concept that describes how an action, event, or state of a verb phrase is situated in time. In this paper, we explore whether different transformer models are capable of identifying aspectual features. We focus on two specific aspectual features: telicity and duration. Telicity marks whether the verb's action or state has an endpoint or not (telic/atelic), and duration denotes whether a verb expresses an action (dynamic) or a state (stative). These features are integral to the interpretation of natural language, but also hard to annotate and identify with NLP methods. We perform experiments in English and French, and our results show that transformer models adequately capture information on telicity and duration in their vectors, even in their non-finetuned forms, but are somewhat biased with regard to verb tense and word order.	PDF	12	2021
DEMix Layers: Disentangling Domains for Modular Language Modeling	We introduce a new domain expert mixture (DEMix) layer that enables conditioning a language model (LM) on the domain of the input text. A DEMix layer includes a collection of expert feedforward networks, each specialized to a domain, that makes the LM modular: experts can be mixed, added, or removed after initial training. Extensive experiments with autoregressive transformer LMs (up to 1.3B parameters) show that DEMix layers reduce test-time perplexity (especially for out-of-domain data), increase training efficiency, and enable rapid adaptation. Mixing experts during inference, using a parameter-free weighted ensemble, enables better generalization to heterogeneous or unseen domains. We also show it is possible to add experts to adapt to new domains without forgetting older ones, and remove experts to restrict access to unwanted domains. Overall, these results demonstrate benefits of domain modularity in language models.	PDF	12	2021
Knowledge Distillation Improves Stability in Retranslation-based Simultaneous Translation	In simultaneous translation, the \emph{retranslation} approach has the advantage of requiring no modifications to the inference engine. However in order to reduce the undesirable instability (flicker) in the output, previous work has resorted to increasing the latency through masking, and introducing specialised inference, losing the simplicity of the approach. In this paper, we argue that the flicker is caused by both non-monotonicity of the training data, and by non-determinism of the resulting model. Both of these can be addressed using knowledge distillation. We evaluate our approach using simultaneously interpreted test sets for English-German and English-Czech and demonstrate that the distilled models have an improved flicker-latency tradeoff, with quality similar to the original.	PDF	12	2021
PInKS: Preconditioned Commonsense Inference with Weak Supervision	Reasoning with preconditions such as "glass can be used for drinking water unless the glass is shattered" remains an open problem for language models. The main challenge lies in the scarcity of preconditions data and the model's lack of support for such reasoning. We present PInKS, Preconditioned Commonsense Inference with Weak Supervision, an improved model for reasoning with preconditions through minimum supervision. We show, both empirically and theoretically, that PInKS improves the results across the benchmarks on reasoning with the preconditions of commonsense knowledge~(up to 0.4 macro-f1 scores). We further investigate the robustness of our method through PAC-Bayesian informativeness analysis, recall measures, and ablation study.	PDF	12	2021
Creation and evaluation of timelines for longitudinal user posts	There is increasing interest to work with user generated content in social media and especially textual posts over time. Currently there is no consistent way of segmenting user posts into timelines in a meaningful way that can improve the quality and cost of manual annotation. Here we propose a set of methods for segmenting longitudinal user posts into timelines that are likely to contain interesting moments of change in a user's behaviour based on the content they have shared online and their online activity. We also propose a framework for evaluating the timelines returned in terms of containing candidate moments of change in close proximity to manually annotated timelines dense in such moments of change. Finally, we present a discussion of the linguistic content of highly ranked timelines.	PDF	12	2021
Speeding Up Entmax	Softmax is the de facto standard for normalizing logits in modern neural networks for language processing. However, by producing a dense probability distribution each token in the vocabulary has a nonzero chance of being selected at each generation step, leading to a variety of reported problems in text generation. $\alpha$-entmax of Peters et al. (2019) solves this problem, but is unfortunately slower than softmax. In this paper, we propose an alternative to $\alpha$-entmax, which keeps its virtuous characteristics, but is as fast as optimized softmax and achieves on par or better performance in machine translation task.	PDF	12	2021
Contrastive Learning for Fair Representations	Trained classification models can unintentionally lead to biased representations and predictions, which can reinforce societal preconceptions and stereotypes. Existing debiasing methods for classification models, such as adversarial training, are often expensive to train and fragile to optimise. Here, we propose a method for mitigating bias in classifier training by incorporating contrastive learning, in which instances sharing the same class label are encouraged to have similar representations, while instances sharing a protected attribute are forced further apart. In such a way our method learns representations that capture the task label in focused regions, while ensuring the protected attribute has diverse spread, and thus has limited impact on prediction and thereby results in fairer models. Extensive experimental results on three tasks show that: our method achieves fairer representations larger bias reduction than competitive baselines; it does so without sacrificing main task performance; and it generalizes across modalities and binary- and multi-class classification tasks, being conceptually simple and agnostic to network architecture, and incurring minimal additional compute cost.	PDF	12	2021
SHAP-Based Explanation Methods: A Review for NLP Interpretability	Model explanations are crucial for the transparent, safe, and trustworthy deployment of machine learning models. The \emph{SHapley Additive exPlanations} (SHAP) framework is considered by many to be a gold standard for local explanations thanks to its solid theoretical background and general applicability. In the years following its publication, several variants appeared in the literature---presenting adaptations in the core assumptions and target applications. In this work, we review all relevant SHAP-based interpretability approaches available to date and provide instructive examples as well as recommendations regarding their applicability to NLP use cases.	PDF	12	2021
Negative Sample is Negative in Its Own Way: Tailoring Negative Sentences for Image-Text Retrieval	Matching model is essential for Image-Text Retrieval framework. Existing research usually train the model with a triplet loss and explore various strategy to retrieve hard negative sentences in the dataset. We argue that current retrieval-based negative sample construction approach is limited in the scale of the dataset thus fail to identify negative sample of high difficulty for every image. We propose our TAiloring neGative Sentences with Discrimination and Correction (TAGS-DC) to generate synthetic sentences automatically as negative samples. TAGS-DC is composed of masking and refilling to generate synthetic negative sentences with higher difficulty. To keep the difficulty during training, we mutually improve the retrieval and generation through parameter sharing. To further utilize fine-grained semantic of mismatch in the negative sentence, we propose two auxiliary tasks, namely word discrimination and word correction to improve the training. In experiments, we verify the effectiveness of our model on MS-COCO and Flickr30K compared with current state-of-the-art models and demonstrates its robustness and faithfulness in the further analysis.	PDF	12	2021
Embedding Convolutions for Short Text Extreme Classification with Millions of Labels	In this paper, we propose a convolutional architecture InceptionXML which is light-weight, yet powerful, and robust to the inherent lack of word-order in short-text queries in search and recommendation tasks. We demonstrate the efficacy of applying convolutions by recasting the operation along the embedding dimension instead of the word dimension as done in conventional usage of CNNs for text classification. Towards scaling our model to problems with millions of labels, we also propose InceptionXML+ framework. This addresses the shortcomings of the dynamic hard-negative mining framework in the recently proposed LightXML by improving the alignment between the label-shortlister and extreme classifier. InceptionXML+ is not only smaller than state-of-the-art deep extreme classifier, Astec, in terms of model size but also significantly outperforms it on popular benchmark datasets. For reproducibility, the code is made available as part of this submission.	PDF	12	2021
WARM: A Weakly (+Semi) Supervised Math Word Problem Solver	Solving math word problems (MWPs) is an important and challenging problem in natural language processing. Existing approaches to solve MWPs require full supervision in the form of intermediate equations. However, labeling every MWP with its corresponding equations is a time-consuming and expensive task. In order to address this challenge of equation annotation, we propose a weakly supervised model for solving MWPs by requiring only the final answer as supervision. We approach this problem by first learning to generate the equation using the problem description and the final answer, which we subsequently use to train a supervised MWP solver. We propose and compare various weakly supervised techniques to learn to generate equations directly from the problem description and answer. Through extensive experiments, we demonstrate that without using equations for supervision, our approach achieves accuracy gains of 4.5% and 32% over the state-of-the-art weakly supervised approach (Hong et al., 2021), on the standard Math23K (Wang et al., 2017) and AllArith (Roy and Roth, 2017) datasets respectively. Additionally, we curate and release new datasets of roughly 10k MWPs each in English and in Hindi (a low-resource language). These datasets are suitable for training weakly supervised models. We also present an extension of WARM to semi-supervised learning and present further improvements on results, along with insights.	PDF	12	2021
Balancing out Bias: Achieving Fairness Through Balanced Training	Bias in natural language processing manifests as disparities in error rates across author demographics, typically disadvantaging minority groups. Although dataset balancing has been shown to be effective in mitigating bias, existing approaches do not directly account for correlations between author demographics and linguistic variables. To achieve Equal Opportunity fairness, this paper introduces a simple but highly effective objective for countering bias using balanced training. We extend the method in the form of a gated model, which incorporates protected attributes as input, and show that it is effective at reducing bias in predictions through demographic input perturbation, outperforming all other bias mitigation techniques when combined with balanced training.	PDF	12	2021
BnPC: A Corpus for Paraphrase Detection in Bangla	In this paper, we present the first benchmark dataset for paraphrase detection in Bangla language. Despite being the sixth most spoken language in the world, paraphrase identification in the Bangla language is barely explored. Our dataset contains 8,787 human-annotated sentence pairs collected from a total of 23 newspaper outlets' headlines on four categories. We explore different linguistic features and pre-trained language models to benchmark the dataset. We perform a human evaluation experiment to obtain a better understanding of the task's constraints, which reveals intriguing insights. We make our dataset and code publicly available.	PDF	12	2021
ImaginE: An Imagination-Based Automatic Evaluation Metric for Natural Language Generation	Automatic evaluations for natural language generation conventionally rely on token-level or embedding-level comparisons with the text references. This is different from human evaluation manners, in which people also form pictures of the text contents in their minds during reading. In this work, we propose ImaginE, an imagination-based automatic evaluation metric for natural language generation. With the help of CLIP and DALL-E, two cross-modal models pre-trained on large-scale image-text pairs, we automatically generate an image as the embodied imagination for the text snippet, and compute the imagination similarity using contextual embeddings. Experiments spanning several text generation tasks demonstrate that adding imagination with our ImaginE displays great potential in introducing multi-modal information into NLG evaluation, and improves existing automatic metrics’ correlations with human similarity judgments in many circumstances.	PDF	12	2021
Literature-Augmented Clinical Outcome Prediction	We present BEEP (Biomedical Evidence-Enhanced Predictions), a novel approach for clinical outcome prediction that retrieves patient-specific medical literature and incorporates it into predictive models. Based on each individual patient's clinical notes, we train language models (LMs) to find relevant papers and fuse them with information from notes to predict outcomes such as in-hospital mortality. We develop methods to retrieve literature based on noisy, information-dense patient notes, and to augment existing outcome prediction models with retrieved papers in a manner that maximizes predictive accuracy. Our approach boosts predictive performance on three important clinical tasks in comparison to strong recent LM baselines, increasing F1 by up to 5 points and precision@Top-K by a large margin of over 25%.	PDF	12	2021
A Survey on Multimodal Disinformation Detection	Recent years have witnessed the proliferation of offensive content online such as fake news, propaganda, misinformation, and disinformation. While initially this was mostly about textual content, over time images and videos gained popularity, as they are much easier to consume, attract more attention, and spread further than simple text. As a result, researchers started leveraging different modalities and combinations thereof to combat online multimodal offensive content. In this study, we offer a survey that carefully studies the state-of-the-art on multimodal disinformation detection covering various combinations of modalities: text, images, speech, video, social media network structure, and temporal information. Moreover, while some studies focused on factuality, others investigated how harmful the content is. While these two components in the definition of disinformation -- (i) factuality, and (ii) harmfulness, are equally important, they are typically studied in isolation. Thus, we argue for the need to tackle disinformation detection by taking into account multiple modalities as well as both factuality and harmfulness, in the same framework. Finally, we discuss current challenges and future research directions.	PDF	12	2021
Multi-Vector Models with Textual Guidance for Fine-Grained Scientific Document Similarity	We present a new scientific document similarity model based on matching fine-grained aspects of texts. To train our model, we exploit a naturally-occurring source of supervision: sentences in the full-text of papers that cite multiple papers together (co-citations). Such co-citations not only reflect close paper relatedness, but also provide textual descriptions of how the co-cited papers are related. This novel form of textual supervision is used for learning to match aspects across papers. We develop multi-vector representations where vectors correspond to sentence-level aspects of documents, and present two methods for aspect matching: (1) A fast method that only matches single aspects, and (2) a method that makes sparse multiple matches with an Optimal Transport mechanism that computes an Earth Mover's Distance between aspects. Our approach improves performance on document similarity tasks in four datasets. Further, our fast single-match method achieves competitive results, paving the way for applying fine-grained similarity to large scientific corpora.	PDF	12	2021
“Find Me a Dataset”: Scientific Dataset Recommendation from Method Descriptions	Much of modern science relies on public datasets to develop research ideas. Finding a dataset for a given task can be difficult, particularly for new researchers. We aim to improve the process of dataset discovery by introducing a system called DatasetFinder which recommends relevant datasets given a short natural language description of a research idea. For the new task of dataset recommendation, we construct an English-language dataset that leverages existing annotations and compare several ranking models on this dataset. We also compare our proposed models against existing commercial search engines and find evidence that leveraging natural language descriptions improves search relevance. To encourage development on this new task, we release our constructed dataset and models to the public.	PDF	12	2021
ME-GCN: Multi-dimensional Edge-Enhanced Graph Convolutional Networks for Semi-supervised Text Classification	Compared to sequential learning models, graph-based neural networks exhibit excellent ability in capturing global information and have been used for semi-supervised learning tasks, including citation network analysis or text classification. Most Graph Convolutional Networks are designed with the single-dimensional edge feature and failed to utilise the rich edge information about graphs. In this paper, we introduce the ME-GCN (Multi-dimensional Edge-enhanced Graph Convolutional Networks) for semi-supervised text classification. A text graph for an entire corpus is firstly constructed to describe the undirected and multi-dimensional relationship of word-to-word, document-document, and word-to-document. The graph is initialised with corpus-trained multi-dimensional word and document node representation, and the relations are represented according to the distance of those words/documents nodes. Then, the generated graph is trained with ME-GCN, which considers the edge features as multi-stream signals, and each stream performs a separate graph convolutional operation. Our ME-GCN can integrate a rich source of graph edge information of the entire text corpus. The results have demonstrated that our proposed model has significantly outperformed the state-of-the-art methods across eight benchmark datasets.	PDF	12	2021
HyEnA: A Hybrid Method for Extracting Arguments from Opinions	The key arguments underlying a large and noisy set of opinions help understand the opinions quickly and accurately. Fully automated methods can extract arguments but (1) require large labeled datasets and (2) work well for known viewpoints, but not for novel points of view. We propose HyEnA, a hybrid (human + AI) method for extracting arguments from opinionated texts, combining the speed of automated processing with the understanding and reasoning capabilities of humans. We evaluate HyEnA on three feedback corpora on COVID-19 relaxation measures. We find that, on the one hand, HyEnA achieves higher coverage and precision than a state-of-the-art automated method, when compared on a common set of diverse opinions, justifying the need for human insight. On the other hand, HyEnA requires less human effort and does not compromise quality compared to (fully manual) expert analysis, demonstrating the benefit of combining human and machine intelligence.	PDF	12	2021
Representation of ambiguity in pretrained models and the problem of domain specificity	Recent developments in pretrained language models have led to many advances in NLP. These models have excelled at learning powerful contextual representations from very large corpora. Fine-tuning these models for downstream tasks has been one of the most used (and successful) approaches to solving a plethora of NLP problems. But how capable are these models in capturing subtle linguistic traits like ambiguity in their representations? We present results from a probing task designed to test the capability of the models to identify ambiguous sentences under different experimental settings. The results show how different pretrained models fare against each other in the same task. We also explore how domain specificity limits the representational capabilities of the probes.	PDF	12	2021
D2U: Distance-to-Uniform Learning for Out-of-Scope Detection	Supervised training with cross-entropy loss implicitly forces models to produce probability distributions that follow a discrete delta distribution. Model predictions in test time are expected to be similar to delta distributions if the classifier determines the class of an input correctly. However, the shape of the predicted probability distribution can become similar to the uniform distribution when the model cannot infer properly. We exploit this observation for detecting out-of-scope (OOS) utterances in conversational systems. Specifically, we propose a zero-shot post-processing step, called Distance-to-Uniform (D2U), exploiting not only the classification confidence score, but the shape of the entire output distribution. We later combine it with a learning procedure that uses D2U for loss calculation in the supervised setup. We conduct experiments using six publicly available datasets. Experimental results show that the performance of OOS detection is improved with our post-processing when there is no OOS training data, as well as with D2U learning procedure when OOS training data is available.	PDF	12	2021
Enhancing Cross-lingual Prompting with Two-level Augmentation	Prompting approaches show promising results in few-shot scenarios. However, its strength for multilingual/cross-lingual problems has not been fully exploited. Zhao and Schütze (2021) made initial explorations in this direction by presenting that cross-lingual prompting outperforms cross-lingual finetuning. In this paper, we first conduct sensitivity analysis on the effect of each component in cross-lingual prompting and derive Universal Prompting across languages. Based on this, we propose a two-level augmentation framework to further improve the performance of prompt-based cross-lingual transfer. Notably, for XNLI, our method achieves 46.54% with only 16 English training examples per class, significantly better than 34.99% of finetuning.	PDF	12	2021
HiURE: Hierarchical Exemplar Contrastive Learning for Unsupervised Relation Extraction	Unsupervised relation extraction aims to extract the relationship between entities from natural language sentences without prior information on relational scope or distribution. Existing works either utilize self-supervised schemes to refine relational feature signals by iteratively leveraging adaptive clustering and classification that provoke gradual drift problems, or adopt instance-wise contrastive learning which unreasonably pushes apart those sentence pairs that are semantically similar. To overcome these defects, we propose a novel contrastive learning framework named HiURE, which has the capability to derive hierarchical signals from relational feature space using cross hierarchy attention and effectively optimize relation representation of sentences under exemplar-wise contrastive learning. Experimental results on two public datasets demonstrate the advanced effectiveness and robustness of HiURE on unsupervised relation extraction when compared with state-of-the-art models.	PDF	12	2021
Cross-modal Contrastive Learning for Speech Translation	How to learn similar representations for spoken utterances and their written text? We believe a unified and aligned representation of speech and text will lead to improvement in speech translation. To this end, we propose ConST, a cross-modal contrastive learning method for end-to-end speech-to-text translation. We evaluate ConST and a variety of previous baselines on multiple language directions (En-De/Fr/Ru) of a popular benchmark MuST-C. Experiments show that the proposed ConST consistently outperforms all previous methods, and achieves the state-of-the-art average BLEU of 28.5. The analysis further verifies that ConST indeed closes the representation gap of different modalities --- its learned representation improves the accuracy of cross-modal text retrieval from 4% to 88%.	PDF	12	2021
CPED: A Large-Scale Chinese Personalized and Emotional Dialogue Dataset for Open-Domain Conversation	Recently, the personification and empathy capabilities of dialogue systems have received extensive attention from researchers. Although it is straightforward for humans to express themselves personally and empathically, this is highly difficult for dialogue systems since training data do not provide personalities or empathy knowledge. In this paper, we propose CPED, a large-scale Chinese personalized and emotional dialogue dataset, which consists of multisource knowledge related to empathy and personal characteristic. This knowledge covers 13 emotions, gender, Big Five personality traits, 19 dialogue acts and other knowledge. CPED contains more than 12K dialogues of 392 speakers from 40 TV shows. We also provide several strong baselines for open-domain conversation generation. The results show that explicitly infusing personalized knowledge and emotional information improves the personification level and empathy ability of dialogue systems, but the infusion method needs to be further studied. The dataset and baselines will be released on https://github.com/***/CPED.	PDF	12	2021
Adversarially Constructed Evaluation Sets Are More Challenging, but May Not Be Fair	More capable language models increasingly saturate existing task benchmarks, in some cases outperforming humans, leaving little headroom with which to measure further progress. Adversarial dataset creation has been proposed as a strategy to construct more challenging datasets, and two common approaches are: (1) filtering out easy examples and (2) model-in-the-loop data collection. In this work, we study the impact of applying each approach to create more challenging evaluation datasets. We adapt the AFLite algorithm to filter evaluation data, and run experiments against 18 different adversary models. We find that AFLite indeed selects more challenging examples, lowering the performance of evaluated models more as stronger adversary models are used. However, the resulting ranking of models can also be unstable and highly sensitive to the choice of adversary model used. Moreover, AFLite oversamples examples with low annotator agreement, meaning that model comparisons hinge on the most contentiously labeled examples. Smaller-scale experiments on the adversarially collected datasets ANLI and AdversarialQA show similar findings, broadly lowering performance with stronger adversaries while disproportionately affecting the adversary model.	PDF	12	2021
Few-Shot Self-Rationalization with Natural Language Prompts	Self-rationalization models that predict task labels and generate free-text elaborations for their predictions could enable more intuitive interaction with NLP systems. These models are, however, currently trained with a large amount of human-written free-text explanations for each task which hinders their broader usage. We propose to study a more realistic setting of self-rationalization using few training examples. We present FEB---a standardized collection of four existing English-language datasets and associated metrics. We identify the right prompting approach by extensively exploring natural language prompts on FEB. Then, by using this prompt and scaling the model size, we demonstrate that making progress on few-shot self-rationalization is possible. We show there is still ample room for improvement in this task: the average plausibility of generated explanations assessed by human annotators is at most 51\%, while plausibility of human explanations is 76\%. We hope that FEB and our proposed approach will spur the community to take on the few-shot self-rationalization challenge.	PDF	12	2021
Flat and Nested Negation and Uncertainty Detection with PubMed BERT	Negation and uncertainty detection is an oft-studied challenge in biomedical NLP. Annotation style for the task has not been standardized and as such, the existing datasets not only vary in domain but require various algorithmic designs due to their structural differences. We present a new negation detection dataset in two versions from clinical publications. We further developed two BERT-based models to evaluate on each dataset version. Both models treat the task as a token-level multi-class classification task, one of which is capable of assigning more than one label per token in the case of recursive nesting. Our models achieve F1 scores of 76% and 72% on the development and test sets, respectively.	PDF	12	2021
CalBERT - Code-mixed Adaptive Language representations using BERT	A code-mixed language is a type of language that involves the combination of two or more language varieties in its script or speech. Code-mixed language has become increasingly prevalent in recent times, especially on social media. However, the exponential increase in the usage of code-mixed language, especially in a country like India which is linguistically diverse has led to various inconsistencies. Analysis of text is now becoming harder to tackle because the language present is not consistent and does not work with predefined existing models which are monolingual. We propose a novel approach to improve performance in Transformers by introducing an additional step called "Siamese Pre-Training", which allows pre-trained monolingual Transformers to adapt language representations for code-mixed languages with a few examples of code-mixed data. Our studies show that CalBERT is able to improve performance over existing pre-trained Transformer architectures on downstream tasks such as sentiment analysis. Multiple CalBERT architectures beat the state of the art F1-score on the Sentiment Analysis for Indian Languages (SAIL) dataset, with the highest possible improvement being 5.1 points. CalBERT also achieves state-of-the-art accuracy on the IndicGLUE Product Reviews dataset by beating the benchmark by 0.4 points.	PDF	12	2021
Improving Tokenisation by Alternative Treatment of Spaces	Tokenisation is the first step in almost all NLP tasks, and state-of-the-art transformer-based language models all use subword tokenisation algorithms to process input text. Existing algorithms have problems, often producing tokenisations of limited linguistic validity, and representing equivalent strings differently depending on their position within a word. We hypothesise that these problems hinder the ability of transformer-based models to handle complex words, and suggest that these problems are a result of allowing tokens to include spaces. We thus experiment with an alternative tokenisation approach where spaces are always treated as individual tokens, finding it alleviates existing problems, improving performance of models. Concretely, we apply a modification to the BPE and Unigram algorithms which implements this approach, and find it gives more morphologically correct tokenisations, in particular when handling prefixes. In addition, we show that the modified algorithms give improved performance on downstream NLP tasks that involve handling complex words, whilst having no detrimental effect on performance in general natural language understanding tasks. Given the results of our experiments, we advocate for always treating spaces as individual tokens as a superior tokenisation method.	PDF	12	2021
Understanding Attention for Vision-and-Language Tasks	Attention mechanism has been used as an important component across Vision-and-Language(VL) tasks in order to bridge the semantic gap between visual and textual features. While attention has been widely used in VL tasks, it has not been examined the capability of different attention alignment calculation in bridging the semantic gap between visual and textual clues. In this research, we conduct a comprehensive analysis on understanding the role of attention alignment by looking into the attention score calculation methods and check how it actually represents the visual region's and textual token's significance for the global assessment. We also analyse the conditions which attention score calculation mechanism would be more (or less) interpretable, and which may impact the model performance on three different VL tasks, including visual question answering, text-to-image generation, text-and-image matching (both sentence and image retrieval). Our analysis is the first of its kind and provides useful insights of the importance of each attention alignment score calculation when applied at the training phase of VL tasks, commonly ignored in attention-based cross modal models, and/or pretrained models.	PDF	12	2021
LPC: A Logits and Parameter Calibration Framework on Continual Learning	Deep learning based pre-trained natural language processing (NLP) models typically pre-train on large unlabeled corpora first, then fine-tune on new tasks. When we execute such a paradigm on continuously sequential tasks, the model will suffer from the catastrophic forgetting problem (i.e., they forget the parameters learned in previous tasks when we train the model on newly emerged tasks). Inspired by the idea of how humans learn things, we aim to maintain the old knowledge when we transfer to novel contents and calibrate the old and new knowledge. We propose a Logits and Parameter Calibration (LPC) framework to reduce the catastrophic forgetting in the continual learning process. The proposed framework includes two important components, the Logits Calibration (LC) and Parameter Calibration (PC). The core idea is to reduce the difference between old knowledge and new knowledge by doing calibration on logits and parameters so that the model can maintain old knowledge while learning new tasks without preserving data in previous tasks. First, we preserve the parameters learned from the base tasks. Second, we train the existing model on novel tasks and estimate the difference between base logits and parameters and novel logits and parameters. Third, we drift from the base tasks to novel tasks gradually. Furthermore, we integrate the logtis and parameter calibration into a brand-new optimization algorithm. Finally, we do experiments on 7 scenarios of the GLUE (the General Language Understanding Evaluation) benchmark. The experimental results show that our model achieves state-of-the-art performance on all 7 scenarios.	PDF	12	2021
Joint Mitigation of Interactional Bias	Machine learning algorithms have been found discriminative against groups of different social identities, e.g., gender and race. With the detrimental effects of these algorithmic biases, researchers proposed promising approaches for bias mitigation, typically designed for individual bias types. Due to the complex nature of social bias, we argue it is important to study how different biases interact with each other, i.e., how mitigating one bias type (e.g., gender) influences the bias results regarding other social identities (e.g., race and religion). We further question whether jointly debiasing multiple types of bias is desired in different contexts, e.g., when correlations between biases are different. To address these research questions, we examine bias mitigation in two NLP tasks -- toxicity detection and word embeddings -- on three social identities, i.e., race, gender, and religion. Empirical findings based on benchmark datasets suggest that different biases can be correlated and therefore, warranting attention for future research on joint bias mitigation.	PDF	12	2021
RoViST: Learning Robust Metrics for Visual Storytelling	Visual storytelling (VST) is the task of generating a story paragraph that describes a given image sequence. Most existing storytelling approaches have evaluated their models using traditional natural language generation metrics like BLEU or CIDEr. However, such metrics based on $n$-gram matching tend to have poor correlation with human evaluation scores and do not explicitly consider other criteria necessary for storytelling such as sentence structure or topic coherence. Moreover, a single score is not enough to assess a story as it does not inform us about what specific errors were made by the model. In this paper, we propose 3 evaluation metrics sets that analyses which aspects we would look for in a good story: 1) visual grounding, 2) coherence, and 3) non-redundancy. We measure the reliability of our metric sets by analysing its correlation with human judgement scores on a sample of machine stories obtained from 4 state-of-the-arts models trained on the Visual Storytelling Dataset (VIST). Our metric sets outperforms other metrics on human correlation, and could be served as a learning based evaluation metric set that is complementary to existing rule-based metrics.	PDF	12	2021
PaCo: Preconditions Attributed to Commonsense Knowledge	Humans can seamlessly reason with circumstantial preconditions of commonsense knowledge. We understand that "a glass is used for drinking water", unless "the glass is broken" or "the water is toxic". Despite state-of-the-art(SOTA) language models’ (LMs) impressive performance on inferring commonsense knowledge, it is unclear whether they understand the circumstantial preconditions. To address this gap, we propose a novel challenge of reasoning with circumstantial preconditions. We collect a dataset, called PaCo, consisting of 12.4 thousand preconditions of commonsense statements expressed in natural language. Based on this dataset, we create three canonical evaluation tasks and use them to examine the capability of existing LMs to understand situational preconditions. Our results reveal a 10-30% gap between machine and human performance on our tasks, which shows that reasoning with preconditions is an open challenge. Upon acceptance, we will release the dataset and the code used to test models.	PDF	12	2021
Diagnosing Vision-and-Language Navigation: What Really Matters	Vision-and-language navigation (VLN) is a multimodal task where an agent follows natural language instructions and navigates in visual environments. Multiple setups have been proposed, and researchers apply new model architectures or training techniques to boost navigation performance. However, there still exist non-negligible gaps between machines' performance and human benchmarks. Moreover, the agents' inner mechanisms for navigation decisions remain unclear. To the best of our knowledge, how the agents perceive the multimodal input is under-studied and needs investigation. In this work, we conduct a series of diagnostic experiments to unveil agents' focus during navigation. Results show that indoor navigation agents refer to both object and direction tokens when making decisions. In contrast, outdoor navigation agents heavily rely on direction tokens and poorly understand the object tokens. The differences in dataset designs and the visual features lead to distinct behaviors on visual environment understanding. Many models claim that they can align object tokens with specific visual targets when it comes to vision-and-language alignments. We find unbalanced attention on the vision and text input and doubt the reliability of such cross-modal alignments.	PDF	12	2021
TASA: Twin Answer Sentences Attack for Adversarial Context Generation in Question Answering	We present Twin Answer Sentences Attack (TASA), a novel question answering (QA) adversarial attack method that produces fluent and grammatical adversarial contexts while maintaining its gold answers. Despite phenomenal progresses on general adversarial attacks, few works have investigated the vulnerability and adversarial attack specifically for QA. In this work, we first investigate the biases in the existing models and discover that they heavily rely on keyword matching and ignore the relevant entities from the question. TASA explores the two biases above and attacks the target model in two folds: (1) lowering the model's confidence on the gold answer with a perturbed answer sentence; (2) misguiding the model towards a wrong answer with a distracting answer sentence. Equipped with designed beam search and filtering methods, TASA is able to attack the target model efficiently while sustaining the quality of contexts. Extensive experiments on four QA datasets and human evaluations demonstrate that TASA generates substantial-high-quality attacks than existing textual adversarial attack methods.	PDF	12	2021
Explaining Ranking Models using Multiple Explainers	Current approaches to interpreting complex ranking models are based on local approximations of the ranking model using a simple ranker in the locality of the query. Since rankings have multiple relevance factors and are aggregations of predictions, existing approaches that use a single ranker might not be sufficient to approximate a complex model resulting in low local fidelity. In this paper, we overcome this problem by considering multiple simple rankers for better approximating the black box ranking model. We pose the problem of local approximation as a Generalized Preference Coverage (GPC) problem that incorporates multiple simple rankers towards the post-hoc interpretability of ranking models. Our approach Multiplex uses a linear programming approach to judiciously extract the explanation terms. We conduct extensive experiments on a variety of ranking models and report fidelity improvements of $37\% - 54\%$ over existing baselines and competitors. We finally qualitatively compare modern neural ranking models in terms of their explanations to better understand the differences between them, showcasing our explainers' practical utility.	PDF	12	2021
Cloze Evaluation for Deeper Understanding of Commonsense Stories in Indonesian	Story comprehension that involves complex causal and temporal relations is imperative in NLP, but previous studies have focused on English, leaving open the question of how the findings generalize to other languages, such as Indonesian. In this paper, we follow the Story Cloze Test framework of Mostafazadeh et al. (2016) in evaluating story understanding in Indonesian, by constructing a four-sentence story with one correct ending and one incorrect ending. To investigate commonsense knowledge acquisition in language models, we experimented with: (1) a classification task to predict the correct ending; and (2) a generation task to complete the story with a single sentence. We investigate these tasks in two settings: (i) monolingual training and (ii) zero-shot cross-lingual transfer between Indonesian and English.	PDF	12	2021
Challenging America: Modeling language in longer time scales	The aim of the paper is to apply, for historical texts, the methodology used commonly to solve various NLP tasks defined for contemporary data, i.e. pre-train and fine-tune large Transformer models. This paper introduces an ML challenge, named Challenging America (ChallAm), based on OCR-ed excerpts from historical newspapers collected from the Chronicling America portal. ChallAm provides a dataset of clippings, labeled with metadata on their origin, and paired with their textual contents retrieved by an OCR tool. Three, publicly available, ML tasks are defined in the challenge: to determine the article date, to detect the location of the issue, and to deduce a word in a text gap (cloze test). Strong baselines are provided for all three ChallAm tasks. In particular, we pre-trained a RoBERTa model from scratch from the historical texts. We also discuss the issues of discrimination and hate-speech present in the historical American texts.	PDF	12	2021
A High-Precision Health-relatedness Score for Phrases to Mine Cause-Effect Statements from the Web	The measurement of the health-relatedness of a phrase is important when mining the web at scale for health information, e.g., when building a search engine or when carrying out health-sociological analyses. We propose a new termhood scoring scheme that allows for the prediction of the health-relatedness of phrases at high precision. An evaluation on several corpora of cause--effect statements (heuristically and professionally labeled) yields about 60\%~recall at over 90\%~precision, outperforming state-of-the-art vocabulary-based approaches and performing on par with BERT while being less resource-demanding. A new resource of over 4~million health-related cause--effect statements is compiled, such as ``Studies show that stress induces insomnia.'', which explicitly connect symptoms (`stress') as claimed causes for conditions (`insomnia'). It consists of over 4~million sentences from more than 2~million unique web pages and 234,000 unique websites.	PDF	12	2021
Training Dynamics for Curriculum Learning: A Study on Monolingual and Cross-lingual NLU	Curriculum Learning (CL) is a technique of training models via ranking examples in a typically increasing difficulty trend with the aim of accelerating convergence and improving generalisability. However, current approaches for Natural Language Understanding (NLU) tasks use CL to improve in-domain model performance often via metrics that are detached from the model one aims to improve. In this work, instead, we employ CL for NLU by taking advantage of training dynamics as difficulty metrics, i.e. statistics that measure the behavior of the model at hand on data instances during training. In addition, we propose two modifications of existing CL schedulers based on these statistics. Differently from existing works, we focus on evaluating models on out-of-distribution data as well as languages other than English via zero-shot cross-lingual transfer. We show across four XNLU tasks that CL with training dynamics in both monolingual and cross-lingual settings can achieve significant speedups up to 58%. We also find that performance can be improved on challenging tasks, with OOD generalisation up by 8\% and zero-shot cross-lingual transfer up by 1%. Overall, experiments indicate that training dynamics can lead to better performing models and smoother training compared to other difficulty metrics.	PDF	12	2021
Pretrained Language Models Are All You Need For Text-to-SQL Schema Linking	The use of Exact Match based Schema Linking (EMSL) has become standard in text-to-SQL: many state-of-the-art text-to-SQL models employ EMSL, and their performance drops significantly when the EMSL component is removed. In this work, however, we demonstrate that EMSL reduces robustness, rendering models vulnerable to synonym substitution and typos. Instead of relying on EMSL to make up for deficiencies in question-schema encoding, we show that by utilizing the pre-trained language model as the encoder, we can improve the performance without using EMSL, and thus the model is more robust. Our experiments suggest that EMSL is not the icing on the cake, but it is the one that introduces the vulnerability, and it can be replaced by better input encoding.	PDF	12	2021
A Two-Stream AMR-enhanced Model for Document-level Event Argument Extraction	Most previous studies aim at extracting events from a single sentence, while document-level event extraction still remains under-explored. In this paper, we focus on extracting event arguments from an entire document, which mainly faces two critical problems: a) the long-distance dependency between trigger and arguments over sentences; b) the distracting context towards an event in the document. To address these issues, we propose a \textbf{T}wo-\textbf{S}tream \textbf{A}bstract meaning \textbf{R}epresentation enhanced extraction model (TSAR). TSAR encodes the document from different perspectives by a two-stream encoding module, to utilize local and global information and lower the impact of distracting context. Besides, TSAR introduces an AMR-guided interaction module to capture both intra-sentential and inter-sentential features, based on the locally and globally constructed AMR semantic graphs. An auxiliary boundary loss is introduced to enhance the boundary information for text spans explicitly. Extensive experiments illustrate that TSAR outperforms previous state-of-the-art by a large margin, with $2.54$ F1 and $5.13$ F1 performance gain on the public RAMS and WikiEvents datasets respectively, showing the superiority in the cross-sentence arguments extraction. We will release our code upon acceptance.	PDF	12	2021
Heterogeneous-Graph Reasoning and Fine-Grained Aggregation for Fact Checking	Fact checking is a challenging task that requires corresponding evidences to verify the property of a claim based on reasoning. Previous studies generally i) construct the graph by treating each evidence-claim pair as node which is a simple way that ignores to exploit their implicit interaction, or building a fully-connected graph among claim and evidences where the entailment relationship between claim and evidence would be considered equally to the semantic relationship among evidences; ii) aggregate evidences equally without considering their different stances towards the verification of fact. Towards the above issues, we propose a novel heterogeneous-graph reasoning and fine-grained aggregation model, with two following modules: 1) a heterogeneous graph attention network module to distinguish different types of relationships within the constructed graph; 2) fine-grained aggregation module which learns the implicit stance of evidences towards the prediction result in details. Extensive experiments on the benchmark dataset demonstrate that our proposed model achieves much better performance than state-of-the-art methods.	PDF	12	2021
Deep Reinforcement Learning-based Authentic Dialogue Generation To Protect Youth From Cybergrooming	Cybergrooming is defined as a crime towards potential victims, especially teens, by building close personal relationships with them with the purpose of sexual exploitation via online media. Cyber or online sexual grooming has been recognized as a serious cyber crime. However, there have been insufficient programs to proactively protect the youth from cybergrooming. In this work, we present a generative chatbot framework, called SERI (Stop cybERgroomIng), that can generate simulated conversations between a perpetrator chatbot and a potential victim chatbot. To realize the simulation of authentic conversations in the context of cybergrooming, we take deep reinforcement learning (DRL)-based dialogue generation for authentic simulation of the conversations between a potential victim and a perpetrator (i.e., cybergroomer). The design of the SERI is motivated to ensure a safe and authentic environment to strengthen the youth's precautionary awareness of cybergrooming while any unnecessary ethical issues (e.g., the potential misuse of the SERI) are removed or minimized. We developed the SERI as a preliminary platform that can deploy the perpetrator chatbot to interact with human users (i.e., youth) to observe youth users' responses to strangers or acquaintances and collect the reactions when the youth users are asked for private or sensitive information by the perpetrator. We evaluated the quality of conversations generated by the SERI based on open-source, referenced, unreferenced metrics, and human evaluation.	PDF	12	2021
A Survey on Stance Detection for Mis- and Disinformation Identification	Understanding attitudes expressed in texts, also known as stance detection, plays an important role in systems for detecting false information online, be it misinformation (unintentionally false) or disinformation (intentionally false information). Stance detection has been framed in different ways, including (a) as a component of fact-checking, rumour detection, and detecting previously fact-checked claims, or (b) as a task in its own right. While there have been prior efforts to contrast stance detection with other related tasks such as argumentation mining and sentiment analysis, there is no existing survey on examining the relationship between stance detection and mis- and disinformation detection. Here, we aim to bridge this gap by reviewing and analysing existing work in this area, with mis- and disinformation in focus, and discussing lessons learnt and future challenges.	PDF	12	2021
A Character-level Ngram-based MT Approach for Lexical Normalization in Social Media	This paper presents an ngram-based MT approach that operates at character-level to generate possible canonical forms for lexical variants in social media text. It utilizes a joint n-gram model to learn edit sequences of word pairs, thus overcomes the shortage of phrase-based approach that is unable to capture dependencies across phrases. We evaluate our approach on two English tweet datasets and observe that the ngram-based approach significantly outperforms phrase-based approach in normalization task. Our simple model achieves a broad coverage on diverse variants which is comparable to complex hybrid systems.	PDF	12	2021
Compositional Generalization Requires Compositional Parsers	A rapidly growing body of research on compositional generalization investigates the ability of a semantic parser to dynamically recombine linguistic elements seen in training into unseen sequences. We present a systematic comparison of sequence-to-sequence models and models guided by compositional principles on the recent COGS corpus (Kim and Linzen, 2020). Though seq2seq models can perform well on lexical tasks, they perform with near-zero accuracy on structural generalization tasks that require novel syntactic structures; this holds true even when they are trained to predict syntax instead of semantics. In contrast, compositional models achieve near-perfect accuracy on structural generalization; we present new results confirming this from the AM parser (Groschwitz et al., 2021). Our findings show structural generalization is a key measure of compositional generalization and requires models that are aware of complex structure.	PDF	12	2021
Hybrid-Regressive Neural Machine Translation	Non-autoregressive translation (NAT) with iterative refinement mechanism has shown comparable performance with the auto-regressive counterpart. However, we have empirically found that decoding acceleration is fragile when using a large batch size and running on the CPU. We demonstrate that one-pass NAT is sufficient when providing a few target contexts in advance through synthetic experiments. Inspired by this, we propose a two-stage translation prototype -- Hybrid-Regressive Translation (HRT) to combine the strengths of autoregressive and non-autoregressive. Specifically, HRT first generates a discontinuous sequence by autoregression (e.g., make a prediction every k tokens, k>1) and then fills all previously skipped tokens at once in a non-autoregressive manner. We also propose a bag of techniques to effectively and efficiently train HRT, with almost no increase in parameters. Experimental results on WMT En-Ro, En-De, and NIST Zh-En show that our model outperforms existing semi-autoregressive models and is competitive with current state-of-the-art non-autoregressive models. Moreover, compared to its autoregressive counterpart, HRT has a stable 1.5x acceleration, regardless of batch size and device.	PDF	12	2021
Anticipation-free Training for Simultaneous Translation	Simultaneous translation (SimulMT) speeds up the translation process by starting to translate before the source sentence is completely available. It is difficult due to limited context and word order difference between languages. Existing methods increase latency or introduce adaptive read-write policies for SimulMT models to handle local reordering and improve translation quality. However, the long-distance reordering would make the SimulMT models learn translation mistakenly. Specifically, the model may be forced to predict target tokens when the corresponding source tokens have not been read. This leads to aggressive anticipation during inference, resulting in the hallucination phenomenon. To mitigate this problem, we propose a new framework that decompose the translation process into the monotonic translation step and the reordering step, and we model the latter by the auxiliary sorting network (ASN). The ASN rearranges the hidden states to match the order in the target language, so that the SimulMT model could learn to translate more reasonably. The entire model is optimized end-to-end and does not rely on external aligners or data. During inference, ASN is removed to achieve streaming. Experiments show the proposed framework could outperform previous methods with less latency.	PDF	12	2021
Incorporate Dependency Relation Knowledge into Transformer Block for Multi-turn Dialogue Generation	Because of the compositionality of natural language, syntactic structure is one of the key factors for semantic understanding. However, the Transformer block, which is widely used for obtaining the distributed representations of sentences in dialogue generation tasks, views sentences as a sequence of words and does not effectively learn the syntactic structure. In this work, we explore how to effectively incorporate dependency relation knowledge that contains syntactic structure information into Transformer block and propose Dependency Relation Attention(DRA). Experimental results demonstrate that DRA can further improve the performance of state-of-the-art models for multi-turn dialogue generation.	PDF	12	2021
Toward Preference-Aware Story Evaluation via Ranking, Rating and Reasoning	Existing automatic story evaluation methods place a premium on story coherence, deviating from human preference. We go beyond such restrictions by presenting a more challenging task of \textbf{preference-aware story evaluation}. Given either a machine-generated or a human-written story, the task requires the machine to output a preference score that corresponds to human preference, along with specific ratings and comments for various aspects (e.g., opening, character-shaping). To support this novel task, we introduce a new dataset, namely \textbf{StoR3}, comprising (i) 100k ranked story pairs; and (ii) a set of 46k ratings and comments on various aspects of the story. To move towards preference-aware evaluation, we propose a model using the \textit{upvote count} as the criterion. The experiments show that the scores obtained by our model have a high correlation to human preference. Additionally, we discovered that the combination of aspect ratings and comments improves performance. Our dataset and benchmarks are publicly available to advance the research of story evaluation tasks.	PDF	12	2021
Is syntax structure modeling worth? Leveraging pattern-driven modeling to enable affordable sentiment dependency learning	Is structure information modeling really worth in Aspect-based sentiment classification (ABSC)? Recent popular works tend to exploit syntactic information guiding sentiment dependency parsing, i.e., structure-based sentiment dependency learning. However, many works fall into the trap that confusing the concepts between syntax dependency and sentiment dependency. Besides, structure information (e.g., syntactic dependency tree) usually consumes expensive computational resources due to the extraction of the adjacent matrix. Instead, we believe the sentiment dependency mostly occurs between adjacent aspects. By proposing the sentiment patterns (SP) to boost the sentiment dependency learning, we introduce the Local dependency aggregating (Lena) to explore sentiment dependency in the text. Experiments show that Lena is more efficient than existing structure-based models without dependency matrix constructing and modeling expense. The performance on all five public ABSC datasets makes a big step compared to state-of-the-art models, and our work could inspire future research focusing on efficient local sentient dependency modeling.	PDF	12	2021
Iterative Decoding for Compositional Generalization in Transformers	Deep learning models generalize well to in-distribution data but struggle to generalize compositionally, i.e., to combine a set of learned primitives to solve more complex tasks. In sequence-to-sequence (seq2seq) learning, transformers are often unable to predict correct outputs for longer examples than those seen at training. This paper introduces iterative decoding, an alternative to seq2seq that (i) improves transformer compositional generalization in the PCFG and Cartesian product datasets and (ii) evidences that, in these datasets, seq2seq transformers do not learn iterations that are not unrolled. In iterative decoding, training examples are broken down into a sequence of intermediate steps that the transformer learns iteratively. At inference time, the intermediate outputs are fed back to the transformer as intermediate inputs until an end-of-iteration token is predicted. We conclude by illustrating some limitations of iterative decoding in the CFQ dataset.	PDF	12	2021
Learning to Win Lottery Tickets in BERT Transfer via Task-agnostic Mask Training	Recent studies on the lottery ticket hypothesis (LTH) show that pre-trained language models (PLMs) like BERT contain matching subnetworks that have similar transfer learning performance as the original PLM. These subnetworks are found using magnitude-based pruning. In this paper, we find that the BERT subnetworks have even more potential than these studies have shown. Firstly, we discover that the success of magnitude pruning can be attributed to the preserved pre-training performance, which correlates with the downstream transferability. Inspired by this, we propose to directly optimize the subnetwork structure towards the pre-training objectives, which can better preserve the pre-training performance. Specifically, we train binary masks over model weights on the pre-training tasks, with the aim of preserving the universal transferability of the subnetwork, which is agnostic to any specific downstream tasks. We then fine-tune the subnetworks on the GLUE benchmark and the SQuAD dataset. The results show that, compared with magnitude pruning, mask training can effectively find BERT subnetworks with improved overall performance on downstream tasks. Moreover, our method is also more efficient in searching subnetworks and more advantageous when fine-tuning within a certain range of data scarcity. Our code will be released upon publication.	PDF	12	2021
Your Answer is Incorrect... Would you like to know why? Introducing a Bilingual Short Answer Feedback Dataset	Handing in a paper or exercise and merely receiving a "bad" or "incorrect" as feedback is not very helpful when the goal is to improve. Unfortunately, this is currently the kind of feedback given by many Automatic Short Answer Grading (ASAG) systems. One of the reasons for this is a lack of content-focused elaborated feedback datasets. To encourage research on explainable and understandable feedback systems, we present the Short Answer Feedback dataset (SAF). Similar to other ASAG datasets, SAF contains learner responses and reference answers to German and English questions. However, instead of only assigning a label or score to the learners' answers, SAF also contains elaborated feedback explaining the given score. Thus, SAF enables supervised training of models that grade answers and explain where and why mistakes were made. This paper discusses the need for enhanced feedback models in real-world pedagogical scenarios, describes the dataset annotation process, gives a comprehensive analysis of SAF, and demonstrates how SAF challenges T5 Transformer models.	PDF	12	2021
Unsupervised Compressive Text Summarisation with Reinforcement Learning	Recently, compressive text summarisation offers a balance between the conciseness issue of extractive summarisation and the factual hallucination issue of abstractive summarisation. However, most existing compressive summarisation methods are supervised, relying on the expensive effort of creating a new training dataset with corresponding compressive summaries. In this paper, we propose an unsupervised compressive summarisation method that utilises reinforcement learning to optimise a summary's semantic coverage and fluency by simulating human judgment on summarisation quality. Our model consists of an extractor agent and a compressor agent, and both agents have a multi-head attentional pointer-based structure. The extractor agent first chooses salient sentences from a document, and then the compressor agent compresses these extracted sentences by selecting salient words to form a summary without using reference summaries to compute the summary reward. That is, a parallel dataset with document-summary pairs is not required to train the proposed model. To the best of our knowledge, our proposed method is the first work on unsupervised compressive summarisation. Experimental results on three widely used datasets, Newsroom, CNN/DM, and XSum, show that our model achieves promising performance and significant improvement on Newsroom in terms of the ROUGE metric.	PDF	12	2021
AutoGraphex: Zero-shot Biomedical Definition Generation with Automatic Prompting	Describing terminologies with definition texts is an important step towards understanding the scientific literature, especially for domains with limited labeled terminologies. Previous works have sought to design supervised neural text generation models to solve the biomedical terminology generation task, but most of them failed to define never-before-seen terminologies in newly emerging research fields. Here, we tackle this challenge by introducing a zero-shot definition generation model based on prompting, a recent approach for eliciting knowledge from pre-trained language models, with automatically generated prompts. Furthermore, we enhanced the biomedical terminology dataset by adding descriptive texts to each biomedical subdiscipline, thus enabling zero-shot learning scenarios. Our model outperformed existing supervised baseline and the baseline pre-trained language model that employs manually crafted prompts by up to 52 and 6 BLEU score, respectively.	PDF	12	2021
When is BERT Multilingual? Isolating Crucial Ingredients for Cross-lingual Transfer	While recent work on multilingual language models has demonstrated their capacity for cross-lingual zero-shot transfer on downstream tasks, there is a lack of consensus in the community as to what shared properties between languages enable such transfer. Analyses involving pairs of natural languages are often inconclusive and contradictory since languages simultaneously differ in many linguistic aspects. In this paper, we perform a large-scale empirical study to isolate the effects of various linguistic properties by measuring zero-shot transfer between four diverse natural languages and their counterparts constructed by modifying aspects such as the script, word order, and syntax. Among other things, our experiments show that the absence of sub-word overlap significantly affects zero-shot transfer when languages differ in their word order, and there is a strong correlation between transfer performance and word embedding alignment between languages (e.g., $\rho_s=0.94$ on the task of NLI). Our results call for focus in multilingual models on explicitly improving word embedding alignment between languages rather than relying on its implicit emergence.	PDF	12	2021
MOVER: Mask, Over-generate and Rank for Hyperbole Generation	Despite being a common figure of speech, hyperbole is under-researched in Figurative Language Processing. In this paper, we tackle the challenging task of hyperbole generation to transfer a literal sentence into its hyperbolic paraphrase. To address the lack of available hyperbolic sentences, we construct HYPO-XL, the first large-scale hyperbole corpus containing 17,862 hyperbolic sentences in a non-trivial way. Based on our corpus, we propose an unsupervised method for hyperbole generation that does not require parallel literal-hyperbole pairs. During training, we fine-tune BART to infill masked hyperbolic spans of sentences from HYPO-XL. During inference, we mask part of an input literal sentence and over-generate multiple possible hyperbolic versions. Then a BERT-based ranker selects the best candidate by hyperbolicity and paraphrase quality. Automatic and human evaluation results show that our model is effective at generating hyperbolic paraphrase sentences and outperforms several baseline systems.	PDF	12	2021
Extreme Zero-Shot Learning for Extreme Text Classification	The eXtreme Multi-label text Classification (XMC) problem concerns finding most relevant labels for an input text instance from a large label set. However, the XMC setup faces two challenges: (1) it is not generalizable to predict unseen labels in dynamic environments, and (2) it requires a large amount of supervised (instance, label) pairs, which can be difficult to obtain for emerging domains. In this paper, we consider a more practical scenario called Extreme Zero-Shot XMC (EZ-XMC), in which no supervision is needed and merely raw text of instances and labels are accessible. Few-Shot XMC (FS-XMC), an extension to EZ-XMC with limited supervision is also investigated. To learn the semantic embeddings of instances and labels with raw text, we propose to pre-train Transformer-based encoders with self-supervised contrastive losses. Specifically, we develop a pre-training method $\textbf{MACLR}$, which thoroughly leverages the raw text with techniques including $\textbf{M}$ulti-scale $\textbf{A}$daptive $\textbf{C}$lustering, $\textbf{L}$abel $\textbf{R}$egularization, and self-training with pseudo positive pairs. Experimental results on four public EZ-XMC datasets demonstrate that MACLR achieves superior performance compared to all other leading baseline methods, in particular with approximately 5-10% improvement in precision and recall on average. Moreover, we show that our pre-trained encoder can be further improved on FS-XMC when there are a limited number of ground-truth positive pairs in training.	PDF	1	2022
AWS-EP: A Multi-Task Prediction Approach for MBTI/Big5 Personality Tests	Personality and preferences are essential variables in computational sociology and social science. They describe differences between people at both individual and group levels. In recent years, automated approaches to detect personality traits have received much attention due to the massive availability of individuals' digital footprints. Furthermore, researchers have demonstrated a strong link between personality traits and various downstream tasks such as personalized filtering, profile categorization, and profile embedding. Therefore, the detection of individuals' personality traits has become a critical process for improving the performance of different tasks. In this paper, we build on the importance of the individual personality and propose a novel multitask modeling approach that understands and models the user personality based on its textual posts and comments within a multimedia framework. Experiments and results demonstrate that our model outperforms state-of-the-art performances across multiple famous personality datasets.	PDF	1	2022
Lexicon for multiword expression identification	Following the idea that lexicons are needed in order for automatic identification of multiword expressions(MWE) to handle the unpredictable nature of MWEs, this paper proposes a lexicon formalism, itself declined in multitudes of possible sub-formalisms depending on the linguistic features considered , along with an evaluation method which could be used to compare lexicon formalisms to each other.An exploration of the powerset of features is done in order to find the bests of such subset of features to be used. The impact of the proposed lexicon formalism on MWE identification is investigated, leading us to conjecture that lexicon indeed have the potential to help MWE identification.	PDF	1	2022
Framework for Weakly Supervised Causal Knowledge Extraction from Text	In this paper, we address the problem of extracting causal knowledge from text documents in a weakly supervised manner. We target use cases in decision support and risk management, where causes and effects are general phrases without any constraints. We present a unified framework that supports three classes of tasks with varying degrees of available information. We provide approaches for each of the tasks using pre-trained, Natural Language Inference (NLI) and Question Answering (QA) models. We present a novel evaluation scheme and use existing and new benchmark data sets to measure the relative performance of each of the approaches.	PDF	1	2022
Headed-Span-Based Projective Dependency Parsing	We propose a new method for projective dependency parsing based on headed spans. In a projective dependency tree, the largest subtree rooted at each word covers a contiguous sequence (i.e., a span) in the surface order. We call such a span marked by a root word \textit{headed span}. A projective dependency tree can be represented as a collection of headed spans. We decompose the score of a dependency tree into the scores of the headed spans and design a novel $O(n^3)$ dynamic programming algorithm to enable global training and exact inference. We evaluate our method on PTB, CTB, and UD and it achieves state-of-the-art or competitive results. We will release our code at \url{github.com}.	PDF	1	2022
EmpHi: Generating Empathetic Responses with Human-like Intents	In empathetic conversations, humans express their empathy to others with empathetic intents. However, most existing empathetic conversational methods suffer from a lack of empathetic intents, which leads to monotonous empathy. To address the bias of the empathetic intents distribution between empathetic dialogue models and humans, we propose a novel model to generate empathetic responses with human-like empathetic intents, EmpHi for short. Precisely, EmpHi learns the distribution of potential empathetic intents with a discrete latent variable, then combines both implicit and explicit intent representation to generate responses with various empathetic intents. Experiments show that EmpHi outperforms state-of-the-art models in terms of empathy, relevance, and diversity on both automatic and human evaluation. Moreover, the case studies demonstrate the high interpretability and outstanding performance of our model.	PDF	1	2022
Statistical word segmentation in spontaneous child-directed speech of Korean	The present study demonstrates advantages of child-directed speech (CDS) over adult-directed speech (ADS) in statistical word segmentation of spontaneous Korean. We derived phonetic input from phonemic corpus by applying a set of phonological rules. For modeling the statistical word segmentation based on transitional probability (TP), we used two syllable-based algorithms (i.e., Absolute and Relative) in two directions (i.e., Forward TP and Backward TP). Results show that (i) segmentation accuracy is greater with phonetic input than phonemic, (ii) The model performs better when trained on CDS than ADS, and (iii) segmentation accuracy improves with child age.	PDF	1	2022
Exploring the Impact of Negative Samples of Contrastive Learning: A Case Study of Sentence Embedding	Contrastive learning is emerging as a powerful technique for extracting knowledge from unlabeled data. This technique requires a balanced mixture of two ingredients: positive (similar) and negative (dissimilar) samples. This is typically achieved by maintaining a queue of negative samples during training. Prior works in the area typically uses a fixed-length negative sample queue, but how the negative sample size affects the model performance remains unclear. The opaque impact of the number of negative samples on performance when employing contrastive learning aroused our in-depth exploration. This paper presents a momentum contrastive learning model with negative sample queue for sentence embedding, namely MoCoSE. We add the prediction layer to the online branch to make the model asymmetric and together with EMA update mechanism of the target branch to prevent model from collapsing. We define a maximum traceable distance metric, through which we learn to what extent the text contrastive learning benefits from the historical information of negative samples. Our experiments find that the best results are obtained when the maximum traceable distance is at a certain range, demonstrating that there is an optimal range of historical information for a negative sample queue. We evaluate the proposed unsupervised MoCoSE on the semantic text similarity (STS) task and obtain an average Spearman's correlation of $77.27\%$. Source code is available \href{https://anonymous.4open.science/r/mocose-3E3C}{here}.	PDF	1	2022
Contrastive Event Extraction Using Video Enhancements	Event extraction aims to extract information of triggers associated with arguments from texts. Recent advanced methods consider the multi-modality to tackle the task by pairing the modalities without guaranteeing the alignment of event information across modalities, which negatively impacts on the model performances. To address the issue, we firstly constructed the Text Video Event Extraction (TVEE) dataset with an inner annotator agreement of 83.4\%, containing 7,598 pairs of text-videos, each of which is connected by event alignments. To the best of our knowledge, this is the first multimodal dataset with aligned event information in each sentence and video pair. Secondly, we present a \textbf{C}ontrastive \textbf{L}earning based \textbf{E}vent \textbf{E}xtraction model with enhancements from the \textbf{V}ideo modality (CLEEV) to pair videos and texts using event information. CLEEV constructs negative samples by measuring event weights based on occurrences of event types to enhance the contrast.We conducted experiments on the TVEE and VM2E2 datasets by incorporating modalities to assist the event extraction, outperforming SOTA methods with 1.0 and 1.2 point percentage improvements in terms of F-score, respectively.Our experimental results show that the multimedia information improves the event extraction from the textual modality\footnote{The dataset and code will be released based on acceptance.	PDF	1	2022
Enhanced Knowledge Graphs Using Typed Entailment Graphs	Constructing knowledge graphs from open-domain corpora is a crucial stage in question answering. Most previous works are based on open information extraction methods, which extract relations by parsing sentences into triples <e1, r, e2>. These methods lack inference ability and are limited by corpus. When the query is different from the relations in the text-based knowledge graph, it is hard to return correct answers. In this paper, we propose a method to enhance knowledge graphs by using typed entailment graphs to add missing links. We construct the enhanced knowledge graph in both dynamical and offline ways. The experiment shows that our method outperforms the pre-trained language models in zero-shot cloze-style question answering. Furthermore, we find entailment graphs can significantly improve the recall and F-score of knowledge graphs.	PDF	1	2022
Evaluating the timing and magnitude of semantic change in diachronic word embedding models	Recent studies have suggested that diachronic word embedding models are able to track the direction of changes in public perception. Building on these works, we evaluate the ability of diachronic word embedding models to accurately capture such changes both qualitatively and quantitatively, such as their timing and magnitudes. Using a longitudinal dataset on public perception of brands, we found that evolution of word meaning as captured by diachronic word embedding models, trained on New York Times articles, reflected the timing and magnitudes of general consumer awareness of companies. In contrast, this was not the case for other readily available characteristics, such as stock market prices. This comparison is enabled by a new feature extraction method which summarizes the semantic changes encoded in diachronic word embeddings.	PDF	1	2022
A Multi-Granularity Opinion Summarization Method	Existing opinion mining (OM) is limited to applications on commercial reviews, with aspect and sentiment of the opinions in a coarse-grained form. In this paper, we further explore the definition of OM by extending the concepts of aspect and sentiment, and propose an opinion summarization method based on Multi-granularity Clustering and BERT (Jacob et al., 2018), i.e., MCB for emergent online discussion record in keeping with the further definition. A supporting Chinese corpus, ZH45 comprising 45 groups of discussion, and assorted metrics are also proposed. Experiments based on ZH45 and the metrics demonstrate that MCB produces succinct and insightful opinion summaries.	PDF	1	2022
Meta-CQG: A Meta-Learning Framework for Complex Question Generation over Knowledge Graph	Complex question generation (CQG) aims to generate questions involving multiple Knowledge Base (KB) relations or functional constraints. Existing methods train an encoder-decoder-based model to fit all questions. However, the questions in the real world exhibit an imbalanced distribution in many dimensions, such as question type, relation class, entity class, and query structure. This results in insufficient learning for minority class samples under different dimensions. To address this problem, we propose a meta-learning framework for complex question generation. It trains a unique generator for each sample via retrieving a few most related training samples, which can deeply and quickly dive into the content features (e.g. relation and entity) and structure features (e.g. query structure) of each sample. As retrieved samples directly determine the effectiveness of each unique generator, we design a self-supervised graph retriever to learn the potential features of samples and retrieve the most related samples according to multiple dimensions. We conduct experiments on both WebQuestionSP and ComplexWebQuestion, the results on the minority class of different dimensions have been significantly improved, which demonstrates the effectiveness of the proposed framework.	PDF	1	2022
A Bit Bayesian Facilitates Efficient Training in Token Classification	Token classification is a fundamental subject in computational linguistics. Token classification models, like other modern deep neural network models, are usually trained on the entire training set in each epoch, while research has found the entirety of the training data may not be needed in later epochs of training. Moreover, over-training on data that are properly handled may poison the model. Inspired by human pedagogy, we propose a teacher-aware learning structure for token classification models. After each epoch of training, the teacher selects data it is uncertain of and data it predicts differently from the student, which are passed into the structure for training in the next epoch. As a proof of concept, we use a Bayesian linear classifier as the teacher and two commonly used backbone models as the student. Experiments show our method reduces the number of training iterations and improves model performance in most cases.	PDF	1	2022
AdapLeR: Speeding up Inference by Adaptive Length Reduction	Pre-trained language models have shown stellar performance in various downstream tasks. But, this usually comes at the cost of high latency and computation, hindering their usage in resource-limited settings. In this work, we propose a novel approach for reducing the computational cost of BERT with minimal loss in downstream performance. Our model dynamically eliminates less contributing tokens through layers, resulting in shorter lengths and consequently lower computational cost.To determine the importance of each token representation, we train a Contribution Predictor for each layer using a gradient-based saliency method. Our experiments on several diverse classification tasks show speedups up to 17x during inference time. We also validate the quality of the selected tokens in our method using human annotations in the ERASER benchmark. In comparison to other widely used strategies for selecting important tokens, such as saliency and attention, our proposed method has significantly less false positive rate in generating rationales.	PDF	1	2022
Generating a Temporally Coherent Visual Story by Multimodal Recurrent Transformers	Story visualization is a challenging text-to-image generation task for the difficulty of rendering visual details from abstract text descriptions. Besides the difficulty of image generation, the generator also needs to conform to the narrative of a multi-sentence story input. While prior arts in this domain have focused on improving semantic relevance between generated images and input text, controlling the generated images to be temporally consistent still remains a challenge. Moreover, existing generators are trained on single text-image pairs and fail to consider the variations of natural language captions that can describe a given image, causing poor model generalization. To address such problems, we leverage a cyclic training methodology involving pseudo-text descriptions as an intermediate step that decouples the image’s visual appearance from the variations of natural language descriptions. Additionally, to generate a semantically coherent image sequence, we consider an explicit memory controller which can augment the temporal coherence of images in the multi-modal autoregressive transformer. To sum up all components, we call it Cyclic Story visualization by MultimodAl Recurrent Transformers or C-SMART for short. Our method generates high-resolution, high-quality images, outperforming prior works by a significant margin across multiple evaluation metrics on the Pororo-SV dataset.	PDF	1	2022
Deep Speech Synthesis from Articulatory Features	In the articulatory synthesis task, speech is synthesized from input features containing information about the physical behavior of the human vocal tract. This task provides a promising direction for speech synthesis research, as the articulatory space is compact, smooth, and interpretable. Current works have highlighted the potential for deep learning models to perform articulatory synthesis. However, it remains unclear whether these models can achieve the efficiency and fidelity of the human speech production system. To help bridge this gap, we propose a time-domain articulatory synthesis methodology and demonstrate its efficacy with both electromagnetic articulography (EMA) and synthetic articulatory feature inputs. Our model is both computationally efficient and highly intelligible, achieving a transcription word error rate (WER) of 7.14\% for the EMA-to-speech task. Through interpolation experiments, we also highlight the generalizability and interpretability of our approach.	PDF	1	2022
How do we get there? Evaluating transformer neural networks as cognitive models for English past tense inflection	Neural network models have achieved good performance on morphological inflection tasks, including English past tense inflection. However whether they can represent human cognitive mechanisms is still under debate. In this work, we examined transformer models with different size and distribution of training data to show that: 1) neural model's performance correlates with the adult behavior, but not children's behavior; and the model with small-size training data that matches parents' input distribution has the highest correlation; 2) neural models' errors are not human-like; however, the errors on the regulars and irregulars show a clear distinction. Therefore, we conclude that the current transformer models exhibit some resemblance of human behavior, but is insufficient as a cognitive model of learning morphological rules.	PDF	1	2022
What Role Does BERT Play in the Neural Machine Translation Encoder?	Pre-trained language models have been widely applied in various natural language processing tasks. But when it comes to neural machine translation, things are a little different. The differences between the embedding spaces created by BERT and NMT encoder may be one of the main reasons for the difficulty of integrating pre-trained LMs into NMT models. Previous studies illustrate the best way of integration is introducing the output of BERT into the encoder with some extra modules. Nevertheless, it is still unrevealed whether these additional modules will affect the embedding spaces created by the NMT encoder or not and what kind of information the NMT encoder takes advantage of from the output of BERT. In this paper, we start by comparing the changes of embedding spaces after introducing BERT into the NMT encoder trained on different machine translation tasks. Although the changing trends of these embedding spaces vary, introducing BERT into the NMT encoder will not affect the space of the last layer significantly. Subsequent evaluation on several semantic and syntactic tasks proves the NMT encoder is facilitated by the rich syntactic information contained in the output of BERT to boost the translation quality.	PDF	1	2022
Quantifying Synthesis and Fusion and their Impact on Machine Translation	Theoretical work in morphological typology offers the possibility of measuring morphological diversity on a continuous scale. However, literature in NLP typically labels a whole language with a strict type of morphology, e.g. fusional or agglutinative. In this work, we propose to reduce the theoretical rigidity of such claims, by quantifying the morphological typology at the word and segment level. We consider Payne (2017)'s approach to classify morphology using two indices: synthesis (from 1 for analytic to 3 or more for polysynthetic) and fusion (from 0 for agglutinative to 1 for fusional). For computing synthesis, we test unsupervised and supervised morphological segmentation methods for English, German and Turkish, whereas for fusion, we propose a semi-automatic method using Spanish as a case study.Then, we analyse the relationship between machine translation quality and the degree of synthesis and fusion at word (nouns and verbs for English-Turkish, and verbs in English-Spanish) and segment level (previous language pairs plus English-German in both directions). We complement the word-level analysis with human evaluation, and overall, we observe a consistent impact of both indexes on machine translation quality.	PDF	1	2022
JEFF - Just Another EFFicient Reading Comprehension Test Generation	We introduce a method for generating vocabulary questions on reading comprehension of a given English article. In our approach, the method involves selecting target words in the given English article, finding synonyms as answer keys, and generating seemingly reasonable words in context as distractors. At run-time, some target words in the inputted article will be identified as questions, and automatically generating one answer key and three distractors. We present a AQG (automatic question generation) system, JEFF, that applies the method to generate questions automatically. Evaluation on a set of questions generated by JEFF shows that the method is close to the human-designed ones.	PDF	1	2022
FAQ Search using Transformers	Many websites have bots as a guiding agent, for answering FAQ questions or directing users to human support. Many of them already have a curated FAQ page that can be used to bootstrap these bots. In this paper, we want to tackle a real-world problem of question answering for Bots. Given a user query, the system needs to pick the most relevant answer from a data source such as FAQ or Manuals. So, the ranking system needs to consider not just the passage but also the provided support questions or titles. This technique also provides the flexibility to add and delete support questions to continuously improve bot's quality, suggestions can be provided by system and the bot developer has control over their data instead of a black box system. We explore novel techniques to improve the results on a few public sets and on our own judged real user data. For the paper, We limit our experiments to transformers since it has proven to be significantly better in all question answering tasks. We show that significant gains can be observed using an extra segment embedding as well as pre-training new separators in transformers.	PDF	1	2022
Sentence-Level Discourse Parsing as Text-to-Text Generation	Previous studies have made great advances in RST discourse parsing through neural frameworks or efficient features, but they usually split the parsing process into two subtasks and heavily depended on gold segmentation. In this paper, we introduce an end-to-end method for sentence-level RST discourse parsing via transforming it into a text-to-text generation task. Our method unifies the traditional two-stage parsing and generates the parsing tree directly from the input text without requiring a complicated model. Moreover, the EDU segmentation can be simultaneously generated and extracted from the parsing tree. Experimental results on the RST Discourse Treebank demonstrate that our proposed method outperforms existing methods in both tasks of sentence-level RST parsing and discourse segmentation. Considering the lack of annotated data in RST parsing, we also create high-quality augmented data and implement self-training, which further improves the performance.	PDF	1	2022
Don’t Forget About Pronouns: Removing Gender Bias in Language Models without Losing Factual Gender Information	The representations in large language models contain various types of gender information. We focus on two types of such signals in English texts: factual gender information, which is a grammatical or semantic property, and gender bias, which is the correlation between a word and specific gender. We can disentangle the model’s embeddings and identify components encoding both information with probing. We aim to diminish the representation of stereotypical bias while preserving factual gender signal. Our filtering method shows that it is possible to decrease the bias of gender-neutral profession names without deteriorating language modeling capabilities. The findings can be applied to language generation and understanding to mitigate reliance on stereotypes while preserving gender agreement in coreferences.	PDF	1	2022
E-MMAD: Multimodal Advertising Caption Generation Based on Structured Information	With multimodal tasks increasingly getting popular in recent years, datasets with large scale and reliable authenticity are in urgent demand. Therefore, we present an e-commercial multimodal advertising dataset, E-MMAD, which contains 120 thousand valid data elaborately picked out from 1.3 million real product examples in both Chinese and English. Noticeably, it is one of the largest video captioning datasets in this field, in which each example has its product video (around 30 seconds), title, caption and structured information table that is observed to play a vital role in practice. We also introduce a novel task for vision-language research based on E-MMAD: e-commercial multimodal advertising caption generation, which requires to use aforementioned product multimodal information to generate textual advertisement. Accordingly, we propose a baseline method on the strength of structured information reasoning to solve the demand in reality on this dataset.	PDF	1	2022
Learning to Embed Multi-Modal Contexts for Situated Conversational Agents	The Situated Interactive Multi-Modal Conversations (SIMMC) 2.0 aims to create virtual shopping assistants that can accept complex multi-modal inputs, i.e. visual appearances of objects and user utterances. It consists of four subtasks, multi-modal disambiguation (MM-Disamb), multi-modal coreference resolution (MM-Coref), multi-modal dialog state tracking (MM-DST), and response retrieval and generation. While many task-oriented dialog systems usually tackle each subtask separately, we propose a jointly learned multi-modal encoder-decoder that incorporates visual inputs and performs all four subtasks at once for efficiency. This approach won the MM-Coref and response retrieval subtasks and nominated runner-up for the remaining subtasks using a single unified model at the 10th Dialog Systems Technology Challenge (DSTC10), setting a high bar for the novel task of multi-modal task-oriented dialog systems.	PDF	1	2022
Layout-Aware Neural Model for Resolving Hierarchical Table Structure	While many pipelines for extracting information from tables assume simple table structure, tables in the financial domain frequently have a complex, hierarchical structure. The primary example would be parent-child relationships between header cells. Most prior datasets of tables annotated from images or pdf and most models for extracting table structure concentrate on the problems of table boundaries, cell, row, and column bounding box extraction. The area of fine-grained table structure remains relatively unexplored. This study presents a dataset of 657 tables, manually labeled for cell types and column hierarchy relations. The tables are selected from IBM FinTabNet. The selection of these 657 tables is performed using heuristics, resulting in a much larger proportion, roughly half, of the selected tables having a complex hierarchical structure than a random sample from FinTabNet. Further, we fine-tune models based on LayoutLM on the cell-type classification task and identify hierarchical relations among column headers. We achieve F1 scores of 97% and 73% on the respective tasks. Finally, we use the trained model to create soft labels for the entirety of FinTabNet	PDF	1	2022
Towards a Progression-Aware Autonomous Dialogue Agent	Recent advances in large-scale language modeling and generation have enabled the creation of dialogue agents that exhibit human-like responses in a wide range of conversational scenarios spanning a diverse set of tasks, from general chit-chat to focused goal-oriented discourse. While these agents excel at generating high-quality responses that are relevant to prior context, they suffer from a lack of awareness of the overall direction in which the conversation is headed, and the likelihood of task success inherent therein. Thus, we propose a framework in which dialogue agents can evaluate the progression of a conversation toward or away from desired outcomes, and use this signal to inform planning for subsequent responses. Our framework is composed of three key elements: (1) the notion of a "global" dialogue state (GDS) space, (2) a task-specific progression function (PF) computed in terms of a conversation's trajectory through this space, and (3) a planning mechanism by which a dialogue agent may use progression signals to select its next response.	PDF	1	2022
Patching Leaks in the Charformer for Generative Tasks	Character-based representations have important advantages over subword-based ones, including increased robustness to noisy input and removing the need of tokenization preprocessing. However, they also have a crucial disadvantage: they notably increase the length of text sequences. The GBST method from Charformer groups (aka downsamples) characters to solve this, but allows information to leak when applied to a Transformer decoder. We introduce novel methodology to solve this information leak issue, which opens up the possibility of using character grouping in the decoder. We show that Charformer downsampling has no apparent benefits in NMT over previous downsampling methods.	PDF	1	2022
Probing The Linguistic Capacity of Pre-Trained Vision-Language Models	How do recent vision-language pre-trained models compare against language-specific pre-trained models on common linguistic tasks? In this paper, we assess this in a probing setting. Our results suggest that different multimodal pre-training strategies entail distinct strengths. Although pre-trained language models generally fare better, pre-trained vision-language models can obtain higher average scores in certain scenarios (e.g., CLIP is $2\%$ higher than BERT on SST2). We also analyze and illustrate that the different competences in different model layers cause such performance differences. Our work then proposes fine-tuning techniques to improve the abilities of vision-language models on linguistic tasks.	PDF	1	2022
Unlearnable Text for Neural Classifiers	Neural text classification models are known to001explore statistical patterns during supervised002learning. However, such patterns include spurious patterns and superficial regularity in the004training data. In this paper, we exaggerate superficial regularity in the text to prevent unau-006thorized exploration of personal data.007We propose a gradient-based method to construct text modifications, which can make deep009neural networks (DNNs) unlearnable.We010then analyze text modifications exposed by the gradient-based method and further propose012two simple hypotheses to manually craft unlearnable text. Experiments on four tasks (sen-014timent classification, topic classification, read-015ing comprehension and gender classification validate the effectiveness of our method, by which these hypotheses achieve almost un-018trained performance after training on unlearn-019able text.	PDF	1	2022
Learning Sense Embeddings from Definitions in Dictionaries	We introduce a method for learning to embed word senses as defined in a given set of given dictionaries. In our approach, sense definition pairs, <word, definition> are transformed into low-dimension vectors aimed at maximizing the probability of reconstructing the definitions in an autoencoding setting. The method involves automatically training sense autoencoder for encoding sense definitions, automatically aligning sense definitions, and automatically generating embeddings of arbitrary description. At run-time, queries from users are mapped to the embedding space and re-ranking is performed on the sense definition retrieved. We present a prototype sense definition embedding, SenseNet, that applies the method to two dictionaries. Blind evaluation on a set of real queries shows that the method significantly outperforms a baseline based on the Lesk algorithm. Our methodology clearly supports combining multiple dictionaries resulting in additional improvement in representing sense definitions in dictionaries.	PDF	1	2022
Neural Discourse Deixis Resolution in Dialogue	We adapt Lee at el.'s (2018) span-based entity coreference model to the task of discourse deixis resolution. The resulting model achieves state-of-the-art results on the four datasets in the CODI-CRAC 2021 shared task.	PDF	1	2022
Rethinking Style Transformer by Energy-based Interpretation: Adversarial Unsupervised Style Transfer using Pretrained Model	Style control, content preservation, and fluency determine the quality of text style transfer models. To train on a nonparallel corpus, several existing approaches aim to deceive the style discriminator with an adversarial loss. However, adversarial training significantly degrades fluency compared to the other two metrics. In this work, we explain this phenomenon with the energy-based interpretation and leverage a pretrained language model to improve fluency. Specifically, we propose a novel approach of applying the pretrained language model to the text style transfer framework by restructuring the discriminator and the model itself, allowing the generator and the discriminator to also take advantage of the power of the pretrained model. We evaluate our model on four public benchmarks Amazon, Yelp, GYAFC, and Civil Comments and achieve state-of-the-art performance on the overall metrics.	PDF	1	2022
Contrastive Demonstration Tuning for Pre-trained Language Models	Pretrained language models can be effectively stimulated by textual prompts or demonstrations, especially in low-data scenarios. Recent works have focused on automatically searching discrete or continuous prompts or optimized verbalizers, yet studies for the demonstration are still limited. Concretely, the demonstration examples are crucial for an excellent final performance of prompt-tuning. In this paper, we propose a novel pluggable, extensible, and efficient approach named contrastive demonstration tuning, which is free of demonstration sampling. Furthermore, the proposed approach can be: (i) Plugged to any previous prompt-tuning approaches; (ii) Extended to widespread classification tasks with a large number of categories. Experimental results on 16 datasets illustrate that our method integrated with previous approaches LM-BFF and P-tuning can yield better performance.	PDF	1	2022
Curriculum: A Broad-Coverage Benchmark for Linguistic Phenomena in Natural Language Understanding	In the age of large transformer language models, linguistic benchmarks play an important role in diagnosing models' abilities and limitations on natural language understanding. However, current benchmarks show some significant shortcomings. In particular, they do not provide insight into how well a language model captures distinct linguistic phenomena essential for language understanding and reasoning. In this paper, we introduce Curriculum, a new large-scale NLI benchmark for evaluation on broad-coverage linguistic phenomena. We show that our benchmark for linguistic phenomena serves as a more difficult challenge for current state-of-the-art models. Our experiments also provide insight into the limitation of existing benchmark datasets. In addition, we find that sequential training on selected linguistic phenomena effectively improves generalizing performance on adversarial NLI under limited training examples.	PDF	1	2022
Rethinking Offensive Text Detection as a Multi-Hop Reasoning Problem	We introduce the task of implicit offensive text detection in dialogues, where a statement may have either an offensive or non-offensive interpretation, depending on the listener and context. We argue that reasoning is crucial for understanding this broader class of offensive utterances, and create Mh-RIOT ($M$ulti-hop $R$easoning $I$mplicitly $O$ffensive $T$ext Dataset), to support research on this task. Experiments using the dataset show that state-of-the-art methods of offense detection perform poorly when asked to detect implicitly offensive statements, achieving only $<11$ accuracy.	PDF	1	2022
DocEE: A Large-Scale and Fine-grained Benchmark for Document-level Event Extraction	Event extraction aims to identify an event and then extract the arguments participating in the event. Despite the great success in sentence-level event extraction, events are more naturally presented in the form of documents, with event arguments scattered in multiple sentences. However, a major barrier to promote document-level event extraction has been the lack of large-scale and practical training and evaluation datasets. In this paper, we present DocEE, a new document-level event extraction dataset including 20,000+ events, 100,000+ arguments. We highlight three features: large-scale manual annotations, fine-grained argument types and application-oriented settings. Experiments show that there is still a big gap between state-of-the-art models and human beings (43\% Vs 85\% in F1 score), indicating that DocEE is an open issue. We will publish DocEE upon acceptance.	PDF	1	2022
Data Augmentation for Low-Resource Dialogue Summarization	We present DADS, a novel Data Augmentation technique for low-resource Dialogue Summarization. Our method generates synthetic examples by replacing sections of text from both the input dialogue and summary while preserving the augmented summary to correspond to a viable summary for the augmented dialogue. We utilize pretrained language models that produce highly likely dialogue alternatives while still being free to generate diverse alternatives. We applied our data augmentation method to the SAMSum dataset in low resource scenarios, mimicking real world problems such as chat, thread, and meeting summarization where large scale supervised datasets with human-written summaries are scarce. Through both automatic and human evaluations, we show that DADS shows strong improvements for low resource scenarios while generating topically diverse summaries without introducing additional hallucinations to the summaries.	PDF	1	2022
CSL: A Large-scale Chinese Scientific Literature Dataset for Cross-task Evaluation	Scientific literature serves as a high-quality corpus, which could provide natural annotated data for many natural language processing (NLP) research. In this work, we introduce a Chinese Scientific Literature dataset – CSL, which contains the titles, abstracts, keywords and academic fields of 400,000 papers. The rich semantic information in these scientific literature creates extensive NLP tasks and provides a natural cross-task scenario. Based on this, we present a cross-task few-shot benchmark. To evaluate the cross-task transferability of the model, we design scenarios with different aspects and difficulties. Compared with previous cross-task benchmarks, these tasks are constructed from homogeneous corpus, allowing researchers to investigate the relationships between tasks, without being disturbed by heterogeneous data sources, annotation, and other factors. We analyze the behavior of existing text-to-text models on the proposed benchmark, and reveal the challenges for cross-task generalization, which provides a valuable reference for future research. Code and data are publicly available at https://github.com/CSL-Dataset/CSL_Dataset.	PDF	1	2022
AlephBERT: Language Model Pre-training and Evaluation from Sub-Word to Sentence Level	Large Pre-trained Language Models (PLMs) have become ubiquitous in the development of language understanding technology and lie at the heart of many artificial intelligence advances. While advances reported for English using PLMs are unprecedented, reported advances using PLMs for Hebrew are few and far between. The problem is twofold. First, Hebrew resources for training large language models have not been of the same magnitude as their English counterparts. Second, most bench marks available to evaluate progress in Hebrew NLP require morphological boundaries which are not readily available in the output of PLMs. In this work we aim to remedy both aspects. We present AlephBERT, a large PLM for Modern Hebrew, trained on larger vocabulary and larger dataset than any Hebrew PLM before. More over, we introduce a novel neural architecture that recovers the morphological segments encoded in contextualized embeddings. Based on this new morphological component we offer an evaluation suite consisting of multiple tasks and benchmarks, that cover both word-level and sub-word level analyses. On all tasks, AlephBERT obtains state-of-the-art results beyond all existing Hebrew models. We make AlephBERT, the morphological extraction model, and the Hebrew evaluation suite publicly available.	PDF	1	2022
C$^3$KG: A Chinese Commonsense Conversation Knowledge Graph	Existing commonsense knowledge bases often organize tuples in an isolated manner, which is deficient for commonsense conversational models to plan the next steps. To fill the gap, we curate a large-scale multi-turn human-written conversation corpus, and create the first Chinese commonsense conversation knowledge graph which incorporates both social commonsense knowledge and dialog flow information. To show the potential of our graph, we develop a graph-conversation matching approach, and benchmark two graph-grounded conversational tasks. All the resources in this work will be released to foster future research.	PDF	1	2022
Improving Neural Models for Radiology Report Retrieval with Lexicon-based Automated Annotation	Many clinical informatics tasks that are based on electronic health records need relevant patient cohorts to be selected based on findings, symptoms, and diseases. Frequently, these conditions are described in radiology reports which can be retrieved using information retrieval (IR) methods. The latest of these techniques utilize neural IR models such as BERT trained on clinical text. However, these methods still lack semantic understanding of the underlying clinical conditions as well as ruled out findings, resulting in poor precision during retrieval. In this paper we combine clinical finding detection with supervised query match learning. Specifically, we use lexicon-driven concept detection to detect relevant findings in sentences. These findings are used as queries to train a Sentence-BERT (SBERT) model using triplet loss on matched and unmatched query-sentence pairs. We show that the proposed supervised training task remarkably improves the retrieval performance of SBERT. The trained model generalizes well to unseen queries and reports from different collections.	PDF	1	2022
Measuring and Improving Compositional Generalization in Text-to-SQL via Component Alignment	In text-to-SQL tasks --- as in much of NLP --- compositional generalization is a major challenge: neural networks struggle with compositional generalization where training and test distributions differ. However, most recent attempts to improve this are based on word-level synthetic data or specific dataset splits to generate compositional biases. In this work, we propose a clause-level compositional example generation method. We first split the sentences in the Spider text-to-SQL dataset into sub-sentences, annotating each sub-sentence with its corresponding SQL clause, resulting in a new dataset Spider-SS. We then construct a further dataset, Spider-CG, by composing Spider-SS sub-sentences in different combinations, to test the ability of models to generalize compositionally. Experiments show that existing models suffer significant performance degradation when evaluated on Spider-CG, even though every sub-sentence is seen during training.To deal with this problem, we modify a number of state-of-the-art models to train on the segmented data of Spider-SS, and we show that this method improves the generalization performance.	PDF	1	2022
Reinforcement Learning with Large Action Spaces for Neural Machine Translation	Applying Reinforcement learning (RL) following pre-training is a versatile method for enhancing neural machine translation (NMT) performance. However, recent work has argued that the gains produced by RL for NMT are mostly due to promoting tokens that have already received a fairly high probability in pre-training. We hypothesize that the large action space is a main obstacle to RL's effectiveness in MT, and conduct two sets of experiments that lend support to our hypothesis, focusing on low-resource settings. First, we find that reducing the size of the vocabulary improves RL's effectiveness. Second, we find that effectively reducing the dimension of the action space without changing the vocabulary also yields notable improvement as evaluated by BLEU, semantic similarity, andhuman evaluation. Indeed, by replacing the network's final fully connected layer (that maps the network's internal dimension to the vocabulary dimension), with a layer that generalizes over similar actions, we obtain a substantial improvement in RL performance.	PDF	1	2022
How to Translate Your Samples and Choose Your Shots? Analyzing Translate-train & Few-shot Cross-lingual Transfer	Translate-train or few-shot cross-lingual transfer can be used to improve the zero-shot performance of multilingual pretrained language models. Few-shot utilizes high-quality low-quantity samples (often manually translated from the English corpus ). Translate-train employs a machine translation of the English corpus, resulting in samples with lower quality that could be scaled to high quantity. Given the lower cost and higher availability of machine translation compared to manual professional translation, it is important to systematically compare few-shot and translate-train, understand when each has an advantage, and investigate how to choose the shots to translate in order to increase the few-shot gain. This work aims to fill this gap: we compare and quantify the performance gain of few-shot vs. translate-train using three different base models and a varying number of samples for three tasks/datasets (XNLI, PAWS-X, XQuAD) spanning 17 languages. We show that scaling up the training data using machine translation gives a larger gain compared to using the small-scale (higher-quality) few-shot data. When few-shot is beneficial, we show that there are random sets of samples that perform better across languages and that the performance on English and on the machine-translation of the samples can both be used to choose the shots to manually translate for an increased few-shot gain.	PDF	1	2022
Multi-stage Distillation Framework for Cross-Lingual Semantic Similarity Matching	Previous studies have proved that cross-lingual knowledge distillation can significantly improve the performance of pre-trained models for cross-lingual similarity matching tasks. However, the student model needs to be large in this operation. Otherwise, its performance will drop sharply, thus making it impractical to be deployed to memory-limited devices. To address this issue, we delve into cross-lingual knowledge distillation and propose a multi-stage distillation framework for constructing a small-size but high-performance cross-lingual model. In our framework, contrastive learning, bottleneck, and parameter recurrent strategies are delicately combined to prevent performance from being compromised during the compression process. The experimental results demonstrate that our method can compress the size of XLM-R and MiniLM by more than 50%, while the performance is only reduced by about 1%.	PDF	1	2022
When Does Translation Require Context? A Data-driven, Multilingual Exploration	Although proper handling of discourse phenomena significantly contributes to the quality of machine translation (MT), improvements on these phenomena are not adequately measured in common translation quality metrics. Recent works in context-aware MT attempt to target a small set of these phenomena during evaluation. In this paper, we propose a methodology to identify translations that require context systematically, and use this methodology to both confirm the difficulty of previously studied phenomena as well as uncover new ones that have not been addressed in previous work. We then develop the \textbf{Mu}ltilingual \textbf{D}iscourse-\textbf{A}ware (MuDA) benchmark, a series of taggers for these phenomena in 14 different language pairs, which we use to evaluate context-aware MT. We find that commonly studied context-aware MT models make marginal improvements over context-agnostic models, which suggests these models do not handle these ambiguities effectively. We will release code and data to invite the MT research community to increase efforts on translation on discourse phenomena and languages that are currently overlooked.	PDF	1	2022
Continual Prompt Tuning for Dialog State Tracking	A desirable dialog system should be able to continually learn new skills without forgetting old ones, and thereby adapt to new domains or tasks in its life cycle. However, continually training a model often leads to a well-known catastrophic forgetting issue. In this paper, we present Continual Prompt Tuning, a parameter-efficient framework that not only avoids forgetting but also enables knowledge transfer between tasks. To avoid forgetting, we only learn and store a few prompt tokens' embeddings for each task while freezing the backbone pre-trained model. To achieve bi-directional knowledge transfer among tasks, we propose several techniques (continual prompt initialization, query fusion, and memory replay) to transfer knowledge from preceding tasks and a memory-guided technique to transfer knowledge from subsequent tasks. Extensive experiments demonstrate the effectiveness and efficiency of our proposed method on continual learning for dialog state tracking, compared with state-of-the-art baselines.	PDF	1	2022
RGL: A Simple yet Effective Relation Graph Augmented Prompt-based Tuning Approach for Few-Shot Learning	Pre-trained language models (PLMs) which carry generic knowledge can be a good starting point for adapting to downstream applications. However, it is difficult to generalize PLMs to new tasks with only a limited number of labeled samples given. In this work, we show that Relation Graph augmented Learning RGL method can obtain better performance in few-shot natural language understanding tasks. During learning, RGL constructs a relation graph based on the label consistency between samples in the same batch, and learns to solve the resultant node classification and link prediction problems of the relation graphs. In this way, RGL fully exploits the limited supervised information, which can boost the tuning effectiveness. Extensive experiments on benchmark tasks show that RGL consistently improve the performance of prompt-based tuning strategies.	PDF	1	2022
FaiRR: Faithful and Robust Deductive Reasoning over Natural Language	Transformers have been shown to be able to perform deductive reasoning on a logical rulebase containing rules and statements written in natural language. Recent works show that such models can also produce the reasoning steps (i.e., the proof graph) that emulate the model's logical reasoning process. Currently, these black-box models generate both the proof graph and intermediate inferences within the same model and thus may be unfaithful. In this work, we frame the deductive logical reasoning task by defining three modular components: rule selection, fact selection, and knowledge composition. The rule and fact selection steps select the candidate rule and facts to be used and then the knowledge composition combines them to generate new inferences. This ensures model faithfulness by assured causal relation from the proof step to the inference reasoning. To test our framework, we propose FaiRR (Faithful and Robust Reasoner) where the above three components are independently modeled by transformers. We observe that FaiRR is robust to novel language perturbations, and is faster at inference than previous works on existing reasoning datasets. Additionally, in contrast to black-box generative models, the errors made by FaiRR are more interpretable due to the modular approach.	PDF	1	2022
Towards a Fast Response Selection: Selecting the Optimal Dialogue Response Once for All	Response selector, as an essential component of dialogue systems, aims to pick out an optimal response in a candidate pool to continue the dialogue. The current state-of-the-art methods are mainly based on an encoding paradigm called Cross-Encoder, which separately encodes each context-response pair and ranks the responses according to their fitness scores. However, such a paradigm is both inefficient and ineffective. Specifically, it has to repeatedly encode the same context for each response, which results in heavy inference cost. Also, without considering the relationship among the candidates, it is difficult to tell which one is the best candidate purely based on the fitness score of each candidate. To address this problem, we propose a new model called Panoramic-Encoder, which accepts all candidates and the context as inputs at once and allows them to interact with each other through a specially designed attention mechanism. Our method also allows us to naturally integrate some of the effective training techniques, such as the in-batch negative training. Extensive experiments across four benchmark datasets show that our new method significantly outperforms the current state-of-the-art while achieving approximately 3X speed-up at inference time.	PDF	1	2022
Pruning Adatperfusion with Lottery Ticket Hypothesis	Pre-trained language models have shown great success in multiple downstream tasks. However, they are computationally expensive to fine-tune. Thus, transfer learning with adapter modules has been introduced to alleviate this problem, helping to extract knowledge of the downstream tasks. And the latest Adapterfusion model can further merge multiple adapters to incorporate knowledge from different tasks. However, merging multiple adapters will inevitably cause redundancies, increasing the training and inference time massively. Therefore, in this paper, we propose an approach to identify the influence of each adapter module and a novel way to prune adapters based on the prestigious Lottery Ticket Hypothesis. Experiments on GLUE datasets show that the pruned Adapterfusion model with our scheme can achieve state-of-the-art results, reducing sizes significantly while keeping performance intact.	PDF	1	2022
Multi-way VNMT for UGC: Improving Robustness and Capacity via Mixture Density Networks	This work presents a novel Variational Neural Machine Translation (VNMT) architecture with enhanced robustness properties, which we investigate through a detailed case-study addressing noisy French user-generated content (UGC) translation to English. We show that the proposed model, with results comparable or superior to state-of-the-art VNMT, improves performance over UGC translation in a zero-shot evaluation scenario while keeping optimal translation scores on in-domain test sets. We elaborate on such results by visualizing and explaining how neural learning representations behave when processing UGC noise. In addition, we show that VNMT enforces robustness to the learned embeddings, which can be later used for robust transfer learning approaches.	PDF	1	2022
LawngNLI: a multigranular, long-premise NLI benchmark for evaluating models’ in-domain generalization from short to long contexts	Natural language inference has trended with NLP toward studying reasoning over long contexts, with several datasets moving beyond the sentence level. However, short-sequence models typically perform best despite their sequence limits. Confounded by domain shifts between datasets, it has remained unclear whether long premises are truly needed at fine-tuning time to learn long-premise NLI. We construct LawngNLI, with premises that skew much longer than in existing NLI benchmarks and are multigranular: all contain a short version. LawngNLI is constructed from U.S. legal opinions, with automatic labels with high human-validated accuracy. Evaluating on its long-premise NLI, we show top performance is achieved only with fine-tuning using these long premises. Models only fine-tuned on existing datasets and even our short premises (which derive from judge-selected relevant Entail excerpts in source documents) thus controlling for domain underperform considerably. Top performance is by short-sequence models prepended with a standard retrieval method filtering across each premise, but they underperform absent fine-tuning using long premises as inputs. LawngNLI also holds relevance for the legal community, as NLI is a principal cognitive task in developing cases and advice. Models performing well could double as retrieval or implication scoring systems for legal cases.	PDF	1	2022
AllWOZ: Towards Multilingual Task-Oriented Dialog Systems for All	A commonly observed problem of the state-of-the-art natural language technologies, such as Amazon Alexa and Apple Siri, is that their services do not extend to most developing countries' citizens due to language barriers. Such populations suffer due to the lack of available resources in their languages to build NLP products. This paper presents AllWOZ, a multilingual multi-domain task-oriented customer service dialog dataset covering eight languages: English, Mandarin, Korean, Vietnamese, Hindi, French, Portuguese, and Thai. Furthermore, we create a benchmark for our multilingual dataset by applying mT5 in a meta-learning setting.	PDF	1	2022
Extending Multi-Text Sentence Fusion Resources via Pyramid Annotations	NLP models that process multiple texts often struggle in recognizing corresponding and salient information that is often differently phrased, and consolidating the redundancies across texts. To facilitate research of such challenges, the sentence fusion task was proposed, yet previous datasets for this task were very limited in their size and scope. In this paper, we revisit and substantially extend previous dataset creation efforts. With careful modifications, relabeling, and employing complementing data sources, we were able to more than triple the size of a notable earlier dataset.Moreover, we show that our extended version uses more representative texts for multi-document tasks and provides a more diverse training set, which substantially improves model performance.	PDF	1	2022
GlobEnc: Quantifying Global Token Attribution by Incorporating the Whole Encoder Layer in Transformers	There has been a growing interest in interpreting the underlying dynamics of Transformers. While self-attention patterns were initially deemed as the primary choice, recent studies have shown that integrating other components can yield more accurate explanations. This paper introduces a novel token attribution analysis method that incorporates all the components in the encoder block and aggregates this throughout layers. We quantitatively and qualitatively demonstrate that our method can yield faithful and meaningful global token attributions. Our extensive experiments reveal that incorporating almost every encoder component results in increasingly more accurate analysis in both local (single layer) and global (the whole model) settings. Our global attribution analysis surpasses previous methods by achieving significantly higher results in various datasets.	PDF	1	2022
MuCGEC: a Multi-Reference Multi-Source Evaluation Dataset for Chinese Grammatical Error Correction	This paper presents MuCGEC, a multi-reference multi-source evaluation dataset for Chinese Grammatical Error Correction (CGEC), % based on newly proposed annotation guidelines, consisting of 7,063 sentences from three different Chinese-as-a-Second-Language (CSL) learner sources. Each sentence has been corrected by three annotators, and their corrections are meticulously reviewed by an expert, resulting in 2.3 references on average per sentence. We conduct experiments with two mainstream CGEC models, i.e., the sequence-to-sequence (Seq2Seq) model and the sequence-to-edit (Seq2Edit) model, both enhanced with large pretrained language models, achieving competitive benchmark performance on previous and our datasets. We also discuss the CGEC evaluation methodologies, including the effect of multiple references and using a char-based metric. We will release our annotation guidelines, data, and code.	PDF	1	2022
Evaluating the Text-to-SQL Capabilities of Large Language Models	We perform an empirical evaluation of Text-to-SQL capabilities of the Codex language model. We find that, without any finetuning, Codex is a strong baseline on the Spider benchmark; we also analyze the failure modes of Codex in this setting. Furthermore, we demonstrate on the GeoQuery and Scholar benchmarks that a small number of in-domain examples provided in the prompt enables Codex to perform better than state-of-the-art models finetuned on such few-shot examples.	PDF	1	2022
AraBART: a Pretrained Arabic Sequence-to-Sequence Model for Abstractive Summarization	Like most natural language understanding and generation tasks, state-of-the-art models for summarization are transformer-based sequence-to-sequence architectures that are pretrained on large corpora. While most existing models focused on English, Arabic remained understudied. In this paper we propose AraBART, the first Arabic model in which the encoder and the decoder are pretrained end-to-end, based on BART. We show that AraBART achieves the best performance on multiple abstractive summarization datasets, outperforming strong baselines including a pretrained Arabic BERT-based model and multilingual mBART and mT5 models.	PDF	1	2022
ERNIE-Layout: Layout-Knowledge Enhanced Multi-modal Pre-training for Document Understanding	We propose ERNIE-Layout, a knowledge enhanced pre-training approach for visual document understanding, which incorporates layout-knowledge into the pre-training of visual document understanding to learn a better joint multi-modal representation of text, layout and image. Previous works directly model serialized tokens from documents according to a raster-scan order, neglecting the importance of the reading order of documents, leading to sub-optimal performance. We incorporate layout-knowledge from Document-Parser into document pre-training, which is used to rearrange the tokens following an order more consistent with human reading habits. And we propose the Reading Order Prediction (ROP) task to enhance the interactions within segments and correlation between segments and a fine-grained cross-modal alignment pre-training task named Replaced Regions Prediction (RRP). ERNIE-Layout attempts to fuse textual and visual features in a unified Transformer model, which is based on our newly proposed spatial-aware disentangled attention mechanism. ERNIE-Layout achieves superior performance on various document understanding tasks, setting new SOTA for four tasks, including information extraction, document classification, document question answering.	PDF	1	2022
Jointly Reinforced User Simulator and Task-oriented Dialog System with Simplified Generative Architecture	The large pre-training language model GPT-2 has been fine-tuned in task-oriented dialog system and achieved state-of-the-art performance on many datasets. However, there's few work of reinforcement learning on these GPT-2 based dialog systems, not to mention designing a GPT-2 based user simulator. In this paper, we propose a dialog system and user simulator based on GPT-2 with simplified generative architecture for reinforcement learning. The experiments are conducted on MultiWOZ2.1 and we evaluate our system with an offline method and online method respectively. The results show that our dialog system achieves the best performance among all the GPT-2 based models even without RL optimization and the performance of the model is further improved after RL. We also explore different reward settings in RL and provide deep analysis of how the model attends to different information and how RL improve the performance of dialog system.	PDF	1	2022
A Self-Adaptive Learning Rate and Curriculum Learning Based Framework for Few-Shot Text Classification	Due to the lack of labeled data in many realistic scenarios, a number of few-shot learning methods for text classification have been proposed, among which the meta learning based ones have recently attracted much attention. Such methods usually consist of a learner as the classifier and a meta learner for specializing the learner to tasks. For the learner, learning rate is crucial to its performance. However, existing methods treat it as a hyper parameter and adjust it manually, which is time-consuming and laborious. Intuitively, for different tasks and neural network layers, the learning rates should be different and self-adaptive. For the meta learner, it requires a good generalization ability so as to quickly adapt to new tasks. Therefore, we propose a novel meta learning framework, called MetaCLSLR, for few-shot text classification. Specifically, we present a novel meta learning mechanism to obtain different learning rates for different tasks and neural network layers so as to enable the learner to quickly adapt to new training data. Moreover, we propose a task-oriented curriculum learning mechanism to help the meta learner achieve a better generalization ability by learning from different tasks with increasing difficulties. Extensive experiments on three benchmark datasets demonstrate the effectiveness of MetaCLSLR.	PDF	1	2022
BitFit: Simple Parameter-efficient Fine-tuning for Transformer-based Masked Language-models	We show that with small-to-medium training data, fine-tuning only the bias terms (or a subset of the bias terms) of pre-trained BERT models is competitive with (and sometimes better than) fine-tuning the entire model. For larger data, bias-only fine-tuning is competitive with other sparse fine-tuning methods.Besides their practical utility, these findings are relevant for the question of understanding the commonly-used process of finetuning: they support the hypothesis that finetuning is mainly about exposing knowledge induced by language-modeling training, rather than learning new task-specific linguistic knowledge.	PDF	1	2022
SANCL: Multimodal Review Helpfulness Prediction with Selective Attention and Natural Contrastive Learning	With the boom of e-commerce, Multimodal Review Helpfulness Prediction (MRHP) that identifies the helpfulness score of multimodal product reviews has become a research hotspot.Previous work on this task focuses on attention-based modality fusion, information integration, and relation modeling, which primarily exposes the following drawbacks:1) the model may fail to capture the really essential information due to its indiscriminate attention formulation; 2) lack appropriate modeling methods that takes full advantage of correlation among provided data. In this paper, we propose SANCL: Selective Attention and Natural Contrastive Learning for MRHP.SANCL adopts a probe-based strategy to enforce high attention weights on the regions of greater significance. It also constructs a contrastive learning framework based on natural matching properties in the dataset.Experimental results on two benchmark datasets with three categories show that SANCL achieves state-of-the-art baseline performance with lower memory consumption.	PDF	1	2022
Can BERT Conduct Logical Reasoning? On the Difficulty of Learning to Reason from Data	Logical reasoning is needed in a wide range of NLP tasks. In this work, we seek to answer one research question: can we train a BERT model to solve logical reasoning problems written in natural language? We study this problem on a confined problem space and train a BERT model on randomly drawn data. However, we report a rather surprising finding: even if BERT achieves nearly perfect accuracy on the test data, it only learns an incorrect and partial reasoning function; further investigation shows that the behaviour of the model (i.e., the learned partial reasoning function) is unreasonably sensitive to the training data. Our work reveals the difficulty of learning to reason from data and shows that near-perfect performance on randomly drawn data is not a sufficient indicator of models' ability to conduct logical reasoning.	PDF	1	2022
Rare Tokens Degenerate All Tokens: Improving Neural Text Generation via Adaptive Gradient Gating for Rare Token Embeddings	Recent studies have determined that the learned token embeddings of large-scale neural language models are degenerated to be anisotropic with a narrow-cone shape. This phenomenon, called the representation degeneration problem, facilitates an increase in the overall similarity between token embeddings that negatively affect the performance of the models. Although the existing methods that address the degeneration problem based on observations of the phenomenon triggered by the problem improves the performance of the text generation, the training dynamics of token embeddings behind the degeneration problem are still not explored. In this study, we analyze the training dynamics of the token embeddings focusing on rare token embedding. We demonstrate that the specific part of the gradient for rare token embeddings is the key cause of the degeneration problem for all tokens during training stage. Based on the analysis, we propose a novel method called, \textit{adaptive gradient gating} (AGG). AGG addresses the degeneration problem by gating the specific part of the gradient for rare token embeddings. Experimental results from language modeling, word similarity, and machine translation tasks quantitatively and qualitatively verify the effectiveness of AGG.	PDF	1	2022
Dynamic Programming in Rank Space: Scaling Structured Inference with Low-Rank HMMs and PCFGs	Hidden Markov Models (HMMs) and Probabilistic Context-Free Grammars (PCFGs) are widely used structured models, both of which can be represented as factor graph grammars (FGGs), a powerful formalism capable of describing a wide range of models. Recent research found it beneficial to use large state spaces for HMMs and PCFGs. However, inference with large state spaces is computationally demanding, especially for PCFGs. To tackle this challenge, we leverage tensor rank decomposition (aka. CPD) to decrease inference computational complexities for a subset of FGGs subsuming HMMs and PCFGs. We apply CPD on the factors of an FGG and then construct a new FGG defined in the rank space. Inference with the new FGG produces the same result but has a lower time complexity when the rank size is smaller than the state size. We conduct experiments on HMM language modeling and unsupervised PCFG parsing, showing better performance than previous work. We will release our code at $\url{github.com/xxx}$.	PDF	1	2022
ValCAT: Generating Variable-Length Contextualized Adversarial Transformations using Encoder-Decoder	Adversarial samples are helpful to explore vulnerabilities in neural network models, improve model robustness, and explain their working mechanism. However, the adversarial texts generated by existing word substitution-based methods are trapped in a one-to-one attack pattern, which is inflexible and cramped. In this paper, we propose ValCAT, a black-box attack framework that misleads the language model by applying variable-length contextualized transformations to the original text. Experiments show that our method outperforms state-of-the-art methods on attacking several classification tasks and inference tasks. More comprehensive human evaluations demonstrate that ValCAT has a significant advantage in ensuring the fluency of the adversarial samples and achieves better semantic consistency. We release our code at https://github.com/linerxliner/ValCAT.	PDF	1	2022
MultiCite: Modeling realistic citations requires moving beyond the single-sentence single-label setting	Despite decades of study, computational methods for CCA have largely relied on overly-simplistic assumptions of how authors cite, which ignore several important phenomena. For instance, scholarly papers often contain rich discussions of cited work that span multiple sentences and express multiple intents concurrently. Yet, recent work in CCA is often approached as a single-sentence, single-label classification task, and thus many datasets used to develop modern computational approaches fail to capture this interesting discourse. To address this research gap, we highlight three understudied phenomena for CCA and release MultiCite, a new dataset of 12.6K citation contexts from 1.2K computational linguistics papers that fully models these phenomena.Not only is it the largest collection of expert-annotated citation contexts to-date, MultiCite contains multi-sentence, multi-label citation contexts annotated throughout entire full paper texts.We demonstrate how MultiCite can enable the development of new computational methods on three important CCA tasks. We release our code and dataset at \url{placeholder}.	PDF	1	2022
Document-level Neural Machine Translation Using Dependency RST Structure	Document-level machine translation (MT) extends the translation unit from the sentence to the whole document. Intuitively, discourse structure can be useful for document-level MT for its helpfulness in long-range dependency modelling. However, few efforts have been paid on leveraging discourse information for document-level neural machine translation(NMT). In this paper, we propose a dependency Rhetorical Structure Theory (RST) tree enhanced NMT model, RST-Transformer. The model only needs to encodes the dependency RST tree of the source document via the attention mask, and can enhance both the encoder and the decoder. Experiments on English-German datasets in both non-pretraining and pretraining settings show that our discourse information enhanced approach outperforms the current state-of-the-art document-level NMT model.	PDF	1	2022
Phrase-level Textual Adversarial Attack with Label Preservation	Generating high-quality textual adversarial examples is critical for investigating the pitfalls of natural language processing (NLP) models and further promoting their robustness. Existing attacks are usually realized through word-level or sentence-level perturbations, which either limit the perturbation space or sacrifices fluency and textual quality, both affecting the attack effectiveness. In this paper, we propose PLAT that generates adversarial samples through phrase-level perturbations. PLAT first extracts the vulnerable phrases as attack targets by a syntactic parser, and then perturbs them by a pretrained blank-infilling model. Such flexible perturbation design substantially expands the search space for more effective attacks without introducing too many modifications, and meanwhile maintains the textual fluency and grammaticality via contextualized generation using surrounding texts. Moreover, we develop a label-preservation filter leveraging the likelihoods of language models finetuned on each class to rule out those perturbations that potentially alter the original class label for humans. Extensive experiments and human evaluation demonstrate that PLAT has a superior attack efficiency as well as a better label consistency than strong baselines.	PDF	1	2022
Modularized Transfer Learning with Multiple Knowledge Graphs for Zero-shot Commonsense Reasoning	Commonsense reasoning systems should be able to generalize to diverse reasoning cases. However, most state-of-the-art approaches depend on expensive data annotations and overfit to a specific benchmark without learning how to perform general semantic reasoning. To overcome these drawbacks, zero-shot QA systems have shown promise as a robust learning scheme by transforming a commonsense knowledge graph (KG) into synthetic QA-form samples for model training. Considering the increasing type of different commonsense KGs, this paper aims to extend the zero-shot transfer learning scenario into multiple-source settings, where different KGs can be utilized synergetically. Towards this goal, we propose to mitigate the loss of knowledge from the interference among the different knowledge sources, by developing a modular variant of the knowledge aggregation as a new zero-shot commonsense reasoning framework. Results on five commonsense reasoning benchmarks demonstrate the efficacy of our framework, improving the performance with multiple KGs.	PDF	1	2022
PNEG: Prompt-based Negative Response Generation for Robust Response Selection Model	Dialogue response selection models typically predict an appropriate response relying on the context-response content similarity. However, the selection model with over-reliance only on superficial features is vulnerable to adversarial responses that are semantically similar but irrelevant to dialogue context. Recent studies have shown that leveraging these adversarial responses as negative training samples is useful for improving the robustness of the selection model. Nevertheless, existing methods often require further fine-tuning for data creation or have limited scalability. To overcome these limitations, this paper proposes a simple but effective method for generating adversarial negative responses leveraging a large-scale language model. Our method can generate realistic negative responses only with a few human-written examples and a prompt designed to optimize generation quality. Experimental results on the dialogue selection task show that our method outperforms existing methods for creating negative responses. Synthetic quality analyses and ablation studies prove that our method is scalable and can generate high-quality negative responses. These results suggest that our method can be an effective alternative to human annotators in generating adversarial responses. Our code and data will be released after acceptance.	PDF	1	2022
Learning from Explanations: Multi-aspect based Age-restricted Rating Prediction in Long Scripts	In the Motion Picture Association of America (MPAA), reviewers watch the entire film to determine the age-restricted category (MPAA rating) of the movie and provide the explanatory feedback for rating decision. As such human expert system is a time-consuming and non-scalable process, this paper proposes to develop a machine review system named MARS that automatically predicts the MPAA ratings of movie scripts. Specifically, in MARS, we first explore the use of the well-studied multi-aspect classification as machine-provided explanations, then leverage them to better learn the target rating prediction models. We demonstrate MARS outperforms various baselines by around 10 points in terms of F1 score, detecting severe contents with multi-aspect view.	PDF	1	2022
Exploring the Low-Resource Transfer-Learning with mT5 model	Languages are mortal. While the NLP community tends to expand its competence to multilingual models, there is still a great risk for low-resource languages to vanish before any prototypes appear for them.This paper presents a series of experiments that explore the transfer learning for low-resource languages, testing hypotheses about finding the optimal donor language on the typological relations and grammatical features. Our results showed that multilingual models like mT5 obtain significantly lower perplexity on 45/46 low-resource languages without training on them.We collected the most variable multilingual training corpus available with 288 languages, based on the linguistically-wise databases, field linguist resources, the World Atlas of Language Structures, and Wikipedia.	PDF	1	2022
MUST: A Framework for Training Task-oriented Dialogue Systems with Multiple User SimulaTors	Recent works try to optimize a Task-oriented Dialogue System with reinforcement learning (RL) by building user simulators. However, most of them only focus on training the dialogue system using a single user simulator. In this paper, we propose a framework called MUST to improve the dialogue agent by utilizing multiple user simulators simultaneously shown in Figure 1. Two core research problems of the proposed MUST are: (1) how to specify these different simulators effectively in the RL training? and (2) what model architecture should we use to learn a user simulator with better generalization capability? To tackle the first problem, we formulate the simulator selection task to train the system agent as a Multi-armed bandit (MAB) problem and modify one Upper Confidence Bound (UCB) algorithms called UCB1 to guide this selection process. To deal with the second problem, we present a new user simulator model called U-GPT based on the Generative Pre-trained Transformer (GPT). Extensive empirical results demonstrate that the dialogue system trained by the proposed MUST achieves a better performance than those trained by a single user simulator and our modified UCB1 algorithm can accelerate the MUST training. Furthermore, we reveal that our GPT-based user simulator outperforms previous learning-based simulators through direct and indirect evaluations.	PDF	1	2022
A Study of Pre-trained Language Models for Analogy Generation	We propose a novel application of Pre-trained Language Models (PLMs) to generate analogies and study how to design effective prompts to prompt a PLM to generate a source concept analogous to a given target concept as well as to generate an explanation of the similarity between given pair of target concept and source concept. We found that it is feasible to prompt a GPT-3 PLM to generate meaningful analogies and the best prompts tend to be precise imperative statements especially with low temperature setting. We systematically analyzed the sensitivity of the GPT-3 model to prompt design and temperature and found that the model is particularly sensitive to certain variations (e.g., questions vs. imperative statements). We also investigated the suitability of using the existing reference-based metrics designed for evaluating natural language generation (NLG) to evaluate analogy generation and found that the recent BLEURT score is better than the others. We further propose a promising consensus measure based on diverse prompts and settings, which can be potentially used to both automatically evaluate the generated analogies in the absence of reference text (e.g., in novel domains) and rank a set of generated analogies to select analogies of different characteristics. Overall, our study shows that PLMs offer a promising new way to generate analogies in unrestricted domains, breaking the limitation of existing analogy generation methods in requiring structured representation.	PDF	1	2022
Classification of Illegal Drug Sales Posts using Clustering-Based Topic Modeling.	Drugs illegally traded online are causing social problems around the world wide. One of the ways to solve this problem is to automatically delete sales posts quickly even if they are uploaded. We propose new data on illegal drug sales posts in Korean collected directly from Twitter. There are about 100K collected data, and labels were added directly to each data. Supervised learning-based models generally show high performance, but label information is essential. It is difficult to add labels to all texts in situations where a large amount of text occurs. In this work, we propose a topic modeling-based classification model that can perform higher with even a small number of labels. As a result of the experiment, higher classification performance is shown when Topic modeling is used as a small number of data.	PDF	1	2022
"You might think about slightly revising the title": identifying hedges in peer-tutoring interactions	Hedges play an important role in the management of conversational interaction. In peer-tutoring, they are notably used by tutors in dyads (pairs of interlocutors) experiencing low rapport to tone down the impact of instructions and negative feedback. Pursuing the objective of building a tutoring agent that manages rapport with students in order to improve learning, we used a multimodal peer-tutoring dataset to construct a computational framework for identifying hedges. We compared approaches relying on pre-trained resources with others that integrate insights from the social science literature. Our best performance involved a hybrid approach that outperforms the existing baseline while being easier to interpret. We employ a model explainability tool to explore the features that characterize hedges in peer-tutoring conversations, and we identify some novel features, and the benefits of such a hybrid model approach.	PDF	1	2022
Identifying and Measuring Token-Level Sentiment Bias in Pre-trained Language Models with Prompts	Due to the superior performance, large-scale pre-trained language models (PLMs) have been widely adopted in many aspects of human society. However, we still lack effective tools to understand the potential bias embedded in the black-box models. Recent advances in prompt tuning show the possibility to explore the internal mechanism of the PLMs. In this work, we propose two token-level sentiment tests: Sentiment Association Test (SAT) and Sentiment Shift Test (SST) which utilize the prompt as a probe to detect the latent bias in the PLMs. Our experiments on the collection of sentiment datasets show that both SAT and SST can identify sentiment bias in PLMs and SST is able to quantify the bias. The results also prove that fine-tuning can augment existing bias in PLMs.	PDF	1	2022
Challenge for open-domain targeted sentiment analysis	Since previous studies on open-domain targeted sentiment analysis are limited in dataset domain variety and sentence level, we propose a novel dataset consisting of 6,013 human-labeled data to extend the data domains in topics of interest and document level. Furthermore, we offer a nested target annotation schema to extract the complete sentiment information in documents, boosting the practicality and effectiveness of open-domain targeted sentiment analysis. Moreover, we leverage the pre-trained model BART in a sequence-to-sequence generation method for the task. Benchmark results show that there exists large room for improvement of open-domain targeted sentiment analysis. Meanwhile, experiments have shown that challenges remain in the effective use of open-domain data, long documents, the complexity of target structure, and domain variances.	PDF	1	2022
Textual Backdoor Attacks Can Be More Harmful via Two Simple Tricks	Backdoor attacks are a kind of emergent security threat in deep learning. After injected into a backdoor, a deep neural model will behave normally on standard inputs but give adversary-specified predictions once the input contains specific backdoor triggers. Current textual backdoor attacks have poor attack performance in some tough situations. In this paper, we find two simple tricks that can make existing textual backdoor attacks much more harmful. The first trick is to add an extra training task to distinguish poisoned and clean data during the training of the victim model, and the second one is to use all the clean training data rather than remove the original clean data corresponding to the poisoned data. These two tricks are universally applicable to different attack models. We conduct experiments in three tough situations including clean data fine-tuning, low-poisoning-rate, and label-consistent attacks. Experimental results show that the two tricks can significantly improve attack performance. This paper exhibits the great potential harmfulness of backdoor attacks. All the code and data will be made public to facilitate further research.	PDF	1	2022
Gated Recursive and Sequential Deep Hierarchical Encoding for Detecting Incongruent News Articles	With the increase in misinformation across digital platforms, incongruent news detection is becoming an important research problem. Earlier, researchers have exploited various feature engineering approaches and deep learning models with embedding to capture incongruity between news headlines and the body. Recent studies have also shown the advantages of capturing structural properties of the body using hierarchical encoding. Hierarchical encoding decomposes the body of a news article into smaller segments such as sentences or paragraphs. However, the existing hierarchical methods have not considered two important aspects; (i) deeper hierarchical level, and (ii) importance of different paragraphs in generating document encoding. Motivated by this, in this paper, we propose a Gated RecursiveAnd Sequential Deep HierarchicalEncoding (GRASHE) method for detectingincongruent news articles by extends hierarchicalencoding upto word level and incorporatingincongruently weight of each paragraph. Experimental results show that the proposed models outperform the bag-of-word features, sequential and hierarchical encoding-based counterparts. We also perform various ablation analysis to support the proposed models.	PDF	1	2022
Consecutive Question Generation with Multitask Joint Reranking and Dynamic Rationale Search	Automatic question generation (QG) aims to generate a set of questions for a given passage, and can be viewed as a dual task of question answering (QA). However, most current methods of QG tend to generate question by question independently, mainly based on specific extracted answer spans. In this paper, we propose to consecutively generate questions over a whole passage, with a comprehensive consideration of the aspects including accuracy, diversity, informativeness, and coverage. First we exam four key elements in QG, i.e., question, answer, rationale, and context history, and propose a novel multitask framework with one main task generating a question-answer pair, and four auxiliary tasks generating other elements alternately, improving model performance from all aspects through both joint training and reranking. Further, to learn the connection between questions and fully exploit the important information in every sentence, we propose a new consecutive generation strategy, which dynamically selects the rationales and searches for the best question series globally. Extensive experiments on different datasets show that our method can improve question generation significantly and benefit multiple related NLP tasks.	PDF	1	2022
CQMrobust: A Chinese Dataset of Linguistically Perturbed Natural Questions for Evaluating the Robustness of Question Matching Models	In this paper, we focus on studying robustness evaluation of Chinese question matching. Most of the previous work on analyzing robustness issue focus on just one or a few types of artificial adversarial examples. Instead, we argue that it is necessary to formulate a comprehensive evaluation about the linguistic capabilities of models on natural texts. For this purpose, we create a Chinese dataset namely CQMrobust which contains natural questions with linguistic perturbations to evaluate the robustness of question matching models. CQMrobust contains 3 categories and 13 subcategories with 32 linguistic perturbations. The extensive experiments demonstrate that CQMrobust has a better ability to distinguish different models. Importantly, the detailed breakdown of evaluation by linguistic phenomenon in CQMrobust helps us easily diagnose the strength and weakness of different models. Additionally, our experiment results show that the effect of artificial adversarial examples does not work on the natural texts.The dataset and baseline codes will be publicly available in the open source community.	PDF	1	2022
Causal Language Model for Zero-shot Constrained Keyphrase Generation	Recently, most of the state-of-the-art keyphrase prediction models are based on a supervised generative model.Although it shows noticeable improvement over statistical methods, it still struggles with low performance on out of the domain and low-resource data. To overcome these limitations, unsupervised methods have also been critical and studied. But the unsupervised method also has a defect in the necessary process which extracting candidates before selecting keyphrases. As not including various forms of phrases, we note that the unsupervised method can't ensure oracle keyphrase.In this paper, we present zero-shot constrained keyphrase generation by leveraging a large-scale language model. To generate diverse keyphrases, we explore controlling a phrase during the generation. Finally, we evaluate benchmark datasets of the scholar domain. It results in better performances than unsupervised methods on several datasets without going through the candidate extraction stage. For domain robustness, we evaluate out-of-domain DUC compare with NUS. Since our method doesn't fine-tune to a corpus of a specific domain, it's better than supervised methods based on Sequence-to-Sequence.	PDF	1	2022
Multimodal Semi-supervised Learning for Disaster Tweet Classification	During natural disasters, people often use social media platforms, such as Twitter, to post information about casualties and damage produced by disasters. This information can help relief authorities gain situational awareness in nearly real time, and enable them to quickly distribute resources where most needed. However, annotating data for this purpose can be burdensome, subjective and expensive. In this paper, we investigate how to leverage the copious amounts of unlabeled data generated by disaster eyewitnesses and affected individuals during disaster events. To this end, we propose a semi-supervised learning approach to improve the performance of neural models on several multimodal disaster tweet classification tasks. Our approach shows significant improvements, obtaining up to $3.5\%$ F1 performance gain at no additional annotation cost.	PDF	1	2022
Investigating the Benefits of Free-Form Rationales	Free-form rationales aim to aid model interpretability by supplying the background knowledge that can help understand model decisions. Crowdsourced rationales are provided for commonsense QA instances in popular datasets such as CoS-E and ECQA, but their utility remains under-investigated. We present human studies which show that ECQA rationales indeed provide additional information to understand a decision, while 70% of CoS-E rationales do not. Inspired by this finding, we ask: can the additional context provided by free-form rationales benefit models, similar to human users? We investigate the utility of rationales as an additional source of supervision, by varying the quantity and quality of rationales during training. After controlling for instances where rationales leak the correct answer, we find that incorporating only 5% of rationales during training can boost model performance by 16.89%. Moreover, we also show that rationale quality matters: compared to crowdsourced rationales, T5-generated rationales provide not only much weaker supervision to models, but are also not helpful for human users in aiding model interpretability.	PDF	1	2022
LexiCon: Lexically Constrained Review Generation via Robust Insertion	Existing review generators struggle to generate specific information correctly (e.g., Caesar salad, Snapdragon CPU), which prevents generated reviews from being more informative. In this paper, we propose to introduce lexical constraints into review generation which can be any key phrases to be contained in reviews. Compared to soft constraints (e.g., aspects) used in previous work, lexical constraints easily incorporate specific information which can largely improve the diversity and informativeness of generated reviews. To this end, we present LexiCon, a novel insertion-based review generation framework that can generate personalized reviews containing lexical constraints. Specifically, the proposed method progressively inserts new tokens between existing tokens in a parallel manner until a sequence is completed. Experimental results show that LexiCon outperforms the strongest review generation model by 20% BLEU-2 (coherence) and 68% Distinct-2 (diversity) on average. Human evaluation also shows that LexiCon is more robust to various lexical constraints than the state-of-the-art lexically-constrained model for general purpose.	PDF	1	2022
Building a Role Specified Open-Domain Dialogue System Leveraging Large-Scale Language Models	Recent open-domain dialogue models have brought numerous breakthroughs. However, building a chat system is not scalable since it often requires a considerable volume of human-human dialogue data, especially when enforcing features such as persona, style, or safety. In this work, we study the challenge of imposing roles on open-domain dialogue systems, with the goal of making the systems maintain consistent roles while conversing naturally with humans. To accomplish this, the system must satisfy a role specification that includes certain conditions on the stated features as well as a system policy on whether or not certain types of utterances are allowed. For this, We propose an efficient data collection framework leveraging in-context few-shot learning of large-scale language models for building role-satisfying dialogue dataset from scratch. We then compare various architectures for open-domain dialogue systems in terms of meeting role specifications while maintaining conversational abilities. Automatic and human evaluations show that our models return few out-of-bounds utterances, keeping competitive performance on general metrics. We release a Korean dialogue dataset we built for further research.	PDF	1	2022
NeuS: Neutral Multi-News Summarization for Framing Bias Mitigation	Media framing bias can lead to increased political polarization, and thus, the need for automatic mitigation methods is growing. We propose a new task, a \textit{neutral} summary generation from multiple news articles of the varying political spectrum, to facilitate balanced and unbiased news reading.In this paper, we first collect a new dataset, obtain some insights about framing bias through a case study, and propose a new effective metric and models for the task. Lastly, we conduct experimental analyses to provide insights about remaining challenges and future directions. One of the most interesting observations is that generation models can hallucinate not only factually inaccurate or unverifiable content but also politically biased content.	PDF	1	2022
Detecting Unintended Social Bias in Toxic Language Datasets	Hate speech and offensive texts are examples of damaging online content that target or promote hatred towards a group or individual member based on their actual or perceived features of identification, such as race, religion, or sexual orientation. Sharing violent and offensive content has had a significant negative impact on society. These hate speech and offensive content generally contains societal biases in them. With the rise of online hate speech, automatic detection of such biases as a natural language processing task is getting popular. However, not much research has been done to detect unintended social bias from toxic language datasets. In this paper, we introduce a new dataset from an existing toxic language dataset, to detect social biases along with their categories and targeted groups. We then report baseline performances of both classification and generation tasks on our curated dataset using transformer-based models. Our study motivates a systematic extraction of social bias data from toxic language data.	PDF	1	2022
GraphDiffs: Graph Modeling with Differential Sequence for Document-Grounded Conversation	Knowledge grounded dialogue systems need to incorporate natural transitions between knowledge for dialogue to flow smoothly. Current systems not only lack good structured representations for knowledge that span multiple documents, but also effective algorithms that utilize such resources. We design a Co-Referential Multi-Document Graph(CoRM-DoG) that seamlessly captures inter-document correlations and intra-document co-referential knowledge relations. To best linearise this static graph into sequential dialogues, we contribute a Graph Modeling with Differential Sequence (GraphDiffs) method for knowledge transitions in dialogue. GraphDiffs performs knowledge selection by natively accounting for contextual graph structure and introducing differential sequence learning to effectively learn multi-turn knowledge transitions. Our analysis shows that GraphDiffs based on CoRM-DoG significantly outperforms the current state-of-the-art by 9.5\% and 7.4% on two public benchmarks, WoW and Holle-E, where the modeling of co-reference and differential sequence are critical factors for its success.	PDF	1	2022
On the Anatomy of Latent-variable Generative Models for Conditional Text Generation	Conditional text generation is a non-trivial task, which is until now predominantly performed with latent-variable generative models. In this work, we intend to explore several choices that are shown to affect the two essential aspects of model performance: expressivity and controllability. We propose to experiment with a series of latent-variable models built around simple design changes under a general unified framework, with a particular focus on prior distributions based on Energy-Based Models instead of the usual standard Gaussian. Our experiments validate the claim that this richer prior allows for a better representational power, but it exhibits difficult training. We provide a comprehensive analysis of these difficulties and a close comparison with recent work on EBM-based priors for conditional text generation.	PDF	1	2022
Unsupervised Common Sense Relation Extraction	Vast and diverse knowledge about the relations in the world help humans comprehend and argue about their environment. Equipping machines with this knowledge is challenging yet essential for general reasoning capabilities. Here, we propose to apply unsupervised relation extraction (URE), aiming to induce general relations between concepts from natural language. Previous work in URE has predominantly focused on relations between named entities in the encyclopedic domain. The more general, and more challenging, domain of common sense relation learning has not yet been addressed, partially due to a lack of datasets. We present a framework for common sense relation extraction from free-text, associated with two benchmark datasets. We present initial experiments using three state-of-the-art models developed for encyclopedic relation induction. Our results verify the utility of our benchmarks for common sense relation extraction, and suggest ample scope for future work on this important, yet challenging, task.	PDF	1	2022
ComSearch: Equation Searching with Combinatorial Mathematics for Solving Math Word Problems with Weak Supervision	Previous studies have introduced a weakly-supervised paradigm for solving math word problems requiring only the answer value annotation. While these methods search for correct value equation candidates as pseudo labels, they search among a narrow sub-space of the enormous equation space. To address this problem, we propose a novel search algorithm with combinatorial mathematics ComSearch, which can compress the search space by excluding mathematical equivalent equations. The compression allows the searching algorithm to enumerate all possible equations and obtain high-quality data. We investigate the noise in the pseudo labels that hold wrong mathematical logic, which we refer to as the false-matching problem, and propose a ranking model to denoise the pseudo labels. Our approach holds a flexible framework to utilize two existing supervised math word problem solvers to train pseudo labels, and both achieve state-of-the-art performance in the weak supervision task.	PDF	1	2022
Show, Don't Tell: Demonstrations Outperform Descriptions for Schema-Guided Task-Oriented Dialogue	Building universal dialogue systems that can seamlessly operate across multiple domains/APIs and can generalize to new ones with minimal supervision and low maintenance is a critical challenge. Recent works have leveraged natural language descriptions for schema elements to build such systems. However, descriptions only provide indirect supervision for downstream tasks, while still requiring effort to construct. In this work, we propose Show, Don't Tell, which uses a short labeled example dialogue to show the semantics of a schema rather than telling the model about the schema elements via descriptions. While requiring similar effort from service developers, we show that using short examples as schema representations with large language models results in stronger performance and better generalization on two popular dialogue state tracking benchmarks: the Schema-Guided Dialogue (SGD) dataset and the MultiWoZ leave-one-out benchmark.	PDF	1	2022
ULF: Cross-Validation for Weak Supervision	A way to overcome expensive and time-consuming manual data labeling is weak supervision - automatic annotation of data samples via a predefined set of labeling functions (LFs), rule-based mechanisms that generate potentially erroneous labels. In this work, we investigate noise reduction techniques for weak supervision based on the principle of k-fold cross-validation. In particular, we extend two frameworks for detecting the erroneous samples in manually annotated data to the weakly supervised setting. Our methods profit from leveraging the information about matching LFs and detect noisy samples more accurately. We also introduce a new algorithm for denoising the weakly annotated data called ULF, that refines the allocation of LFs to classes by estimating the reliable LFs-to-classes joint matrix. Evaluation on several datasets shows that ULF successfully improves weakly supervised learning without using any manually labeled data.	PDF	1	2022
A Relation Semantic Information Attentive Stereoscopic Framework for Relational Triple Extraction	Extracting relational triples from unstructured text is crucial for information extraction.Recent methods extract relational triple from a stereoscopic perspective which can better capture the interaction between entity and relation. However, the stereoscopic models introduce redundant triples, which makes it difficult to identify triples accurately. Since the relation is one of the elements of triples to be extracted, the introduction of its semantic information can make the triple information more complete, which is helpful to relational triple extraction. In this work, we propose a Relation Semantic Information Attentive Stereoscopic framework (RSIA) which can fully represent and use the semantic information of relations. Specifically, a fusion encoder from transformers on top of relation encoder and sentence encoder is designed to enrich the semantic information of relation. Then, the semantic representation of the relation is integrated into the stereoscopic 3D space as its relation dimension. Our model achieves state-of-the-art performance with F1 score up to 93.5\% and 94.3\% on two public datasets and delivers consistent performance gain on complex scenarios of overlapping triples.	PDF	1	2022
Hierarchical Transformers Are More Efficient Language Models	Transformer models yield impressive results on many NLP and sequence modeling tasks. Remarkably, Transformers can handle long sequences, which allows them to produce long coherent outputs: entire paragraphs produced by GPT-3 or well-structured images produced by DALL-E. These large language models are impressive but also very inefficient and costly, which limits their applications and accessibility. We postulate that having an explicit hierarchical architecture is the key to Transformers that efficiently handle long sequences. To verify this claim, we first study different ways to downsample and upsample activations in Transformers so as to make them hierarchical. We use the best performing upsampling and downsampling layers to create Hourglass - a hierarchical Transformer language model. Hourglass improves upon the Transformer baseline given the same amount of computation and can yield the same results as Transformers more efficiently. In particular, Hourglass sets new state-of-the-art for Transformer models on the ImageNet32 generation task and improves language modeling efficiency on the widely studied enwik8 benchmark.	PDF	1	2022
Detection of Word Adversarial Examples in NLP: Benchmark and Baseline via Robust Density Estimation	Word-level adversarial attacks have shown success in NLP models, drastically decreasing the performance of transformer-based models in recent years. As a countermeasure, adversarial defense has been explored, but relatively few efforts have been made to detect adversarial examples. However, detecting adversarial examples in NLP may be crucial for automated task (e.g. review sentiment analysis) that wishes to amass information about a certain population and additionally be a step towards a robust defense system. To this end, we release a dataset for four popular attack methods on four datasets and four NLP models to encourage further research in this field. Along with it, we propose a competitive baseline based on density estimation that has the highest \textsc{auc} on 29 out of 30 dataset-attack-model combinations.\footnote{https://github.com/anoymous92874838/text-adv-detection}	PDF	1	2022
Modeling Exemplification in Long-form Question Answering via Retrieval	Exemplification is a process by which writers explain or clarify a concept by providing an example. While common in all forms of writing, exemplification is particularly useful in the task of long-form question answering (LFQA), where a complicated answer can be made more understandable through simple examples. In this paper, we provide the first computational study of exemplification in QA, performing a fine-grained annotation of different types of examples (e.g., hypotheticals, anecdotes) in three corpora. We show that not only do state-of-the-art LFQA models struggle to generate relevant examples, but also that standard evaluation metrics such as ROUGE are insufficient to judge exemplification quality. We propose to treat exemplification as a \emph{retrieval} problem in which a partially-written answer is used to query a large set of human-written examples extracted from a corpus. Our approach allows a reliable ranking-type automatic metrics that correlates well with human evaluation. Human evaluation shows that examples retrieved from our retriever are more relevant than examples generated from state-of-the-art LFQA model.	PDF	1	2022
ezCoref: A Scalable Approach for Collecting Crowdsourced Annotations for Coreference Resolution	Large-scale high-quality corpora are critical for advancing research in coreference resolution. Coreference annotation is typically time consuming and expensive, since researchers generally hire expert annotators and train them with an extensive set of guidelines. Crowdsourcing is a promising alternative, but coreference includes complex semantic phenomena difficult to explain to untrained crowdworkers, and the clustering structure is difficult to manipulate in a user interface. To address these challenges, we develop and release ezCoref, an easy-to-use coreference annotation tool, and annotation methodology that facilitates crowdsourced data collection across multiple domains, currently in English. Instead of teaching crowdworkers how to handle non-trivial cases (e.g., near-identity coreferences), ezCoref provides only a minimal set of guidelines sufficient for understanding the basics of the task. To validate this decision, we deploy ezCoref on Mechanical Turk to re-annotate 240 passages from seven existing English coreference datasets across seven domains, achieving an average rate of 2530 tokens per hour, for one annotator. This paper is the first to compare the quality of crowdsourced coreference annotations against those of experts, and to identify where their behavior differs to facilitate future annotation efforts. We show that it is possible to collect coreference annotations of a reasonable quality in a fraction of time it would traditionally require.	PDF	1	2022
Old BERT, New Tricks: Artificial Language Learning for Pre-Trained Language Models	We extend the artificial language learning experimental paradigm from psycholinguistics and apply it to pre-trained language models -- specifically, BERT (Devlin et al., 2019). We treat a pretrained model as a subject in an artificial language learning experimental setting: in order to learn the relation between two linguistic properties $A$ and $B$, we introduce a set of new, non-existent, linguistic items, give the model information about their variation along property $A$, then measure to what extent the model learns property $B$ for these items as a result of training. We show this method at work for degree modifiers (expressions like {\it slightly}, {\it very}, {\it rather}, {\it extremely}) and test the hypothesis that the degree expressed by the modifier (low, medium or high degree) is related to its sensitivity to sentence polarity (whether it shows preference for affirmative or negative sentences or neither). Our experimental results are compatible with existing linguistic observations that relate degree semantics to polarity-sensitivity, including the main one: low degree semantics leads to positive polarity sensitivity (that is, to preference towards affirmative contexts).	PDF	1	2022
Tapping BERT for Preposition Sense Disambiguation	Prepositions are frequently occurring polysemous words. Disambiguation of prepositions is crucial in tasks like semantic role labelling, question answering, text entailment, and noun compound paraphrasing. In this paper, we propose a novel methodology for preposition sense disambiguation (PSD), which does not use any linguistic tools. In a supervised setting, the machine learning model is presented with sentences wherein prepositions have been annotated with 'senses'. These 'senses' are IDs in what is called 'The Preposition Project (TPP)'. We use the hidden layer representations from pre-trained BERT and its variants. The latent representations are then classified into the correct sense ID using a Multi-Layer Perceptron. The datasets used for this task are from SemEval-2007 Task-6 and Oxford English Corpus (OEC). Our methodology gives an accuracy of 86.85% on the SemEval task, which is better than the state-of-the-art.	PDF	1	2022
Revisiting the Roles of “Text” in Text Games	Text games present opportunities for natural language understanding (NLU) methods to tackle reinforcement learning (RL) challenges. However, recent work has questioned the necessity of NLU by showing random text hashes could perform decently. In this paper, we pursue a fine-grained investigation into the roles of text in the face of different RL challenges, and reconcile that semantic and non-semantic language representations could be complementary rather than contrasting. Concretely, we propose a simple scheme to extract relevant contextual information into an approximate state hash as extra input for an RNN-based text agent. Such a lightweight plug-in achieves competitive performance with state-of-the-art text agents using advanced NLU techniques such as knowledge graph and passage retrieval, suggesting non-NLU methods might suffice to tackle the challenge of partial observability. However, if we remove RNN encoders and use approximate or even ground-truth state hash alone, the model performs miserably, which confirms the importance of semantic function approximation to tackle the challenge of combinatorially large observation and action spaces. Our findings and analysis provide new insights for designing better text game task setups and agents.	PDF	1	2022
Bridge the Gap Between CV and NLP! A Gradient-based Textual Adversarial Attack Framework	Despite the success of recent deep learning techniques, they still perform poorly on adversarial examples with small perturbations. While gradient-based adversarial attack methods are well-explored in the field of computer vision, it is impractical to directly apply them in natural language processing due to the discrete nature of the text. To address the problem, we propose a unified framework to extend the existing gradient-based method to craft textual adversarial samples. In this framework, gradient-based continuous perturbations are added to the embedding layer and amplified in the forward propagation process. Then the final perturbed latent representations are decoded with a mask language model head to obtain potential adversarial samples. In this paper, we instantiate our framework with an attack algorithm named Textual Projected GradientDescent (T-PGD). We conduct comprehensive experiments to evaluate our framework by performing transfer black-box attacks on BERT, RoBERTa, and ALBERT on three benchmark datasets. Experimental results demonstrate that our method achieves an overall better performance and produces more fluent and grammatical adversarial samples compared to strong baseline methods. All the code and data will be made public.	PDF	1	2022
Investigating and Explaining Feature and Representation Learning in Translationese Classification	Recent work has shown that neural feature- and representation-learning approaches, and specifically the BERT model, demonstrates superior performance over traditional manual feature engineering and an SVM classifier for the task of translationese classification for various source and target languages. However, to date it is unclear whether the performance differences are due to better representations, better classifiers or both. Moreover, it remains unclear whether the features learnt by BERT overlap with commonly used manual features. To answer these, we exchange features between BERT-based and SVM classifiers, and show that, an SVM fed with BERT representations performs at the level of the best BERT classifiers, and BERT learning and using hand-crafted features performs at the level of traditional classifiers using hand-crafted features. Our experiments indicate that our hand-crafted feature set does not provide any additional information that BERT has not learnt already, and is likely to be a subset of features automatically learnt by BERT. Finally, we apply Integrated Gradients to examine token importance for the BERT model, and find that part of its top performance results are due to just topic differences and spurious correlations with translationese.	PDF	1	2022
Non-Autoregressive Machine Translation: It's Not as Fast as it Seems	Efficient machine translation models are commercially important as they can increase inference speeds, and reduce costs and carbon emissions. Recently, there has been much interest in non-autoregressive (NAR) models, which promise faster translation. In parallel to the research on NAR models, there have been successful attempts to create optimized autoregressive models as part of the WMT shared task on efficient translation. In this paper, we point out flaws in the evaluation methodology present in the literature on NAR models and we provide fair comparison between a state-of-the-art NAR model and the autoregressive submissions to the shared task. We make the case for consistent evaluation of NAR models, and also for the importance of comparing NAR models with other widely used efficiency approaches. We run experiments with a connectionist-temporal-classification-based (CTC) NAR model implemented in C++ and compare it with AR models using wall clock times. Our results show that, although NAR models are faster on GPUs, with small batch sizes, they are nearly always slower under more realistic usage conditions. We call for more realistic and extensive evaluation of NAR models in future work.	PDF	1	2022
Mix and Match: Learning-free Controllable Text Generation using Energy Language Models	Due to the unidirectional nature of prevalent autoregressive generation models, recent work on controlled generation based on global text attributes has either required attribute-based fine-tuning of the base language model or restricted the parametrization of the attribute prediction model to be compatible with the base LM. In this work, we propose Mix and Match LM, a global score-based alternative for controllable text generation that combines arbitrary pretrained black box models for achieving the desired attributes in the generated text without involving any fine-tuning or structural assumptions about the black box models. We interpret the task of controllable generation as drawing samples from an energy-based model whose energy values are a linear combination of scores from black box models that are separately responsible for fluency, the control attribute, and faithfulness to any conditioning context. We use a Metropolis-Hastings sampling scheme to sample from this energy-based model using bidirectional context and global attribute features. We validate the effectiveness of our approach on various controlled generation and style-based text revision tasks by outperforming recently proposed methods that involve extra training, fine-tuning, or restrictive assumptions over the form of models.	PDF	1	2022
End-to-end Speech Translation with Spoken-to-Written Style Conversion	End-to-end speech translation (ST), which translates speech in source language directly into text in target language by a single model, has attracted a great deal of attention in recent years. Compared to the cascade ST, it has the advantages of easier deployment, better efficiency, and less error propagation. Meanwhile, spoken-to-written style conversion has been proved to be able to improve cascaded ST by reducing the gap between the language style of speech transcription and bilingual corpora used for machine translation training. Therefore, it is desirable to integrate the conversion into end-to-end ST. In this paper, we propose a joint task of speech-to-written-style-text conversion and end-to-end ST, as well as an interactive-attention-based multi-decoder model for the joint task to improve end-to-end ST. Experiments on a Japanese-English lecture ST dataset and CoVoST 2 Native Japanese show that our models outperform a strong baseline on Japanese-English ST.	PDF	1	2022
Cheat Codes to Quantify Missing Source Information in Neural Machine Translation	This paper describes a method to quantify the amount of information $H(t\|s)$ added by the target sentence $t$ that is not present in the source $s$ in a neural machine translation system. We do this by providing the model the target sentence in a highly compressed form (a "cheat code"), and exploring the effect of the size of the cheat code. We find that the model is able to capture extra information from just a single float representation of the target and nearly reproduces the target with two 32-bit floats per target token.	PDF	1	2022
Ask Me Anything in Your Native Language	Cross-lingual question answering is a thriving field in the modern world, helping people to search information on the web more efficiently. One of the important scenarios is to give an answer even there is no answer in the language a person asks a question with. We present a novel approach based on single encoder for query and passage for retrieval from multi-lingual collection, together with cross-lingual generative reader. It achieves a new state of the art in both retrieval and end-to-end tasks on the XOR TyDi dataset outperforming the previous results up to 10\% on several languages. We find that our approach can be generalized to more than 20 languages in zero-shot approach and outperform all previous models by 12\%.	PDF	1	2022
Unbiased Math Word Problems Benchmark for Mitigating Solving Bias	In this paper, we revisit the solving bias when evaluating models on current Math Word Problem (MWP) benchmarks. However, current solvers exist solving bias which consists of data bias and learning bias due to biased dataset and improper training strategy. Our experiments verify MWP solvers are easy to be biased by the biased training datasets which do not cover diverse questions for each problem narrative of all MWPs, thus a solver can only learn shallow heuristics rather than deep semantics for understanding problems. Besides, an MWP can be naturally solved by multiple equivalent equations while current datasets take only one of the equivalent equations as ground truth, forcing the model to match the labeled ground truth and ignoring other equivalent equations. Here, we first introduce a novel MWP dataset named UnbiasedMWP which is constructed by varying the grounded expressions in our collected data and annotating them with corresponding multiple new questions manually. Then, to further mitigate learning bias, we propose a Dynamic Target Selection (DTS) Strategy to dynamically select more suitable target expressions according to the longest prefix match between the current model output and candidate equivalent equations which are obtained by applying commutative law during training. The results show that our UnbiasedMWP has significantly fewer biases than its original data and other datasets, posing a promising benchmark for fairly evaluating the solvers' reasoning skills rather than matching nearest neighbors. And the solvers trained with our DTS achieve higher accuracies on multiple MWP benchmarks.	PDF	1	2022
$Great~Truths~are ~Always ~Simple:$ A Rather Simple Knowledge Encoder for Enhancing the Commonsense Reasoning Capacity of Pre-Trained Models	Commonsense reasoning in natural language is a desired capacity of artificial intelligent systems. For solving complex commonsense reasoning tasks, a typical approach is to enhance pre-trained language models~(PTM) by a knowledge-aware graph neural network~(GNN) encoder that leverages commonsense knowledge graphs~(CSKGs).Despite the effectiveness, these approaches are built in heavy architectures, and can't clearly explain how external knowledge resources improve the reasoning capacity of PTMs. Considering this issue, we conduct deep empirical analysis, and find that it is indeed \emph{relation features} from CSKGs (but not \emph{node features}) that mainly contribute to the performance improvement of PTM. Based on this finding, we design a simple MLP-based knowledge encoder by utilizing statistical relation paths as features. Extensive experiments conducted on five benchmarks demonstrate the effectiveness of our approach, which also largely reduces the parameters for encoding CSKGs.	PDF	1	2022
Sequentially Controlled Text Generation	While GPT2 generates sentences that are remarkably human-like, longer documents can ramble and are structurally different from human-written articles. We study the problem of imposing structure on long-range text. We propose a novel controlled text generation task, sequentially controlled text generation, and identify a dataset, NewsDiscourse as a starting point for this task. We develop a sequential controlled text generation pipeline with generation and editing, based on extensions of existing classifier-based approaches. We test different degrees of structural awareness and show that, in general, more structural awareness results in higher control-accuracy, grammaticality, global coherency and topicality, approaching human-level writing performance.	PDF	1	2022
Zero-Shot Aspect-Based Scientific Document Summarization using Self-Supervised Pre-training	We study the zero-shot setting for the aspect-based scientific document summarization task. Summarizing scientific documents with respect to an aspect can remarkably improve document assistance systems and readers experience. However, existing large-scale datasets contain a limited variety of aspects, causing summarization models to over-fit to a small set of aspects. We establish baseline results in zero-shot performance (over unseen aspects and the presence of domain shift), paraphrasing, leave-one-out, and limited supervised samples experimental setups. We propose a self-supervised pre-training approach to enhance the zero-shot performance. Experimental results on the FacetSum and PubMed aspect-based datasets show promising performance when the model is pre-trained using unlabelled in-domain data.	PDF	1	2022
A Dataset for N-ary Relation Extraction of Drug Combinations	Combination therapies have become the standard of care for diseases such as cancer, tuberculosis, malaria and HIV. However, the combinatorial set of available multi-drug treatments creates a challenge in identifying effective combination therapies available in a situation.To assist medical professionals in identifying beneficial drug-combinations, we construct an expert-annotated dataset for extracting information about the efficacy of drug combinations from the scientific literature. Beyond its practical utility, the dataset also presents a unique NLP challenge, as the first relation extraction dataset consisting of variable-length relations. Furthermore, the relations in this dataset predominantly require language understanding beyond the sentence level, adding to the challenge of this task. We provide a promising baseline model and identify clear areas for further improvement. We release our dataset and code (https://anonymous.4open.science/r/drug-synergy-models--C8B7/README.md) publicly to encourage the NLP community to participate in this task.	PDF	1	2022
Speaker Clustering in Textual Dialogue with Utterance Correlation and Cross-corpus Dialogue Act Supervision	We propose a textual dialogue speaker clustering model, which groups the utterances of a multi-party dialogue without speaker annotations, so that the real speakers are identical inside each cluster. We find that, even without knowing the speakers, the interactions between utterances are still implied in the text. Such interactions suggest the correlations of the speakers. In this work, we model the semantic content of an utterance with a pre-trained language model, and the correlations of speakers with an utterance-level pairwise matrix. The semantic content representation can be further enhanced by additional cross-corpus supervised dialogue act modeling. The speaker labels are finally generated by spectral clustering. Experiment shows that our model outperforms the sequence classification baseline, and benefits from the set-specific dialogue act classification auxiliary task. We also discuss the detail of correlation modeling and step-wise training process.	PDF	1	2022
Revisiting Additive Compositionality: AND, OR, and NOT Operations with Word Embeddings	It is well-known that typical word embedding methods have the property that the meaning can be composed by adding up the embeddings (additive compositionality). Several theories have been proposed to explain additive compositionality, but the following problems remain: (i) The assumptions of those theories do not hold for the practical word embedding. (ii) Ordinary additive compositionality can be seen as an AND operation of word meanings, but it is not well understood how other operations, such as OR and NOT, can be computed by the embeddings. We address these issues by the idea of frequency-weighted centering at its core. This method bridges the gap between practical word embedding and the assumption of theory about additive compositionality as an answer to (i). This paper also gives a method for taking OR or NOT of the meaning by linear operation of word embedding as an answer to (ii). Moreover, we confirm experimentally that the accuracy of AND operation, i.e., the ordinary additive compositionality, can be improved by our post-processing method (3.5x improvement in top-100 accuracy) and that OR and NOT operations can be performed correctly. We also confirm that the proposed method is effective for BERT.	PDF	1	2022
A Balanced Data Approach for Evaluating Cross-Lingual Transfer: Mapping the Linguistic Blood Bank	We show that the choice of pretraining languages affects downstream cross-lingual transfer for BERT based models. We inspect zero-shot performance under balanced data conditions to mitigate data size confounds, classifying pretrain languages that increase downstream performance into donors, and languages that are most improved in zero-shot performance as recipients. We develop a method of quadratic time complexity in the number of pretraining languages to estimate these inter-language relations, instead of an exponential exhaustive computation of all possible combinations. We find that our method is effective on a diverse set of languages spanning different linguistic features and two downstream tasks.Our findings can inform developers of future large scale multilingual language models in choosing better pretraining configurations.	PDF	1	2022
AutoAttention: Automatic Attention Head Selection Through Differentiable Pruning	Multi-head attention is considered as a driving force and key component behind the state-of-art transformer models. However, recent research reveals that there are many redundant heads with duplicated patterns in each layer. In this work, we propose an automatic pruning strategy using differentiable binary gates to remove redundant heads. We relax the binary head pruning problem into a differentiable optimization by employing Straight Through Estimators (STEs), in which the model weights and head-sparse model structure can be jointly learned through back-propagation. In this way, attention heads can be pruned efficiently and effectively. Experimental results on the General Language Understanding Evaluation (GLUE) benchmark are provided using BERT model. We could reduce more than 57% heads on average with zero or minor accuracy drop on all nine tasks and even achieve better results than state-of-the-arts (e.g., Random, HISP, $L0$ Norm, SMP, etc). Furthermore, our proposed method can prune more than 79% heads with only 0.82% accuracy degradation on average. We further illustrate the pruning procedure and parameters change through the head attention visualization and show how the trainable gate parameters determine the head mask and the final attention map.	PDF	1	2022
A Study of Syntactic Multi-Modality in Non-Autoregressive Machine Translation	It is difficult for non-autoregressive translation (NAT) models to capture the multi-modal distribution of target translations due to their conditional independence assumption, which is known as the ``multi-modality problem'', including the lexical multi-modality and the syntactic multi-modality. While the first one has been well studied, the syntactic multi-modality brings severe challenge to the standard cross entropy (XE) loss in NAT and is under studied. In this paper, we conduct a systematic study on the syntactic multi-modality problem. Specifically, we decompose it into short- and long-range syntactic multi-modalities and evaluate several recent NAT algorithms with advanced loss functions on both carefully designed synthesized datasets and real datasets. We find that the Connectionist Temporal Classification (CTC) loss and the Order-Agnostic Cross Entropy (OAXE) loss can better handle short- and long-range syntactic multi-modalities respectively. Furthermore, we take the best of both and design a new loss function to better handle the complicated syntactic multi-modality in real-world datasets. To facilitate practical usage, we provide a guide to use different loss functions for different kinds of syntactic multi-modality.	PDF	1	2022
δ-SAM: Sharpness-Aware Minimization with Dynamic Reweighting	Deep neural networks are often overparameterized and may not easily achieve model generalization. Adversarial training has shown effectiveness in improving generalization by regularizing the change of loss on top of adversarially chosen perturbations. The recently proposed sharpness-aware minimization (SAM) algorithm conducts adversarial weight perturbation, encouraging the model to converge to a flat minima. Unfortunately, due to increased computational cost, adversarial weight perturbation can only be efficiently estimated per-batch instead of per-instance by SAM, leading to degraded performance. In this paper, we tackle this efficiency bottleneck and propose the first instance-based weight perturbation method: sharpness-aware minimization with dynamic reweighting (δ-SAM). δ-SAM dynamically reweights perturbation within each batch by estimated guardedness (i.e. unguarded instances are up-weighted), serving as a better approximation to per-instance perturbation. Experiments on various tasks demonstrate the effectiveness of δ-SAM.	PDF	1	2022
CrowdChecked: Detecting Previously Fact-Checked Claims in Social Media	While there has been substantial progress in developing systems to automate the process of fact-checking, such systems still lack credibility in the eyes of the users, and thus human fact-checkers remain the main drivers of the process. In view of that, recently, a middle-ground approach has emerged: to do automatic fact-checking by verifying whether the input claim has been previously fact-checked by professional fact-checkers, and to return back an article that explains the verdict on the claim. This is a sensible approach as people trust manual fact-checking, and as many claims are repeated multiple times online.Yet, a major issue when building such kinds of systems is the small number of known input--verified claim pairs available for training. Here, we aim to bridge this gap by making use of crowd fact-checking, i.e., mining claims in social media for which users have responded with a link to a fact-checking article. In particular, we mine a large-scale collection of 330,000 tweets paired with a corresponding fact-checking article. We further propose a new model to learn from this noisy data based on modified self-adaptive training, in a distant supervision scenario. Our experiments on a standard test set show improvements over the state of the art by two points absolute.	PDF	1	2022
Detecting Rumor Veracity with Only Textual Information by Double-Channel Structure	Kyle (1985) model proposes two types of rumors: informed rumors which are based on some private information and uninformed rumors which are not based on any information (i.e. bluffing). Also, prior studies find that when people have credible source of information, they are likely to use a more confident textual tone in their spreading of rumors. Motivated by these theoretical findings, we propose a double-channel structure to determine the ex-ante veracity of rumors on social media. Our ultimate goal is to classify each rumor into true, false, or unverifiable. We first assign each text into either certain (informed rumor) or uncertain (uninformed rumor) category. Then, we apply lie detection algorithm to informed rumors and thread-reply agreement detection algorithm to uninformed rumors. Using the dataset of SemEval 2019 Task 7, which requires ex-ante threefold classification (true, false, or unverifiable) of social media rumors, our model yields a macro-F1 score of 0.4027, outperforming all the baseline models and the second-place winner (Gorrell et al., 2019). Furthermore, we empirically validate that the double-channel structure outperforms single-channel structures which use either lie detection or agreement detection algorithm to all posts.	PDF	1	2022
A Benchmark for Text Quantification Learning Under Real-World Temporal Distribution Shift	Text quantification is a supervised learning task estimating the relative frequency of each class for a collection of uncategorized text documents. Quantification learning has an increasing number of applications in practice and presents unique challenges that are often overlooked in classification problems, such as dealing with distribution shift. Many studies on quantification use artificially re-sampled test sets to evaluate models under varying target label distributions. Despite being a convenient solution, label-based biased sampling changes the underlying test data distribution and makes it hard to rely on the results to deploy models in practice. This paper introduces a text quantification benchmark consisting of 8 datasets across sentiment analysis, document categorization, and toxicity classification. We compare popular quantification baselines on the benchmark and show that there is no model consistently outperforming others. Therefore, we believe the benchmark should enable new community research to tackle text quantification under temporal distribution shift and develop reliable models in real-world applications.	PDF	1	2022
Modeling Tension in Stories via Commonsense Reasoning and Emotional Word Embeddings	Dramatic tension is crucial for generating interesting stories. This paper aims to model dramatic tension from a story text using neural commonsense-reasoning language models and emotional word embeddings. We also propose a method of converting a categorical emotion word into a numerical value. The evaluation results using human-annotated stories demonstrate that our proposed method is promising in predicting tension development in a story.	PDF	1	2022
A Meta-transfer Learning framework for Visually Grounded Compositional Concept Learning	Humans acquire language in a compositional and grounded manner.They can describe their perceptual world using novel compositions from already learnt elementary concepts. However, recent research shows that modern neural networks lack such compositional generalization ability. To address this challenge, in this paper, we propose \textit{MetaVL}, a meta-transfer learning framework to train transformer-based vision-and-language (V\&L) models using optimization-based meta-learning method and episodic training.We carefully created two datasets based on MSCOCO and Flicker30K to specifically target novel compositional concept learning. Our empirical results have shown that \textit{MetaVL} outperforms baseline models in both datasets. Moreover, \textit{MetaVL} has demonstrated higher sample efficiency compared to supervised learning, especially under the few-shot setting.	PDF	1	2022
Efficient Weighted Deduction Systems for Earley’s Algorithm	The parsing algorithm of Earley (1970), as presented, has a runtime complexity of $\mathcal{O}(N^3\lvert\mathcal{G}\rvert \lvert\mathcal{R}\rvert)$ where $N$ is the length of the sentence, $\lvert\mathcal{G}\rvert $ is the size of the grammar, and $\lvert\mathcal{R}\rvert$ is the number of productions in the grammar. This is unworkable for the large grammars that arise in natural language processing. Fortunately, the dynamic programming algorithm can be improved to run in time $\mathcal{O}(N^3\lvert\mathcal{G}\rvert)$, matching the complexity of running CKY on a binarized version of $\mathcal{G}$. Some of the necessary speed-ups have been presented in part or in full in various parts of the literature. However, there has been no unified, formal treatment that is written as a deduction system or covers the weighted case. We present such a treatment in terms of five proof rules that can be used in weighted deduction, which refine Earley's \predict, \scan and \complete actions. We also provide a generalization of Earley's algorithm that uses a finite-state automaton to represent the grammar, and whose runtime is proportional to the size of the automaton (and the usual $\mathcal{O}(N^3)$ term), or more precisely the size of the portion of the automaton that is reached while parsing the input sentence. Further speed-ups can then be achieved by minimizing the automaton so that similar productions share transitions.	PDF	1	2022
Building Sequence-to-Sequence Document Revision Models from Matched and Multiple Partially-Matched Datasets	This paper defines the document revision task and proposes a novel modeling method that can utilize not only a matched dataset but also multiple partially-matched datasets. In the document revision task, we aim to simultaneously consider multiple perspectives for writing supports. To this end, it is important not only to correct grammatical errors but also to improve readability and perspicuity, through means such as conjunction insertion and sentence reordering. However, it is difficult to prepare enough the matched dataset for the document revision task since this task has to consider multiple perspectives simultaneously. To mitigate this problem, our idea is to utilize not only a limited matched dataset but also various partially-matched datasets that handles individual perspectives, e.g., correcting grammatical errors or inserting conjunctions. Since suitable partially-matched datasets have either been published or can easily be made, we expect to prepare a large amount of these partially-matched datasets. To effectively utilize these multiple datasets, our proposed modeling method incorporates ``on-off'' switches into sequence-to-sequence modeling to distinguish the matched datasets and individual partially-matched datasets. Experiments using our created document revision datasets demonstrate the effectiveness of the proposed method.	PDF	1	2022
HiStruct+: Improving Extractive Text Summarization with Hierarchical Structure Information	Transformer-based language models usually treat texts as linear sequences. However, most texts also have an inherent hierarchical structure, i.,e., parts of a text can be identified using their position in this hierarchy. In addition, section titles usually indicate the common topic of their respective sentences. We propose a novel approach to extract, encode and inject hierarchical structure (HiStruct) information into an extractive summarization model (HiStruct+ model) based on a pre-trained, encoder-only language model. Our HiStruct+ model achieves SOTA extractive ROUGE scores on three public summarization datasets (CNN/DailyMail, PubMed, arXiv), the improvement is especially substantial on PubMed and arXiv. Using various experimental settings, our HiStruct+ model outperforms a strong baseline, which differs from our model only in that the HiStruct information is not injected. The ablation study demonstrates that the hierarchical position information is the main contributor to our model's SOTA performance.	PDF	1	2022
POLITICS: Pretraining with Same-story Article Comparison for Ideology Prediction and Stance Detection	Ideology is at the core of political science research. Yet, there still does not exist general-purpose tools to characterize and predict ideology across different genres of text. To this end, we study Pretrained Language Models using novel ideology-driven pretraining objectives that rely on the comparison of articles on the same story written by media of different ideologies. We further collect a large-scale dataset, consisting of more than 3.6M political news articles, for experiments. Our model POLITICS outperforms strong baselines on 8 out of 11 ideology prediction and stance detection tasks. Further analyses show that POLITICS is especially good at understanding long or formally written texts, and is also robust in few-shot learning scenarios.	PDF	1	2022
Few-Shot Semantic Parsing with Language Models Trained On Code	Large language models can perform semantic parsing with little training data, when prompted with in-context examples. It has been shown that this can be improved by formulating the problem as paraphrasing into canonical utterances, which casts the underlying meaning representation into a controlled natural language-like representation. Intuitively, such models can more easily output canonical utterances as they are closer to the natural language used for pre-training. More recently, models also pre-trained on code, like OpenAI Codex, have risen in prominence. Since semantic parsing requires translating natural language into code, such models may prove more adept at it. In this paper, we test this hypothesis and find that Codex performs better at semantic parsing than equivalent GPT-3 models. We find that unlike GPT-3, Codex performs similarly when targeting meaning representations directly, perhaps because meaning representations used in semantic parsing are structured similar to code.	PDF	1	2022
Exploring Cross-Lingual Guidance in Abstractive Summarization	Cross-lingual guidance (CLG) as an augmentation method is often applied in cross-lingual summarization (CLS) to improve its performance. In this paper, we empirically study how cross-lingual information of different quality benefits the encoding and decoding procedures for both cross-lingual and mono-lingual abstractive summarization. We specially propose a summarization model DualSum which can utilize CLG in both encoding and decoding, and construct a dataset BiRead with high-quality parallel bilingual document-summary pairs. The empirical experiments will show how CLS and MLS are influenced by CLG.	PDF	1	2022
On Synthetic Data for Back Translation	Back translation (BT) is one of the most significant technologies in NMT research fields. Existing attempts on BT share a common characteristic: they employ either beam search or random sampling to generate synthetic data with a backward model but seldom work studies the role of synthetic data in the performance of BT. This motivates us to ask a fundamental question: what kind of synthetic data contributes to BT performance?Through both theoretical and empirical studies, we identify two key factors on synthetic data controlling the back-translation NMT performance, which are quality and importance. Furthermore, based on our findings, we propose a simple yet effective method to generate synthetic data to better trade off both factors so as to yield the better performance for BT. We run extensive experiments on WMT14 DE-EN, EN-DE, and RU-EN benchmark tasks. By employing our proposed method to generate synthetic data, our BT model significantly outperforms the standard BT baselines (i.e., beam and sampling based methods for data generation), which proves the effectiveness of our proposed methods.	PDF	1	2022
LongChecker: Improving scientific claim verification by modeling full-abstract context	The spread of scientific mis- and dis-information has motivated the development of datasets and models for the task of scientific claim verification. We address two modeling challenges associated with this task. First, existing claim verification systems make predictions by extracting an evidentiary sentence (or sentences) from a larger context, and then predicting whether this sentence supports or refutes the claim in question. This can be problematic, since the meaning of the selected sentence may change when interpreted outside its original context. Second, given the difficulty of collecting high-quality fact-checking annotations in expert domains, there is an unaddressed need for methods to facilitate zero / few-shot domain adaptation. Motivated by these challenges, we develop LongChecker. Given a claim and evidence-containing abstract, LongChecker predicts a fact-checking label and identifies evidentiary sentences in a multi-task fashion based on a shared encoding of all available context. This approach enables LongChecker to perform domain adaptation by leveraging weakly-supervised in-domain data. We show that LongChecker achieves state-of-the-art performance on three datasets, and conduct analysis to confirm that its strong performance is due to its ability to model full-abstract context.	PDF	1	2022
Understand before Answer: Improve Temporal Reading Comprehension via Precise Question Understanding	This work studies temporal reading comprehension (TRC), which reads a free-text passage and answers temporal ordering questions. Precise question understanding is critical for temporal reading comprehension. For example, the question "What happened before the victory" and "What happened after the victory" share almost all words except one, while their answers are totally different. Moreover, even if two questions query about similar temporal relations, different varieties might also lead to various answers. For example, although both the question "What usually happened during the press release?" and "What might happen during the press release" query events which happen after "the press release", they convey divergent semantics.To this end, we propose a novel reading comprehension approach with precise question understanding. Specifically, a temporal ordering question is embedded into two vectors to capture the referred event and the temporal relation. Then we evaluate the temporal relation between candidate events and the referred event based on that. Such fine-grained representations offer two benefits. First, it enables a better understanding of the question by focusing on different elements of a question. Second, it provides good interpretability when evaluating temporal relations. Furthermore, we also harness an auxiliary contrastive loss for representation learning of temporal relations, which aims to distinguish relations with subtle but critical changes. The proposed approach outperforms strong baselines and achieves state-of-the-art performance on the TORQUE dataset. It also increases the accuracy of four pre-trained language models (BERT base, BERT large, RoBERTa base, and RoBETRa large), demonstrating its generic effectiveness on divergent models.	PDF	1	2022
LoPE: Learnable Sinusoidal Positional Encoding for Improving Document Transformer Model	Positional encoding plays a key role in Transformer-based architecture, which is to indicate and embed token sequential order information. Understanding documents with unreliable reading order information is a real challenge for document Transformer model. This paper proposes a new and generic positional encoding method, learnable sinusoidal positional encoding (LoPE), by combining sinusoidal positional encoding function and a learnable feed-forward network. We apply LoPE to document Transformer model and pretrain the model on document datasets. Then we finetune and evaluate the model performance on document understanding tasks in form and receipt domains. Experimental results not only show our proposed method outperforms other baselines and state-of-the-arts, but also demonstrate its robustness and stability on handling noisy data with incorrect order information.	PDF	1	2022
A Cueing Strategy with Prompt Tuning for Relation Extraction	Prompt tuning shows great potential to support relation extraction because it is effective to take full use of rich knowledge in pretrained language models (PLMs). However, current prompt tuning models are directly implemented on a raw input. It is weak to encode semantic dependencies of a relation instance. In this paper, we designed a cueing strategy which implants task specific cues into the input. It enables PLMs to learn task specific contextual features and semantic dependencies in a relation instance. Experiments on ReTACRED corpus and ACE 2005 corpus show state-of-the-art performance in terms of F1-score.	PDF	1	2022
A New Dataset for Summarizing Radiology Reports	The radiology report summarization is an important technology in smart healthcare. Compared with medical image processing and disease recognition which have been comprehensively studied, the research on radiology report summarization is much limited, which is mainly due to the lack of a high-quality benchmark dataset. In this paper, we present a dataset called CRRsum for radiology report summarization, where it is constructed from over 10K real radiology reports that contains diagnostic findings and diagnostic opinions. An extensive evaluation is performed with the current state-of-the-art methods for radiology report summarization on our proposed dataset. Our experiments reveal the challenges of radiology report summarization and provide many opportunities for research going forward. We also show that the CRRsum can be used in medical classification to facilitate the research in this task.	PDF	1	2022
The Impact of Cross-Lingual Adjustment of Contextual Word Representations on Zero-Shot Transfer	Large pre-trained multilingual models such as mBERT and XLM-R enabled effective cross-lingual zero-shot transfer in many NLP tasks. A cross-lingual adjustment of these models using a small parallel corpus can further improve results. This is a more data efficient method compared to training a machine-translation system or a multi-lingual model from scratch using only parallel data. In this study, we experiment with zero-shot transfer of English models to four typologically different languages (Spanish, Russian, Vietnamese, and Hindi) and three NLP tasks (QA, NLI, and NER). We carry out a cross-lingual adjustment of an off-the-shelf mBERT model. We show that this adjustment makes embeddings of semantically similar words from different languages closer to each other, while keeping unrelated words apart. In contrast, fine-tuning of mBERT on English data (for a specific task such as NER) draws embeddings of both related and unrelated words closer to each other. The cross-lingual adjustment of mBERT improves NLI in four languages and NER in two languages. However, in the case of QA performance never improves and sometimes degrades. In that, the increase in the amount of parallel data is most beneficial for NLI, whereas QA performance peaks at roughly 5K parallel sentences and further decreases as the number of parallel sentences increases.	PDF	1	2022
MWP-BERT: Numeracy-Augmented Pre-training for Math Word Problem Solving	Math word problem (MWP) solving faces a dilemma in number representation learning. In order to avoid the number representation issue and reduce the search space of feasible solutions, existing works striving for MWP solving usually replace real numbers with symbolic placeholders to focus on logic reasoning. However, different from common symbolic reasoning tasks like program synthesis and knowledge graph reasoning, MWP solving has extra requirements in numerical reasoning. In other words, instead of the number value itself, it is the reusable numerical property that matters more in numerical reasoning. Therefore, we argue that injecting numerical properties into symbolic placeholders with contextualized representation learning schema can provide a way out of the dilemma in the number representation issue here. In this work, we introduce this idea to the popular pre-training language model (PLM) techniques and build MWP-BERT, an effective contextual number representation PLM. We demonstrate the effectiveness of our MWP-BERT on MWP solving and several MWP-specific understanding tasks on both English and Chinese benchmarks.	PDF	1	2022
DISAPERE: A Dataset for Discourse Structure in Peer Review Discussions	At the foundation of scientific evaluation is the labor-intensive process of peer review. This critical task requires participants to consume vast amounts of highly technical text. Prior work has annotated different aspects of review argumentation, but discourse relations between reviews and rebuttals have yet to be examined.We present DISAPERE, a labeled dataset of 20k sentences contained in 506 review-rebuttal pairs in English, annotated by experts. DISAPERE synthesizes label sets from prior work and extends them to include fine-grained annotation of the rebuttal sentences, characterizing their context in the review and the authors' stance towards review arguments. Further, we annotate \textit{every} review and rebuttal sentence.We show that discourse cues from rebuttals can shed light on the quality and interpretation of reviews. Further, an understanding of the argumentative strategies employed by the reviewers and authors provides useful signal for area chairs and other decision makers.	PDF	1	2022
Morphosyntactic Tagging with Pre-trained Language Models for Arabic and its Dialects	We present state-of-the-art results on morphosyntactic tagging across different varieties of Arabic using fine-tuned pre-trained transformer language models. Our models consistently outperform existing systems in Modern Standard Arabic and all the Arabic dialects we study, achieving 2.6% absolute improvement over the previous state-of-the-art in Modern Standard Arabic, 2.8% in Gulf, 1.6% in Egyptian, and 8.3% in Levantine. We explore different training setups for fine-tuning pre-trained transformer language models, including training data size, the use of external linguistic resources, and the use of annotated data from other dialects in a low-resource scenario. Our results show that strategic fine-tuning using datasets from other high-resource dialects is beneficial for a low-resource dialect. Additionally, we show that high-quality morphological analyzers as external linguistic resources are beneficial especially in low-resource settings.	PDF	1	2022
That is a good looking car !: Visual Aspect based Sentiment Controlled Personalized Response Generation	In a conversational system, generating utterances that communicate consistent and relevant preferences is vital for more personalized conversations. In this paper, we propose a task of generating utterances grounded on some assigned aspect-preferences-profile. These aspect-preference profiles consist of a list of aspect-sentiment tuples, denoting the preference of the speaker for some aspect in the form of sentiment ("positive" or "negative"). Since no prior dataset containing such profiles is available, we enhance Image-Chat data by assigning these profiles to each user in a conversation. The conversations in this dataset are based on an image, therefore the aspects are present in images as well as dialogue history. We build a BERT and ResNet-based encoder-decoder model with a memory network to store preference-profile. Through our experiments, we show that our model can generate responses that convey the sentiment of relevant aspects in accordance with the assigned profile. Both automatic and manual evaluations show the effectiveness of our model and dataset. Our proposed system when using these profiles achieves a BLEU-1 score of 15.93 on this new task, which is an improvement of 2.92 points from the baseline experiment that does not use aspect-preference profiles.	PDF	1	2022
Leveraging Visual Knowledge in Language Tasks: An Empirical Study on Intermediate Pre-training for Cross-modal Knowledge Transfer	Pre-trained language models are still far from human performance in tasks that need understanding of properties (e.g. appearance, measurable quantity) and affordances of everyday objects in the real world since the text lacks such information due to reporting bias.In this work, we study whether integrating visual knowledge into a language model can fill the gap.We investigate two types of knowledge transfer: (1) \textit{text knowledge transfer using image captions that may contain enriched visual knowledge and (2) \textit{cross-modal knowledge transfer} using both images and captions with vision-language training objectives.On 5 downstream tasks that may need visual knowledge to solve the problem, we perform extensive empirical comparisons over the presented objectives.Our experiments show that visual knowledge transfer can improve performance in both low-resource and fully supervised settings.	PDF	1	2022
Context-Aware Language Modeling for Goal-Oriented Dialogue Systems	Goal-oriented dialogue systems face a trade-off between fluent language generation and task-specific control. While supervised learning with large language models is capable of producing realistic text, how to steer such responses towards completing a specific task without sacrificing language quality remains an open question. In this work, we formulate goal-oriented dialogue as a partially observed Markov decision process, interpreting the language model as a representation of both the dynamics and the policy. This view allows us to extend techniques from learning-based control, such as task relabeling, to derive a simple and effective method to finetune language models in a goal-aware way, leading to significantly improved task performance. We additionally introduce a number of training strategies that serve to better focus the model on the task at hand. We evaluate our method, Context-Aware Language Models (CALM), on a practical flight-booking task using AirDialogue. Empirically, CALM outperforms state-of-the-art method by 7% in terms of task success, matching human-level task performance on this dataset.	PDF	1	2022
TVShowGuess: Character Comprehension in Stories as Speaker Guessing	We propose a new task for assessing machines' skills of understanding fictional characters in narrative stories. The task, TVShowGuess, builds on the scripts of TV series and takes the form of guessing the anonymous main characters based on the backgrounds of the scenes and the dialogues. Our human study supports that this form of task covers comprehension of multiple types of character persona, including understanding characters' personalities, facts and memories of personal experience, which are well aligned with the psychological and literary theories about the theory of mind (ToM) of human beings on understanding fictional characters during reading. We further propose new model architectures to support the contextualized encoding of long scene texts. Experiments show that our proposed approaches significantly outperform baselines, yet still largely lag behind the (nearly perfect) human performance.Our work serves as a first step toward the goal of narrative character comprehension.	PDF	1	2022
Synopses of Movie Narratives: a Video-Language Dataset for Story Understanding	Despite recent advances of AI, story understanding remains an open and under-investigated problem. We collect, preprocess, and publicly release a video-language story dataset, Synopses of Movie Narratives(SyMoN), containing 5,193 video summaries of popular movies and TV series. SyMoN captures naturalistic storytelling videos for human audience made by human creators, and has higher story coverage and more frequent mental-state references than similar video-language story datasets. Differing from most existing video-text datasets, SyMoN features large semantic gaps between the visual and the textual modalities due to the prevalence of reporting bias and mental state descriptions. We establish benchmarks on video-text retrieval and zero-shot alignment on movie summary videos. With SyMoN, we hope to lay the groundwork for progress in multimodal story understanding.	PDF	1	2022
An Empirical Study of Representation, Training and Decoding for Span-based Named Entity Recognition	Named Entity Recognition (NER) is an important task in Natural Language Processing with application in many domains. While the dominant paradigm of NER is sequence labelling, span-based approaches have become very popular in recent times, but are less well understood. In this work, we study different aspects of span-based NER, namely the span representation, learning strategy, and decoding algorithms to avoid span overlap. We also propose an exact algorithm that efficiently finds the set of non-overlapping spans that maximize a global score, given a list of candidate spans. We perform our study on three benchmarks NER datasets from different domains. The code and supporting files for the experiments will be made publicly available.	PDF	1	2022
Entity Cloze By Date: Understanding what LMs know about unseen entities	Language models (LMs) are typically trained once on a large-scale corpus and used for years without being updated. Our world, however, is dynamic, and new entities constantly arise. We propose a framework to analyze what LMs can infer about new entities that did not exist when the LMs were pretrained. We derive a dataset of entities indexed by their origination date and paired with their English Wikipedia articles, from which we can find sentences about each entity. We evaluate LMs' perplexity on masked spans within these sentences. We show that models more informed about the entities, such as those with access to a textual definition of them, achieve lower perplexity on this benchmark. Our experimental results demonstrate that making inferences about new entities remains difficult for LMs. Given its wide coverage on entity knowledge and temporal indexing, our dataset can be used to evaluate LMs and techniques designed to modify or extend their knowledge. Our automatic data collection pipeline can be easily used to continually update our benchmark.	PDF	1	2022
Don't Take It Literally: An Edit-Invariant Sequence Loss for Text Generation	Neural text generation models are typically trained by maximizing log-likelihood with the sequence cross entropy (CE) loss, which encourages an exact token-by-token match between a target sequence with a generated sequence. Such training objective is sub-optimal when the target sequence is not perfect, e.g., when the target sequence is corrupted with noises, or when only weak sequence supervision is available. To address the challenge, we propose a novel Edit-Invariant Sequence Loss (EISL), which computes the matching loss of a target $n$-gram with all $n$-grams in the generated sequence. EISL is designed to be robust to various noises and edits in the target sequences. Moreover, the EISL computation is essentially an approximate convolution operation with target $n$-grams as kernels, which is easy to implement and efficient to compute with existing libraries. To demonstrate the effectiveness of EISL, we conduct experiments on a wide range of tasks, including machine translation with noisy target sequences, unsupervised text style transfer with only weak training signals, and non-autoregressive generation with non-predefined generation order. Experimental results show our method significantly outperforms the common CE loss and other strong baselines on all the tasks. EISL has a simple API that can be used as a drop-in replacement of the CE loss: https://anonymous.4open.science/r/EISLLoss.	PDF	1	2022
Towards Transparent Interactive Semantic Parsing via Step-by-Step Correction	Existing studies on semantic parsing focus on mapping a natural-language utterance to a logical form (LF) in one turn. However, because natural language may contain ambiguity and variability, this is a difficult challenge. In this work, we investigate an interactive semantic parsing framework that explains the predicted LF step by step in natural language and enables the user to make corrections through natural-language feedback for individual steps. We focus on question answering over knowledge bases (KBQA) as an instantiation of our framework, aiming to increase the transparency of the parsing process and help the user trust the final answer. We construct INSPIRED, a crowdsourced dialogue dataset derived from the ComplexWebQuestions dataset. Our experiments show that this framework has the potential to greatly improve overall parse accuracy. Furthermore, we develop a pipeline for dialogue simulation to evaluate our framework w.r.t. a variety of state-of-the-art KBQA models without further crowdsourcing effort. The results demonstrate that our framework promises to be effective across such models.	PDF	1	2022
NewsEdits: A Dataset of News Article Revision Histories and a Novel Document-Level Reasoning Challenge	News article revision histories provide clues to narrative and factual evolution in news articles. To facilitate analysis of this evolution, we present the first publicly available dataset of news revision histories, NewsEdits. Our dataset is large-scale and multilingual; it contains 1.2 million articles with 4.6 million versions from over 22 English- and French-language newspaper sources based in three countries, spanning 15 years of coverage (2006-2021).We define article-level edit actions: Add, Delete, Edit and Move Sentence, and develop a high-accuracy extraction algorithm to identify these actions. To underscore the factual nature of many edit actions, we conduct analyses showing that added and deleted sentences are more likely to contain updating events, main content and quotes than unchanged sentences. Finally, to explore whether edit actions are predictable, we introduce three novel tasks aimed at predicting actions performed during version updates. We show that these tasks are challenging for large NLP models but are possible for expert humans. We hope this can spur research in narrative framing and help provide predictive tools for journalists chasing breaking news.	PDF	1	2022
Fine-grained Location Extraction via Curriculum Learning	Named Entity Recognition (NER) seeks to extract entity mentions from texts with predefined categories such as Person, Location. General domain NER datasets like CoNLL-2003 mostly annotate Location coarse-grained entities manner (e.g., a country or a city). However, many applications require to identify fine-grained locations from texts and map them precisely to geographic sites (e.g., a crossroad or a store). Therefore, we propose a new NER dataset HarveyNER with fine-grained locations annotated in tweets. This dataset presents unique challenges and characterizes many complex and long location mentions in informal descriptions. Considering Curriculum Learning can help a system better learn the hard samples, we adopt it and first design two heuristic curricula based on the characteristic difficulties of HarveyNER, and then propose a novel curriculum that takes the commonness of sample difficulty into consideration. Our curricula are simple yet effective and experimental results show that our methods can improve both the hard case and overall performance in HarveyNER over strong baselines without extra cost.	PDF	1	2022
DAQE: Exploring the Direct Assessment on Word-Level Quality Estimation in Machine Translation	Word-level Quality Estimation (QE) of Machine Translation (MT) helps to find out potential translation errors in translated sentences without reference. The current collection of QE datasets is typically based on the exact matching between the words from MT sentences and post-edited sentences through Translation Error Rate (TER) toolkit. However, we find that the data generated by TER cannot faithfully reflect human judgment, which can make the research deviate from the correct direction. To overcome the limitation, we for the first time collect the direct assessment (DA) dataset for the word-level QE task, namely DAQE, which contains the golden corpus annotated by expert translators on two language pairs. Furthermore, we propose two tag correcting strategies, namely tag refinement strategy and tree-based annotation strategy, to make the TER-based artificial QE tags closer to human judgement, so that the corrected TER-based data can be used to improve the QE performance during pre-training. We conduct detailed experiments on our collected DAQE dataset, as well as comparison with the TER-based QE dataset MLQE-PE. The results not only show our proposed dataset DAQE is more consistent with human judgment but also confirm the effectiveness of the pre-training approach with the tag correcting strategies.	PDF	1	2022
A Holistic Framework for Analyzing the COVID-19 Vaccine Debate	The Covid-19 pandemic has led to infodemic of low quality information leading to poor health decisions. Combating the outcomes of this infodemic is not only a question of identifying false claims, but also reasoning about the decisions individuals make.In this work we propose a holistic analysis framework connecting stance and reason analysis and fine-grained entity level moral sentiment analysis. We study how to model the dependencies between the different level of analysis and incorporate human insights into the learning process. Experiments show that our framework provides reliable predictions even in the low-supervision settings.	PDF	1	2022
TemporalWiki: A Lifelong Benchmark for Training and Evaluating Ever-Evolving Language Models	Language Models (LMs) become outdated as the world changes; they often fail to perform tasks requiring recent factual information which was absent or different during training, a phenomenon called temporal misalignment. This is especially a challenging problem because the research community still lacks a coherent dataset for assessing the adaptability of LMs to frequently-updated knowledge corpus such as Wikipedia. To this end, we introduce TemporalWiki, a lifelong benchmark for ever-evolving LMs that utilizes the difference between the consecutive snapshots of Wikipedia and Wikidata for training and evaluation, respectively. The benchmark hence allows one to periodically track an LM's ability to retain previous knowledge and acquire new or updated knowledge at each point in time. We also find that training an LM on the diff data with an adapter achieves similar or better perplexity than on the entire snapshot in our benchmark with 12 times less computational cost, which verifies that factual knowledge in LMs can be safely updated with minimal training data via continual learning. The dataset and the code will be available at this link.	PDF	1	2022
A Text-Image Pair Is not Enough: Language-Vision Relation Inference with Auxiliary Modality Translation	The semantic relations between language and vision modalities become more and more vital since they can effectively facilitate the downstream multi-modal tasks, such as cross-modal retrieval, multi-modal sentiment analysis and entity recognition. Although several approaches have been proposed to handle language-vision relation inference (LVRI), they normally rely on the limited information of the posted single sentence and single image. In this paper, to extend the information width of the original input, we introduce a concept of modality translation with two potential directions to generate additional modalities, and propose the auxiliary modality translation framework (AMT) for LVRI. This approach can not only generate the additional image by translating original text, but also the additional text by translating original image. Moreover, towards the potential three or four modalities as input, we employ a unified layer-wise transformer structure to perform multi-modal interactions. Systematic experiments and extensive analysis demonstrate that our approach with auxiliary modality translation significantly outperforms conventional approaches of LVRI and several competitive baselines for other text-image classification tasks.	PDF	1	2022
"my stance decides my language": Modeling of Framing and Political Stance in News Media	Framing is a political strategy in which journalists and politicians highlight certain aspects of an issue or a problem to influence public opinion. Frameworks for detecting framing in news articles or social media posts are necessary in order to understand the spread of biased information in our society. Prior research efforts have shown that their framework for framing detection works well by predicting political affiliation afterward. In this paper, rather than predicting stance after detecting frames, we incorporate stance prediction into a framing detection model to jointly capture framing languages better. We take advantage of political stance data, which are more readily available than framing data that require manual annotation of professionals, and propose automatic framing detection models, which can detect previously unseen framing phrases. We compare two different methods of incorporation and show that leveraging stance prediction improves the separation of liberal and conservative biased frame language.	PDF	1	2022
ViL-Sum: Enhancing Vision and Language Representations via Multi-task Learning for Multi-modal Summarization	With the advance of multimedia on the Internet, multi-modal summarization has drawn much attention. Most current methods follow a pipeline strategy, where an off-the-shelf object detector is used to extract visual features which are then fused with language representations for decoder to generate. However, these methods suffer two issues 1) separate vision and language representations fail to capture the interrelations within the two modalities; 2) from the local view, the semantic alignments between images and paragraphs are missing. In order to address these problems, in this paper, we propose a novel Vision-Language Summarization (ViL-Sum) model with a multi-task learning framework. Specifically, we train our model with two auxiliary tasks in a multi-task manner, that are images selection and images reordering. In this way, the interrelations within image and text are well captured. Besides, to further enhance the vision-language representation, we employ a unified transformer-based encoder-decoder structure. The encoder simultaneously takes image and text as input and jointly learns the representations of both. Then the representations are used by the decoder to generate the summary. Experimental results show that ViL-Sum significantly outperforms current state-of-the-art methods. In further analysis, we find that the enhanced representations via multi-task training and joint modeling learn reasonable relations between image and text.	PDF	1	2022
Improving Conversational Recommendation Systems’ Quality with Context-Aware Item Meta-Information	A key challenge of Conversational Recommendation Systems (CRS) is to integrate the recommendation function and the dialog generation function smoothly. Previous works employ graph neural networks with external knowledge graphs (KG) to model individual recommendation items and integrate KGs with language models through attention mechanism for response generation. Although previous approaches prove effective, there is still room for improvement. For example, KG-based approaches only rely on entity relations and bag-of-words to recommend items and neglect the information in the conversational context. We propose to improve the usage of dialog context for both recommendation and response generation using an encoding architecture along with the self-attention mechanism of transformers. In this paper, we propose a simple yet effective architecture comprising a pre-trained language model (PLM) and an item metadata encoder to integrate the recommendation and the dialog generation better. The proposed item encoder learns to map item metadata to embeddings reflecting the rich information of the item, which can be matched with dialog context. The PLM then consumes the context-aware item embeddings and dialog context to generate high-quality recommendations and responses. Experimental results on the benchmark dataset ReDial show that our model obtains state-of-the-art results on both recommendation and response generation tasks.	PDF	1	2022
Using Natural Sentence Prompts for Understanding Biases in Language Models	Evaluation of biases in language models is often limited to synthetically generated datasets. This dependence traces back to the need of prompt-style dataset to trigger specific behaviors of language models. In this paper, we address this gap by creating a prompt dataset with respect to occupations collected from real-world natural sentences present in Wikipedia.We aim to understand the differences between using template-based prompts and natural sentence prompts when studying gender-occupation biases in language models. We find bias evaluations are very sensitiveto the design choices of template prompts, and we propose using natural sentence prompts as a way of more systematically using real-world sentences to move away from design decisions that may bias the results.	PDF	1	2022
Lacuna Reconstruction: Self-supervised Pre-training for Low-Resource Historical Document Transcription	We present a self-supervised pre-training approach for learning rich visual language representations for both handwritten and printed historical document transcription. After supervised fine-tuning of our pre-trained encoder representations for low-resource document transcription on two languages, (1) a heterogeneous set of handwritten Islamicate manuscript images and (2) early modern English printed documents, we show a meaningful improvement in recognition accuracy over the same supervised model trained from scratch with as few as 30 line image transcriptions for training. Our masked language model-style pre-training strategy, where the model is trained to be able to identify the true masked visual representation from distractors sampled from within the same line, encourages learning robust contextualized language representations invariant to scribal writing style and printing noise present across documents.	PDF	1	2022
Database Search Results Disambiguation for Task-Oriented Dialog Systems	As task-oriented dialog systems are becoming increasingly popular in our lives, more realistic tasks have been proposed and explored. However, new practical challenges arise. For instance, current dialog systems cannot effectively handle multiplesearch results when querying a database, due to the lack of such scenarios in existing public datasets. In this paper, we propose Database Search Result (DSR) Disambiguation, a novel task that focuses on disambiguating database search results, which enhances user experience by allowing them to choose from multiple options instead of just one. To study this task, we augment the popular task-oriented dialog datasets (MultiWOZ and SGD) with turns that resolve ambiguities by (a) synthetically generating turns through a pre-defined grammar, and (b) collecting human paraphrases for a subset. We find that training on our augmented dialog data improves the model's ability to deal with ambiguous scenarios, without sacrificing performance on unmodified turns. Furthermore, pre-fine tuning and multi-task learning help our model to improve performance on DSR-disambiguation even in the absence of in-domain data, suggesting that it can be learned as a universal dialog skill. Our data and code will be made publicly available.	PDF	1	2022
When do Contrastive Word Alignments Improve Many-to-many Neural Machine Translation?	Word alignment has proven to benefit many-to-many neural machine translation (NMT). However, high-quality ground-truth bilingual dictionaries were used for pre-editing in previous methods, which are unavailable for most language pairs. Meanwhile, the contrastive objective can implicitly utilize automatically learned word alignment, which has not been explored in many-to-many NMT. This work proposes a word-level contrastive objective to leverage word alignments for many-to-many NMT. Empirical results show that this leads to 0.8 BLEU gains for several language pairs. Analyses reveal that in many-to-many NMT, the encoder's retrieval performance highly correlates with the translation quality, which explains when the proposed method impacts translation. This motivates future exploration for many-to-many NMT focusing on improving the encoder retrieval performance.	PDF	1	2022
Are All the Datasets in Benchmark Necessary? A Pilot Study of Dataset Evaluation for Text Classification	In this paper, we ask the research question if all the datasets in the benchmark are necessary. We approach this by first characterizing the distinguishing ability of datasets when comparing different systems. Experiments on 9 datasets and 36 systems show that several existing benchmark datasets contribute little to discriminating top-scoring systems, while those less used datasets exhibit impressive discriminative power. We further, taking the text classification task as a case study, investigate the possibility of predicting dataset discrimination based on its properties (e.g., average sentence length). Our preliminary experiments promisingly show that given a sufficient number of training experimental records, a meaningful predictor can be learned to estimate dataset discrimination over unseen datasets. We released all related code at Github \url{https://github.com/annonnlp-demo/acl-V2} and a new benchmark dataset for text classification based on our observations.	PDF	1	2022
Navigating Connected Memories with a Task-oriented Dialog System	Recent years have seen an increasing trend in the volume of personal media captured by users, thanks to the advent of smartphones and smart glasses, resulting in large media collections. Despite conversation being an intuitive human-computer interface, current efforts focus mostly on single-shot natural language based media retrieval to aid users query their media and re-live their memories. This severely limits the search functionality as users can neither ask follow-up queries nor obtain information without first formulating a single-turn query.In this work, we propose \textit{dialogs for connected memories} as a powerful tool to empower users to search their media collection through a multi-turn, interactive conversation. Towards this, we collect a new task-oriented dialog dataset COMET, which contains $11.5k$ user$\leftrightarrow$assistant dialogs (totalling $103k$ utterances), grounded in simulated personal memory graphs. We employ a resource-efficient, two-phase data collection pipeline that uses: (1) a novel multimodal dialog simulator that generates synthetic dialog flows grounded in memory graphs, and, (2) manual paraphrasing to obtain natural language utterances. We analyze COMET, formulate four main tasks to benchmark meaningful progress, and adopt state-of-the-art language models as strong baselines, in order to highlight the multimodal challenges captured by our dataset. Our code \& data will be made publicly available.	PDF	1	2022
Analyzing CodeBERT's Performance on Natural Language Code Search	Large language models such as CodeBERT perform very well on tasks such as natural language code search. We show that this is most likely due to the high token overlap and similarity between the queries and the code in datasets obtained from large codebases, rather than any deeper understanding of the syntax or semantics of the query or code.	PDF	1	2022
Do Prompts Solve NLP Tasks Using Natural Language?	Thanks to the advanced improvement of large pre-trained language models, prompt-based fine-tuning is shown to be effective on a variety of downstream tasks.Though many prompting methods have been investigated, it remains unknown which type of prompts are the most effective among three types of prompts (i.e., human-designed prompts, schema prompts and null prompts). In this work, we empirically compare the three types of prompts under both few-shot and fully-supervised settings. Our experimental results show that schema prompts are the most effective in general.Besides, the performance gaps tend to diminish when the scale of training data grows large.	PDF	1	2022
Commonsense Knowledge Transfer for Pre-trained Language Models	Despite serving as the foundation models for a wide range of NLP benchmarks, pre-trained language models have shown limited capabilities of acquiring implicit commonsense knowledge from self-supervision alone, compared to learning linguistic and factual knowledge that appear more explicitly in the surface patterns in text. In this work, we introduce \textit{commonsense knowledge transfer}, a framework to transfer the commonsense knowledge stored in a neural commonsense knowledge model to a general-purpose pre-trained language model. It first exploits general texts to form queries for extracting commonsense knowledge from the neural commonsense knowledge model and then refines the language model with two self-supervised objectives: \textit{commonsense mask infilling} and \textit{commonsense relation prediction}, which align human language with the underlying commonsense knowledge.Empirical results show that our approach consistently improves the model's performance on downstream tasks that require commonsense reasoning. Moreover, we find that the improvement is more significant in the few-shot setting. This suggests that our approach helps language models better transfer to downstream tasks without extensive supervision by injecting commonsense knowledge into their parameters.	PDF	1	2022
A Deep Paradigm for Articulatory Speech Representation Learning via Neural Convolutive Sparse Matrix Factorization	Most of the research on data-driven speech representation learning has focused on raw audios in an end-to-end manner, paying little attention to their internal phonological or gestural structure. This work, investigating the speech representations derived from articulatory kinematics signals, uses a neural implementation of convolutive sparse matrix factorization to decompose the articulatory data into interpretable gestures and gestural scores. By applying sparse constraints, the gestural scores leverage the discrete combinatorial properties of phonological gestures. Phoneme recognition experiments were additionally performed to show that gestural scores indeed code phonological information successfully. The proposed work thus makes a bridge between articulatory phonology and deep neural networks to leverage interpretable, intelligible, informative, and efficient speech representations.	PDF	1	2022
AdapterBias: Parameter-efficient Token-dependent Representation Shift for Adapters in NLP Tasks	Transformer-based pre-trained models with millions of parameters require large storage. Recent approaches tackle this shortcoming by training adapters, but these approaches still require a relatively large number of parameters. In this study, AdapterBias, a surprisingly simple yet effective adapter architecture, is proposed. AdapterBias adds a token-dependent shift to the hidden output of transformer layers to adapt to downstream tasks with only a vector and a linear layer. Extensive experiments are conducted to demonstrate the effectiveness of AdapterBias. The experiments show that our proposed method can dramatically reduce the trainable parameters compared to the previous works with a minimal decrease in task performances compared with fine-tuned pre-trained models. We further find that AdapterBias automatically learns to assign more significant representation shifts to the tokens related to the task in consideration.	PDF	1	2022
Learning to repair: Repairing model output errors after deployment using a dynamic memory of feedback	Large language models (LMs), while powerful, are not immune to mistakes, but can be difficult to retrain. Our goal is for an LM to continue to improve after deployment, without retraining, using feedback from the user. Our approach pairs an LM with (i) a growing memory of cases where the user identified an output error and provided general feedback on how to correct it (ii) a corrector model, trained to translate this general feedback into specific edits to repair the model output. Given a new, unseen input, our model can then use feedback from similar, past cases to repair output errors that may occur. We instantiate our approach using an existing, fixed model for script generation, that takes a goal (e.g., "bake a cake") and generates a partially ordered sequence of actions to achieve that goal, sometimes containing errors. We show that our memory-enhanced system, FBNet, learns to apply user feedback effectively to repair such errors (up to 30 points improvement), while making a start at avoiding similar past mistakes on new, unseen examples (up to 7 points improvement in a controlled setting). This is a first step towards strengthening deployed models, potentially broadening their utility.	PDF	1	2022
Surprisingly Simple Adapter Ensembling for Zero-Shot Cross-Lingual Sequence Tagging	Adapters are parameter-efficient modules added to pretrained Transformer models that facilitate cross-lingual transfer. Language adapters and task adapters can be separately trained and zero-shot transfer is enabled by pairing the language adapter in the target language with a task adapter trained on a high-resource language. However, there are many languages and dialects for which training language adapters would be difficult. In this work, we present a simple and efficient ensembling technique to transfer task knowledge to unseen target languages for which no language adapters exist. We compute a uniformly-weighted ensemble model over the top language adapters based on how well they perform on the test set of a high-resource language. We outperform the state-of-the-art model for this specific setting on named entity recognition (NER) and part-of-speech tagging (POS), across nine typologically diverse languages with relative performance improvements of up to $29\%$ and $9\%$ on NER and POS, respectively, on select target languages.	PDF	1	2022
The Art of Prompting: Event Detection based on Type Specific Prompts	We compare various forms of prompts to represent event types and develop a unified framework to incorporate the event type specific prompts for supervised, few-shot, and zero-shot event detection. The experimental results demonstrate that a well-defined and comprehensive event type prompt can significantly improve the performance of event detection, especially when the annotated data is scarce (few-shot event detection) or not available (zero-shot event detection). By leveraging the semantics of event types, our unified framework shows up to 24.3\% F-score gain over the previous state-of-the-art baselines.	PDF	1	2022
PCEE-BERT: Accelerating BERT Inference via Patient and Confident Early Exiting	BERT and other pre-trained language models (PLMs) are ubiquitous in the modern NLP. Even though PLMs are the state-of-the-art (SOTA) models for almost every NLP task \citep{Qiu2020PretrainedMF}, the significant latency during inference forbids more widely industrial usage. In this work, we propose \underline{P}atient and \underline{C}onfident \underline{E}arly \underline{E}xiting BERT (PCEE-BERT), an off-the-shelf sample-dependent early exiting method that can work with different PLMs and can also work along with popular model compression methods. With a multi-exit BERT as the backbone model, PCEE-BERT will make the early exiting decision if enough numbers (patience parameter) of consecutive intermediate layers are confident about their predictions. The entropy value measures the confidence level of an intermediate layer's prediction. Experiments on the GLUE benchmark demonstrate that our method outperforms previous SOTA early exiting methods. Ablation studies show that: (a) our method performs consistently well on other PLMs, such as ALBERT and TinyBERT; (b) PCEE-BERT can make achieve different speed-up ratios by adjusting the patience parameter and the confidence threshold.	PDF	1	2022
HybriDialogue: An Information-Seeking Dialogue Dataset Grounded on Tabular and Textual Data	A pressing challenge in current dialogue systems is to successfully converse with users on topics with information distributed across different modalities. Previous work in multiturn dialogue systems has primarily focused on either text or table information. In more realistic scenarios, having a joint understanding of both is critical as knowledge is typically distributed over both unstructured and structured forms. We present a new dialogue dataset, HybriDialogue, which consists of crowdsourced natural conversations grounded on both Wikipedia text and tables. The conversations are created through the decomposition of complex multihop questions into simple, realistic multiturn dialogue interactions. We conduct several baseline experiments, including retrieval, system state tracking, and dialogue response generation. Our results show that there is still ample opportunity for improvement, demonstrating the importance of building stronger dialogue systems that can reason over the complex setting of information-seeking dialogue grounded on tables and text	PDF	1	2022
Document-Level Event Argument Extraction by Leveraging Redundant Information and Closed Boundary Loss	In document-level event argument extraction, an argument is likely to appear multiple times in different expressions in the document. The redundancy of arguments underlying multiple sentences is beneficial but is often overlooked. In addition, in event argument extraction, the majority entities are regarded as class “others", i.e. universum class, which is composed of heterogeneous entities without typical common features. Classifiers trained by cross entropy loss could easily misclassify universum class because of their open decision boundary. In this paper, to make use of redundant information underlying a document, we build an entity coreference graph with graph2token module to produce comprehensive and coreference-aware representation for every entity, and then build an entity summary graph to merge the multiple extraction results. To better classify universum class, we propose a new loss function to build classifiers with closed boundaries. Experimental results show that our model outperforms the previous state-of-the-art models by 3.35% in F1-score.	PDF	1	2022
Unsupervised Reinforcement Adaptation for Class-Imbalanced TextClassification	Unsupervised domain adaptation (UDA) augment model performance with only accessible annotations from the source domain and unlabeled data from the target domain. Existing state-of-the-art UDA models learn domain-invariant representations across domains and evaluate primarily on class-imbalanced data. In this work, we propose an unsupervised domain adaptation approach via reinforcement learning that jointly leverages both label prediction, domain, and imbalanced labels across domains. We experiment with the text classification task for its easily accessible datasets and compare the proposed method with five baselines. Experiments on three datasets prove that our proposed method can effectively learn robust domain-invariant representations and successfully adapt text classifiers over domains and imbalanced classes.	PDF	1	2022
CERES: Pretraining of Graph-Conditioned Transformer for Semi-Structured Session Data	User sessions empower many search and recommendation tasks on a daily basis. Such session data are semi-structured, which encode heterogeneous relations between queries and products, and each item is described by the unstructured text. Despite recent advances in self-supervised learning for text or graphs, there lack of self-supervised learning models that can effectively capture both intra-item semantics and inter-item interactions for semi-structured sessions. To fill this gap, we propose CERES, a graph-based transformer model for semi-structured session data. CERES learns representations that capture both inter- and intra-item semantics with (1) a graph-conditioned masked language pretraining task that jointly learns from item text and item-item relations; and (2) a graph-conditioned transformer architecture that propagates inter-item contexts to item-level representations. We pretrained CERES using ~468 million Amazon sessions and find that CERES outperforms strong pretraining baselines by up to 9% in three session search and entity linking tasks.	PDF	1	2022
Incremental Prompting: Episodic Memory Prompt for Lifelong Event Detection	Lifelong event detection aims to incrementally update a model with new event types and data while retaining the capability of previously learned old types. One critical challenge is that the model would catastrophically forget old types when continually trained on new data. In this paper, we introduce \textbf{E}psodic \textbf{M}emory \textbf{P}rompts (\textbf{EMP}) to explicitly preserve the learned task-specific knowledge. Our method adopts continuous prompt for each task and they are optimized to instruct the model prediction and learn event-specific representation. The EMPs learned in previous tasks are carried along with the model in subsequent tasks, and can serve as a memory module that keeps the old knowledge and transferring to new tasks. Experiment results demonstrate the effectiveness of our method. Furthermore, we also conduct a comprehensive analysis of the new and old event types in lifelong learning.	PDF	1	2022
An Emoji-aware Multitask Framework for Multimodal Sarcasm Detection	Sarcasm is a case of implicit emotion and needs additional information like context and multimodality for its better detection. But sometimes this additional information also fails to help in sarcasm detection. For example, the utterance "Oh yes, you’ve been so helpful. Thank you so much for all your help", said in a polite tone with a smiling face, can be understood easily as non-sarcastic because of its positive sentiment. But, if the above message is accompanied with a frustrated emoji 😤, the negative sentiment of emoji becomes evident and the intended sarcasm can be easily understood. Thus, in this paper, we propose the SEEmoji MUStARD, an extension of the multimodal MUStARD dataset. We annotate each utterance with relevant emoji, emoji's sentiment and emoji's emotion. We propose an emoji-aware multitask deep learning framework for multimodal sarcasm detection (i.e. primary task), and sentiment and emotion detection (i.e. secondary task) in a multimodal conversational scenario. Experimental results on the SEEmoji MUStARD show the efficacy of our proposed approach for sarcasm detection over the state-of-the-art.	PDF	1	2022
Sparsely Activated Mixture-of-Experts are Robust Multi-Task Learners	Traditional multi-task learning (MTL) methods use dense networks that use the same set of shared weights across several different tasks. This often creates interference where two or more tasks compete to pull model parameters in different directions. In this work, we study whether sparsely activated Mixture-of-Experts (MoE) improve multi-task learning by specializing some weights for learning shared representations and using the others for learning task-specific information.To this end, we devise task-aware gating functions to route examples from different tasks to specialized experts which share subsets of network weights conditioned on the task. This results in a sparsely activated multi-task model with a large number of parameters, but with the same computational cost as that of a dense model. We demonstrate such sparse networks to improve multi-task learning along three key dimensions: (i) transfer to low-resource tasks from related tasks in the training mixture; (ii) sample-efficient generalization to tasks not seen during training by making use of task-aware routing from seen related tasks; (iii) robustness to the addition of unrelated tasks by avoiding catastrophic forgetting of existing tasks.	PDF	1	2022
$\textsc{AmbiPun}$: Generating Humorous Puns with Ambiguous Context	In this paper, we propose a simple yet effective way to generate pun sentences that does not require any training on existing puns. Our approach is inspired by humor theories that ambiguity comes from the context rather than the pun word itself. Given a pair of definitions of a pun word, our model first produces a list of related concepts through a reverse dictionary. We then utilize one-shot GPT3 to generate context words and then generate puns incorporating context words from both concepts. Human evaluation shows that our method successfully generates pun 52\% of the time, outperforming well-crafted baselines and the state-of-the-art models by a large margin.	PDF	1	2022
Rationalized Co-Training	Co-training is a semi-supervised learning technique that leverages two views of the data. It trains a classifier for each view using a small set of labelled data and uses the classifiers to label training data for each other. Intuitively, co-training works by encouraging agreement between the classifiers; an idea exploited in co-regularization. In this work, we propose rationalized co-training: a variant of co-training that encourages agreement between the rationales of the classifiers' predictions. Experiments on two datasets showed that rationalized co-training reduces the error rates of the partially and fully supervised models by 32.3%. This error rate reduction outperformed that of vanilla co-training by 8.51%.	PDF	1	2022
Teaching BERT to Wait: Balancing Accuracy and Latency for Streaming Disfluency Detection	In modern interactive speech-based systems, speech is consumed and transcribed incrementally prior to having disfluencies removed. While this post-processing step is crucial for producing clean transcripts and high performance on downstream tasks (e.g. machine translation), most current state-of-the-art NLP models such as the Transformer operate non-incrementally, potentially causing unacceptable delays for the user. In this work we propose a streaming BERT-based sequence tagging model that, combined with a novel training objective, is capable of detecting disfluencies in real-time while balancing accuracy and latency. This is accomplished by training the model to decide whether to immediately output a prediction for the current input or to wait for further context, in essence learning to dynamically size the lookahead window. Our results demonstrate that our model produces comparably accurate predictions and does so sooner than our baselines, with lower flicker. Furthermore, the model attains state-of-the-art latency and stability scores when compared with recent work on incremental disfluency detection.	PDF	1	2022
Why Does Surprisal From Smaller GPT-2 Models Provide Better Fit to Human Reading Times?	This work presents an in-depth analysis of an observation that contradicts the findings of recent work in computational psycholinguistics, namely that smaller GPT-2 models that show higher test perplexity nonetheless generate surprisal estimates that are more predictive of human reading times. Analysis of the surprisal values shows that rare proper nouns, which are typically tokenized into multiple subword tokens, are systematically assigned lower surprisal values by the larger GPT-2 models. A comparison of residual errors from regression models fit to reading times reveals that regression models with surprisal predictors from smaller GPT-2 models have significantly lower mean absolute errors on words that are tokenized into multiple tokens, while this trend is not observed on words that are kept intact. These results indicate that the ability of larger GPT-2 models to predict internal pieces of rare words more accurately makes their surprisal estimates deviate from humanlike expectations that manifest in self-paced reading times and eye-gaze durations.	PDF	1	2022
Hierarchical Relation-Guided Type-Sentence Alignment for Long-Tail Relation Extraction with Distant Supervision	Distant supervision uses triple facts in knowledge graphs to label a corpus for relation extraction, leading to wrong labeling and long-tail problems. Some works use the hierarchy of relations for knowledge transfer to long-tail relations. However, a coarse-grained relation often implies only an attribute (e.g., domain or topic) of the distant fact, making it hard to discriminate relations based solely on sentence semantics. One solution is resorting to entity types, but open questions remain about how to fully leverage the information of entity types and how to align multi-granular entity types with sentences. In this work, we propose a novel model to enrich distantly-supervised sentences with entity types. It consists of (1) a pairwise type-enriched sentence encoding module injecting both context-free and -related backgrounds to alleviate sentence-level wrong labeling, and (2) a hierarchical type-sentence alignment module enriching a sentence with the triple fact's basic attributes to support long-tail relations. Our model achieves new state-of-the-art results in overall and long-tail performance on benchmarks.	PDF	1	2022
In-BoXBART: Get Instructions into Biomedical Multi-task Learning	Single-task models have proven pivotal in solving specific tasks; however, they have limitations in real-world applications where multi-tasking is necessary and domain shifts are exhibited. Recently, instructional prompts have shown significant improvement towards multi-task generalization; however, the effect of instructional prompts and Multi-Task Learning (MTL) has not been systematically studied in the biomedical domain. Motivated by this, this paper explores the impact of instructional prompts for biomedical MTL. We introduce the BoX, a collection of 32 instruction tasks for Biomedical NLP across (X) various categories. Using this meta-dataset, we propose a unified model termed as In-BoXBART, that can jointly learn all tasks of the BoX without any task-specific modules. To the best of our knowledge, this is the first attempt to propose a unified model in the biomedical domain and use instructions to achieve generalization across several biomedical tasks. Experimental results indicate that the proposed model: 1) outperforms single-task baseline by ~3% and multi-task (without instruction) baseline by ~18% on an average, and 2) shows ~23% improvement compared to single-task baseline in few-shot learning (i.e., 32 instances per task) on an average. Our analysis indicates that there is significant room for improvement across tasks in the BoX, implying the scope for future research direction.	PDF	1	2022
Offensive Content Detection Via Synthetic Code-Switched Text	The prevalent use of offensive content in social media has become an important reasonfor concern for online platforms (customer service chat-boxes, and social media platforms). Classifying offensive and hate-speech content in online settings is an essential task in many applications that needs to be addressed accordingly. However, online text from online platforms can contain code-switching, a combination of more than one language. The non-availability of labeled code-switched data for a low-resourced code-switching combinations adds difficulty to this problem. To overcome this, we release a synthetic code-switched textual dataset containing around 29k samples for training and a real-world dataset containing around 10k samples for testing for three language combinations en-fr, en-es, and en-de. In this paper, we describe our algorithm for creating synthetic code-switched offensive content data and the process for creating the human-generated data. We also introduce the results of a keyword classification baseline and a multi-lingual transformer-based classification model.	PDF	1	2022
Investigating Non-local Features for Neural Constituency Parsing	Thanks to the strong representation power of neural encoders, neural chart-based parsers have achieved highly competitive performance by using local features. Recently, it has been shown that non-local features in CRF structures lead to improvements. In this paper, we investigate injecting non-local features into the training process of a local span-based parser, by predicting constituent -gram non-local patterns and ensuring consistency between non-local patterns and local constituents. Results show that our simple method gives better results than the self-attentive parser on both PTB and CTB. Besides, our method achieves state-of-the-art BERT-based performance on PTB (95.92 F1) and strong performance on CTB (92.31 F1). Our parser also outperforms the self-attentive parser in multi-lingual and zero-shot cross-domain settings.	PDF	1	2022
Towards Robust Online Dialogue Response Generation	Although pre-trained sequence-to-sequence models have achieved great success in dialogue response generation, chatbots still suffer from generating inconsistent responses in real-world practice, especially in multi-turn settings. We argue that this can be caused by a discrepancy between training and real-world testing. At training time, chatbot generates response with the golden context, while it has to generate based on the context consisting of both user utterances and the model predicted utterances during real-world testing. With the growth of the number of utterances, this discrepancy becomes more serious in the multi-turn settings. In this paper, we propose a hierarchical sampling-based method consisting of both utterance-level sampling and semi-utterance-level sampling, to alleviate the discrepancy, which implicitly increases the dialogue coherence. We further adopt reinforcement learning and re-ranking methods to explicitly optimize the dialogue coherence during training and inference, respectively. Empirical experiments show the effectiveness of the proposed methods for improving the robustness of chatbots in real practice.	PDF	1	2022
Combinatorial Scientific Discovery: Finding New Concept Combinations Beyond Link Prediction	As the number of publications is growing tremendously, it is more and more a challenge for researchers to read all related literature to find the "white space" in a specific research domain. Automatic scientific discovery has been proposed to help researchers identify new research ideas, but it has generally been limited to finding new combinations of concept pairs using link prediction in a knowledge graph. In this paper, we propose the combinatorial scientific discovery task: predicting combinations of more than two concepts. We standardize the task by providing benchmark datasets and initial models. Our solutions demonstrate the challenge but also the value of the task to find new, meaningful scientific ideas and its advantage over simple link prediction.	PDF	1	2022
A Word is Worth A Thousand Dollars: Adversarial Attack on Tweets Fools Stock Prediction	More and more investors and machine learning models rely on social media (e.g., Twitter and Reddit) to gather information and predict movements stock prices. Although text-based models are known to be vulnerable to adversarial attacks, whether stock prediction models have similar vulnerability given necessary constraints is underexplored. In this paper, we experiment with a variety of adversarial attack configurations to fool three stock prediction victim models. We address the task of adversarial generation by solving combinatorial optimization problems with semantics and budget constraints. Our results show that the proposed attack method can achieve consistent success rates and cause significant monetary loss in trading simulation by simply concatenating a perturbed but semantically similar tweet.	PDF	1	2022
UBERT: A Novel Language Model for Synonymy Prediction at Scale in the UMLS Metathesaurus	The UMLS Metathesaurus integrates more than 200 biomedical source vocabularies. During the Metathesaurus construction process, synonymous terms are clustered into concepts by human editors, assisted by lexical similarity algorithms. This process is error-prone and time-consuming. Recently, a deep learning model (LexLM) has been developed for the UMLS Vocabulary Alignment (UVA) task. This work introduces UBERT, a BERT-based language model, pretrained on UMLS terms via a supervised Synonymy Prediction (SP) task replacing the original Next Sentence Prediction (NSP) task. The effectiveness of UBERT for UMLS Metathesaurus construction process is evaluated using the UMLS Vocabulary Alignment (UVA) task. We show that UBERT outperforms the LexLM, as well as biomedical BERT-based models. Key to the performance of UBERT are the synonymy prediction task specifically developed for UBERT, the tight alignment of training data to the UVA task, and the similarity of the models used for pretrained UBERT.	PDF	1	2022
Real-time ASR Customization via Hypotheses Re-ordering: A Comparative Study of Different Scoring Functions	General purpose automatic speech recognizers (ASRs) require customization to the domain and context, to achieve practically acceptable accuracy levels when used as part of voice digital assistants. Further, such general purpose ASRs typically output multiple alternative hypotheses for the same input utterance. In this paper, we consider the hypothesis re-ordering framework and evaluate the impact of three different scoring functions for re-ordering the hypotheses: phoneme-based, character-based and word-based, and determine their strengths and weaknesses. Based on our intuitions and experimental validation, we determine that phoneme-based scoring is the best for closed domain contexts, while character-based and word-based scoring do better in case of more open-domain contexts. Our results show that character-based scoring gives the best performance improvement in terms of word error rate over general purpose ASRs for voice assistants used in a classroom context. Our analysis also reveals that character-based scoring is preferred for shorter utterances while word-based scoring is preferred for longer utterances.	PDF	1	2022
Structured Pruning Learns Compact and Accurate Models	The growing size of neural language models has led to increased attention in model compression. The two predominant approaches are pruning, which gradually removes weights from a pre-trained model, and distillation, which trains a smaller compact model to match a larger one. Pruning methods can significantly reduce the model size but hardly achieve large speedups as distillation. However, distillation methods require large amounts of unlabeled data and are expensive to train. In this work, we aim to close this gap and propose a structured pruning method---MixedPruning---which matches the distillation counterparts in both latency and accuracy and only incurs 5% of training cost without using unlabeled data. Our key insight is to jointly prune coarse (e.g., layers) and fine-grained (e.g., heads and hidden units) modules, which controls the pruning decision of each parameter with masks of different granularity. This pruning strategy eases optimization and delivers highly competitive and parallelizable subnetworks that were not demonstrated before. We also propose a novel layerwise distillation approach to further guide pruning. We evaluate MixedPruning extensively on SQuAD and GLUE datasets and demonstrate its effectiveness and efficiency over state-of-the-art pruning and distillation methods.	PDF	1	2022
Identifying the Source of Vulnerability in Fragile Interpretations: A Case Study in Neural Text Classification	Prior works mainly used input perturbation methods for testing stability of post-hoc interpretation methods and observed fragile interpretations. However, different works show conflicting results on the primary source of fragile interpretations because input perturbation can cause potential effects on the model and the interpretation methods. Instead, this work proposes a simple output perturbation method that circumvents models' potential effects by slightly modifying the prediction probability. We evaluate the proposed method using two popularly-used post-hoc interpretation methods (LIME and Sample Shapley), and CNN, LSTM, and BERT as the neural classifiers. The results show that post-hoc methods produce only slightly different interpretations under output perturbation. It suggests that the black-box model is the primary source causing fragile interpretations.	PDF	1	2022
AcTune: Uncertainty-Aware Active Self-Training for Active Fine-Tuning of Pretrained Language Models	Although fine-tuning pre-trained language models (PLMs) renders strong performance in many NLP tasks, it relies on excessive labeled data. Recently, researchers have resorted to active fine-tuning for enhancing the label efficiency of PLM fine-tuning, but existing methods of this type usually ignore the potential of unlabeled data. We develop AcTune, a new framework that improves the label efficiency of active PLM fine-tuning by unleashing the power of unlabeled data via self-training. AcTune switches between data annotation and model self-training based on uncertainty: the unlabeled samples of high-uncertainty are selected for annotation, while the ones from low-uncertainty regions are used for model self-training. Additionally, we design (1) a region-aware sampling strategy to avoid redundant samples when querying annotations and (2) a momentum-based memory bank to dynamically aggregate the model's pseudo labels to suppress label noise in self-training. Experiments on 6 text classification datasets show that AcTune outperforms the strongest active learning and self-training baselines and improves the label efficiency of PLM fine-tuning by $56.2\%$ on average.	PDF	1	2022
Dangling-Aware Entity Alignment with Mixed High-Order Proximities	We study dangling-aware entity alignment in knowledge graphs (KGs), which is an underexplored but important problem. As different KGs are naturally constructed by different sets of entities, a KG commonly contains some dangling entities that cannot find counterparts in other KGs. Therefore, dangling-aware entity alignment is more realistic than the conventional entity alignment where prior studies simply ignore dangling entities. We propose a framework using mixed high-order proximities on dangling-aware entity alignment. Our framework utilizes both the local high-order proximity in a nearest neighbor subgraph and the global high-order proximity in an embedding space for both dangling detection and entity alignment. Extensive experiments with two evaluation settings shows that our method more precisely detects dangling entities, and better aligns matchable entities. Further investigations demonstrate that our framework can mitigate the hubness problem on dangling-aware entity alignment.	PDF	1	2022
Summarize and Generate to Back-translate: Unsupervised Translation of Programming Languages	Back-translation is widely known for its effectiveness for neural machine translation when little to no parallel data is available. In this approach, a source-to-target model is coupled with a target-to-source model and trained in parallel. While the target-to-source model generates noisy sources, the source-to-target model is trained to reconstruct the targets and vice versa. Recent developments of multilingual pre-trained sequence-to-sequence models for programming languages have been very effective for a broad spectrum of downstream software engineering tasks. Therefore, it is compelling to train them to build programming language translation systems via back-translation. However, these models cannot be further trained via back-translation since they learn to output sequences in the same language as the inputs during pre-training. As an alternative, we suggest performing back-translation via code summarization and generation. In code summarization, a model learns to generate a natural language (NL) summary given a piece of code, and in code generation, the model learns to do the opposite. Therefore, target-to-source generation in back-translation can be viewed as target-to-NL-to-source generation. We take advantage of labeled data for the code summarization task. We show that our proposed framework performs comparably to state-of-the-art methods, if not exceeding their translation performance between Java and Python languages.	PDF	1	2022
Memformer: A Memory-Augmented Transformer for Sequence Modeling	Transformers have reached remarkable success in sequence modeling. However, these models have efficiency issues as they need to store all the history token-level representations as memory. We present Memformer, an efficient neural network for sequence modeling, that utilizes an external dynamic memory to encode and retrieve past information. Our model achieves linear time complexity and constant memory space complexity when processing long sequences. We also propose a new optimization scheme, memory replay back-propagation (MRBP), which promotes long-range back-propagation through time with a significantly reduced memory requirement. Experimental results show that Memformer has achieved comparable performance compared against the baselines by using 8.1x less memory space and 3.2x faster on inference. Analysis of the attention pattern shows that our external memory slots can encode and retain important information through timesteps.	PDF	1	2022
TD-ConE: An Information-Theoretic Approach to Assessing Parallel Text Generation Data	Existing data assessment methods are mainly for classification-based datasets and limited for use in natural language generation (NLG) datasets. In this work, we focus on parallel NLG datasets and address this problem through an information-theoretic approach, TD-ConE, to assess data uncertainty using input-output sequence mappings. Our experiments on text style transfer datasets demonstrate that the proposed simple method leads to better measurement of data uncertainty compared to some complicated alternatives and demonstrates a high correlation with downstream model performance. As an extension of TD-ConE, we introduce TD-ConE_Rel to compute the relative uncertainty between two datasets. Our experiments with paraphrase generation datasets demonstrate that selecting data with lower TD-ConE_Rel scores leads to better model performance and decreased validation perplexity.	PDF	1	2022
Retrieval Enhanced Data Augmentation for Question Answering on Privacy Policies	Prior studies in privacy policies frame the question answering (QA) tasks as identifying the most relevant text segment or a list of sentences from the policy document for a user query. However, annotating such a dataset is hard as it requires specific domain expertise (e.g., law academics). Even if we manage a small-scale one, a bottleneck that remains is that the labeled data are heavily imbalanced (only a few segments are relevant) - limiting the gain in this domain. Therefore, in this paper, we develop a novel data augmentation framework based on ensembling retriever models that captures the relevant text segments from unlabeled policy documents and expand the positive examples in the training set. In addition, to improve the diversity and the quality of the augmented data, we leverage multiple pre-trained language models (LMs) and cascaded them with noise reduction oracles. Using our augmented corpora on the PrivacyQA benchmark, we elevate the existing baseline by a large margin (11% F1) and achieve a new state-of-the-art F1 score of 50%. Our ablation studies provide further insights into the effectiveness of our approach.	PDF	1	2022
Prompt Consistency for Zero-Shot Task Generalization	One of the most impressive results of recent NLP history is the ability of pre-trained language models to solve new tasks in a zero-shot setting. To achieve this, NLP tasks are framed as natural language prompts, generating a response indicating the predicted output. Nonetheless, the performance in such settings often lags far behind its supervised counterpart, suggesting a large space for potential improvement. In this paper, we explore methods to utilize unlabeled data to improve zero-shot performance. Specifically, we take advantage of the fact that multiple prompts can be used to specify a single task, and propose to regularize prompt consistency, encouraging consistent predictions over this diverse set of prompts. Our method makes it possible to fine-tune the model either with extra unlabeled training data, or directly on test input at inference time in an unsupervised manner. In experiments, our approach outperforms the state-of-the-art zero-shot learner, T0 (Sanh et al. 2021), on 9 out of 11 datasets across 4 NLP tasks by up to 10.6 absolute points in terms of accuracy. The gains are often attained with a small number of unlabeled examples.	PDF	1	2022
The Role of Context in Detecting Previously Fact-Checked Claims	Recent years have seen the proliferation of disinformation and fake news online. Traditional proposals to mitigate these problems are manual and automatic fact-checking. Recently, another approach has emerged: checking whether the input claim has previously been fact-checked, which can be done automatically, and thus fast, while also offering credibility and explainability, thanks to the human fact-checking and explanations in the associated fact-checking article. Here we focus on claims made in a political debate, where context really matters. We study the impact of modeling the context of the claim: both on the source side, i.e., in the debate, as well as on the target side, i.e., in the fact-checking explanation document. We do this by modeling the local context, the global context, as well as by means of co-reference resolution, and multi-hop reasoning over the sentences of the document describing the fact-checked claim. The experimental results show that each of these represents a valuable information source, but that modeling the source-side context is more important and can yield 10+points of absolute improvement over a state-of-the-art model.	PDF	1	2022
DialSummEval: Revisiting Summarization Evaluation for Dialogues	Dialogue summarization is receiving increasing attention from researchers due to its extraordinary difficulty and unique application value. We observe that current dialogue summarization models have flaws that may not be well exposed by frequently used metrics such as ROUGE. In our paper, we re-evaluate 18 categories of metrics in terms of four dimensions: coherence, consistency, fluency and relevance, as well as a unified human evaluation of various models for the first time. Some noteworthy trends which are different from the conventional summarization tasks are identified. We will release DialSummEval, a multi-faceted dataset of human judgments containing the outputs of 14 models on SAMSum.	PDF	1	2022
FedNLP: Benchmarking Federated Learning Methods for Natural Language Processing Tasks	Increasing concerns and regulations about data privacy and sparsity necessitate the study of privacy-preserving, decentralized learning methods for natural language processing (NLP) tasks. Federated learning (FL) provides promising approaches for a large number of clients (e.g., personal devices or organizations) to collaboratively learn a shared global model to benefit all clients while allowing users to keep their data locally. Despite interest in studying FL methods for NLP tasks, a systematic comparison and analysis is lacking in the literature. Herein, we present the FedNLP, a benchmarking framework for evaluating federated learning methods on four different task formulations: text classification, sequence tagging, question answering, and seq2seq. We propose a universal interface between Transformer-based language models (e.g., BERT, BART) and FL methods (e.g., FedAvg, FedOPT, etc.) under various non-IID partitioning strategies. Our extensive experiments with FedNLP provide empirical comparisons between FL methods and help us better understand the inherent challenges of this direction. The comprehensive analysis points to intriguing and exciting future research aimed at developing FL methods for NLP tasks.	PDF	1	2022
Opponent Modeling in Negotiation Dialogues by Related Data Adaptation	Opponent modeling refers to the task of inferring another party's mental state within the context of non-collaborative social tasks. In a negotiation, it involves identifying the opponent’s priorities, which is crucial for finding high-value deals. Discovering these priorities is helpful for automated negotiation systems deployed in pedagogy and conversational AI. In this work, we propose a transformer-based ranker for identifying these priorities from negotiation dialogues. The model takes in a partial dialogue as input and predicts the priority order of the opponent. We further devise ways to adapt related data sources for this task to provide more explicit supervision for incorporating the opponent preferences and offers, as a proxy to relying on granular utterance-level annotations. We show the utility of our proposed approach through extensive experiments based on two dialogue datasets. We particularly find that the proposed data adaptations lead to strong performance in 0-shot and few-shot scenarios. Moreover, they allow the model to perform better with access to fewer utterances from the opponent.	PDF	1	2022
KAT: A Knowledge Augmented Transformer for Vision-and-Language	The primary focus of recent work with large-scale transformers has been on optimizing the amount of information packed into the model's parameters. In this work, we ask a complementary question: Can multimodal transformers leverage explicit knowledge in their reasoning? Existing, primarily unimodal, methods have explored approaches under the paradigm of knowledge retrieval followed by answer prediction, but leave open questions about the quality and relevance of the retrieved knowledge used, and how the reasoning processes over implicit and explicit knowledge should be integrated. To address these challenges, we propose a - Knowledge Augmented Transformer (KAT) - which achieves a strong state-of-the-art result (+6\% absolute) on the open-domain multimodal task of OK-VQA. Our approach integrates implicit and explicit knowledge in an encoder-decoder architecture, while still jointly reasoning over both knowledge sources during answer generation. Additionally, explicit knowledge integration improves interpretability of model predictions in our analysis.	PDF	1	2022
Penguins Don’t Fly: Reasoning about Generics through Instantiations and Exceptions	Generic statements (e.g., Birds can fly) express generalizations about the world. However, generics are not universally true -- while sparrows and penguins are both birds, penguins can't fly. Understanding cases when a generic statement is true or false is crucial for machine reasoning. In this work, we present a novel framework to generate pragmatically relevant true and false instances of a generic.We use pre-trained language models, constraining the generation based on our computational framework, and produce ${\sim}20k$ \textsc{exemplars} for ${\sim}650$ generics. Our system outperforms few-shot generation from GPT-3 (by 12.5 precision points) and our analysis highlights the importance of constrained decoding for this task and the implications of generics \textsc{exemplars} for non-monotonic reasoning and NLI.	PDF	1	2022
Explaining Toxic Text via Knowledge Enhanced Text Generation	Warning: This paper contains content that is offensive and may be upsetting.Biased or toxic speech can be harmful to various demographic groups. Therefore, it is not only important for models to detect these speech, but to also output explanations of why a given text is toxic. Previous literature has mostly focused on classifying and detecting toxic speech, and existing efforts on explaining stereotypes in toxic speech mainly use standard text generation approaches, resulting in generic and repetitive explanations. Building on these prior works, we introduce a novel knowledge-informed encoder-decoder framework to utilize multiple knowledge sources to generate implications of biased text.Experiments show that our knowledge informed models outperform prior state-of-the-art models significantly, and can generate detailed explanations of stereotypes in toxic speech compared to baselines, both quantitatively and qualitatively.	PDF	1	2022
PREME: Preference-based Meeting Exploration through an Interactive Questionnaire	The recent increase in the volume of online meetings necessitates automated tools for managing and organizing the material, especially when an attendee has missed the discussion and needs assistance in quickly exploring it. In this work, we propose a novel end-to-end framework for generating interactive questionnaires for preference-based meeting exploration. As a result, users are supplied with a list of suggested questions reflecting their preferences. Since the task is new, we introduce an automatic evaluation strategy. Namely, it measures how much the generated questions via questionnaire are answerable to ensure factual correctness and covers the source meeting for the depth of possible exploration.	PDF	1	2022
IDPG: An Instance-Dependent Prompt Generation Method	Prompt tuning is a new, efficient NLP transfer learning paradigm that adds a task-specific prompt in each input instance during the model training stage. It freezes the pre-trained language model and only optimizes a few task-specific prompts. In this paper, we propose a conditional prompt generation method to generate prompts for each input instance, referred to as the Instance-Dependent Prompt Generation (IDPG). Unlike traditional prompt tuning methods that use a fixed prompt, IDPG introduces a lightweight and trainable component to generate prompts based on each input sentence. Extensive experiments on ten natural language understanding (NLU) tasks show that the proposed strategy consistently outperforms various prompt tuning baselines and is on par with other efficient transfer learning methods such as Compacter while tuning far fewer model parameters.	PDF	1	2022
Reframing Human-AI Collaboration for Generating Free-Text Explanations	Large language models are increasingly capable of generating fluent-appearing text with relatively little task-specific supervision. But can these models accurately explain classification decisions? We consider the task of generating free-text explanations using a small number of human-written examples (i.e., in a few-shot manner). We find that (1) higher-quality, human-authored prompts result in higher quality generations; and (2) surprisingly, in a head-to-head comparison, humans often prefer explanations generated by GPT-3 to crowdsourced explanations in existing datasets. Our human studies also show, however, that while models often produce factual, grammatical, and sufficient explanations, they have room to improve along axes such as providing novel information and supporting the label. We create a pipeline that combines GPT-3 with a supervised filter that incorporates binary acceptability judgments from humans in the loop. Despite significant subjectivity intrinsic to judging acceptability, our approach is able to consistently filter GPT-3 generated explanations deemed acceptable by humans.	PDF	1	2022
Exposing the Limits of Video-Text Models through Contrast Sets	Recent video-text models can retrieve relevant videos based on text with a high accuracy, but to what extent do they comprehend the semantics of the text? Can they discriminate between similar entities and actions? To answer this, we propose an evaluation framework that probes video-text models with hard negatives. We automatically build contrast sets, where true textual descriptions are manipulated in ways that change their semantics while maintaining plausibility. Specifically, we leverage a pre-trained language model and a set of heuristics to create verb and person entity focused contrast sets. We apply these in the multiple choice video to-text classification setting. We test the robustness of recent methods on the proposed automatic contrast sets, and compare them to additionally collected human-generated counterparts, to assess their effectiveness. We see that model performance suffers across all methods, erasing the gap between recent CLIP-based methods vs. the earlier methods.	PDF	1	2022
Template-free Prompt Tuning for Few-shot NER	Prompt-based methods have been successfully applied in sentence-level few-shot learning tasks, mostly owing to the sophisticated design of templates and label words. However, when applied to token-level labeling tasks such as NER, it would be time-consuming to enumerate the template queries over all potential entity spans. In this work, we propose a more elegant method to reformulate NER tasks as LM problems without any templates. Specifically, we discard the template construction process while maintaining the word prediction paradigm of pre-training models to predict a class-related pivot word (or label word) at the entity position. Meanwhile, we also explore principled ways to automatically search for appropriate label words that the pre-trained models can easily adapt to. While avoiding the complicated template-based process, the proposed LM objective also reduces the gap between different objectives used in pre-training and fine-tuning, thus it can better benefit the few-shot performance. Experimental results demonstrate the effectiveness of the proposed method over bert-tagger and template-based method under few-shot settings. Moreover, the decoding speed of the proposed method is up to 1930.12 times faster than the template-based method.	PDF	1	2022
Perturbations in the Wild: Leveraging Human-Written Text Perturbations for Realistic Adversarial Attack and Defense	We proposes a novel algorithm, ANTHRO, that inductively extracts over 600K human-written text perturbations in the wild and leverages them for realistic adversarial attack. Unlike existing character-based attacks which often deductively hypothesize a set of manipulation strategies, our work is grounded on actual observations from real-world texts. We find that adversarial texts generated by ANTHRO achieve the best trade-off between (1) attack success rate, (2) semantic preservation of the original text, and (3) stealthiness--i.e. indistinguishable from human writings hence harder to be flagged as suspicious. Specifically, our attacks accomplished around 83% and 91% attack success rates on BERT and RoBERTa, respectively. Moreover, it outperformed the TextBugger baseline with an increase of 50% and 40% in terms of semantic preservation and stealthiness when evaluated by both layperson and professional human workers. ANTHRO can further enhance a BERT classifier's performance in understanding different variations of human-written toxic texts via adversarial training when compared to the Perspective API. All source code will be released.	PDF	1	2022
XLTime: A Cross-Lingual Knowledge Transfer Framework for Temporal Expression Extraction	Temporal Expression Extraction (TEE) is essential for understanding time in natural language. It has applications in Natural Language Processing (NLP) tasks such as question answering, information retrieval, and causal inference. To date, work in this area has mostly focused on English as there is a scarcity of labeled data for other languages. We propose XLTime, a novel framework for multilingual TEE. XLTime works on top of pre-trained language models and leverages multi-task learning to prompt cross-language knowledge transfer both from English and within the non-English languages. It alleviates problems caused by a shortage of data in the target language. We apply XLTime with different language models and show that it outperforms the previous automatic SOTA methods on French, Spanish, Portuguese, and Basque, by large margins. It also closes the gap considerably on the handcrafted HeidelTime method.	PDF	1	2022
Zero-shot Cross-lingual Transfer is Under-specified Optimization	Pretrained multilingual encoders enable zero-shot cross-lingual transfer, but often produce unreliable models that exhibit high performance variance on the target language. We postulate that this high variance results from zero-shot cross-lingual transfer solving an under-specified optimization problem. We show that any linear-interpolated model between the source language monolingual model and source + target bilingual model has equally low source language generalization error, yet the target language generalization error reduces smoothly and linearly as we move from the monolingual to bilingual model, suggesting that the model struggles to identify good solutions for both source and target languages using the source language alone. Additionally, we show that zero-shot solution lies in non-flat region of target language error generalization surface, causing the high variance.	PDF	1	2022
DOCmT5: Document-Level Pre-training of Multilingual Language Models	In this paper, we introduce DOCmT5, a multi-lingual sequence-to-sequence language model pre-trained with large scale parallel documents. While previous approaches have focused on leveraging sentence-level parallel data, we try to build a general-purpose pre-trained model that can understand and generate long documents. We propose a simple and effective pre-training objective - Document reordering MachineTranslation (DrMT), in which the input documents that are shuffled and masked need to be translated. DrMT brings consistent improvements over strong baselines on a variety of document-level generation tasks, including over 12 BLEU points for seen-language pair document-level MT, over 7 BLEU points for unseen-language-pair document-level MT and over 3 ROUGE-1 points for seen-language pair cross-lingual summarization. We achieve state-of-the-art (SOTA) on WMT20 De-En and IWSLT15 Zh-En document translation tasks. We also conduct extensive analysis on various factors for document pre-training, including (1) the effects of pre-training data quality and (2) The effects of combining mono-lingual and cross-lingual pre-training. We plan to make our model checkpoints publicly available.	PDF	1	2022
Meta-learning via Language Model In-context Tuning	The goal of meta-learning is to learn to adapt to a new task with only a few labeled examples. Inspired by the recent progress in large language models, we propose $\textit{in-context tuning}$ (ICT), which recasts task adaptation and prediction as a simple sequence prediction problem: to form the input sequence, we concatenate the task instruction, labeled in-context examples, and the target input to predict; to meta-train the model to learn from in-context examples, we fine-tune a pre-trained language model (LM) to predict the target label given the input sequence on a collection of tasks.We benchmark our method on two collections of text classification tasks: LAMA and BinaryClfs. Compared to MAML which adapts the model through gradient descent, our method leverages the inductive bias of pre-trained LMs to perform pattern matching, and outperforms MAML by an absolute $6\%$ average AUC-ROC score on BinaryClfs, gaining more advantage with increasing model size. Compared to non-fine-tuned in-context learning (i.e. prompting a raw LM), in-context tuning meta-trains the model to learn from in-context examples. On BinaryClfs, ICT improves the average AUC-ROC score by an absolute $10\%$, and reduces the variance due to example ordering by 6x and example choices by 2x.	PDF	1	2022
Exploring the Universal Vulnerability of Prompt-based Learning Paradigm	Prompt-based learning paradigm bridges the gap between pre-training and fine-tuning, and works effectively under the few-shot setting. However, we find that this learning paradigm inherits the vulnerability from the pre-training stage, where model predictions can be misled by inserting certain triggers into the text. In this paper, we explore this universal vulnerability by either injecting \textit{backdoor triggers} or searching for \textit{adversarial triggers} on pre-trained language models using only plain text. In both scenarios, we demonstrate that our triggers can totally control or severely decrease the performance of prompt-based models fine-tuned on arbitrary downstream tasks, reflecting the universal vulnerability of the prompt-based learning paradigm. Further experiments show that adversarial triggers have good transferability among language models. We also find conventional fine-tuning models are not vulnerable to adversarial triggers constructed from pre-trained language models. We conclude by proposing a potential solution to mitigate our attack methods. All the code and data will be made public.	PDF	1	2022
Putting the Con in Context: Identifying Deceptive Actors in the Game of Mafia	While neural networks demonstrate a remarkable ability to model linguistic content, capturing contextual information related to a speaker's conversational role is an open area of research. In this work, we analyze the effect of speaker role on language use through the game of Mafia, in which participants are assigned either an honest or a deceptive role. In addition to building a framework to collect a dataset of Mafia game records, we demonstrate that there are differences in the language produced by players with different roles. We confirm that classification models are able to rank deceptive players as more suspicious than honest ones based only on their use of language. Furthermore, we show that training models on two auxiliary tasks outperforms a standard BERT-based text classification approach. We also present methods for using our trained models to identify features that distinguish between player roles, which could be used to assist players during the Mafia game.	PDF	1	2022
SEQZERO: Few-shot Compositional Semantic Parsing with Sequential Prompts and Zero-shot Models	Recent research showed promising results on combining pretrained language models (LMs) with canonical utterance for few-shot semantic parsing.The canonical utterance is often lengthy and complex due to the compositional structure of formal languages. Learning to generate such canonical utterance requires significant amount of data to reach high performance. Fine-tuning with only few-shot samples, the LMs can easily forget pretrained knowledge, overfit spurious biases, and suffer from compositionally out-of-distribution generalization errors. To tackle these issues, we propose a novel few-shot semantic parsing method -- SEQZERO. SEQZERO decomposes the problem into a sequence of sub-problems, which corresponds to the sub-clauses of the formal language. Based on the decomposition, the LMs only need to generate short answers using prompts for predicting sub-clauses. Thus, SEQZERO avoids generating a long canonical utterance at once. Moreover, SEQZERO employs not only a few-shot model but also a zero-shot model to alleviate the overfitting.In particular, SEQZERO brings out the merits from both models via ensemble equipped with our proposed constrained rescaling.SEQZERO achieves SOTA performance on GeoQuery dataset and a new EcommerceQuery dataset in the few-shot compositional generalization setting.	PDF	1	2022
Do Pretrained Contextual Language Models Distinguish between Hebrew Homograph Analyses?	Semitic morphologically-rich languages (MRLs) are plagued by word ambiguity; in a standard text, many (and often most) of the words will be homographs with multiple possible analyses. Previous research on MRLs claimed that standardly trained contextualized embeddings based on word-pieces do not sufficiently capture the internal structure of words with hugely ambiguous homographs. Taking Hebrew as a case study, we investigate the extent to which Hebrew homographs can be disambiguated using contextualized embeddings. We evaluate all existing models for contextualized Hebrew embeddings on 75 Hebrew homograph challenge sets. Our empirical results demonstrate that contemporary Hebrew contextualized embeddings outperform non-contextualized embeddings; they are most effective for disambiguation of segmentation and morphological features, less so regarding pure sense disambiguation. We show that these embeddings are more effective when the number of word-piece splits is limited, and they are more effective for 2-way and 3-way ambiguities than for 4-way ambiguity. We show that the embeddings are equally effective for homographs of both balanced and skewed distributions. Finally, we show that these embeddings are as effective for homograph disambiguation with extensive supervised training as with a few-shot setup.	PDF	1	2022
Experience Affected in the Act of Remembering: A Study of Discursivity of Verb Tense Shifts in Memory Narrative	This article contributes to the empirical understanding of the discursivity of verb morphology and verb tense shifts in memory narratives. Specifically, we explore how the 2016 presidential election result, as a historic and political event of the past decade, is recounted collectively through the lens of language use. In an online survey, 185 undergraduate students in the Computer Science department at the University of Georgia were asked to remember the day they learned about the 2016 presidential election results and write a narrative of their experience. The results from our analysis show a distinct correlation between the political leaning of the surveyed population and verb tense shifts in their stories.	PDF	1	2022
SUBS: Subtree Substitution for Compositional Semantic Parsing	Although sequence-to-sequence models often achieve good performance in semantic parsing for i.i.d. data, their performance is still inferior in compositional generalization. Several data augmentation methods have been proposed to alleviate this problem. However, prior work only leveraged superficial grammar or rules for data augmentation, which resulted in limited improvement. We propose to use subtree substitution for compositional data augmentation, where we consider subtrees with similar semantic functions as exchangeable. Our experiments showed that such augmented data led to significantly better performance on Scan and GeoQuery, and reached new SOTA on compositional split of GeoQuery.	PDF	1	2022
Parsing Natural Language into Propositional and First-Order Logic with Dual Reinforcement Learning	Semantic parsing converts natural language paraphrases into structured logical expressions. In this paper, we consider two such formal representations: Propositional Logic (PL) and First-order Logic (FOL). Due to the insufficiency of annotated data in this field, we use dual reinforcement learning (RL) to make full use of labeled and unlabeled data. We further propose a brand new reward mechanism to avoid the trouble of manually defining the reward in RL. To utilize the training data efficiently and make the learning process consistent with humans, we integrate curriculum learning into our framework. Experimental results show that the proposed method outperforms competitors on different datasets. In addition to the technical contribution, we construct a Chinese-PL/FOL dataset to make up for the lack of data in this field. We aim to release our code as well as the dataset to aid further research in related tasks.	PDF	1	2022
Hey AI, Can You Solve Complex Tasks by Talking to Agents?	Training giant models from scratch for each complex task is resource- and data-inefficient. To help develop models that can leverage existing systems, we propose a new challenge: Learning to solve complex tasks by communicating with existing agents (or models) in natural language. We design a synthetic benchmark, CommaQA, with three complex reasoning tasks (explicit, implicit, numeric) designed to be solved by communicating with existing QA agents. For instance, using text and table QA agents to answer questions such as "Who had the longest javelin throw from USA?". We show that black-box models struggle to learn this task from scratch (accuracy under 50\%) even with access to each agent's knowledge and gold facts supervision. In contrast, models that learn to communicate with agents outperform black-box models, reaching scores of 100\% when given gold decomposition supervision. However, we show that the challenge of learning to solve complex tasks by communicating with existing agents \emph{without relying on any auxiliary supervision or data} still remains highly elusive. We will release CommaQA, along with a compositional generalization test split, to advance research in this direction	PDF	1	2022
Improving Robustness in Multilingual Machine Translation via Data Augmentation	Multilingual humans can and do seamlessly switch back and forth between languages when communicating. However, multilingual (machine) translation models are not robust to such sudden changes. In this work, we explore the robustness of multilingual MT models to language switching and propose checks to measure switching capability. We also investigate simple and effective data augmentation methods that can enhance robustness. A glass-box analysis of attention modules demonstrates the effectiveness of these methods in improving robustness.	PDF	1	2022
Towards Coding Social Science Datasets with Language Models	Researchers often rely on humans to code (label, annotate, etc.) large sets of texts. This is a highly variable task and requires a great deal of time and resources. Efforts to automate this process have achieved human-level accuracies in some cases, but often rely on thousands of hand-labeled training examples, which makes them inapplicable to small-scale research studies and still costly for large ones. At the same time, it is well known that language models can classify text; in this work, we use GPT-3 as a synthetic coder, and compare it to human coders using classic methodologies and metrics, such as intercoder reliability. We find that GPT-3 can match the performance of typical human coders and frequently outperforms them in terms of intercoder agreement across a variety of social science tasks, suggesting that language models could serve as useful coders.	PDF	1	2022
Frustratingly Simple Regularization to Improve Zero-shot Cross-lingual Robustness	Large-scale multilingual pretrained encoders, such as mBERT and XLM-R, have demonstrated impressive zero-shot cross-lingual transfer capability across multiple NLP tasks. However, as we show in this paper, these models suffer from two major problems: (1) degradation in zero-shot cross-lingual performance after fine-tuning on a single language, and (2) cross-lingual performance sensitivity to fine-tuning hyperparameters. In order to address these issues, we evaluate two techniques during fine-tuning, namely, Elastic Weight Consolidation (EWC) and L2-distance regularization to assist the multilingual models in retaining their cross-lingual ability after being fine-tuned on a single language. We compare zero-shot cross-lingual performance of mBERT with/without regularization on four different tasks: XNLI, PANX, UDPOS and PAWSX and demonstrate that the model fine-tuned with L2-distance regularization performs better than its vanilla fine-tuned counterpart in zero-shot setting across all the tasks by up to 1.64%. Moreover, by fine-tuning mBERT with different hyperparameter settings on the specified tasks, we demonstrate that L2-distance regularization also makes fine-tuning more robust, reducing standard deviation of zero-shot results by up to 87%. Based on our experiments, EWC does not provide consistent improvements across languages. Moreover, to test if additional constraint on the encoder parameters would improve the results further, we compared L2-distance regularization with techniques that freeze most of the encoder parameters during fine-tuning, such as bitfit, soft prompting, and adapter-based methods. However, we observe that L2-distance regularization still performs the best.	PDF	1	2022
Hallucinated but Factual! Inspecting the Factuality of Hallucinations in Abstractive Summarization	State-of-the-art abstractive summarization systems often generate hallucinations; i.e., content that is not directly inferable from the source text. Despite being assumed incorrect, we find that much hallucinated content is factual, namely consistent with world knowledge. These factual hallucinations can be beneficial in a summary by providing useful background information. In this work, we propose a novel detection approach that separates factual from non-factual hallucinations of entities. Our method utilizes an entity's prior and posterior probabilities according to pre-trained and finetuned masked language models, respectively. Empirical results suggest that our approach vastly outperforms five baselines and strongly correlates with human judgments.Furthermore, we show that our detector, when used as a reward signal in an off-line reinforcement learning (RL) algorithm, significantly improves the factuality of summaries while maintaining the level of abstractiveness.	PDF	1	2022
EASE: Entity-Aware Contrastive Learning of Sentence Embedding	We present EASE, a novel method for learning sentence embeddings via contrastive learning between sentences and their related entities.The advantage of using entity supervision is twofold: (1) entities have been shown to be a strong indicator of text semantics and thus should provide rich training signals for sentence embeddings; (2) entities are defined independently of languages and thus offer useful cross-lingual alignment supervision.We evaluate EASE against other unsupervised models both in monolingual and multilingual settings.We show that EASE exhibits competitive or better performance in English semantic textual similarity (STS) and short text clustering (STC) tasks and it significantly outperforms baseline methods in multilingual settings on a variety of tasks.Our EASE model and newly constructed multilingual STC dataset, MewsC-15, have been made publicly available to catalyze future research on sentence embeddings.	PDF	1	2022
Self-Supervised Bot Play for Transcript-Free Conversational Recommendation with Justification	Conversational recommender systems offer a way for users to engage in multi-turn conversations to find items they enjoy. Dialog agents for conversational recommendation rely on expensive human dialog transcripts, limiting their usage to domains where such data exists. We develop an alternative, two-part framework for training multi-turn conversational recommenders that accommodate a common paradigm of conversation: experts provide and justify suggestions, while users can critique and respond. We can thus adapt conversational recommendation to a wider range of domains where crowd-sourced ground truth dialogs are not available. First, we train a recommender system to jointly suggest items and justify its reasoning via subjective aspects. We then fine-tune this model to incorporate iterative user feedback via self-supervised bot-play. Experiments on three real-world datasets demonstrate that our system can be applied to different recommendation models across diverse domains to achieve state-of-the-art performance in multi-turn recommendation. Human studies show that systems trained with our framework provide more useful, helpful, and knowledgeable suggestions in warm- and cold-start settings.	PDF	1	2022
On Measuring Social Biases in Prompt-Based Learning	Large language models trained on a mixture of NLP tasks that are converted into a text-to-text format using prompts, can generalize into novel forms of language and handle novel tasks. A large body of work within prompt engineering attempts to understand the effects of input forms and prompts in achieving superior performance. We consider an alternative measure and inquire whether the way in which an input is encoded affects \textit{social biases} promoted in outputs. In this paper, we study T0, a large-scale multi-task text-to-text language model trained using prompt-based learning. We consider two different forms of semantically equivalent inputs: \textit{question-answer} format and \textit{premise-hypothesis} format. We use an existing bias benchmark for the former BBQ and create the first bias benchmark in natural language inference BBNLI with hand-written hypotheses while also converting each benchmark into the other form. The results on two benchmarks suggest that given two different formulations of essentially the same input, T0 conspicuously acts more biased in question answering form, which is seen during training, compared to premise-hypothesis form which is unlike its training examples.	PDF	1	2022
How do QA models combine knowledge from LM and 100 passages?	Retrieval-based generation models achieve high accuracy in open retrieval question answering by assessing rich knowledge sources --- multiple retrieved passages and parametric knowledge in the language model (LM). Yet, little is known about how they blend information stored in their LM parameters with that from retrieved evidence documents. We study this by simulating knowledge conflicts (i.e., where parametric knowledge suggests one answer and different passages suggest different answers). We find that retrieval performance largely decides which knowledge source models use, and a state-of-the-art model barely relies on parametric knowledge when given multiple passages. When presented with passages suggesting multiple answers, however, models use parametric knowledge to break the ties. We discover a troubling trend that contradictions in diverse knowledge sources affect model confidence only marginally. Together, our study helps interpreting answers from these models and suggests directions for future work.	PDF	1	2022
CofeNet: Context and Former-Label Enhanced Net for Complicated Quotation Extraction	Quotation extraction aims to extract quotations from written text. There are three components in a quotation: source refers to the holder of the quotation, cue is the trigger word(s), and content is the main body. Existing solutions for quotation extraction mainly utilize rule-based approaches and sequence labeling models. While rule-based approaches often lead to low recalls, sequence labeling models cannot well handle quotations with complicated structures. In this paper, we propose the Context and Former-Label Enhanced Net (CofeNet) for quotation extraction. CofeNet is able to extract complicated quotations with components of variable lengths and complicated structures. On two public datasets (i.e., PolNeAR and Riqua) and one proprietary dataset (i.e., PoliticsZH), we show that our CofeNet achieves state-of-the-art performance on complicated quotation extraction.	PDF	1	2022
One Step Is Enough for Few-Shot Cross-Lingual Transfer: Co-Training with Gradient Optimization	The current state-of-the-art for few-shot cross-lingual transfer learning first trains on abundant labeled data in the source language and then fine-tunes with a few examples on the target language, termed target-adapting. Though this has been demonstrated to work on a variety of tasks, in this paper we show some deficiencies of this approach and propose a one-step co-training method that trains on both source and target data with stochastic gradient surgery, a novel gradient-level optimization. Unlike the previous studies that focus on one language at a time when target-adapting, we use one model to handle all target languages simultaneously to avoid excessively language-specific models. Moreover, we discuss the unreality of utilizing large target development sets for model selection in previous literature, and further show that our method is development-free for target languages and also able to escape from overfitting issues. We conduct a large-scale experiment on 4 diverse NLP tasks across up to 48 languages. Our proposed method achieves state-of-the-art performance on all tasks and outperforms target-adapting by a large margin, especially for languages that are linguistically distant from the source language, e.g., an average of 7.36% absolute F1 improvement on the NER task, up to a gain of 17.60% on Punjabi.	PDF	1	2022
A Weakly Supervised Approach to Evaluating Single-Document Summarization via Negative Sampling	Canonical automatic summary evaluation metrics, such as ROUGE, focus on lexical similarity which cannot well capture semantics nor linguistic quality and require a reference summary which is costly to obtain. Recently, there have been a growing number of efforts to alleviate either or both of the two drawbacks. In this paper, we present a proof-of-concept study to a weakly supervised summary evaluation approach without the presence of reference summaries. Massive data in existing summarization datasets are transformed for training via simple negative sampling methods. In cross-domain tests, our strategy outperforms baselines with promising improvements, and show a great advantage in gauging linguistic qualities over all metrics. We hope this study can inspire more research using similar strategies. Our code is at https://anonymous.4open.science/r/37CF.	PDF	1	2022
Interpretable Proof Generation via Iterative Backward Reasoning	We present IBR, an Iterative Backward Reasoning to solve the proof generation task on rule-based Question Answering (QA), where models are required to reason over a series of textual rules and facts to find out the related proof path and derive the final answer. We handle the limitations of existed works in two folds: 1) enhance the interpretability of reasoning procedure with detailed tracking, by predicting nodes and edges in the proof path iteratively backward from the question; 2) promote the efficiency and accuracy via reasoning on the elaborate representations of nodes and history path, without any intermediate texts that may introduce external noise during proof generation. There are three main modules in IBR, QA and proof strategy prediction to obtain the answer and offer guidance for the following procedure; parent node prediction to determine a node in the existing proof that a child node will link to; child node prediction to find out which new node will be added to the proof. Experiments on both synthetic and paraphrased datasets demonstrate that IBR has a better in-domain performance as well as cross-domain transferability than state-of-the-art models.	PDF	1	2022
Dynamic Relevance Graph Network for Knowledge-Aware Question Answering	This work investigates the challenge of learning and reasoning for Commonsense Question Answering given an external source of knowledge in the form of a knowledge graph. We propose a novel graph neural network architecture, called dynamic relevance graph network (DRGN). DRGN operates on a given KG subgraph based on the question and answers entities and uses the relevance between the nodes to establish new edges dynamically for learning node representations in the graph network. Using the relevance between the graph nodes in learning representations helps the model to not only exploit the existing relationships in the KG subgraph but also recover the missing edges. Moreover, our model improves handling the negative questions due to considering the relevance between the global question node and the graph entities. Our proposed approach shows competitive performance on two QA datasets with commonsense knowledge, CommonsenseQA and OpenbookQA, and improves the state-of-the-art published results.	PDF	1	2022
Provably Confidential Language Modelling	Large language models are shown to memorize privacy information such as social security numbers in training data. Given the sheer scale of the training corpus, it is challenging to screen and filter these privacy data, either manually or automatically. In this paper, we propose Confidentially Redacted Training (CRT), a method to train language generation models while protecting the confidential segments. We borrow ideas from differential privacy (which solves a related but distinct problem) and show that our method is able to provably prevent unintended memorization by randomizing parts of the training process. Moreover, we show that redaction with an approximately correct screening policy amplifies the confidentiality guarantee. We implement the method for both LSTM and GPT language models. Our experimental results show that the models trained by CRT obtain almost the same perplexity while preserving strong confidentiality.	PDF	1	2022
The Aligned Multimodal Movie Treebank: An audio, video, dependency-parse treebank	Treebanks have traditionally included only text and were derived from written sources such as newspapers or the web. We introduce the Aligned Multimodal Movie Treebank, an English language treebank derived from naturalistic dialog in Hollywood movies which includes the source video and audio, transcriptions with word-level alignment to the audio stream, as well as part of speech tags and dependency parses in the Universal Dependencies formalism. AMMT consists of 31,264 sentences and 218,090 words, that will be the 3rd largest UD English treebank, and the only multimodal treebank in UD. To help with the web-based annotation effort, we also introduce the Efficient Audio Alignment Annotator (EAAA), a companion tool that enables annotators to speed-up significantly the annotation process.	PDF	1	2022
A Feasibility Study of Answer-Unaware Question Generation for Education	We conduct a feasibility study into the applicability of \textit{answer-unaware} question generation models to textbook passages. We show that a significant portion of errors in such systems arise from asking irrelevant or un-interpretable questions and that such errors can be ameliorated by providing summarized input. We find that giving these models human-written summaries instead of the original text results in a significant increase in acceptability of generated questions (33\% -> 83\%) as determined by expert annotators. We also find that, in the absence of human-written summaries, automatic summarization can serve as a good middle ground.	PDF	1	2022
Extract, Select and Rewrite: A New Modular Summarization Method	Prior works on supervised summarization are mainly based on end-to-end models, leading to low modularity, unfaithfulness and low interpretability. To address this, we propose a new three-phase modular abstractive sentence summarization method.We split up the summarization problem explicitly into three stages, namely knowledge extraction, content selection and rewriting.We utilize multiple knowledge extractors to obtain relation triples from the text, learn a fine-tuned classifier to select content to be included in the summary and use a fine-tuned BART rewriter to rewrite the selected triples into a natural language summary.We find our model shows good modularity as the modules can be trained separately and on different datasets. The automatic and human evaluations demonstrate that our new method is competitive with state-of-the-art methods and more faithful than end-to-end baseline models.	PDF	1	2022
Event Linking: Grounding Event Mentions to Wikipedia	Comprehending an article requires understanding its constituent events. However, the context where an event is mentioned often lacks the details of this event. A question arises: how can the reader obtain more knowledge about this particular event in addition to what is provided by the local context in the article?This work defines Event Linking, a new natural language understanding task at the event level. Event linking tries to link an event mention appearing in an article to the most appropriate Wikipedia page. This page is expected to provide rich knowledge about what the event mention refers to. To standardize the research in this new direction, we first formally define Event Linking task. Second, we collect a dataset for this new task. Specifically, we automatically gather training set from Wikipedia, and then create two evaluation sets: one from the Wikipedia domain, reporting the in-domain performance, and a second from the real-world news domain, to evaluate out-of-domain performance. Third, we propose EveLINK, the first-ever event linking system. Overall, as our analysis shows, Event Linking is a considerably challenging task requiring more effort from the community. Data and code will be publicly released.	PDF	1	2022
Target-Guided Dialogue Response Generation Using Commonsense and Data Augmentation	Targeted-guided response generation enables dialogue systems to smoothly transition a conversation from a dialogue context toward a target sentence. Such control is useful for designing dialogue systems that direct a conversation toward specific goals, such as providing counselling and creating non-obtrusive recommendations. In this paper, we introduce a new technique for target-guided response generation, which first finds a bridging path of commonsense knowledge concepts between the source and the target, and then uses the identified bridging path to generate transition responses. Additionally, we propose techniques to re-purpose existing dialogue datasets for target-guided generation. Experiments reveal that the proposed techniques outperform various baselines on this task. Finally, we observe that the existing automated metrics for this task correlate poorly with human judgement ratings. We propose a novel evaluation metric that we demonstrate to be more reliable for target-guided response evaluation. Our work generally enables dialogue system designers to exercise more control over the conversations that their systems produce.	PDF	1	2022
S5 Framework: A Review of Self-Supervised Shared Semantic Space Optimization for Multimodal Zero-Shot Learning	In this review, we aim to inspire research into Self-Supervised Shared Semantic Space (S5) multimodal learning problems. We equip non-expert researchers with a framework of informed modeling decisions via an extensive literature review, an actionable modeling checklist, as well as a series of novel zero-shot evaluation tasks. The core idea for our S5 checklist lies in learning contextual multimodal interactions at various granularity levels via a shared Transformer encoder with a denoising loss term, which is also regularized by a contrastive loss term to induce a semantic alignment prior on the contextual embedding space. Essentially, we aim to model human concept understanding and thus learn to ``put a name to a face''. This ultimately enables interpretable zero-shot S5 generalization on a variety of novel downstream tasks. In summary, this review provides sufficient background and actionable strategies for training cutting-edge S5 multimodal networks.	PDF	1	2022
Towards Computationally Feasible Deep Active Learning	Active learning (AL) is a prominent technique for reducing the annotation effort required for training machine learning models. Deep learning offers a solution for several essential obstacles to deploying AL in practice but introduces many others. One of such problems is the excessive computational resources required to train an acquisition model and estimate its uncertainty on instances in the unlabeled pool. We propose two techniques that tackle this issue for text classification and tagging tasks, offering a substantial reduction of AL iteration duration and the computational overhead introduced by deep acquisition models in AL. We also demonstrate that our algorithm that leverages pseudolabeling and distilled models overcomes one of the essential obstacles revealed previously in the literature. Namely, it was shown that due to differences between an acquisition model used to select instances during AL and a successor model trained on the labeled data, the benefits of AL can diminish. We show that our algorithm, despite using a smaller and faster acquisition model, is capable of training a more expressive successor model with higher performance.	PDF	1	2022
Multilingual Event Linking to Wikidata	We present a task of multilingual linking of events to a knowledge base. We automatically compile a large-scale dataset for this task, comprising of 1.8M mentions across 44 languages referring to over 10.9K events from Wikidata. We propose two variants of the event linking task: 1) multilingual, where event descriptions are from the same language as the mention, and 2) crosslingual, where all event descriptions are in English. On the two proposed tasks, we compare multiple event linking systems including BM25+ (Lv and Zhai, 2011) and multilingual adaptations of the biencoder and crossencoder architectures from BLINK (Wu et al., 2020). In our experiments on the two task variants, we find both biencoder and crossencoder models significantly outperform the BM25+ baseline. Our results also indicate that the crosslingual task is in general more challenging than the multilingual task. We also present a qualitative analysis highlighting various aspects captured by the proposed dataset, including the need for temporal reasoning over context and tackling diverse event descriptions across languages.	PDF	1	2022
AutoLEX: An Automatic Framework for Linguistic Exploration	Each language has its own complex systems of word, phrase, and sentence construction, the guiding principles of which are often summarized in grammar descriptions for the consumption of linguists or language learners. However, manual creation of such descriptions is a fraught process, as creating descriptions which describe the language in "its own terms" without bias or error requires both a deep understanding of the language at hand and linguistics as a whole. We propose an automatic framework AutoLEX that aims to ease linguists' discovery and extraction of concise descriptions of linguistic phenomena. Specifically, we apply this framework to extract descriptions for three phenomena: morphological agreement, case marking, and word order, across several languages. We evaluate the descriptions with the help of language experts and propose a method for automated evaluation when human evaluation is infeasible.	PDF	1	2022
Balanced Adversarial Training: Balancing Tradeoffs Between Oversensitivity and Undersensitivity in NLP Models	Traditional (\emph{oversensitive}) adversarial examples involve finding a small perturbation that does not change an input's true label but confuses the classifier into outputting a different prediction. \emph{Undersensitive} adversarial examples are the opposite---the adversary's goal is to find a small perturbation that changes the true label of an input while preserving the classifier's prediction. Adversarial training and certified robust training have shown some effectiveness in improving the robustness of machine learnt models to oversensitive adversarial examples. However, recent work has shown that using these techniques to improve robustness for image classifiers may make a model more vulnerable to undersensitive adversarial examples. We demonstrate the same phenomenon applies to NLP models, showing that training methods that improve robustness to synonym-based attacks (oversensitive adversarial examples) tend to increase a model's vulnerability to antonym-based attacks (undersensitive adversarial examples) for both natural language inference and paraphrase identification tasks. To counter this phenomenon, we introduce \textit{Balanced Adversarial Training} which incorporates contrastive learning to increase robustness against both over- and undersensitive adversarial examples.	PDF	1	2022
DECK: Behavioral Tests to Improve Interpretability and Generalizability of BERT Models Detecting Depression from Text	Models that accurately detect depression from text are important tools for addressing the post-pandemic mental health crisis. BERT-based classifiers' promising performance and the off-the-shelf availability make them great candidates for this task. However, these models are known to suffer from performance inconsistencies and poor generalization. In this paper, we introduce the DECK (\textbf{DE}pression \textbf{C}hec\textbf{K}list), depression-specific model behavioral tests that allow better interpretability and improve generalizability of BERT classifiers in depression domain. We create 23 tests to evaluate BERT, RoBERTa and ALBERT depression classifiers on three datasets, two Twitter-based and one clinical interview-based. Our evaluation shows that these models: 1) are robust to certain gender-sensitive variations in text; 2) rely on important depressive language marker of the increased use of first person pronouns; 3) fail to detect some other depression symptoms like suicidal ideation. We also demonstrate that DECK tests can be used to incorporate symptom-specific information in the training data and consistently improve generalizability of all three BERT models, with the out-of-distribution F1-score increase of up to 53.93\%. The DECK tests, together with the associated code, are available for download at https://github.com/Anonymous.	PDF	1	2022
{BlonDe}: An Automatic Evaluation Metric for Document-level Machine Translation	Standard automatic metrics, e.g. BLEU, are not reliable for document-level MT evaluation. They can neither distinguish document-level improvements in translation quality from sentence-level ones, nor identify the discourse phenomena that cause context-agnostic translations. This paper introduces a novel automatic metric BlonDe to widen the scope of automatic MT evaluation from sentence to document level. BlonDe takes discourse coherence into consideration by categorizing discourse-related spans and calculating the similarity-based F1 measure of categorized spans. We conduct extensive comparisons on a newly constructed dataset BWB. The experimental results show that BlonD possesses better selectivity and interpretability at the document-level, and is more sensitive to document-level nuances. In a large-scale human study, BlonD also achieves significantly higher Pearson's r correlation with human judgments compared to previous metrics.	PDF	1	2022
Faithful and Plausible Explanations of Medical Code Predictions	Machine learning models that offer excellent predictive performance often lack the interpretability necessary to support integrated human machine decision-making. In clinical medicine and other high-risk settings, domain experts may be unwilling to trust model predictions without explanations. Work in explainable AI must balance competing objectives along two different axes: 1) Models should ideally be both accurate and simple. 2) Explanations must balance faithfulness to the model's decision-making with their plausibility to a domain expert.We propose to train a proxy model that mimics the behavior of a trained model and provides control over these trade-offs. We evaluate our approach on the task of assigning ICD codes to clinical notes to demonstrate that the proxy model is faithful to the trained model's behavior and produces quality explanations.	PDF	1	2022
Impart Contextualization to Static Word Embeddings through Semantic Relations	Dense word embedding is the foundational model for the downstream NLP research. It encodes the meanings of words into low dimensional vector spaces. Recent models with the start-of-the-art performances mostly adopt the contextualized word embeddings, which can distinguish the various meanings of the words by their dynamic context. To impart the information of context to the static word embeddings, we formulate 3 semantic relations: interchangeable, opposite and relative relation to find a sub-set of dimensions for interpreting the specific context. The experiment shows that the relations can be mined from fastText embedding.	PDF	1	2022
Improving Mental Health Classifier Generalization with Pre-Diagnosis Data	Recent work has shown that classifiers for depression detection often fail to generalize to new datasets. Most NLP models for this task are built on datasets that use textual reports of a depression diagnosis (e.g., statements on social media) to identify diagnosed users; this approach allows for collection of large-scale datasets, but means that classifiers suffer from a self-report bias. Notably, models tend to capture features that typify direct discussion of mental health rather than more subtle indications of depression symptoms. In this paper, we explore the hypothesis that building classifiers using exclusively social media posts from before a user's diagnosis will lead to less reliance on shortcuts and better generalization. We test our classifiers on a dataset that is based on an external survey rather than textual self-reports, and find that using pre-diagnosis data for training yields improved performance.	PDF	1	2022
A Family of Cognitively Realistic Parsing Environments for Deep Reinforcement Learning	The hierarchical syntactic structure of natural language is a key feature of human cognition that enables us to recursively construct arbitrarily long sentences supporting communication of complex, relational information. In this work, we describe a framework in which learning cognitively-realistic left-corner parsers can be formalized as a Reinforcement Learning problem, and introduce a family of cognitively realistic chart-parsing environments to evaluate potential psycholinguistic implications of RL algorithms. We report how several baseline Q-learning and Actor Critic algorithms, both tabular and neural, perform on subsets of the Penn Treebank corpus. We observe a sharp increase in difficulty as parse trees get slightly more complex, indicating that hierarchical reinforcement learning might be required to solve this family of environments.	PDF	1	2022
Can Rationalization Improve Robustness?	A growing line of work has investigated the development of neural NLP models that can produce rationales--subsets of input that can explain their model predictions. In this paper, we ask whether such rationale models can also provide robustness to adversarial attacks in addition to their interpretable nature. Since these models need to first generate rationales (``rationalizer'') before making predictions (``predictor''), they have the potential to ignore noise or adversarially added text by simply masking it out of the generated rationale. To this end, we systematically generate various types of `AddText' attacks for both token and sentence-level rationalization tasks and perform an extensive empirical evaluation of state-of-the-art rationale models across five different tasks. Our experiments reveal that the rationale models promise to improve robustness over AddText attacks while they struggle in certain scenarios--when the rationalizer is sensitive to position bias or lexical choices of attack text. Further, leveraging human rationale as supervision does not always translate to better performance. Our study is a first step towards exploring the interplay between interpretability and robustness in the rationalize-then-predict framework.	PDF	1	2022
Searching for Effective Multilingual Fine-Tuning Methods: A Case Study in Summarization	Recently, a large number of tuning strategies have been proposed to adapt pre-trained language models to downstream tasks. In this paper, we perform an extensive empirical evaluation of various tuning strategies for multilingual learning, particularly in the context of text summarization. Specifically, we explore the relative advantages of three families of multilingual tuning strategies (a total of five models) and empirically evaluate them for summarization over 45 languages. Experimentally, we not only established a new state-of-the-art on the XL-Sum dataset but also derive a series of observations that hopefully can provide hints for future research on the design of multilingual tuning strategies.	PDF	1	2022
StoryQA: Story Grounded Question Answering Dataset	The abundance of benchmark datasets supports the recent trend of increased attention given to Question Answering (QA) tasks. However, most of them lack a diverse selection of QA types and more challenging questions. In this work, we present StoryQA, a new task and dataset addressing diverse QA problems for both in-context and out-of-context questions. Additionally, we developed QA models based on large pretrained language models. Our experiments on the new dataset show our developed model achieves comparable performance to answers provided by humans. The resources in this work will be released to foster future research.	PDF	1	2022
Modular and Parameter-Efficient Multimodal Fusion with Prompting	Recent research has made impressive progress in large-scale multimodal pre-training. In the context of the rapid growth of model size, it is necessary to seek efficient and flexible methods other than fine-tuning. In this paper, we propose to use prompt vectors to align the modalities. We achieve comparable performance to several other multimodal fusion methods in low-resource settings, showing that this approach is modular and parameter-efficient for processing tasks that involve two or more modalities.	PDF	1	2022
Novel Chapter Abstractive Summarization using Spinal Tree Aware Sub-Sentential Content Selection	Summarizing novel chapters is a difficult task due to the length of the chapter to be summarized and the fact that summary sentences draw content from multiple sentences in the chapter. We present a pipelined extractive-abstractive approach where the extractive step filters the content that is passed to the abstractive component. Extremely lengthy input also results in a dataset highly skewed towards negative instances and we thus adopt a margin ranking loss for extraction to encourage separation between positive and negative input. To generate summary sentences that fuse information from different sentences, our extraction component operates at the constituent level; our novel approach to this problem enriches the text with spinal tree information which provides context to the extraction model. We show an improvement of 3.71 Rouge-1 points over the state-of-the-art on an existing novel chapter dataset.	PDF	1	2022
PrefScore: Pairwise Preference Learning for Reference-free Single-document Summarization Quality Assessment	Evaluating machine-generated summaries without a human-written reference summary has been a need for a long time. Inspired by preference labeling in existing works of summarization evaluation, we propose to judge summary quality by learning the preference rank of summaries using the Bradley-Terry power ranking model from generated inferior summaries of a base summary. Despite the simplicity of our method, extensive experiments on several datasets show that our weakly supervised scheme can produce scores highly correlate with human ratings.	PDF	1	2022
Co-training an Unsupervised Constituency Parser with Weak Supervision	We introduce a method for unsupervised parsing that relies on bootstrapping classifiers to identify if a node dominates a specific span in a sentence. There are two types of classifiers, an inside classifier that acts on a span, and an outside classifier that acts on everything outside of a given span. Through self-training and co-training with the two classifiers, we show that the interplay between them helps improve the accuracy of both, and as a result, effectively parse. A seed bootstrapping technique prepares the data to train these classifiers. Our analyses further validate that such an approach in conjunction with weak supervision using prior branching knowledge of a known language (left/right-branching) and minimal heuristics injects strong inductive bias into the parser, achieving 63.1 F$_1$ on the English (PTB) test set. In addition, we show the effectiveness of our architecture by evaluating on treebanks for Chinese (CTB) and Japanese (KTB) and achieve new state-of-the-art results.\footnote{For code or data, please contact the authors.}	PDF	1	2022
DEEP: DEnoising Entity Pre-training for Neural Machine Translation	It has been shown that machine translation models usually generate poor translations for named entities that are infrequent in the training corpus. Earlier named entity translation methods mainly focus on phonetic transliteration, which ignores the sentence context for translation and is limited in domain and language coverage. To address this limitation, we propose DEEP, a DEnoising Entity Pre-training method that leverages large amounts of monolingual data and a knowledge base to improve named entity translation accuracy within sentences. Besides, we investigate a multi-task learning strategy that finetunes a pre-trained neural machine translation model on both entity-augmented monolingual data and parallel data to further improve entity translation. Experimental results on three language pairs demonstrate that DEEP results in significant improvements over strong denoising auto-encoding baselines, with a gain of up to 1.3 BLEU and up to 9.2 entity accuracy points for English-Russian translation.	PDF	1	2022
The Case for a Single Model that can Both Generate Continuations and Fill-in-the-Blank	The task of inserting text into a specified position in a passage, known as fill in the blank (FitB), is useful for a variety of applications where writers interact with a natural language generation (NLG) system to craft text. While previous work has tackled this problem with models trained specifically to do fill in the blank, a more useful model is one that can effectively perform _both_ FitB and continuation tasks. In this work, we evaluate the feasibility of using a single model to do both tasks. We show that models pre-trained with a FitB-style objective are capable of both tasks, while models pre-trained for continuation are not. Finally, we show how these models can be easily finetuned to allow for fine-grained control over the length and word choice of the generation.	PDF	1	2022
Efficient layout-aware pretraining for multimodal form understanding	Layout-aware language models have been used to create multimodal representations for documents that are in image form, achieving relatively high accuracy in document understanding tasks. However, the large number of parameters in the resulting models makes building and using them prohibitive without access to high-performing processing units with large memory capacity. We propose an alternative approach that can create efficient representations without the need for a neural visual backbone. This leads to an 80% reduction in the number of parameters compared to the smallest SOTA model, widely expanding applicability. Despite using 2.5% of training data, we show competitive performance on two form understanding tasks: semantic labeling and link prediction.	PDF	1	2022
Small Changes Make Big Differences: Improving Multi-turn Response Selection in Dialogue Systems via Fine-Grained Contrastive Learning	Retrieve-based dialogue response selection aims to find a proper response from a candidate set given a multi-turn context. Pre-trained language models (PLMs) based methods have yielded significant improvements on this task. The sequence representation plays a key role in the learning of matching degree between the dialogue context and the response. However, we observe that different context-response pairs sharing the same context always have a greater similarity in the sequence representations calculated by PLMs, which makes it hard to distinguish positive responses from negative ones. Motivated by this, we propose a novel Fine-Grained Contrastive (FGC) learning method for the response selection task based on PLMs. This FGC learning strategy helps PLMs to generate more distinguishable matching representations of each dialogue at fine grains, and further make better predictions on choosing positive responses. Empirical studies on two benchmark datasets demonstrate that the proposed FGC learning method can generally and significantly improve the model performance of existing PLM-based matching models.	PDF	1	2022
Modeling Multi-Granularity Hierarchical Features for Relation Extraction	Relation extraction is a key task in Natural Language Processing (NLP), which aims to extract relations between entity pairs from given texts. Recently, relation extraction (RE) has achieved remarkable progress with the development of deep neural networks. Most existing research focuses on constructing explicit structured features using external knowledge such as knowledge graph and dependency tree. In this paper, we propose a novel method to extract multi-granularity features based solely on the original input sentences. We show that effective structured features can be attained even without external knowledge. Three kinds of features based on the input sentences are fully exploited, which are in entity mention level, segment level, and sentence level. All the three are jointly and hierarchically modeled. We evaluate our method on three public benchmarks: SemEval 2010 Task 8, Tacred, and Tacred Revisited. To verify the effectiveness, we apply our method to different encoders such as LSTM and BERT. Experimental results show that our method significantly outperforms existing state-of-the-art models that even use external knowledge. Extensive analyses demonstrate that the performance of our model is contributed by the capture of multi-granularity features and the model of their hierarchical structure.	PDF	1	2022
What use can and should ACL researchers make of the CambridgeGrammar of the English Language?	The Cambridge Grammar of the English Language (henceforth H&P) provides an 1842 page description of the grammar of English. We analysed the top 75 citations to this grammar in the ACL Anthology. The community has indeed produced work that is strongly influenced by H&P especially in linguistically challenging areas such as deixis, anaphora and negation. To illustrate the potential of H&P as source material for linguistically informed error analysis in a conceptually complex domain, we extract the examples from chapter 17 (by Sterling and Huddleston), which deals with deixis and anaphora. We show how a representative modern co-reference engine (Stanford's) handles these examples. Since every example in H&P is chosen to illustrate a point about English, and the authors provide text explaining the importance of the point, the error analyst has immediate access to a good proxy for relevant linguistic expertise.	PDF	1	2022
Entity Linking via Explicit Mention-Mention Coreference Modeling	Learning representations of entity mentions is a core component of modern entity linking systems for both candidate generation and making linking predictions. In this paper, we present and empirically analyze a novel training approach for learning mention and entity representations that is based on building minimum spanning arborescences (i.e., directed spanning trees) over mentions and entities across documents to explicitly model mention coreference relationships. We demonstrate the efficacy of our approach by showing significant improvements in both candidate generation recall and linking accuracy on the Zero-Shot Entity Linking dataset and MedMentions, the largest publicly available biomedical dataset. In addition, we show that our improvements in candidate generation yield higher quality re-ranking models downstream, setting a new SOTA result in linking accuracy on MedMentions. We further demonstrate that our improved mention representations are effective for the discovery of new entities via cross-document coreference.	PDF	1	2022
Can Language Models Take A Hint? Prompting for Controllable Contextualized Commonsense Inference	Generating commonsense assertions, given a certain story context, is a tough challenge even for modern language models. One of the reasons for this may be that the model has to "guess" what topic or entity in a story to generate an assertion about. Prior work has tackled part of the problem, by providing techniques to align commonsense inferences with stories and training language generation models on these. However, none of the prior work provides means to control the parts of a generated assertion. In this work, we present "hinting", a data augmentation technique for improving inference of contextualized commonsense assertions. Hinting is a prefix prompting strategy that uses both hard and soft prompts. We demonstrate the effectiveness of hinting by showcasing its effect on two contextual commonsense inference datasets: ParaCOMET (Gabriel et al., 2021) and GLUCOSE (Mostafazadeh et al., 2020), for both general and context-specific inference.	PDF	1	2022
Zero-Shot On-the-Fly Event Schema Induction	What are the events involved in a pandemic outbreak? What steps should be taken when planning a wedding? The answers to these questions can be found by collecting many documents on the complex event of interest, extracting relevant information and analyzing it. We present a new approach in which large language models are utilized to generate source documents that allow predicting, given a high-level event definition, the specific events, arguments, and relations between them to construct a schema that describes the complex event in its entirety. Using our model, complete schemas on any topic can be generated on-the-fly without any data collection needed, i.e., in a zero-shot manner. Moreover, we develop efficient methods to extract pertinent information from texts and demonstrate, in a series of experiments, that these schemas are considered to be more complete than human-curated ones in the majority of examined scenarios. Finally, we show that this framework is comparable in performance with previous supervised schema induction methods that rely on collecting real texts while being more general and flexible by avoiding the need to use a predefined ontology.	PDF	1	2022
Quality-Aware Decoding for Neural Machine Translation	Despite the progress in machine translation quality estimation and evaluation in the last years, decoding in neural machine translation (NMT) is mostly oblivious to this and centers around finding the most probable translation according to the model (MAP decoding), approximated with beam search. In this paper, we bring together these two lines of research and propose \emph{quality-aware decoding} for NMT, by leveraging recent breakthroughs in reference-free and reference-based MT evaluation through various inference methods like $N$-best reranking and minimum Bayes risk decoding. We perform an extensive comparison of various possible candidate generation and ranking methods across four datasets and two model classes and find that quality-aware decoding consistently outperforms MAP-based decoding according both to state-of-the-art automatic metrics (COMET and BLEURT) and to human assessments.	PDF	1	2022
Learning To Retrieve Prompts for In-Context Learning	In-context learning is a recent paradigm in natural language understanding, where a large pre-trained language model (LM) observes a test instance and a few training examples as its input, and directly decodes the output without any update to its parameters. However, performance has been shown to strongly depend on the selected training examples (termed prompts). In this work, we propose an efficient method for retrieving prompts for in-context learning using annotated data and an LM. Given an input-output pair, we estimate the probability of the output given the input and a candidate training example as the prompt, and label training examples as positive or negative based on this probability. We then train an efficient dense retriever from this data, which is used to retrieve training examples as prompts at test time. We evaluate our approach on three sequence-to-sequence tasks where language utterances are mapped to meaning representations, and find that it substantially outperforms prior work and multiple baselines across the board.	PDF	1	2022
CONFIT: Toward Faithful Dialogue Summarization with Linguistically-Informed Contrastive Fine-tuning	Factual inconsistencies in generated summaries severely limit the practical applications of abstractive dialogue summarization. Although significant progress has been achieved by using pre-trained neural language models, substantial amounts of hallucinated content are found during the human evaluation. In this work, we first devised a typology of factual errors to better understand the types of hallucinations generated by current models and conducted human evaluation on popular dialog summarization dataset. We further propose a training strategy that improves the factual consistency and overall quality of summaries via a novel contrastive fine-tuning, called CONFIT. To tackle top factual errors from our annotation, we introduce additional contrastive loss with carefully designed hard negative samples and self-supervised dialogue-specific loss to capture the key information between speakers. We show that our model significantly reduces all kinds of factual errors on both SAMSum dialogue summarization and AMI meeting summarization. On both datasets, we achieve significant improvements over state-of-the-art baselines using both automatic metrics, ROUGE and BARTScore, and human evaluation.	PDF	1	2022
The Curious Case of Control	Children acquiring English make systematic errors on subject control sentences (Chomsky, 1969) possibly due to heuristics based on semantic roles (Maratsos, 1974).Given the advanced fluency of large generative language models, we ask what kinds of generalizations these models make on object and subject control clauses.We find broad differences between models, with many models adopting positional heuristics that succeed on subject control but fail on object control.This result is surprising, given that object control is orders of magnitude more frequent in text data.	PDF	1	2022
A Well-Composed Text is Half Done! Semantic Composition Sampling for Diverse Conditional Generation	We propose Composition Sampling, a simple but effective method to generate higher quality diverse outputs for conditional generation tasks, compared to previous stochastic decoding strategies. It builds on recently proposed planning-based neural generation models that are trained to first create a composition of the output using an entity chain and then continue to generate conditioned on the entity chain and the input (Narayan et al, 2021). Our approach avoids text degeneration by first sampling a composition in the form of an entity chain and then using beam search to generate the best possible text grounded to the entity chain. Experiments on summarization (CNN/DailyMail and XSum) and SQuAD question generation tasks, using a wide variety of automatic metrics and human-based evaluation, demonstrate that Composition Sampling is currently the best available decoding strategy for generating diverse meaningful outputs. We further introduce a novel automatic measure for jointly evaluating diversity and faithfulness in summaries.	PDF	1	2022
ErAConD: Error Annotated Conversational Dialog Dataset for Grammatical Error Correction	Currently available grammatical error correction (GEC) datasets are compiled using well-formed written text, limiting the applicability of these datasets to other domains such as informal writing and conversational dialog. In this paper, we present a novel GEC dataset consisting of parallel original and corrected utterances drawn from open-domain chatbot conversations; this dataset is, to our knowledge, the first GEC dataset targeted to a conversational setting. We also present a detailed annotation scheme which ranks errors by perceived impact on comprehension, making our dataset more representative of real-world language learning applications. To demonstrate the utility of the dataset, we use our annotated data to fine-tune a state-of-the-art GEC model. Experimental results show the effectiveness of our data in improving GEC model performance in conversational scenario.	PDF	1	2022
UserIdentifier: Implicit User Representations for Simple and Effective Personalized Sentiment Analysis	Sentiment Classification models are typically trained to be as generalizable as possible. Invariance to the specific user is considered desirable since models are shared across multitudes of users. However, these models are often unable to produce personalized responses for individual users, based on their data. Contrary to widely-used personalization techniques based on few-shot and meta-learning, we propose UserIdentifier, a novel scheme for training a single shared model for all users. Our approach produces personalized responses by prepending a fixed, user-specific non-trainable string (called ``user identifier'') to each user's input text. Unlike prior work, this method doesn't need any additional model parameters, any extra rounds of personal few-shot learning or any change made to the vocabulary. We empirically study different types of user identifiers (numeric, alphanumeric, and also randomly generated) and demonstrate that, surprisingly, randomly generated user identifiers outperform the prefix-tuning based state-of-the-art approach by up to 13, on a suite of sentiment analysis datasets.	PDF	1	2022
Explore the Potential Performance of Vision-and-Language Navigation Model: a Snapshot Ensemble Method	Given an instruction in a natural language, the vision-and-language navigation (VLN) task requires a navigation model to match the instruction to its visual surroundings and then move to the correct destination. It has been difficult to build VLN models that can generalize as well as humans. In this paper, we provide a new perspective that accommodates the potential variety of interpretations of verbal instructions. We discovered that snapshots of a VLN model, i.e., model versions based on parameters saved at various intervals during its training, behave significantly differently even when their navigation success rates are almost the same. We thus propose a snapshot-based ensemble solution that leverages predictions provided by multiple snapshots. Our approach is effective and generalizable, and can be applied to ensemble snapshots from different models. Constructed on the mixed snapshots of the existing state-of-the-art (SOTA) RecBERT and HAMT models, our proposed ensemble achieves new SOTA performance in the R2R Dataset Challenge in the single-run setting.	PDF	1	2022
Retrieval-guided Counterfactual Generation for QA	Deep NLP models have been shown to be brittle to input perturbations. Recent work has shown that data augmentation using counterfactuals --- i.e. minimally perturbed inputs --- can help ameliorate this weakness. We focus on the task of creating counterfactuals for question answering, which presents unique challenges related to world knowledge, semantic diversity, and answerability. To address these challenges, we develop a Retrieve-Generate-Filter(RGF) technique to create counterfactual evaluation and training data with minimal human supervision. Using an open-domain QA framework and question generation model trained on original task data, we create counterfactuals that are fluent, semantically diverse, and automatically labeled. Data augmentation with RGF counterfactuals improves performance on out-of-domain and challenging evaluation sets over and above existing methods, in both the reading comprehension and open-domain QA settings. Moreover, we find that RGF data leads to significant improvements to robustness to local perturbations.	PDF	1	2022
Efficient Zero-Shot Semantic Parsing with Paraphrasing from Pretrained Language Models	Building a domain-specific semantic parser with little or no domain-specific training data remains a challenging task. Previous work has shown that crowdsourced paraphrases of synthetic (grammar-generated) utterances can be used to train semantic parsing models for new domains with good results. We investigate whether semantic parsers for new domains can be built with no additional human effort, obtaining paraphrases of grammar-generated utterances from large neural language models, such as Google's T5 and EleutherAI's GPT-J, as an alternative to crowd-sourcing. While our models trained with automated paraphrases generated by pretrained language models do not outperform supervised models trained with similar amounts of human-generated domain-specific data, they perform well in a zero-shot setting, where no domain-specific data is available for a new domain. Additionally, unlike the current state-of-the-art in zero-shot semantic parsing, our approach does not require the use of large transformer-based language models at inference-time. Using the Overnight dataset, we show that automated paraphrases can be used to train a semantic parsing model that outperforms or is competitive with state-of-the-art-models in the zero-shot setting, while requiring a small fraction of the time and energy costs at inference time.	PDF	1	2022
DYLE: Dynamic Latent Extraction for Abstractive Long-Input Summarization	Transformer-based models have achieved state-of-the-art performance on short-input summarization. However, they still struggle with summarizing longer text. In this paper, we present DYLE, a novel dynamic latent extraction approach for abstractive long-input summarization. DYLE jointly trains an extractor and a generator and treats the extracted text snippets as the latent variable, allowing dynamic snippet-level attention weights during decoding. To provide adequate supervision, we propose simple yet effective heuristics for oracle extraction as well as a consistency loss term, which encourages the extractor to approximate the averaged dynamic weights predicted by the generator. We evaluate our method on different long-document and long-dialogue summarization tasks: GovReport, QMSum, and arXiv. Experiment results show that DYLE outperforms all existing methods on GovReport and QMSum, with gains up to 6.1 ROUGE, while yielding strong results on arXiv. Further analysis shows that the proposed dynamic weights provide interpretability of our generation process.	PDF	1	2022
A Masked Segmental Language Model for Unsupervised Natural Language Segmentation	We introduce a Masked Segmental Language Model (MSLM) for joint language modeling and unsupervised segmentation. While near-perfect supervised methods have been developed for segmenting human-like linguistic units in resource-rich languages such as Chinese, many of the world's languages are both morphologically complex, and have no large dataset of ``gold'' segmentations for supervised training. Segmental Language Models offer a unique approach by conducting unsupervised segmentation as the byproduct of a neural language modeling objective. However, current SLMs are limited in their scalability due to their recurrent architecture. We propose a new type of SLM for use in both unsupervised and lightly supervised segmentation tasks. The MSLM is built on a span-masking transformer architecture, harnessing a masked bidirectional modeling context and attention, as well as adding the potential for model scalability. In a series of experiments, our model outperforms the segmentation quality of recurrent SLMs on Chinese, and performs similarly to the recurrent model on English.	PDF	1	2022
Better Uncertainty Quantification for Machine Translation Evaluation	Neural-based machine translation (MT) evaluation metrics are progressing fast. However, they are often hard to interpret and might produce unreliable scores when human references or assessments are noisy or when data is out-of-domain. Recent work leveraged uncertainty quantification techniques such as Monte Carlo dropout and deep ensembles to provide confidence intervals, but these techniques (as we show) are limited in several ways. In this paper, we introduce more powerful and efficient uncertainty predictors for capturing both aleatoric and epistemic uncertainty, by training the COMET metric with new heteroscedastic regression, divergence minimization, and direct uncertainty prediction objectives. Our experiments show improved results on WMT20 and WMT21 metrics task datasets and a substantial reduction in computational costs. Moreover, they demonstrate the ability of our predictors to identify low quality references and to reveal model uncertainty due to out-of-domain data.	PDF	1	2022
Table Retrieval Does Not Necessitate Table-specific Model Design	Tables are an important form of structured data for both human and machine readers alike, providing answers to questions that cannot, or cannot easily, be found in texts. Recent work designs special models and trains for table-related tasks such as table-based question answering and table retrieval. Though effective, they add model-data dual complexity to generic text solutions and obscure which elements are truly beneficial. In this work, we focus on the task of table retrieval, and ask: ``are table-specific model designs necessary for table retrieval, or can a text-generic model be effectively used to achieve a similar result?’’ We start by analyzing NQ-table, a set of table-answerable questions in the Natural Questions (NQ) dataset, and find 90\% of the questions can match tables in content with little concern for table structure. Motivated by this, we experiment with a general-purpose Dense Passage Retriever (DPR) for text and a special-purpose Dense Table Retriever (DTR) for tables. We show that DPR, without any design for or training on tables, can perform comparably well to the state-of-the-art DTR model, and neither adding DTR-like table-specific embeddings nor perturbing cell orders lead to significant changes. Both results strongly indicate that table retrieval does not necessitate table-specific model design, as well as the potential of directly applying powerful text-generic retrievers to structured tables.	PDF	1	2022
A Weak supervision with Syntactic Cues for Reference Resolution	In recipes, contextual understanding of instructions depends on temporal interpretation of the entities because of their spatio-temporal changes. Accordingly, we propose the use of reference resolution to find the origin action of entities, provided that the entity is an output from a previous action, instead of being a raw ingredient. Here, we introduce a weak supervision method that supports the syntactic features for producing latent links between entities and their origin actions. The results show that our weak supervision outperforms the previous unsupervised studies with \%8 F1. In particular, our approach indicates \%82 resolution performance for pronouns and \%85 for null entities.	PDF	1	2022
Improving Unsupervised Sentence Simplification Using Fine-Tuned Masked Language Models	Word suggestion in unsupervised sentence simplification is mostly done without considering the context of the input sentence. Fortunately, masked language modeling is a well-established task for predicting the most suitable candidate for a masked token using the surrounding context words. In this paper, we propose a technique that merges pre-trained BERT models with a successful edit-based unsupervised sentence simplification model to bring context-awareness into the simple word suggestion functionality. Next, we show that only by fine-tuning the BERT model on enough simplistic sentences, simplification results can be improved and even outperform some of the competing supervised methods. Finally, we introduce a framework that involves filtering an arbitrary amount of unlabeled in-domain texts for tuning the model. By removing useless training samples, this preprocessing step speeds up the fine-tuning process where labeled data, as simple and complex, are scarce.	PDF	1	2022
Embedding-Enhanced GIZA++: Improving Low-Resource Word Alignment Using Embeddings	Word alignment has been dominated until recently by GIZA++, a statistical method based on the 30-year-old IBM models. New methods primarily rely on large machine translation models, massively multilingual language models, or supervision. We introduce Embedding-Enhanced GIZA++, and outperform GIZA++ without any of the aforementioned factors. Taking advantage of monolingual embedding spaces of source and target language only, we exceed GIZA++'s performance in every tested scenario for three languages pairs. In the lowest-resource setting, we outperform GIZA++ by 8.5, 10.9, and 12 AER for Ro-En, De-En, and En-Fr, respectively. We release our code at www.blind-review.code.	PDF	1	2022
A Study of the Attention Abnormality in Trojaned BERTs	Trojan attacks raise serious security concerns. In this paper, we investigate the underlying mechanism of Trojaned BERT models. We observe the attention focus drifting behavior of Trojaned models, i.e., when encountering an poisoned input, the trigger token hijacks the attention focus regardless of the context. We provide a thorough qualitative and quantitative analysis of this phenomenon, revealing insights into the Trojan mechanism. Based on the observation, we propose an attention-based Trojan detector to distinguish Trojaned models from clean ones. To the best of our knowledge, we are the first to analyze the Trojan mechanism and develop a Trojan detector based on the transformer's attention.	PDF	1	2022
MoEBERT: from BERT to Mixture-of-Experts via Importance-Guided Adaptation	Pre-trained language models have demonstrated superior performance in various natural language processing tasks. However, these models usually contain hundreds of millions of parameters, which limits their practicality because of latency requirements in real-world applications. Existing methods train small compressed models via knowledge distillation. However, performance of these small models drops significantly compared with the pre-trained models due to their reduced model capacity. We propose MoEBERT, which uses a Mixture-of-Experts structure to increase model capacity and inference speed. We initialize MoEBERT by adapting the feed-forward neural networks in a pre-trained model into multiple experts. As such, representation power of the pre-trained model is largely retained. During inference, only one of the experts is activated, such that speed can be improved. We also propose a layer-wise distillation method to train MoEBERT. We validate the efficiency and efficacy of MoEBERT on natural language understanding and question answering tasks. Results show that the proposed method outperforms existing task-specific distillation algorithms. For example, our method outperforms previous approaches by over $2\%$ on the MNLI (mismatched) dataset. Our code will be publicly available.	PDF	1	2022
Towards Policy-Guided Conversational Recommendation with Dialogue Acts	Conversation Recommender System (CRS) aims to recommend items through nature conversation. Existing works in open-ended CRS mainly focus on recommendation and generation, but lacks of control over dialogue policy. In addition, the system is unable to adapt user profile to the user's feedback. Thus, we present a new dataset named DA-ReDial (Recommendation through Dialogue guided by Dialogue Act). We summarize 10 representative Dialog Acts and label dialogue with the DAs schema. To solve the problems above, we also propose a novel CRS called PGCR which stands for Policy-Guided Conversational Recommendation. It is able to formulate a DA-aware user profile, leverage Dialogue Acts to explicitly model the discourse structure of conversation and better guide the response generation. Extensive experiments on the new dataset show that our proposed model outperforms most baselines in dialog generation and recommendation. Also, the Policy Network fine-tuned by self-play can better control the dialogue policy and contribute a lot to recommendation strategy and user engagement in conversation.	PDF	1	2022
Towards Multi-Turn Empathetic Dialogs with Positive Emotion Elicitation	Emotional support is a crucial skill for many real-world scenarios, including caring for the elderly, mental health support, and customer service chats. This paper presents a novel task of empathetic dialog generation with positive emotion elicitation to promote users' positive emotion, similar to that of emotional support between humans. In this task, the agent conducts empathetic responses along with the target of eliciting the user's positive emotions in the multi-turn dialog. To facilitate the study of this task, we collect a large-scale emotional dialog dataset with positive emotion elicitation, called PosEmoDial (about 820k dialogs, 3M utterances). In these dialogs, the agent tries to guide the user from any possible initial emotional state, e.g., sadness, to a positive emotional state. Then we present a positive-emotion-guided dialog generation model with a novel loss function design. This loss function encourages the dialog model to not only elicit positive emotions from users but also ensure smooth emotional transitions along with the whole dialog. Finally, we establish benchmark results on PosEmoDial, and we will release this dataset and related source code to facilitate future studies.	PDF	1	2022
One-Shot Learning from a Demonstration with Hierarchical Latent Language	Humans have the capability, aided by the expressive compositionality of their language, to learn quickly by demonstration. They are able to describe unseen task-performing procedures and generalize their execution to other contexts. In this work, we introduce DescribeWorld, an environment designed to test this sort of generalization skill in grounded agents, where tasks are linguistically and procedurally composed of elementary concepts. The agent observes a single task demonstration in a Minecraft-like grid world, and is then asked to carry out the same task in a new map. To enable such a level of generalization, we propose a neural agent infused with hierarchical latent language—both at the level of task inference and subtask planning. Our agent first generates a textual description of the demonstrated unseen task, then leverages this description to replicate it. Through multiple evaluation scenarios and a suite of generalization tests, we find that agents that perform text-based inference are better equipped for the challenge under a random split of tasks.	PDF	1	2022
Validated Image Caption Rating (VICR) Scale, Dataset, and Model	Assessing the quality of an image caption is a complex task. We propose a new image caption rating system that consists of (1) a robust rating scale that is consistent, teachable, and externally validated, (2) an engaging and scalable data generation approach for the task, (3) a high-quality dataset, and (4) an effective image caption rating predictor. Using contemporary approaches from psychometrics we demonstrate that the proposed scale and rater training routine can support high quality annotation efforts for the task. We introduce two new datasets (one original and another derived) for the task. Our reference-free and multi-level rating predictor performance is on par with state-of-the-art approaches.	PDF	1	2022
Investigating the Roots of Gender Bias in Machine Translation: Observations on Gender Transfer between French and English	This paper aims at identifying the inner mechanisms that make a translation model choose a masculine rather than a feminine form, an essential step to mitigate gender bias in MT. We conduct two series of experiments using probing and comparing the predictions of translation and a language models to show that i) gender information is encoded in all decoder's and encoder's representations and ii) the translation model does not need to use information from the source to predict his.	PDF	1	2022
Controlling the Focus of Pretrained Language Generation Models	The finetuning of pretrained transformer-based language generation models are typically conducted in an end-to-end manner, where the model learns to attend to relevant parts of the input by itself. However, there does not exist a mechanism to directly control the model's focus. This work aims to develop a control mechanism by which a user can select spans of context as "highlights'' for the model to focus on, and generate relevant output. To achieve this goal, we augment a pretrained model with trainable "focus vectors'' that are directly applied to the model's embeddings, while the model itself is kept fixed. These vectors, trained on automatic annotations derived from attribution methods, act as indicators for context importance. We test our approach on two core generation tasks: dialogue response generation and abstractive summarization. We also collect evaluation data where the highlight-generation pairs are annotated by humans. Our experiments show that the trained focus vectors are effective in steering the model to generate outputs that are relevant to user-selected highlights.	PDF	1	2022
Discriminative Models Can Still Outperform Generative Models in Aspect Based Sentiment Analysis	Aspect-based Sentiment Analysis (ABSA) helps to explain customers' opinions towards products and services. In the past, ABSA models were discriminative, but more recently generative models have been used to generate aspects and polarities directly from text. In contrast, discriminative models commonly first select aspects from the text, and then classify the aspect's polarity. Previous results showed that generative models outperform discriminative models on several English ABSA datasets. Here, we evaluate and contrast two state-of-the-art discriminative and generative models in several settings: cross-lingual, cross-domain, and cross-lingual and domain, to understand generalizability in settings other than English mono-lingual in-domain. Our more thorough evaluation shows that, contrary to previous studies, discriminative models can still outperform generative models in almost all settings.	PDF	1	2022
Multi2WOZ: A Robust Multilingual Dataset and Conversational Pretraining for Task-Oriented Dialog	Research on (multi-domain) task-oriented dialog (TOD) has predominantly focused on the English language, primarily due to the shortage of robust TOD datasets in other languages, preventing the systematic investigation of cross-lingual transfer for this crucial NLP application area. In this work, we introduce Multi2WOZ, a new multilingual multi-domain TOD dataset, derived from the well-established English dataset MultiWOZ, that spans four typologically diverse languages: Chinese, German, Arabic, and Russian. In contrast to concurrent efforts, Multi2WOZ contains gold-standard dialogs in target languages that are directly comparable with development and test portions of the English dataset, enabling reliable and comparative estimates of cross-lingual transfer performance for TOD. This enables us to detect and explore crucial challenges for TOD cross-lingually. We then introduce a new framework for multilingual conversational specialization of pretrained language models (PrLMs) that aims to facilitate cross-lingual transfer for arbitrary downstream TOD tasks. Using such conversational PrLMs specialized for concrete target languages, we systematically benchmark a number of zero-shot and few-shot cross-lingual transfer approaches on two standard TOD tasks: Dialog State Tracking and Response Retrieval. Our experiments show that, in most setups, the best performance entails the combination of (i) conversational specialization in the target language and (ii) few-shot transfer for the concrete TOD task. Most importantly, we show that our conversational specialization in the target language allows for a much more sample-efficient few-shot transfer for downstream TOD tasks.	PDF	1	2022
Context-Aware Query Rewriting for Improving Users' Search Experience on E-commerce Websites	E-commerce queries are often short and ambiguous. E-commerce query understanding often uses query rewriting to disambiguate user-input queries. While using e-commerce search tools, users tend to enter multiple searches, which we call context, before purchasing. These history searches contain contextual insights about users' true shopping intents. Therefore, modeling such contextual information is critical to a better query rewriting model. However, existing query rewriting models ignore users' history behaviors and consider only the instant search query, which is often a short string offering limited information about the true shopping intent.We propose an end-to-end context-aware query rewriting model to bridge this gap, which takes the search context into account. Specifically, our model builds a session graph using the history search queries, their contained words, and auxiliary category information. We then employ a weighted graph attention mechanism that models cross-query relations and computes contextual information of the session. The model subsequently calculates session representations by combining the contextual information with the instant search query using an aggregation network. The session representations are then decoded to generate rewritten queries. Empirically, we demonstrate the superiority of our method to state-of-the-art approaches under various evaluation metrics. Our code and data will be publicly available.	PDF	1	2022
Label-guided Data Augmentation for Prompt-based Few Shot Learners	Recent advances on large pre-trained language models (PLMs) lead impressive gains on many natural language understanding (NLU) tasks with task-specific fine-tuning. However, direct fine-tuning PLMs heavily rely on large amount of labeled instances, which are expensive and time-consuming to obtain. Prompt-based tuning on PLMs has proven valuable for few shot tasks. Existing works studying prompt-based tuning for few-shot NLU mainly focus on deriving proper label words with a verbalizer or generating prompt templates for eliciting semantics from PLMs. In addition, conventional data augmentation methods can enrich training data for improving few-shot learning, while ignoring the label semantics. It is promising to leverage the rich label semantics in label words for data augmentation to facilitate prompt-based tuning for the downstream NLU tasks. However, the work on this is rather limited. Therefore, we study a new problem of data augmentation for prompt-based few shot learners. We propose a novel label-guided data augmentation method PromptDA which exploits the enriched label semantic information for data augmentation. Experimental results on several few shot text classification tasks show that our proposed framework achieves superior performance by effectively leveraging label semantics and data augmentation in language understanding.	PDF	1	2022
Unsupervised Slot Schema Induction for Task-oriented Dialog	Carefully-designed schemas describing how to collect and annotate dialog corpora are a prerequisite towards building task-oriented dialog systems. In practical applications, manually designing schemas can be error-prone, laborious, iterative, and slow, especially when the schema is complicated. To alleviate this expensive and time consuming process, we propose an unsupervised approach for slot schema induction from unlabeled dialog corpora. Leveraging in-domain language models and unsupervised parsing structures, our data-driven approach extracts candidate slots without constraints, followed by coarse-to-fine clustering to induce slot types. We compare our method against several strong supervised baselines, and show significant performance improvement in slot schema induction on MultiWoz and SGD datasets. We also demonstrate the effectiveness of induced schemas on downstream applications including dialog state tracking and response generation.	PDF	1	2022
Investigating the saliency of sentiment expressions in aspect-based sentiment analysis	We examine the behaviour of an aspect-based sentiment classifier built by fine-tuning the English BERT base model on the SemEval 2016 English dataset. In a set of masking experiments, we examine the extent to which the tokens which express the sentiment towards the aspect are being used by the classifier. The enhanced performance of a classifier that only sees the relevant sentiment expressions suggests that they are not being used to their full potential. Furthermore, sentiment expressions which are not directly relevant to the aspect in focus also appear to be used. We then use a gradient-based method to identify the most salient words. A comparison of these salient words, or rationales, with the sentiment expressions reveals only a moderate level of agreement. Some disagreements are related to the fixed length of the rationales and the tendency of the rationales to contain content words related to the aspect itself.	PDF	1	2022
Guiding Neural Story Generation with Reader Models	Automated storytelling has long captured the attention of researchers for the ubiquity of narratives in everyday life. However, it is challenging to maintain coherence and stay on-topic toward a specific ending when generating narratives with neural language models. In this paper, we introduce Story generation with Reader Models (StoRM), a framework in which a reader model is used to reason about the story should progress.A reader model infers what a human reader believes about the concepts, entities, and relations about the fictional story world.We show how an explicit reader model represented as a knowledge graph affords story coherence and provides controllability in the form of achieving a given story world state goal.Experiments show that our model produces significantly more coherent and on-topic stories, outperforming baselines in dimensions including plot plausibility and staying on topic. Our system also outperforms outline-guided story generation baselines in composing given concepts without ordering.	PDF	1	2022
Match made by BERT? Towards Interpretable Paper-Reviewer Assignments in NLP	Both scientific progress and individual researcher careers depend on the quality of peer review, which in turn depends on paper-reviewer matching. Surprisingly, this problem has been mostly approached simply as an automated recommendation problem, rather than as a matter where different stakeholders (authors, reviewers, area chairs) have accumulated experience worth taking into account. We present the results of the first survey of the NLP community, identifying common issues and perspectives on what factors should be considered in paper-reviewer matching. This study contributes actionable recommendations for improving future NLP conferences, and desiderata for interpretable peer review assignments.	PDF	1	2022
On Systematic Style Differences between Unsupervised and Supervised MT and an Application for High-Resource Machine Translation	Modern unsupervised machine translation (MT) systems reach reasonable translation quality under clean and controlled data conditions. As the performance gap between supervised and unsupervised MT narrows, it is interesting to ask whether the different training methods result in systematically different output beyond what is visible via quality metrics like adequacy or BLEU. We compare translations from supervised and unsupervised MT systems of similar quality, finding that unsupervised output is more fluent and more structurally different in comparison to human translation than is supervised MT. We then demonstrate a way to combine the benefits of both methods into a single system which results in improved adequacy and fluency as rated by human evaluators. Our results open the door to interesting discussions about how supervised and unsupervised MT might be different yet mutually-beneficial.	PDF	1	2022
Translated Texts Under the Lens: From Machine Translation Detection to Source Language Identification	In this work, we tackle the problem of the detection of translated texts from different angles. On top of addressing the classic task of machine translation detection, we investigate and find the presence of common patterns across different machine translation systems as well as different source languages. Then, we show that it is possible to identify the translation systems used to produce a translated text (F1-score $88.5\%$) as well as the source language of the original text (F1-score $79\%$).We assess our tasks using Books, a new dataset we built from scratch based on excerpts of novels and the well-known Europarl dataset.	PDF	1	2022
An Encoder Attribution Analysis for Dense Passage Retriever in Open-Domain Question Answering	The bi-encoder design of dense passage retriever (DPR) is a key factor to its success in open-domain question answering (QA).However, it is unclear how DPR's question encoder and passage encoder individually contributes to the overall performance, which we refer to as the encoder attribution problem.The problem is important as it helps us isolate responsible factors for individual encoders to further improve overall performance.In this paper, we formulate our analysis under a probabilistic framework called encoder marginalization, where we quantify the contribution of a single encoder by marginalizing over other variables.We find that the passage encoder contributes more than the question encoder to the in-domain retrieval accuracy.We further use an example to demonstrate how to find the affecting factors for each encoder, where we train multiple DPR models with different amounts of data and use encoder marginalization to analyze the results.We find that the positive passage overlap and corpus coverage of training data have big impacts on the passage encoder, while the question encoder is mainly affected by training sample complexity under this setting.Based on this framework, we can devise data-efficient training regimes: for example, we manage to train a passage encoder on SQuAD using 60\% less training data without loss of accuracy.These results illustrate the utility of our encoder attribution analysis.	PDF	1	2022
A Multilingual Perspective Towards the Evaluation of Attribution Methods	Most evaluations of attribution methods focus on the English language. In this work, we present a multilingual approach for evaluating attribution methods for the Natural Language Inference (NLI) task in terms of plausibility and faithfulness properties. First, we introduce a novel cross-lingual strategy to measure faithfulness based on word alignments, which eliminates the potential downsides of erasure-based evaluations.We then perform a comprehensive evaluation of attribution methods, considering different output mechanisms and aggregation methods.Finally, we augment the XNLI dataset with highlight-based explanations, providing a multilingual NLI dataset with highlights, which may support future exNLP studies. Our results show that attribution methods performing best for plausibility and faithfulness are different.	PDF	1	2022
Answer Uncertainty and Unanswerability in Multiple-Choice Machine Reading Comprehension	Machine reading comprehension (MRC) has drawn a lot of attention as an approach for assessing the ability of systems to understand natural language. Usually systems focus on selecting the correct answer to a question given a contextual paragraph. However, for many applications of multiple-choice MRC systems there are two additional considerations. For multiple-choice exams there is often a negative marking scheme; there is a penalty for an incorrect answer. This means that the system is required to have an idea of the uncertainty in the predicted answer. The second consideration is that many multiple-choice questions have the option of none of the above (NOA) indicating that none of the answers is applicable, rather than there always being the correct answer in the list of choices. This paper investigates both of these issues by making use of predictive uncertainty. It is shown that uncertainty does allow questions that the system is not confident about to be detected. Additionally we show that uncertainty outperforms a system explicitly built with an NOA option for the ReClor corpus.	PDF	1	2022
Product Answer Generation from Heterogeneous Sources: A New Benchmark and Best Practices	It is of great value to answer product questions based on heterogeneous information sources available on web product pages, e.g., semi-structured attributes, text descriptions, user-provided contents, etc. However, these sources have different structures and writing styles, which poses challenges for (1) evidence ranking, (2) source selection, and (3) answer generation. In this paper, we build a benchmark with annotations for both evidence selection and answer generation covering 6 information sources. Based on this benchmark, we conduct a comprehensive study and present a set of best practices. We show that all sources are important and contribute to answering questions. Handling all sources within one single model can produce comparable confidence scores across sources and combining multiple sources for training always helps, even for sources with totally different structures. We further propose a novel data augmentation method to iteratively create training samples for answer generation, which achieves close-to-human performance with only a few thousandannotations. Finally, we perform an in-depth error analysis of model predictions and highlight the challenges for future research.	PDF	1	2022
Residue-Based Natural Language Adversarial Attack Detection	Deep learning based systems are susceptible to adversarial attacks, where a small, imperceptible change at the input alters the model prediction. However, to date the majority of the approaches to detect these attacks have been designed for image processing systems. Many popular image adversarial detection approaches are able to identify adversarial examples from embedding feature spaces, whilst in the NLP domain existing state of the art detection approaches solely focus on input text features, without consideration of model embedding spaces. This work examines what differences result when porting these image designed strategies to Natural Language Processing (NLP) tasks - these detectors are found to not port over well. This is expected as NLP systems have a very different form of input: discrete and sequential in nature, rather than the continuous and fixed size inputs for images. As an equivalent model-focused NLP detection approach, this work proposes a simple sentence-embedding "residue" based detector to identify adversarial examples. On many tasks, it out-performs ported image domain detectors and recent state of the art NLP specific detectors.	PDF	1	2022
Privacy, Interpretability, and Fairness in the Multilingual Space	Multilingual generalization or compression is an objective for cross-lingual models in natural language processing (NLP). We explore how the compression sought for in such models aligns with other common objectives in NLP such as performance, differential privacy, interpretability, and fairness. We show that compression, which can be quantified by, e.g., sentence retrieval or centered kernel alignment, is compatible with performance and privacy, but that performance and privacy are at odds, leading to non-linear interactions between compression, performance, and privacy. We also demonstrate that privacy is at odds with interpretability, leading to non-linear interactions between compression, privacy, and interpretability. Finally, while fairness and privacy are generally at odds, we show that in the multilingual space, fairness and privacy have common solutions. In sum, our study shows that if we want to learn multilingual models that exhibit good performance and good generalization properties, {\em and} are private, interpretable and fair (or any combination thereof), we need to jointly optimize for these inter-dependent objectives.	PDF	1	2022
Evaluating Compositionality in Neural Models Using Arithmetic Expressions	We introduce CobA, a dataset designed to evaluate the compositional properties of neural models. The dataset consists of simple arithmetic expressions combining natural integers with addition and multiplication operators. For example, $(5 + 4) \times 2$. We distinguish four aspects of compositionality: localism, substitutivity, productivity, and systematicity. We generate partitions of the dataset with specific in-domain and generalization sets, designed to evaluate the model's ability for each compositional aspect. By carefully selecting expressions from the in-domain and generalization sets, we introduce controlled differences between the two sets. We show that models achieve competitive performance on a random partition, for which there is no controlled difference. Yet, for partitions requiring compositional extrapolation, performances drastically decrease for most encoder architectures. We observe distinctions among architectures, in particular fixed-length context transformers, sequential or tree-structured LSTM.	PDF	1	2022
Proposition-Level Clustering for Multi-Document Summarization	Text clustering methods were traditionally incorporated into multi-document summarization (MDS) as a means for coping with considerable information repetition. Particularly, clusters were leveraged to indicate information saliency as well as to avoid redundancy. Such prior methods focused on clustering sentences, even though closely related sentences usually contain also non-aligned parts. In this work, we revisit the clustering approach, grouping together sub-sentential propositions, aiming at more precise information alignment. Specifically, our method detects salient propositions, clusters them into paraphrastic clusters, and generates a representative sentence for each cluster via text fusion.Our summarization method improves over the previous state-of-the-art MDS method in the DUC 2004 and TAC 2011 datasets, both in automatic ROUGE scores and human preference.	PDF	1	2022
On the Effectiveness of Quasi Character-Level Models for Machine Translation	Neural Machine Translation (NMT) models often use subword-level vocabularies to deal with rare or unknown words. Although some studies have shown the effectiveness of purely character-based models, these approaches have resulted in highly expensive models in computational terms. In this work, we explore the benefits of quasi-character-level models for low-resource NMT and their ability to mitigate the effects of the catastrophic forgetting problem. We first present a theoretical foundation along with an empirical study on the effectiveness of these models, as a function of the vocabulary and training set size, for a range of languages, domains, and architectures. Next, we study the ability of these models to mitigate the effects of catastrophic forgetting in machine translation. Our work suggests that quasi-character-level models have practically the same generalization capabilities as character-based models but at lower computational costs. Furthermore, they appear to help achieve greater consistency between domains than standard subword-level models, although the catastrophic forgetting problem is not mitigated.	PDF	1	2022
TraceNet: Tracing and Locating the Key Elements in Sentiment Analysis	In this paper, we study sentiment analysis task where the outcomes are mainly contributed by a few key elements of the inputs. Motivated by the two-streams hypothesis, we propose a neural architecture, named TraceNet, to address this type of task. It not only learns discriminative representations for the target task via its encoders, but also traces key elements at the same time via its locators. In TraceNet, both encoders and locators are organized in a layer-wise manner, and a smoothness regularization is employed between adjacent encoder-locator combinations. Moreover, a sparsity constraints are enforced on locators for tracing purposes and items are proactively masked according to the item weights output by locators.A major advantage of TraceNet is that the outcomes are easier to understand, since the most responsible parts of inputs are identified. Also, under the guidance of locators, it is more robust to attacks due to its focus on key elements and the proactive masking training strategy. Experimental results show its effectiveness for sentiment classification. Moreover, we provide several case studies to demonstrate its robustness and interpretability.	PDF	1	2022
Investigating Crowdsourcing Protocols for Evaluating the Factual Consistency of Summaries	Current pre-trained models applied for summarization are prone to factual inconsistencies that misrepresent the source text. Evaluating the factual consistency of summaries is thus necessary to develop better models. However, the human evaluation setup for evaluating factual consistency has not been standardized. To determine the factors that affect the reliability of the human evaluation, we crowdsource evaluations for factual consistency across state-of-the-art models on two news summarization datasets using the rating-based Likert Scale and ranking-based Best-Worst Scaling. Our analysis reveals that the ranking-based Best-Worst Scaling offers a more reliable measure of summary quality across datasets and that the reliability of Likert ratings highly depends on the target dataset and the evaluation design. To improve crowdsourcing reliability, we extend the scale of the Likert rating and present a scoring algorithm for Best-Worst Scaling that we call value learning. Our crowdsourcing guidelines will be publicly available to facilitate future work on factual consistency in summarization.	PDF	1	2022
Knowledge Based Template Machine Translation In Low-Resource Setting	Incorporating tagging into neural machine translation (NMT) systems has shown promising results in helping translate rare words such as named entities (NE). However, translating NE in low-resource setting remains a challenge. In this work, we investigate the effect of using tags and NE hypernyms from knowledge graphs (KGs) in parallel corpus in different level of resource conditions. We find the tag-and-copy mechanism (tag the NEs in the source sentence and copy them to the target sentence) improves translation in high-resource settings only. Introducing copying also results in polarizing effects in translating different parts-of-speech (POS). Interestingly, we find that copy accuracy for hypernyms is consistently higher than that of entities. As a way of avoiding "hard" copying and utilizing hypernym in bootstrapping rare entities, we introduced a "soft" tagging mechanism and found consistent improvement in high and low-resource setting.	PDF	1	2022
XQA-DST: Multi-Domain and Multi-Lingual Dialogue State Tracking	In a task-oriented dialogue system, Dialogue State Tracking (DST) keeps track of all important information by filling slots with values given through the conversation. Existing methods generally rely on a predefined set of values and struggle to generalise to previously unseen slots in new domains. In this paper, we propose a multi-domain and multi-lingual dialogue state tracker in a neural reading comprehension approach. Our approach fills the slot values using span prediction, where the values are extracted from the dialogue itself. With a novel training strategy and an independent domain classifier, empirical results demonstrate that our model is a domain-scalable and open-vocabulary model that achieves 53.2% Joint Goal Accuracy (JGA) on MultiWOZ 2.1. We show its competitive transferability by zero-shot domain-adaptation experiments on MultiWOZ 2.1 with an average JGA of 31.6% for five domains. In addition, it achieves cross-lingual transfer with state-of-the-art zero-shot results, 64.9% JGA from English to German and 68.6% JGA from English to Italian on WOZ 2.0.	PDF	1	2022
Paragraph-based Transformer Pretraining for Multi-Sentence Inference	Inference tasks such as answer sentence selection (AS2) or fact verification are typically solved by fine-tuning transformer-based models as individual sentence-pair classifiers. Recent studies show that these tasks benefit from modeling dependencies across multiple candidate sentences `jointly'. In this paper, we first show that popular pretrained transformers perform poorly when used for fine-tuning on multi-candidate inference tasks. We then propose a new pretraining objective that models the paragraph-level semantics across multiple input sentences. Our evaluation on three AS2, and one fact verification dataset demonstrates the superiority of our pretrained joint models over pretrained transformers for multi-candidate inference tasks.	PDF	1	2022
Power Norm Based Lifelong Learning for Paraphrase Generations	Seq2seq language generation models that are trained offline with multiple domains in a sequential fashion often suffer from catastrophic forgetting. Lifelong learning has been proposed to handle this problem. However, existing work such as experience replay or elastic weighted consolidation requires incremental memory space. In this work, we propose an innovative framework, RMR_DSEthat leverages a recall optimization mechanism to selectively memorize important parameters of previous tasks via regularization, and uses a domain drift estimation algorithm to compensate the drift between different do-mains in the embedding space. These designs enable the model to be trained on the current task while keep-ing the memory of previous tasks, and avoid much additional data storage. Furthermore, RMR_DSE can be combined with existing lifelong learning approaches. Our experiments on two seq2seq language generation tasks, paraphrase and dialog response generation, show thatRMR_DSE outperforms SOTA models by a considerable margin and reduces forgetting greatly.	PDF	1	2022
Logical Satisfiability of Counterfactuals for Faithful Explanations in NLI	Evaluating an explanation's faithfulness is desired for many reasons such as trust, interpretability and diagnosing the sources of model's errors. In this work, which focuses on the NLI task, we introduce the methodology of Faithfulness-through-Counterfactuals, which first generates a counterfactual hypothesis based on the logical predicates expressed in the explanations, and then evaluates if the model's prediction on the counterfactual is consistent with that expressed logic (i.e. if the new formula is logically satisfiable). In contrast to existing approaches, this does not require any explanations for training a separate verification model. We first validate the efficacy of automatic counterfactual hypothesis generation, leveraging on the few-shot priming paradigm. Next, we show that our proposed metric performs well compared to other metrics using simulatability studies as a proxy task for faithfulness. In addition, we conduct a sensitivity analysis to validate that our metric is sensitive to unfaithful explanations.	PDF	1	2022
Extracting and Inferring Personal Attributes from Dialogue	Personal attributes represent structured information about a person, such as their hobbies, pets, family, likes and dislikes. We introduce the tasks of extracting and inferring personal attributes from human-human dialogue, and analyze the linguistic demands of these tasks. To meet these challenges, we introduce a simple and extensible model that combines an autoregressive language model utilizing constrained attribute generation with a discriminative reranker. Our model outperforms strong baselines on extracting personal attributes as well as inferring personal attributes that are not contained verbatim in utterances and instead requires commonsense reasoning and lexical inferences, which occur frequently in everyday conversation. Finally, we demonstrate the benefit of incorporating personal attributes in social chit-chat and task-oriented dialogue settings.	PDF	1	2022
Cross-stitched Multi-modal Encoders	In this paper, we propose a novel architecture for multi-modal speech and text input. We combine pretrained speech and text encoders using multi-headed cross-modal attention and jointly fine-tune on the target problem. The resultant architecture can be used for continuous token-level classification or utterance-level prediction acting on simultaneous text and speech. The resultant encoder efficiently captures both acoustic-prosodic and lexical information. We compare the benefits of multi-headed attention-based fusion for multi-modal utterance-level classification against a simple concatenation of pre-pooled, modality-specific representations. Our model architecture is compact, resource efficient, and can be trained on a single consumer GPU card.	PDF	1	2022
Exact Paired Permutation Testing Algorithms for NLP Systems	Significance testing has played a vital role in the development of NLP systems, providing confidence that one system is indeed better than another one. However, many significance tests involve hard computation problems, and so we rely on approximation methods such as Monte Carlo sampling. In this paper, we provide an exact dynamic programming algorithm that runs in quadratic time in the size of the dataset and performs the paired permutation test, a widely used test in comparing two systems, for the case of comparing accuracies between two classification systems. We show that Monte Carlo approximations are often too noisy to reliably determine whether we can reject the null hypothesis. We show that Monte Carlo approximations are often too noisy to reliably determine whether we can reject the null hypothesis with a significance level of $\threshold\approx 0.05$ for any number of sentence $N$. Additionally, we show that our exact algorithm is more efficient than the approximation algorithm for $N\le 10K$.	PDF	1	2022
Polling Latent Opinions: A Method for Computational Sociolinguistics Using Transformer Language Models	Text analysis of social media for sentiment, topic analysis, and other analysis depends initially on the selection of keywords and phrases that will be used to create the research corpora. However, keywords that researchers choose may occur infrequently, leading to errors that arise from using small samples. In this paper, we use the capacity for memorization, interpolation, and extrapolation of Transformer Language Models such as the GPT series to learn the linguistic behaviors of a subgroup within larger corpora of Yelp reviews. We then use prompt-based queries to generate synthetic text that can be analyzed to produce insights into specific opinions held by the populations that the models were trained on. Once learned, more specific sentiment queries can be made of the model with high levels of accuracy when compared to traditional keyword searches. We show that even in cases where a specific keyphrase is limited or not present at all in the training corpora, the GPT is able to accurately generate large volumes of text that have the correct sentiment.	PDF	1	2022
Inherently Explainable Reinforcement Learning in Natural Language	We focus on the task of creating a reinforcement learning agent that is inherently explainable---with the ability to produce immediate local explanations by thinking out loud while performing a task and analyzing entire trajectories post-hoc to produce temporally extended explanations. This Hierarchically Explainable Reinforcement Learning agent (HEX-RL), operates in Interactive Fictions, text-based game environments in which an agent perceives and acts upon the world using textual natural language.These games are usually structured as puzzles or quests with long-term dependencies in which an agent must complete a sequence of actions to succeed---providing ideal environments in which to test an agent's ability to explain its actions.Our agent is designed to treat explainability as a first-class citizen, using an extracted symbolic knowledge graph-based state representation coupled with a Hierarchical Graph Attention mechanism that points to the facts in the internal graph representation that most influenced the choice of actions.Experiments show that this agent provides significantly improved explanations over strong baselines, as rated by human participants generally unfamiliar with the environment, while also matching state-of-the-art task performance.	PDF	1	2022
Continual Learning for Seq2Seq Generations with Transformer Calibration	Conventional NLP generation models are trained offline with a given dataset for a particular task, which is referred to as isolated learning. Research on sequence-to-sequence language generation aims to study continual learning model to constantly learning from sequentially encountered tasks. However, continual learning studies often suffer from catastrophic forgetting, a persistent challenge for lifelong learning. In this paper, we present a novel NLP transformer model which attempts to mitigate catastrophic forgetting in online continual learning from a new perspective, i.e., attention calibration. We model the attention in the transformer as a calibrated unit in a general formulation, where the attention calibration could give benefits to balance the stability and plasticity of continual learning algorithms through influencing both their forward inference path and backward optimization path. Our experiments, paraphrase generation, show that this work outperforms SOTA models by a considerable margin and remedy the forgetting greatly.	PDF	1	2022
SSCAE: A Novel Semantic, Syntactic, and Context-Aware Natural Language Adversarial Example Generator	Training a machine learning model with adversarial examples (AEs) improves its robustness against adversarial attacks. Hence, it is crucial to develop effective generative models to produce high-quality AEs. Developing such models has been much slower in natural language processing (NLP). The current state-of-the-art in NLP generates AEs that are somehow human detectable and/or include semantic and linguistic defects. This paper introduces a novel, practical, and efficient adversarial attack model called SSCAE for Semantic, Syntactic, and Context-aware natural language Adversarial Examples generator. SSCAE generates humanly imperceptible context-aware AEs thatpreserve semantic consistency and source language’s syntactical and grammatical requirements. The effectiveness and superiority ofthe proposed SSCAE model are illustrated over eleven comparative experiments, extensive ablation studies, and human evaluations.	PDF	1	2022
Hybrid Semantic Type Representation for Zero-Shot Event Extraction	Event extraction is a significant task in natural language processing. However, it is labor-intensive to get annotation when generalizing to new event types and ontologies. In this paper, we propose the HTR (Hybrid Type Representation) framework for zero-shot event extraction. We make a distinction of the abstraction level between events and roles, analyze role semantics, and propose a new representation approach, LRDB (label-related description-based), which is effective for both argument classification and collaboration with trigger extraction. We conduct extensive evaluation on ACE2005 dataset and achieve state-of-the-art.	PDF	1	2022
Testing the Ability of Language Models to Interpret Figurative Language	Figurative and metaphorical language are commonplace in discourse, and figurative expressions play an important role in communication and cognition. However, figurative language has been a relatively under-studied area in NLP, and it remains an open question to what extent modern language models can interpret nonliteral phrases. To address this question, we introduce Fig-QA, a Winograd-style nonliteral language understanding task consisting of correctly interpreting paired figurative phrases with divergent meanings. We evaluate the performance of several state-of-the-art language models on this task, and find that although language models achieve performance significantly over chance, they still fall short of human performance, particularly in zero- or few-shot settings. This suggests that further work is needed to improve the nonliteral reasoning capabilities of language models.	PDF	1	2022
Hardness Masking via Auto-Regressive Language Model	Pre-trained masked language models have achieved tremendous success in natural language processing. Most of these methods rely on recovering randomly masked tokens, which is in general not as good as when tokens are masked based on how well the model can predict. However, it is costly for a large-scale model to self-identify tokens that it still struggles to predict. On the other hand, we observe that a smaller language model can often effectively find what a large model fails to learn. Inspired by this observation, we propose to leverage a compact bi-directional auto-regressive language model to dynamically discover tokens that a large language model has not learned well and guide its training via hardness masking. Comprehensive experiments demonstrate that our masking method can effectively boost the performance of pre-trained language models on general language understanding benchmarks.	PDF	1	2022
Twitter-COMMs: Detecting Climate, COVID, and Military Multimodal Misinformation	Detecting out-of-context media, such as "miscaptioned" images on Twitter, is a relevant problem, especially in domains of high public significance. In this work we aim to develop defenses against such misinformation for the topics of Climate Change, COVID-19, and Military Vehicles. We first present a large-scale multimodal dataset with over 884k tweets relevant to these topics. Next, we propose a detection method, based on the state-of-the-art CLIP model, that leverages automatically generated hard image-text mismatches. While this approach works well on our automatically constructed out-of-context tweets, we aim to validate its usefulness on data representative of the real world. Thus, we test it on a set of human-generated fakes, created by mimicking in-the-wild misinformation. We achieve an 11% detection improvement in a high precision regime over a strong baseline. Finally, we share insights about our best model design and analyze the challenges of this emerging threat.	PDF	1	2022
Open Domain Response Generation Guided by Retrieved Conversations	Open domain response generation is the task of creating a response givena user query in any topics/domain. Limited by context and referenceinformation, responses generated by current systems are often "bland"or generic. In this paper, we combine a response generation model witha retrieval system that searches for relevant utterances and responses,and extracts keywords from the retrieved results to guide the responsegeneration. Our model uses a keyword extraction module to extract twotypes of keywords in an unsupervised fashion: (1) keywords in thequery not found in the retrieved utterances (DIFFKEY),and (2) overlapping keywords among the retrieved responses(SIMKEY). Given these keywords, we use a two-stage transformer thatfirst decides where to insert the keywords in the response, and thengenerates the full response given the location of the keywords. Thekeyword extraction module and the two-stage transformer are connected ina single network, and so our system is trained end-to-end.Experimental results on Cornell Movie-Dialog corpus, Douban and Weibodemonstrate that our model outperforms state-of-the-art systems in termsof ROUGE, relevance scores and human evaluation. Source code of ourmodel is available at: ANONYMISED.	PDF	1	2022
Cross-Lingual Event Detection via Optimized Adversarial Training	In this work, we focus on Cross-Lingual Event Detection where a model is trained on data from a source language but its performance is evaluated on data from a second, target, language. Most recent works in this area have harnessed the language-invariant qualities displayed by pre-trained Multi-lingual Language Models. Their performance, however, reveals there is room for improvement as they mishandle delicate cross-lingual instances. We employ Adversarial Language Adaptation to train a Language Discriminator to discern between the source and target languages using unlabeled data. The discriminator is trained in an adversarial manner so that the encoder learns to produce refined, language-invariant representations that lead to improved performance. More importantly, we optimize the adversarial training by only presenting the discriminator with the most informative samples. We base our intuition about what makes a sample informative on two disparate metrics: sample similarity and event presence. Thus, we propose using Optimal Transport as a solution to naturally combine these two distinct information sources into the selection process. Extensive experiments on 8 different language pairs, using 4 languages from unrelated families, show the flexibility and effectiveness of our model that achieves new state-of-the-art results.	PDF	1	2022
Slangvolution: A Causal Analysis of Semantic Change and Frequency Dynamics in Slang	All living languages are continually undergoing changes, and the mechanisms that underlie language change are still a matter of debate. In this work, we approach language change through the lens of causality in order to model not only how various distributional factors associate with language change, but how they causally affect it. In particular, we study slang, which is an informal language that is typically restricted to a specific group or social setting. We analyze the semantic change and frequency shift of slang words and compare them to those of standard, nonslang words. With causal discovery and causal inference techniques, we measure the effect that word type (slang/nonslang) has on both semantic change and frequency shift, as well as its relationship to frequency, polysemy and part of speech. Our analysis provides some new insights in the study of semantic change, e.g., we show that slang words undergo less semantic change but tend to have larger frequency shifts over time.	PDF	1	2022
Looking Into the Black Box - How Are Idioms Processed in BERT?	Idioms such as ``call it a day'' and ``piece of cake'' are ubiquitous in natural language. How are idioms processed by language models such as BERT? This study investigates this question with three experiments: (1) an analysis of embedding similarities of idiomatic sentences and their literal spelled-out counterparts, (2) an analysis of word embeddings when the word appears in an idiomatic versus literal context, and (3) an attention analysis of words when they appear in an idiomatic versus literal context. Each of these three experiments analyse results across all layers of BERT. Experiment 1 shows that the cosine similarity of the embeddings of an idiom sentence and its spelled-out counterpart increases the deeper the layer. However, when compared to random controls, layer 8 is where the spelled-out counterpart is ranked highest in embedding similarity. Experiment 2 shows that the embedding of single words in idiomatic versus literal contexts diverge and become the most different in layer 8 also. Experiment 3 shows that other sentence tokens pay less attention to a word inside an idiom compared to the same word in a literal sentence. Overall, the study suggests that BERT ``understands'' idiomatic expressions, and that it processes them more akin to a syntactic phenomenon than purely a semantic one. A mechanism for this understanding in BERT is attention, which illustrates that idioms are semantically and syntactically idiosyncratic.	PDF	1	2022
Learning to Borrow– Relation Representation for Without-Mention Entity-Pairs for Knowledge Graph Completion	Prior work on integrating text corpora with knowledge graphs (KGs) to improve Knowledge Graph Embedding (KGE) have obtained good performance for entities that co-occur in sentences in text corpora. Such sentences (textual mentions of entity-pairs) are represented as Lexicalised Dependency Paths (LDPs) between two entities. However, it is not possible to represent relations between entities that do not co-occur in a single sentence using LDPs. In this paper, we propose and evaluate several methods to address this problem, where we \emph{borrow} LDPs from the entity pairs that co-occur in sentences in the corpus (i.e. \emph{with mentions} entity pairs) to represent entity pairs that do \emph{not} co-occur in any sentence in the corpus (i.e. \emph{without mention} entity pairs). We propose a supervised borrowing method, \emph{SuperBorrow}, that learns to score the suitability of an LDP to represent a without-mentions entity pair using pre-trained entity embeddings and contextualised LDP representations. Experimental results show that SuperBorrow improves the link prediction performance of multiple widely-used prior KGE methods such as TransE, DistMult, ComplEx and RotatE.	PDF	1	2022
Investigating Zero- and Few-shot Generalization in Fact Verification	We explore zero- and few-shot generalization for fact verification (FV), which aims to generalize the FV model trained on well-resourced domains (e.g., Wikipedia) to low-resourced domains that lack human annotations. To this end, we first construct a benchmark dataset collection which contains 11 FV datasets representing 6 domains. We conduct an empirical analysis of generalization across these FV datasets, finding that current models generalize poorly. Our analysis reveals that several factors affect generalization, including dataset size, length of evidence, and the type of claims. Finally, we show that two directions of work improve generalization: 1) incorporating domain knowledge via pretraining on specialized domains, and 2) automatically generating training data via claim generation.	PDF	1	2022
Sense Embeddings are also Biased -- Evaluating Social Biases in Static and Contextualised Sense Embeddings	Sense embedding learning methods learn different embeddings for the different senses of an ambiguous word.One sense of an ambiguous word might be socially biased while its other senses remain unbiased.In comparison to the numerous prior work evaluating the social biases in pretrained word embeddings, the biases in sense embeddings have been relatively under studied.In this paper, we create a benchmark dataset for evaluating the social biases in sense embeddings and propose novel sense-specific bias evaluation measures.We conduct an extensive evaluation of multiple static and contextualised sense embeddings for various types of social biases using the proposed measures.Our experimental results show that even in cases where no biases are found at word-level, there still exist worrying levels of social biases at sense-level, which are often ignored by the word-level bias evaluation measures.	PDF	1	2022
Bilingual Lexicon Induction for Low-Resource Languages using Graph Matching via Optimal Transport	Bilingual lexicons form a critical component of various NLP applications, including unsupervised and semisupervised machine translation and crosslingual information retrieval. In this work, we improve bilingual lexicon induction performance across 32 diverse language pairs with a graph-matching method based on optimal transport. The method is especially strong with very low amounts of supervision.	PDF	1	2022
OneAligner: Zero-shot Cross-lingual Transfer with One Rich-Resource Language Pair for Low-Resource Sentence Retrieval	Aligning parallel sentences in multilingual corpora is essential to curating data for downstream applications such as Machine Translation. In this work, we present OneAligner, an alignment model specially designed for sentence retrieval tasks. This model is able to train on only one language pair and transfers, in a cross-lingual fashion, to low-resource language pairs with negligible degradation in performance. When trained with all language pairs of a large-scale parallel multilingual corpus (OPUS-100), this model achieves the state-of-the-art result on the Tateoba dataset, outperforming an equally-sized previous model by $8.0$ points in accuracy while using less than $0.6\%$ of their parallel data. When finetuned on a single rich-resource language pair, be it English-centered or not, our model is able to match the performance of the ones finetuned on all language pairs under the same data budget with less than $2.0$ points decrease in accuracy. Furthermore, with the same setup, scaling up the number of rich-resource language pairs monotonically improves the performance, reaching a minimum of $0.4$ points discrepancy in accuracy, making it less mandatory to collect any low-resource parallel data. Finally, we conclude through empirical results and analyses that the performance of the sentence alignment task depends mostly on the monolingual and parallel data size, up to a certain size threshold, rather than on what language pairs are used for training or evaluation.	PDF	1	2022
Orthogonal Language and Task Adapters in Zero-Shot Cross-Lingual Transfer	Adapter modules have recently been used for efficient fine-tuning and language specialization of massively multilingual Transformers (MMTs), improving downstream zero-shot cross-lingual transfer. In this work, we propose orthogonal language and task adapters (dubbed orthoadapters) for cross-lingual transfer. They are trained to encode language- and task-specific information that is complementary (i.e., orthogonal) to the knowledge already stored in the pretrained MMT parameters. Our zero-shot transfer experiments, involving three tasks and 10 diverse languages, 1) point to the usefulness of orthoadapters in cross-lingual transfer, especially for the most complex NLI task, but also 2) indicate that the optimal (ortho)adapter configuration highly depends on the task and the target language at hand. We hope that our work will motivate a wider investigation of usefulness of orthogonality constraints in language- and task-specific fine-tuning of pretrained transformers.	PDF	1	2022
Forecasting COVID-19 Caseloads Using Unsupervised Embedding Clusters of Social Media Posts	We present a novel approach incorporating transformer-based language models into infectious disease modelling. Text-derived features are quantified by tracking high-density clusters of sentence-level representations of Reddit posts within specific US states' COVID-19 subreddits. We benchmark these clustered embedding features against features extracted from other high-quality datasets. In a threshold-classification task, we show that they outperform all other feature types at predicting upward trend signals, a significant result for infectious disease modelling in areas where epidemiological data is unreliable. Subsequently, in a time-series forecasting task, we fully utilise the predictive power of the caseload and compare the relative strengths of using different supplementary datasets as covariate feature sets in a transformer-based time-series model.	PDF	1	2022
Measuring Context-Dependent Syntactic Information Across Layers	Probing studies have extensively explored where in neural language models linguistic information is located. While probing classifiers are a common instrument to approach such questions, it is less clear what evaluation metrics to choose, how to compare probes, and which baselines to use. We identify angles from which the question how linguistic information is structured within a model can be approached and propose two new setups that fill the gap of explicitly modelling local information gain compared to the previous layer.We apply the new setups, along with two from the literature, to probe models for a syntactic property that explicitly needs context to be retrieved: part-of-speech tags that are not the most common for a specific token. We test the hypothesis that more information is retrieved in deeper layers than for the most common tags, and find that while this is often true, the manifestation varies among metrics and models in different languages.	PDF	1	2022
Improving Coherence of Language Model Generation with Latent Semantic State	Sentences generated by neural language models (LMs) often suffer from coherence errors: they describe events and situations inconsistent with the state of the world described by preceding text. We show that coherence errors can arise at multiple stages of LM computation, and describe a procedure for distinguishing errors in inferring state from errors in generating sentences. In models with correctable errors of the first type, we show that targeted supervision can address them. We introduce two procedures for using explicit representations of world state as auxiliary supervision. These procedures efficiently improve LM coherence, in some cases providing the benefits of 1,000-9,000 training examples with only 500 state annotations.	PDF	1	2022
Massive-scale Decoding for Text Generation using Lattices	Conditional neural text generation models generate high-quality outputs, but often concentrate around a mode when what we really want is a diverse set of options. We present a search algorithm to construct lattices encoding a massive number of generation options. First, we restructure decoding as a best-first search, which explores the space differently than beam search and improves efficiency by avoiding pruning paths. Second, we revisit the idea of hypothesis recombination: we can identify pairs of similar generation candidates during search and merge them as an approximation. On both summarization and MT, we show that our algorithm encodes thousands of diverse options that remain grammatical and high-quality into one lattice. This algorithm provides a foundation for building downstream generation applications on top of massive-scale diverse outputs.	PDF	1	2022
Towards Understanding Large-Scale Discourse Structures in Pre-Trained and Fine-Tuned Language Models	In this paper, we extend the line of BERTology work by focusing on the important, yet less explored, alignment of pre-trained and fine-tuned PLMs with large-scale discourse structures. We propose a novel approach to infer discourse information for arbitrarily long documents. In our experiments, we find that the captured discourse information is local and general, even across a collection of fine-tuning tasks. We compare the inferred discourse trees with supervised, distantly supervised and simple baselines to explore the structural overlap, finding that constituency discourse trees align well with supervised models, however, contain complementary discourse information.Lastly, we individually explore self-attention matrices to analyze the information redundancy. We find that similar discourse information is consistently captured in the same heads.	PDF	1	2022
Improved grammatical error correction by ranking elementary edits	We offer a two-stage reranking method for grammatical error correction: the first model serves as edit generator, while the second classifies the proposed edits as correct or false. We show how to use both encoder-decoder and sequence labeling models for the first step of our pipeline. We achieve state-of-the-art quality on BEA 2019 English dataset even using weak BERT-GEC edit generator. Combining our roberta-base scorer with state-of-the-art GECToR edit generator, we surpass GECToR by $2-3\%$. With a larger model we establish a new SOTA on BEA development and test sets. Our model also sets a new SOTA on Russian, despite using smaller models and less data than the previous approaches.	PDF	1	2022
Improving Data Augmentation in Low-resource Question Answering with Active Learning in Multiple Stages	Neural approaches have become very popular in the domain of Question Answering, however they require a large amount of annotated data. Furthermore, they often yield very good performance but only in the domain they were trained on. In this work we propose a novel approach that combines data augmentation via question-answer generation and active learning to improve performance in low resource settings, where the target domain is vastly different from the source domain. Furthermore, we investigate data augmentation via generation for question answering in three different low-resource settings relevant in practice and how this can be improved: 1) No labels for the target domain, 2) static, labelled data for the target domain and 3) an Active Learning approach with labels for the target domain provided by an expert. In all settings we assume sufficient amount of labelled data from the source domain is available. We perform extensive experiments in each of the above conditions. Our findings show that our novel approach, which combines data augmentation with active learning, boosts performances in the low-resource, domain-specific setting, allowing for low-labelling-effort question answering systems in new, specialized domains. They further demonstrate how to best utilize data augmentation to boost performance in these settings.	PDF	1	2022
Do Language Models Have Beliefs? Methods for Detecting, Updating, and Visualizing Model Beliefs	Do language models have beliefs about the world? Dennett (1995) famously argues that even thermostats have beliefs, on the view that a belief is simply an informational state decoupled from any motivational state. In this paper, we discuss approaches to detecting when models have beliefs about the world, updating model beliefs, and visualizing beliefs graphically. Our main contributions include: (1) new metrics for evaluating belief-updating methods focusing on the logical consistency of beliefs, (2) a training objective for Sequential, Local, and Generalizing updates (SLAG) that improves the performance of learned optimizers for updating beliefs, and (3) the introduction of the belief graph, a new form of interface with language models showing the interdependencies between model beliefs. Our experiments suggest that models possess belief-like qualities to only a limited extent, but update methods can both fix incorrect model beliefs and greatly improve their consistency. Although off-the-shelf optimizers are surprisingly strong belief-updating baselines, our learned optimizers can outperform them in more difficult settings than have been considered in past work.	PDF	1	2022
Zombies Eat Brains, You are Safe: A Knowledge Infusion based Multitasking System for Sarcasm Detection in Meme	Sarcasm detection is, in itself, a challenging task in the field of Natural Language Processing (NLP), and the task even becomes more complex when the target is a meme. In this paper, we first hypothesize that sarcasm detection is closely associated with emotions present in the meme. We propose a deep learning-based multitask model to perform these two tasks in parallel, where sarcasm detection is the primary, whereas emotion recognition is considered as an auxiliary task. Furthermore, we propose a novel knowledge infusion (KI) method to get a sentiment-aware knowledge representation on top of our multitasking model. This sentiment-aware knowledge representation is obtained from a pre-trained parent model and subsequently, this representation is used via a novel Gating Mechanism to train our downstream multitasking model. For training and evaluation purposes, we created a large-scale dataset consisting of 7,416 sample Hindi memes as there was no readily available dataset for building such multimodal systems. We collect the Hindi memes from various domains, such as politics, religious, racist, and sexist, and manually annotate each instance with three sarcasm categories, i.e., (i) Not Sarcastic, ii) Mildly Sarcastic or iii) Highly Sarcastic ) and 13 fine-grained emotion classes. We demonstrate the effectiveness of our proposed work through extensive experiments. The experimental results show that our proposed system achieves a 64.48% macro F1-score, outperforming all the baseline models. Finally, we note that our proposed system is model agnostic and can be used with any downstream model in practice. We will make the resources and codes available.	PDF	1	2022
Visual content classifier for cultural heritage repositories	This work presents a novel approach for the automatic creation of an aligned image / text training set for the generation of descriptions of the visual content of artworks. To do this, we develop a classification tool based on a mix of heuristic rules and deep learning. This classifier is able to identify statements that describe visual art content, out of complex cultural heritage text that contains a mix of many other types of information on context, medium, author, etc. Our results are very promising when tested on texts from the Museo del Prado collections.	PDF	1	2022
Conceptualizing Treatment Leakage in Text-based Causal Inference	Causal inference methods that control for text-based confounders are becoming increasingly important in the social sciences and other disciplines where text is readily available. However, these methods rely on a critical assumption that there is no treatment leakage: that is, the text contains only information about the confounder and no information about treatment assignment (leading to post-treatment bias). However, this assumption may be unrealistic in real-world situations involving text, as human language is rich and flexible. We first define the leakage problem, discussing the identification and estimation challenges it raises. We also discuss the conditions under which leakage can be addressed by removing the treatment-related signal from the text in a pre-processing step we define as \emph{text distillation}. Then, using simulation, we investigate the mechanics of treatment leakage on estimates of the average treatment effect (ATE).	PDF	1	2022
FRUIT: Faithfully Reflecting Updated Information in Text	Textual knowledge bases such as Wikipedia require considerable effort to keep up to date and consistent. While automated writing assistants could potentially ease this burden, the problem of suggesting edits grounded in external knowledge has been under-explored. In this paper, we introduce the novel generation task of faithfully reflecting updated information in text (FRUIT) where the goal is to update an existing article given new evidence. We release the FRUIT-WIKI dataset, a collection of over 170K distantly supervised data produced from pairs of Wikipedia snapshots, along with our data generation pipeline and a gold evaluation set of 914 instances whose edits are guaranteed to be supported by the evidence. We provide benchmark results for popular generation systems as well as EDIT5 -- a T5-based approach tailored to editing we introduce that establishes the state of the art. Our analysis shows that developing models that can update articles faithfully requires new capabilities for neural generation models, and opens doors to many new applications.	PDF	1	2022
When a sentence does not introduce a discourse entity, Transformer-based models still often refer to it	Understanding longer narratives or participating in conversations requires tracking of discourse entities that have been mentioned. Indefinite noun phrases, such as 'a dog', frequently introduce discourse entities but this behavior is modulated by sentential operators such as negation. For example, 'a dog' in 'Arthur doesn't own a dog' does not introduce a discourse entity due to the presence of negation. In this work, we adapt the psycholinguistic assessment of language models paradigm to higher-level linguistic phenomena and introduce an English evaluation suite that targets the knowledge of the interactions between sentential operators and indefinite noun phrases. We use this evaluation suite for a fine-grained investigation of the entity tracking abilities of the Transformer-based models GPT-2 and GPT-3. We find that while the models are to a certain extent sensitive to the interactions we investigate, they are all challenged by the presence of multiple noun phrases and their behavior is not systematic, which suggests that even models at the scale of GPT-3 do not fully acquire basic entity tracking abilities.	PDF	1	2022
Batch-Softmax Contrastive Loss for Pairwise Sentence Scoring Tasks	The use of contrastive loss for representation learning has become prominent in computer vision, and it is now getting attention in Natural Language Processing (NLP). Here, we explore the idea of using a batch-softmax contrastive loss when fine-tuning large-scale pre-trained transformer models to learn better task-specific sentence embeddings for pairwise sentence scoring tasks. We introduce and study a number of variations in the calculation of the loss as well as in the overall training procedure; in particular, we find that a special data shuffling can be quite important. Our experimental results show sizable improvements on a number of datasets and pairwise sentence scoring tasks including classification, ranking, and regression. Finally, we offer detailed analysis and discussion, which should be useful for researchers aiming to explore the utility of contrastive loss in NLP.	PDF	1	2022
Falsesum: Generating Document-level NLI Examples for Recognizing Factual Inconsistency in Summarization	Neural abstractive summarization models are prone to generate summaries that are factually inconsistent with their source documents. Previous work has introduced the task of recognizing such factual inconsistency as a downstream application of natural language inference (NLI). However, state-of-the-art NLI models perform poorly in this context due to their inability to generalize to the target task. In this work, we show that NLI models can be effective for this task when the training data is augmented with high-quality task-oriented examples. We introduce Falsesum, a data generation pipeline leveraging a controllable text generation model to perturb human-annotated summaries, introducing varying types of factual inconsistencies. Unlike previously introduced document-level NLI datasets, our generated dataset contains examples that are diverse and inconsistent yet plausible. We show that models trained on a Falsesum-augmented NLI dataset improve the state-of-the-art performance across four benchmarks for detecting factual inconsistency in summarization.	PDF	1	2022
Multi-Stage Pre-Training for Math-Understanding: $\mu^2$(AL)BERT	Understanding mathematics requires not only comprehending natural language, but also mathematical notation. For mathematical language modeling, current pre-training methods for transformer-based language models which were originally developed for natural language need to be adapted. In this work, we propose a multi-stage pre-training scheme including natural language and mathematical notation that is applied on ALBERT and BERT, resulting in two models that can be fine-tuned for downstream tasks: $\mu^2$ALBERT and $\mu^2$BERT. We show that both models outperform the current state-of-the-art model on Answer Ranking. Furthermore, a structural probing classifier is applied in order to test whether operator trees can be reconstructed from the models' contextualized embeddings.	PDF	1	2022
Robin: A Novel Online Suicidal Text Corpus of Substantial Breadth and Scale	Suicide is a major public health crisis. With more than 20,000,000 suicide attempts each year, the early detection of suicidal intent has the potential to save hundreds of thousands of lives. Traditional mental health screening methods are time-consuming, costly, and often inaccessible to disadvantaged populations; online detection of suicidal intent using machine learning offers a viable alternative. Here we present Robin, the largest non-keyword generated suicidal corpus to date, consisting of over 1.1 million online forum postings. In addition to its unprecedented size, Robin is specially constructed to include various categories of suicidal text, such as suicide bereavement and flippant references, better enabling models trained on Robin to learn the subtle nuances of text expressing suicidal ideation. Experimental results achieve state-of-the-art performance for the classification of suicidal text, both with traditional methods like logistic regression (F1=0.85), as well as with large scale pre-trained language models like BERT (F1=0.92). Finally, we release the Robin dataset publicly as a machine learning resource with the potential to drive the next generation of suicidal sentiment research.	PDF	1	2022
Pre-trained language models evaluating themselves - A comparative study	Evaluating generated text received new attention with the introduction of model-based metrics in recent years. These new metrics have a higher correlation with human judgments and seemingly overcome many issues of previous n-gram based metrics from the symbolic age. In this work, we examine the recently introduced metrics BERTScore, BLEURT, NUBIA, MoverScore, and Mark-Evaluate (Petersen). We examined their sensitivity to different types of semantic deterioration (part of speech drop and negation), word order perturbations, word drop, and the common problem of repetition. No metric showed appropriate behaviour for negation, and further no metric was overall sensitive to the other issues mentioned above.	PDF	1	2022
BAD-X: Bilingual Adapters Improve Zero-Shot Cross-Lingual Transfer	Adapter modules enable modular and efficient zero-shot cross-lingual transfer, where current state-of-the-art adapter-based approaches learn specialized language adapters (LAs) for individual languages. In this work, we show that it is more effective to learn bilingual language pair adapters (BAs) when the goal is to optimize performance for a particular source-target transfer direction. Our novel BAD-X adapter framework trades off some modularity of dedicated LAs for improved transfer performance: we demonstrate consistent gains in three standard downstream tasks, and for the majority of evaluated low-resource languages.	PDF	1	2022
Early Guessing for Dialect Identification	This paper deals with the problem of incremental dialect identification. Our goal is to reliably determine the dialect before the full utterance is given as input. The major part of the previous research on dialect identification has been model-centric with a focus on performance. We address a new question: How much input is needed to identify a dialect? Our approach is a data-centric analysis that results in general criteria for finding the shortest input needed to make a plausible guess. Working with two sets of dialects (Swiss German and Indo-Aryan languages), we show that the dialect can be identified well before the end of the input utterance. To determine the optimal point for making the first guess, we propose a heuristic that involves calibrated model confidence (temperature scaling) and input length. We show that the same input shortening criteria apply to both of our data sets. While the performance with the early guesses is still below the performance on the full input, the gap is smaller when the overall performance of the fine-tuned model is better.	PDF	1	2022
AMRize, then Parse! Enhancing AMR Parsing with PseudoAMR Data	As Abstract Meaning Representation (AMR) implicitly involves compound semantic annotations, we hypothesize auxiliary tasks which are semantically or formally related can better enhance AMR parsing. With carefully designed control experiments, we find that 1) Semantic role labeling (SRL) and dependency parsing (DP), would bring much more significant performance gain than unrelated tasks in the text-to-AMR transition. 2) To make a better fit for AMR, data from auxiliary tasks should be properly "AMRized" to PseudoAMR before training. 3) Intermediate-task training paradigm outperforms multitask learning when introducing auxiliary tasks to AMR parsing. From an empirical perspective, we propose a principled method to choose, reform and train auxiliary tasks to boost AMR parsing. Extensive experiments show that our method achieves new state-of-the-art performance on in-distribution, out-of-distribution benchmarks of AMR parsing. We will release our code upon acceptance.	PDF	1	2022
GREENER: Graph Neural Networks for News Media Profiling	We study the problem of profiling news media on the Web with respect to their factuality of reporting and bias. This is an important but under-studied problem related to disinformation and ``fake news'' detection, but it addresses the issue at a coarser granularity compared to looking at an individual article or an individual claim. This is useful as it allows to profile entire media outlets in advance. Unlike previous work, which has focused primarily on text (\emph{e.g.},~on the text of the articles published by the target website, or on the textual description in their social media profiles or in Wikipedia), here our main focus is on modeling the similarity between media outlets based on the overlap of their audience. This is motivated by homophily considerations, {\em i.e.},~the tendency of people to have connections to people with similar interests, which we extend to media, hypothesizing that similar types of media would be read by similar kinds of users. In particular, we propose GREENER (GRaph nEural nEtwork for News mEdia pRofiling), a model that builds a graph of inter-media connections based on their audience overlap, and then uses graph neural networks to represent each medium. We find that such representations on their own, or when augmented with representations for articles, and from Twitter, YouTube, Facebook, and Wikipedia are quite useful for predicting the factuality and the bias of news media outlets, yielding state-of-the-art results on four datasets for the two tasks.	PDF	1	2022
How to be Helpful on Online Support Forums?	Internet forums such as Reddit offer people a platform to ask for advice when they encounter various issues at work, school or in relationships. Telling helpful comments apart from unhelpful comments to these advice-seeking posts can help people and dialogue agents to become more helpful in offering advice. We propose a dataset that contains both helpful and unhelpful comments in response to such requests. We then relate helpfulness to the closely related construct of empathy. Finally, we analyze the language features that are associated with helpful and unhelpful comments.	PDF	1	2022
Conventional clustering-based method for event detection on social networks	Social networks are becoming the preferred channel to report and discuss events happening around the world. The information stream such channels contain can be used to detect and describe the ongoing events to take informed decisions in numerous domains. A typical framework for event detection is to first cluster the stream of tweets, and then analyze the clusters to decide which deal with real-world events. In this context, content representation models and clustering approaches are critical. Classical approaches are usually based on TF-IDF for the representation of the text content and on dynamic clustering for the clustering part. In this paper, we propose to compare TF-IDF with recent text representation models and we propose an event detection method based on conventional clustering. We show that, contrary to previous results, language models based on Transformer architectures are competitive with TF-IDF. We also show that our approach outperforms the most used approach of the literature.	PDF	1	2022
Realistic Zero-Shot Cross-Lingual Transfer in Legal Topic Classification	We consider zero-shot cross-lingual transfer in legal topic classification using the recent Multi-EURLEX dataset. Since the original dataset contains parallel documents, which is unrealistic for zero-shot cross-lingual transfer, we develop a new version of the dataset without parallel documents. We use it to show that translation-based methods vastly outperform cross-lingual fine-tuning of multilingually pre-trained models, the best previous zero-shot transfer method for Multi-EURLEX. We also develop a bilingual teacher-student zero-shot transfer approach, which exploits additional unlabeled documents of the target language and performs better than a model fine-tuned directly on labeled target language documents.	PDF	1	2022
Lexicon based Fine-tuning of Multilingual Language Models for Sentiment Analysis of Low-resource Languages	Massively multilingual language models (MMLM) such as mBERT and XLM-R have shown good cross-lingual transferability. However, they are not specifically trained to capture cross-lingual signals with respect to sentiment words. In this paper, we use a sentiment lexicon of a high-resource language in order to generate an intermediate fine-tuning task for the MMLM, when fine-tuning it for a low-resource sentiment classification task. We show that such a fine-tuning task improves the mapping between similar sentiment words in different languages and improves the sentiment classification task of the low-resource language.	PDF	1	2022
DISARM: Detecting the Victims Targeted by Harmful Memes	Internet memes have emerged as an increasingly popular means of communication on the web. Although memes are typically intended to elicit humour, they have been increasingly used to spread hatred, trolling, and cyberbullying, as well as to target specific individuals, communities, or society on political, socio-cultural, and psychological grounds. While previous work has focused on detecting harmful, hateful, and offensive memes in general, identifying whom these memes attack (i.e., the `victims') remains a challenging and underexplored area. We attempt to address this problem in this paper. To this end, we create a dataset in which we annotate each meme with its victim(s) such as the name of the targeted person(s), organization(s), and community(ies). We then propose DISARM (Detecting vIctimS targeted by hARmful Memes), a framework that uses named-entity recognition and person identification to detect all entities a meme is referring to, and then, incorporates a novel contextualized multimodal deep neural network to classify whether the meme intends to harm these entities. We perform several systematic experiments on three different test sets, corresponding to entities that are (i) all seen while training, (ii) not seen as a harmful target while training, and (iii) not seen at all while training. The evaluation shows that DISARM significantly outperforms 10 unimodal and multimodal systems. Finally, we demonstrate that DISARM is interpretable and comparatively more generalizable, and that it can reduce the relative error rate of harmful target identification by up to 9% absolute over multimodal baseline systems.	PDF	1	2022
CodeBPE: Investigating Subtokenization Options for Large Language Model Pretraining on Source Code	Recent works has widely adopted large language model pretraining for source code, suggested source code-specific pretraining objectives and investigated the applicability of various Transformer-based language model architectures for source code. This work investigates another important aspect of such models, the effect of different subtokenization options, and aims at identifying most effective and length-efficient subtokenizations, taking into account source code specifics. We propose subtokenziation that reduces average length by 17--40% without downstream performance drop, and show that a carefully chosen subtokenization may significantly improve quality by 0.5-2%, possibly with some length increase.	PDF	1	2022
When does Parameter-Efficient Transfer Learning Work for Machine Translation?	We study parameter-efficient transfer learning methods that adapt a pre-trained model by fine-tuning a small number of parameters, for machine translation. We conduct experiments across a diverse set of languages, comparing different fine-tuning methods in terms of (1) parameter budget, (2) language-pair, and (3) different pre-trained models. We show that methods such as adapters and prefix-tuning that add parameters to a pre-trained model perform best. However, methods which fine-tune a subset of existing parameters, e.g. BitFit and cross-attention tuning, are better correlated with pre-trained model capability. Furthermore, we found a large performance variation across language pairs, with parameter-efficient methods particularly struggling for distantly related language-pairs. Finally, we show that increasing model size, but tuning only 0.03% of total parameters, can outperform tuning 100% of the parameters of a smaller model	PDF	1	2022
Why are NLP Models Fumbling at Elementary Math? A Survey of Automatic Word Problem Solvers	From the latter half of the last decade, there has been growing interest in developing algorithms for automatically solving mathematical word problems (MWP). It is an exciting language problem which demands not only surface level text pattern recognition but requires coupling with mathematical reasoning as well. In spite of the dedicated effort, we are still miles away from building robust representations of elementary math word problems. In this paper, we critically examine the various models that have been developed for solving word problems, their pros and cons and the challenges ahead. In the last two years, a lot of deep learning models have come out with competing results on benchmark datasets. We take a step back and analyse why, in spite of this, the predominantly used experiment and dataset designs are a stumbling block and provide a road-map for the future.	PDF	1	2022
Improving Contextual Representation with Gloss Regularized Pre-training	Though achieving impressive results on many NLP tasks, the BERT-like masked language models (MLM) encounter the discrepancy between pre-training and inference. In light of this gap, we investigate the contextual representation of pre-training and inference from the perspective of word probability distribution. We discover that BERT risks neglecting the contextual word similarity in pre-training. To tackle this issue, we propose an auxiliary gloss regularizer module to BERT pre-training (GR-BERT), to enhance word semantic similarity. By predicting masked words and aligning contextual embeddings to corresponding glosses simultaneously, the word similarity can be explicitly modeled. We design two architectures for GR-BERT and evaluate our model in downstream tasks. Experimental results show that the gloss regularizer benefits BERT in word-level and sentence-level semantic representation. The GR-BERT achieves new state-of-the-art in lexical substitution task and greatly promotes BERT sentence representation in both unsupervised and supervised STS tasks.	PDF	1	2022
Unsupervised Preference-Aware Language Identification	Recognizing the language of ambiguous texts has become a main challenge in language identification (LID). When using multilingual applications, users have their own language preferences, which can be regarded as external knowledge for LID. Nevertheless, current studies do not consider the inter-personal variations due to the lack of user annotated training data. To fill this gap, we introduce preference-aware LID and propose a novel unsupervised learning strategy. Concretely, we construct pseudo training set for each user by extracting training samples from a standard LID corpus according to his/her historical language distribution. Besides, we contribute the first user labeled LID test set called "U-LID". Experimental results reveal that our model can incarnate user traits and significantly outperforms existing LID systems on handling ambiguous texts. Our code and dataset are released at XXX.	PDF	1	2022
Generalized Quantifiers as a Source of Error in Multilingual NLU Benchmarks	Logical approaches to representing language have developed and evaluated computational models of quantifier words since the 19th century, but today's NLU models still struggle to capture their semantics. We rely on Generalized Quantifier Theory for language-independent representations of the semantics of quantifier words, to quantify their contribution to the errors of NLU models. We find that quantifiers are pervasive in NLU benchmarks, and their occurrence at test time is associated with performance drops. Multilingual models also exhibit unsatisfying quantifier reasoning abilities, but not necessarily worse for non-English languages. To facilitate directly-targeted probing, we present an adversarial generalized quantifier NLI task (GQNLI) and show that pre-trained language models have a clear lack of robustness in generalized quantifier reasoning.	PDF	1	2022
Meet Your Favorite Character: Open-domain Chatbot Mimicking Fictional Characters with only a Few Utterances	In this paper, we consider mimicking fictional characters as a promising direction for building engaging conversation models. To this end, we present a new practical task where only a few utterances of each fictional character are available to generate responses mimicking them. Furthermore, we propose a new method named Pseudo Dialog Prompting (PDP) that generates responses by leveraging the power of large-scale language models with prompts containing the target character's utterances. To better reflect the style of the character, PDP builds the prompts in the form of dialog that includes the character's utterances as dialog history. Since only utterances of the characters are available in the proposed task, PDP matches each utterance with an appropriate pseudo-context from a predefined set of context candidates using a retrieval model. Through human and automatic evaluation, we show that PDP generates responses that better reflect the style of fictional characters than baseline methods.	PDF	1	2022
A Two-Stage Approach towards Generalization in Knowledge Base Question Answering	Most existing approaches for Knowledge Base Question Answering (KBQA) focus on a specific underlying knowledge base either because of inherent assumptions in the approach, or because evaluating it on a different knowledge base requires non-trivial changes. However, many popular knowledge bases share similarities in their underlying schemas that can be leveraged to facilitate generalization across knowledge bases. To achieve this generalization, we introduce a KBQA framework based on a 2-stage architecture that explicitly separates semantic parsing from the knowledge base interaction, facilitating transfer learning across datasets and knowledge graphs. We show that pretraining on datasets with a different underlying knowledge base can nevertheless provide significant performance gains and reduce sample complexity. Our approach achieves comparable or state-of-the-art performance for LC-QuAD (DBpedia), WebQSP (Freebase), SimpleQuestions (Wikidata) and MetaQA (Wikimovies-KG).	PDF	1	2022
Efficient Hierarchical Domain Adaptation for Pretrained Language Models	Generative language models are trained on diverse, general-domain corpora. However, this limits their applicability to narrower domains, and prior work has shown that continued in-domain training can provide further gains. In this paper, we introduce a method to scale domain adaptation to many diverse domains using a computationally efficient adapter approach. Our method is based on the observation that textual domains are partially overlapping, and we represent domains as a hierarchical tree structure where each node in the tree is associated with a set of adapter weights. When combined with a frozen pretrained language model, this approach enables parameter sharing among related domains, while avoiding negative interference between unrelated ones. Experimental results with GPT-2 and a large fraction of the 100 most represented websites in C4 show across-the-board improvements in-domain. We additionally provide an inference time algorithm for a held-out domain and show that averaging over multiple paths through the tree enables further gains in generalization, while adding only a marginal cost to inference.	PDF	1	2022
Representation Learning for Conversational Data using Discourse Mutual Information Maximization	Although many pretrained models exist for text or images, there have been relatively fewer attempts to train representations specifically for dialog understanding. Prior works usually relied on finetuned representations based on generic text representation models like BERT or GPT-2. But such language modeling pretraining objectives do not take the structural information of conversational text into consideration. Although generative dialog models can learn structural features too, we argue that the structure-unaware word-by-word generation is not suitable for effective conversation modeling. We empirically demonstrate that such representations do not perform consistently across various dialog understanding tasks. Hence, we propose a structure-aware Mutual Information based loss-function DMI (Discourse Mutual Information) for training dialog-representation models, that additionally captures the inherent uncertainty in response prediction. Extensive evaluation on nine diverse dialog modeling tasks shows that our proposed DMI-based models outperform strong baselines by significant margins.	PDF	1	2022
Answering Open-Domain Multi-Answer Questions via a Recall-then-Verify Framework	Open-domain questions are likely to be open-ended and ambiguous, leading to multiple valid answers. Existing approaches typically adopt the rerank-then-read framework, where a reader reads top-ranking evidence to predict answers. According to our empirical analysis, this framework faces three problems: first, to leverage a large reader under a memory constraint, the reranker should select only a few relevant passages to cover diverse answers, while balancing relevance and diversity is non-trivial; second, the small reading budget prevents the reader from accessing valuable retrieved evidence filtered out by the reranker; third, when using a generative reader to predict answers all at once based on all selected evidence, whether a valid answer will be predicted also pathologically depends on evidence of some other valid answer(s). To address these issues, we propose to answer open-domain multi-answer questions with a recall-then-verify framework, which separates the reasoning process of each answer so that we can make better use of retrieved evidence while also leveraging large models under the same memory constraint. Our framework achieves state-of-the-art results on two multi-answer datasets, and predicts significantly more gold answers than a rerank-then-read system that uses an oracle reranker.	PDF	1	2022
Empathic Machines: Using Intermediate Features as Levers to Emulate Emotions in Text-To-Speech Systems	We present a method to control the emotional prosody of Text to Speech (TTS) systems by using phoneme-level intermediate features (pitch, energy, and duration) as levers. As a key idea, we propose Differential Scaling (DS) to disentangle features relating to affective prosody from those arising due to acoustics conditions and speaker identity. With thorough experimental studies, we show that the proposed method improves over the prior art in accurately emulating the desired emotions while retaining the naturalness of speech. We extend the traditional evaluation of using individual sentences for a more complete evaluation of HCI systems. We present a novel experimental setup by replacing an actor with a TTS system in offline and live conversations. The emotion to be rendered is either predicted or manually assigned. The results show that the proposed method is strongly preferred over the state-of-the-art TTS system and adds the much-coveted "human touch" in machine dialogue. Audio samples from our experiments are available at: https://emtts.github.io/tts-demo/	PDF	1	2022
ReadE: Learning Relation-Dependent Entity Representation for Knowledge Graph Completion	Existing knowledge graph embedding methods that adopt powerful graph neural networks try to aggregate well-preserved neighborhood information into the entity representation. However, they represent each entity solely with a relation-irrespective representation which contains the entire miscellaneous neighborhood information, regardless of the variance of emphatic semantics required by different relations in predicting the missing entities. To tackle this problem, we propose ReadE, a method to learn relation-dependent entity representation, of which the neighborhood information is selectively aggregated and emphasized by varied relations types. First, we propose a relation-controlled gating mechanism targeting on utilizing the relation to control the information flow from neighbors in the aggregation step of the graph neural network. Second, we propose a well-designed contrastive learning method with mixing both relation-level and entity-level negative samples to enhance semantics preserved in our relation-dependent GNN-based representations. Experiments on three benchmarks show that our proposed model outperforms all strong baselines. The code will be made open-sourced on Github.	PDF	1	2022
Event Detection via Derangement Reading Comprehension	Event detection (ED), aiming to detect events from texts and categorize them, is vital to understanding the actual happenings in real life. Recently, ED without triggers has been proposed and gained benefits since it relieves the tedious effort of data labeling. However, it still suffers from several formidable challenges: multi-label, insufficient clues, and imbalanced event types. We, therefore, propose a novel Derangement mechanism on a machine Reading Comprehension (DRC) framework to tackle the above challenges. More specially, we treat the input text as {\em Context} and concatenate it with all event types that are deemed as {\em Answers} with an omitted default question. Thus, by appending input text and event types simultaneously, we can facilitate the power of self-attention in pre-trained language models, e.g., BERT, to absorb the semantic relation among them. Moreover, we design a simple yet effective {\em derangement} mechanism to relieve the imbalanced training. By introducing such perturbation mainly on major events, we can prohibit major events from excessive learning or implicitly under-sample the instances of the major events. This yields a more balanced training to resolve the imbalanced learning issue. The empirical results show that: (1) our proposed framework attains state-of-the-art performance over previous competitive models, and (2) by-product, our model can signify the connection of triggers and arguments to events for further analysis.	PDF	1	2022
CSD: A Chinese Dataset for Subtext Problem	Subtext is a kind of deep semantics which can be acquired after one or more rounds of expression transformation. As a popular way of expressing one's intentions, it is well worth studying. In this paper, we propose two subtext-related tasks which are termed ``subtext recognition'' and ``subtext recovery'' and make a clear definition for their purposes. Moreover, we build a Chinese dataset whose source data comes from popular social media (e.g. Weibo, Netease Music, Zhihu, and Bilibili) and propose a new evaluation metric termed ``Two-stages Annotation Evaluation'' (TAE) for the validation of a multi-turn annotation process.	PDF	1	2022
Meta Learning for Natural Language Processing: A Survey	Deep learning has been the mainstream technique in the natural language processing (NLP) area. However, deep learning requires many labeled data and is less generalizable across domains. Meta-learning is an arising field in machine learning. It studies approaches to learning better learning algorithms and aims to improve algorithms in various aspects, including data efficiency and generalizability. The efficacy of meta-learning has been shown in many NLP tasks, but there is no systematic survey of these approaches in NLP, which hinders more researchers from joining the field. Our goal with this survey paper is to offer researchers pointers to relevant meta-learning works in NLP and attract more attention from the NLP community to drive future innovation. This paper first introduces the general concepts of meta-learning and the common approaches. Then we summarize task construction settings, applications of meta-learning for various NLP problems and review the development of meta-learning in the NLP community.	PDF	1	2022
What do tokens know about their characters and how do they know it?	Pre-trained language models (PLMs) that use subword tokenization schemes can succeed at a variety of language tasks that require character-level information, despite lacking explicit access to the character composition of tokens. Here, studying a range of models (e.g., GPT-J, BERT, RoBERTa, GloVe), we probe what word pieces encode about character-level information by training classifier to predict the presence or absence of a particular alphabetical character in an English-language token, based on its embedding (e.g., probing whether the model embedding for "cat" encodes that it contains the character "a"). We find that these models robustly encode character-level information and, in general, larger models perform better at the task. Through a series of experiments and analyses, we investigate the mechanisms through which PLMs acquire character information during training and argue that this knowledge is acquired through multiple phenomena, including a systematic relationship between particular characters and particular parts of speech, as well as natural variability in the tokenization of related strings.	PDF	1	2022
Measuring and Improving Semantic Diversity of Dialogue Generation	Response diversity has become an important criterion for evaluating the quality of open-domain dialogue generation models. However, current evaluation metrics for response diversity do not capture semantic diversity of generated responses, as they only consider lexical aspects of the responses. In this paper, we introduce a new automatic evaluation metric to measure the semantic diversity of generated responses. Through human evaluation, we demonstrate that our proposed metric highly correlates to human judgments on response diversity than existing lexical-level diversity metrics. Furthermore, motivated by the analysis of an existing dialogue dataset, we propose a simple yet effective learning method that improves the semantic diversity of generated responses through response re-weighting based on the semantic distribution of the training dataset. Through automatic and human evaluation, we show that our proposed learning method better improves both response diversity and coherency compared to other baseline methods.	PDF	1	2022
Schema Encoding for Transferable Dialogue State Tracking	Dialogue state tracking (DST) is an essential sub-task for task-oriented dialogue systems.Recent work has focused on deep neural models for DST.However, the neural models require a large dataset for training.Furthermore, applying them to another domain needs a new dataset because the neural models are trained to imitate the given dataset.In this paper, we propose Schema Encoding for Transferable Dialogue State Tracking (SET-DST), which is a neural DST method for effective transfer to new domains.Transferable DST could assist developments of dialogue systems even with few dataset on target domains.We use a schema encoder not just to imitate the dataset but to comprehend the schema of the dataset.We aim to transfer the model to new domains by encoding new schemas and using them for DST.As a result, SET-DST improved the accuracy by 1.46 points on MultiWOZ 2.1.	PDF	1	2022
Weakly Supervised Turn-level Engagingness Evaluator for Dialogues	The standard approach to evaluating dialogue engagingness is by measuring Conversation Turns Per Session (CTPS), which implies that the dialogue length is the main predictor of the user engagement with a dialogue system. The main limitation of CTPS is that it can only be measured at the session level, i.e., once the dialogue is over. But a dialogue system has to continuously monitor user engagement throughout the dialogue session as well. Existing approaches to measuring turn-level engagingness require human annotations for training. We pioneer an alternative approach, Weakly Supervised Engagingness Evaluator (WeSEE), which uses the remaining depth (RD) for each turn as a heuristic weak label for engagingness. WeSEE does not require human annotations and also relates closely to CTPS, thus serving as a good learning proxy for this metric. We show that WeSEE achieves the new state-of-the-art results on the Fine-grained Evaluation of Dialog (FED) dataset (0.38 Spearman) and the DailyDialog dataset (0.62 Spearman).	PDF	1	2022
When More is not Necessary Better: Multilingual Auxiliary Tasks for Zero-Shot Cross-Lingual Transfer of Hate Speech Detection Models	Zero-shot cross-lingual transfer learning has been shown to be highly challenging for tasks involving a lot of linguistic specificities or when a cultural gap is present between languages, such as in hate speech detection. In this paper, we highlight this limitation on several datasets and investigate how training on multilingual auxiliary tasks -- sentiment analysis, named entity recognition, and tasks relying on syntactic information -- impacts the zero-shot transfer of hate speech detection models across languages. We show the positive impact of these tasks, particularly named entity recognition, for bridging the gap between languages. Then, we present cases where the language model training data prevents hate speech detection models from benefiting from a knowledge proxy brought by auxiliary tasks fine-tuning. Our results warrant further investigation on how to best address cultural gap issues in resource-scarce scenario.	PDF	1	2022
Auto-regressive Text Generation with Pre-Trained Language Models: An Empirical Study on Question-type Short Text Generation	We present a multi-way parallel math word problem dataset, which covers English, Tamil and Sinhala. We employ this dataset in an empirical analysis of GPT-2, BART, and T5, as well as mT5 and mBART in auto-regressive text generation. Our findings show that BART and T5 perform noticeably better than GPT-2 for the considered task, and text generation with mBART50 and mT5 provides very promising results even for languages under-represented in these pre-trained models.	PDF	1	2022
WECHSEL: Effective initialization of subword embeddings for cross-lingual transfer of monolingual language models	Large pretrained language models (LMs) have become the central building block of many NLP applications. Training these models requires ever more computational resources and most of the existing models are trained on English text only. It is exceedingly expensive to train these models in other languages. To alleviate this problem, we introduce a novel method -- called WECHSEL -- to efficiently and effectively transfer pretrained LMs to new languages. WECHSEL can be applied to any model which uses subword-based tokenization and learns an embedding for each subword. The tokenizer of the source model (in English) is replaced with a tokenizer in the target language and token embeddings are initialized such that they are semantically similar to the English tokens by utilizing multilingual static word embeddings covering English and the target language. We use WECHSEL to transfer the English RoBERTa and GPT-2 models to four languages (French, German, Chinese and Swahili). We also study the benefits of our method on very low-resource languages. WECHSEL improves over proposed methods for cross-lingual parameter transfer and outperforms models of comparable size trained from scratch with up to 64x less training effort. Our method makes training large language models for new languages more accessible and less damaging to the environment. We make our code and models publicly available.	PDF	1	2022
Emphasis on Easy Samples for Distantly Supervised Relation Extraction	There are many wrongly-labeled samples and low-quality samples in automatically generated Distantly Supervised Relation Extraction datasets. Overfitting these samples leads to decline of generalization. To address this issue, the learning of high-quality samples should be prioritized. In this paper, we propose the Emphasis on Easy Samples (EES) mechanism to emphasize high-quality samples using weight distribution regularization at sentence level and priority weighting at bag level. Experiments on a widely used benchmark show that our approach achieves significant improvements.	PDF	1	2022
Investigating Math Word Problems using Pretrained Multilingual Language Models	In this paper, we revisit math word problems~(MWPs) from the {\em cross-lingual} and {\em multilingual} perspective.We construct our MWP solvers over pretrained multilingual language models using the sequence-to-sequence model with copy mechanism.We compare how the MWP solvers perform in cross-lingual and multilingual scenarios.To facilitate the comparison of cross-lingual performance, we first adapt the large-scale English dataset MathQA as a counterpart of the Chinese dataset Math23K.Then we extend several English datasets to bilingual datasets through machine translation plus human annotation.Our experiments show that the MWP solvers may not be transferred to a different language even if the target expressions share the same numerical constants and operator set.However, it can be better generalized if problem types exist on both source language and target language.	PDF	1	2022
An Empirical Study on Cross-Lingual and Cross-Domain Transfer for Legal Judgment Prediction	Cross-lingual transfer learning has proven useful in a variety of NLP tasks, but it is understudied in the context of legal NLP, and not at all on n Legal Judgment Prediction (LJP). We explore transfer learning techniques on LJP using the trilingual Swiss-Judgment-Prediction (SJP) dataset, including cases written in three languages (German, French, Italian). We find that Cross-Lingual Transfer (CLT) improves the overall results across languages, especially when we augment the dataset with machine-translated versions of the original documents, using a $3\times$ larger training corpus. Further on, we perform an analysis exploring the effect of cross-domain and cross-regional transfer, i.e., train a model across domains (legal areas), or regions. We find that in both settings (legal areas, origin regions), models trained across all groups perform overall better, while they also have improved results in the worst-case scenarios. Finally, we report improved results when we ambitiously apply cross-jurisdiction transfer, where we augment our dataset with Indian legal cases originally written in English.	PDF	1	2022
Task Formulation Matters When Learning Continuously: A Case Study in Visual Question Answering	Continual learning is a promising alternative to the current pretrain-and-finetune paradigm: It aims to learn a model on a sequence of tasks without forgetting knowledge from preceding tasks. We investigate continual learning for Visual Question Answering and show that performance highly depends on task design, order, and similarity - where tasks may be formulated according to either modality. Our results suggest that incremental learning of language reasoning skills (such as questions about color, count etc.) is more difficult than incrementally learning visual categories. We show that this difficulty is related to task similarity, where heterogeneous tasks lead to more severe forgetting. We also demonstrate that naive finetuning of pretrained models is insufficient, and recent continual learning approaches can reduce forgetting by more than 20%. We propose a simple yet effective Pseudo-Replay algorithm, which improves results while using less memory compared to standard replay. Finally, to measure gradual forgetting we introduce a new metric that takes into account the semantic similarity of predicted answers.	PDF	1	2022
Mukayese: Turkish NLP Strikes Back	Having sufficient resources for language X lifts it from the $\textit{under-resourced}$ languages class, but not necessarily from the $\textit{under-researched}$ class. In this paper, we address the problem of the absence of organized benchmarks in the Turkish language. We demonstrate that languages such as Turkish are left behind the state-of-the-art in NLP applications. As a solution, we present $\textit{Mukayese}$, a set of NLP benchmarks for the Turkish language that contains several NLP tasks. We work on one or more datasets for each benchmark and present two or more baselines. Moreover, we present four new benchmarking datasets in Turkish for language modeling, sentence segmentation, and spell checking.	PDF	1	2022
On the current state of reproducibility and reporting of uncertainty for Aspect-based Sentiment Analysis	For the latter part of the past decade, Aspect-Based Sentiment Analysis has been a field of great interest within Natural Language Processing. Supported by the Semantic Evaluation Conferences in 2014 -- 2016, a variety of methods has been developed competing in improving performances on benchmark data sets. Exploiting the transformer architecture behind BERT, results improved rapidly and efforts in this direction still continue today. Our contribution to this body of research is a holistic comparison of six different architectures which achieved (near) state-of-the-art results at some point in time. We utilize a broad spectrum of five benchmark data sets and introduce a fixed setting with respect to the pre-processing, the train/validation splits, the performance measures and the quantification of uncertainty. Overall, our findings are two-fold: First, we find that the results reported in the scientific articles are hardly reproducible, since in our experiments the observed performance (most of the time) fell short of the reported one. Second, the results are burdened with notable uncertainty (depending on the data splits) which is why a reporting of uncertainty measures is crucial.	PDF	1	2022
An Isotropy Analysis in the Multilingual BERT Embedding Space	Several studies have explored various advantages of multilingual pre-trained models (such as multilingual BERT) in capturing shared linguistic knowledge. However, less attention has been paid to their limitations. In this paper, we investigate the multilingual BERT for two known issues of the monolingual models: anisotropic embedding space and outliers. We show that, unlike its monolingual counterpart, the multilingual model exhibits no outlier dimension in its representations while it has a highly anisotropic space. Furthermore, our experimental results demonstrate that increasing the isotropy of multilingual space can significantly improve its representation power and performance, similarly to what had been observed for monolingual CWRs. Our analysis indicates that, although the degenerated directions vary in different languages, they encode similar linguistic knowledge, suggesting a shared linguistic space among languages.	PDF	1	2022
A Dual-Channel Framework for Sarcasm Recognition by Detecting Sentiment Conflict	Sarcasm employs ambivalence, where one says something positive but actually means negative, vice versa. The essence of sarcasm, which is also a sufficient and necessary condition, is conflict between the literal and implied sentiments expressed in one sentence. However, it is difficult to recognize such sentiment conflict because of the sentiments are mixed or even implicit.As a result, the recognition of sophisticated and obscure sentiment brings in a great challenge to sarcasm detection. In this paper, we propose a Dual-Channel Framework by modeling both literal and implied sentiments separately. Based on this dual-channel framework, we design the Dual-Channel Net~(DC-Net) to recognize sentiment conflict.Experiments on political debates (\ie IAC-V1 and IAC-V2) and Twitter datasets show that our proposed DC-Net achieves state-of-the-art performance on sarcasm recognition.	PDF	1	2022
Low Resource Style Transfer via Domain Adaptive Meta Learning	Text style transfer (TST) without parallel data has achieved some practical success. However, most of the existing unsupervised text style transfer methods suffer from (i) requiring massive amounts of nonparallel data to guide transferring different text styles. (ii) colossal performance degradation when fine-tuning the model in new domains. In this work, we propose DAML-ATM(Domain Adaptive Meta-Learning with Adversarial Transfer Model), which consists of two parts, DAML and ATM. DAML is a domain adaptive meta-learning approach to refine general knowledge in multi-heterogeneous source domains, capable of adapting to new unseen domains with a small amount of data. Moreover, we propose a new unsupervised TST approach Adversarial Transfer Model (ATM), composed of a sequence-to-sequence pre-trained language model and uses adversarial style training for better content preservation and style transfer. Results on multi-domain datasets demonstrate that our approach generalizes well on unseen low-resource domains, achieving state-of-the-art results against ten-strong baselines.	PDF	1	2022
CLoCE:Contrastive Learning Optimize Continous Prompt Embedding Space in Relation Extraction	Recent studies have proved that prompt tuning can improve the performance of pre-trained language models (PLMs) on downstream tasks. However, in the task of relation extraction (RE), there are still a large number of confusing samples that hinder prompt-tuning method from achieving higher accuracy. Inspired by previous works, we innovatively utilize contrastive learning to solve this problem. We propose a prompt-tuning-based framework and apply contrastive learning to optimize the representation of input sentences in embedding space. At the same time, we design a more general template for RE task, and further use knowledge injection to improve performance of the model. Through extensive experiments on public datasets, the micro F1-score of our model exceeds the existing SOTA on the Re-TACRED and TACREV datasets by 0.5 and 1.0, respectively. Meanwhile, in the few-shot scenario, our model also has a more robust performance than fine-tune methods.	PDF	1	2022
Towards Coherent and Captivating Topic Transitions in Knowledge-Grounded Conversations	Knowledge-grounded conversations require skillful usage of knowledge to generate suitably diverse responses to keep user captivated while maintaining coherence to the dialogue context. However, current approaches that directly match knowledge with dialog context can result in capturing spurious correlations between knowledge and context, leading to either incoherent or mundane topic transitions in the generated dialogs that fail to engage.In this work, we introduce the Coherent and Captivating Topic Transition (C2T2) method to select the appropriate knowledge to be used in next response, resulting in topic transitions that are coherent to the ongoing conversations while providing adequate topic development for an engaging dialog.Our C2T2 employs transition-aware features designed to consider both historical contextual coherence as well as sequential topic development under a knowledge shifting constraint to select the next knowledge, thereby generating the response for an engaging conversation.We also designed a pointer network-based knowledge inference module to take into consideration of the relations among knowledge candidates during knowledge inference. Extensive experiments on two public benchmarks demonstrated the superiority of C2T2 on knowledge selection. Analysis on fine-grained knowledge selection accuracy also showed that C2T2 could better balance the topic adhesion and knowledge diversity in dialogs than existing approaches.	PDF	1	2022
DialogueScore: Evaluating Responses in Task-Oriented Dialogue	Task-Oriented Dialogue systems have been widely deployed in real-world applications in the last few years.Yet, evaluations of task-oriented dialogue systems are relatively limited.The informative and success score only consider the key entities in the generated responses to judge whether the user's goal is achieved.On the other hand, the fluency metric (BLEU score) cannot measure the quality of the short responses properly since the golden responses could be diversified. To better explore the behavior and evaluate the generation ability of task-oriented dialogue systems, we explore the relation between user utterances and system responses and their follow-up utterances. Therefore, we design a scorer named \textbf{DialogueScore} based on the natural language inference task and synthesize negative data to train the scorer.Via performances of \textbf{DialogueScore}, we observe that the dialogue system fails to generate high-quality responses compared with the reference responses. Therefore, our proposed scorer could provide a new perspective for future dialogue system evaluation and construction.	PDF	1	2022
A Federated Approach to Predict Emojis in Hindi Tweets	The use of emojis provide for adding a visual modality to textual communication. The task of predicting emojis however provides a challenge for computational approaches as emoji use tends to cluster into the frequently used and the rarely used emojis. Much of the research on emoji use has focused on high resource languages and conceptualised the task of predicting emojis around traditional servers-side machine learning approaches, which can introduce privacy concerns, as user data is transmitted to a central storage. We show that a privacy preserving approach, Federated Learning exhibits comparable performance to traditional servers-side transformer models. In this paper, we provide a benchmark dataset of $118$k tweets (augmented from $25$k unique tweets) for emoji prediction in Hindi and propose modification to the CausalFedGSD algorithm aiming to balance model performance and user privacy. We show that our approach obtains comparative scores with more complex centralised models while reducing the amount of data required to optimise the models and minimising risks to user privacy.	PDF	1	2022
Don’t Forget Cheap Training Signals Before Building Unsupervised Bilingual Word Embeddings	Bilingual Word Embeddings (BWEs) are one of the cornerstones of cross-lingual transfer of NLP models.They can be built using only monolingual corpora without supervision leading to numerous works focusing on unsupervised BWEs.However, most of the current approaches to build unsupervised BWEs do not compare their results with methods based on easy-to-access cross-lingual signals.In this paper, we argue that such signals should always be considered when developing unsupervised BWE methods.The two approaches we find most effective are: 1) using identical words as seed lexicons (which unsupervised approaches incorrectly assume are not available for orthographically distinct language pairs) and 2) combining such lexicons with pairs extracted by matching romanized versions of words with an edit distance threshold.We experiment on thirteen non-Latin languages (and English) and show that such cheap signals work well and that they outperform using more complex unsupervised methods on distant language pairs such as Chinese, Japanese, Kannada, Tamil, and Thai.In addition, we show that our signals are even competitive with the use of high-quality lexicons in supervised approaches.Our results show that these training signals should not be neglected when building BWEs, even for distant languages.	PDF	1	2022
RED-ACE: Robust Error Detection for ASR using Confidence Embeddings	ASR Error Detection (AED) models aim to post-process the output of Automatic Speech Recognition (ASR) systems, in order to detect transcription errors. Modern approaches usually use text-based input, comprised solely of the ASR transcription hypothesis, disregarding additional signals from the ASR model. Instead, we propose to utilize the ASR system's word-level confidence scores for improving AED performance. Specifically, we add an ASR Confidence Embedding (ACE) layer to the AED model's encoder, allowing us to jointly encode the confidence scores and the transcribed text into a contextualized representation. Our experiments show the benefits of ASR confidence scores for AED, their complementary effect over the textual signal, as well as the effectiveness and robustness of ACE for combining these signals. To foster further research, we publish a novel AED dataset consisting of ASR outputs on the LibriSpeech corpus with annotated transcription errors.	PDF	1	2022
When Does Syntax Mediate Neural Language Model Performance? Evidence from Dropout Probes	Recent causal probing literature reveals when language models and syntactic probes use similar representations. Such techniques may yield ``false negative'' causality results: models may use representations of syntax, but probes may have learned to use redundant encodings of the same syntactic information. We demonstrate that models do encode syntactic information redundantly and introduce a new probe design that guides probes to consider all syntactic information present in embeddings. Using these probes, we find evidence for the use of syntax in models where prior methods did not, allowing us to boost model performance by injecting syntactic information into representations.	PDF	1	2022
Seq-GAN-BERT：Sequence Generative Adversarial Learning for Low-resource Name Entity Recognition	Named entity recognition (NER), as an important basic task of natural language processing, has been widely studied. In the case of relatively sufficient labeled data, traditional NER methods have achieved remarkable results. However, due to the lack of labeled data in many fields and the difficulty of manual annotation, the task of low-resource NER has become a research hotspot. To effectively improve the recognition accuracy of low-resource NER, this paper proposes the semi-supervised learning model Seq-GAN-BERT，which integrates the adversarial generative network based on the pre-trained language model BERT, and uses the domain unlabeled corpus to train the adversarial generative network to learn the important general semantic information of the data. The proposed Seq-GAN-BERT method can further optimize BERT-based supervised training and improve the ability of entity recognition. The experimental results show that our model greatly reduces the dependence on labeled samples and effectively improves the performance of low-resource NER task.	PDF	1	2022
Non-Autoregressive Neural Machine Translation with Consistency Regularization Optimized Variational Framework	Variational Autoencoder (VAE) is an effective way to model the interdependency for Non-autoregressive neural machine translation (NAT). LaNMT, a representative VAE-based latent-variable NAT framework achieves great improvements to vanilla models, but still suffers from two main issues which lower down the translation quality: (1) mismatch between training and inference circumstances and (2) inadequacy of latent representations. In this work, we target on addressing these issues by proposing posterior consistency regularization. Specifically, we first apply stochastic data augmentation on the input samples to better adapt the model for inference circumstance, and then perform consistency training on posterior latent variables to train a more robust posterior network with better latent representations. Experiments on En-De/De-En/En-Ro benchmarks confirm the effectiveness of our methods with about 1.3/0.7/0.8 BLEU points improvement to the baseline model with about $12.6\times$ faster than autoregressive Transformer.	PDF	1	2022
Should We Rely on Entity Mentions for Relation Extraction? Debiasing Relation Extraction with Counterfactual Analysis	Recent literature focuses on utilizing the entity information in the sentence-level relation extraction (RE), but this risks leaking superficial and spurious clues of relations. As a result, RE still suffers from unintended entity bias, i.e., the spurious correlation between entity mentions (names) and relations. Entity bias can mislead the RE models to extract the relations that do not exist in the text. To combat this issue, some previous work masks the entity mentions to prevent the RE models from over-fitting entity mentions. However, this strategy degrades the RE performance because it loses the semantic information of entities. In this paper, we propose the CoRE (Counterfactual Analysis based Relation Extraction) debiasing method that guides the RE models to focus on the main effects of textual context without losing the entity information. We first construct a causal graph for RE, which models the dependencies between variables in RE models. Then, we propose to conduct counterfactual analysis on our causal graph to distill and mitigate the entity bias, that captures the causal effects of specific entity mentions in each instance. Note that our CoRE method is model-agnostic to debias existing RE systems during inference without changing their training processes. Extensive experimental results demonstrate that our CoRE yields significant gains on both effectiveness and generalization for RE.	PDF	1	2022
A Transformer-based Threshold-Free Framework for Multi-Intent NLU	Multi-intent natural language understanding (NLU) has recently gained attention. It detects multiple intents in an utterance, which is better suited to real-world scenarios. However, the state-of-the-art joint NLU models mainly detect multiple intents on threshold-based strategy, resulting in one main issue: the model is extremely sensitive to the threshold settings. In this paper, we propose a transformer-based Threshold-Free Multi-intent NLU model (TFMN) with multi-task learning (MTL). Specifically, we first leverage multiple layers of a transformer-based encoder to generate multi-grain representations. Then we exploit the information of the number of multiple intents in each utterance without additional manual annotations and propose an auxiliary detection task: Intent Number detection (IND). Furthermore, we propose a threshold-free intent multi-intent classifier that utilizes the output of IND task and detects the multiple intents without depending on the threshold. Extensive experiments demonstrate that our proposed model achieves superior results on two public multi-intent datasets.	PDF	1	2022
Learning to Predict Persona Information for Dialogue Personalization without Explicit Persona Description	Personalizing dialogue agents is important for dialogue systems to generate more specific, consistent, and engaging responses. However, most current dialogue personalization approaches rely on explicit persona descriptions during inference, which severely restricts its application. In this paper, we propose a novel approach that learns to predict persona information based on the dialogue history to personalize the dialogue agent without relying on any explicit persona descriptions during inference. Experimental results on the PersonaChat dataset show that the proposed method can improve the consistency of generated responses when conditioning on the predicted profile of the dialogue agent (i.e. ``self persona''), and improve the engagingness of the generated responses when conditioning on the predicted persona of the dialogue partner (i.e. ``their persona''). We also find that a trained persona prediction model can be successfully transferred to other datasets and help generate more relevant responses.	PDF	1	2022
QAFactEval: Improved QA-Based Factual Consistency Evaluation for Summarization	Factual consistency is an essential quality of text summarization models in practical settings. Existing work in evaluating this dimension can be broadly categorized into two lines of research, entailment-based and question answering (QA)-based metrics, and different experimental setups often lead to contrasting conclusions as to which paradigm performs the best. In this work, we conduct an extensive comparison of entailment and QA-based metrics, demonstrating that carefully choosing the components of a QA-based metric, especially question generation and answerability classification, is critical to performance. Building on those insights, we propose an optimized metric, which we call QAFactEval, that leads to a 14% average improvement over previous QA-based metrics on the SummaC factual consistency benchmark, and also outperforms the best-performing entailment-based metric. Moreover, we find that QA-based and entailment-based metrics can offer complementary signals and be combined into a single metric for a further performance boost.	PDF	1	2022
Weakly Supervised Text-to-SQL Parsing through Question Decomposition	Text-to-SQL parsers are crucial in enabling non-experts to effortlessly query relational data. Training such parsers, by contrast, generally requires expert annotation of natural language (NL) utterances paired with corresponding SQL queries.In this work, we propose a weak supervision approach for training text-to-SQL parsers. We take advantage of the recently proposed question meaning representation called QDMR, an intermediate between NL and formal query languages.We show that given questions, their QDMR structures (annotated by non-experts or automatically predicted), and the answers, we can automatically synthesize SQL queries that are then used to train text-to-SQL models. Extensive experiments test our approach on five benchmark datasets. The results show that our models perform competitively with those trained on annotated NL-SQL data.Overall, we effectively train text-to-SQL parsers, using zero SQL annotations.	PDF	1	2022
BOOKSUM: A Collection of Datasets for Long-form Narrative Summarization	The majority of existing text summarization datasets include short-form source documents that lack long-range causal and temporal dependencies, and often contain strong layout and stylistic biases. While relevant, such datasets will offer limited challenges for future text summarization systems. We address these issues by introducing BOOKSUM, a collection of datasets for long-form narrative summarization. Our dataset covers documents from the literature domain, such as novels, plays and stories, and includes highly abstractive, human written summaries on three levels of granularity of increasing difficulty: paragraph-, chapter-, and book-level. The domain and structure of our dataset poses a unique set of challenges for summarization systems, which include: processing very long documents, non-trivial causal and temporal dependencies, and rich discourse structures. To facilitate future work, we trained and evaluated multiple extractive and abstractive summarization models as baselines for our dataset.	PDF	1	2022
Generate it. A simple method for End-to-End Relation Extraction	End-to-end Relation Extraction (RE) is a fundamental problem of information extraction, which includes two tasks: identifying named entities from text and classifying relations between entities. In this work, we propose a simple but effective method to extract entities and relations from text jointly by designing the target output of a BART-based generative model for Named Entity Recognition (NER) without changing its architecture. Compared to existing methods on ChEMU, our method performs better on RE and produces comparable results on NER. Our experimental results also demonstrate that the generative model designed for a single task is capable of joint learning.	PDF	1	2022
JointLK: Joint Reasoning with Language Models and Knowledge Graphs for Commonsense Question Answering	Existing KG-augmented models for question answering primarily focus on designing elaborate Graph Neural Networks (GNNs) to model knowledge graphs (KGs). However, they ignore (i) the effectively fusing and reasoning over question context representations and the KG representations, and (ii) automatically selecting relevant nodes from the noisy KGs during reasoning. In this paper, we propose a novel model, JointLK, which solves the above limitations through the joint reasoning of LMs and GNNs and the dynamic KGs pruning mechanism. Specifically, JointLK performs joint reasoning between the LMs and the GNNs through a novel dense bidirectional attention module, in which each question token attends on KG nodes and each KG node attends on question tokens, and the two modal representations fuse and update mutually by multi-step interactions. Then, the dynamic pruning module uses the attention weights generated by joint reasoning to recursively prune irrelevant KG nodes. Our results on the CommonsenseQA and OpenBookQA datasets demonstrate that our modal fusion and knowledge pruning methods can make better use of relevant knowledge for reasoning.	PDF	1	2022
KD-VLP: Improving End-to-End Vision-and-Language Pretraining with Object Knowledge Distillation	Self-supervised vision-and-language pretraining (VLP) aims to learn transferable multi-modal representations from large-scale image-text data and to achieve strong performances on a broad scope of vision-language tasks after finetuning. Previous mainstream VLP approaches typically adopt a two-step strategy relying on external object detectors to encode images in a multi-modal Transformer framework, which suffer from restrictive object concept space, limited image context and inefficient computation. In this paper, we propose an object-aware end-to-end VLP framework, which directly feeds image grid features from CNNs into the Transformer and learns the multi-modal representations jointly. More importantly, we propose to perform object knowledge distillation to facilitate learning cross-modal alignment at different semantic levels. To achieve that, we design two novel pretext tasks by taking object features and their semantic labels from external detectors as supervision: 1.) Object-guided masked vision modeling task focuses on enforcing object-aware representation learning in the multi-modal Transformer; 2.) Phrase-region alignment task aims to improve cross-modal alignment by utilizing the similarities between noun phrases and object labels in the linguistic space. Extensive experiments on a wide range of vision-language tasks demonstrate the efficacy of our proposed framework, and we achieve competitive or superior performances over the existing pretraining strategies.	PDF	1	2022
Code Summarization: Do Transformers Really Understand Code?	Recent approaches for automatic code summarization rely on fine-tuned transformer based language Models often injected with program analysis information. We perform empirical studies to analyze the extent to which these models understand the code they attempt to summarize. We observe that these models rely heavily on the textual cues present in comments/function names/variable names and that masking this information negatively impacts the generated summaries. Further, subtle code transformations which drastically alter program logic have no corresponding impact on the generated summaries. Overall, the quality of the generated summaries even from State-Of-The-Art models is quite poor, raising questions about the utility of current approaches and datasets.	PDF	1	2022
Exploring Neural Models for Query-Focused Summarization	Query-focused summarization (QFS) aims to produce summaries that answer particular questions of interest, enabling greater user control and personalization. While recently released datasets, such as QMSum or AQuaMuSe, facilitate research efforts in QFS, the field lacks a comprehensive study of the broad space of applicable modeling methods. In this paper we conduct a systematic exploration of neural approaches to QFS, considering two general classes of methods: two-stage extractive-abstractive solutions and end-to-end models. Within those categories, we investigate existing methods and present two model extensions that achieve state-of-the-art performance on the QMSum dataset by a margin of up to 3.38 ROUGE-1, 3.72 ROUGE-2, and 3.28 ROUGE-L. Through quantitative experiments we highlight the trade-offs between different model configurations and explore the transfer abilities between summarization tasks. We also perform human evaluation that suggests the best models produce more comprehensive and factually-consistent summaries compared to a baseline model. Code and checkpoints are made publicly available: https://github.com/anonymized.	PDF	1	2022
Shapeshifter and the impact of representation in classification	This work explores the representation format of text during the classification process. We defined eight types of representations using graphs to study the impact of the representation in a model. We build the graphs based on the dependency tree and input them into the UGformer to classify the documents. As a result, we observed that the best result is always at least two percentage points higher than the worst result.	PDF	1	2022
Exploiting Coreference and Schema Structure for Document-level Event Extraction	Document-level event extraction (DEE) extracts structured information of events from a document. Previous studies focus on improving the model architecture. We argue that exploiting data characteristics is also important. We propose to utilize coreference information to obtain better document-level entity representations, and propose the concept of core roles to adjust the schema structure to alleviate error propagation. Experiments demonstrate that our data exploitation methods significantly improve the performance of existing models on both the role-level and record-level metrics.	PDF	1	2022
Learning from Mental Disorder Self-tests: Multi-head Siamese Network for Few-shot Knowledge Learning	Social media is one of the most highly sought resources to analyze characteristics of the language by its users. In particular, many researchers utilized various linguistic features to identify users with mental disorders. However, generalizing linguistic features of such psychiatric patients is challenging since these features are apparently dependent on cultural or personal language habits. To address this challenge, we make use of the symptoms, which are shared properties of people with mental illness, concerning clinical contents rather than the ways of expressing them. In this paper, we aim to let our classification model identify informative features by training on knowledge about the symptoms. To this end, we propose a multi-head siamese network, which captures informative features based on the knowledge of mental illness symptoms and compares them to those of target text to be classified. The model is designed to learn the required knowledge by reading just a few questions from self-tests, and to identify similar stories from social media texts. Experimental results demonstrate that our model achieves improved performance as well as human-interpretable results for mental illness symptoms. A case study shows that our proposed model offers the possibility of automatic mental illness diagnosis, grounded on rational reasons.	PDF	1	2022
Are Pretrained Multilingual Models Equally Fair Across Languages?	Pretrained multilingual language models can help bridge the digital language divide, enabling high-quality NLP models for lower-resourced languages. Studies of multilingual models have so far focused on performance, consistency, and cross-lingual generalization. However, with their wide-spread application in the wild and downstream societal impact, it is important to put multilingual models under the same scrutiny as monolingual models. This work investigates the group fairness of multilingual models, asking whether these models are equally fair across languages. To this end, we create a new four-way multilingual dataset of parallel cloze test examples (MozArt), equipped with demographic information (balanced with regard to gender and native tongue) about the test participants. We evaluate three multilingual models on MozArt -- mBERT, XLM-R, and mT5 -- and show that across the four target languages, the three models exhibit different levels of group disparity, e.g., exhibiting near-equal risk for Spanish, but high levels of disparity for German.	PDF	1	2022
Knowledge Base Index Compression via Dimensionality and Precision Reduction	Recently neural network based approaches to knowledge-intensive NLP tasks, such as question answering, started to rely heavily on the combination of neural retrievers and readers. Retrieval is typically performed over a large textual knowledge base which requires significant memory and compute resources, especially when scaled up. On HotpotQA we systematically investigate reducing the size of the KB index by means of dimensionality (sparse random projections, PCA, autoencoders) and numerical precision reduction.Our results show that PCA is an easy solution that requires very little data and is only slightly worse than autoencoders, which are less stable. All methods are sensitive to pre- and post-processing and data should always be centered and normalized both before and after dimension reduction. Finally, we show that it is possible to combine PCA with using 1bit per dimension. Overall we achieve (1) 100$\times$ compression with 75%, and (2) 24$\times$ compression with 92% original retrieval performance.	PDF	1	2022
The Common Readability Formula & Five Adjusted Readability Formulas for Text Simplification, Medical Documents and Other General Uses	Traditional readability formulas, or equations, are inaccurate and measure highly limited linguistic properties. Despite the recent machine learning-based readability assessment models, many researchers insist on using the outdated formulas. To replace the linguistically-shallow inaccurate formulas, we: : 1. introduce Common Readability Formula (CoRF), 2. recalibrate outdated formulas (Flesch-Kincaid Grade Level, Fog Index, SMOG Index, Coleman-Liau Index, and Automated Readability Index), 3. evaluate the formulas, and 4. develop a Python library for the wide dispersal of our variations.	PDF	1	2022
KALA: Knowledge-Augmented Language Model Adaptation	Pre-trained language models (PLMs) have achieved remarkable success on various natural language understanding tasks. Simple fine-tuning of PLMs, on the other hand, might be suboptimal for domain-specific tasks because they cannot possibly cover knowledge from all domains. While adaptive pre-training of PLMs can help them obtain domain-specific knowledge, it requires a large training cost. Moreover, adaptive pre-training can harm the PLM's performance on the downstream task by causing catastrophic forgetting of its general knowledge. To overcome such limitations of adaptive pre-training for PLM adaption, we propose a novel domain adaption framework for PLMs coined as Knowledge-Augmented Language model Adaptation (KALA), which modulates the intermediate hidden representations of PLMs with domain knowledge, consisting of entities and their relational facts. We validate the performance of our KALA on question answering and named entity recognition tasks on multiple datasets across various domains. The results show that, despite being computationally efficient, our KALA largely outperforms adaptive pre-training.	PDF	1	2022
Sonnet Generation by Training on Non-poetic Texts with Discourse-level Coherence and Poetic Features	Poetry generation, and creative language generation in general, usually suffers from the lack of large training data. In this paper, we present a novel framework to generate sonnets that does not require training on poems. We design a hierarchical framework which plans the poem sketch before decoding. Specifically, a content planning module is trained on non-poetic texts to obtain discourse-level coherence; then a rhyme module generates rhyme words and a polishing module introduces imagery and similes for aesthetics purposes. Finally, we design a constrained decoding algorithm to impose the meter-and-rhyme constraint of the generated sonnets. Automatic and human evaluation show that our multi-stage approach without training on poem corpora generates more coherent, poetic, and creative sonnets than several strong baselines.	PDF	1	2022
Improving language models fine-tuning with representation consistency targets	Fine-tuning contextualized representations learned by pre-trained language models has become a standard practice in the NLP field. However, pre-trained representations are prone to degradation (also known as representation collapse) during fine-tuning, which leads to instability, sub-optimal performance, and weak generalization. In this paper, we propose a novel fine-tuning method that avoids representation collapse during fine-tuning by discouraging undesirable changes of the representations. We show that our approach matches or exceeds the performance of the existing regularization-based fine-tuning methods across 13 language understanding tasks (GLUE benchmark and six additional datasets). We also demonstrate its effectiveness in low-data settings and robustness to label perturbation. Furthermore, we extend previous studies of representation collapse and propose several metrics to quantify it. Using these metrics and previously proposed experiments, we show that our approach obtains significant improvements in retaining the expressive power of representations.	PDF	1	2022
It's What You Say and How You Say It: Exploring Textual and Audio Features for Podcast Data	Podcasts are relatively new media in the form of spoken documents or conversations with a wide range of topics, genres, and styles. With a massive increase in the number of podcasts and their listener base, it is beneficial to understand podcasts better, to derive insights into questions such as what makes certain podcasts more popular than others or which tags help in characterizing a podcast. In this work, we provide a comprehensive analysis of hand-crafted features from two modalities, i.e., text and audio. We explore multiple feature combinations considering podcast popularity prediction and multi-label tag assignment as proxy downstream tasks. In our experiments, we use document embeddings, affective features, named entities, tags, and topics as the textual features, while multi-band modulation and traditional speech processing features constitute the audio features. We find the audio feature prosody and textual affective features, sentiment, and emotions are significant for both the downstream tasks. We observe that the combination of textual and audio features helps in improving performance in the popularity prediction task.	PDF	1	2022
Robust (Controlled) Table-to-Text Generation with Structure-Aware Equivariance Learning	Controlled table-to-text generation seeks to generate natural language descriptions for highlighted subparts of a table. Previous SOTA systems still employ a sequence-to-sequence generation method, which merely captures the table as a linear structure and is brittle when table layouts change. We seek to go beyond this paradigm by (1) effectively expressing the relations of content pieces in the table, and (2) making our model robust to content-invariant structural transformations. Accordingly, we propose an equivariance learning framework, encoding tables with a structure-aware self-attention mechanism. This prunes the full self-attention structure into an order-invariant graph attention that captures the connected graph structure of cells belonging to the same row or column, and it differentiates between relevant cells and irrelevant cells from the structural perspective. Our framework also modifies the positional encoding mechanism to preserve the relative position of tokens in the same cell but enforce position invariance among different cells. Our technology is free to be plugged into existing table-to-text generation models, and has improved T5-based models to offer better performance on ToTTo and HiTab. Moreover, on a harder version of ToTTo, we preserve promising performance, while previous SOTA systems, even with transformation-based data augmentation, have seen significant performance drops.	PDF	1	2022
CL-ReKD: Cross-lingual Knowledge Distillation for Multilingual Retrieval Question Answering	Cross-Lingual Retrieval Question Answering (CL-ReQA) is concerned with retrieving answer documents or passages to a question written in a different language. A common approach to CL-ReQA is to create a multilingual sentence embedding space such that question-answer pairs across different languages are close to each other. In this paper, we propose a novel CL-ReQA method utilizing the concept of knowledge distillation and a new cross-lingual consistency training technique to create a multilingual embedding space for ReQA. To assess the effectiveness of our work, we conducted comprehensive experiments on CL-ReQA and a downstream task, machine reading QA. We compared our proposed method with the current state-of-the-art solutions across three public CL-ReQA corpora. Our method outperforms competitors in 19 out of 21 settings of CL-ReQA. When used with a downstream machine reading QA task, our method outperforms the best existing language-model-based method by 10% in F1 while being 10 times faster in sentence embedding computation.	PDF	1	2022
ITA: Image-Text Alignments for Multi-Modal Named Entity Recognition	Recently, Multi-modal Named Entity Recognition (MNER) has attracted a lot of attention. Most of the work utilizes image information through region-level visual representations obtained from a pretrained object detector and relies on an attention mechanism to model the interactions between image and text representations. However, it is difficult to model such interactions as image and text representations are trained separately on the data of their respective modality and are not aligned in the same space. As text representations take the most important role in MNER, in this paper, we propose {\bf I}mage-{\bf t}ext {\bf A}lignments (ITA) to align image features into the textual space, so that the attention mechanism in transformer-based pretrained textual embeddings can be better utilized. ITA first aligns the image into regional object tags, image-level captions and optical characters as visual contexts, concatenates them with the input texts as a new cross-modal input, and then feeds it into a pretrained textual embedding model. This makes it easier for the attention module of a pretrained textual embedding model to model the interaction between the two modalities since they are both represented in the textual space. ITA further aligns the output distributions predicted from the cross-modal input and textual input views so that the MNER model can be more practical and robust to noises from images. In our experiments, we show that ITA models can achieve state-of-the-art accuracy on multi-modal Named Entity Recognition datasets, even without image information.	PDF	1	2022
Enhancing Robustness in Aspect-based Sentiment Analysis by Better Exploiting Data Augmentation	In this paper, we propose to leverage data augmentation to improve the robustness of aspect-based sentiment analysis models. Our method not only exploits augmented data but also makes models focus more on predictive features. We show in experiments that our method compares favorably against strong baselines on both robustness and standard datasets. In the contrary, the widely used adversarial training that only leverages the augmented data fails to improve performance due to the distribution shift caused by the augmented data.	PDF	1	2022
A Few Thousand Translations Go a Long Way! Leveraging Pre-trained Models for African News Translation	Recent advances in the pre-training for language models leverage large-scale datasets to create multilingual models. However, low-resource languages are mostly left out in these datasets. This is primarily because many widely spoken languages that are not be well represented on the web and therefore excluded from the large-scale crawls for datasets. Furthermore, downstream users of these models are restricted to the selection of languages originally chosen for pre-training. This work investigates how to optimally leverage existing pre-trained models to create low-resource translation systems for 16 African languages. We focus on two questions: 1) How can pre-trained models be used for languages not included in the initial pretraining? and 2) How can the resulting translation models effectively transfer to new domains? To answer these questions, we create a novel African news corpus covering 16 languages, of which eight languages are not part of any existing evaluation dataset. We demonstrate that the most effective strategy for transferring both additional languages and additional domains is to leverage small quantities of high-quality translation data to fine-tune large pre-trained models.	PDF	1	2022
Detection and Mitigation of Political Bias in Natural Language Processing: A Literature Review	With the increasing importance of Natural Language Processing (NLP) tools, their implications on the propagation of societal biases become more and more relevant. In this context, the analysis of political bias in manually written and automatically generated text is a relatively understudied field. Political bias refers to the preference or prejudice towards one political ideology over another. To increase the discourse in this subject area, we analyze contemporary studies on detecting and mitigating political bias in this literature review. We further discuss the benefits and potential drawbacks of the considered methods and look at the ethical considerations involved with political bias in NLP, before we give suggestions for future studies.	PDF	1	2022
Latent Group Dropout for Multilingual and Multidomain Machine Translation	Multidomain and multilingual machine translation often rely on parameter sharing strategies, where large portions of the network are meant to capture the commonalities of the tasks at hand, while smaller parts are reserved to model the peculiarities of a language or a domain. In adapter-based approaches, these strategies are hardcoded in the network architecture, independent of the similarities between tasks. In this work, we propose a new method to better take advantage of these similarities, using a latent-variable model. We also develop new techniques to train this model end-to-end and report experimental results showing that the learned patterns are both meaningful and yield improved translation performance without any increase of the model size.	PDF	1	2022
The Algorithmic Inflection and Morphological Variability of Russian	We present a~set of deterministic algorithms for Russian inflection and automated text synthesis. These algorithms are implemented in a~publicly available web-service www.passare.ru. This service provides functions for inflection of single words, word matching and synthesis of grammatically correct Russian text. Selected code and datasets are available at https://github.com/passare-ru/PassareFunctions/Performance of the inflectional functions has been tested against the annotated corpus of Russian language OpenCorpora, compared with that of other solutions, and used for estimating the morphological variability and complexity of different parts of speech in Russian.	PDF	1	2022
ANNA: Enhanced Language Representation for Question Answering	Pre-trained language models have brought significant improvements in performance in a variety of natural language processing tasks. Most existing models performing state-of-the-art results have shown their approaches in the separate perspectives of data processing, pre-training tasks, neural network modeling, or fine-tuning. In this paper, we demonstrate how the approaches affect performance individually, and that the language model performs the best results on a specific question answering task when those approaches are jointly considered in pre-training models. In particular, we propose an extended pre-training task, and a new neighbor-aware mechanism that attends neighboring tokens more to capture the richness of context for pre-training language modeling. Our best model achieves new state-of-the-art results of 95.7\% F1 and 90.6\% EM on SQuAD 1.1 and also outperforms existing pre-trained language models such as RoBERTa, ALBERT, ELECTRA, and XLNet on the SQuAD 2.0 benchmark.	PDF	1	2022
Do BERTs Learn to Use Browser User Interface? Exploring Multi-Step Tasks with Unified Vision-and-Language BERTs	Unifying models by reducing task-specific structures have been studied to facilitate the transfer of learned knowledge.A text-to-text framework has pushed the unification of the model.However, the framework remains limited because it does not allow contents with a layout for input and has a basic assumption that the task can be solved in a single step.To address these limitations, in this paper, we explore a new framework in which a model performs a task by manipulating displayed web pages in multiple steps.We develop two types of task web pages with different levels of difficulty and propose a BERT extension for the framework.We trained the BERT extension with those task pages jointly, and the following observations were made.(1) The model maintains its performance greater than 80% of that of the original BERT separately fine-tuned in a single-step framework in five out of six tasks.(2) The model learned to solve both tasks of difficulty level. (3) The model did not generalize effectively on unseen tasks.These results suggest that although room for improvement exists, we can transfer BERTs to multi-step tasks, such as using graphical user interfaces.	PDF	1	2022
Toward Automatic Misinformation Detection Utilizing Fact-checked Information	We proposed a new task FCCKB: Fact-checking by Claim Knowledge Base. The goal was to fact-check a sentence utilizing verified claims stored in the database. To retrieve relevant claims from the large database, we proposed applying Semantic Role Labeling(SRL) on the input sentence having rich semantics and then encoding the results to get fine-grained sentence embeddings. That improved semantic matching between the input sentence and the relevant claims. We used three sentence encoders for sentence encoding. In FEVER dataset, precision and recall was improved by more than 5 percent after SRL was applied.	PDF	1	2022
Exploring Example Selection for Few-shot Text-to-SQL Semantic Parsing	We study example selection methods for few-shot text-to-SQL tasks with unseen databases. Annotating natural language questions with corresponding SQL queries is expensive, but we can use abundant unlabeled questions to efficiently select examples to annotate and then use them to adapt models. Many previous works only randomly sample a few instances for few-shot learning, but this random selection is not sufficient to select representative and informative examples that provide specific domain knowledge. We thus explore methods to efficiently choose annotation examples. We identify two important factors: the diversity of selected instances and the dissimilarity to the source training data if any. A diverse training set contains more domain knowledge, while dissimilar examples are selected to fill in the domain gap between the source and target. We show that our best example selection approach substantially improves few-shot text-to-SQL performance in both finetuning using T5 and in-context learning with Codex: average execution accuracy gains of 8.7% and 4.3% over random selection. Our extensive analysis demonstrates the importance of the similarity metric and the embedding method for example representations. We also find that effective example selection reduces syntax errors on the target domains. Our results encourage future work to further explore example selection for efficient adaptation of text-to-SQL models.	PDF	1	2022
MANGO: Enhancing the Robustness of VQA Models via Adversarial Noise Generation	Large-scale pre-trained vision-and-language (V+L) transformers have propelled the state of the art (SOTA) on Visual Question Answering (VQA) task. Despite impressive performance on the standard VQA benchmark, it remains unclear how robust these models are. To investigate, we conduct a host of evaluations over 4 different types of robust VQA datasets: (i) Linguistic Variation; (ii) Logical Reasoning; (iii) Visual Content Manipulation; and (iv) Answer Distribution Shift. Experiments show that pre-trained V+L models already exhibit better robustness than many task-specific SOTA methods via standard model finetuning. To further enhance model robustness, we propose Mango, a generic and efficient approach that learns a Multimodal Adversarial Noise GeneratOr in the embedding space to fool V+L models. Differing from previous studies focused on one specific type of robustness, Mango is agnostic to robustness types, and enables universal performance lift for both task-specific and pre-trained models over diverse robust VQA datasets designed to evaluate broad aspects of robustness. Comprehensive experiments demonstrate that Mango achieves new SOTA on 7 out of 9 robustness benchmarks.	PDF	1	2022
Defending against Backdoor Attacks in Natural Language Generation	The frustratingly fragile nature of neural network models make current natural language generation (NLG) systems prone to backdoor attacks and generate malicious sequences that could be sexist or offensive. Unfortunately, little effort has been invested to how backdoor attacks can affect current NLG models and how to defend against these attacks. In this work, we investigate this problem on two important NLG tasks, machine translation and dialogue generation. By giving a formal definition for backdoor attack and defense, and developing corresponding benchmarks, we design methods to attack NLG models, which achieve high attack success to ask NLG models to generate malicious sequences. To defend against these attacks, we propose to detect the attack trigger by examining the effect of deleting or replacing certain words on the generation outputs, which we find successful for certain types of attacks. We will discuss the limitation of this work, and hope this work can raise the awareness of backdoor risks concealed in deep NLG systems.	PDF	1	2022
Divide and Conquer: Text Semantic Matching with Disentangled Keywords and Intents	Text semantic matching is a fundamental task that has been widely used in various scenarios, such as community question answering, information retrieval, and recommendation. Most state-of-the-art matching models, e.g., BERT, directly perform text comparison by processing each word uniformly. However, a query sentence generally comprises content that calls for different levels of matching granularity. Specifically, keywords represent factual information such as action, entity, and event that should be strictly matched, while intents convey abstract concepts and ideas that can be paraphrased into various expressions. In this work, we propose a simple yet effective training strategy for text semantic matching in a divide-and-conquer manner by disentangling keywords from intents. Our approach can be easily combined with pre-trained language models (PLM) without influencing their inference efficiency, achieving stable performance improvements against a wide range of PLMs on three benchmarks.	PDF	1	2022
Improving both domain robustness and domain adaptability in machine translation	We address two problems of domain adaptation in neural machine translation. First, we want to reach domain robustness, i.e., good quality of both domains from the training data, and domains unseen in the training data. Second, we want our systems to be adaptive, i.e., making it possible to finetune systems with just hundreds of in-domain parallel sentences. In this paper, we introduce a novel combination of two previous approaches, word adaptive modelling, which addresses domain robustness, and meta-learning, which addresses domain adaptability, and we present empirical results showing that our new combination improves both of these properties. Our source code is attached and will be made publicly available.	PDF	1	2022
Faster Nearest Neighbor Machine Translation	$k$NN based neural machine translation ($k$NN-MT) has achieved state-of-the-art results in a variety of MT tasks. One significant shortcoming of $k$NN-MT lies in its inefficiency in identifying the $k$ nearest neighbors of the query representation from the entire datastore, which is prohibitively time-intensive when the datastore size is large.In this work, we propose \textbf{Faster $k$NN-MT} to address this issue. The core idea of Faster $k$NN-MT is to use a hierarchical clustering strategy to approximate the distance between the query and a data point in the datastore, which is decomposed into two parts: the distance between the query and the center of the cluster that the data point belongs to, and the distance between the data point and the cluster center. We propose practical ways to compute these two parts in a significantly faster manner. Through extensive experiments on different MT benchmarks, we show that \textbf{Faster $k$NN-MT} is faster than Fast $k$NN-MT \citep{meng2021fast} and only slightly (1.2 times) slower than its vanilla counterpart, while preserving model performance as $k$NN-MT. Faster $k$NN-MT enables the deployment of $k$NN-MT models on real-world MT services.	PDF	1	2022
DEGREE: A Data-Efficient Generation-Based Event Extraction Model	Due to the high cost of human annotations, learning a data-efficient event extraction model that can be trained with only a few labeled examples has become a crucial challenge. In this paper, we focus on low-resource end-to-end event extraction. We propose DEGREE, a model that formulates event extraction as a conditional generation problem. Given a passage and a manually designed prompt, DEGREE learns to summarize the event happening in the passage into a natural sentence that follows a predefined pattern. The final event structure predictions are then extracted from the generated sentence with a deterministic algorithm. DEGREE has the following advantages to learn well with less training data. First, with our design of prompts, DEGREE obtains semantic guidance by leveraging label semantics and thus better captures the argument roles. In addition, the proposed model is capable of using additional weakly-supervised information, such as the description of events. Finally, learning triggers and argument roles in an end-to-end manner encourages the model to better utilize the shared knowledge and dependencies between them. Our experimental results and ablation studies demonstrate the strong performance of DEGREE for low-resource event extraction.	PDF	1	2022
LMTurk: Few-Shot Learners as Crowdsourcing Workers	Vast efforts have been devoted to creating high-performance few-shot learners, i.e., large-scale pretrained language models (PLMs) that perform well with little downstream task training data. Training PLMs has incurred significant cost, but utilizing the few-shot learners is still challenging due to their enormous size. This work focuses on a crucial question: How to make effective use of these few-shot learners? We propose LMTurk, a novel approach that treats few-shot learners as crowdsourcing workers. The rationale is that crowdsourcing workers are in fact few-shot learners: They are shown a few illustrative examples to learn about a task and then start annotating. LMTurk employs few-shot learners built upon PLMs as workers. We show that the resulting annotations can be utilized to train models that solve the task well and are small enough to be deployable in practical scenarios. Altogether, LMTurk is an important step towards making effective use of current PLMs.	PDF	1	2022
An MRC Framework for Semantic Role Labeling	Semantic Role Labeling (SRL) aims at recognizing the predicate-argument structure of a sentence and can be decomposed into two subtasks: predicate disambiguation and argument labeling. Prior work deals with these two tasks independently, which ignores the semantic connection between the two tasks. In this paper, we propose to use the machine reading comprehension (MRC) framework to bridge this gap. We formalize predicate disambiguation as multiple-choice machine reading comprehension, where the descriptions of candidate senses of a given predicate are used as options to select the correct sense. The chosen predicate sense is then used to determine the semantic roles for that predicate, and these semantic roles are used to construct the query for another MRC model for argument labeling. In this way, we are able to leverage both the predicate semantics and the semantic role semantics for argument labeling. We also propose to select a subset of all the possible semantic roles for computational efficiency. Experiments show that the proposed framework achieves state-of-the-art results on both span and dependency benchmarks.	PDF	1	2022
Learning Non-Autoregressive Models from Search for Unsupervised Sentence Summarization	Text summarization aims to generate a short summary for an input text. In this work, we propose a Non-Autoregressive Unsupervised Summarization (NAUS) approach, which does not require parallel data for training. Our NAUS first performs edit-based search towards a heuristically defined score, and generates a summary as pseudo-groundtruth. Then, we train an encoder-only non-autoregressive Transformer based on the search result. We also proposed a dynamic programming approach for length-control decoding, which is important for the summarization task. Experiments on the Gigaword headline generation and DUC2004 datasets show that NAUS achieves state-of-the-art performance for unsupervised summarization, yet largely improving inference efficiency. Further, our algorithm is able to perform length-transfer summary generation.	PDF	1	2022
Is More Data Better? Using Transformers-Based Active Learning for Efficient and Effective Detection of Abusive Language	Annotating abusive language content can cause psychological harm; yet, most machine learning research has prioritized efficacy (i.e., F1 or accuracy scores) while little research has analyzed data efficiency (i.e., how to minimize annotation requirements).In this paper, we use a series of simulated experiments over two datasets at varying percentages of abuse to demonstrate that transformers-based active learning is a promising approach that maintains high efficacy but substantially raises efficiency, requiring a fraction of labeled data to reach equivalent performance to passive training over the full dataset.	PDF	1	2022
Data Augmentation for Biomedical Factoid Question Answering	We study the effect of seven data augmentation (DA) methods in factoid question answering, focusing on the biomedical domain, where obtaining training instances is particularly difficult. We experiment with data from the BIOASQ challenge, which we augment with training instances obtained from an artificial biomedical machine reading comprehension dataset, or via back-translation, information retrieval, word substitution based on WORD2VEC embeddings, or masked language modeling, question generation, or extending the given passage with additional context. We show that DA can lead to very significant performance gains, even when using large pre-trained Transformers, contributing to a broader discussion of if/when DA benefits large pre-trained models. One of the simplest DA methods, WORD2VEC-based word substitution, performed best and is recommended. We release our artificial training instances and code.	PDF	1	2022
HIE-SQL: History Information Enhanced Network for Context-Dependent Text-to-SQL Semantic Parsing	Previous works of context-dependent text-to-SQL semantic parsing leverage context-dependence information either from interaction history utterances or the previous predicted SQL queries but fail in taking advantage of both since of the mismatch between natural language and logic-form SQL. In this work, we propose a History Information Enhanced text-to-SQL model (HIE-SQL) to exploit context-dependence information from both history utterances and the last predicted SQL query. In view of the mismatch, we treat natural language and SQL as two modalities and propose a bimodal pre-trained model to bridge the gap between them. Besides, we design a schema linking graph to enhance connections from utterances and the SQL query to the database schema. We achieve new state-of-the-art results on the two context-dependent text-to-SQL benchmarks, SparC and CoSQL, at the writing time.	PDF	1	2022
SemAttack: Natural Textual Attacks via Different Semantic Spaces	Recent studies show that pre-trained language models (LMs) are vulnerable to textual adversarial attacks. However, existing attack methods either suffer from low attack success rates or fail to search efﬁciently in the exponentially large perturbation space. We propose an efﬁcient and effective framework SemAttack to generate natural adversarial text by constructing different semantic perturbation functions. In particular, SemAttack optimizes the generated perturbations constrained on generic semantic spaces, including typo space, knowledge space (e.g., WordNet), contextualized semantic space (e.g., the embedding space of BERT clusterings), or the combination of these spaces. Thus, the generated adversarial texts are more semantically close to the original inputs. Extensive experiments reveal that state-of-the-art (SOTA) large-scale LMs (e.g., DeBERTa-v2) and defense strategies (e.g., FreeLB) are still vulnerable to SemAttack. We further demonstrate that SemAttack is general and able to generate natural adversarial texts for different languages (e.g., English and Chinese) with high attack success rates. Human evaluations also conﬁrm that our generated adversarial texts are natural and barely affect human performance.	PDF	1	2022
Topic-controllable Abstractive Summarization	Existing approaches for topic-controllable summarization either incorporate topic embeddings or modify the attention mechanism. The incorporation of such approaches in a particular summarization model requires the adaptation of its codebase, a process that can be complex and time-consuming. Instead, we propose a model-agnostic topic-controllable summarization method employing a simple tagging-based formulation that can effortlessly work with any summarization model. In addition, we propose a new topic-oriented evaluation measure to quantitatively evaluate the generated summaries based on the topic affinity between the generated summary and the desired topic. Experimental results show that the proposed tagging-based formulation can achieve similar or even better performance compared to the embedding-based approach, while being at the same time significantly faster.	PDF	1	2022
Addressing Segmentation Ambiguity in Neural Linguistic Steganography	Previous studies on neural linguistic steganography, except Ueoka et al. (2021), overlook the fact that the sender must detokenize cover texts to avoid arousing the eavesdropper's suspicion. In this paper, we demonstrate that segmentation ambiguity indeed causes occasional decoding failures at the receiver's side. With the near-ubiquity of subwords, this problem now affects any language. We propose simple tricks to overcome this problem, which are even applicable to languages without explicit word boundaries.	PDF	1	2022
BERT Learns to Teach: Knowledge Distillation with Meta Learning	We present Knowledge Distillation with Meta Learning (MetaDistil), a simple yet effective alternative to traditional knowledge distillation (KD) methods where the teacher model is fixed during training. We show the teacher network can learn to better transfer knowledge to the student network (i.e., \textit{learning to teach}) with the feedback from the performance of the distilled student network in a meta learning framework. Moreover, we introduce a pilot update mechanism to improve the alignment between the inner-learner and meta-learner in meta learning algorithms that focus on an improved inner-learner. Experiments on various benchmarks show that MetaDistil can yield significant improvements compared with traditional KD algorithms and is less sensitive to the choice of different student capacity and hyperparameters, facilitating the use of KD on different tasks and models.	PDF	1	2022
Progressive Class Semantic Matching for Semi-supervised Text Classification	Semi-supervised learning is a promising way to reduce the annotation cost for text-classification. Combining with pre-trained language models (PLMs), e.g., BERT, recent semi-supervised learning methods achieved impressive performance. In this work, we further investigate the marriage between semi-supervised learning and a pre-trained language model. Unlike existing approaches that utilize PLMs only for model parameter initialization, we explore the inherent topic matching capability inside PLMs for building a more powerful semi-supervised learning approach. Specifically, we propose a joint semi-supervised learning process that can progressively build a standard $K$-way classifier and a matching network for the input text and the Class Semantic Representation (CSR). The CSR will be initialized from the given labeled sentences and progressively updated through the training process. By means of extensive experiments, we show that our method can not only bring remarkable improvement to baselines, but also overall be more stable, and achieves state-of-the-art performance in semi-supervised text classification.	PDF	1	2022
Mosaic Augmentation for Text: Cropping and Collaging as Cross-Domain Techniques	We present new visually inspired cropping and collaging data augmentations for text. We test how these augmentations impact data-scarce scenarios over multiple NLP tasks: name entity recognition, extractive question answering and abstractive summarization, across 9 prominent datasets. Ablation studies show different prevailing reasons for the augmentations' effectiveness for the different tasks, but all benefit from our approach. We achieve significant improvements over baselines, particularly for limited data use cases.	PDF	1	2022
KNN-BERT: Fine-Tuning Pre-Trained Models with KNN Classifier	Pre-trained models are widely used in fine-tuning downstream tasks with linear classifiers optimized by the cross entropy loss, which might face robustness and stability problems.These problems can be improved by learning representations that focus on similarities in the same class and variance in different classes when making predictions.In this paper, we utilize the K-Nearest Neighbors Classifier in pre-trained model fine-tuning.For this KNN classifier, we introduce a supervised momentum contrastive learning framework to learn the clustered representations of the supervised downstream tasks.Extensive experiments on text classification tasks and robustness tests show that by incorporating KNNs with the traditional fine-tuning process, we can obtain significant improvements on the clean accuracy in both rich-source and few-shot settings and can improve the robustness against adversarial attacks.\footnote{all codes will be available at https://github.com//}	PDF	1	2022
Rebuild and Ensemble: Exploring Defense Against Text Adversaries	Adversarial attacks can mislead strong neural models; as such, in NLP tasks, substitution-based attacks are difficult to defend. Current defense methods usually assume that the substitution candidates are accessible, which cannot be widely applied against adversarial attacks unless knowing the mechanism of the attacks. In this paper, we propose a \textbf{Rebuild and Ensemble} Framework to defend against adversarial attacks in texts without knowing the candidates.We propose a rebuild mechanism to train a robust model and ensemble the rebuilt texts during inference to achieve good adversarial defense results.Experiments show that our method can improve accuracy under the current strong attack methods.	PDF	1	2022
Neighbors Are Not Strangers: Improving Non-Autoregressive Translation under Low-Frequency Lexical Constraints	Lexically constrained neural machine translation (NMT) draws much industrial attention for its practical usage in specific domains. However, current autoregressive approaches suffer from high latency. In this paper, we focus on non-autoregressive translation (NAT) for this problem for its efficiency advantage. We identify that current constrained NAT models, which are based on iterative editing, do not handle low-frequency constraints well. To this end, we propose a plug-in algorithm for this line of work, i.e., Aligned Constrained Training (ACT), which alleviates this problem by familiarizing the model with the source-side context of the constraints. Experiments on the general and domain datasets show that our model improves over the backbone constrained NAT model in constraint preservation and translation quality, especially for rare constraints.	PDF	1	2022
Balancing the Style-Content Trade-Off in Sentiment Transfer UsingPolarity-Aware Denoising	We present a polarity-aware denoising-based sentiment transfer model, which accurately controls the sentiment attributes in generated text, preserving the content to a great extent. Though current models have shown good results, still two major issues exist: (1) target sentences still retain the sentiment of source sentences (2) content preservation in transferred sentences is insufficient. Our proposed polarity-aware enhanced denoising mechanism helps in balancing the style-content trade-off in sentiment-controlled generation. Our proposed method is structured around two key stages in the sentiment transfer process: better representation learning using a shared encoder (pre-trained on general domain) and sentiment-controlled generation using separate decoders. Our extensive experimental results show that our method achieves good results for balancing the sentiment transfer with the content preservation.	PDF	1	2022
Good Examples Make A Faster Learner: Simple Demonstration-based Learning for Low-resource NER	Recent advances in prompt-based learning have shown strong results on few-shot text classification by using cloze-style templates. Similar attempts have been made on named entity recognition (NER) which manually design templates to predict entity types for every text span in a sentence. However, such methods may suffer from error propagation induced by entity span detection, high cost due to enumeration of all possible text spans, and omission of inter-dependencies among token labels in a sentence. Here we present a simple demonstration-based learning method for NER, which lets the input be prefaced by task demonstrations for in-context learning. We perform a systematic study on demonstration strategy regarding what to include (entity examples, with or without surrounding context), how to select the examples, and what templates to use. Results on in-domain learning and domain adaptation show that the model's performance in low-resource settings can be largely improved with a suitable demonstration strategy (e.g., a 4-17% improvement on 25 train instances). We also find that good demonstration can save many labeled examples and consistency in demonstration contributes to better performance.	PDF	1	2022
Beyond Emotion: A Multi-Modal Dataset for Human Desire Understanding	Desire is a strong wish to do or have something, which involves not only a linguistic expression, but also underlying cognitive phenomena driving human feelings. As the most primitive and basic human instinct, conscious desire is often accompanied by a range of emotional responses. As a strikingly understudied task, it is difficult for machines to model and understand desire due to the unavailability of benchmarking datasets with desire and emotion labels. To bridge this gap, we present MSED, the first multi-modal and multi-task sentiment, emotion and desire dataset, which contains 9,190 text-image pairs, with English text. Each multi-modal sample is annotated with six desires, three sentiments and six emotions. We also propose the state-of-the-art baselines to evaluate the potential of MSED and show the importance of multi-task and multi-modal clues for desire understanding. We hope this study provides a benchmark for human desire analysis. MSED will be publicly available for research.	PDF	1	2022
Word-level Stroke Trajectory Recovery for Handwriting with Gaussian Dynamic Time Warping	Handwriting trajectory recovery has recently gained more attention for practical applications such as personalized messages. It is a sequence learning problem from image to handwriting stroke sequence where Dynamic Time Warping (DTW) is a preferred loss function. However, aligning two varying length sequences in DTW loss accumulates the differences of predicted and ground truth strokes for the entire line-level text. As a result, averaging over long sequences in DTW loss, it cannot distinguish between a small number of perceptually significant errors and a large number of visually insignificant errors. To address this issue, we propose two new strategies. First, we propose applying DTW to words instead of line-level text so that the DTW loss for all the words in the line-level text is not averaged out. Moreover, for aligning the predicted and ground-truth sequences for each word, we propose to weight the cost matrix with a Gaussian function so that the far-off predicted strokes from ground truth are penalized heavily. This strategy for word-level stroke trajectory learning improves quantitative and qualitative results.	PDF	1	2022
Can Multilinguality Benefit Non-autoregressive Machine Translation?	Non-autoregressive (NAR) machine translation has recently achieved significant improvements, and now outperforms autoregressive (AR) models on some benchmarks, providing an efficient alternative to AR inference. However, while AR translation is often implemented using multilingual models that benefit from transfer between languages and from improved serving efficiency, multilingual NAR models remain relatively unexplored. Taking Connectionist Temporal Classification (CTC) as an example NAR model and Imputer as a semi-NAR model, we present a comprehensive empirical study of multilingual NAR. We test its capabilities with respect to positive transfer between related languages and negative transfer under capacity constraints. As NAR models require distilled training sets, we carefully study the impact of bilingual versus multilingual teachers. Finally, we fit a scaling law for multilingual NAR, which quantifies its performance relative to the AR model as model scale increases.	PDF	1	2022
DG2: Data Augmentation Through Document Grounded Dialogue Generation	Collecting data for training dialog systems can be extremely expensive due to the involvement of human participants and need for extensive annotation. Especially in document-grounded dialog systems, human experts need to carefully read the unstructured documents to answer the users' questions. As a result, existing document-grounded dialog datasets are relatively small-scale and obstruct the effective training of dialogue systems. In this paper, we propose an automatic data augmentation technique grounded on documents through a generative dialogue model. The dialogue model consists of a user bot and agent bot that can synthesize diverse dialogues given an input document, which are then used to train a downstream model. When supplementing the original dataset, our method achieves significant improvement over traditional data augmentation methods. We also achieve great performance in the low-resource\vphantom{and unseen document} setting.	PDF	1	2022
Causal Distillation for Language Models	Distillation efforts have led to language models that are more compact and efficient without serious drops in performance. The standard approach to distillation trains a student model against two objectives: a task-specific objective (e.g., language modeling) and an imitation objective that encourages the hidden states of the student model to be similar to those of the larger teacher model. In this paper, we show that it is beneficial to augment distillation with a third objective that encourages the student to imitate the \emph{causal} dynamics of the teacher through a distillation interchange intervention training objective (DIITO). DIITO pushes the student model to become a \emph{causal abstraction} of the teacher model -- a faithful model with simpler causal structure. DIITO is fully differentiable, easily implemented, and combines flexibly with other objectives. Compared against standard distillation with the same setting, DIITO results in lower perplexity on the WikiText-103M corpus (masked language modeling) and marked improvements on the GLUE benchmark (natural language understanding), SQuAD (question answering), and CoNLL-2003 (named entity recognition).	PDF	1	2022
Improving negation detection with negation-focused pre-training	Negation is a common linguistic feature that is crucial in many language understanding tasks, yet it remains a hard problem due to diversity in its expression in different types of text. Recent works show that state-of-the-art NLP models underperform on samples containing negation in various tasks, and that negation detection models do not transfer well across domains. We propose a new negation-focused pre-training strategy, involving targeted data augmentation and negation masking, to better incorporate negation information into language models. Extensive experiments on common benchmarks show that our proposed approach improves negation detection performance and generalizability over the strong baseline NegBERT (Khandelwal and Sawant, 2020).	PDF	1	2022
Integrating Empirical Knowledge into Multi-View Feature Attention Network for Disease Diagnosis	As one of the currently significant problems in AI-enabled healthcare research, disease diagnosis based on the medical text has made substantial progress. However, the length of the diagnostic evidences is different, leading to the difficulty of capturing multi-scale features of each disease. And recent studies have discovered that structural knowledge from medical text is critical for disease diagnosis. This paper proposes integrating empirical knowledge of disease into a multi-view feature attention network to address these issues. The multi-view feature attention network employs multi encoders to capture segment information of diagnostic evidences of each illness. Meanwhile, we used an abductive causal graph constructed from medical text to extract the empirical knowledge representation of diseases by graph convolutional network. The evaluation conducted on the MIMIC-III-50 dataset and Chinese dataset demonstrates that the proposed method outperforms the structural knowledge-based state-of-the-art models.	PDF	1	2022
Cross-lingual Lifelong Learning	The longstanding goal of multi-lingual learning has been to develop a universal cross-lingual model that can withstand the changes in multi-lingual data distributions. However, most existing models assume full access to the target languages in advance, whereas in realistic scenarios this is not often the case, as new languages can be incorporated later on. In this paper, we present the Cross-lingual Lifelong Learning (CLL) challenge, where a model is continually fine-tuned to adapt to emerging data from different languages. We provide insights into what makes multilingual sequential learning particularly challenging. To surmount such challenges, we benchmark a representative set of cross-lingual continual learning algorithms and analyze their knowledge preservation, accumulation, and generalization capabilities compared to baselines on carefully curated datastreams. The implications of this analysis include a recipe for how to measure and balance between different cross-lingual continual learning desiderata, which goes beyond conventional transfer learning.	PDF	1	2022
Competition Dynamics in the Meme Ecosystem	The creation and sharing of memes is a common modality of online social interactions. The goal of the present work is to better understand the collective dynamics of memes in this accelerating and competitive environment. By taking an ecological perspective and tracking the meme-text from 352 popular memes over the entirety of Reddit, we are able to show that the frequency of memes has scaled almost exactly with the total amount of content created over the past decade. One consequence of limited human attention in the face of a growing number of memes is that the diversity of these memes has decreased at the community level, albeit slightly, in the same period. Another consequence is that the average lifespan of a meme has decreased dramatically, which is further evidence of an increase in competition among memes and a decreasing collective attention span.	PDF	1	2022
KCD: Knowledge Walks and Textual Cues Enhanced Political Perspective Detection in News Media	Political perspective detection has become an increasingly important task that can help combat echo chambers and political polarization. Previous approaches generally focus on leveraging textual content to identify stances, while they fail to reason with background knowledge or leverage the rich semantic and syntactic textual labels in news articles. In light of these limitations, we propose KCD, a political perspective detection approach to enable multi-hop knowledge reasoning and incorporate textual cues as paragraph-level labels. Specifically, we firstly generate random walks on external knowledge graphs and infuse them with news text representations. We then construct a heterogeneous information network to jointly model news content as well as semantic, syntactic and entity cues in news articles. Finally, we adopt relational graph neural networks for graph-level representation learning and conduct political perspective detection. Extensive experiments demonstrate that our approach outperforms state-of-the-art methods on two benchmark datasets. We further examine the effect of knowledge walks and textual cues and how they contribute to our approach's data efficiency.	PDF	1	2022
Visual Commonsense in Pretrained Unimodal and Multimodal Models	Our commonsense knowledge about objects includes their typical visual attributes; we know that bananas are typically yellow or green, and not purple. Text and image corpora, being subject to reporting bias, represent this world-knowledge to varying degrees of faithfulness. In this paper, we investigate to what degree unimodal (language-only) and multimodal (image and language) models capture a broad range of visually salient attributes. To that end, we automatically extract a visually-grounded commonsense dataset covering 5 property types (color, shape, material, size, and visual co-occurrence) for over 5000 subjects. We validate this dataset by showing that our grounded color data correlates much better than ungrounded text-only data with crowdsourced color judgments provided by Paik et al. (2021). We then use our dataset to evaluate pre-trained unimodal models and multimodal models. Our results show that multimodal models better reconstruct attribute distributions, but are still subject to reporting bias. Moreover, increasing model size does not enhance performance, suggesting that the key to visual commonsense lies in the data.	PDF	1	2022
Quantifying Adaptability in Pre-trained Language Models with 500 Tasks	When a neural language model (LM) is adapted to perform a new task, what aspects of the task predict the eventual performance of the model? In NLP, systematic features of LM generalization to individual examples are well characterized, but systematic aspects of LM adaptability to new tasks are not nearly as well understood. We present a large-scale empirical study of the features and limits of LM adaptability using a new benchmark, TaskBench500, built from 500 procedurally generated sequence modeling tasks. These tasks combine core aspects of language processing, including lexical semantics, sequence processing, memorization, logical reasoning, and world knowledge. Using TaskBench500, we evaluate three facets of adaptability, finding that: (1) adaptation procedures differ dramatically in their ability to memorize small datasets; (2) within a subset of task types, adaptation procedures exhibit compositional adaptability to complex tasks; and (3) failure to match training label distributions is explained by mismatches in the intrinsic difficulty of predicting individual labels. Our experiments show that adaptability to new tasks, like generalization to new examples, can be systematically described and understood, and we conclude with a discussion of additional aspects of adaptability that could be studied using the new benchmark.	PDF	1	2022
Tree Knowledge Distillation for Compressing Transformer-Based Language Models	Knowledge distillation has emerged as a promising technique for compressing neural language models. However, most knowledge distillation methods focus on extracting the ``knowledge'' from a teacher network to guide the training of a student network, ignoring the ``requirements'' of the student. In this paper, we introduce Tree Knowledge Distillation for Transformer-based teacher and student models, which allows student to actively extract its ``requirements'' via a tree of tokens. In specific, we first choose the \|[CLS]\| token at the output layer of Transformer in student as the root of the tree. We choose tokens with the highest values in the row for \|[CLS]\| of the attention feature map at the second last layer as the children of the root. Then we choose children of these nodes in their corresponding rows of the attention feature map at the next layer, respectively. Later, we connect layers of Transformer in student to corresponding layers in teacher by skipping every $t$ layers. At last, we improve the loss function by adding the summed mean squared errors between the embeddings of the tokens in the tree. The experiments show that tree knowledge distillation achieves competitive performance for compressing BERT among other knowledge distillation methods in GLUE benchmark.	PDF	1	2022
Language Models for Code-switch Detection of te reo Māori and English in a Low-resource Setting	Te reo Māori, New Zealand's only indigenous language, is code-switched with English. Most Māori speakers are bilingual, and the use of Māori is increasing in New Zealand English. Unfortunately, due to the minimal availability of resources, including digital data, Māori is under-represented in technological advances. Cloud-based systems such as Google and Azure support Māori language detection. However, we provide experimental evidence to show that the accuracy of such systems is low when detecting Māori. Hence, with the support of Māori community, we collect Māori i and bilingual data to use natural language processing (NLP) to improve Māori language detection. We train bilingual sub-word embeddings and provide evidence to show that our bilingual embeddings improve overall accuracy compared to the publicly-available monolingual embeddings. This improvement has been verified for various NLP tasks using three bilingual databases containing formal transcripts and informal social media data. We also show that BiLSTM with bilingual sub-word embeddings outperforms large-scale contextual language models such as BERT on down streaming tasks of detecting Māori language. The best accuracy of 87% was obtained using BiLSTM with bilingual embeddings for detecting code-switch points of bilingual sentences.	PDF	1	2022
MetaICL: Learning to Learn In Context	We introduce MetaICL (Meta-training for In-Context Learning), a new meta-training framework for few-shot learning where a pretrained language model is tuned to do in-context learning on a large set of training tasks. This meta-training enables the model to more effectively learn a new task in context at test time, by simply conditioning on a few training examples with no parameter updates or task-specific templates. We experiment on a large, diverse collection of tasks consisting of 142 NLP datasets including classification, question answering, natural language inference, paraphrase detection and more, across seven different meta-training/target splits. MetaICL outperforms a range of baselines including in-context learning without meta-training and multi-task learning followed by zero-shot transfer. We find that the gains are particularly significant for target tasks that have domain shifts from the meta-training tasks, and that using a diverse set of the meta-training tasks is key to improvements. We also show that MetaICL approaches (and sometimes beats) the performance of models fully finetuned on the target task training data, and outperforms much bigger models with nearly 8x parameters.	PDF	1	2022
Enhanced Protein-Protein Interactions Extraction from the Literature using Entity Type- and Position-aware Representation	Since protein-protein interactions (PPIs) are crucial to understanding living systems, harvesting these data is essential to probe the development of diseases and to understand gene/protein functions and biological processes. Some curated datasets exist containing PPI data derived from the literature and other sources (e.g., IntAct, BioGrid, DIP and HPRD), but these are far from exhaustive and their maintenance is a labor intensive process.On the other hand, machine learning (ML) methods to automate PPI knowledge extraction from the scientific literature have been limited by a shortage of appropriate annotated data.In this work, we create a unified multi-source PPI corpora with vetted interaction definitions, and augmented by binary interaction type labels.We also present a Transformer-based deep learning method, exploiting entity type and positional information for relation representation to improve relation classification performance.We evaluated our model's performance on three widely studied relation extraction datasets from biology and computer science domains as well as our work's target PPI datasets to observe the effectiveness of the representation to relation extraction tasks in various domains, and found it to outperform prior state-of-the-art (SOTA) models.	PDF	1	2022
Guiding Topic Flows in the Generative Chatbot by Enhancing the ConceptNet with the Conversation Corpora	Human conversations consist of reasonable and natural topic flows, which are observed as the shifts of the mentioned concepts across utterances.Previous chatbots that incorporate the external commonsense knowledge graph prove that modeling the concept shifts can effectively alleviate the dull and uninformative response dilemma.However, there still exists a gap between the concept relations in the natural conversation and those in the external commonsense knowledge graph. Specifically, the concept relations in the external commonsense knowledge graph are not intuitively built from the conversational scenario but the world knowledge, which makes them insufficient for the chatbot construction.To bridge the above gap, we propose the method to supply more concept relations extracted from the conversational corpora and build an enhanced concept graph for the chatbot construction. We then introduce the enhanced graph to the response generation process with a designed network.Experimental results on the Reddit conversation dataset indicate our proposed method significantly outperforms strong baseline systems and achieves new SOTA results.Further analysis individually proves the effectiveness of the enhanced concept graph.	PDF	1	2022
Crossmodal-3600: A Massively Multilingual Multimodal Evaluation Dataset	Research in massively multilingual image captioning has been severely hampered by a lack of high-quality evaluation datasets. In this paper we present and make available the Crossmodal-3600 dataset, a geographically diverse set of 3600 images each of them annotated with human-generated reference captions in 36 languages. We select a representative set of images from across the world for this dataset, and annotate it with captions that achieve consistency in terms of style across all languages, while avoiding annotation artifacts due to direct translation. We apply this benchmark to model selection for massively multilingual image captioning models, and show superior correlation results with human evaluations when using the Crossmodal-3600 dataset as golden references for automatic metrics.	PDF	1	2022
Detection, Disambiguation, Re-ranking: Autoregressive Entity Linking as a Multi-Task Problem	We propose an autoregressive entity linking model, that is trained with two auxiliary tasks, and learns to re-rank generated samples at inference time. Our proposed novelties address two weaknesses in the literature. First, a recent method proposes to learn mention detection and then entity candidate selection, but relies on predefined sets of candidates. We use encoder-decoder autoregressive entity linking in order to bypass this need, and propose to train mention detection as an auxiliary task instead. Second, previous work suggests that re-ranking could help correct prediction errors. We add a new, auxiliary task, match prediction, to learn re-ranking. Without the use of a knowledge base or candidate sets, our model sets a new state of the art in two benchmark datasets of entity linking: COMETA in the biomedical domain, and AIDA-CoNLL in the news domain. We show through ablation studies that each of the two auxiliary tasks increases performance, and that re-ranking is an important factor to the increase. Finally, our low-resource experimental results suggest that performance on the main task benefits from the knowledge learned by the auxiliary tasks, and not just from the additional training data.	PDF	1	2022
On a Benefit of Masked Language Model Pretraining: Robustness to Simplicity Bias	Despite the success of pretrained masked language models (MLM), why MLM pretraining is useful is still a question not fully answered. In this work we theoretically and empirically that MLM pretraining makes models robust to lexicon-level spurious features, partly answering the question. Our explanation is that MLM pretraining may alleviate problems brought by simplicity bias (Shah et al., 2020), which refers to the phenomenon that a deep model tends to rely excessively on simple features. In NLP tasks, those simple features could be token-level features whose spurious association with the label can be learned easily. We show that MLM pretraining makes learning from the context easier. Thus, pretrained models are less likely to rely excessively on a single token. We also explore the theoretical explanations of MLM’s efficacy in causal settings. Compared with Wei et al. (2021), we achieve similar results with milder assumption. Finally, we close the gap between our theories and real-world practices by conducting experiments on real-world tasks.	PDF	1	2022
PRIMERA: Pyramid-based Masked Sentence Pre-training for Multi-document Summarization	We introduce PRIMERA, a pre-trained model for multi-document representation with a focus on summarization that reduces the need for dataset-specific architectures and large amounts of fine-tuning labeled data. PRIMERA uses our newly proposed pre-training objective designed to teach the model to connect and aggregate information across documents. It also uses efficient encoder-decoder transformers to simplify the processing of concatenated input documents. With extensive experiments on 6 multi-document summarization datasets from 3 different domains on zero-shot, few-shot and full-supervised settings, PRIMERA outperforms current state-of-the-art dataset-specific and pre-trained models on most of these settings with large margins.	PDF	1	2022
Masked Measurement Prediction: Learning to Jointly Predict Quantities and Units from Textual Context	Physical measurements constitute a large portion of numbers in academic papers, engineering reports, and web tables. Current benchmarks fall short of properly evaluating numeracy of pretrained language models on measurements, hindering research on developing new methods and applying them to numerical tasks. To that end, we introduce a novel task, Masked Measurement Prediction (MMP), where a model learns to reconstruct a number together with its associated unit given masked text. MMP is useful for both training new numerically informed models as well as evaluating numeracy of existing systems. To address this task, we introduce a new Generative Masked Measurement (GeMM) model that jointly learns to predict numbers along with their units. We perform fine-grained analyses comparing our model with various ablations and baselines. We use linear probing of traditional pretrained transformer models (RoBERTa) to show that they significantly underperform jointly trained number-unit models, highlighting the difficulty of this new task and the benefits of our proposed pretraining approach. We hope this framework accelerates the progress towards building more robust numerical reasoning systems in the future.	PDF	1	2022
CheckDST: Measuring Real-World Generalization of Dialogue State Tracking Performance	Neural models that extend the pretrain-then-finetune paradigm continue to achieve new state-of-the-art results in dialogue state tracking (DST) benchmarks on joint goal accuracy (JGA). However, motivated by CheckList (Ribeiro et al. 2020), we argue for a holistic assessment of DST models since JGA is unable to capture robustness to the inevitable test-time distribution shifts. To this end, we build on recent work on robustness testing in task-oriented dialogue and introduce CheckDST, an instantiation of CheckList for DST that quantifies robustness with test set augmentations and new metrics that measure consistency. Using CheckDST, we are able to extensively compare state-of-the-art DST models, finding that, although span-based classification models achieve slightly better JGA on the original test set than generation models, they are significantly less robust to distribution shift. Secondly, we observe that while stopping training early, e.g. at the first epoch, hurts JGA, the resulting models are significantly more robust to distribution shift. Lastly, guided by the weaknesses exposed by CheckDST, we explore training DST models that simultaneously boost JGA and CheckDST metrics and report preliminary success with PrefineDST, a simple generation model pretrained with non-target datasets to internalize reasoning skills relevant to dialogue state tracking.	PDF	1	2022
Analyzing Modality Robustness in Multimodal Sentiment Analysis	Building robust multimodal models are crucial to achieving reliable deployment in the wild. Despite its importance, less attention has been paid to identifying and improving the robustness of Multimodal Sentiment Analysis (MSA) models. In this work, we hope to address that by (i) Proposing simple diagnostic checks for modality robustness in a trained multimodal model. Using these checks, we find MSA models to be highly sensitive to a single modality, which creates issues in their robustness; (ii) We analyze well-known robust training strategies to alleviate the issues. Critically, we observe that robustness can be achieved without compromising on the original performance. We hope our extensive study--performed across five models and two benchmark datasets--and proposed procedures would make robustness an integral component in MSA research. Our diagnostic checks and robust training solutions are simple to implement and shall be released at https://github.com/XXXX.	PDF	1	2022
Boosted Dense Retriever	We propose DrBoost, a dense retrieval ensemble inspired by boosting. DrBoost is trained in stages: each component model is learned sequentially and specialized by focusing only on retrieval mistakes made by the current ensemble. The final representation is the concatenation of the output vectors of all the component models, making it a drop-in replacement for standard dense retrievers at test time. DrBoost enjoys several advantages compared to standard dense retrieval models. It produces representations which are 4x more compact, while delivering comparable retrieval results. It also performs surprisingly well under approximate search with coarse quantization, reducing latency and bandwidth needs by another 4x. In practice, this can make the difference between serving indices from disk versus from memory, paving the way for much cheaper deployments.	PDF	1	2022
Logical Story Representations via FrameNet + Semantic Parsing	We present a means of obtaining rich semantic representations of stories by combining neural FrameNet identification, a formal logic-based semantic parser, and a hierarchical event schema representation. The final schematic representation of the story abstracts constants to variables, preserving their types and relationships to other individuals in the story. All identified FrameNet frames are incorporated as temporally bound ``episodes'' and related to one another in time. The semantic role information from the frames is also incorporated into the final schema's type constraints. We describe this system as well as its possible applications to question answering and open-domain event schema learning.	PDF	1	2022
CONQRR: Conversational Query Rewriting for Retrieval with Reinforcement Learning	Compared to standard retrieval tasks, passage retrieval for conversational question answering (CQA) poses new challenges in understanding the current user question, as each question needs to be interpreted within the dialogue context. Moreover, it can be expensive to re-train well-established retrievers such as search engines that are originally developed for non-conversational queries. To facilitate their use, we develop a query rewriting model CONQRR that rewrites a conversational question in the context into a standalone question. It is trained with a novel reward function to directly optimize towards retrieval using reinforcement learning and can be adapted to any fixed retriever. We show that CONQRR achieves state-of-the-art results on a recent open-domain CQA dataset containing conversations from three different sources, and is effective for two different fixed retrievers. Our extensive analysis also shows the robustness of CONQRR to out-of-domain dialogues as well as to limited query rewriting supervision.	PDF	1	2022
VEE-BERT: Accelerating BERT Inference for Named Entity Recognition via Vote Early Exiting	Named entity recognition (NER) is of great importance for a wide range of tasks, such as medical health record understanding, document analysis, dialogue understanding. BERT and its variants are the most performing models for NER. However, these models are notorious for being large and slow during inference. Thus their usage in the industry is limited. Pilot experiments exhibit that in the NER task, BERT suffers from the severe over-thinking problem, thus motivating BERT to exit early at intermediate layers. Thus, in this work, we propose a novel method, \underline{V}ote \underline{E}arly \underline{E}xiting BERT (VEE-BERT), for improving the early exiting of BERT on NER tasks. To be able to deal with complex NER tasks with nested entities, we adopt the Biaffine NER model \citep{yu-etal-2020-named}, which converts a sequence labeling task to the table filling task. VEE-BERT makeS early exiting decisions by comparing the predictions of the current layer with those of the previous layers. Experiments on six benchmark NER tasks demonstrate that our method is effective in accelerating the BERT Biaffine model's inference speed with less performance loss compared to the baseline early exiting method.	PDF	1	2022
A Simple But Powerful Graph Encoder for Temporal Knowledge Graph Completion	While knowledge graphs contain rich semantic knowledge about various entities and the relational information among them, temporal knowledge graphs (TKGs) describe and model the interactions of the entities over time. In this context, automatic temporal knowledge graph completion (TKGC) has gained great interest. Recent TKGC methods aim to integrate advanced deep learning techniques, e.g., Transformers, to boost model performance. However, we find that instead of adopting various kinds of complex modules, it is more beneficial to capture more extensive temporal information. In this paper, we propose a simple but powerful graph encoder for TKGC, namely, TARGCN. TARGCN is parameter-efficient, and it extensively utilizes the information from the whole temporal context. We perform experiments on three benchmark datasets. Our model can achieve a more than 46% relative improvement on the GDELT dataset compared with state-of-the-art models. Meanwhile, it outperforms the strongest baseline on the ICEWS05-15 dataset with around 18% fewer parameters.	PDF	1	2022
Semantically Informed Slang Interpretation	Slang is a predominant form of informal language making flexible and extended use of words that is notoriously hard for natural language processing systems to interpret. Existing approaches to slang interpretation tend to rely on context but ignore semantic extensions common in slang word usage. We propose a semantically informed slang interpretation (SSI) framework that considers jointly the contextual and semantic appropriateness of a candidate interpretation for a query slang. We perform rigorous evaluation on two large-scale online slang dictionaries and show that our approach not only achieves state-of-the-art accuracy for slang interpretation in English, but also does so in zero-shot and few-shot scenarios where training data is sparse. Furthermore, we show how the same framework can be applied to enhancing machine translation of slang from English to other languages. Our work creates opportunities for the automated interpretation and translation of informal language.	PDF	1	2022
Unsupervised Domain Adaptation for Event Detection via Meta Self-Paced Learning	A shift in data distribution can have a significant impact on performance of a model to detect important events in text. Recent methods addressing unsupervised domain adaptation for event detection task typically extracted domain-invariant representations through balancing between various objectives to align feature spaces between source and target domains. While effective, these methods are impractical as large-scale language models are drastically growing bigger to achieve optimal performance. To this end, we propose to leverage meta-learning framework to train a neural network-based self-paced learning procedure in an end-to-end manner. Our method, called Meta Self-Paced Domain Adaption (MSP-DA), effectively tunes domain-specific hyperparameters including learning schedules, sample weights, and objective balancing coefficients, simultaneously throughout the learning process, by imitating the train-test dataset split based on the difficulties of source domain's samples. Extensive experiments demonstrate our framework substantially improves performance on target domains, surpassing state-of-the-art approaches. Detailed analyses validate our method and provide insight into how each domain affects the learned hyperparameters.	PDF	1	2022
PROMPT WAYWARDNESS: The Curious Case of Discretized Interpretation of Continuous Prompts	Fine-tuning continuous prompts for target tasks has recently emerged as a compact alternative to full model fine-tuning. Motivated by these promising results, we investigate the feasibility of extracting a discrete (textual) interpretation of continuous prompts that is faithful to the problem they solve. In practice, we observe a "wayward" behavior between the task solved by continuous prompts and their nearest neighbor discrete projections: We can find continuous prompts that solve a task while being projected to an arbitrary text (e.g., definition of a different or even a contradictory task), while being within a very small (2%) margin of the best continuous prompt of the same size for the task. We provide intuitions behind this odd and surprising behavior, as well as extensive empirical analyses quantifying the effect of various parameters. For instance, for larger model sizes we observe higher waywardness, i.e, we can find prompts that more closely map to any arbitrary text with a smaller drop in accuracy. These findings have important implications relating to the difficulty of faithfully interpreting continuous prompts and their generalization across models and tasks, providing guidance for future progress in prompting language models.	PDF	1	2022
End-to-end Dense Video Captioning as Sequence Generation	Dense video captioning aims to identify the events of interest in an input video, and generate descriptive captions for each event. Previous approaches usually follow a two-stage generative process, which first proposes a segment for each event, then renders a caption for each identified segment. Recent advances in large-scale sequence generation pretraining have seen great success in unifying task formulation for a great variety of tasks, but so far, more complex tasks such as dense video captioning are not able to fully utilize this powerful paradigm.In this work, we show how to model the two subtasks of dense video captioning jointly as one sequence generation task, and simultaneously predict the events and the corresponding descriptions. Experiments on YouCook2 and ViTT show encouraging results and indicate the feasibility of training complex tasks such as end-to-end dense video captioning integrated into large-scale pretrained models.	PDF	1	2022
Towards Personalized Intelligence at Scale	Personalized Intelligence (PI) is the problem of providing customized AI experiences tailored to each individual user. In many applications, PI is preferred or even required. Existing personalization approaches involve fine-tuning pre-trained models to create new customized models. However, these approaches require a significant amount of computation to train, scaling with model size and the number of users, inhibiting PI to be realized widely. In this work, we introduce a novel model architecture and training/inference framework to enable Personalized Intelligence at scale. We achieve this by attaching a Personalization Head (PH) and freezing the base pre-trained LM. Since only the parameters in PH are updated during training, this results in a model much smaller than the conventional fine-tuned LM when scaled across users. We evaluate on academia and industry-focused datasets and show that this is much more scalable than traditional fine-tuning and outperforms zeroshot baseline in F1 score. We identify key factors required for effective PH design and training.	PDF	1	2022
Evidentiality-guided Generation for Knowledge-Intensive NLP Tasks	Retrieval-augmented generation models have shown state-of-the-art performance across many knowledge-intensive NLP tasks such as open question answering and fact verification. These models are trained to generate a final output given retrieved passages that can be irrelevant to an input query, leading to learning spurious cues or memorization. This work introduces a method to incorporate the evidentiality of passages---whether a passage contains correct evidence to support the output---into training the generator. We introduce a multi-task learning framework to jointly generate the final output and predict the {\it evidentiality} of each passage. We introduce a new task-agnostic method for obtaining high-quality silver evidentiality labels, addressing the issues of gold evidentiality labels being unavailable in most domains. Our experiments on five datasets across three knowledge-intensive tasks of open-domain question answering, fact verification, and knowledge-enhanced dialogue show that our new evidentiality-guided generator significantly outperforms its direct counterpart on all of them, and advances the state of the art on three of them. Our analysis shows that multi-task learning and silver evidentiality mining played key roles.	PDF	1	2022
On Leakage in Some Popular Benchmarks on Graphs	A number of benchmarks are based on graphs. Edges are typically split into train, validation and test splits, using a random partition. Leakage has been discovered in a number of popular benchmarks; FB15k has been replaced by FB15k-237 and WN18 has been replaced by WN18RR, though leakage has been reported even after these corrections. This paper will report a new type of leakage, $A$-leakage, on benchmarks for synonym-antonym classification. $A$-leakage infers labels for pairs of words in the test split, $w_i , w_j$, by exploiting labels on paths from $w_i$ to $w_j$ in the training split. We conclude that it is safer to partition vertices, $V$, than edges, $E$.	PDF	1	2022
Challenges in Generalization in Open Domain Question Answering	Recent work on Open Domain Question Answering has shown that there is a large discrepancy in model performance between novel test questions and those that largely overlap with training questions. However, it is unclear which aspects of novel questions make them challenging. Drawing upon studies on systematic generalization, we introduce and annotate questions according to three categories that measure different levels and kinds of generalization: training set overlap, compositional generalization (comp-gen), and novel-entity generalization (novel-entity). When evaluating six popular parametric and non-parametric models, we find that for the established Natural Questions and TriviaQA datasets, even the strongest model performance for comp-gen/novel-entity is 13.1/5.4% and 9.6/1.5% lower compared to that for the full test set -- indicating the challenge posed by these types of questions. Furthermore, we show that whilst non-parametric models can handle questions containing novel entities relatively well, they struggle with those requiring compositional generalization. Lastly, we find that key question difficulty factors are: cascading errors from the retrieval component, frequency of question pattern, and frequency of the entity.	PDF	1	2022
Knowledge-Grounded Dialogue Generation with a Unified Knowledge Representation	Knowledge-grounded dialogue systems are challenging to build due to the lack of training data and heterogeneous knowledge sources. Existing systems perform poorly on unseen topics due to limited topics covered in the training data. In addition, it is challenging to generalize to the domains that require different types of knowledge sources. To address the above challenges, we present PLUG, a language model that homogenizes different knowledge sources to a unified knowledge representation for knowledge-grounded dialogue generation tasks. We first retrieve relevant information from heterogeneous knowledge sources (e.g., wiki, dictionary, or knowledge graph); Then the retrieved knowledge is transformed into text and concatenated with dialogue history to feed into the language model for generating responses. PLUG is pre-trained on a large-scale knowledge-grounded dialogue corpus. The empirical evaluation on two benchmarks shows that PLUG generalizes well across different knowledge-grounded dialogue tasks. It achieves comparable performance with state-of-the-art methods in the fully-supervised setting and significantly outperforms other approaches in zero-shot and few-shot settings.	PDF	1	2022
Multi-Task End-to-End Training Improves Conversational Recommendation	In this paper, we analyze the performance of a multitask end-to-end transformer model on the task of conversational recommendations, which aim to provide recommendations based on a user’s explicit preferences expressed in dialogue. While previous works in this area adopt complex multi-component approaches where the dialogue management and entity recommendation tasks are handled by separate components, we show that a unified transformer model, based on the T5 text-to-text transformer model, can perform competitively in both recommending relevant items and generating conversation dialogue. We fine-tune our model on the ReDIAL conversational movie recommendation dataset, and create additional training tasks derived from MovieLens (such as the prediction of movie attributes and related movies based on an input movie), in a multitask learning setting. Using a series of probe studies, we demonstrate that the learned knowledge in the additional tasks is transferred to the conversational setting, where each task leads to an increase in its related probe score.	PDF	1	2022
MedDistant19: A Challenging Benchmark for Distantly Supervised Biomedical Relation Extraction	Relation Extraction in the biomedical domain is a challenging task due to the lack of labeled data and the long-tail distribution of the entity mentions. Recent works propose distant supervision as a way to tackle the scarcity of annotated data by automatically pairing knowledge graph relationships with raw textual data. In several benchmarks, Distantly Supervised Biomedical Relation Extraction (Bio-DSRE) models can produce very accurate results. However, given the challenging nature of the task, we set out to investigate the validity of such impressive results. We probed the datasets used by \citet{amin2020data} and \citet{hogan2021abstractified} and found a significant overlap between training and evaluation relationships that, once resolved, reduced the accuracy of the models by up to 71\%. Furthermore, we noticed several inconsistencies along the data construction process, such as the creation of negative samples and improper handling of redundant relationships. To mitigate these issues we present \meddistant, a new benchmark dataset obtained by aligning the MEDLINE abstracts with the widely used SNOMED-Clinical Terms (SNOMED-CT) knowledge base. We experimented with several state-of-the-art models following our methodology, showing that there is still plenty of room for improvement for the task. We release our code and data for reproducibility.	PDF	1	2022
Unsupervised Sentence Simplification via Dependency Parsing	Text simplification is the task of rewriting a text so that it is readable and easily understood. In this paper, we propose a simple yet novel unsupervised sentence simplification system that harnesses parsing structures together with sentence embeddings to produce linguistically effective simplifications. This means our model is capable of introducing substantial modifications to simplify a sentence while maintaining its original semantics and adequate fluency. We establish the unsupervised state-of-the-art at 39.13 SARI on TurkCorpus set and perform competitively against supervised baselines on various quality metrics. Furthermore, we demonstrate our framework's extensibility to other languages via a proof-of-concept on Vietnamese data. Code for reproduction is anonymously published at https://anonymous.4open.science/r/USDP-744B.	PDF	1	2022
Learning Cross-Lingual IR from an English Retriever	We present a new cross-lingual information retrieval (CLIR) system trained using multi-stage knowledge distillation (KD). The teacher relies on a highly effective but expensive two-stage process consisting of query translation and monolingual IR, while the student executes a single CLIR step. We teach the student powerful multilingual encoding as well as CLIR by optimizing two corresponding KD objectives. Learning useful non-English representations from an English-only retriever is accomplished through a cross-lingual token alignment algorithm that relies on the representation capabilities of the underlying multilingual language model. In both in-domain and zero-shot evaluation, the proposed method demonstrates far superior accuracy over direct fine-tuning with labeled CLIR data. One of our systems is also the current best single-model system on the XOR-TyDi leaderboard.	PDF	1	2022
Simple Local Attentions Remain Competitive for Long-Context Tasks	Many NLP tasks require processing long contexts beyond the length limit of pretrained models. In order to scale these models to longer text sequences, many efficient long-range attention variants have been proposed. Despite the abundance of research along this direction, it is still difficult to gauge the relative effectiveness of these models in practical use cases, e.g., if we apply these models following the pretrain-and-finetune paradigm. In this work, we aim to conduct a thorough analysis of these emerging models with large-scale and controlled experiments. For each attention variant, we pretrain large-size models using the same long-doc corpus and then finetune these models for real-world long-context tasks. Our findings reveal pitfalls of an existing widely-used long-range benchmark and show none of the tested efficient attentions can beat a simple local window attention under standard pretraining paradigms. Further analysis on local attention variants suggests that even the commonly used attention-window overlap is not necessary to achieve good downstream results --- using disjoint local attentions, we are able to build a simpler and more efficient long-doc QA model that matches the performance of Longformer with half of its pretraining compute.	PDF	1	2022
Cross-document Misinformation Detection based on Event Graph Reasoning	For emerging events, human readers are often exposed to both real news and fake news. Multiple news articles may contain complementary or contradictory information that readers can leverage to help detect fake news. Inspired by this process, we propose a novel task of cross-document misinformation detection. Given a cluster of topically related news documents, we aim to detect misinformation at both document level and a more fine-grained level, event level. Due to the lack of data, we generate fake news by manipulating real news, and construct 3 new datasets with 422, 276, and 1,413 clusters of topically related documents, respectively. We further propose a graph-based detector that constructs a cross-document knowledge graph using cross-document event coreference resolution and employs a heterogeneous graph neural network to conduct detection at two levels. We then feed the event-level detection results into the document-level detector. Experimental results show that our proposed method significantly outperforms existing methods by up to 7 F1 points on this new task.	PDF	1	2022
Analyzing Gender Representation in Multilingual Models	Multilingual language models were shown to allow for nontrivial transfer across scripts and languages. In this work, we study the structure of the internal representations that enable this transfer. We focus on the representations of gender distinctions as a practical case study, and examine the extent to which the gender concept is encoded in shared subspaces across different languages. Our analysis shows that gender representations consist of several prominent components that are shared across languages, alongside language-specific components. The existence of language-independent and language-specific components provides an explanation for an intriguing empirical observation we make: while gender classification transfers well across languages, bias mitigation interventions trained on a single language do not transfer easily to others.	PDF	1	2022
Re2G: Retrieve, Rerank, Generate	As demonstrated by GPT-3 and T5, transformers grow in capability as parameter spaces become larger and larger. However, for tasks that require a large amount of knowledge, non-parametric memory allows models to grow dramatically with a sub-linear increase in computational cost and GPU memory requirements. Recent models such as RAG and REALM have introduced retrieval into conditional generation. These models incorporate neural initial retrieval from a corpus of passages. We build on this line of research, proposing Re2G, which combines both neural initial retrieval and reranking into a BART-based sequence-to-sequence generation. Our reranking approach also permits merging retrieval results from sources with incomparable scores, enabling an ensemble of BM25 and neural initial retrieval. To train our system end-to-end, we introduce a novel variation of knowledge distillation to train the initial retrieval, reranker and generation using only ground truth on the target sequence output. We find large gains in four diverse tasks: zero-shot slot filling, question answering, fact checking and dialog, with relative gains of 9% to 34% over the previous state-of-the-art on the KILT leaderboard. We make our code available as open source.	PDF	1	2022
Original or Translated? A Causal Analysis of the Impact of Translationese on Machine Translation Performance	Human-translated text displays distinct features from naturally written text in the same language. This phenomena, known as translationese, has been argued to confound the machine translation (MT) evaluation. Yet, we find that existing work on translationese neglects some important factors and the conclusions are mostly correlational but not causal. In this work, we collect CausalMT, a dataset where the MT training data are also labeled with the human translation directions. We inspect two critical factors, the train-test alignment (whether the human translation directions in the training and test sets are aligned), and data-model alignment (whether the model learns in the same direction as the human translation direction in the dataset). We show that these two factors have a large causal effect on the MT performance, in addition to the test-model misalignment highlighted by existing work on the impact of translationese in the test set. In light of our findings, we provide a set of suggestions for MT training and evaluation.	PDF	1	2022
A Dataset for Cross-Domain Reasoning via Template Filling	While several benchmarks exist for reasoning tasks, reasoning across domains is an under-explored area in NLP. Towards this, we present a dataset and a prompt-template-filling approach to enable sequence to sequence models to perform cross-domain reasoning. We also present a case-study with commonsense and health and well-being domains, where we study how prompt-template-filling enables pretrained sequence to sequence models across domains. Our experiments across several pretrained encoder-decoder models show that cross-domain reasoning is challenging for current models. We also show an in-depth error analysis and avenues for future research for reasoning across domains	PDF	1	2022
Cooperative Self-training of Machine Reading Comprehension	Pretrained language models have significantly improved the performance of downstream language understanding tasks, including extractive question answering, by providing high-quality contextualized word embeddings. However, training question answering models still requires large amounts of annotated data for specific domains. In this work, we propose a cooperative self-training framework, RGX, for automatically generating more non-trivial question-answer pairs to improve model performance. RGX is built upon a masked answer extraction task with an interactive learning environment containing an answer entity \textbf{R}ecognizer, a question \textbf{G}enerator, and an answer e\textbf{X}tractor. Given a passage with a masked entity, the generator generates a question around the entity, and the extractor is trained to extract the masked entity with the generated question and raw texts. The framework allows the training of question generation and answering models on any text corpora without annotation. We further leverage a reinforcement learning technique to reward generating high-quality questions and to improve the answer extraction model's performance. Experiment results show that RGX outperforms the state-of-the-art (SOTA) pretrained language models and transfer learning approaches on standard question-answering benchmarks, and yields the new SOTA performance under given model size and transfer learning settings.	PDF	1	2022
Applying SoftTriple Loss for Supervised Language Model Fine Tuning	We introduce a new loss function TripleEntropy to improve classification performance for fine-tuninggeneral knowledge pre-trained language models based on cross-entropy and SoftTriple loss. Thisloss function can improve the robust RoBERTa baseline model fine-tuned with cross-entropy loss byabout (0.02% - 2.29%). Thorough tests on popular datasets indicate a steady gain. The fewer samplesin the training dataset, the higher gain – thus, for small-sized dataset it is 0.78%, for medium-sized –0.86% for large – 0.20% and for extra-large 0.04%.	PDF	1	2022
Cross-Lingual Speaker Identification from Weak Local Evidence	Speaker identification, determining which character said each utterance in text, benefits many downstream tasks. Most existing approaches use expert-defined rules or rule-based features to directly approach this task, but these approaches come with significant drawbacks, such as lack of contextual reasoning and poor cross-lingual generalization. In this work, we propose a speaker identification framework that addresses these issues. We first extract large-scale distant supervision signals in English via general-purpose tools and heuristics, and then apply these weakly-labeled instances with a focus on encouraging contextual reasoning to train a cross-lingual language model. We show that our final model outperforms the previous state-of-the-art methods on two English speaker identification benchmarks by $5.4\%$ in accuracy, as well as two Chinese speaker identification datasets by up to $4.7\%$.	PDF	1	2022
Learning to Transpile AMR into SPARQL	We propose a transition-based system to transpile Abstract Meaning Representation (AMR) into SPARQL for Knowledge Base Question Answering (KBQA). This allows to delegate part of the abstraction problem to a strongly pre-trained semantic parser, while learning transpiling with small amount of paired data. We departure from recent work relating AMR and SPARQL constructs, but rather than applying a set of rules, we teach the BART model to selectively use these relations. Further, we avoid explicitly encoding AMR but rather encode the parser state in the attention mechanism of BART, following recent semantic parsing works. The resulting model is simple, provides supporting text for its decisions, and outperforms recent progress in AMR-based KBQA on LC-QuAD (F1 53.4), and QALD (F1 31.6), while exploiting the same inductive biases.	PDF	1	2022
Imagination-Augmented Natural Language Understanding	Human brains integrate linguistic and perceptual information simultaneously to understand natural language and hold the critical ability to render imaginations. Such abilities enable us to construct new abstract concepts or concrete objects and are essential in involving applicable knowledge to solve problems in low-resource scenarios. However, most existing methods for Natural Language Understanding (NLU) are mainly focused on textual signals. They do not simulate human visual imagination ability, which hinders models from inferring and learning efficiently from limited data samples.Therefore, we introduce an Imagination-Augmented Cross-modal Encoder (iACE) to solve natural language understanding tasks from a novel learning perspective---imagination-augmented cross-modal understanding. iACE enables visual imagination with the external knowledge transferred from the powerful generative model and pre-trained vision-and-language model.Extensive experiments on GLUE and SWAG datasets show that iACE achieves consistent improvement over visually-supervised pre-trained models. More importantly, results in extreme and normal few-shot settings validate the effectiveness of iACE in low-resource natural language understanding circumstances.	PDF	1	2022
Fine-tuning Strategies for Domain Specific Question Answering under Low Annotation Budget Constraints	The progress introduced by pre-trained language models and their fine-tuning has resulted in significant improvements in most downstream NLP tasks. The unsupervised fine-tuning of a language model combined with further target task fine-tuning has become the standard QA fine-tuning procedure. In this work, we demonstrate that this strategy is sub-optimal for fine-tuning QA models, especially under a low QA annotation budget, which is a usual setting in practice due to the extractive QA labeling cost. We draw our conclusions by conducting an exhaustive analysis of the performance of the alternatives of the sequential fine-tuning strategy on different QA datasets. Our experiments provide one of the first investigations on how to best fine-tune a QA system under a low budget, and is therefore of the utmost practical interest for the QA practitioner.	PDF	1	2022
NewsClaims: A New Benchmark for Claim Detection from News with Background Knowledge	Claim detection and verification are crucial for news understanding and have emerged as promising technologies for mitigating news misinformation. However, most existing work has focused on claim sentence analysis while overlooking crucial background attributes (e.g., claimer, claim objects). In this work, we present NewsClaims, a new benchmark for knowledge-aware claim detection in the news domain. We redefine the claim detection problem to include extraction of additional background attributes related to each claim and release 889 claims annotated over 143 news articles. NewsClaims aims to benchmark claim detection systems in emerging scenarios, comprising unseen topics with little or no training data. To this end, we provide a comprehensive evaluation of zero-shot and prompt-based baselines for NewsClaims.	PDF	1	2022
Retrieving Visual Facts For Few-Shot Visual Question Answering	We introduce the Retrieving Visual Facts (RVF) framework for few-shot visual question answering (VQA). The RVF framework represents an image as a set of natural language facts; for example, in practice these could be tags from an object detector. Critically, the question is used to retrieve $\textit{relevant}$ facts: an image may contain numerous details, and one should attend to the few which may be useful for the question. Finally, one predicts the answer from the retrieved facts and the question, e.g., by prompting a language model as we do here. Compared to PICA (Yang et al., 2021), the previous state-of-the-art in few-shot VQA, a proof-of-concept RVF implementation improves absolute performance by 2.6% and 1.5% respectively on the VQAv2 (Goyal et al., 2017) and OK-VQA (Marino et al., 2019) datasets. We also analyze our implementation's strengths and weaknesses on various question types, highlighting directions for further study.	PDF	1	2022
Exploring Topic-Metadata Relationships with the STM: A Bayesian Approach	The initial purpose of topic models was to identify latent topical clusters within unstructured text. Meanwhile, the focus of advanced studies has changed primarily to estimating the relationship between the discovered topical structure and theoretically relevant metadata. Methods used to estimate such relationships must take into account that the topical structure is not directly observed, but instead being estimated itself in an unsupervised fashion. In the Structural Topic Model (STM;Roberts et al., 2016), for instance, multiple repeated linear regressions of sampled topic proportions on metadata covariates are performed. This is done by using a Monte Carlo sampling technique known as the \textit{method of composition}. In this paper, we propose two modifications of this approach: First, we implement a substantial correction to the model by replacing linear regression with the more appropriate Beta regression. Second, we provide a fundamental enhancement of the entire estimation framework by substituting the current blending of frequentist and Bayesian methods with a fully Bayesian approach instead. This allows for a more appropriate quantification of uncertainty. We illustrate our improved methodology by investigating relationships between Twitter posts by German parliamentarians and different metadata covariates related to their electoral districts.	PDF	1	2022
Uninformative Input Features and Counterfactual Invariance: Two Perspectives on Spurious Correlations in Natural Language	The natural language processing community has become increasingly interested in spurious correlations, and in methods for identifying and eliminating them. Gardner et al (2021) argue that due to the compositional nature of language, \emph{all} correlations between labels and individual input features are spurious. This paper analyzes this proposal in the context of a toy example, demonstrating three distinct conditions that can give rise to feature-label correlations through a simple PCFG. Linking the toy example to a structured causal model shows that (1) feature-label correlations can arise even when the label is invariant to interventions on the feature, and (2) feature-label correlations may be absent even when the label \emph{is} sensitive to interventions on the feature. Because input features will be individually correlated with labels except in very rare circumstances, mitigation and stress tests should focus on those correlations that are counterfactually invariant under plausible causal models.	PDF	1	2022
Breaking Character: Are Subwords Good Enough for MRLs After All?	Large pretrained language models (PLMs) typically tokenize the input string into contiguous subwords before any pretraining or inference. However, previous studies have claimed that this form of subword tokenization is inadequate for processing morphologically-rich languages (MRLs). We revisit this hypothesis by pretraining a BERT-style masked language model over character sequences instead of word-pieces. We compare the resulting model, dubbed TavBERT, against contemporary PLMs based on subwords for three highly complex and ambiguous MRLs (Hebrew, Turkish, and Arabic), testing them on both morphological and semantic tasks. Our results show, for all tested languages, that while TavBERT obtains mild improvements on surface-level tasks à la POS tagging and full morphological disambiguation, subword-based PLMs achieve significantly higher performance on semantic tasks, such as named entity recognition and extractive question answering. These results showcase and (re)confirm the potential of subword tokenization as a reasonable modeling assumption for many languages, including MRLs.	PDF	1	2022
Neural Pipeline for Zero-Shot Data-to-Text Generation	In data-to-text (D2T) generation, training on in-domain data leads to overfitting to the data representation and repeating training data noise. We examine how to avoid finetuning the pretrained language models (PLMs) on D2T generation datasets while still taking advantage of surface realization capabilities of PLMs. Inspired by pipeline approaches, we propose to generate text by gradually transforming single-item descriptions with a sequence of modules trained on general-domain text-based operations: ordering, aggregation, and paragraph compression. We train PLMs for performing these operations on a synthetic corpus WikiFluent which we build from English Wikipedia. Our experiments on two major triple-to-text datasets—WebNLG and E2E—show that our approach enables D2T generation from RDF triples in zero-shot settings.	PDF	1	2022
GUSUM: Graph-Based Unsupervised Summarization using Sentence-BERT and Sentence Features	Unsupervised extractive document summarization aims to extract salient sentences from a document without requiring a labelled corpus. In existing graph-based methods, vertex and edge weights are mostly created by calculating sentence similarities. In this paper, we develop a Graph-Based Unsupervised Summarization method for extractive text summarization. We revive traditional graph ranking algorithms with recent sentence embedding models and sentence features and modify how sentence centrality is computed. We first use Sentence-BERT, a state-of-the-art method for obtaining sentence embeddings to better capture the sentence meaning. In this way, we define the edges of a graph where semantic similarities are represented. Then, we create an undirected graph in which the calculated sentence feature scores of each sentence are represented in the vertices. In the last stage, we determine the most important sentences in the document with the ranking method we suggested on the graph created. Experiments on CNN/Daily Mail and New York Times datasets show our approach achieves high performance on unsupervised graph-based summarization when evaluated both automatically and by humans.	PDF	1	2022
PPL-MCTS: Constrained Textual Generation Through Discriminator-Guided MCTS Decoding	Large language models (LM) based on Transformers allow to generate plausible long texts. In this paper, we explore how this generation can be further controlled at decoding time to satisfy certain constraints (eg. being non-toxic, conveying certain emotions, using a specific writing style, etc.) without fine-tuning the LM.Precisely, we formalize constrained generation as a tree exploration process guided by a discriminator that indicates how well the associated sequence respects the constraint. This approach, in addition to being easier and cheaper to train than fine-tuning the LM, allows to apply the constraint more finely and dynamically.We propose several original methods to search this generation tree, notably the Monte Carlo Tree Search (MCTS) which provides theoretical guarantees on the search efficiency, but also simpler methods based on re-ranking a pool of diverse sequences using the discriminator scores. These methods are evaluated, with automatic and human-based metrics, on two types of constraints and languages: review polarity and emotion control in French and English. We show that discriminator-guided MCTS decoding achieves state-of-the-art results without having to tune the language model, in both tasks and languages. We also demonstrate that other proposed decoding methods based on re-ranking can be really effective when diversity among the generated propositions is encouraged.	PDF	1	2022
Few-Shot Authorship Attribution in English Reddit Posts	Authorship attribution (AA), an area of research seeking to identify the author of a particular text, is typically conducted on a closed set of authors, and often on certain forms of text, such as edited and less colloquial language like that available in news articles. This paper introduces a few-shot learning approach using prototypical networks and a mix of stylometric and pre-trained transformer-related features, as applied to Reddit data.By employing few-shot learning and applying our efforts to social media text, we are looking to expand beyond the typical AA application--allowing for disjoint author sets and shorter, more colloquial forms of English. Additionally, using subreddit IDs as a proxy for topics, we explore cross-topic analysis and differentiate performance accordingly. In so doing, we test the limits of AA, with the goal of setting a baseline for performance and assessing viability of few-shot learning for this task. Of the exhibited models, those trained with transformer embeddings performed well compared to ones with only stylometric features, and accounting for differing subreddits showed varying performances across models.	PDF	1	2022
MixQG: Neural Question Generation with Mixed Answer Types	Asking good questions is an essential ability for both human and machine intelligence. However, existing neural question generation approaches mainly focus on short factoid type of answers. In this paper, we introduce a neural question generator, MixQG, to bridge this gap. We combine nine question answering datasets with diverse answer types, including yes/no, multiple-choice, extractive, and abstractive answers, to train a single generative model. We show with empirical results that our model outperforms existing work in both seen and unseen domains, and can generate questions with different cognitive levels when conditioned on different answer types. We run a human evaluation study to assess the quality of generated questions and find that MixQG outperforms the next best model by 10%. Our code and model checkpoints will be released and integrated with the HuggingFace library to facilitate various downstream applications.	PDF	1	2022
Generative Pretraining for Paraphrase Evaluation	We introduce ParaBLEU, a paraphrase representation learning model and evaluation metric for text generation. Unlike previous approaches, ParaBLEU learns to understand paraphrasis using generative conditioning as a pretraining objective. ParaBLEU correlates more strongly with human judgements than existing metrics, obtaining new state-of-the-art results on the 2017 WMT Metrics Shared Task. We show that our model is robust to data scarcity, exceeding previous state-of-the-art performance using only 50% of the available training data and surpassing BLEU, ROUGE and METEOR with only 40 labelled examples. Finally, we demonstrate that ParaBLEU can be used to conditionally generate novel paraphrases from a single demonstration, which we use to confirm our hypothesis that it learns abstract, generalized paraphrase representations.	PDF	1	2022
Incorporating Question Answering-Based Signals into Abstractive Summarization via Salient Span Selection	In this work, we propose a method for incorporating question-answering (QA) signals into a summarization model. Our method identifies salient noun phrases (NPs) in the input document by automatically generating wh-questions that are answered by the NPs and automatically determining whether those questions are answered in the gold summaries. This QA-based signal is incorporated into a two-stage summarization model which first marks salient NPs in the input document using a classification model, then conditionally generates a summary. Our experiments demonstrate that the models trained using QA-based supervision generate higher-quality summaries than baseline methods of identifying salient spans on benchmark summarization datasets. Further, we show that the content of the generated summaries can be controlled based on which NPs are marked in the input document. Finally, we propose a method of augmenting the training data so the gold summaries are more consistent with the marked input spans used during training and show how this results in models which learn to better exclude unmarked document content.	PDF	1	2022
Measuring Word-Context Biases in Lexical Semantic Datasets	State-of-the-art contextualized models eg. BERT use tasks such as WiC and WSD to evaluate their word-in-context representations. This inherently assumes that performance in these tasks reflect how well a model represents the coupled word and context semantics. We question this assumption by presenting the first quantitative analysis on the context-word interaction required and being tested in major contextual lexical semantic tasks, taking into account that tasks can be inherently biased and models can learn spurious correlations from datasets. To achieve this, we run probing baselines on masked input, based on which we then propose measures to calculate the degree of context or word biases in a dataset, and plot existing datasets on a continuum. The analysis were performed on both models and humans to decouple biases inherent to the tasks and biases learned from the datasets. We found that, (1) to models, most existing datasets fall into the extreme ends of the continuum: the retrieval-based tasks and especially the ones in the medical domain (eg. COMETA) exhibit strong target word bias while WiC-style tasks and WSD show strong context bias; (2) AM2iCo and Sense Retrieval show less extreme model biases and challenge a model more to represent both the context and target words. (3) A similar trend of biases exists in humans but humans are much less biased compared with models as humans found semantic judgments more difficult with the masked input, indicating models are learning spurious correlations. This study demonstrates that with heavy context or target word biases, models are usually not being tested for word-in-context representations as such in these tasks and results are therefore open to misinterpretation. We recommend our framework as a sanity check for context and target word biases in future task design and model interpretation in lexical semantics.	PDF	1	2022
Experiments with adversarial attacks on text genres	Neural models based on pre-trained transformers, such as BERT or XLM-RoBERTa, demonstrate SOTA results in many NLP tasks, including non-topical classification, such as genre identification. However, often these approaches exhibit low reliability to minor alterations of the test texts. One of the problems concerns topical biases in the training corpus, for example, the prevalence of words on a specific topic in a specific genre can trick the genre classifier to recognise any text on this topic in this genre. In order to mitigate this problem, we investigate techniques for attacking genre classifiers to understand the limitations of the transformer models and to improve their performance. While simple text attacks, such as those based on word replacement using keywords extracted by tf-idf, are not capable of deceiving powerful models like XLM-RoBERTa, we show that embedding-based algorithms which can replace some of the most ``significant'' words with words similar to them, for example, TextFooler, have the ability to influence model predictions in a significant proportion of cases.	PDF	1	2022
Nearest Neighbor Knowledge Distillation for Neural Machine Translation	k-nearest-neighbor machine translation ($k$NN-MT), proposed by Khandelwal et al. (2021), has achieved many state-of-the-art results in machine translation tasks. Although effective, $k$NN-MT requires conducting $k$NN searches through the large datastore for each decoding step during inference, prohibitively increasing the decoding cost and thus leading to the difficulty for the deployment in real-world applications. In this paper, we propose to move the time-consuming $k$NN search forward to the preprocessing phase, and then introduce $k$ Nearest Neighbor Knowledge Distillation ($k$NN-KD) that trains the base NMT model to directly learn the knowledge of $k$NN. Distilling knowledge retrieved by $k$NN can encourage the NMT model to take more reasonable target tokens into consideration, thus addressing the overcorrection problem. Extensive experimental results\footnote{We will release the source code upon acceptance} show that, the proposed method achieves consistent improvement over the state-of-the-art baselines including $k$NN-MT, while maintaining the same training and decoding speed as the standard NMT model.	PDF	1	2022
Towards Interactive Language Modeling	Interaction between caregivers and children plays a critical role in human language acquisition and development. Given this observation, it is remarkable that explicit interaction plays little to no role in artificial language modeling---which also targets the acquisition of human language, yet by artificial models. Moreover, an interactive approach to language modeling has the potential to make language models substantially more versatile and to considerably impact downstream applications. Motivated by these considerations, we pioneer the space of interactive language modeling. First we present a road map in which we detail the steps that need to be taken towards interactive language modeling. We then lead by example and take the first steps on this road map, showing the initial feasibility of our approach. As such, this work aims to be the start of a larger research agenda on interactive language modeling.	PDF	1	2022
A Generative Language Model for Few-shot Aspect-Based Sentiment Analysis	Sentiment analysis is an important task in natural language processing. In recent works, pre-trained language models are often used to achieve state-of-the-art results, especially when training data is scarce. It is common to fine-tune on the downstream task, usually by adding task-specific layers on top of the model. In this paper, we focus on aspect-based sentiment analysis, which involves extracting aspect term, category, and predicting their corresponding polarities. In particular, we are interested in few-shot settings. We propose to reformulate the extraction and prediction tasks into the sequence generation task, using a generative language model with unidirectional attention (GPT2 is used unless stated otherwise). This way, the model learns to accomplish the tasks via language generation without the need of training task-specific layers. Our evaluation results on the single-task polarity prediction show that our approach outperforms the previous state-of-the-art (based on BERT) on average performance by a large margins in few-shot and full-shot settings. More importantly, our generative approach significantly reduces the model variance caused by low-resource data. We further demonstrate that the proposed generative language model can handle joint and multi-tasking settings, unlike previous work. We observe that the proposed sequence generation method achieves further improved performances on polarity prediction when the model is trained via joint and multi-tasking settings. Further evaluation on similar sentiment analysis datasets, SST-2, SST-5 and OOS intent detection validates the superiority and noise robustness of generative language model in few-shot settings.	PDF	1	2022
EVI: Multilingual Spoken Dialogue Tasks and Dataset for Knowledge-Based Enrolment, Verification, and Identification	Knowledge-based authentication is crucial for task-oriented spoken dialogue systems that offer personalised and privacy-focused services. Such systems should be able to enrol (E), verify (V), and identify (I) new and recurring users based on their personal information, e.g. postcode, name, and date-of-birth. In this work, we formalise the three authentication tasks and their evaluation protocols, and we present EVI, a challenging spoken multilingual dataset with 5,506 dialogues in English, Polish, and French. Our proposed models set the first competitive benchmarks, explore the challenges of multilingual natural language processing of spoken dialogue, and set directions for future research.	PDF	1	2022
Improving Fairness without Demographic by Lagged Dynamic Grouping	Machine learning models are prone to social biases in datasets and thus could make discriminatory decisions against demographic minority groups. Most existing fairness-promoting methods usually assume access to the annotations of the demographic information. However, such information could be inaccessible due to the high data annotation cost and privacy restrictions. Recently, distributionally robust optimization (DRO) techniques have been applied to promote fairness without demographic labels. DRO-based methods optimize the individuals/groups with the worst prediction performance, with the intuition that these groups roughly correspond to the minority groups being biased against. However, in complex real-world settings with multiple strong bias attributes, the simple grouping schemes in the existing DRO-based methods can fail to identify the ground truth minority groups. In this paper, we propose FreeDRO, a demographic-free group DRO method featuring a more principled grouping scheme, call lagged dynamic grouping. Specifically, FreeDRO dynamically splits the training data based on the ground truth labels and the prediction of the model at an earlier iteration and then optimizes worst group performance. Extensive experiments on five real-world datasets show that our method can effectively alleviate the biases and even achieve comparable results with methods with full demographic annotations. The results also verify that our grouping scheme has a good correspondence with the ground truth demographic grouping.	PDF	1	2022
Practical Dataless Text Classification Through Dense Retrieval	Dataless text classification aims to classify documents using only class descriptions without any training data. Recent research shows that pre-trained textual entailment models can achieve state-of-the-art dataless classification performance on various tasks. However, such models are not practical in that their prediction speed is slow as they need k forward passes to predict k classes and they are not built for fine-tuning to further improve the initial (often mediocre) performance.This work proposes a simple, effective, and practical dataless classification approach. We use class descriptions as queries to retrieve task-specific or external unlabeled data on which pseudo-labels are assigned to train a classifier. Experiments on a wide range of classification tasks show that the proposed approach consistently outperforms entailment-based models in terms of classification accuracy, prediction speed, and performance gain when fine-tuned on labeled data.	PDF	1	2022
Exploring the Value of Multi-View Learning for Session-Aware Query Representation	Recent years have witnessed a growing interest towards learning distributed query representations that are able to capture search intent semantics. Most existing approaches learn query embeddings using relevance supervision making them suited only to document ranking tasks. Besides, they generally consider either user’s query reformulations or system’s rankings whereas previous findings show that user’s query behavior and knowledge change depending on the system’s results, intertwine and affect each other during the completion of a search task. In this paper, we explore the value of multi-view learning for generic and unsupervised session-aware query representation learning. First, single-view query embeddings are obtained in separate spaces from query reformulations and document ranking representations using transformers. Then, we investigate the use of linear (CCA) and non linear (UMAP) multi-view learning methods, to align those spaces with the aim of revealing similarity traits in the multi-view shared space. Experimental evaluation is carried out in a query classification and session-based retrieval downstream tasks using respectively the KDD and TREC session datasets. The results show that multi-view learning is an effective and controllable approach for unsupervised learning of generic query representations and can reflect search behavior patterns.	PDF	1	2022
Augmenting Document Representations for Dense Retrieval with Interpolation and Perturbation	Dense retrieval models, which aim at retrieving the most relevant document for an input query on a dense representation space, have gained considerable attention for their remarkable success. Yet, dense models require a vast amount of labeled training data for notable performance, whereas it is often challenging to acquire query-document pairs annotated by humans. To tackle this problem, we propose a simple but effective Document Augmentation for dense Retrieval (DAR) framework, which augments the representations of documents with their interpolation and perturbation. We validate the performance of DAR on retrieval tasks with two benchmark datasets, showing that the proposed DAR significantly outperforms relevant baselines on the dense retrieval of both the labeled and unlabeled documents.	PDF	1	2022
Does Summary Evaluation Survive Translation to Other Languages?	The creation of a quality summarization dataset is an expensive, time-consuming effort, requiring the production and evaluation of summaries by both trained humans and machines. The returns to such an effort would increase significantly if the dataset could be used in additional languages without repeating human annotations. To investigate how much we can trust machine translation of summarization datasets, we translate the English SummEval dataset to seven languages and compare performances across automatic evaluation measures. We explore equivalence testing as the appropriate statistical paradigm for evaluating correlations between human and automated scoring of summaries. We also consider the effect of translation on the relative performance between measures. We find some potential for dataset reuse in languages similar to the source and along particular dimensions of summary quality.	PDF	1	2022
Lifting the Curse of Multilinguality by Pre-training Modular Transformers	Multilingual pre-trained models are known to suffer from the curse of multilinguality, which causes per-language performance to drop as they cover more languages. We address this issue by introducing language-specific modules, which allows us to grow the total capacity of the model, while maintaining the total number of trainable parameters per language. In contrast to prior work which learns language-specific components post-hoc, we pre-train the modules of our Cross-lingual Mod (X-Mod) models from the start. Our experiments on natural language inference, named entity recognition and question answering show that our approach not only mitigates the negative interference between languages, but also enables positive transfer, resulting in improved monolingual and cross-lingual performance. Furthermore, our approach enables adding languages post-hoc with no measurable drop in performance, no longer limiting the model usage to the set of pre-trained languages.	PDF	1	2022
Disaggregating Hops: Can We Guide a Multi-Hop Reasoning Language Model to Incrementally Learn at each Hop?	Despite the success of state-of-the-art pre-trained language models (PLMs) on a series of multi-hop reasoning tasks, they still suffer from their limited abilities to transfer learning from simple to complex tasks and vice-versa. We argue that one step forward to overcome this limitation is to better understand the behavioral trend of PLMs at each hop over the inference chain. Our critical underlying idea is to mimic human-style reasoning: we envision the multi-hop reasoning process as a sequence of explicit one-hop incremental reasoning steps. Using the SHINRA and ConceptNet resources jointly, we provide automatically generated datasets built upon a set of inference heuristics on relevant phrases and distractors, allowing us to teach the models incremental reasoning skills. We empirically show the effectiveness of the proposed models on multiple-choice question answering (MCQA) and reading comprehension (RC), with a relative improvement of $68.4\%$ and $16.0\%$ accuracy improvement w.r.t. classic PLMs, respectively.	PDF	1	2022
Unmasking the Trade-off: Measuring Gender Bias Mitigation and Over-debiasing Effects in Pretrained Language Models	Pretrained language models (PLMs) have demonstrated success across many natural language processing tasks. However, they have been shown to encode gender bias present in the corpora they are trained on. Existing bias mitigation methods are usually devised to remove all associations related to gender. This can hurt the performance of PLMs, because of a possible loss of typical associations (e.g., not associating the word ``mother'' with female). To measure the extent of loss of typical gender associations (i.e.\ over-debiasing), we introduce the Typical Associations evaluation corpus for Gender (TA-Gender). We find that three popular debiasing methods result in substantial loss of typical gender associations. Our results highlight the importance of mitigating bias without removing typical gender associations, and our dataset constitutes the first benchmark to evaluate information loss.	PDF	1	2022
DialogueEIN: Emotional Interaction Network for Emotion Recognition in Conversations	Emotion Recognition in Conversations (ERC) is a necessary step for developing empathetic human-computer interaction system. The existing methods on ERC primarily focus on capturing the context-level and speaker-level information from utterances. However, these methods ignore the causes of human emotion change, resulting in insufficient in capturing useful information for emotional prediction. In this work, we propose more explanatory Emotional Interaction Network (DialogueEIN) based on two main stages to capture the contexual information over intra- and inter-speaker dependencies directly from utterances, and to explore and analyze the differentiated contributions over the both kinds of information to boost better understanding of current utterance in conversation. Experimental results on two benchmark datasets demonstrate the effectiveness and superiority of our proposed model.	PDF	1	2022
Bayesian Deep Learning for Interactive Community Question Answering	Human-in-the-loop interactive learning has been shown to be effective for best solution selection tasks. Bayesian Optimisation (BO) reduces the amount of user interaction required but has so far relied on shallow models rather than end-to-end deep learning. This paper leverages recent advances in Bayesian deep learning (BDL) to more accurately identify the best solution from a few rounds of interaction. We apply our approach to community question answering (cQA), finding that our BDL approach significantly outperforms existing methods while remaining robust to noise in the user feedback.	PDF	1	2022
DUCK: Rumour Detection on Social Media by Modelling User and Comment Propagation Networks	Social media rumours, a form of misinformation, can mislead the public and cause significant economic and social disruption. Motivated by the observation that the user network --- which captures $\textit{who}$ engage with a story --- and the comment network --- which captures $\textit{how}$ they react to it --- provide complementary signals for rumour detection, in this paper, we propose DUCK (rumour $\underline{d}$etection with $\underline{u}$ser and $\underline{c}$omment networ$\underline{k}$s) for rumour detection on social media. We study how to leverage transformers and graph attention networks to jointly model the contents and structure of social media conversations, as well as the network of users who engaged in these conversations. Over four widely used benchmark rumour datasets in English and Chinese, we show that DUCK produces superior performance for detecting rumours, creating a new state-of-the-art. Source code for DUCK is available at: ANONYMISED.	PDF	1	2022
How Conservative are Language Models? Adapting to the Introduction of Gender-Neutral Pronouns	Gender-neutral pronouns have recently been introduced in many languages to a) include non-binary people and b) as a generic singular. Recent results from psycho-linguistics suggest that gender-neutral pronouns (in Swedish) are not associated with human processing difficulties. This, we show, is in sharp contrast with automated processing. We show that gender-neutral pronouns in Danish, English, and Swedish are associated with higher perplexity, more dispersed attention patterns, and worse downstream performance. We argue that such conservativity in language models may limit widespread adoption of gender-neutral pronouns and must therefore be resolved.	PDF	1	2022
Pathway2Text: Dataset and Method for Biomedical Pathway Description Generation	Biomedical pathways have been extensively used to characterize the mechanism of complex diseases. One essential step in biomedical pathway analysis is to curate the description of this pathway based on its graph structure and node features. Neural text generation could be a plausible technique to circumvent the tedious manual curation. In this paper, we propose a new dataset Pathway2Text, which contains 2094 pairs of biomedical pathway and textual descriptions. All pathway graphs are experimentally derived or manually curated. All textual descriptions are written by domain experts. We form this problem as a Graph2Text task and propose a novel graph-based text generation approach $k$NN-Graph2Text, which explicitly exploited descriptions of similar graphs to generate new descriptions. We observed substantial improvement of our method on both Graph2Text and the reverse task of Text2Graph. We further illustrated how our dataset can be used as a novel benchmark for biomedical name entity recognition. Collectively, we envision our method will become an important benchmark for evaluating Graph2Text methods and advance biomedical research for complex diseases.	PDF	1	2022
Comparative Analysis of Existing and a Novel Approach to Topic Detection on Conversational Dialogue Corpora	Topic detection in dialogue corpora has become a major challenge for a conversational systems, with efficient conversational topic prediction being a critical part of constructing cohesive and engaging dialogue systems (Sunet al., 2019). This paper proposed unsupervised and semi-supervised techniques for topic detection in conversational dialogue corpora and compared them with existing techniques. However, these existing topic detection techniques are widely applied to textual tweets, blogs, documents, textual data on the web. Therefore, we applied these existing techniques to dialogue corpora to detect the topics and compared them with the proposed approach because textual dialogues typically are irregular and short sentences. The paper proposes a novel approach for topic detection, which combines the clustering of known similar words, TF-IDF scores and 'bag of words' techniques (BOW) with the Parallel Latent Dirichlet Allocation (PLDA) Model to achieve topic detection. The approach also integrates the elbow method for interpretation and validation to select the optimal number of clusters. The paper comprises a comparative analysis of traditional LDA and clustering approaches across both unlabelled (unsupervised) and partially labelled (semi-supervised) switchboard corpus with a proposed novel approach. The evaluation results shows that proposed approach performs best using partially labelled topic dialogue corpora and out performed traditional and unsupervised methods.	PDF	1	2022
Active Gradual Machine Learning for Entity Resolution	Recent work has shown that the task of entity resolution (ER) can be effectively performed by gradual machine learning (GML). GML begins with some easy instances, which can be automatically labeled by the machine with high accuracy, and then gradually labels more challenging instances by iterative knowledge conveyance in a factor graph. Without involving manual labeling effort, the current GML solution for ER is unsupervised. However, its performance is limited by inaccurate and insufficient knowledge conveyance. Therefore, there is a need to investigate how to improve knowledge conveyance by manual labeling effort.In this paper, we propose an active learning (AL) approach based on GML for ER. It iteratively generates new knowledge in the form of one-sided rules by manual label verification and instills them into a factor graph for improved knowledge conveyance. We first present a technique of knowledge discovery based on genetic mutations, which can generate effective knowledge rules with very small manual verification cost. Then, we demonstrate how to leverage the generated rules for improved knowledge conveyance by measuring their influence over label status by the metric of skyline distance. We have evaluated the performance of the proposed approach by a comparative study on real benchmark data. Our extensive experiments have shown that it can significantly improve the performance of unsupervised GML with very small manual cost; furthermore, it outperforms the state-of-the-art AL solutions for deep learning by considerable margins in terms of learning efficiency.	PDF	1	2022
TaCL: Improving BERT Pre-training with Token-aware Contrastive Learning	Masked language models (MLMs) such as BERT have revolutionized the field of Natural Language Understanding in the past few years. However, existing pre-trained MLMs often output an anisotropic distribution of token representations that occupies a narrow subset of the entire representation space. Such token representations are not ideal, especially for tasks that demand discriminative semantic meanings of distinct tokens. In this work, we propose TaCL (Token-aware Contrastive Learning), a novel continual pre-training approach that encourages BERT to learn an isotropic and discriminative distribution of token representations. TaCL is fully unsupervised and requires no additional data. We extensively test our approach on a wide range of English and Chinese benchmarks. The results show that TaCL brings consistent and notable improvements over the original BERT model. Furthermore, we conduct detailed analysis to reveal the merits and inner-workings of our approach.	PDF	1	2022
GraphCache: Message Passing as Caching for Sentence-Level Relation Extraction	Entity types and textual context are essential properties for sentence-level relation extraction (RE). Existing work only encodes these properties within individual instances, which limits the performance of RE given the insufficient features in a single sentence. In contrast, we model these properties from the whole dataset and use the dataset-level information to enrich the semantics of every instance. We propose the GraphCache (Graph Neural Network as Caching) module, that propagates the features across sentences to learn better representations for RE. GraphCache aggregates the features from sentences in the whole dataset to learn global representations of properties, and use them to augment the local features within individual sentences. The global property features act as dataset-level prior knowledge for RE, and a complement to the sentence-level features. Inspired by the classical caching technique in computer systems, we develop GraphCache to update the property representations in an online manner. Overall, GraphCache yields significant effectiveness gains on RE and enables efficient message passing across all sentences in the dataset.	PDF	1	2022
Roof-BERT: Divide Understanding Labour and Join in Work	Recent work on enhancing BERT-based language representation models with knowledge graphs (KGs) and knowledge bases (KBs) has promising results on multiple NLP tasks. State-of-the-art approaches typically integrate the original input sentences with triples in KGs, and feed the combined representation into a BERT model. However, as the sequence length of a BERT model is limited, the framework can not contain too much knowledge besides the original input sentences and is thus forced to discard some knowledge. The problem is especially severe for those downstream tasks that input is a long paragraph or even a document, such as QA or reading comprehension tasks. To address the problem, we propose Roof-BERT, a model with two underlying BERTs and a fusion layer on them. One of the underlying BERTs encodes the knowledge resources and the other one encodes the original input sentences, and the fusion layer like a roof integrates both BERTs’ encodings. Experiment results on QA task and GLUE benchmark reveal the effectiveness of the proposed model.	PDF	1	2022
Pinyin-BART: An End-to-End Chinese Input Method	A Chinese Input Method Engine helps user convert a keystroke sequence into the desired Chinese character sequence. It is usually a cascaded process in which the original input sequence is firstly corrected to remove typos, then segmented into the pinyin token sequence, and finally converted into a Chinese character sequence. Errors are prone to accumulate and propagate in that pipeline. This paper summarizes that process as a Key-to-Character (K2C) conversion task and solve it in a unified end-to-end way. Pinyin-bart is proposed which can effectively solve the error propagation problem and improve the IME engine performance significantly in experiments. Moreover, we model the user real input behaviors and design a method to generate the training corpus with typos for the K2C task. It further improves the robustness of Pinyin-bart. Finally, we design a non-autoregressive (NAR) decoder for Pinyin-bart and obtain 9x+ acceleration with limited performance degradation, which makes the deployment possible on the commercial input software.	PDF	1	2022
Long-Tail Classification for Distinctive Image Captioning: A Simple yet Effective Remedy for Side Effects of Reinforcement Learning	Distinctiveness is a desirable feature of image captions. Captions should cover the characteristic details of input images. However, recent high-performing captioning models that are trained with reinforcement learning (RL) tend to generate overly generic captions despite their high performance in various other criteria. Interestingly, it has also been reported that their outputs are composed of a limited number of common words and rarely contain tail-class words, i.e., low-frequency words in the training corpus. Vocabulary size is closely related to distinctiveness as it is difficult for a model to describe details beyond its vocabulary. Based on this insight, we hypothesize that the limited vocabulary of RL models is the major factor limiting their distinctiveness. We recast distinctive image captioning as a simpler task of long-tail classification to increase the vocabulary and then propose lightweight fine-tuning methods to encourage tail-class word generation. The experimental results demonstrate that our methods significantly enhance the distinctiveness of existing RL models as well as their vocabulary size, without sacrificing quality. Our methods also outperform previous distinctiveness-aware methods with a small computational cost of minor modifications to pre-trained RL models.	PDF	1	2022
What do models learn from training on more than text? Measuring visual commonsense knowledge	There are limitations in learning language from text alone. Therefore, recent focus has been on developing multimodal models. However, few benchmarks exist that can measure what language models learn about language from multimodal training. We hypothesize that training on a visual modality should improve on the visual commonsense knowledge in language models. Therefore, we introduce two evaluation tasks for measuring visual commonsense knowledge in language models\footnote{A link to a GitHub repo with the evaluation tasks and code necessary for reproducing our results will be placed here. For reviewing purposes, we add it as supplementary material.} and use them to evaluate different multimodal models and unimodal baselines. Primarily, we find that the visual commonsense knowledge is not significantly different between the multimodal models and unimodal baseline models trained on visual text data.	PDF	1	2022
Representation Learning for Resource-Constrained Keyphrase Generation	State-of-the-art keyphrase generation methods generally depend on large annotated datasets, limiting their performance in domains with constrained resources. To overcome this challenge, we investigate pre-training strategies to learn an intermediate representation suitable for the keyphrase generation task. We introduce salient span recovery and salient span prediction as guided denoising language modeling objectives that condense the domain-specific knowledge essential for keyphrase generation. Through experiments on benchmarks spanning multiple domains, we show the effectiveness of the proposed approaches for facilitating low resource and zero-shot keyphrase generation.	PDF	1	2022
Context-Aware Prompt: Customize A Unique Prompt For Each Input	After the proposal of BERT, pre-trained language models have become the dominant approach for solving many NLP tasks. Typically, a linear classifier is added to the head of the model for fine-tuning to fit downstream tasks, while a more recent approach, also known as prompt-based learning or prompt-learning, using prompts to perform various downstream tasks, is considered to be able to uncover the potential of the language model.Prior study, however, attempted to find a universal prompt for a certain task across all samples. Therefore, we propose a novel method, Context-Aware Prompt (CAP), which provides a unique continuous prompt for each sample input by combining contextual information to further investigate the potential capabilities of the language models. On the SuperGlue benchmark, our method outperforms multiple models with vanilla fine-tuning. Furthermore, we extend the use of prompts to include Replaced Token Detection (RTD) type prompts, allowing models like ELECTRA and DeBERTaV3 that employ RTD as a training objective to use prompts for downstream tasks.	PDF	1	2022
KETOD: Knowledge-Enriched Task-Oriented Dialogue	Existing studies in dialogue system research mostly treat task-oriented dialogue and chit-chat as separate domains. Towards building a human-like assistant that can converse naturally and seamlessly with users, it is important to build a dialogue system that conducts both types of conversations effectively. In this work, we investigate how task-oriented dialogue and knowledge-grounded chit-chat can be effectively integrated into a single model. To this end, we create a new dataset, KETOD (Knowledge-Enriched Task-Oriented Dialogue), where we naturally enrich task-oriented dialogues with chit-chat based on relevant entity knowledge. We also propose two new models, SimpleToDPlus and Combiner, for the proposed task. Experimental results on both automatic and human evaluations show that the proposed methods can significantly improve the performance in knowledge-enriched response generation while maintaining a competitive task-oriented dialog performance. We believe our new dataset will be a valuable resource for future studies. The code and the dataset will be made publicly available.	PDF	1	2022
HCL-MTC Hierarchical Contrastive Learning for Multi-label Text Classification	Multi-label text classification is a big challenging subtask in text classification, where labels generally form a tree structure. Existing solutions learn the label tree structure in a shallow manner and ignore the distinctive information between labels. To address this problem, we propose a Hierarchical Contrastive Learning for Multi-label Text Classification (HCL-MTC), which constructs the graph based on the contrastive knowledge between labels. Specifically, we formulate the MTC as a multi-task learning by introducing a sampling hierarchical contrastive loss, which learns both the correlative and distinctive label information and is beneficial in learning deep label hierarchy. The experimental results show that the proposed model can achieve considerable improvements on both public datasets (i.e., RCV1-v2 and WoS).	PDF	1	2022
On Transferability of Prompt Tuning for Natural Language Processing	Prompt tuning (PT) is a promising parameter-efficient method to utilize extremely large pre-trained language models (PLMs), which can achieve comparable performance to full-parameter fine-tuning by only tuning a few soft prompts. However, PT requires much more training time than fine-tuning. Intuitively, knowledge transfer can help to improve the efficiency. To explore whether we can improve PT via prompt transfer, we empirically investigate the transferability of soft prompts across different downstream tasks and PLMs in this work. We find that (1) in zero-shot setting, trained soft prompts can effectively transfer to similar tasks and other PLMs with a trained projector on similar tasks; (2) as initialization, trained soft prompts and projected prompts can significantly accelerate training and also improve performance of PT in similar tasks and other PLMs respectively. Moreover, to explore what decides prompt transferability, we investigate various transferability indicators and find that the overlapping rate of activated neurons strongly reflects the transferability, which suggests how the prompts stimulate PLMs is essential for transferability. Our findings show that prompt transfer is promising for improving PT, and further research shall focus more on prompts' stimulation to PLMs. The source code will be publicly released.	PDF	1	2022
ProcessBERT: Towards Equivalence Judgment of Variable Definitions among Multiple Engineering Documents	Physical models play an important role in the process industry. However, conventional physical model building requires a survey on a huge amount of literature and trial-and-error to improve the model performance. We aim to develop an automated physical model builder (AutoPMoB), which automatically collects documents about a target process from literature databases, extracts necessary information from them, and builds a desired physical model by reorganizing the information. In this study, we proposed a method of judging equivalence of variable definitions, which is one of the fundamental technologies to realize AutoPMoB. We built a large-scale corpus specialized in chemical engineering and developed ProcessBERT, which is a domain-specific language model pre-trained on our corpus. We created datasets from papers related to chemical processes and evaluated the performance of ProcessBERT in the equivalence judgment task. We found that ProcessBERT outperformed the other language models in the similarity-based method.	PDF	1	2022
Uncovering Surprising Event Boundaries in Narratives	When reading stories, people can naturally identify sentences in which a new event starts, i.e., \textit{event boundaries}, using their knowledge of how events typically unfold, but a computational model to detect event boundaries is not yet available.We characterize and detect sentences with expected or surprising event boundaries in an annotated corpus of short diary-like stories, using a model that combines commonsense knowledge and narrative flow features with a RoBERTa classifier. Our results show that, while commonsense and narrative features can help improve performance overall, detecting event boundaries that are more subjective remains challenging for our model.We also find that sentences marking surprising event boundaries are less likely to be causally related to the preceding sentence, but are more likely to express emotional reactions of story characters, compared to sentences with no event boundary.	PDF	1	2022
Multi-Aspect co-Attentional Collaborative Filtering for Extreme Multi-label Text Classification	This work proposes a general and effective architecture for the extreme multi-label text classification (XMTC), and reformate the learning task to an interaction function between document and label. Recently, there are many studies trying to enhance text representation or reduce the number of labels to optimize the problem of lack of information in a text or the sparsity of the possibility vector. In the field of recommendation, a similar problem is already defined and studied for a quite long time. It is worthy to learn methods from recommendation to XMTC for finding matching relations in large size of dataset accurately. With co- attention mechanism and neural collaborative filtering, we not only learn informative label representation enhanced by document-specific label group vector and label-specific text feature vector but also build an effective interaction function to get matching score. After ex- tensive comparison experiments with various models, results demonstrate the architecture we proposed outperforms most of the methods and achieves significant improvement on basic document encoders.	PDF	1	2022
Focus-Driven Contrastive Learning for Medical Question Summarization	Automatic medical question summarization can significantly help the system to understand consumer health questions and retrieve correct answers. The Seq2Seq model based on maximum likelihood estimation (MLE) has been applied in this task, which faces two general problems: the model can not capture well question focus and and the traditional MLE strategy lacks the ability to understand sentence-level semantics. To alleviate these problems, we propose a novel question focus-driven contrastive learning framework (QFCL). Specially, we propose an easy and effective approach to generate hard negative samples based on the question focus, and exploit contrastive learning at both encoder and decoder to obtain better sentence level representations. On three medical benchmark datasets, our proposed model achieves new state-of-the-art results, and obtains a performance gain of 12.2%, 28.7% and 9.6% over the baseline BART model on three datasets respectively. Further human judgement and detailed analysis prove that our QFCL model learns better sentence representations with the ability to distinguish different sentence meanings, and generates high-quality summaries by capturing question focus.	PDF	1	2022
CONTaiNER: Few-Shot Named Entity Recognition via Contrastive Learning	Named Entity Recognition (NER) in Few-Shot setting is imperative for entity tagging in low resource domains. Existing approaches only learn class-specific semantic features and intermediate representations from source domains. This affects generalizability to unseen target domains, resulting in suboptimal performances. To this end, we present CONTaiNER, a novel contrastive learning technique that optimizes the inter-token distribution distance for Few-Shot NER. Instead of optimizing class-specific attributes, CONTaiNER optimizes a generalized objective of differentiating between token categories based on their Gaussian-distributed embeddings. This effectively alleviates overfitting issues originating from training domains. Our experiments in several traditional test domains (OntoNotes, CoNLL'03, WNUT '17, GUM) and a new large scale Few-Shot NER dataset (Few-NERD) demonstrate that on average, CONTaiNER outperforms previous methods by 3%-13% absolute F1 points while showing consistent performance trends, even in challenging scenarios where previous approaches could not achieve appreciable performance.	PDF	1	2022
Learning Monolingual Sentence Embeddings with Large-scale Parallel Translation Datasets	Although contrastive learning has greatly improved sentence representation, its performance is still limited by the size of monolingual sentence-pair datasets. Meanwhile, there exist large-scale parallel translation pairs (100x larger than monolingual pairs) that are highly correlated in semantic, but have not been utilized for learning sentence representation. Furthermore, given parallel translation pairs, previous contrastive learning frameworks can not well balance the monolingual embeddings’ alignment and uniformity which represent the quality of embeddings. In this paper, we build on the top of dual encoder and propose to freeze the source language encoder, utilizing its consistent embeddings to supervise the target language encoder via contrastive learning, where source-target translation pairs are regarded as positives. We provide the first exploration of utilizing parallel translation sentence pairs to learn monolingual sentence embeddings and show superior performance to balance the alignment and uniformity. We achieve a new state-of-the-art performance on the average score of standard semantic textual similarity (STS), outperforming both SimCSE and Sentence-T5, and the best performance in corresponding tracks on transfer tasks.	PDF	1	2022
Unsupervised Full Constituency Parsing with Neighboring Distribution Divergence	Unsupervised constituency parsing has been explored much but is still far from being solved as currently mainstream unsupervised constituency parser only captures the unlabeled structure of sentences. Properties in the substitution of constituents make it possible to detect constituents in a particular label. We propose an unsupervised and training-free labeling procedure by leveraging a newly introduced metric, Neighboring Distribution Divergence (NDD), which evaluates semantic changes caused by editions. We develop NDD into Dual POS-NDD (DP-NDD) and build templates called "molds" to extract labeled constituents from sentences. We show that DP-NDD labels constituents precisely and inducts more accurate unlabeled constituency trees than all previous unsupervised methods. Following two frameworks for labeled constituency trees inference, we set the new state-of-the-art for unlabeled F1 and labeled F1. Further studies show our approach can be scaled to other span labeling problems, i.e., named entity recognition.	PDF	1	2022
Improving Candidate Retrieval with Entity Profile Generation for Wikidata Entity Linking	There is little work on entity linking (EL) over Wikidata, even though it is the most extensive crowdsourced knowledge base. The scale of Wikidata can open up many new real-world applications, but its massive number of entities also makes EL challenging. To effectively narrow down the search space, we propose a novel candidate retrieval paradigm based on entity profiling. Wikidata entities and their textual fields are first indexed into a text search engine (e.g., Elasticsearch). During inference, given a mention and its context, we use a sequence-to-sequence (seq2seq) model to generate the profile of the target entity, which consists of its title and description. We use the profile to query the indexed search engine to retrieve candidate entities. Our approach complements the traditional approach of using a Wikipedia anchor-text dictionary, enabling us to further design a highly effective hybrid method for candidate retrieval. Combined with a simple cross-attention reranker, our complete EL framework achieves state-of-the-art results on three Wikidata-based datasets and strong performance on TACKBP-2010.	PDF	1	2022
KGRefiner: Knowledge Graph Refinement for Improving Accuracy of Translational Link Prediction Methods	The Link Prediction is the task of predicting missing relations between entities of the knowledge graph. Recent work in link prediction has attempted to provide a model for increasing link prediction accuracy by using more layers in neural network architecture. In this paper, we propose a novel method of refining the knowledge graph so that link prediction operation can be performed more accurately using relatively fast translational models. Translational link prediction models, such as TransE, TransH, TransD, have less complexity than deep learning approaches. Our method uses the hierarchy of relationships and entities in the knowledge graph to add the entity information as auxiliary nodes to the graph and connect them to the nodes which contain this information in their hierarchy. Our experiments show that our method can significantly increase the performance of translational link prediction methods in H@10, MR, MRR.	PDF	1	2022
That's the Wrong Lung! Evaluating and Improving the Interpretability of Unsupervised Multimodal Encoders for Medical Data	Pretraining multimodal models on Electronic Health Records (EHRs) provides a means to learn rich representations that might transfer to downstream tasks with minimal supervision. Recent multimodal models induce soft local alignments between modalities (image regions and sentences). This is of particular interest in the medical domain, where alignments could serve to highlight regions in an image relevant to specific phenomena described in free-text. Past work has presented example “heatmaps” as qualitative evidence that cross-modal soft alignments can be interpreted in this manner. However, there has been little quantitative evaluation of such alignments. Here we compare alignments from a state-of-the-art multimodal (image and text) model for EHR with human annotations that associate image regions with sentences. Our main finding is that the text has surprisingly little influence on the attention; alignments do not consistently reflect basic anatomical information. Moreover, synthetic modifications, such as substituting "left" for "right," do not substantially influence attention. We find that simple techniques such as masking out entity names during training show promise in terms of their ability to improve alignments without additional supervision.	PDF	1	2022
Do Prompt-Based Models Really Understand the Meaning of Their Prompts?	Recently, a boom of papers has shown extraordinary progress in zero-shot and few-shot learning with various prompt-based models. Such success can give the impression that prompts help models to learn faster in the same way that humans learn faster when provided with task instructions expressed in natural language. In this study, we experiment with over 30 prompts manually written for natural language inference (NLI). We find that models learn just as fast with many prompts that are intentionally irrelevant or even pathologically misleading as they do with instructively “good” prompts. Further, such patterns hold even for models as large as 175 billion parameters (Brown et al., 2020) as well as the recently proposed instruction-tuned models which are trained on hundreds of prompts (Sanh et al., 2021; Wei et al., 2021). Despite some success, instruction-tuned models are capable of producing good predictions with misleading prompts even at zero shots. In sum, notwithstanding prompt-based models’ impressive improvement, we find evidence of serious limitations that question the degree to which language models really understand the meaning of prompts in the way humans do.	PDF	1	2022
Learning the Ordering of Coordinate Compounds and Elaborate Expressions in Hmong, Lahu, and Chinese	Coordinate compounds (CCs) and elaborate expressions (EEs) are coordinate constructions common in languages of East and Southeast Asia. Mortensen (2006) claims that (1) the linear ordering of EEs and CCs in Hmong, Lahu, and Chinese can be predicted via phonological hierarchies and (2) that these phonological hierarchies lack a clear phonetic rationale. These claims are significant because morphosyntax has often been seen as in a feed-forward relationship with phonology, and phonological generalizations have often been assumed to be phonetically "natural". We investigate whether the ordering of CCs and EEs can be learned empirically and whether computational models (classifiers and sequence-labeling models) learn unnatural hierarchies similar to those posited by Mortensen (2006). We find that decision trees and SVMs learn to predict the order of CCs/EEs on the basis of phonology, beating strong baselines for all three languages, with DTs learning hierarchies strikingly similar to those proposed by Mortensen. However, we also find that a neural sequence labeling model is able to learn the ordering of elaborate expressions in Hmong very effectively without using any phonological information. We argue that EE ordering can be learned through two independent routes: phonology and lexical distribution, presenting a more nuanced picture than previous work.	PDF	1	2022
Let the Model Decide its Curriculum for Multitask Learning	Curriculum learning strategies in prior multi-task learning approaches arrange datasets in a difficulty hierarchy either based on human perception or by exhaustively searching the optimal arrangement. However, human perception of difficulty may not always correlate well with machine interpretation leading to poor performance and exhaustive search is computationally expensive. Addressing these concerns, we propose two classes of techniques to arrange training instances into a learning curriculum based on difficulty scores computed via model-based approaches. The two classes i.e Dataset-level and Instance-level differ in the granularity of arrangement. We conduct comprehensive experiments with $12$ datasets and show that instance-level and dataset-level techniques lead to an average performance improvement of $4.17\%$ and $3.15\%$ over their respective baseline methods. Furthermore, we find that most of this improvement comes from correctly answering the difficult instances, implying a greater efficacy of our techniques on difficult tasks.	PDF	1	2022
STT: Soft Template Tuning for Few-Shot Learning	With the rapid expansion of large pre-trained language models, fine-tuning all the model parameters for downstream tasks is becoming computationally prohibitive. The recently developed prompt-based methods freeze the entire model parameters and only update the so-called prompt parameters appended to the inputs, significantly reducing the burden of fully fine-tuning. However, standard prompt-based methods mainly consider the case where sufficient data of downstream tasks are available. It is still unclear whether the advantage can be transferred to the few-shot regime, where only limited data are available for each downstream task. Our empirical studies suggest there is still a gap between prompt tuning and fully fine-tuning for few-shot learning. We propose a new prompt-tuning framework, called Soft Template Tuning (STT), to bridge the gap. STT combines manual prompts and auto-prompts, and treats downstream classification tasks as a masked language modeling task. STT can close the gap between fine-tuning and prompt-based methods without introducing additional parameters. Importantly, it can even outperform the time- and resource-consuming fine-tuning method on sentiment classification tasks.	PDF	1	2022
DUAL: Textless Spoken Question Answering with Speech Discrete Unit Adaptive Learning	Spoken Question Answering (SQA) has gained research attention and made remarkable progress in recent years. However, existing SQA methods rely on Automatic Speech Recognition (ASR) transcriptions, which is time and cost-prohibitive to collect. This work proposes an ASR transcription-free SQA framework named Discrete Unit Adaptive Learning (DUAL), which leverages unlabeled data for pre-training and is fine-tuned by the SQA downstream task. DAUL can directly predict the time interval of the spoken answer from the spoken document. We also release a new SQA benchmark corpus Natural Multi-speaker Spoken Question Answering (NMSQA) for testing SQA in realistic scenarios. The experimental results show that DUAL performs competitively with the cascade approach (ASR + text QA), and DUAL is robust to real-world speech. We will open-source our code and model to inspire more SQA innovations from the community.	PDF	1	2022
Measuring Faithfulness of Abstractive Summaries	Recent abstractive summarization systems fail to generate factually consistent – faithful – summaries, which heavily limits their practical application. Commonly, these models tend to mix concepts from the source or hallucinate new content, completely ignoring the source. Addressing the faithfulness problem is perhaps the most critical challenge for current abstractive summarization systems. First automatic faithfulness metrics were proposed, but we argue that existing methods do not yet utilize the full potential that this field has to offer and introduce new approaches to assess factual correctness. We evaluate existing and our proposed methods by correlating them with human judgements and find that BERTScore works well. Finally, we conduct a qualitative and quantitative error analysis, which reveals common problems and indicates means to further improve the metrics.	PDF	1	2022
Efficient Machine Translation Domain Adaptation	Machine translation models struggle when translating out-of-domain text, which makes domain adaptation a topic of critical importance. However, most domain adaptation methods focus on fine-tuning or training the entire or part of the model on every new domain, which can be costly. On the other hand, semi-parametric models have been shown to successfully perform domain adaptation by retrieving examples from an in-domain datastore (Khandelwal et al., 2021). A drawback of these retrieval-augmented models, however, is that they tend to be substantially slower. In this paper, we explore several approaches to speed up nearest neighbors machine translation. We adapt the methods recently proposed by He et al. (2021) for language modeling, and introduce a simple but effective caching strategy that avoids performing retrieval when similar contexts have been seen before. Translation quality and runtimes for several domains show the effectiveness of the proposed solutions.	PDF	1	2022
TRUE: Re-evaluating Factual Consistency Evaluation	Grounded text generation systems often generate text that contains factual inconsistencies, hindering their real-world applicability. Automatically evaluating such inconsistencies may help to alleviate this limitation by accelerating evaluation cycles, filtering inconsistent outputs and annotating large-scale training data. While attracting increasing attention, such evaluation metrics are usually developed and evaluated in silo for a single task or dataset. Moreover, previous meta-evaluation protocols focused on system-level correlations with human annotations, which leave the example-level accuracy of such metrics unclear.In this work, we introduce TRUE: a comprehensive study of factual consistency metrics on a standardized collection of existing texts from diverse tasks, manually annotated for factual consistency. Our standardization enables an example-level meta-evaluation protocol that is more actionable and interpretable than previously reported correlations, yielding clearer quality measures. Across diverse state-of-the-art metrics and 11 datasets we find that large-scale NLI and question generation-and-answering-based approaches achieve strong and complementary results, and recommend them as a starting point for future evaluations.	PDF	1	2022
TEMPLATE: TempRel Classification Model Trained with Embedded Temporal Relation Knowledge	In recent years, the mainstream Temporal Relation (TempRel) classification methods may not take advantage of the large amount of semantic information contained in golden TempRel labels which is lost by the traditional discrete one-hot labels. So we propose a new approach that can make full use of golden TempRel label information and make the model performance better. Firstly we build a TempRel Classification model which consists of a RoBERTa and a Classifier. Secondly we establish fine-grained templates to automatically generate sentences to enrich golden TempRel label information and build an Enhanced Data-set. Thirdly we use the Enhanced Data-set to train the Knowledge Encoder which has the same structure as the TempRel Classification model, and get embedded knowledge. Finally we Trian the TempRel Classification model with EMbedded temPoral reLATion knowldgE (TEMPLATE) by using our designed Cosine balanced MSE loss function. Extensive experimental results shows that our approach achieves new state-of-the-art results on TB-Dense and MATRES and outperforms the TempRel Classification model trained with only traditional cross entropy loss function with up to 5.51%F1 on TB-Dense and 2.02%F1 on MATRES.	PDF	1	2022
Adaptable Adapters	State-of-the-art pretrained NLP models contain a hundred million to trillion parameters. Adapters provide a parameter-efficient alternative for the full finetuning in which we can only finetune lightweight neural network layers on top of pretrained weights. Adapter layers are initialized randomly. However, existing work uses the same adapter architecture---i.e., the same adapter layer on top of each layer of the pretrained model---for every dataset, regardless of the properties of the dataset or the amount of available training data. In this work, we introduce adaptable adapters that contain (1) learning different activation functions for different layers and different input data, and (2) a learnable switch to select and only use the beneficial adapter layers. We show that adaptable adapters achieve on-par performances with the standard adapter architecture while using a considerably smaller number of adapter layers. In addition, we show that the selected adapter architecture by adaptable adapters transfers well across different data settings and similar tasks. We propose to use adaptable adapters for designing efficient and effective adapter architectures. The resulting adapters (a) contain about 50% of the learning parameters of the standard adapter and are therefore more efficient at training and inference, and require less storage space, and (b) achieve considerably higher performances in low-resource scenarios. The code will be publicly available upon publication.	PDF	1	2022
Learn to Discover Dialog Intents via Self-supervised Context Pretraining	Intent detection is one of most critical tasks in prevalent task-oriented dialog systems. However, most systems could only identify a fixed set of intents, without covering a ubiquitous space of real-world semantics. Inducing new dialog intents or excluding out-of-scope (OOS) queries are crucial particularly in complex domains like customer support. We present a simple yet effective intent induction schema via pre-training and contrastive learning. In particular, we first transform pretrained LMs into conversational encoders with in-domain dialogs. Then we conduct context-aware contrastive learning to reveal latent intent semantics via coherence from dialog contexts. By composing a fine-grained intent subspace from in-scope domain data, we demonstrate the effectiveness of our approach to induce intents with simple clustering algorithms and detect outliers with probabilistic linear discriminant analysis (pLDA). The experimental results validate the robustness and versatility of our framework, which also achieves superior performances over competitive baselines without label supervision.	PDF	1	2022
ViQA-COVID: COVID-19 Machine Reading Comprehension Dataset for Vietnamese	After two years of appearance, COVID-19 has negatively affected people and normal life around the world. As in January 2022, there are more than 317 million cases and five million deaths worldwide (including nearly two million cases and over thirty-four thousand deaths in Vietnam). Economy and society are both severely affected. The variant of COVID-19, Omicron, has broken disease prevention measures of countries and rapidly increased number of infections. Resources overloading in treatment and epidemics prevention is happening all over the world. It can be seen that, application of artificial intelligence (AI) to support people at this time is extremely necessary. There have been many studies applying AI to prevent COVID-19 which are extremely useful, and studies on machine reading comprehension (MRC) are also in it. Realizing that, we created the first MRC dataset about COVID-19 for Vietnamese: ViQA-COVID and can be used to build models and systems, contributing to disease prevention. Besides, ViQA-COVID is also the first multi-span extraction MRC dataset for Vietnamese, we hope that it can contribute to promoting MRC studies in Vietnamese and multilingual. We will publicly release ViQA-COVID soon.	PDF	1	2022
Same Author or Just Same Topic? Towards Topic-Independent Style Representations	Style is an integral component of language. Recent advances in the development of style representations have increasingly used training objectives from authorship verification (AV): Do two texts have the same author? The assumption underlying the AV training task (same author approximates same writing style) enables self-supervised and, thus, extensive training. However, AV usually does not or only on a coarse-grained level control for topic. The resulting representations might therefore also encode topical information instead of style alone. We introduce a variation of the AV training task that controls for topic using conversation, domain or no topic control as a topic proxy. To evaluate whether trained representations prefer style over topic information, we propose an original variation to the recent STEL framework. We find that representations trained by controlling for conversation are better than representations trained with domain or no topic control at representing style independent from topic.	PDF	1	2022
Assisting the Human Fact-Checkers: Detecting All Previously Fact-Checked Claims in a Document	Given the recent proliferation of false claims online, there has been a lot of manual fact-checking effort. As this is very time-consuming, human fact-checkers can benefit from tools that can support them and make them more efficient. Here, we focus on building a system that could provide such support. Given an input document, it aims to detect all sentences that contain a claim that can be verified by some previously fact-checked claims (from a given database). The output is a re-ranked list of the document sentences, so that those that can be verified are ranked as high as possible, together with corresponding evidence. Unlike previous work, which has looked into claim retrieval, here we take a document-level perspective. We create a new manually annotated dataset for the task, and we propose suitable evaluation measures. We further experiment with a learning-to-rank approach, achieving sizable performance gains over several strong baselines. Our analysis demonstrates the importance of modeling text similarity and stance, while also taking into account the veracity of the retrieved previously fact-checked claims. We believe that this research would be of interest to fact-checkers, journalists, media, and regulatory authorities.	PDF	1	2022
Semantics is Actually 82% Distributional, but Neural Networks Aren't.	Distributional semantics is often proposed as the linguistic theory underpinning many of the most efficient current NLP systems. In the present paper, we question the linguistic well-foundedness of these models, addressing it from the perspective of distributional substitution. To that end, we provide a dataset of human judgments on the distributional hypothesis, and highlight how humans cannot systematically distinguish pairs of words solely from contextual information. We stress that earlier static embedding architectures are competitive with more modern contextual embeddings on the distributional substitution task, and that neither serve as good models of human linguistic behavior.	PDF	1	2022
Less is More: Learning to Refine Dialogue History for Personalized Dialogue Generation	Personalized dialogue systems explore the problem of generating responses that are consistent with the user's personality, which have raised much attention in recent years. Existing personalized dialogue systems have tried to extract user profiles from dialogue history to guide personalized response generation. Since the dialogue history is usually long and noisy, most existing methods truncate the dialogue history to model the user personality. Such methods can generate some personalized responses, but a large part of dialogue history is wasted, leading to sub-optimal performance of personalized response generation. In this work, we propose to refine the user dialogue history from a large scale, based on which we can handle more dialogue history and obtain a more abundant and accurate persona information. Specifically, we design an MSP model which consists of three personal information refiners and a personalized response generator. With these multi-level refiners, we can sparsely extract the most valuable information (tokens) from the dialogue history and leverage other similar users' data to enhance the personalization. Experimental results on two real-world datasets demonstrate the superiority of our model in generating more informative and personalized responses.	PDF	1	2022
Elastic Weight Consolidation for Reduction of Catastrophic Forgetting in GPT-2	Neural networks are naturally prone to the effects of catastrophic forgetting during fine-tuning. Despite the extensive adoption of transformers, little research has been done to investigate the effects of catastrophic forgetting on attention-based architectures. In this work, we used elastic weight consolidation (EWC) to mitigate catastrophic forgetting caused by fine-tuning in one of the foundation models, GPT-2. We show that by using EWC, we can significantly slow down the forgetting process without major penalty for the performance of the task model is fine-tuned for. We also determine that the majority of important weights is located in self-attention layers, and the parameters most sensitive to change are located in the normalization layers. Finally, we explore the instability of the EWC and potential performance issues.	PDF	1	2022
Context-guided Triple Matching for Multiple Choice Question Answering	The task of multiple choice question answering (MCQA) refers to identifying a suitable answer from multiple candidates, by estimating the matching score among the \emph{triple} of the passage, question and answer. Despite the general research interest in this regard, existing methods decouple the process into several pair-wise or \emph{dual} matching steps, that limited the ability of assessing cases with multiple evidence sentences. To alleviate this issue, this paper introduces a novel \textbf{C}ontext-guided \textbf{T}riple \textbf{M}atching algorithm, which is achieved by integrating a Triple Matching (TM) module and a Contrastive Regularization (CR). The former is designed to enumerate one component from the triple as the background context, and estimate its semantic matching with the other two. Additionally, the contrastive term is further proposed to capture the dissimilarity between the correct answer and distractive ones. We validate the proposed algorithm on several benchmarking MCQA datasets, which exhibits competitive performances against state-of-the-arts.	PDF	1	2022
MetaQA: Combining Expert Agents for Multi-Skill Question Answering	The recent explosion of question answering (QA) datasets and models has increased the interest in the generalization of models across multiple domains and formats by either training on multiple datasets or by combining multiple models. Despite the promising results of multi-dataset models, some domains or QA formats may require specific architectures, and thus the adaptability of these models might be limited. In addition, current approaches for combining models disregard cues such as question-answer compatibility. In this work, we propose to combine expert agents with a novel, flexible, and training-efficient architecture that considers questions, answer predictions, and answer-prediction confidence scores to select the best answer among a list of answer candidates. Through quantitative and qualitative experiments we show that our model i) creates a collaboration between agents that outperforms previous multi-agent and multi-dataset approaches in both in-domain and out-of-domain scenarios, ii) is highly data-efficient to train, and iii) can be adapted to any QA format. We release our code and a dataset of answer predictions from expert agents for 16 QA datasets to foster future developments of multi-agent systems.	PDF	1	2022
Improved and Efficient Conversational Slot Labeling through Question Answering	Transformer-based pretrained language models (PLMs) offer unmatched performance across the majority of natural language understanding (NLU) tasks, including a body of question answering (QA) tasks. We hypothesize that improvements in QA methodology can also be directly exploited in dialog NLU; however, dialog tasks must be \textit{reformatted} into QA tasks. In particular, we focus on modeling and studying \textit{slot labeling} (SL), a crucial component of NLU for dialog, through the QA optics, aiming to improve both its performance and efficiency, and make it more effective and resilient to working with limited task data. To this end, we make a series of contributions: 1) We demonstrate how QA-tuned PLMs can be applied to the SL task, reaching new state-of-the-art performance, with large gains especially pronounced in such low-data regimes. 2) We propose to leverage contextual information, required to tackle ambiguous values, simply through natural language. 3) Efficiency and compactness of QA-oriented fine-tuning are boosted through the use of lightweight yet effective adapter modules. 4) Trading-off some of the quality of QA datasets for their size, we experiment with larger automatically generated QA datasets for QA-tuning, arriving at even higher performance. Finally, our analysis suggests that our novel QA-based slot labeling models, supported by the PLMs, reach a performance ceiling in high-data regimes, calling for more challenging and more nuanced benchmarks in future work.	PDF	1	2022
Evons: A Dataset for Fake and Real News Virality Analysis and Prediction	We present a new collection of news articles originating from fake and real news media sources for the analysis and prediction of news virality. Unlike existing fake news datasets which either contain claims, or news article headline and body, in this collection each article is supported with a Facebook engagement count which we consider as an indicator of the article virality. In addition we also provide the article description and thumbnail image with which the article was shared on Facebook. These images were automatically annotated with object tags and color attributes. Using cloud based vision analysis tools thumbnail images were also analyzed for faces and detected faces were annotated with facial attributes. We empirically investigate the use of this collection on the task of article virality prediction.	PDF	1	2022
EiCi: A New Method of Dynamic Embedding Incorporating Contextual Information in Chinese NER	With the continuous development of deep learning technology, the field of Named Entity Recognition(NER) has made great achievements in recent years. In Chinese NER, making full use of word information is becoming the key to improve model performance. In the previous related work, lexicon was applied to add word information. However, the word vectors generated by that way is static. It means that it cannot accurately describe some polysemous words in a specific context, which will affect the performance of the NER task. This paper presents EiCi to solve this problem. The new method is proposed that, without relying on external pre-trained word vectors, it takes the advantage of the pre-trained language model BERT to extract polysemous word information. In order to further utilize the word information, a sub-module for type recognition is also added to assist the main task of NER. Experiments on two main Chinese NER datasets show EiCi has better performance than the traditional NER models and other NER models that use word information.	PDF	1	2022
Semantic Parsing for Planning Goals as Constrained Combinatorial Contextual Bandits	We are working towards AI planning systems with natural language interfaces. In this paper, we tackle the semantic parsing problem of learning to set the logical goals of the planning system based on a natural language description of the task. The current state of the art in semantic parsing is to use supervised learning with deep neural networks but this needs a lot of labelled data made by domain experts. To reduce this need, we additionally use a reward signal that comes from completing the AI planning task. We formalize this as a constrained combinatorial contextual bandit problem. The context is created by using a deep neural network for feature extraction and the constrained combinatorial nature of the task can be used to increase the efficiency of learning. We show this result theoretically with our lower regret bound and then experimentally in our extension of the TextWorld problem.	PDF	1	2022
Graph Recurrent Neural Network for Text Classification	Graph Neural Networks(GNNs) application to text classification is currently one of the most popular fields. Most GNNs-based models only focus on the interaction of words in the document, whereas the word order is ignored, and the related semantic information is lost. In addition, when the graph density increases, the word nodes become over-smooth. As a result, the semantic information of the document is destroyed. In this paper, TextGRNN, a text classification method based on GNN is proposed to solve the above problems. First, our proposed model constructs the document-level graph via Visibility Graph, in which the graph density is restrained, and updates the word representations by GNN. Then the TextGRNN model utilizes Bi-LSTM that can recognize word order to learn the semantic information of the document. Finally, the attention mechanism is used to highlight the essential words. Numerous experiments on three benchmark datasets demonstrate that our model is preferable to state-of-the-art text classification methods.	PDF	1	2022
Bi-SimCut: A Simple Strategy for Boosting Neural Machine Translation	We introduce Bi-SimCut: a simple but effective strategy to boost neural machine translation (NMT) performance. It consists of two training procedures: bidirectional pretraining and unidirectional finetuning. Both procedures utilize SimCut, a simple regularization method that forces the consistency between the output distributions of the original and the cutoff samples. Without utilizing extra dataset via back-translation or integrating large-scale pretrained model, Bi-SimCut achieves strong translation performance across five translation benchmarks (data sizes range from 160K to 20.1M): BLEU scores of $31.16$ for $\texttt{en}\rightarrow\texttt{de}$ and $38.37$ for $\texttt{de}\rightarrow\texttt{en}$ on the IWSLT14 dataset, $30.78$ for $\texttt{en}\rightarrow\texttt{de}$ and $35.15$ for $\texttt{de}\rightarrow\texttt{en}$ on the WMT14 dataset, and $27.17$ for $\texttt{zh}\rightarrow\texttt{en}$ on the WMT17 dataset. SimCut is not a new method, but a version of Cutoff (Shen et al., 2020) simplified and adapted for NMT, and it could be considered as a perturbation-based method. Given the universality and simplicity of Bi-SimCut and SimCut, we believe they can serve as strong baselines for future NMT research.	PDF	1	2022
Towards Equal Opportunity Fairness through Adversarial Learning	Adversarial training is a common approach for bias mitigation in natural language processing. Although most work on debiasing is based around the equal opportunity criterion, it is not explicitly captured in standard adversarial training. In this paper, we propose an augmented discriminator for adversarial training, which takes the target class as input to create richer features and more explicitly model equal opportunity. Experimental results over two datasets show that our method substantially improves over standard adversarial debiasing methods, in terms of the performance--fairness trade-off.	PDF	1	2022
Enhancing the Nonlinear Mutual Dependencies in Transformers with Mutual Information	The predictive uncertainty problem exists in Transformers. We present that pre-trained Transformers can be further regularized by employing mutual information to alleviate such issues in neural machine translation (NMT). In this paper, to enhance the representation, we explicitly capture the nonlinear mutual dependencies existing in two types of attention in the decoder to reduce the model uncertainty. Specifically, we employ mutual information to measure the nonlinear mutual dependencies of token-token interactions during attention calculation. Moreover, we resort to InfoNCE for mutual information estimation to avoid the intractable problem. By maximizing the mutual information among tokens, we capture more knowledge concerning token-token interactions from the training corpus to reduce the model uncertainty. Experimental results on WMT'14 En$\rightarrow$De and WMT'14 En$\rightarrow$Fr demonstrate the consistent effectiveness and evident improvements of our model over the strong baselines. Quantifying the model uncertainty again verifies our hypothesis. The proposed plug-and-play approach can be easily incorporated and deployed into pre-trained Transformer models. Code will be released soon.	PDF	1	2022
All You May Need for VQA are Image Captions	Visual Question Answering (VQA) is a challenge that has benefited tremendously from increasingly sophisticated models, but has not enjoyed the same level of engagement in terms of data creation. We propose here a method that automatically derives VQA examples at volume, by leveraging the abundance of existing image-caption annotations combined with neural models for question generation. We show that the resulting data is powerful enough to boost the state-of-the-art zero-shot results on VQA by double digits, and exhibits a level of robustness that lacks in models with the same architecture trained on human-annotated data.	PDF	1	2022
A Graph Fusion Approach for Cross-Lingual Machine Reading Comprehension	Although great progress has been made for Machine Reading Comprehension (MRC) in English, scaling out to a large number of languages remains a huge challenge due to the lack of large amounts of annotated training data in non-English languages. To address this challenge, some recent efforts of cross-lingual MRC employ machine translation to transfer knowledge from English to other languages, through either explicit alignment or implicit attention. For effective knowledge transition, it is beneficial to leverage both semantic and syntactic information. However, the existing methods fail to explicitly incorporate syntax information in model learning. Consequently, the models are not robust to errors in alignment and noises in attention. In this work, we propose a novel approach, named GraFusionMRC, which jointly models the cross-lingual alignment information and the mono-lingual syntax information using a graph. We develop a series of algorithms including graph construction, learning, and pre-training. The experiments on two benchmark datasets for cross-lingual MRC show that our approach outperforms all strong baselines, which verifies the effectiveness of syntax information for cross-lingual MRC. The code will be made open-sourced on Github.	PDF	1	2022
Distant Supervision for Relation Extraction with Hierarchical Attention-Based Networks	Distant supervision employs external knowledge bases to automatically label corpora. The labeled sentences in a corpus are usually packaged and trained for relation extraction using a multi-instance learning paradigm. The automated distant supervision inevitably introduces label noises. Previous studies that used sentence-level attention mechanisms to de-noise neither considered correlation among sentences in a bag nor correlation among bags. This paper proposes hierarchical attention-based networks that can de-noise at both sentence and bag levels. In the calculation of bag representation, we provide weights to sentence representations using sentence-level attention that considers correlations among sentences in each bag. Then, we employ bag-level attention to merge the similar bags by considering their correlations and to provide properer weights in the calculation of bag group representation. Experimental results on the New York Times datasets show that the proposed method outperforms the state-of-the-art ones.	PDF	1	2022
Transferring Knowledge from Structure-aware Self-attention Language Model to Sequence-to-Sequence Semantic Parsing	Semantic parsing considers the task of mapping a natural language sentence into a target formal representation, where various sophisticated sequence-to-sequence (seq2seq) models have been applied with promising results. Generally, these target representations follow a syntax formalism that limits permitted forms. However, it is neither easy nor flexible to explicitly integrate this syntax formalism into a neural seq2seq model. In this paper, we present a structure-aware self-attention language model to capture structural information of target representations and propose a knowledge distillation based approach to incorporating the target language model into a seq2seq model, where grammar rules, sketches or extra corpus are not required in the training process. An ablation study shows that the proposed language model can notably improve the performance of the baseline model. The experiments show that our method achieves new state-of-the-art performance among neural approaches on four semantic parsing (ATIS, GEO) and Python code generation (Django, CoNaLa) tasks.	PDF	1	2022
A Two-stage Attention-based Model for Customer Satisfaction Prediction in E-commerce Customer Service	Nowadays, customer satisfaction prediction (CSP) on e-commerce platforms has become a hot research topic for both intelligent customer service and artificial customer service. CSP aims to discover customer satisfaction according to the dialogue content of customer and customer service, for the purpose of improving service quality and customer experience. In this paper, we focus on CSP for intelligent customer service chatbots. Although previous works have made some progress in many aspects, they mostly ignore the huge differences of expressions between customer and customer service, and fail to adequately consider the internal relations of those two kinds of personalized expressions. Thus, for emphasizing the importance of modeling customer part and service part separately, in this work we propose a two-stage dialogue-level classification model, which contains an intra-stage and an inter-stage to handle the issues above. In the intra-stage, we model customer part and service part separately by using attention mechanism combined with personalized context to obtain {\it customer state} and {\it service state}. Then we interact those two states with each other in the inter-stage to capture the final satisfaction representation of the whole dialogue. Experiment results demonstrate that our model achieves better performance than several competitive baselines on our in-house dataset and four public datasets.	PDF	1	2022
ProQA: Structural Prompt-based Pre-training for Unified Question Answering	Question Answering (QA) is a longstanding challenge in natural language processing. Existing QA works mostly focus on specific question types, knowledge domains, or reasoning skills. The specialty in QA research hinders systems from modeling commonalities between tasks and generalization for wider applications. To address this issue, we present ProQA, a unified QA paradigm that solves various tasks through a single model. ProQA takes a unified structural prompt as the bridge and improves the QA-centric ability by structural prompt-based pre-training. Through a structurally designed prompt-based input schema, ProQA concurrently models the knowledge generalization for all QA tasks while keeping the knowledge customization for every specific QA task. Furthermore, ProQA is pre-trained with structural prompt-formatted large-scale synthesized corpus, which empowers the model with the commonly-required QA ability. Experimental results on 11 QA benchmarks demonstrate that ProQA consistently boosts performance on both full data fine-tuning, few-shot learning, and zero-shot testing scenarios. Furthermore, ProQA exhibits strong ability in both continual learning and transfer learning by taking the advantages of the structural prompt.	PDF	1	2022
Analytical Reasoning of Text	Analytical reasoning is an essential and challenging task that requires a system to analyze a scenario involving a set of particular circumstances and perform reasoning over it to make conclusions. However, current neural models with implicit reasoning ability struggle to solve this task. In this paper, we study the challenge of analytical reasoning of text and collect a new dataset consisting of questions from the Law School Admission Test from 1991 to 2016. We analyze what knowledge understanding and reasoning abilities are required to do well on this task, and present an approach dubbed ARM. It extracts knowledge such as participants and facts from the context. Such knowledge are applied to an inference engine to deduce legitimate solutions for drawing conclusions. In our experiments, we find that ubiquitous pre-trained models struggle to deal with this task as their performance is close to random guess. Results show that ARM outperforms pre-trained models significantly. Moreover, we demonstrate that ARM has better explicit interpretable reasoning ability.	PDF	1	2022
Sentence-Level Resampling for Named Entity Recognition	As a fundamental task in natural language processing, named entity recognition (NER) aims to locate and classify named entities in unstructured text. However, named entities are always the minority among all tokens in the text. This data imbalance problem presents a challenge to machine learning models as their learning objective is usually dominated by the majority of non-entity tokens. To alleviate data imbalance, we propose a set of sentence-level resampling methods where the importance of each training sentence is computed based on its tokens and entities. We study the generalizability of these resampling methods on a wide variety of NER models (CRF, Bi-LSTM, and BERT) across corpora from diverse domains (general, social, and medical texts). Extensive experiments show that the proposed methods improve span-level macro F1-scores of the evaluated NER models on multiple corpora, frequently outperforming sub-sentence-level resampling, data augmentation, and special loss functions such as focal and Dice loss.	PDF	1	2022
Prompt-based Zero-shot Relation Classification with Semantic Knowledge Augmentation	In relation classification, recognizing unseen (new) relations for which there are no training instances is a challenging task. We propose a prompt-based model with semantic knowledge augmentation (ZS-SKA) to recognize unseen relations under the zero-shot setting. We present a new word-level sentence translation rule and generate augmented instances with unseen relations from instances with seen relations using that new rule. We design prompts based on an external knowledge graph to integrate semantic knowledge information learned from seen relations. Instead of using the actual label sets in the prompt template, we construct weighted virtual label words. We learn the representations of both seen and unseen relations with augmented instances and prompts. We then calculate the distance between the generated representations using prototypical networks to predict unseen relations. Extensive experiments conducted on three public datasets show that ZS-SKA outperforms state-of-the-art methods under the zero-shot scenarios. Our experimental results also demonstrate the effectiveness and robustness of ZS-SKA.	PDF	1	2022
Cross-lingual Inference with A Chinese Entailment Graph	Predicate entailment detection is a crucial task for question-answering from text, where previous work has explored unsupervised learning of entailment graphs from typed open relation triples. In this paper, we present the first pipeline for building Chinese entailment graphs, which involves a novel high-recall open relation extraction (ORE) method and the first Chinese fine-grained entity typing dataset under the FIGER type ontology. Through experiments on the Levy-Holt dataset and a boolean QA task, we verify the strength of our Chinese entailment graph, and reveal the cross-lingual complementarity: on the parallel Levy-Holt dataset, an ensemble of Chinese and English entailment graphs beats both monolinguals, and raises unsupervised SOTA by 4.7 AUC points.	PDF	1	2022
Probing the Role of Positional Information in Vision-Language Models	In most Vision-Language models (VL) the understanding of the image structure is enabled by injecting the position information (PI) about objects in the image. In our case study of LXMERT, a state-of-the-art VL model, we probe the use of the PI in the representation and study its effect on Visual Question Answering. We show that the model is not capable of leveraging the PI for image-text matching task on a challenge set where only position differs. Yet, our experiments with probing confirm that the PI is indeed present in the representation. We introduce two strategies (i) Positional Information Pre-training and (ii) Contrastive Learning on PI using Cross-Modality Matching. Doing so, the model can correctly classify if image with detailed PI statements matches. Additionally to the 2D information from bounding boxes, we introduce the object's depth as a new feature for a better object localization in the space. Even though we were able to improve the model properties as defined by our probes, it only has a negligible effect on the downstream performance. Our results thus highlight an important issue of multimodal modeling: the mere presence of information detectable by a probing classifier is not a guarantee that the information is available in a cross-modal setup.	PDF	1	2022
Accurate Online Posterior Alignments for Principled Lexically-Constrained Decoding	Online alignment in machine translation refers to the task of aligning a target word to a source word when the target sequence has only been partially decoded. Good online alignments facilitate important applications such as lexically constrained translation where user-defined dictionaries are used to inject lexical constraints into the translation model. We propose a novel posterior alignment technique that is truly online in its execution and superior in terms of alignment error rates compared to existing methods. Our proposed inference technique jointly considers alignment and token probabilities in a principled manner and can be seamlessly integrated within existing constrained beam-search decoding algorithms. On five language pairs, including two distant language pairs, we achieve consistent drop in alignment error rates. When deployed on seven lexically constrained translation tasks, we achieve significant improvements in BLEU specifically around the constrained positions.	PDF	1	2022
On the Limitations of Dataset Balancing: The Lost Battle Against Spurious Correlations	Recent work have shown that deep learning models in NLP are highly sensitive to low-level correlations between simple features and specific output labels, leading to overfitting and lack of generalization.To mitigate this problem, a common practice is to balance datasets by adding new instances or by filtering out "easy" instances (Sakaguchi et al., 2020) culminating in a recent proposal to eliminate single-word correlations altogether (Gardner et al., 2021).In this opinion paper, we identify that despite these efforts, increasingly-powerful models keep exploiting ever-smaller spurious correlations, and as a result even balancing all single-word features is insufficient for mitigating all of these correlations. In parallel, a truly balanced dataset may be bound to "throw the baby out with the bathwater" and miss important signal encoding common sense and world knowledge. We highlight several alternatives to dataset balancing, focusing on enhancing datasets with richer contexts, allowing models to abstain and interact with users, and turning from large-scale fine-tuning to zero- or few-shot setups.	PDF	1	2022
An Empirical Study of Document-to-document Neural Machine Translation	This paper does not aim at introducing a novel method for document NMT. Instead, we head back to the original transformer model with document-level training and hope to answer the following question: Is the capacity of current models strong enough for document-level NMT? Interestingly, we observe that the original transformer with appropriate training techniques can achieve strong results for document translation, even with a length of 2000 words. We evaluate this model and several recent approaches on nine document-level datasets and two sentence-level datasets across six languages. Experiments show that the original Transformer model outperforms sentence-level models and many previous methods in a comprehensive set of metrics, including BLEU, four lexical indices, three newly proposed assistant linguistic indicators, and human evaluation.	PDF	1	2022
OkwuGbé: End-to-End Speech Recognition for Fon and Igbo	Language is inherent and compulsory for human communication. Whether expressed in a written or spoken way, it ensures understanding between people of the same and different regions. With the growing awareness and effort to include more low-resourced languages in NLP research, African languages have recently been a major subject of research in machine translation, and other text-based areas of NLP. However, there is still very little comparable research in speech recognition for African languages. Interestingly, some of the unique properties of African languages affecting NLP, like their diacritical and tonal complexities, have a major root in their speech, suggesting that careful speech interpretation could provide more intuition on how to deal with the linguistic complexities of African languages for text-based NLP. OkwuGbé is a step towards building speech recognition systems for African low-resourced languages. Using Fon and Igbo as our case study, we conduct a comprehensive linguistic analysis of each language and describe the creation of end-to-end, deep neural network-based speech recognition models for both languages. We present a state-of-the-art ASR model for Fon, as well as benchmark ASR model results for Igbo. Our linguistic analyses (for Fon and Igbo) provide valuable insights and guidance into the creation of speech recognition models for other African low-resourced languages, as well as guide future NLP research for Fon and Igbo. The Fon and Igbo models source code will be publicly available.	PDF	1	2022
Document Classification with Word Sense Knowledge	The performance of Word Sense Disambiguation (WSD) on a standard evaluation framework has reached an estimated upper bound. However, there is limited research on the application of WSD to relevant NLP tasks due to the high computational cost of supervised systems. In this paper, we propose a partial WSD method with sense category information and incorporate the sense knowledge into a supervised document classification framework. Experimental results show that the proposed method can constantly boost the system’s performance on document classification datasets against strong baselines.	PDF	1	2022
Multiplicative Position-aware Transformer Models for Language Understanding	In order to utilize positional ordering information in transformer models, various flavors of absolute and relative position embeddings have been proposed. However, there is no comprehensive comparison of position embedding methods in the literature. In this paper, we review existing position embedding methods and compare their accuracy on downstream NLP tasks, using our own implementations. We also propose a novel multiplicative embedding method which leads to superior accuracy when compared to existing methods. Finally, we show that our proposed embedding method, served as a drop-in replacement of the default absolute position embedding, can improve the RoBERTa-base and RoBERTa-large models on SQuAD1.1 and SQuAD2.0 datasets.	PDF	1	2022
Discourse-Aware Prompt Design for Text Generation	Current efficient fine-tuning methods (e.g., adapters, prefix-tuning, etc.) have optimized conditional text generation via training a small set of extra parameters of the neural language model, while freezing the rest for efficiency. While showing strong performance on some generation tasks, they don't generalize across all generation tasks. In this work, we show that prompt based conditional text generation can be improved with simple and efficient methods that simulate modeling the discourse structure of human written text.We introduce two key design choices: First, we show that a higher-level discourse structure of human written text can be modelled with hierarchical blocking on prefix parameters. It enables spanning different parts of the input and output text and yields more coherent output generations. Second, we propose sparse prefix tuning by introducing attention sparsity on the prefix parameters at different layers of the network and learn sparse transformations on the softmax-function, respectively. We find that sparse attention enables the prefix-tuning to better control of the input contents (salient facts) yielding more efficient tuning of the prefix-parameters. Our experiments show that structured design of prefix parameters can yield more coherent, faithful and relevant generations than baseline prefix-tuning on all generation tasks and perform at par with fine-tuning while being more efficient.	PDF	1	2022
Mining Information from Event Structure Relation Graph for Event Argument Extraction	Event Argument Extraction is a vital subtask of Event Extraction. Despite the achievements in existing methods, they can not fully use the event structure information and the rich semantics of the labels, which can provide richer external knowledge for extracting event arguments. To this end, we propose an efficient and end-to-end event argument extraction model based on the Event Structure and Question Answering (ESQA-EAE): (1) we model a multi-relational graph of event ontologies to get the structure-aware node representations; (2) we encode the questions and event mentions separately to avoid premature fusion of the two features. Experiments on the ACE2005 show that ESQA-EAE surpasses the baseline models, which further show that ESQA-EAE can use the structural information to improve the accuracy of event argument extraction.	PDF	1	2022
Minimally-Supervised Relation Induction from Pre-trained Language Model	Relation Induction is a very practical task in Natural Language Processing (NLP) area. In practical application scenarios, people want to induce more entity pairs having the same relation from only a few seed entity pairs. Thus, instead of the laborious supervised setting, in this paper, we focus on the minimally-supervised setting where only a couple of seed entity pairs per relation are provided. Although the conventional relation induction methods have made some success, their performance depends heavily on the quality of word embeddings. The great success of Pre-trained Language Models, such as BERT, changes the NLP area a lot, and they are proven to be able to better capture relation knowledge. In this paper, we propose a novel method to induce relation with BERT under the minimally-supervised setting. Specifically, we firstly extract proper templates from the corpus by using the mask-prediction task in BERT to build pseudo-sentences as the context of entity pairs. Then we use BERT attention weights to better represent the pseudo-sentences. In addition, We also use the IntegratedGradient of entity pairs to iteratively select better templates further. Finally, with the high-quality pseudo-sentences, we can train a better classifier for relation induction. Experiments onGoogle Analogy Test Sets (GATS), Bigger Analogy TestSet (BATS) and DiffVec demonstrate that our proposed method achieves state-of-the-art performance.	PDF	1	2022
How Gender Debiasing Affects Internal Model Representations, and Why It Matters	Common studies of gender bias in NLP focus either on extrinsic bias measured by model performance on a downstream task or on intrinsic bias found in models' internal representations. However, the relationship between extrinsic and intrinsic bias is relatively unknown. In this work, we illuminate this relationship by measuring both quantities together: we debias a model during downstream fine-tuning, which reduces extrinsic bias, and measure the effect on intrinsic bias, which is operationalized as bias extractability with information-theoretic probing. Through experiments on two tasks and multiple bias metrics, we show that our intrinsic bias metric is a better indicator of debiasing than (a contextual adaptation of) the standard WEAT metric, and can also expose cases of superficial debiasing. Our framework provides a comprehensive perspective on bias in NLP models, which can be applied to deploy NLP systems in a more informed manner. Our code will be made publicly available.	PDF	1	2022
Bidimensional Leaderboards: Generate and Evaluate Language Hand in Hand	Natural language processing researchers have identified limitations of evaluation methodology for generation tasks, with new questions raised about the validity of automatic metrics and of crowdworker judgments. Meanwhile, efforts to improve generation models tend to focus on simple n-gram overlap metrics (e.g., BLEU, ROUGE). We argue that new advances on models and metrics should each more directly benefit and inform the other. We therefore propose a generalization of leaderboards, bidimensional leaderboards (Billboards), that simultaneously tracks progress in language generation tasks and metrics for their evaluation. Unlike conventional unidimensional leaderboards that sort submitted systems by predetermined metrics, a Billboard accepts both generators and evaluation metrics as competing entries. A Billboard automatically creates an ensemble metric that selects and linearly combines a few metrics based on a global analysis across generators. Further, metrics are ranked based on their correlation with human judgments. We release four Billboards for machine translation, summarization, and image captioning. We demonstrate that a linear ensemble of a few diverse metrics sometimes substantially outperforms existing metrics in isolation. Our mixed-effects model analysis shows that most automatic metrics, especially the reference-based ones, overrate machine over human generation, demonstrating the importance of updating metrics as generation models become stronger (and perhaps more similar to humans) in the future.	PDF	1	2022
Transparent Human Evaluation for Image Captioning	We establish a rubric-based human evaluation protocol for image captioning models. Our scoring rubrics and their definitions are carefully developed based on machine- and human-generated captions on the MSCOCO dataset. Each caption is evaluated along two main dimensions in a tradeoff (precision and recall) as well as other aspects that measure the text quality (fluency, conciseness, and inclusive language). Our evaluations demonstrate several critical problems of the current evaluation practice. Human-generated captions show substantially higher quality than machine-generated ones, especially in coverage of salient information (i.e., recall), while most automatic metrics say the opposite. Our rubric-based results reveal that CLIPScore, a recent metric that uses image features, better correlates with human judgments than conventional text-only metrics because it is more sensitive to recall. We hope that this work will promote a more transparent evaluation protocol for image captioning and its automatic metrics.	PDF	1	2022
IMPLI: Investigatng NLI Models' Performance on Figurative Language	Natural language inference (NLI) has been widely used as a task to train and evaluate models for language understanding. However, the ability of NLI models to perform inferences requiring understanding of figurative language such as idioms and metaphors remains understudied. We introduce the IMPLI (Idiomatic and Metaphoric Paired Language Inference) dataset, an English dataset consisting of paired sentences spanning idioms and metaphors. We develop novel methods to generate 24k semi-automatic pairs as well as manually creating 1.8k gold pairs. We use IMPLI to evaluate NLI models based on RoBERTa fine-tuned on the MNLI dataset, and show that while they can reliably detect entailment relationship between figurative phrases with their literal counterparts, they perform poorly on examples where pairs are designed to be non-entailing. This suggests the limits of current NLI models with regard to understanding figurative language and this dataset serves as a benchmark for future improvements in this direction.	PDF	1	2022
Exploiting Topic Information for Joint Intent Detection and Slot Filling	Intent detection and slot filling are two important basic tasks in natural language understanding. Actually, there are multiple intents in an utterance. How to map different intents to corresponding slot becomes a new challenge for recent research. Existing models solve this problem by using neural layers to adaptively capture related intent information for each slot, which the process of intent selection is not clear enough. It is observed that there is strong consistency between intents and topics of a sentence, thus we exploit topic information for joint intent detection and slot filling via a topic fusion mechanism, where token-level topic information take the place of intent information to guide slot prediction. In addition, sentence-level topic information is also utilized to enhance the intent detection. Experiment results show explicit improvements on two public datasets, where provide 4.8% improvement in sentence accuracy on MixATIS and 0.7% improvement in intent detection on MixSNIPS.	PDF	1	2022
Global Entity Disambiguation with BERT	We propose a global entity disambiguation (ED) model based on BERT. To capture global contextual information for ED, our model treats not only words but also entities as input tokens, and solves the task by sequentially resolving mentions to their referent entities and using resolved entities as inputs. We train the model using a large entity-annotated corpus obtained from Wikipedia. We achieve new state-of-the-art results on five standard ED datasets: AIDA-CoNLL, MSNBC, AQUAINT, ACE2004, and WNED-WIKI.	PDF	1	2022
How to Fool Systems and Humans in Visually Grounded Interaction: A Case Study on Adversarial Attacks on Visual Dialog	Adversarial attacks change predictions of deep neural network models, while aiming to remain unnoticed by the user.This is a challenge for textual attacks, which target discrete text. In this study, we investigate the robustness of visually grounded dialog models towards textual attacks to understand how different input components can mitigate the attack. Our results show that dialog history is important for model robustness: models encoding history are more robust, and when launching an attack on history, model prediction becomes more uncertain. This is in contrast to prior work which finds that dialog history is negligible for model performance. We also evaluate how to generate adversarial examples which successfully attack the model but remain undetected by the user. We find that the textual, as well as the visual context is important to generate attacks which appear semantically coherent to humans.	PDF	1	2022
Fast and Accurate Span-based Semantic Role Labeling as Graph Parsing	Currently, BIO-based and Tuple-based approaches perform quite well on the span-based semantic role labeling (SRL) task. However, the BIO-based approach usually needs to encode a sentence once for each predicate when predicting its arguments, and the Tuple-based approach has to deal with a huge search space of $O(n^3)$, greatly reducing the training and inference efficiency. Moreover, both BIO-based and Tuple-based approaches usually consider only local structural information when making predictions.This paper proposes to cast end-to-end span-based SRL as a graph parsing task. Based on a novel graph representation schema, we present a fast and accurate SRL parser on the shoulder of recent works on high-order semantic dependency graph parsing (SDGP). Moreover, we propose a constrained Viterbi procedure to ensure the legality of the output graph. Experiments on CoNLL05, CoNLL12, and Chinese Proposition Bank 1.0 (CPB1.0) datasets show that our model achieves new state-of-the-art results and can parse over 600 sentences per second.	PDF	1	2022
An Exploitation of Heterogeneous Graph Neural Network for Extractive Long Document Summarization	Heterogeneous Graph Neural Networks (HeterGNN) has been recently introduced as an emergent approach for many Natural Language Processing (NLP) tasks by enriching the complex information between word and sentence. In this paper, we try to improve the performance of Extractive Document Summarization (EDS) for long-form documents based on the concept of HeterGNN. Specifically, long documents (e.g., Scientific Papers) are truncated for most neural-based models, which leads to the challenge in terms of information loss of inter-sentence relations. In this regard, we present a new method by exploiting the capabilities of HeterGNN and pre-trained language models. Particularly, BERT is considered for improving the sentence information into the Heterogenous graph layer. Accordingly, two versions of the proposed method are presented which are: i) Multi Graph Neural Network (MTGNN-SUM), by combining both heterogeneous graph layer and graph attention layer; and ii) HeterGNN with BERT (HeterGNN-BERT-SUM), by integrating BERT directly into the heterogeneous graph structure. Experiments on two benchmark datasets of long documents such as PubMed and ArXiv show that our method outperforms state-of-the-art models in this research field	PDF	1	2022
GPL: Generative Pseudo Labeling for Unsupervised Domain Adaptation of Dense Retrieval	Dense retrieval approaches can overcome the lexical gap and lead to significantly improved search results. However, they require large amounts of training data which is not available for most domains. As shown in previous work (Thakur et al., 2021b), the performance of dense retrievers severely degrades under a domain shift. This limits the usage of dense retrieval approaches to only a few domains with large training datasets.In this paper, we propose the novel unsupervised domain adaptation method Generative Pseudo Labeling (GPL), which combines a query generator with pseudo labeling from a cross-encoder. On six representative domain-specialized datasets, we find the proposed GPL can outperform an out-of-the-box state-of-the-art dense retrieval approach by up to 8.9 points nDCG@10. GPL requires less (unlabeled) data from the target domain and is more robust in its training than previous methods.We further investigate the role of six recent pre-training methods in the scenario of domain adaptation for retrieval tasks, where only three could yield improved results. The best approach, TSDAE (Wang et al., 2021) can be combined with GPL, yielding another average improvement of 1.0 points nDCG@10 across the six tasks.	PDF	1	2022
Extractive Text Summarization with Latent Topics using Heterogeneous Graph Neural Network	This paper presents a heterogeneous graph neural network (HeterGNN) model for extractive text summarization (ETS) by using latent topics to capture the important content of input documents. Specifically, topical information has been widely used as global information for sentence selection. However, most of the recent approaches use neural models, which lead the training models more complex and difficult for extensibility. In this regard, this study presents a novel graph-based ETS by adding a new node of latent topics into HeterGN for the summarization (TopicHeterGraphSum). Specifically, TopicHeterGraphSum includes three types of semantic nodes (i.e., topic-word-sentence) in order to enrich the cross-sentence relations. Furthermore, an extended version of TopicHeterGraphSum for multi documents extraction is also taken into account to emphasize the advantage of the proposed method. Experiments on benchmark datasets such as CNN/DailyMail and Multi-News show the promising results of our method compared with state-of-the-art models.	PDF	1	2022
Magic Pyramid: Accelerating Inference with Early Exiting and Token Pruning	Pre-training and then fine-tuning large language models is commonly used to achieve state-of-the-art performance in natural language processing (NLP) tasks. However, most pre-trained models suffer from low inference speed. Deploying such large models to applications with latency constraints is challenging. In this work, we focus on accelerating the inference via conditional computations. To achieve this, we propose a novel idea, Magic Pyramid (MP), to reduce both width-wise and depth-wise computation via token pruning and early exiting for Transformer-based models, particularly BERT. The former manages to save the computation via removing non-salient tokens, while the latter can fulfill the computation reduction by terminating the inference early before reaching the final layer, if the exiting condition is met. Our empirical studies demonstrate that compared to previous state of arts, MP is not only able to achieve a speed-adjustable inference, but also to surpass token pruning and early exiting by reducing up to 70\% giga floating point operations (GFLOPs) with less than 0.5\% accuracy drop. Token pruning and early exiting express distinctive preferences to sequences with different lengths. However, MP is capable of achieving an average of 8.06x speedup on two popular text classification tasks, regardless of the sizes of the inputs.	PDF	1	2022
Biasly: a machine learning based platform for automatic racial discrimination detection in online texts	Detecting hateful, toxic, and otherwise racist or sexist language in user-generated online contents has become an increasingly important task in recent years. Indeed, the anonymity, transience, size of messages, and the difficulty of management, facilitate the diffusion of racist or hateful messages across the Internet. The critical influence of this cyber-racism is no longer limited to social media, but also has a significant effect on our society : corporate business operation, users' health, crimes, etc. Traditional racist speech reporting channels have proven inadequate due to the enormous explosion of information, so there is an urgent need for a method to automatically and promptly detect texts with racial discrimination. We propose in this work, a machine learning-based approach to enable automatic detection of racist text content over the internet. State-of-the-art machine learning models that are able to grasp language structures are adapted in this study. Our main contribution include 1) a large scale racial discrimination data set collected from three distinct sources and annotated according to a guideline developed by specialists, 2) a set of machine learning models with various architectures for racial discrimination detection, and 3) a web-browser-based software that assist users to debias their texts when using the internet. All these resources are made publicly available.	PDF	1	2022
Feasibility of BERT Embeddings For Domain-Specific Knowledge Mining	Extracting information from large corpora of unstructured text using computational methods presents a challenge. Tshitoyan et al. (2019) demonstrated that unsupervised mathematical word-embeddings produced by a static language model could be utilized to uncover `latent knowledge' within a materials science corpus. The rise of contextualized and massively pre-trained language models like BERT have seen static models becoming surpassed for most NLP tasks. Nevertheless, due to innate architectural and use differences, BERT requires adaptation for knowledge mining. This study tests the suitability of BERT-derived word embeddings for knowledge mining purposes. It utilizes a variation of the approach described by Bommasani et al. (2020) for creating static-equivalent vectors from multiple contextualized word representations. It is conducted using a biomedical corpus, a biomedical BERT variation and validated using domain-specific intrinsic benchmarking tools. Novel, layer-wise BERT performance characteristics are demonstrated. A key finding is that layer-wise intrinsic performance differs for nouns and verbs. Performance also varies according to whether a word of interest belongs to BERT's native vocabulary or is built from sub-word representations: BERT-native representations perform best when extracted from earlier layers, while representations requiring multiple tokens perform best when extracted from the middle-to-latter model layers.	PDF	1	2022
Connecting the Dots between Audio and Text without Parallel Data through Visual Knowledge Transfer	Machines that can represent and describe environmental soundscapes have practical potential, e.g., for audio tagging and captioning. Prevailing learning paradigms of audio-text connections have been relying on parallel audio-text data, which is, however, scarcely available on the web. We propose VIP-ANT that induces Audio-Text alignment without using any parallel audio-text data. Our key idea is to share the image modality between bi-modal image-text representations and bi-modal image-audio representations; the image modality functions as a pivot and connects audio and text in a tri-modal embedding space implicitly.In a difficult zero-shot setting with no paired audio-text data, our model demonstrates state-of-the-art zero-shot performance on the ESC50 and US8K audio classification tasks, and even surpasses the supervised state of the art for Clotho caption retrieval (with audio queries) by 2.2% R@1. We further investigate cases of minimal audio-text supervision, finding that, e.g., just a few hundred supervised audio-text pairs increase the zero-shot audio classification accuracy by 8% on US8K. However, to match human parity on some zero-shot tasks, our empirical scaling experiments suggest that we would need about $2^{21} \approx 2\text{M}$ supervised audio-caption pairs. Our work opens up new avenues for learning audio-text connections with little to no parallel audio-text data.	PDF	1	2022
Simplifying Dataflow Dialogue Design	In \citep{andreas2020task-oriented}, a dataflow (DF) based dialogue system was introduced, showing clear advantages compared to many commonly used current systems. This was accompanied by the release of SMCalFlow, a practically relevant, manually annotated dataset, more detailed and much larger than any comparable dialogue dataset.Despite these remarkable contributions, the community has not shown further interest in this direction.What are the reasons for this lack of interest? And how can the community be encouraged to engage in research in this direction?One explanation may be the perception that this approach is too complex - both the the annotation and the system. This paper argues that this perception is wrong: 1) Suggestions for a simplified format for the annotation of the dataset are presented, 2) A basic implementation of the DF execution engine is released, which can serve as a sandbox allowing researchers to easily implement, and experiment with, new DF dialogue designs.The hope is that these contributions will help engage more practitioners in exploring new ideas and designs for DF based dialogue systems.	PDF	1	2022
Lex2Sent: A bagging approach to unsupervised sentiment analysis	Unsupervised sentiment analysis is traditionally performed by counting those words in a text that are stored in a sentiment lexicon and then assigning a label depending on the proportion of positive and negative words registered. While these "counting" methods are considered to be beneficial as they rate a text deterministically, their accuracy decreases when the analyzed texts are short or the vocabulary differs from what the lexicon considers default. The model proposed in this paper, called Lex2Sent, is an unsupervised sentiment analysis method to improve the classification of sentiment lexicon methods. For this purpose, a Doc2Vec-model is trained to determine the distances between document embeddings and the embeddings of the positive and negative part of a sentiment lexicon. These distances are then evaluated for multiple executions of Doc2Vec on resampled documents and are averaged to perform the classification task. For three benchmark datasets considered in this paper, the proposed Lex2Sent outperforms every evaluated lexicon, including state-of-the-art lexica like VADER or the Opinion Lexicon in terms of accuracy.	PDF	1	2022
Topic Modeling with Topological Data Analysis	Recent unsupervised topic modelling approaches that use clustering techniques on word, token or document embeddings can extract coherent topics. However, a common limitation of such approaches is that they reveal nothing about inter-topic relationships which are essential in many real-world application domains. We present an unsupervised topic modelling method which harnesses Topological Data Analysis (TDA) to extract a topological skeleton of the manifold upon which contextualised word embeddings lie. We demonstrate that our approach, which performs on par with a recent baseline, is able to construct a network of coherent topics together with meaningful relationships between them.	PDF	1	2022
Deep Continuous Prompt for Contrastive Learning of Sentence Embeddings	The performance of sentence representation has been remarkably improved by the framework of contrastive learning. However, recent works still require full fine-tuning, which is quite inefficient for large-scaled pre-trained language models. To this end, we present a novel method which freezes the whole language model and only optimizes the prefix deep continuous prompts. It not only tunes around 0.1\% parameters of the original language model, but avoids the cumbersome computation of searching handcrafted prompts. Experimental results show that our proposed DCPCSE outperforms the state-of-the-art method SimCSE by a large margin. We raise the performance of unsupervised BERT$_{base}$ and supervised RoBERTa$_{large}$ by 2.24 and 1.00 points, respectively. Our code will be released at Github.	PDF	1	2022
Adaptive Transfer Learning for Multi-Label Emotion Classification	In this study, we explore how data annotated with different taxonomies can be used to improve multi-label emotion classification. We propose a novel transfer learning framework to model the interaction between emotion categories, and introduce an adaptive aggregation mechanism to fuse the information from different taxonomies. The cross-taxonomy emotion interaction allows the source and target tasks to collaborate effectively, resulting in more accurate predictions. The experimental results on the SemEval-2018 dataset show that our approach can effectively boost the performance gain brought by transfer learning, and significantly outperforms existing methods.	PDF	1	2022
Night Owls and Majestic Whales: Modeling Metaphor Comprehension as a Rational Speech Act over Vector Representations of Lexical Semantics	While they are some of the few computational models that directly capture pragmatic processes underlying language reasoning, current Rational Speech Act (RSA) models of metaphor are (1) not easily scalable, and (2) do not align well with contemporary accounts of metaphor comprehension. The following research project leverages GloVe word vectors to capture pragmatic language reasoning in metaphoric utterances using an updated RSA framework. This updated framework better aligns model predictions with Relevance Theoretic and Construction Grammatical theories of metaphor semantics. The model yields high posterior probabilities for attributes of metaphors that humans deem relevant in metaphoric utterances over erroneous ones in 89% of all cases, validating the methodology to generate prior probabilities for a RSA framework. When presented with biased priors like listeners are in many naturalistic conversations, the model accurately matches human judgements of the most topical attribute of a topic/target indicated by a metaphoric utterance 90% of the time.	PDF	1	2022
QubitE: Qubit Embedding for Knowledge Graph Completion	Knowledge graph embeddings (KGEs) learn low-dimensional representations of entities and relations to predict missing facts based on existing ones.Quantum-based KGEs utilize variational quantum circuits for link prediction and score triples via the probability distribution of measuring the qubit states.But current quantum-based KGEs either lose quantum advantages during optimizing, or require a large number of parameters to store quantum states, thus leading to overfitting and low performance.Besides, they ignore theoretical analysis which are essential for understanding the model performance.To address performance issue and bridge theory gap, we propose QubitE which is lightweight and suitable for link prediction task.In addition, our model preserves quantum advantages which enable quantum logical computing based on semantics.Furthermore, we prove that (1) QubitE is full-expressive; (2) QubitE can infer various relation patterns including symmetry/antisymmetry, inversion, and commutative/non-commutative composition; (3) QubitE subsumes several existing approaches, \eg~DistMult, pRotatE, RotatE, TransE and ComplEx; (4) QubitE owns linear space complexity and linear time complexity.Experiments on multiple benchmark knowledge graphs demonstrate that QubitE can achieve comparable results to the state-of-the-art classical models.	PDF	1	2022
Plot Writing From Pre-Trained Language Models	Pre-trained language models (PLMs) fail to generate long-form narrative text because they do not consider global structure. As a result, the generated texts are often incohesive, repetitive, or lack content. Recent work in story generation reintroduced explicit content planning in the form of prompts, keywords, or semantic frames. Trained on large parallel corpora, these models can generate more logical event sequences and thus more contentful stories. However, these intermediate representations are often not in natural language and cannot be utilized by PLMs without fine-tuning. We propose generating story plots using off-the-shelf PLMs while maintaining the benefit of content planning to generate cohesive and contentful stories. Our proposed method, ScratchPlot, first prompts a PLM to compose a content plan. Then, we generate the story's body and ending conditioned on the content plan. Furthermore, we take a generate-and-rank approach by using additional PLMs to rank the generated (story, ending) pairs. We benchmark our method with various baselines and achieved superior results in both human and automatic evaluation.	PDF	1	2022
CoMPM: Context Modeling with Speaker's Pre-trained Memory Tracking for Emotion Recognition in Conversation	As the use of interactive machines grow, the task of Emotion Recognition in Conversation (ERC) became more important. If the machine-generated sentences reflect emotion, more human-like sympathetic conversations are possible. Since emotion recognition in conversation is inaccurate if the previous utterances are not taken into account, many studies reflect the dialogue context to improve the performances. Many recent approaches show performance improvement by combining knowledge into modules learned from external structured data. However, structured data is difficult to access in non-English languages, making it difficult to extend to other languages. Therefore, we extract the pre-trained memory using the pre-trained language model as an extractor of external knowledge. We introduce CoMPM, which combines the speaker's pre-trained memory with the context model, and find that the pre-trained memory significantly improves the performance of the context model. CoMPM achieves the first or second performance on all data and is state-of-the-art among systems that do not leverage structured data. In addition, our method shows that it can be extended to other languages because structured knowledge is not required, unlike previous methods.	PDF	1	2022
MAML-CL: Edited Model-Agnostic Meta-Learning for Continual Learning	Recent continual learning (CL) models use meta learning to enable efficient cross-domain knowledge transfer and thus enhance sparse experience rehearsal (or called episodic memory replay). Whereas, the knowledge transfer can be constrained by its episodically occurrence, especially when the training sets are small or/and the replay frequency is low (usually 1%). This paper studies the feasibility of solely using meta learning to address CL problems. In particular, we devise an optimisation-based meta learning framework for CL in accordance with MAML, where query samples are edited for generalisation of learned knowledge. We conduct extensive experiments on text classification in a low resource CL setup, where we downsize the training set to its 10%. The experimental results demonstrate the superiority of our method in terms of stability, fast adaptation, memory efficiency and knowledge retention across various domains.	PDF	1	2022
Heterogeneous-Graph Reasoning and Fine-Grained Aggregation for Fact Checking	Fact checking is a challenging task that requires corresponding evidences to verify the property of a claim based on reasoning. Previous studies generally i) construct the graph by treating each evidence-claim pair as node which is a simple way that ignores to exploit their implicit interaction, or building a fully-connected graph among claim and evidences where the entailment relationship between claim and evidence would be considered equally to the semantic relationship among evidences; ii) aggregate evidences equally without considering their different stances towards the verification of fact. Towards the above issues, we propose a novel heterogeneous-graph reasoning and fine-grained aggregation model, with two following modules: 1) a heterogeneous graph attention network module to distinguish different types of relationships within the constructed graph; 2) fine-grained aggregation module which learns the implicit stance of evidences towards the prediction result in details. Extensive experiments on the benchmark dataset demonstrate that our proposed model achieves much better performance than state-of-the-art methods.	PDF	2	2022
Intent Classification by the use of Automatically Generated Knowledge Graphs	Intent classification is an essential task for goal-oriented dialogue systems, in order to automatically identify customers' goals. Although intent classification performs well in general settings, domain-specific user goals can still present a challenge for this task. To address this challenge, we automatically generate knowledge graphs for targeted datasets to capture domain-specific knowledge and leverage embeddings trained on these knowledge graphs for the intent classification task. We compare our results with state-of-the-art pre-trained sentence embeddings. Our evaluation on three datasets show improvement on the intent classification task in terms of precision.	PDF	2	2022
Distilling Task-specific Logical Rules from Large Pre-trained Models	Logical rules, both transferable and explainable, are widely used as weakly supervised signals for many downstream tasks such as named entity tagging. To reduce the human effort of writing rules, previous researchers adopt an iterative approach to automatically learn logical rules from several seed rules. However, obtaining more seed rules can only be accomplished by extra human annotation with heavy costs. Limited by the size and quality of the seed rules, the model performance of previous systems is bounded. In this paper, we develop a novel framework STREAM to distill task-specific logical rules from large pre-trained models. Specifically, we borrow recent prompt-based language models as the knowledge expert to yield initial seed rules, and based on the formed high-quality instance pool that acts as an intermediary role, we keep teaching the expert to fit our task and learning task-specific logical rules. Experiments on three public benchmarks demonstrate the effectiveness of our proposed framework. Without any participation of manual annotation, our system has gained significant improvements over previous state-of-the-art methods.	PDF	2	2022
Average Is Not Enough: Caveats of Multilingual Evaluation	This paper discusses the problem of multilingual evaluation. Using simple statistics, such as average language performance might inject linguistic biases in favor of dominant language families into evaluation methodology. We show that this bias can be found in published works and we demonstrate that linguistically-motivated result visualization can detect it.	PDF	2	2022
DistilCSE: Effective Knowledge Distillation For Contrastive Sentence Embeddings	Large contrastive learning models, e.g., Sentence-T5, tend to be proposed to learn more powerful sentence embeddings recently. Though effective, such large models are hard to serve online due to computational resources or time cost limits. Knowledge distillation can compress a large ``teacher'' model into a small ``student'' model, but it generally suffers from performance decrease. To tackle that, we propose an effective knowledge distillation framework for contrastive sentence embeddings, termed DistilCSE. It first utilizes knowledge distillation to transfer the capability of a large contrastive learning model to a small student model on a large amount of unlabeled data, and then finetunes the student model with contrastive learning on limited labeled data.We further propose Contrastive Knowledge Distillation (CKD) to enhance the training objective consistencies among teacher model training, knowledge distillation, and student model finetuning, which can improve performance like prompt learning. Extensive experiments on seven semantic textual similarity benchmarks show that student models trained with the proposed DistilCSE and CKD suffer from little or even no performance decrease and consistently outperform the corresponding counterparts of the same parameter size. Amazingly, our 110M student model can even outperform the latest state-of-the-art (SOTA) model, i.e., Sentence-T5(11B), with only 1% parameters.	PDF	2	2022
Dodging the Data Bottleneck: Automatic Subtitling with Automatically Segmented ST Corpora	Speech translation for subtitling (SubST) is the task of automatically translating speech data into well-formed subtitles by inserting subtitle breaks compliant to specific displaying guidelines. Similar to speech translation (ST), model training requires parallel data comprising audio inputs paired with their textual translations. In SubST, however, the text has to be also annotated with subtitle breaks. So far, this requirement has represented a bottleneck for system development, as confirmed by the dearth of publicly available SubST corpora. To fill this gap, we propose a method to convert existing ST corpora into SubST resources without human intervention. We build a segmenter model that automatically segments texts into proper subtitles by exploiting audio and text in a multimodal fashion, achieving high segmentation quality in zero-shot conditions. Comparative experiments with SubST systems respectively trained on manual and automatic segmentations result in similar performance, showing the effectiveness of our approach.	PDF	2	2022
ScandEval: A Benchmark for Scandinavian Natural Language Understanding	This paper introduces a Scandinavian benchmarking platform, ScandEval, which can benchmark any pretrained or finetuned model on 29 datasets in Danish, Norwegian, Swedish, Icelandic and Faroese, two of which are new. We develop and release a Python package and Command-Line Interface (CLI), scandeval, which can benchmark any model that has been uploaded to the HuggingFace Hub, with reproducible results. Using this package, we benchmark over 60 Scandinavian or multilingual models and present the results of these in an interactive online leaderboard. The benchmarking results shows that the investment in language technology in Norway and Sweden has led to language models that outperform multilingual models such as XLM-RoBERTa and LaBSE. We release the source code for both the package and leaderboard.	PDF	2	2022
LongtoNotes: OntoNotes with Longer Coreference Chains	Ontonotes has served as the most important benchmark for coreference resolution. However, for ease of annotation, several long documents in Ontonotes were split into smaller parts. In this work, we build a corpus of coreference-annotated documents of significantly longer length than what is currently available. We do so by providing an accurate, manually-curated, merging of annotations from documents that were split into multiple parts in the original Ontonotes annotation process. The resulting corpus, which we call LongtoNotes contains documents in multiple domains of the English language with varying lengths, the longest of which are up to 8x the length of documents in Ontonotes, and 2x those in Litbank. We evaluate state-of-the-art neural coreference systems on this new corpus, analyze the relationships between model architectures/ hyperparameters and document length on performance and efficiency of the models, and demonstrate areas of improvement in long document coreference modelling revealed by our new corpus.	PDF	2	2022
Self-Supervised Losses for One-Class Textual Anomaly Detection	Current deep learning methods for anomaly detection in text rely on supervisory signals in inliers that may be unobtainable or bespoke architectures that are difficult to tune. We study a simpler alternative: fine-tuning Transformers on the inlier data with self-supervised objectives and using the losses as an anomaly score. Overall, the self-supervision approach outperforms other methods under various anomaly detection scenarios, improving the AUROC score on semantic anomalies by 11.6% and on syntactic anomalies by 22.8% on average. Additionally, the optimal objective and resultant learnt representation depend on the type of downstream anomaly. The separability of anomalies and inliers signals that a representation is more effective for detecting semantic anomalies, whilst the presence of narrow feature directions signals a representation that is effective for detecting syntactic anomalies.	PDF	2	2022
Counterfactual Debiasing for Fact Verification	Fact verification aims to automatically judge the veracity of a claim according to several evidences. Due to the manual construction of datasets, spurious correlations between claim patterns and its veracity (i.e., biases) inevitably exist. Recent studies show that models usually learn such biases instead of understanding the semantic relationship between the claim and evidences. Existing debiasing works can be roughly divided into data-augmentation-based and weight-regularization-based pipeline, where the former is inflexible and the latter relies on the uncertain output on the training stage. Unlike previous works, we propose a novel method from a counterfactual view, namely CLEVER, which is augmentation-free and mitigates biases on the inference stage. Specifically, we train a claim-evidence fusion model and a claim-only model independently. Then, we obtain the final prediction via subtracting output of the claim-only model from output of the claim-evidence fusion model, which counteracts biases in two outputs so that the unbiased part is highlighted. Comprehensive experiments on several datasets have demonstrated the effectiveness of CLEVER.	PDF	2	2022
Persian Natural Language Inference: A Meta-learning approach	Incorporating information from other languages can improve the results of tasks in low-resource languages. A powerful method of building functional natural language processing systems for low-resource languages is to combine multilingual pre-trained representations with cross-lingual transfer learning. In general, however, shared representations are learned separately, either across tasks or across languages. This paper proposes a meta-learning approach for inferring natural language in Persian. Alternately, meta-learning uses different task information (such as QA in Persian) or other language information (such as natural language inference in English). Also, we investigate the role of task augmentation strategy for forming additional high-quality tasks. We evaluate the proposed method using four languages and an auxiliary task. Compared to the baseline approach, the proposed model consistently outperforms it, improving accuracy by roughly six percent. We also examine the effect of finding appropriate initial parameters using zero-shot evaluation and CCA similarity.	PDF	2	2022
Few-shot Query-oriented Summarization with Prefix-merging	Query-oriented summarization has been considered as an important extension for text summarization. It aims to generate a concise highlight for a given query. Different from text summarization, query-oriented summarization has long been plagued by the problem of lacking high-quality large-scale datasets. In this paper, we investigate the idea that whether we can integrate and transfer the knowledge of text summarization and question answering to assist the few-shot learning in query-oriented summarization. Meanwhile, we draw inspiration from prefix-tuning, whose prefix is considered as containing task-specific knowledge. Here, we propose prefix-merging, a prefix-based pretraining strategy for few-shot learning in natural language generation tasks. It allows us to control and integrate the task knowledge across multiple basic tasks through a proper prefix design and apply the merged prefix to the downstream task. With only a small amount of trainable parameters, prefix-merging outperforms fine-tuning on the query-oriented summarization task. We further discuss the influence of different prefix designs and propose a visualized explanation for how prefix-merging works.	PDF	2	2022
Modeling Function Relation for Automatic Code Comment Generation	Comments are essential for software maintenance and comprehension. However, comments are often missing, mismatched or outdated insoftware projects. This paper presents a novel approach to automatically generate descriptive comments for methods and functions. Ourwork targets a practical problem where hand-written comments are only available for a few methods in a source file – a common problemseen in real-world software development. We develop a novel learning framework to model the code relation among methods based on graphneural networks. Our model learns to utilize the partially contextual information extracted from the existing comments to generatemissing comments for all methods in a source file. We evaluate our approach by applying it to Java programs. Experimental results showthat our approach outperforms prior methods by a large margin by generating comments that are judged to be helpful by human evaluatorsand of a higher quality measured by quantified metrics.	PDF	2	2022
Multi-task Citation Content Analysis for Clinical Research Publications	Citations are essential building blocks in scientific knowledge production. Citation content analysis using NLP methods has been proposed to benefit tasks such as scientific paper summarization and research impact assessment. In this paper, we propose a new task, citation subject matter extraction, and augment an existing citation sentiment corpus with citation context and subject matter annotations to enable a finer-grained study of citation content. We propose a BERT-based multi-task model to jointly address these three classification tasks (i.e., context, subject matter, and sentiment) by enabling knowledge transfer across tasks. Our experimental results show the effectiveness of our joint model over single task models. We also obtain state-of-the-art results for the citation sentiment classification task and demonstrate that isolating the subject matter significantly improves this task. Our error analysis suggests improving annotation consistency and using external knowledge sources could further improve performance. We will make our code, data, and annotation guidelines publicly available upon acceptance.	PDF	2	2022
3M: Multi-document Summarization Considering Main and Minor Relationship	The multi-document summarization (MDS) is an important branch of information aggregation. Compared with the single-document summary (SDS), MDS has three major challenges: (1) MDS involves too large search space to capture the attention; (2) the input of MDS contains a lot of redundant information and more complex logical relationships; (3) the different opinions of documents bring contradictions. To complete these three main challenges, we combine the Transformer and the Maximal Marginal Relevance (MMR) to design Multi-document summarization considering Main and Minor relationship (3M) model. In this model, we take one document as the main body and use the information of other documents as an addition to modifying the generation of the summary. Therefore, we can reduce the search space and ignore the redundancy in the minor documents. Empirical results on the Multi-News and DUC 2004 dataset show that the 3M brings substantial improvements over several strong baselines, the manual evaluation shows that the generated abstract is fluent and can better express the content of the main document. In addition, by selecting different main documents, 3M can generate multiple abstracts with different styles for one set of documents.	PDF	2	2022
Analyzing Hate Speech Data along Racial, Gender and Intersectional Axes	To tackle the rising phenomenon of hate speech, efforts have been made towards data curation and analysis. When it comes to analysis of bias, previous work has focused predominantly on race. In our work, we further investigate bias in hate speech datasets along racial, gender and intersectional axes. We identify strong bias against AAE, male and AAE+Male tweets, which are annotated as disproportionately more hateful and offensive than from other demographics. We provide evidence that BERT-based models propagate this bias and show that balancing the training data for these protected attributes can lead to fairer models with regards to gender, but not race.	PDF	2	2022
Progressive Sentiment Analysis for Code-Switched Text Data	Multilingual transformer language models have recently attracted much attention from researchers and are used in cross-lingual transfer learning for many NLP tasks such as text classification and named entity recognition.However, similar methods for transfer learning from monolingual text to code-switched text have not been extensively explored mainly due to the following challenges:(1) Code-switched corpus, unlike monolingual corpus, consists of more than one language and existing methods can't be applied efficiently,(2) Code-switched corpus is usually made of resource-rich and low-resource languages and upon using multilingual pre-trained language models, the final model might bias towards resource-rich language. In this paper, we focus on code-switched sentiment analysis where we have a labelled resource-rich language dataset and unlabelled code-switched data. We propose a framework that takes the distinction between resource-rich and low-resource language into account.Instead of training on the entire code-switched corpus at once, we create buckets based on the fraction of words in the resource-rich language and progressively train from resource-rich language dominated samples to low-resource language dominated samples. Extensive experiments across multiple language pairs demonstrate that progressive training helps low-resource language dominated samples.	PDF	2	2022
PromptASTE: Prompting a Dataset from Pre-trained Language Models for Unsupervised Aspect Sentiment Triplet Extraction	Aspect sentiment triplet extraction (ASTE) is a sentiment analysis task that aims to extract views' sentiment polarity, expression, and target (aspect). This paper proposes the first unsupervised method for aspect sentiment triplet extraction. Based on the previous discovery of the pre-trained language model (PLM)'s awareness of sentiment, we further leverage the masked language model (MLM) to prompt an ASTE dataset with automatically annotated labels. Our method, PromptASTE, fills in a series of prompts to generate a dataset for related aspects and views. The dataset is then used to train an ASTE model for prediction. Training on PromptASTE results in models with an outstanding capability in discerning sentiment polarities and targeted aspects. Our model sets the first and strong baseline on unsupervised ASTE.	PDF	2	2022
Whodunit? Learning to Contrast for Authorship Attribution	Authorship attribution is the task of identifying the author of a given text. Most existing approaches use manually designed features that capture a dataset's content and style. However, this dataset-dependent approach yields inconsistent performance. Thus, we propose to fine-tune pretrained language representations using a combination of contrastive learning and supervised learning (Contra-X). We show that Contra-X advances the state-of-the-art on multiple human and machine authorship attribution benchmarks, enabling improvements of up to 6.8\%. We also show Contra-X to be consistently superior to cross-entropy fine-tuning across different data regimes. Crucially, we present qualitative and quantitative analyses of these improvements. Our learned representations form highly separable clusters for different authors. However, we find that contrastive learning improves overall accuracy at the cost of sacrificing performance for some authors. Resolving this tension will be an important direction for future work. To the best of our knowledge, we are the first to analyze the effect of combining contrastive learning with cross-entropy fine-tuning for authorship attribution.	PDF	2	2022
Neural Quadratic Assignment Programming for Sentence Matching	Studies have shown that both the syntactic structures and words' semantics are important for sentence matching. Existing studies usually model the syntactic structures and word semantics separately, resulting in matching models that overlook the relations and dependencies between syntactic structures and semantic meanings. How to jointly model the syntactic and semantic information has become a challenging problem in sentence matching. To address the issue, we formalize sentence matching as a problem of assigning the word of one sentence to that of another sentence, with the costs determined by the differences between the corresponding syntactic structures and word embedding similarities. The proposed method, referred to as neural quadratic assignment programming for sentence matching (NQAP-SM), represents the syntactic structures and semantic matching signals as an association graph.Solving the relaxed quadratic assignment programming (QAP) on this association graph achieves the final matching score. Experimental results on three public datasets demonstrated that NQAP-SM can outperform the state-of-the-art baselines in an effective and efficient way. The analysis also showed that NQAP-SM can match sentences in an interpretable way.	PDF	2	2022
OrderSum: Reading Order-Aware Unsupervised Opinion Summarization	Opinion summarization aims to create a concise summary reflecting subjective information conveyed by multiple user reviews about the same product. To avoid the high expense of curating golden summaries for training, many unsupervised methods have been recently developed. Most state-of-the-art methods utilize the extracted segments following their salience ranking as pseudo labels to train a summary generator. However, the extracted salient segments can be verbose and their reading order has been long overlooked.In this paper, we propose a reading order-aware framework, Order, aiming to generate concise and logical summaries. Specifically, we first formulate the segment ordering problem in pseudo labels as path-choosing and solve it using reinforcement learning.Moreover, to generate a more concise summary, we propose to encourage the generative model to skip useless words based on the token link information derived from concise sentences, which can be collected easily from massive raw reviews by considering the ratio of sentiment/aspect words. Extensive experiments demonstrate that \our benefits from the awareness of reading order and the conciseness modeling, thus being more effective than existing unsupervised methods and achieving the state-of-the-art performance.	PDF	2	2022
Making Document-Level Information Extraction Right for the Right Reasons	Document-level models for information extraction tasks like slot-filling are flexible: they can be applied to settings where information is not necessarily localized in a single sentence. For example, key features of a diagnosis in a radiology report may not be explicitly stated in one place, but nevertheless can be inferred from parts of the report's text. However, these models can easily learn spurious correlations between labels and irrelevant information. This work studies how to ensure that these models make correct inferences from complex text and make those inferences in an auditable way: beyond just being right, are these models "right for the right reasons?" We experiment with post-hoc evidence extraction in a predict-select-verify framework using feature attribution techniques. We show that regularization with small amounts of evidence supervision during training can substantially improve the quality of extracted evidence. We evaluate on two domains: a small-scale labeled dataset of brain MRI reports and a large-scale modified version of DocRED (Yao et al., 2019) and show that models' plausibility can be improved with no loss in accuracy.	PDF	2	2022
Multimodal Audio-textual Architecture for Robust Spoken Language Understanding	Tandem spoken language understanding (SLU) systems suffer from the so-called automatic speech recognition (ASR) error propagation. In this work, we investigate how such problem impacts state-of-the-art NLU models such as BERT (Bidirectional Encoder Representations from Transformers) and RoBERTa. Moreover, a multimodal language understanding (MLU) system is proposed to mitigate SLU performance degradation due to error present in ASR transcripts. Our solution combines an encoder network to embed audio signals and the state-of-the-art BERT to process text transcripts. A fusion layer is also used to fuse audio and text embeddings. Two fusion strategies are explored: a pooling average of probabilities from each modality and a similar scheme with a fine-tuning step. The first approach showed to be the optimal solution to extract semantic information when the text input is severely corrupted whereas the second approach was slightly better when the quality of ASR transcripts was higher. We found that as the quality of ASR transcripts decayed the performance of BERT and RoBERTa also decayed, compromising the overall SLU performance, whereas the proposed MLU showed to be more robust towards poor quality ASR transcripts. Our model is evaluated on five tasks from three SLU datasets with different complexity levels, and robustness is tested using ASR outputs from three ASR engines. Results show that the proposed approach effectively mitigates the ASR error propagation problem across all datasets.	PDF	2	2022
Repetition Facilitates Processing: The Processing Advantage of Construction Repetition in Dialogue	Repetitions occur frequently in dialogue. This study focuses on the repetition of lexicalised constructions---i.e., recurring multi-word units---in English open domain spoken dialogues. We hypothesise that construction repetition is an efficient communication strategy that reduces processing effort, and we make three predictions based on this hypothesis. We conduct a quantitative analysis, measuring reduction in processing effort via two surprisal-based measures and estimating surprisal with an adaptive neural language model. Our three predictions are confirmed: (i)~repetitions facilitate the processing of constructions and of their linguistic context; (ii)~facilitating effects are higher when repetitions accumulate, (iii)~and they are lower when repetitions are less locally distributed. Our findings suggest that human-like patterns of repetitions can be learned implicitly by utterance generation models equipped with psycholinguistically motivated learning objectives and adaptation mechanisms.	PDF	2	2022
SlovakBERT: Slovak Masked Language Model	We introduce a new Slovak masked language model called SlovakBERT. This is to our best knowledge the first paper discussing Slovak transformers-based language models. We evaluate our model on several NLP tasks and achieve state-of-the-art results. This evaluation is likewise the first attempt to establish a benchmark for Slovak language models. We publish the masked language model, as well as the fine-tuned models for part-of-speech tagging, sentiment analysis and semantic textual similarity.	PDF	2	2022
Hate Speech and Counter Speech Detection: Conversational Context Does Matter	Hate speech is plaguing the cyberspace along with user-generated content. This paper investigates the role of conversational context in the annotation and detection of online hate and counter speech, where context is defined as the preceding comment in a conversation thread. We created a context-aware dataset for a 3-way classification task on Reddit comments: hate speech, counter speech, or neutral. Our analyses indicate that context is critical to identify hate and counter speech: human judgments change for most comments depending on whether we show annotators the context. A linguistic analysis draws insights into the language people use to express hate and counter speech. Experimental results show that neural networks obtain significantly better results if context is taken into account. We also present qualitative error analyses shedding light into (a) when and why context is beneficial and (b) the remaining errors made by our best model when context is taken into account.	PDF	2	2022
Controllable Multi-attribute Dialog Generation with PALs and Grounding Knowledge	Today, neural language models are commonly employed for generation of natural like responses in dialogue system. The main issue that limits wide adoption of neural generation is related to poor predictability of responses in terms of a content, as well as dialogue attributes such as dialog acts and sentiment.In this paper we propose a method based on projected attention layers (PALs) for controllable multi-attribute knowledge grounded dialogue generation. We compared a number of methods for training and blending representations produced by PALs combined with Dialo-GPT base model. Results of our experiments demonstrate that separate pre-training of PAL branches for different attributes followed by transfer and fine-tuning of dense blending layer gives the highest accuracy of control of a generated response for less numbers of trainable parameters per an attribute. Furthermore, we applied our approach for controllable multi-attribute generation with grounding knowledge to Blenderbot model. Our solution outperforms the baseline Blenderbot and CRAYON model in control accuracy of dialog acts and sentiment on Daily Dialog as well demonstrates a comparable overall quality of dialogue generation given grounding knowledge on Wizard of Wikipedia.	PDF	2	2022
Numerical Claim Detection in Finance: A Weak-Supervision Approach	In the past few years, Transformer based models have shown excellent performance across a variety of tasks and domains. However, the black-box nature of these models, along with their high computing and manual annotation costs have limited adoption of these models. In this paper, we employ a weak-supervision-based approach to alleviate these concerns. We build and compare models for financial claim detection task using sentences with numerical information in analyst reports for more than 1500 public companies in the United States from 2017 to 2020. In addition to standard performance metrics, we provide cost-value analysis of human-annotation and weak-supervision labeling along with estimates of the carbon footprint of our models. We also analyze the performance of our claim detection models across various industry sectors given the considerable variation in numerical financial claims across industries. Our work highlights the potential of weak supervision models for research at the intersection of Finance and Computational Linguistics.	PDF	2	2022
WANLI: Worker and AI Collaboration for Natural Language Inference Dataset Creation	A recurring challenge of crowdsourcing NLP datasets at scale is that human writers often rely on repetitive patterns when crafting examples, leading to a lack of linguistic diversity. We introduce a novel approach for dataset creation based on worker and AI collaboration, which brings together the generative strength of language models and the evaluative strength of humans. Starting with an existing dataset, MultiNLI for natural language inference (NLI), our approach uses dataset cartography to automatically identify examples that demonstrate challenging reasoning patterns, and instructs GPT-3 to compose new examples with similar patterns. Machine generated examples are then automatically filtered, and finally revised and labeled by human crowdworkers. The resulting dataset, WANLI, consists of 108,079 NLI examples and presents unique empirical strengths over existing NLI datasets. Remarkably, training a model on WANLI instead of MultiNLI (which is 4 times larger) improves performance on seven out-of-domain test sets we consider, including by 11% on HANS and 9% on Adversarial NLI. Moreover, combining MultiNLI with WANLI is more effective than combining it with other NLI augmentation sets. Our results demonstrate the potential of natural language generation techniques to curate NLP datasets of enhanced quality and diversity.	PDF	2	2022
Modeling Hierarchical Reasoning Chains by Linking Discourse Units and Key Phrases for Reading Comprehension	Machine reading comprehension (MRC) poses new challenges over logical reasoning, which aims to understand the implicit logical relations entailed in the given contexts and perform inference over them. Due to the complexity of logic, logical relations exist at different granularity levels. However, most existing methods of logical reasoning individually focus on either entity-aware or discourse-based information but ignore the hierarchical relations that may even have mutual effects. In this paper, we propose a holistic graph network (HGN) which deals with context at both discourse level and word level, as the basis for logical reasoning, to provide a more fine-grained relation extraction. Specifically, node-level and type-level relations, which can be interpreted as bridges in the reasoning process, are modeled by a hierarchical interaction mechanism to improve the interpretation of MRC systems. Experimental results on logical reasoning QA datasets (ReClor and LogiQA) and natural language inference datasets (SNLI and ANLI) show the effectiveness and generalization of our method, and in-depth analysis verifies its capability to understand complex logical relations.	PDF	2	2022
TestAug: A Framework for Augmenting Capability-based NLP Tests	The recently proposed capability-based NLP tests go beyond the traditional heldout evaluation paradigm, allowing model developers to test the different linguistic capabilities of a model. However, existing work on capability-based testing requires the (semi-)manual creation of the test suites (templates); such approach thus heavily relies on the linguistic expertise and domain expertise of the developers. In this paper, we investigate an automatic approach for generating and augmenting the test suites by prompting the GPT-3 engine. Our experiments show that our approach can generate diverse test suites which has a better coverage than the existing approaches using templates. The augmented test suites can also be used to detect more errors compared to existing work. Our test suites can be downloaded at https://anonymous-researcher-nlp.github.io/testaug/.	PDF	2	2022
Calibrating Trust of Multi-Hop Question Answering Systems with Decompositional Probes	Multi-hop Question Answering (QA) is a challenging task since it requires an accurate aggregation of information from multiple context paragraphs and a thorough understanding of the underlying reasoning chains. Recent work in multi-hop QA has shown that performance can be boosted by first decomposing the questions into simpler, single-hop questions. In this paper, we explore one additional utility of the multi-hop decomposition from the perspective of explainable NLP: to create explanation by probing a neural QA model with them. We hypothesize that in doing so, users will be better able to construct a mental model of when the underlying QA system will give the correct answer. Through human participant studies, we verify that exposing the decomposition probes and answers to the probes to users can increase their ability to predict system performance on a question instance basis. We show that decomposition is an effective form of probing QA systems as well as a promising approach to explanation generation. In-depth analyses show the need for improvements in decomposition systems.	PDF	2	2022
Improving Faithfulness by Augmenting Negative Summaries from Fake Documents	Current abstractive summarization systems tend to hallucinate content that is unfaithful to the source document, posing a risk of misinformation. To mitigate hallucination, we must teach the model to distinguish hallucinated summaries from faithful ones. However, the commonly used maximum likelihood training does not disentangle factual errors from other model errors. To address this issue, we propose a back-translation-style approach to augment negative samples that mimic factual errors made by the model. Specifically, we train an elaboration model that generates hallucinated documents given the reference summaries, and then generate negative summaries from the fake documents. We incorporate the negative samples into training through a controlled generator, which produces faithful/unfaithful summaries conditioned on the control codes. Additionally, we find that adding textual entailment data through multi-tasking further boosts the performance. Experiments on XSum, Gigaword, and WikiHow show that our method consistently improves faithfulness without sacrificing informativeness according to both human evaluation and automatic metics.	PDF	2	2022
Multi-head or Single-head? An Empirical Comparison for Transformer Training	Multi-head attention plays a crucial role in the recent success of Transformer, which leads to consistent performance improvements over conventional attention in various applications. The popular belief is that its effectiveness stems from attending to information from multiple representation subspaces. In this paper, we first demonstrate that using multiple subspaces is not a unique feature of multi-head attention, as multi-layer single-head attention also leverages multiple subspaces. Then, we suggest the main advantage of the multi-head attention is the training stability, since it has fewer layers than the single-head attention when using the same number of subspaces. For example, 24-layer 16-head Transformer (BERT-large) and 384-layer single-head Transformer have roughly the same model size and employ the same total subspace number (attention head number), while the multi-head one is significantly shallower. Meanwhile, we show that, with recent advances in deep learning, we can successfully stabilize the training of the deep single-head Transformer. As the training difficulty is no longer a bottleneck, substantially deeper single-head Transformers achieve consistent performance improvements.	PDF	2	2022
Query and Extract: Refining Event Extraction as Type-oriented Binary Decoding	Event extraction is typically modeled as a multi-class classification problem where event types and argument roles are treated as atomic symbols. These approaches are usually limited to a set of pre-defined types. We propose a novel event extraction framework that uses event types and argument roles as natural language queries to extract candidate triggers and arguments from the input text. With the rich semantics in the queries, our framework benefits from the attention mechanisms to better capture the semantic correlation between the event types or argument roles and the input text. Furthermore, the query-and-extract formulation allows our approach to leverage all available event annotations from various ontologies as a unified model. Experiments on ACE and ERE demonstrate that our approach achieves state-of-the-art performance on each dataset and significantly outperforms existing methods on zero-shot event extraction. We will make all the programs publicly available once the paper is accepted.	PDF	2	2022
Double Trouble: How to not explain a text classifier's decisions using counterfactuals synthesized by masked language models?	A principle behind dozens of attribution methods is to take the prediction difference between before-and-after an input feature (here, a token) is removed as its attribution - the individual treatment effect in causal inference.A recent popular Input Marginalization (IM) method (Kim et al. EMNLP 2020) uses BERT to replace a token---\ie simulating the $do(.)$ operator - yielding more plausible counterfactuals. While Kim et al. EMNLP 2020 reported that IM is effective, we find this conclusion not convincing as the \deletionBert metric used in their paper is biased towards IM. Importantly, this bias should exist in many Deletion-based metrics, e.g., Insertion (Arras et al. 2017), Sufficiency, and Comprehensiveness (De Young et al. ACL 2020).Furthermore, our rigorous evaluation using 6 metrics and on 3 datasets finds no evidence that IM is better than a Leave-One-Out (LOO) baseline. We provide two explanations for why IM is not better than LOO: (1) deleting a single word from the input only marginally reduces a classifier's accuracy; and (2) a highly predictable word is always given near-zero attribution which may not match its true importance to the target classifier.	PDF	2	2022
Relation Extraction with Weighted Contrastive Pre-training on Distant Supervision	Contrastive pre-training on distant supervision has shown remarkable effectiveness for improving supervised relation extraction tasks. However, the existing methods ignore the intrinsic noise of distant supervision during the pre-training stage. In this paper, we propose a weighted contrastive learning method by leveraging the supervised data to estimate the reliability of pre-training instances and explicitly reduce the effect of noise. Experimental results on three supervised datasets demonstrate the advantages of our proposed weighted contrastive learning approach, compared to two state-of-the-art non-weighted baselines.	PDF	2	2022
Towards Collaborative Neural-Symbolic Graph Semantic Parsing via Uncertainty	Recent work in task-independent graph semantic parsing has shifted from grammar-based symbolic approaches to neural models, showing strong performance on different types of meaning representations. However, it is still unclear that what are the limitations of these neural parsers, and whether these limitations can be compensated by incorporating symbolic knowledge into model inference.In this paper, we address these questions by taking English Resource Grammar (ERG) parsing as a case study. Specifically, we first develop a state-of-the-art neural ERG parser, and then conduct detail analyses of parser performance within fine-grained linguistic categories and across a wide variety of corpora. The neural parser attains superior performance on in-distribution test set, but degrades significantly on long-tail and out-of-distribution situations, while the symbolic parser performs more robustly. To address this, we further propose a simple yet principled collaborative framework for neural-symbolic semantic parsing, by designing a decision criterion for beam search that incorporates the prior knowledge from a symbolic parser and accounts for model uncertainty. Experimental results show that the proposed framework yields comprehensive improvement over neural baseline across long-tail categories and out-of-domain examples, yielding the best known result on the well-studied DeepBank benchmark.	PDF	2	2022
Not All Errors are Equal: Learning Text Generation Metrics using Stratified Error Synthesis	Is it possible to build a general and automatic natural language generation (NLG) evaluation metric? Existing learned metrics either perform unsatisfactorily or are restricted to tasks where large human rating data is already available. We introduce SEScore, a model-based metric that is highly correlated with human judgements without requiring human annotation, by utilizing a novel, iterative error synthesis and severity scoring pipeline. This pipeline applies a series of plausible errors to raw text and assigns severity labels by simulating human judgements with entailment. We evaluate SEScore against existing metrics by comparing how their scores correlate with human ratings. SEScore outperforms all prior unsupervised metrics on multiple diverse NLG tasks including machine translation, image captioning, and WebNLG text generation. For WMT 20/21 En-De and Zh-En, SEScore improve the average Kendall correlation with human judgement from 0.154 to 0.195. SEScore even achieves comparable performance to the best supervised metric COMET, despite receiving no human-annotated training data.	PDF	2	2022
LOPS: Learning Order Inspired Pseudo-Label Selection for Weakly Supervised Text Classification	Weakly supervised text classification methods typically train a deep neural classifier based on pseudo-labels. The quality of pseudo-labels is crucial to final performance but they are inevitably noisy due to their heuristic nature, so selecting the correct ones has a huge potential for performance boost. One straightforward solution is to select samples based on the softmax probability scores in the neural classifier corresponding to their pseudo-labels. However, we show through our experiments that such solutions are ineffective and unstable due to the erroneously high-confidence predictions from poorly calibrated models. Recent studies on the memorization effects of deep neural models suggest that these models first memorize training samples with clean labels and then those with noisy labels. Inspired by this observation, we propose a novel pseudo-label selection method LOPS that takes learning order of samples into consideration. We hypothesize that the learning order reflects the probability of wrong annotation in terms of ranking, and therefore, propose to select the samples that are learnt earlier. LOPS can be viewed as a strong performance-boost plug-in to most of existing weakly-supervised text classification methods, as confirmed in extensive experiments on four real-world datasets.	PDF	2	2022
Prompting ELECTRA: Few-Shot Learning with Discriminative Pre-Trained Models	Pre-trained masked language models have been successfully used for few-shot learning by formulating downstream tasks as text infilling. However, discriminative pre-trained models like ELECTRA, as a strong alternative in full-shot settings, does not fit into the paradigm. In this work, we adapt prompt-based few-shot learning to ELECTRA and show that it outperforms masked language models in a wide range of tasks. ELECTRA is pre-trained to distinguish if a token is generated or original. We naturally extend that to prompt-based few-shot learning by training to score the originality of the verbalizers without introducing new parameters. Our method can be easily adapted to tasks involving multi-token verbalizers without extra computation overhead. Analysis shows that the distributions learned by ELECTRA align better with downstream tasks.	PDF	2	2022
Generative Retrieval for Long Sequences	Text retrieval is often formulated as mapping the query and the target items (e.g., passages) to the same vector space and finding the item whose embedding is closest to that of the query. In this paper, we explore a generative approach as an alternative, where we use an encoder-decoder model to memorize the target corpus in a generative manner and then finetune it on query-to-passage generation. As GENRE has shown that entities can be retrieved in a generative way, our work can be considered as its generalization to longer text. We show that it consistently achieves comparable performance to traditional bi-encoder retrieval on diverse datasets and is especially strong at retrieving highly structured items, such as reasoning chains and graph relations, while demonstrating superior GPU memory and time complexity. We also conjecture that generative retrieval is complementary to traditional retrieval, as we find that an ensemble of both outperforms homogeneous ensembles.	PDF	2	2022
Invariant Language Modeling	Modern pretrained language models are critical components of NLP pipelines. Yet, they suffer from spurious correlations, poor out-of-domain generalization, and biases.Inspired by recent progress in causal machine learning, in particular the invariant risk minimization (IRM) paradigm,we propose invariant language modeling, a framework for learning invariant representations that generalize better across multiple environments.In particular, we adapt a game-theoretic implementation of IRM (IRM-games)to language models, where the invariance emerges from a specific training schedule in which all the environments compete to optimize their own environment-specific loss by updating subsets of the model in a round-robin fashion.In a series of controlled experiments, we demonstrate the ability of our method to (i) remove structured noise, (ii) ignore specific spurious correlations without affecting global performance, and (iii) achieve better out-of-domain generalization.These benefits come with a negligible computational overhead compared to standard training, do not require changing the local loss, and can be applied to any language model architecture.We believe this framework is promising to help mitigate spurious correlations and biases in language models.	PDF	2	2022
Fair NLP Models with Differentially Private Text Encoders	Encoded text representations often capture sensitive attributes about individuals (e.g., race or gender), which raise privacy concerns and can make downstream models unfair to certain groups. In this work, we propose FEDERATE, an approach that combines ideas from differential privacy and adversarial training to learn private text representations which also induces fairer models. We empirically evaluate the trade-off between the privacy of the representations and the fairness and accuracy of the downstream model on four NLP datasets. Our results show that FEDERATE consistently improves upon previous methods, and thus suggest that privacy and fairness can positively reinforce each other.	PDF	2	2022
Self-supervised Learning for Formosan Speech Representation and Linguistic Phylogeny	Formosan languages, spoken by the indigenous peoples of Taiwan, have unique roles in reconstructing Proto-Austronesian Languages. This paper presents a real-world Formosan language speech dataset, including 144 hours-news footage of 16 Formosan languages. One merit of the dataset is to look into the relationships among Formosan languages in vivo. With the help of deep learning models, we could analyze the speech data without transcription. Specifically, we first train a language classifier based on XLSR-53 to classify the 16 Formosan languages with an accuracy of 88%. Then, we extract the speech vector representations learned from the model and compare them with 153 manually coded linguistic typological features. The comparison suggests that the speech vectors reflect the phonological and morphological aspects of Formosan languages. In addition, these linguistic features are used to construct linguistic phylogeny, and the resulting genealogical grouping corresponds with previous literature. To sum up, the dataset opens up possibilities to investigate the current real-world use of the Formosan language.	PDF	2	2022
InceptionXML : A Lightweight Framework with Synchronized Negative Sampling for Short Text Extreme Classification	Automatic annotation of short-text data to a large number of target labels, referred to as Short Text Extreme Classification, has found numerous applications including prediction of related searches and product recommendation tasks. In this paper, we propose a convolutional architecture InceptionXML which is light-weight, yet powerful, and robust to the inherent lack of word-order in short-text queries encountered in search and recommendation tasks. We demonstrate the efficacy of applying convolutions by recasting the operation along the embedding dimension instead of the word dimension as applied in conventional CNNs for text classification. Towards scaling our model to datasets with millions of labels, we also propose InceptionXML+ framework which improves upon the shortcomings of the recently proposed dynamic hard-negative mining technique for label shortlisting by synchronizing the label-shortlister and extreme classifier. InceptionXML+ not only reduces the inference time to half but is also an order of magnitude smaller than previous state-of-the-art Astec in terms of model size. Through our proposed models, we outperform all existing approaches on popular benchmark datasets.	PDF	2	2022
Training Dynamics for Curriculum Learning: A Study on Monolingual and Cross-lingual NLU	Curriculum Learning (CL) is a technique of training models via ranking examples in a typically increasing difficulty trend with the aim of accelerating convergence and improving generalisability. Current approaches for Natural Language Understanding (NLU) tasks use CL to improve in-distribution data performance often via heuristic-oriented difficulties or task-agnostic ones. In this work, instead, we employ CL for NLU by taking advantage of training dynamics as difficulty metrics, i.e. statistics that measure the behavior of the model at hand on specific task-data instances during training and propose modifications of existing CL schedulers based on these statistics. Differently from existing works, we focus on evaluating models on in-distribution, out-of-distribution as well as zero-shot cross-lingual transfer datasets. We show across several NLU tasks that CL with training dynamics can result in better performance mostly on zero-shot cross-lingual transfer and OOD settings with improvements up by 8.5%. Overall, experiments indicate that training dynamics can lead to better performing models with smoother training compared to other difficulty metrics while at the same time being up to 51% faster. In addition, through analysis we shed light on the correlations of task-specific versus task-agnostic metrics.	PDF	2	2022
Faster and Better Grammar-based Text-to-SQL Parsing via Clause-level Parallel Decoding and Alignment Loss	Grammar-based parsers have achieved high performance in the cross-domain text-to-SQL parsing task, but suffer from low decoding efficiency due to the much larger number of actions for grammar selection than that of tokens in SQL queries. Meanwhile, how to better align SQL clauses and question segments has been a key challenge for parsing performance. Therefore, this paper proposes clause-level parallel decoding and alignment loss to enhance two high-performance grammar-based parsers, i.e., RATSQL and LGESQL. Experimental results of two parsers show that our method obtains consistent improvements both in accuracy and decoding speed.	PDF	2	2022
Adversarially Constructed Evaluation Sets Are More Challenging, but May Not Be Fair	Large language models increasingly saturate existing task benchmarks, in some cases outperforming humans, leaving little headroom with which to measure further progress. Adversarial dataset creation, which builds datasets using examples that a target system outputs incorrect predictions for, has been proposed as a strategy to construct more challenging datasets, avoiding the more serious challenge of building more precise benchmarks by conventional means. In this work, we study the impact of applying three common approaches for adversarial dataset creation: (1) filtering out easy examples (AFLite), (2) perturbing examples (TextFooler), and (3) model-in-the-loop data collection (ANLI and AdversarialQA), across 18 different adversary models. We find that all three methods can produce more challenging datasets, with stronger adversary models lowering the performance of evaluated models more. However, the resulting ranking of the evaluated models can also be unstable and highly sensitive to the choice of adversary model. Moreover, we find that AFLite oversamples examples with low annotator agreement, meaning that model comparisons hinge on the examples that are most contentious for humans. We recommend that researchers tread carefully when using adversarial methods for building evaluation datasets.	PDF	2	2022
Simulating Inconsistencies in Task-oriented Dialog	Most existing dialog models are trained on static dialog datasets or in an interactive way with user simulators, and evaluated in the same way. Such methods mostly make an ideal hypothesis that the user behaves consistently to the goal. Nevertheless, inconsistent behaviors are often observed from real users due to unpredictable mind changes or language understanding errors.In this paper, we give a systematic investigation of the inconsistent problem in real-world dialog systems and introduce three kinds of inconsistencies, namely Goal Change, Action Disloyalty and Understanding Deviation. We propose a user model to simulate those three kinds of inconsistencies, which can be used to examine the model robustness. The simulation model is further utilized to support Reinforcement Learning and inconsistent data augmentation, which boosts the performance of pipeline and end-to-end dialog models under inconsistent situation.	PDF	2	2022
Diverse Parallel Data Synthesis for Cross-Database Adaptation of Text-to-SQL Models	Serving novel schemas for semantic parsing of natural language queries over relational databases is a challenging problem owing to a huge diversity of schemas and zero availability of text queries in the target schema until the initial deployment of the parser in the real world. We present REFILL, a framework for synthesizing diverse and high-quality parallel data of Text-SQL pairs for adapting semantic parsing models on a new schema. Unlike prior approaches that synthesize text using an SQL-to-Text model trained on existing datasets, our approach uses a novel method of retrieving diverse existing text, masking their scheme-specific tokens, and refilling to translate to the target schema. We show that this process leads to significantly more diverse text than achievable by sampling the beam of a plain SQL-to-Text model. Experiments across four groups of relational databases establish that finetuning a semantic parser on the datasets synthesized by REFILL offers consistent performance gains over prior data-augmentation methods.	PDF	2	2022
Debiasing Pretrained Text Encoders by Paying Attention to Paying Attention	Recent studies in fair Representation Learning have observed a strong inclination for Natural Language Processing (NLP) models to exhibit discriminatory stereotypes across gender, religion, race and many such social constructs. In comparison to the progress made in reducing bias from static word embeddings, fairness in sentence-level text encoders received little consideration despite their wider applicability in contemporary NLP tasks. In this paper, we propose a debiasing method for pre-trained text encoders that both reduces social stereotypes, and inflicts next to no semantic damage. Unlike previous studies that directly manipulate the embeddings, we suggest to dive deeper into the operation of these encoders, and pay more attention to the way they pay attention to different social groups. We find that most stereotypes are also encoded in the attention layer. Then, we work on model debiasing by redistributing the attention scores of a text encoder such that it forgets any preference to historically advantaged groups, and attends to all social classes with the same intensity. Our experiments confirm that we successfully reduce bias with little damage to semantic representation.	PDF	2	2022
Low-resource Data-to-Text Generation Using Pretrained Language Models	Expressing natural language descriptions of structured facts or relations -- data-to-text generation -- increases the accessibility of a diverse range of structured knowledge repositories. End-to-end neural models for this task require a large training corpus of relations and corresponding descriptions. While such resources are unrealistic for every domain, we do not fully understand how well different data-to-text generation models can generalize to new relations. This work presents an analysis of data-to-text models for unseen relations based on two pre-trained language models (PLMs): T5 and GPT-2. We consider different strategies, including few-shot learning, prompt-tuning, and incorporating other domain knowledge (natural language description of the unseen relations) to identify effective strategies and remaining challenges for improving the performance of PLMs on new relations.	PDF	2	2022
Alleviating Sparsity of Open Knowledge Graphs with Ternary Contrastive Learning	Sparsity of formal knowledge and roughness of non-ontological construction methods make sparsity problem particularly prominent in Open Knowledge Graphs (OpenKGs). Sparse links make few-shot entities unable to learn potential features. We hypothesize that negative samples could help sparse links highlight discriminative features. However, existing contrastive learning in Graphs model binary objects, none has studied contrastive learning to model ternary pattern in any KGs. In this paper, we propose a Ternary Contrastive Learning (TernaryCL) to alleviate the sparsity of OpenKGs. TernaryCL designs (1) Contrastive Entity and (2) Contrastive Relation to mine ternary discriminative features by both negative entities and relations. (3) Contrastive Self constructs a self positive sample to give zero-shot and few-shot entities chances to learn discriminative features. (4) Contrastive Fusion aggregates graph features by extending the pattern from 1-to-1 to 1-to-N. Extensive experiments on benchmarks show the superiority of TernaryCL over state-of-the-art models.	PDF	2	2022
Zero-shot Cross-Linguistic Learning of Event Semantics	Typologically diverse languages offer systems of lexical and grammatical aspect that allow speakers to focus on facets of event structure in ways that comport with the specific communicative setting and discourse constraints they face. In this paper, we look specifically at captions of images across Arabic, Chinese, Farsi, German, Russian, and Turkish and describe a computational model for predicting lexical aspects. Despite the heterogeneity of these languages, and the salient invocation of distinctive linguistic resources across their caption corpora, speakers of these languages show surprising similarities in the ways they frame image content. We leverage this observation for zero-shot cross-lingual learning and show that lexical aspects can be predicted for a given language despite not having observed any annotated data for this language at all.	PDF	2	2022
Semantic-Preserving Adversarial Code Comprehension	Based on the tremendous success of pre-trained language models (PrLMs) for source code comprehension tasks, current literature studies either ways to further improve the performance (generalization) of PrLMs, or their robustness against adversarial attacks. However, they have to compromise on the trade-off between the two aspects and none of them consider improving both sides in an effective and practical way. To fill this gap, we propose Semantic-Preserving Adversarial Code Embeddings (SPACE) to find the worst-case semantic-preserving attacks while forcing the model to predict the correct labels under these worst cases. Experiments and analysis demonstrate that SPACE can stay robust against state-of-the-art attacks while boosting the performance of PrLMs for code.	PDF	4	2022
Efficient Structured Knowledge Distillation	Structured prediction models aim at solving a type of tasks where the output is a complex structure, rather than a single variable. Performing knowledge distillation for structured prediction models is not trivial due to their exponentially large output space. In this work, we propose an approach that is much simpler in its formulation and far more efficient for training than existing approaches. Specifically, we transfer the knowledge from a teacher model to its student model by locally matching their computations on all internal structures rather than the final outputs. In this manner, we avoid adopting some time-consuming techniques like dynamic programming (DP) for decoding output structures, which permits parallel computation and efficient training. Besides, it encourages the student model to better mimic the internal behavior of the teacher model. Experiments on two structured prediction tasks demonstrate that our approach not only halves the time cost for each training epoch but also outperforms previous methods in task performance.	PDF	4	2022
Multimodal Shannon Game with Images	In the Shannon game, the goal is to guess the next letter in a sentence based on the previous context. It has since become a widely known thought experiment on which concepts in psycholinguistics, computational linguistics and natural language processing are based. We extend this game by including an optional extra modality in the form of images and run an experiment on human participants.We find that the presence of an image vastly improves users' confidence and accuracy accross all POS. This includes determiners (a, an, the), which should otherwise be predicted solely from the previous (left) context of the sentence.	PDF	4	2022
ViLPAct: A Benchmark for Compositional Generalization on Multimodal Human Activities	We introduce ViLPAct, a novel vision-language benchmark for human activity planning. It is designed for a task where embodied AI agents can reason and forecast future actions of humans based on video clips about their initial activities and intents in text. The dataset consists of 2.9k videos from \charades extended with intents via crowdsourcing, a multi-choice question test set, and four strong baselines. One of the baselines implements a neurosymbolic approach based on a multi-modal knowledge base (MKB), while the other ones are deep generative models adapted from recent state-of-the-art (SOTA) works. According to our extensive experiments, the key challenges are compositional generalization and effective use of information from both modalities. We will make the benchmark publicly available upon acceptance.	PDF	4	2022
Fast and Robust Link Prediction using Ontology	Link Prediction is the task of predicting missing relations between knowledge graph entities (KG). Recent work in link prediction mainly attempted to adapt a model for increasing link prediction accuracy by using more layers in neural network architecture, which heavily rely on computational resources and are not scalable on big KGs. This paper proposes a method of refining knowledge graphs to perform link prediction operations more accurately using relatively fast translational models. Translational link prediction models, such as TransE, TransH, TransD, RotatE, and HAKE, have significantly less complexity than deep learning approaches. Our method uses the ontologies of knowledge graphs to add information as auxiliary nodes to the graph. Then, these auxiliary nodes are connected to ordinary nodes of the KG that contain auxiliary information in their hierarchy. Our experiments show that our method can significantly increase the performance of translational link prediction methods in Hit@10, Mean Rank, Mean Reciprocal Rank.	PDF	4	2022
The Optimal BERT Surgeon: Scalable and Accurate Second-Order Pruning for Large Language Models	Transformer-based language models have become a key building block for natural language processing. While these models are extremely accurate, they can be too large and computationally intensive to run on standard deployments. A variety of compression methods, including distillation, quantization, structured and unstructured pruning are known to decrease model size and increase inference speed, with low accuracy loss. In this context, this paper's contributions are two-fold. We perform an in-depth study of the accuracy-compression trade-off for unstructured weight pruning of BERT models. We introduce Optimal BERT Surgeon (OBS-BERT), an efficient and accurate weight pruning method based on approximate second-order information, which we show to yield state-of-the-art results in both stages of language tasks: pre-training and fine-tuning. Specifically, OBS-BERT extends existing work on unstructured second-order pruning by allowing for pruning blocks of weights, and by being applicable at the BERT scale. Second, we investigate the impact of this pruning method when compounding compression approaches to obtain highly compressed but accurate models for deployment on edge devices. These models significantly push boundaries of the current state-of-the-art sparse BERT models with respect to all metrics: model size, inference speed and task accuracy. For example, relative to the dense BERT-base, we obtain 10x model size compression (in MB) with <1% accuracy drop, 10x CPU-inference speedup with <2% accuracy drop, and 29x CPU-inference speedup with <7.5% accuracy drop.	PDF	4	2022
Multimodal Conditionality for Natural Language Generation	Large scale pretrained language models have demonstrated state-of-the-art performance in language understanding tasks. Their application has recently expanded into multimodality learning, leading to improved representations combining vision and language. However, progress in adapting language models towards conditional Natural Language Generation (NLG) has been limited to a single modality, generally text. We propose MAnTiS, Multimodal Adaptation for Text Synthesis, a general approach for multimodal conditionality in transformer-based NLG models. In this method, we pass inputs from each modality through modality-specific encoders, project to textual token space, and finally join to form a conditionality prefix. We fine-tune the pretrained language model and encoders with the conditionality prefix guiding the generation. We apply MAnTiS to the task of product description generation, conditioning a network on both product images and titles to generate descriptive text. We demonstrate that MAnTiS outperforms strong baseline approaches on standard NLG scoring metrics. Furthermore, qualitative assessments demonstrate that MAnTiS can generate human quality descriptions consistent with given multimodal inputs.	PDF	4	2022
An Empirical Study on Cross-Lingual and Cross-Domain Transfer for Legal Judgment Prediction	Cross-lingual transfer learning has proven useful in a variety of NLP tasks, but it is understudied in the context of legal NLP, and not at all on Legal Judgment Prediction (LJP). We explore transfer learning techniques on (LJP) using the trilingual Swiss-Judgment-Prediction (SJP) dataset, including cases written in three languages (German, French, Italian). We find that Cross-Lingual Transfer (CLT) improves the overall results across languages, especially when we use adapter-based fine-tuning. Finally, we further improve the model's performance by augmenting the training dataset with machine-translated versions of the original documents, using a 3x larger training corpus. Further on, we perform an analysis exploring the effect of cross-domain and cross-regional transfer, i.e., train a model across domains (legal areas), or regions. We find that in both settings (legal areas, origin regions), models trained across all groups perform overall better, while they also have improved results in the worst-case scenarios. Finally, we report improved results when we ambitiously apply cross-jurisdiction transfer, where we augment our dataset with Indian legal cases originally written in English.	PDF	4	2022
Table-To-Text generation and pre-training with TabT5	Encoder-only transformer models have been successfully applied to different table understanding tasks, as in TAPAS. A major limitation of these architectures is that they are constrained to classification-like tasks such as cell selection or entailment detection. We present TabT5, an encoder-decoder model that generates natural language text based on tables and textual inputs. TabT5 overcomes the encoder-only limitation by incorporating a decoder component and leverages the input structure with table specific embeddings and pre-training. TabT5 achieves new state-of-the-art results on several domains, including spreadsheet formula prediction with a 15% increase in sequence accuracy, QA with a 2.5% increase in sequence accuracy and data-to-text generation with a 2.5% increase in BLEU.	PDF	4	2022
Selective Labeling: How to Radically Lower Data-Labeling Costs for Document Extraction Models	Building automatic extraction models for visually rich documents like invoices, receipts, bills, tax forms, etc. has received significant attention lately. A key bottleneck in developing extraction models for new document types is the cost of acquiring the several thousand high-quality labeled documents that are needed to train a model with acceptable accuracy. In this paper, we propose selective labeling as a solution to this problem. The key insight is to simplify the labeling task to provide "yes/no" labels for candidate extractions predicted by a model trained on partially-labeled documents. We combine this with a custom active learning strategy to find the predictions that the model is most uncertain about. We show through experiments on document types drawn from 3 different domains that selective labeling can reduce the cost of acquiring labeled data by $10\times$ while achieving negligible loss in accuracy.	PDF	4	2022
Target-Level Sentence Simplification as Controlled Paraphrasing	Automatic text simplification aims to reduce the linguistic complexity of a text in order to make it easier to understand and more accessible. However, simplified texts are consumed by a diverse array of target audiences and what may be appropriately simplified for one group of readers may differ considerably for another.In this work we investigate a novel formulation of sentence simplification as paraphrasing with controlled decoding, which aims to alleviate the major burden of relying on large amounts of in-domain parallel training data, while at the same time allowing for modular and adaptive simplification.According to a range of automatic metrics, our approach performs competitively against baselines that prove more difficult to adapt to the needs of different target audiences or require complex-simple parallel data.	PDF	4	2022
Trial2Vec: Zero-Shot Clinical Trial Document Similarity Search using Self-Supervision	Clinical trials are essential for drug development but are extremely expensive and time-consuming to conduct. It is beneficial to study similar historical trials when designing a clinical trial. However, lengthy trial documents and lack of labeled data make trial similarity search difficult. We propose a zero-shot clinical trial retrieval method, called Trial2Vec, which learns through self-supervision without the need for annotating similar clinical trials. Specifically, the \textit{meta-structure} of trial documents (e.g., title, eligibility criteria, target disease) along with clinical knowledge (e.g., UMLS knowledge base \footnote{\url{https://www.nlm.nih.gov/research/umls/index.html}}) are leveraged to automatically generate contrastive samples. Besides, Trial2Vec encodes trial documents considering meta-structure thus producing compact embeddings aggregating multi-aspect information from the whole document. We show that our method yields medically interpretable embeddings by visualization and it gets 15\% average improvement over the best baselines on precision/recall for trial retrieval, which is evaluated on our labeled 1600 trial pairs. In addition, we prove the pretrained embeddings benefit the downstream trial outcome prediction task over 240k trials.	PDF	4	2022
Fast-R2D2: A Pretrained Recursive Neural Network based on Pruned CKY for Grammar Induction and Text Representation	Chart-based models have shown great potential in unsupervised grammar induction, running recursively and hierarchically, but requiring cubic time-complexity. The Recursive Transformer based on Differentiable Trees (R2D2) makes it possible to scale to large language model pretraining even with a complex tree encoder, by introducing a heuristic pruning method. However, its rule-based pruning process suffers from local optima and slow inference. In this paper, we propose a unified R2D2 method that overcomes these issues. We use a top-down parser as a model-guided pruning method, which also enables parallel encoding during inference. Our parser casts parsing as a split point scoring task, which first scores all split points for a given sentence, and then uses the highest-scoring split point to recursively split a span into two parts. The reverse order of the splits is considered as the order of pruning in the encoder. Besides the bi-directional language model loss, we also optimize the parser by minimizing the Kullback–Leibler distance between tree probabilities from the parser and the R2D2 model. Our experiments show that our Fast-R2D2 significantly improves the grammar induction quality and achieves competitive results in downstream tasks.	PDF	4	2022
How to Stop an Avalanche? JoDeM: Joint Decision Making through Compare and Contrast for Dialog State Tracking	\begin{abstract}Dialog state tracking (DST) is a core component in task-oriented dialog systems. Existing state-of-the-art DST model incorporates insight and intuition from the human experience into design of supplementary labels, which greatly assisted the training process of turn-by-turn DST model. Though the turn-by-turn scheme and supplementary labels enabled satisfactory performance on the task, most of the DST models of this fashion label or process the raw dialogue data on the premise that the last turn dialogue state is always correct, which is usually not the case. In this paper, we address the negative impact resulted from the premise above as the avalanche phenomenon. After that, we propose JoDeM, a state-of-the-art DST model which can tackle the Avalanche phenomenon with two mechanisms. First mechanism is a jointly decision making method to extract key information from the dialogue. Second mechanism is a compare and contrast dialogue update technique to prevent error accumulation. Example study and graph analysis are presented to support our claim about the harmfulness of avalanche phenomenon. We also conduct quantitative and qualitative experiments on the high quality MultiWOZ2.3 corpus dataset to demonstrate that the proposed model not only outperforms the existing state-of-the-art methods, but also proves the validity of solving avalanche degradation problem.\end{abstract}	PDF	4	2022
Fine-Tuning Pre-trained Transformers into Decaying Fast Weights	Autoregressive Transformers are strong language models but incur $O(T)$ complexity during per-token generation due to the self-attention mechanism. Recent work proposes kernel-based methods to approximate causal self-attention by replacing it with recurrent formulations with various update rules and feature maps to achieve $O(1)$ time and memory complexity. We explore these approaches and find that they are unnecessarily complex, and propose a simple alternative - decaying fast weights - that runs fast on GPU, outperforms prior methods, and retains 99% of attention's performance for GPT-2. We also show competitive performance on WikiText-103 against more complex attention substitutes.	PDF	4	2022
A Distilled Representation for Zero and Few-Shot Localization of Task-Oriented Dialogue Agents	Task-oriented Dialogue (ToD) agents are mostly limited to a few widely-spoken languages, mainly due to the high cost of acquiring training data for each language. Existing low-cost approaches that rely on cross-lingual embeddings or naive machine translation sacrifice a lot of accuracy for data efficiency, and largely fail in creating a usable dialogue agent.We propose automatic methods that use ToD training data in a source language to build a high-quality functioning dialogue agent in another target language that has no training data (i.e. zero-shot) or a small training set (i.e. few-shot). Unlike most prior work in cross-lingual ToD that only focus on Dialogue State Tracking (DST), we build an end-to-end agent.We show that our approach closes the accuracy gap between few-shot and existing full-shot methods for ToD agents.We achieve this by (1) improving the dialogue data representation, (2) improving entity-aware machine translation, and (3) automatic filtering of noisy translations.We evaluate our approach on the recent bilingual dialogue dataset BiToD.In Chinese to English transfer, in the zero-shot setting, our method achieves 46.7% and 22.0% in Task Success Rate (TSR) and Dialogue Success Rate (DSR) respectively. In the few-shot setting where 10% of the data in the target language is used, we improve the state-of-the-art by 15.2% and 14.0%, coming within 5% of full-shot training.	PDF	4	2022
KESA: A Knowledge Enhanced Approach For Sentiment Analysis	Though some recent works focus on injecting sentiment knowledge into pre-trained language models, they usually design mask and reconstruction tasks in the post-training phase. In this paper, we aim to benefit from sentiment knowledge in a lighter way. To achieve this goal, we propose two sentiment-aware auxiliary tasks named sentiment word cloze and conditional sentiment prediction and, correspondingly, integrate them into the fine-tuning phase. The first task learns to select the correct sentiment words within the input, given the overall sentiment polarity as prior knowledge. On the contrary, the second task predicts the overall sentiment polarity given the sentiment polarity of the word as prior knowledge. In addition, two kinds of label combination methods are investigated to unify multiple types of labels in each task. Experimental results demonstrate that our approach consistently outperforms baselines and is additive to existing knowledge-enhanced post-trained models.	PDF	4	2022
Efficient Japanese Tokenization Based on Improved Pointwise Linear Classification	This paper proposes multiple techniques to improve runtime efficiency of Japanese tokenization based on the Pointwise Linear Classification (PLC) framework, which formulates the whole tokenization process as a sequence of linear classification problems. Our techniques are optimized by leveraging the characteristics of the PLC framework and the task definition itself. Specifically, we introduce (1) composing multiple classifications into array-based operations, (2) efficient feature lookup with memory-optimized automata, and (3) three orthogonal preprocessing to reduce actual score calculation. Combining these techniques, our implementation works 5.7 times faster than the existing tokenizer based on the same model without any loss of tokenization accuracy.	PDF	4	2022
Generating personalized article edits on collaborative editing platforms	NLP methods to generate edits on collaborative editing platforms can help users to edit more efficiently and suggest locations within an article for editing. Existing methods have largely ignored the personalized aspect of editing--the diverse styles, interests, and editing intentions that affect user edits. In this paper, we analyze two personalization methods: augmenting models with user behavior clusters and user tags. We demonstrate that these methods, when combined with a new architecture, generate edits that are closer to ground-truth Wikipedia edits when compared to an existing strong baseline. Our experiments test edits for both edit type (insertion or deletion) and word choice, and include a user study collecting feedback from human evaluators. Finally, we introduce a new dataset of Wikipedia edits to facilitate future innovation.	PDF	4	2022
MCoNaLa: A Benchmark for Code Generation from Multiple Natural Languages	While there has been a recent burgeoning of applications at the intersection of natural and programming languages, such as code generation and code summarization, these applications are usually English-centric. This creates a barrier for program developers who are not proficient in English. To mitigate this gap in technology development across languages, we propose a multilingual dataset, MCoNaLa, to benchmark code generation from natural language commands extending beyond English. Modeled off of the methodology from the English Code/Natural Language Challenge (CoNaLa) dataset, we annotated a total of 896 NL-Code pairs in three languages: Spanish, Japanese, and Russian. We present a systematic evaluation on MCoNaLa by testing state-of-the-art code generation systems. Although the difficulties vary across three languages, all systems lag significantly behind their English counterparts, revealing the challenges in adapting code generation to new languages.	PDF	4	2022
D$^2$PSG: Multi-Party Dialogue Discourse Parsing as Sequence Generation	Conversational discourse analysis aims to extract the interactions between dialogue turns, which is crucial for modeling complex multi-party dialogues. As the benchmarks are still limited in size and human annotations are costly, the current standard approaches apply pretrained language models, but they still require randomly initialized classifiers to make predictions. These classifiers usually require massive data to work smoothly with the pretrained encoder, causing severe data hunger issue. We propose two convenient strategies to formulate this task as a sequence generation problem, where classifier decisions are carefully converted into sequence of tokens. We then adopt a pretrained T5 model to solve this task so that no parameters are randomly initialized. We also leverage the descriptions of the discourse relations to help model understand their meanings. Experiments on two popular benchmarks show that our approach outperforms previous state-of-the-art models by a large margin, and it is also more robust in zero-shot and few-shot settings.	PDF	4	2022
CC-Riddle: A Question Answering Dataset of Chinese Character Riddles	Chinese character riddle is a challenging riddle game which takes a single character as the solution. The riddle describes the pronunciation, shape and meaning of the solution character with rhetoric techniques. In this paper, we propose a Chinese character riddle dataset covering the majority of common simplified Chinese characters by crawling riddles from the Web and generating brand new ones. In the generation stage, we provide the Chinese phonetic alphabet, decomposition and explanation of the solution character for the generation model and get multiple riddle descriptions for each tested character. Then the generated riddles are manually filtered and the final dataset, CC-Riddle is composed of both human-written riddles and filtered generated riddles. Furthermore, we build a character riddle QA system based on our dataset and find that the existing models struggle to solve such tricky questions.	PDF	4	2022
Disentangled Memory Retrieval Towards Math Word Problem Generation	The task of math word problem (MWP) generation, which generates a MWP given an equation and relevant topic words, has increasingly attracted researchers' attention. In this work, we propose a seq2seq model with a disentangled memory retrieval module to better take advantage of the logical description and scenario description within a MWP and more relevant training data to improve the generation quality. We first disentangle the training MWPs into logical descriptions and scenario description and then record them in respective memory modules. Later, we use the given equation and topic words as queries to retrieve the most relevant logical descriptions and scenario description from the corresponding memory modules respectively. The retrieved results are then used to complement the process of the MWP generation. Extensive experiments verify the superior performance and effectiveness of our method.	PDF	4	2022
Towards Building Accurate End-to-End Task-Oriented Dialog Systems with a Simple Cache	End-to-end task-oriented dialog (TOD) systems have achieved promising performance by leveraging sophisticated natural language understanding and natural language generation capabilities of pre-trained models. This work enables the TOD systems with higher flexibility with a simple cache. The cache provides the flexibility to dynamically update the TOD systems to disable existing or add new and unseen domains, intents, slots, etc., without intensive retraining. Towards this end, we first fine-tune a retrieval module to retrieve Top-$N$ slot information entries from the cache correctly and then train generative end-to-end TOD models with the cache. While performing TOD generation, the models could refer to and ground on both dialog history and the retrieved information. The introduced cache is easy to construct, and the backbone models of TOD systems are compatible with existing pre-trained generative models. Extensive experiments demonstrate the superior performance of our proposed end-to-end framework over baselines, e.g., the Non-Empty JAG is improved by $6.67\%$ when compared with BART-Large.	PDF	4	2022
RID: A Unified Framework for Conversational Recommender Systems with Pretrained Language Models	Conversational Recommender Systems (CRS), which aim to recommend high-quality items to users through interactive conversations, have gained great research attention recently. A CRS is usually composed of a recommendation module and a generation module. In the previous work, these two modules are loosely connected in the model training and are shallowly integrated during inference, where a simple switching network or copy mechanism is adopted to incorporate recommended items into generated responses. Moreover, the current end-to-end neural models trained on small crowd-sourcing datasets (e.g., 10K dialogs in the ReDial dataset) tend to be overfitting and have poor chit-chat ability. In this work, we propose a novel unified framework called RID that integrates recommendation into the dialog generation by introducing a vocabulary pointer. To tackle the low-resource issue in CRS, we finetune the large-scale pretrained language model to generate fluent and diverse responses, and introduce a knowledge-aware bias learned from an entity-oriented knowledge graph to enhance the recommendation performance. Furthermore, we propose to evaluate the CRS models in an end-to-end manner, which can reflect the overall performance of the entire system rather than the performance of individual modules, compared to the separate evaluations of the two modules used in previous work. Experiments on the benchmark dataset ReDial show our RecInDial model significantly surpasses the state-of-the-art methods. More extensive analyses show the effectiveness of our model.	PDF	4	2022
Navigating Connected Memories with a Task-oriented Dialog System	Recent years have seen an increasing trend in the volume of personal media captured by users, thanks to the advent of smartphones and smart glasses, resulting in large media collections. Despite conversation being an intuitive human-computer interface, current efforts focus mostly on single-shot natural language based media retrieval to aid users query their media and re-live their memories. This severely limits the search functionality as users can neither ask follow-up queries nor obtain information without first formulating a single-turn query. In this work, we propose dialogs for connected memories as a powerful tool to empower users to search their media collection through a multi-turn, interactive conversation. Towards this, we collect a new task-oriented dialog dataset COMET, which contains $11.5k$ user$\leftrightarrow$assistant dialogs (totalling $103k$ utterances), grounded in simulated personal memory graphs. We employ a resource-efficient, two-phase data collection pipeline that uses: (1) a novel multimodal dialog simulator that generates synthetic dialog flows grounded in memory graphs, and, (2) manual paraphrasing to obtain natural language utterances. We analyze COMET, formulate four main tasks to benchmark meaningful progress, and adopt state-of-the-art language models as strong baselines, in order to highlight the multimodal challenges captured by our dataset. Our code \& data will be made publicly available.	PDF	4	2022
Finding Interpretable Word Embedding Subspaces using Covariance and Correlation Maximization	This paper proposes a new method for estimating a direction in a word embedding space corresponding to an interpretable semantic property such as gender, race, or religion. Our technique assumes that words can be assigned numerical scores that quantify their association with the target property. We estimate the subspace by maximizing the covariance or correlation of these scores with the projection of word embeddings along the subspace. Using our technique, we show that word embedding spaces in English, French, and Chinese contain subspaces that encode gender, race, religion, sentiment, word length, and national population. We then apply our technique to the mitigation of gender and racial bias from word embeddings. We find that using our technique to estimate a gender or race subspace improves performance on several benchmarks.	PDF	4	2022
Language-Family Adapters for Multilingual Neural Machine Translation	Massively multilingual models, pretrained on monolingual data, yield state-of-the-art results in a wide range of natural language processing tasks. In machine translation, multilingual pretrained models are often fine-tuned on parallel data from one or multiple language pairs. Multilingual fine-tuning improves performance on medium- and low-resource languages but requires modifying the entire model and can be prohibitively expensive. Training a new set of adapters on each language pair or training a single set of adapters on all language pairs (language-pair or language-agnostic adapters) while keeping the pretrained model's parameters frozen has been proposed as a parameter-efficient alternative. However, the former do not learn cross-lingual representations, while the latter share parameters for all languages and potentially have to deal with negative interference. In this paper, we propose training language-family adapters on top of a pretrained multilingual model to facilitate cross-lingual transfer. Our model consistently outperforms other adapter-based approaches. We also demonstrate that language-family adapters provide an effective method to translate to languages unseen during pretraining.	PDF	4	2022
Logical Fallacy Detection	Reasoning is central to human intelligence. However, fallacious arguments are common, and some exacerbate problems such as spreading misinformation about climate change. In this paper, we propose the task of \textit{logical fallacy detection}, and provide a new dataset (Logic) of logical fallacies generally found in text, together with an additional challenge set for detecting logical fallacies in climate change claims (LogicClimate). Detecting logical fallacies is a hard problem as the model must understand the underlying logical structure of the argument. We find that existing pretrained large language models perform poorly on this task. In contrast, we show that a simple structure-aware classifier outperforms the best language model by 5.46% on Logic and 4.51% on LogicClimate. We encourage future work to explore this task as (a) it can serve as a new reasoning challenge for language models, and (b) it can have potential applications in tackling the spread of misinformation.	PDF	4	2022
ClidSum: A Benchmark Dataset for Cross-Lingual Dialogue Summarization	We present ClidSum, a benchmark dataset towards building cross-lingual summarization systems on dialogue documents. It consists of 67k+ dialogue documents and 112k+ annotated summaries in different target languages. Based on the proposed ClidSum, we introduce two benchmark settings for supervised and semi-supervised scenarios, respectively. We then build various baseline systems in different paradigms (pipeline and end-to-end) and conduct extensive experiments on ClidSum to provide deeper analyses. Furthermore, we propose mDialBART which extends mBART via further pre-training, where the multiple objectives help the pre-trained model capture the structural characteristics as well as key content in dialogues and the transformation from source to the target language. Experimental results show the superiority of mDialBART, as an end-to-end model, outperforms strong pipeline models on ClidSum. Finally, we discuss specific challenges that current approaches faced with this task and give multiple promising directions for future research.	PDF	4	2022
The KITMUS Test for Knowledge Integration from Multiple Sources	Natural language understanding models make inferences using information from multiple sources. An important class of such inferences are those that require both background knowledge, presumably contained in a model's pretrained parameters, and instance-specific information that is supplied at inference time. However, the integration and reasoning abilities of NLU models in the presence of multiple knowledge sources have been largely understudied. In this work, we propose a test suite of coreference resolution tasks that require reasoning over multiple facts and an accompanying dataset with individual subtasks that we vary in order to control the knowledge source of relevant facts. We evaluate state-of-the-art coreference resolution models on our dataset. Our results indicate that several models struggle to reason on-the-fly over knowledge observed both at train time and at inference time. However, with task-specific training, a subset of models demonstrates the ability to integrate certain knowledge types from multiple sources.	PDF	4	2022
Using Calibrator to Improve Robustness in Machine Reading Comprehension Without Performance Sacrificing	Machine Reading Comprehension(MRC) has achieved a remarkable result since some powerful models, such as BERT, are proposed. However, these models are not robust enough and vulnerable to adversarial input perturbation and generalization examples. Some works tried to improve the performance on adversarial perturbation by adding related examples into training data while it leads to degradation on the in-domain dataset, because the shift of data distribution makes the answer ranking based on the softmax probability of model unreliable. In this paper, we propose a method to improve the robustness by using a calibrator as the post-hoc reranker, which is implemented based on XGBoost model. The calibrator combines both manual features and representation learning features to rerank candidate results. Experimental results on adversarial datasets show that our model can achieve performance improvement by more than 10% and also make improvement on the in-domain and generalization datasets.	PDF	4	2022
Investigating Adapters Effectiveness in Machine Translation	Pre-trained language models received extensive attention in recent years. However, it is still challenging to incorporate a pre-trained model such as BERT into Natural Language Generation tasks. In this work, we investigate a recent method called adapters as an alternative for fine-tuning. We conduct a study to understand whether we can improve the effectiveness of adapters in transformer model for machine translation task. We explore the behaviour of adapters when only put in either the encoder or the decoder as well as pursuing the effort to down-scale the pre-training model size and try to recover the performance so that they are comparable to the full-sized BERT model that was fine-tuned without adapters.We found that the performance of incorporating adapters and pre-trained weights in the encoder alone is on par as to when we include the adapters on both the encoder and decoder.In our down-scaling study, we found that leveraging only half the size of the original pre-trained weights can positively impact the performance when fine-tuned with adapters. Our experiments show that we can get almost the same performance as the original BERT model after fine-tuning the cross-attention layer.	PDF	4	2022
Probing and Label Preserving Mixup for Explicit and Implicit Hatespeech	Explicit hate speech comprises of cuss, swear, abusive words but implicit hate is more subtle, indirect and contextual, which makes it more challenging. In this work, we perform experiments for studying explicit and implicit hate speech across dimensions of probing, label preserving mixup, cross-learning and domain/task adaptation. We study the efficacy of explicit hate speech for improving the more-challenging implicit hate speech detection. We also probe contextual models and observe that higher layers encode implicit hate speech while lower layers focus on explicit hate speech highlighting the importance of token-level understanding for explicit and context-level for implicit hate speech detection. We propose a simple yet effective input-level data augmentation technique EasyMix, which improves performance in monolingual and multilingual settings for both explicit and implicit hate speech. Coupled with domain and task adaptation, our method shows consistent improvements of 1-6% across multiple datasets and languages.We also share the code and dataset splits for reproducing our results.	PDF	4	2022
Bilingual Lexicon Induction for Low-Resource Languages using Graph Matching via Optimal Transport	Bilingual lexicons form a critical component of various NLP applications, including unsupervised and semisupervised machine translation and crosslingual information retrieval. In this work, we improve bilingual lexicon induction performance across 40 diverse language pairs with a graph-matching method based on optimal transport. The method is especially strong with very low amounts of supervision.	PDF	4	2022
Efficient Ensemble Transformer for Accurate Answer Sentence Ranking	Large transformer models can highly improve Answer Sentence Selection (AS2) task, but their high computational costs prevent their use in many real world applications. In this paper, we explore the following research question: How can we make the AS2 models more accurate without significantly increasing their model complexity? To address the question, we propose a Multiple Heads Student architecture (MHS), an efficient neural network designed to distill an ensemble of large transformers into a single smaller model. An MHS model consists of two components: a stack of transformer layers that is used to encode inputs, and a set of ranking heads; unlike traditional distillation technique each of them is trained by distilling a different large transformer architecture in a way that preserves the diversity of the ensemble members. The resulting model captures the knowledge of heterogeneous transformer models by using just a few extra parameters. We show the effectiveness of MHS on three English datasets for AS2; our proposed approach outperforms all single-model distillations we consider, rivaling the state-of-the-art large AS2 models that have 2.7x more parameters and run 2.5x slower.	PDF	4	2022
Towards Robust NLG Evaluation with Syntactically-diverse Prompts	We present a robust methodology for demographic bias evaluation in natural language generation (NLG) systems. Previous works use a limited number of fixed prefix templates to analyze the bias in text generated by language models (LMs) with mentions of various demographic groups. These fixed prefix templates could be unstable and generate different outputs when paraphrased. To study this problem, we paraphrase the prompts with different syntactic structures and use the paraphrased prompts to evaluate demographic bias in NLG systems. We analyze the probability distribution of regard scores for various demographic groups and syntactic structures for individual and pairwise analysis both in an aggregated and syntactically-segregated manner. Our results suggest similar overall trends but some syntactic structures lead to contradictory conclusions compared to those of past works. We find that some syntactic structures generate more toxic content than others and some could be more biased than others. This shows the instability of prompts, suggests the importance of not relying on a single prompt with fixed syntactic structure and introducing syntactically diverse prompting for NLG evaluation.	PDF	4	2022
Contrastive Learning of Sociopragmatic Meaning in Social Media	Recent progress in representation and contrastive learning in NLP has not widely considered the class of \textit{sociopragmatic meaning} (i.e., meaning in interaction within different language communities). To bridge this gap, we propose a novel framework for learning task-agnostic representations transferable to a wide range of sociopragmatic tasks (e.g., emotion, hate speech, humor, sarcasm). Our framework outperforms other contrastive learning methods for both in-domain and out-of-domain data, across both the general and few-shot settings. For example, compared to two popular language models pre-trained with large datasets, our method obtains an improvement of $11.66$ average $F_1$ on $16$ datasets when fine-tuned on only $20$ training samples per dataset. Our method is also language-agnostic, as we demonstrate on four different languages.	PDF	4	2022
Towards Robust Visual Question Answering: Making the Most of Biased Samples via Contrastive Learning	Models for Visual Question Answering (VQA) often rely on the spurious correlations, i.e., the language priors, that appear in the biased samples of training set, which make them brittle against the out-of-distribution (OOD) test data. Recent methods have achieved promising progress in overcoming this problem by reducing the impact of biased samples on model training. However, these models reveal a trade-off that the improvements on OOD data severely sacrifice the performance on the in-distribution (ID) data (which is dominated by the biased samples). Therefore, we propose a novel contrastive learning approach, MMBS, for building robust VQA models by Making the Most of Biased Samples. Specifically, we construct positive samples for contrastive learning by eliminating the information related to spurious correlation from the original training samples and explore several strategies to use the constructed positive samples for training. Instead of undermining the importance of biased samples in model training, our approach precisely exploits the biased samples for unbiased information that contributes to reasoning. The proposed method is compatible with various VQA backbones. We validate our contributions by achieving competitive performance on the OOD dataset VQA-CP v2 while preserving robust performance on the ID dataset VQA v2.	PDF	6	2022
The Impact of Cross-Lingual Adjustment of Contextual Word Representations on Zero-Shot Transfer	We evaluate effectiveness of an existing approach to cross-lingual adjustment of mBERT using four typologically different languages (Spanish, Russian, Vietnamese, and Hindi) and three NLP tasks (QA, NLI, and NER). The adjustment uses a small parallel corpus to make embeddings of related words across languages similar to each other. It improves NLI in four languages and NER in three languages, while QA performance never improves and sometimes degrades. Analysis of distances between contextualized embeddings of related and unrelated words across languages showed that fine-tuning leads to ``foregetting'' some of the cross-lingual alignment information, which---we conjecture---can negatively affect the effectiveness of the zero-shot transfer. Based on this observation, we further improved performance on NLI using continual learning. Our study contributes to a better understanding of cross-lingual transfer capabilities of large multilingual language models and of effectiveness of their cross-lingual adjustment in various NLP tasks.	PDF	6	2022
SciFig: A Scientific Figure Dataset for Figure Understanding	Most existing large-scale academic search engines are built to retrieve text-based information. However, there are no large-scale retrieval services for non-textual components such as scientific figures and tables. One challenge towards such services is scientific figure understanding that represents visual information by text. A key problem is a lack of datasets containing annotated scientific figures and tables, which can be used for classification, question-answering, and auto-captioning. Here, we design a pipeline that extracts figures and tables from scientific literature and a deep-learning-based framework that classifies scientific figures using visual features. Using this pipeline, we develop the first large-scale annotated corpus, SciFig, consisting of more than 264k scientific figures extracted from $\approx56$k research papers in the ACL Anthology. We make available the SciFig-Pilot dataset that contains 1671 manually labeled scientific figures belonging to 19 different categories. The dataset is publicly accessible at \url{https://bit.ly/3m4u0eq}.	PDF	6	2022
APEACH: Attacking Pejorative Expressions with Analysis on Crowd-Generated Hate Speech Evaluation Datasets	Detecting toxic or pejorative expressions in online communities has become one of the main concerns for preventing the users' mental harm. This led to the development of large-scale hate speech detection datasets of various domains, which are mainly built upon web-crawled texts with labels by crowd workers. However, for languages other than English, researchers might have to rely on only a small-sized corpus due to the lack of data-driven research of hate speech detection. This sometimes misleads the evaluation of prevalently used pretrained language models (PLMs) such as BERT, given that PLMs often share the domain of pretraining corpus with the evaluation set, resulting in over-representation of the detection performance. Also, the scope of pejorative expressions might be restricted if the dataset is built on a single domain text.To alleviate the above problems in Korean hate speech detection, we propose APEACH, a method that allows the collection of hate speech generated by unspecified users. By controlling the crowd-generation of hate speech and adding only a minimum post-labeling, we create a corpus that enables the generalizable and fair evaluation of hate speech detection regarding text domain and topic. We compare our outcome with prior work on an annotation-based toxic news comment dataset using publicly available PLMs. We check that our dataset is less sensitive to the lexical overlap between the evaluation set and pretraining corpus of PLMs, showing that it helps mitigate the unexpected under/over-representation of model performance. We distribute our dataset publicly online to further facilitate the general-domain hate speech detection in Korean.	PDF	6	2022
DiscoScore: Evaluating Text Generation with BERT and Discourse Coherence	Recently, there has been a growing interest in designing text generation systems from a discourse coherence perspective, e.g., modeling the interdependence between sentences. Still, recent BERT-based evaluation metrics cannot recognize coherence and fail to punish incoherent elements in system outputs. In this work, we introduce DiscoScore, a parametrized discourse metric, which uses BERT to model discourse coherence from different perspectives, driven by Centering theory. Our experiments encompass 16 non-discourse and discourse metrics, including DiscoScore and popular coherence models, evaluated on summarization and document-level machine translation (MT). We find that (i) the majority of BERT-based metrics correlate much worse with human rated coherence than early discourse metrics, invented a decade ago; (ii) the recent state-of-the-art BARTScore is weak when operated at system level---which is particularly problematic as systems are typically compared in this manner. DiscoScore, in contrast, achieves strong system-level correlation with human ratings, not only in coherence but also in factual consistency and other aspects, and surpasses BARTScore by over 10 correlation points on average. Further, aiming to understand DiscoScore, we provide justifications to the importance of discourse coherence for evaluation metrics, and explain the superiority of one variant over another.	PDF	6	2022
A Lightweight yet Robust Approach to Textual Anomaly Detection	Highly imbalanced textual datasets continue to pose a challenge for supervised learning models, especially when the minority class is multi-topical. Viewing such imbalanced text data as an anomaly detection (AD) problem how-ever has advantages for certain tasks such as detecting hate speech, or inappropriate and/or offensive language in large social media feeds. There the unwanted content tends to be both rare and non-uniform with respect to its thematic character, and better fits the definition of an anomaly than a class. Several recent approaches to textual AD use transformer models, achieving good results but with trade-offs in pre-training and inflexibility to new domains. In this paper we compare two linear models within the NMF family, which also have a recent history in textual AD. We introduce a new approach based on an alternative regularization of the NMF objective. Our results surpass other linear AD models and are on par with deep models, performing comparably well even in small concentrations.	PDF	6	2022
Arabic Image Captioning using Pre-training of Deep Bidirectional Transformers	Image captioning is the process of automatically generating a textual description of an image. It has a wide range of applications, such as effective image search, auto archiving and even helping visually impaired people to see. English image captioning has seen a lot of development lately, while Arabic image captioning is lagging behind. In this paper, we developed and evaluated several Arabic image captioning models with well-established metrics on a public image captioning benchmark. We initialized all models with transformers pre-trained on different Arabic corpora. After initialization, we fine-tuned them with image-caption pairs using a learning method called OSCAR. OSCAR uses object tags detected in images as anchor points to significantly ease the learning of image-text semantic alignments. In relation to the image captioning benchmark, our best performing model scored 0.39, 0.25, 0.15 and 0.092 with BLEU-1,2,3,4 respectively, an improvement over previously published scores of 0.33, 0.19, 0.11 and 0.057. Beside additional evaluation metrics, we complemented our scores with human evaluation on a sample of our output. Our experiments showed that training image captioning models with Arabic captions and English object tags is a working approach, but that a pure Arabic dataset, with Arabic object tags, would be preferable.	PDF	6	2022
AbLit: A Resource for Analyzing and Generating Abridged Versions of English Literature	Creating an abridged version of a text involves shortening it while maintaining its linguistic qualities. In this paper, we examine this task from an NLP perspective for the first time. We present a new resource, AbLit, which is derived from abridged versions of English literature books. The dataset captures passage-level alignments between the original and abridged texts. We characterize the linguistic relations of these alignments, and create automated models to predict these relations as well as to generate abridgements for new texts. Our findings establish abridgement as a challenging task, motivating future resources and research.	PDF	6	2022
BeamR: Beam Reweighing with Attribute Discriminators for Controllable Text Generation	Recent advances in natural language processing have led to the availability of large pre-trained language models (LMs), with rich generative capabilities. Although these models are able to produce fluent and coherent text, it remains a challenge to control various attributes of the generation, including sentiment, formality, topic and many others. We propose a Beam Reweighing (BeamR) method, building on top of standard beam search, in order to control different attributes. BeamR combines any generative LM with any attribute discriminator, offering full flexibility of generation style and attribute, while the beam search backbone maintains fluency across different domains. Notably, BeamR allows practitioners to leverage pre-trained models without the need to train generative LMs together with discriminators. We evaluate BeamR in two diverse tasks: sentiment steering, and machine translation formality. Our results show that BeamR performs on par with or better than existing state-of-the-art approaches (including fine-tuned methods), and highlight the flexiblity of BeamR in both causal and seq2seq language modeling tasks.	PDF	6	2022
Digging Errors in NMT: Evaluating and Understanding Model Errors from Partial Hypothesis Space	Solid evaluation of neural machine translation (NMT) is key to its understanding and improvement. Current evaluation of an NMT system is usually built upon a heuristic decoding algorithm (e.g., beam search) and an evaluation metric assessing similarity between the translation and golden reference. However, this system-level evaluation framework is limited by evaluating only one best hypothesis and search errors brought by heuristic decoding algorithms.To better understand NMT models, we propose a novel evaluation protocol, which defines model errors with model's ranking capability over hypothesis space. To tackle the problem of exponentially large space, we propose two approximation methods, top region evaluation along with an exact top-$k$ decoding algorithm, which finds top-ranked hypotheses in the whole hypothesis space, and Monte Carlo sampling evaluation, which simulates hypothesis space from a broader perspective. To quantify errors, we define our NMT model errors by measuring distance between the hypothesis array ranked by the model and the ideally ranked hypothesis array. After confirming the strong correlation with human judgment, we apply our evaluation to various NMT benchmarks and model architectures. We show that the state-of-the-art Transformer models face serious ranking issues and only perform at the random chance level in the top region. We further analyze model errors on architectures with different depths and widths, as well as different data-augmentation techniques, showing how these factors affect model errors. Finally, we connect model errors with the search algorithms and provide interesting findings of beam search inductive bias and correlation with Minimum Bayes Risk (MBR) decoding.	PDF	6	2022
Enhanced Biomedical Knowledge Discovery From Unstructured Text Using Contextual Embeddings	Extracting knowledge from large, unstructured text corpora presents a challenge. Recently, authors have utilized unsupervised, static word embeddings to uncover "latent knowledge" contained within domain-specific scientific corpora. Here semantic-similarity measures between representations of concepts, objects or entities were used to predict relationships, which were later verified using physical methods. Static language models have recently been surpassed at most downstream tasks by massively pre-trained, contextual language models like BERT. Some have postulated that contextualized embeddings potentially yield word representations superior to static ones for knowledge-discovery purposes. In an effort to address this question, two biomedically-trained BERT models (BioBERT, SciBERT) were used to encode $n$ = 500, 1000 or 5000 sentences containing words of interest extracted from a biomedical corpus (Coronavirus Open Research Dataset). The $n$ representations for the words of interest were subsequently extracted and then aggregated to yield static-equivalent word representations. These words belonged to the vocabularies of intrinsic benchmarking tools for the biomedical domain (Bio-SimVerb and Bio-SimLex), which assess quality of word representations using semantic-similarity and relatedness measures. Using intrinsic benchmarking tasks, feasibility of using contextualized word representations for knowledge discovery tasks can be assessed: Word representations that better encode described reality are expected to perform better (i.e. closer to domain experts). As postulated, BERT embeddings outperform static counterparts at both verb and noun benchmarks, however performance varies by model and neither model outperforms static models at both tasks. Moreover, unique performance characteristics are illustrated when task vocabulary is split between BERT-native words and words requiring sub-word decomposition.	PDF	6	2022
Improving Graph Clustering with Multi-Granularity Debiased Contrastive Learning	Recently, deep graph clustering achieves significant success by utilizing both the node attribute features and the graph structure information. However, the existing methods still have some limitations: (1) lack of a flexible mechanism to fuse multi-granularity information learned from different views. (2) introduce the noise positive-negative sample pairs lead to reduced the model performance. To tackle these problems, we propose a debiased contrastive learning framework DCL-MGI, which integrates the multi-granularity information of graph data. Specifically, two contrastive learning modules are constructed to capture multi-granularity feature information from node-level and graph-level, respectively. Meanwhile, an adaptive strategy of fusing stable graph structure information and node representations is proposed to select unbiased contrastive sample pairs, which reduces the false-negative samples. Furthermore, we utilize the temporal entropy metric to evaluate the sample quality under each view and communicate the two independent contrastive learning modules in a collaborative training manner. Experimental results on six real-world datasets demonstrate that our proposed framework enhances state-of-the-art methods on the graph clustering task.	PDF	6	2022
Revisiting Transformer-based Models for Long Document Classification	The recent literature in text classification is biased towards short text sequences (e.g., sentences or paragraphs). In real-world applications, multi-page multi-paragraph documents are common and they cannot be efficiently encoded by vanilla Transformer-based models. We compare different Transformer-based Long Document Classification approaches that aim to mitigate the computational overhead of vanilla transformers to encode much longer text, namely sparse attention and hierarchical encoding methods.We examine several aspects of sparse attention (e.g., size of local attention window, use of global attention) and hierarchical (e.g., document splitting strategy) transformers on four document classification datasets covering different domains. We observe a clear benefit from being able to process longer text, and, based on our results, we derive practical advice of applying Transformer-based models on long document classification tasks.	PDF	6	2022
LoRaLay: A Multilingual and Multimodal Dataset for Long Range and Layout-Aware Summarization	Text Summarization is a popular task and an active area of research for the Natural Language Processing community. By definition, it requires to account for long input texts, a characteristic which poses computational challenges for neural models. Moreover, real-world documents come in a variety of complex, visually-rich, layouts. This information is of great relevance, whether to highlight salient content or to encode long-range interactions between textual passages. Yet, all publicly available summarization datasets only provide plain text content. To facilitate research on how to exploit visual/layout information to better capture long-range dependencies in summarization models, we present LoRaLay, a collection of datasets for long-range summarization with accompanying visual/layout information. We extend existing and popular English datasets (arXiv and PubMed) with layout information and propose four novel datasets -- consistently built from scholar resources -- covering French, Spanish, Portuguese, and Korean languages. Further, we propose new baselines merging layout-aware and long-range models -- two orthogonal approaches -- and obtain state-of-the-art results, showing the importance of combining both lines of research.	PDF	6	2022
Towards Linguistically Robust NLG Systems for Localized Virtual Assistants	One of the biggest challenges for localizing the natural language generation of virtual assistants like Alexa, the Google Assistant, or Siri, to many languages, is the proper handling of entities. Neural machine translation systems may translate entities literally, or introduce grammar mistakes by using the wrong inflections. The diversity of linguistic phenomena for entities across all languages is vast, yet ensuring grammatical correctness for a broad diversity of entities is critical -- native speakers may find entity-related grammatical errors silly, jarring, or even offensive.To assess linguistic robustness, we create a multilingual corpus of linguistically significant entities annotated by linguist experts. We also share a simple algorithm for how to leverage this corpus to produce linguistically diverse training and evaluation datasets. Using the Schema-Guided Dialog Dataset (DSTC8) as a test bed, we collect human translations for a subset of linguistically boosted examples to establish quality baselines for neural, template-based, and hybrid NLG systems in French (high-resource), Marathi (low-resource), and Russian (highly inflected language). We make our corpus and the derived translation-based datasets available for further research.	PDF	6	2022
Topic Modelling with Topological Data Analysis	Recent unsupervised topic modelling approaches that use clustering techniques on word, token or document embeddings can extract coherent topics. A common limitation of such approaches is that they reveal nothing about inter-topic relationships which are essential in many real-world application domains. We present an unsupervised topic modelling method which harnesses Topological Data Analysis (TDA) to extract a topological skeleton of the manifold upon which contextualised word embeddings lie. We demonstrate that our approach, which performs on par with a recent baseline, is able to construct a network of coherent topics together with meaningful relationships between them.	PDF	6	2022
Identifying Weaknesses in Machine Translation Metrics Through Minimum Bayes Risk Decoding: A Case Study for COMET	Neural metrics have achieved impressive correlation with human judgements in the evaluation of machine translation systems, but before we can safely optimise towards such metrics, we should be aware of (and ideally eliminate) biases toward bad translations that receive high scores. Our experiments show that sample-based Minimum Bayes Risk decoding can be used to explore and quantify such weaknesses. When applying this strategy to COMET for en$\rightarrow$de and de$\rightarrow$en, we find that COMET models are not sensitive enough to discrepancies in numbers and named entities. We further show that these biases are hard to fully remove by simply training on additional synthetic data.	PDF	6	2022
LongtoNotes: OntoNotes with Longer Coreference Chains	Ontonotes has served as the most important benchmark for coreference resolution. However, for ease of annotation, several long documents in Ontonotes were split into smaller parts. In this work, we build a corpus of coreference-annotated documents of significantly longer length than what is currently available. We do so by providing an accurate, manually-curated, merging of annotations from documents that were split into multiple parts in the original Ontonotes annotation process. The resulting corpus, which we call LongtoNotes contains documents in multiple genres of the English language with varying lengths, the longest of which are up to $8$x the length of documents in Ontonotes, and $2$x those in Litbank. We evaluate state-of-the-art neural coreference systems on this new corpus, analyze the relationships between model architectures/hyperparameters and document length on performance and efficiency of the models, and demonstrate areas of improvement in long-document coreference modelling revealed by our new corpus.	PDF	6	2022
How poor is the stimulus? Evaluating hierarchical generalization in neural networks trained on child-directed speech	When acquiring syntax, children consistently choose hierarchical rules over competing non-hierarchical possibilities. Is this preference due to a learning bias for hierarchical structure, or due to more general biases that interact with hierarchical cues in children's linguistic input? We explore these possibilities by training LSTMs and Transformers--two types of neural networks without a hierarchical bias--on data similar in quantity and content to children's linguistic input: text from the CHILDES corpus. We then evaluate what these models have learned about English yes/no questions, a phenomenon for which hierarchical structure is crucial. We find that, though they perform well at capturing the surface statistics of child-directed speech (as measured by perplexity), both model types generalize in a way more consistent with an incorrect linear rule than the correct hierarchical rule. These results suggest that human-like generalization from text alone requires stronger biases than the general sequence-processing biases of standard neural network architectures.	PDF	6	2022
Neighborhood Contrastive Learning for Scientific Document Representations with Citation Embeddings	Learning scientific document representations can be substantially improved through contrastive learning objectives, where the challenge lies in creating positive and negative training samples that encode the desired similarity semantics. Prior work relies on discrete citation relations to generate contrast samples. However, discrete citations enforce a hard cut-off to similarity. This is counter-intuitive to similarity-based learning and ignores that scientific papers can be very similar despite lacking a direct citation - a core problem of finding related research. Instead, we use controlled nearest neighbor sampling over citation graph embeddings for contrastive learning. This control allows us to learn continuous similarity, to sample hard-to-learn negatives and positives, and also to avoid collisions between negative and positive samples by controlling the sampling margin between them. The resulting method SciNCL outperforms the state-of-the-art on the SciDocs benchmark. Furthermore, we demonstrate that it can train (or tune) language models sample-efficiently and that it can be combined with recent training-efficient methods. Perhaps surprisingly, even training a general-domain language model this way outperforms baselines pretrained in-domain.	PDF	6	2022
Contrastive Learning with Graph Context Modeling for Sparse Knowledge Graph Completion	Knowledge Graph Embeddings (KGE) aim to map entities and relations to high dimensional spaces and have become the de-facto standard for knowledge graph completion. Most existing KGE methods suffer from the sparsity challenge, where it is harder to predict entities that appear less frequently in knowledge graphs. In this work, we propose a novel framework KRACL to alleviate the widespread sparsity in KGs with graph context and contrastive learning. Firstly, we propose the Knowledge Relational Attention Network (KRAT) to leverage the graph context by jointly aggregating neighbors and relations with the attention mechanism. KRAT is capable of capturing the subtle importance of different context triples and leveraging multi-hop information in knowledge graphs. Secondly, we propose the knowledge contrastive loss by combining the contrastive loss with cross entropy loss, which introduces more negative samples and thus enriches the feedback to sparse entities. Our experiments demonstrate that KRACL achieves superior results across various standard knowledge graph benchmarks, especially on WN18RR and NELL-995 which have many low in-degree entities. Extensive experiments also bear out KRACL's effectiveness of handling sparse knowledge graphs and robustness against noisy triples.	PDF	6	2022
LPC: A Logits and Parameter Calibration Framework for Continual Learning	When we execute the typical fine-tuning paradigm on continuously sequential tasks, the model will suffer from the catastrophic forgetting problem (i.e., they forget the parameters learned in previous tasks when training the model on newly emerged tasks). Existing replay-based methods need extra storage for old data to update the parameters of the previous classifier to overcome catastrophic forgetting. Our work aims to achieve the sequential/continual learning of knowledge without accessing the old data. The core idea is to calibrate the parameters and logits (output) so that preserving old parameters and generalized learning on new concepts can be solved simultaneously. Our proposed framework includes two major components, the Logits Calibration (LC) and Parameter Calibration (PC). The LC focuses on calibrating the learning of novel models with old models, and PC aims to preserve the parameters of old models. These two operations can maintain the old knowledge while learning new tasks without storing previous data. We do experiments on 9 scenarios of the GLUE (the General Language Understanding Evaluation) benchmark. The experimental results show that our model achieves state-of-the-art performance on all scenarios.	PDF	6	2022
Improving Faithfulness by Augmenting Negative Summaries from Fake Documents	Current abstractive summarization systems tend to hallucinate content that is unfaithful to the source document, posing a risk of misinformation. To mitigate hallucination, we must teach the model to distinguish hallucinated summaries from faithful ones. However, the commonly used maximum likelihood training does not disentangle factual errors from other model errors. To address this issue, we propose a back-translation-style approach to augment negative samples that mimic factual errors made by the model. Specifically, we train an elaboration model that generates hallucinated documents given the reference summaries, and then generate negative summaries from the fake documents. We incorporate the negative samples into training through a controlled generator. Additionally, we find that adding textual entailment data through multitasking further boosts the performance. Experiments on three datasets show that our method consistently improves faithfulness without sacrificing informativeness according to both human and automatic evaluation.	PDF	6	2022
BIC: Twitter Bot Detection with Text-Graph Interaction and Semantic Consistency	Twitter bot detection is an important and meaningful task. Existing bot detection methods use either text modality to detect bots with anomalies in tweet patterns or graph modality to detect bots with abnormal clustering information. They do not allow text and graph modalities to interact with each other, which fails to learn the relative importance of the two modalities. As a result, these methods struggle to detect bots comprehensively. Besides, existing methods ignore the potential consistency within users' semantic information. In this paper, we propose a novel model named BIC that makes the text and graph modalities interactive. BIC also detects semantic consistency within tweet content. Specifically, BIC contains a text propagation module to learn text information, a graph propagation module to learn neighborhood information, and a text-graph interactive module to make the two interact. Besides, BIC contains a semantic consistency detection module to learn semantic consistency information from tweets. Extensive experiments demonstrate that our framework outperforms competitive baselines on a comprehensive Twitter bot benchmark. We also prove the effectiveness of the proposed interaction and semantic consistency detection.	PDF	6	2022
Zombies Eat Brains, You are Safe: A Knowledge Infusion based Multitasking System for Sarcasm Detection in Meme	In this paper, we hypothesize that sarcasm detection is closely associated with the emotion present in meme. Thereafter, we propose a deep multitask model to perform these two tasks in parallel, where sarcasm detection is treated as the primary task, and emotion recognition is considered as an auxiliary task. We create a large scale dataset consisting of 7416 memes in Hindi, one of the widely spoken languages. We collect the memes from various domains, such as politics, religious, racist, and sexist, and manually annotate each instance with three sarcasm categories, i.e., (i) Not Sarcastic, ii) Mildly Sarcastic or iii) Highly Sarcastic and 13 fine-grained emotion classes. Furthermore, we propose a novel Knowledge Infusion (KI) based module which captures sentiment aware representation from a trained model Memotion 2.0 dataset. Detailed empirical evaluation shows that multitasking model performs better than the single-task model. We also show that using this KI module on top of our model can boost the performance of sarcasm detection in both single task and multi-task settings evenfurther.	PDF	6	2022
Plug-and-Play Knowledge Injection for Pre-trained Language Models	Injecting external knowledge can improve the performance of pre-trained language models (PLMs) in various downstream NLP tasks. However, current knowledge injection methods usually require knowledge-aware pre-training or fine-tuning, which makes the knowledge-enhanced models strongly coupled to some specific knowledge bases. Toward flexible knowledge injection, we explore a new paradigm, plug-and-play knowledge injection, which decouples models from knowledge bases. Correspondingly, we propose a plug-and-play injection method, \textit{map-tuning}, which trains a mapping of knowledge embeddings to enrich model inputs with mapped embeddings while keeping PLMs frozen. Experimental results on two typical knowledge-driven NLP tasks show that map-tuning effectively improves the performance of PLMs with little computational cost. Specifically, one mapping network can be plugged on various downstream tasks without any additional training. And, one downstream model can work with multiple mapping networks of different knowledge bases in order to adapt to different domains. We will release all the code and models of this paper.	PDF	6	2022
Counterfactual Decoding for Anti-Hallucination Knowledge-grounded Dialogue Generation	The task of Knowledge-grounded Dialogue (KGD) generation, which intentionally invokes external knowledge resources to produce natural and informative responses, has been a popular topic these years. Empowered by the large-scale pretrained language models, existing methods have demonstrated impressive performance on this task. However, the hallucination problem remains a serious problem, causing unpredictable factual errors in the generated results. Although serious efforts try to alleviate this phenomenon by data pre-processing or fact-checking, these methods still heavily rely on assistance from external tools or resources. Inspired by counterfactual reasoning, we propose a lightweight and independent anti-hallucination mechanism in KGD by conducting a causal effect analysis. Our example implementation's benchmark and human evaluation results show that our method can significantly reduce hallucination without disrupting the model performance. We hope our efforts can call for more attention to utilizing causal inference to solve relevant issues.	PDF	6	2022
TaHiD: Tackling Data Hiding in Fake News Detection with News Propagation Networks	Fake news with detrimental societal effects has attracted extensive attention and research. Despite early success, the state-of-the-art methods fall short of considering the propagation of news. News propagates at different times through different mediums, including users, comments, and sources, which form the news propagation network. Moreover, the serious problem of data hiding arises, which means that fake news publishers disguise fake news as real to confuse users by deleting comments that refute the rumor or deleting the news itself when it has been spread widely. Existing methods do not consider the propagation of news and fail to identify what matters in the process, which leads to fake news hiding in the propagation network and escaping from detection. Inspired by the propagation of news, we propose a novel fake news detection framework named TaHiD, which models the propagation as a heterogeneous dynamic graph and contains the propagation attention module to measure the influence of different propagation. Experiments demonstrate that TaHiD extracts useful information from the news propagation network and outperforms state-of-the-art methods on several benchmark datasets for fake news detection. Additional studies also show that TaHiD is capable of identifying fake news in the case of data hiding.	PDF	6	2022
Calibrating Zero-shot Cross-lingual (Un-)structured Predictions	We investigate model calibration in the setting of zero-shot cross-lingual transfer with large-scale pre-trained language models. The level of model calibration is an important metric for evaluating the trustworthiness of predictive models. There exists an essential need for model calibration when natural language models are deployed in critical tasks. We study different post-training calibration methods in structured and unstructured prediction tasks. We find that models trained with data from the source language become less calibrated when applied to the target language, and that calibration errors increase with intrinsic task difficulty and relative sparsity of training data. Moreover, we observe a potential connection between the level of calibration error and an earlier proposed measure of the distance from English to other languages. Finally, our comparison demonstrates that among other methods Temperature Scaling (TS) and Gaussian Process Calibration(GPcalib) generalizes well to distant languages, but TS fails to calibrate more complex confidence estimation in structured predictions.	PDF	6	2022
DialogConv: A Lightweight Fully Convolutional Network for Multi-view Response Selection	Current end-to-end retrieval-based dialogue systems are mainly based on Recurrent Neural Networks or Transformers with attention mechanisms. Although promising results have been achieved, these models often suffer from slow inference or huge number of parameters. In this paper, we propose a novel lightweight fully convolutional architecture, called DialogConv, for response selection. DialogConv is exclusively built on top of convolution to extract matching features of context and response. Dialogues are modeled in 3D views, where DialogConv performs convolution operations on embedding view, word view and utterance view to capture richer semantic information from multiple contextual views. On the four benchmark datasets, compared with state-of-the-art baselines, DialogConv is on average about 8.5x smaller in size, and 79.39x and 10.64x faster on CPU and GPU devices, respectively. At the same time, DialogConv achieves the competitive effectiveness of response selection.	PDF	6	2022
Modeling Context With Linear Attention for Scalable Document-Level Translation	Document-level machine translation leverages inter-sentence dependencies to produce more coherent and consistent translations. However, these models, predominantly based on transformers, are difficult to scale to long documents as their attention layers have quadratic complexity in the sequence length. Recent efforts on efficient attention improve scalability, but their effect on document translation remains unexplored. In this work, we investigate the efficacy of a recent linear attention model by Peng et al. (2021) on document translation and augment it with a sentential gate to promote a recency inductive bias. We evaluate the model on IWSLT 2015 and OpenSubtitles 2018 against the transformer, demonstrating substantially increased decoding speed on long sequences with similar or better BLEU scores. We show that sentential gating further improves translation quality on IWSLT.	PDF	6	2022
FineDeb: A Debiased Finetuning Approach for Language Models	As language models are increasing included in human-facing machine learning tools, bias against demographic subgroups has gained attention. We consider the problem of debiasing in language models. Rather than modifying a model's already learned representations, we focus on modifying them during model training itself. We propose a two-phase methodology (FineDeb) that starts with contextual debiasing of embeddings learned by the language models during training, then finetunes the model on the original language modelling objective. We apply our method to debias for demographics with multiple classes, demonstrating its effectiveness through extensive experiments and comparing with state of the art techniques, and on three metrics.	PDF	6	2022
Geometry-Aware Supertagging with Heterogeneous Dynamic Convolutions	The syntactic categories of categorial grammar formalisms are structured units made of smaller, indivisible primitives, bound together by the underlying grammar’s category formation rules. In the trending approach of constructive supertagging, neural models are increasingly made aware of the internal category structure, which in turn enables them to more reliably predict rare and out-of-vocabulary categories, with significant implications for grammars previously deemed too complex to find practical use. In this work, we revisit constructive supertagging from a graph-theoretic perspective, and propose a framework based on heterogeneous dynamic graph convolutions, aimed at exploiting the distinctive structure of a supertagger’s output space. We test our approach on a number of categorial grammar datasets spanning different languages and grammar formalisms, achieving substantial improvements over previous state of the art scores.	PDF	6	2022
Improving Textual Adversarial Attacks using Metric-Guided Rewrite and Rollback	Adversarial examples are helpful for analyzing and improving the robustness of the classifier. Generating high-quality adversarial examples is a challenging task as it requires the generation of adversarial sentences that are fluent, semantically similar to the original ones and should lead to misclassification. Existing methods prioritize misclassification by maximizing each perturbation's effectiveness at misleading the classifier; thus, the generated adversarial examples fall short in terms of fluency and similarity. In this paper, we define a critique score that synthesizes the fluency, similarity, and misclassification metrics. We propose a rewrite and rollback (R&R) framework guided by the optimization of this score to improve the adversarial attack. R&R generates high-quality adversarial examples by allowing exploration of perturbations without immediate impact on the misclassification, and yet optimizing critique score for better fluency and similarity. We evaluate our method on 5 representative datasets and 3 classifier architectures. Our method outperforms current state-of-the-art in attack success rate by +16.2%, +12.8%, and +14.0% on the classifiers respectively. All code and results will be publicly available.	PDF	6	2022
Augmentor or Filter? Reconsider the Role of Pre-trained Language Model in Text Classification Augmentation	Text augmentation is one of the most effective techniques to solve the critical problem of insufficient data in text classification. Existing text augmentation methods achieve hopeful performance in few-shot text data augmentation. However, these methods usually lead to performance degeneration on public datasets due to poor quality augmentation instances. Our study shows that even employing pre-trained language models, existing text augmentation methods generate numerous low-quality instances and lead to the feature space shift problem in augmentation instances. However, we note that the pre-trained language model is good at finding low-quality instances provided that it has been fine-tuned on the target dataset. To alleviate the feature space shift and performance degeneration in existing text augmentation methods, we propose BOOSTAUG, which reconsiders the role of the language model in text augmentation and emphasizes the augmentation instance filtering rather than generation. We evaluate BOOSTAUG on both sentence-level text classification and aspect-based sentiment classification. The experimental results on seven commonly used text classification datasets show that our augmentation method obtains state-of-the-art performance. Moreover, BOOSTAUG is a flexible framework; we release the code which can help improve existing augmentation methods.	PDF	6	2022
Harvesting Mature Relation Extraction Models from Limited Seed Knowledge: A Self-Development Framework for DS Rule Expansion	Distantly-supervised relation extraction (DSRE) is an effective method to scale relation extraction (RE) to large unlabeled corpora with the utilization of knowledge bases (KBs), but suffers from the scale of KBs and the introduced noise.To alleviate the above two problems, we propose a novel framework called Self-development Rule Expansion (SOUP), which starts from limited amount of labeled data and continuously produces low-noise labels on large-scaled unlabeled data by a growing learnable logical rules set.Specifically, SOUP achieves a mutual enhancement of RE model and logical rules set, first a RE model is trained on the labeled data to summarize the knowledge, then the knowledge is utilized to explore candidate rules from unlabeled data, finally high-quality candidates are selected in a graph-based ranking manner to extend the logical rules set and new rule-labeled data are provided for better RE model training.Experiments on wiki20 dataset demonstrate that, with limited seed knowledge from small-scaled manually labeled data, SOUP achieves significant improvement compared to baselines by producing continuous growth of both logical rules and the RE model, and that labeling noise of SOUP is much less than DS. Furthermore, RE model enhanced by SOUP with 1.6k logical rules learned from prior knowledge could produce an equivalent performance to the model trained on data labeled in DS manner by 72k relational facts of KBs.	PDF	6	2022
Uncertainty Quantification with Pre-trained Language Models: A Large-Scale Empirical Analysis	Pre-trained language models (PLMs) have gained increasing popularity due to their compelling prediction performance in diverse natural language processing (NLP) tasks. When formulating a PLM-based prediction pipeline for NLP tasks, it is also crucial for the pipeline to minimize the calibration error, especially in safety-critical applications. That is, the pipeline should reliably indicate when we can trust its predictions. In particular, there are various considerations behind the pipeline: (1) the choice and (2) the size of PLM, (3) the choice of uncertainty quantifier, (4) the choice of fine-tuning loss, and many more. Although prior work has looked into some of these considerations, they usually draw conclusions based on a limited scope of empirical studies. There still lacks a holistic analysis on how to compose a well-calibrated PLM-based prediction pipeline. To fill this void, we compare a wide range of popular options for each consideration based on three prevalent NLP classification tasks and the setting of domain shift. In response, we recommend the following: (1) use ELECTRA for PLM encoding, (2) use larger PLMs if possible, (3) use Temp Scaling as the uncertainty quantifier, and (4) use Focal Loss for fine-tuning.	PDF	6	2022
A Minimal Model for Compositional Generalization on gSCAN	Whether neural networks are capable of compositional generalization has been a topic of much debate. Most previous studies on this subject investigate the generalization capabilities of state-of-the-art deep learning architectures. We here take a more bottom-up approach and design a minimal model that displays generalization on a compositional benchmark, namely, the gSCAN dataset. The model is a hybrid architecture that combines layers trained with gradient descent and a selective attention mechanism optimized with an evolutionary strategy. The architecture has around 60 times fewer trainable parameters than models previously tested on gSCAN, and achieves comparable accuracies on most test splits, even when trained only on a fraction of the dataset. On adverb to verb generalization accuracy, it outperforms previous approaches by 65 to 86\%. Through ablation studies, neuron pruning, and error analyses, we show that weight decay and attention mechanisms facilitate compositional generalization by encouraging sparse representations divorced from irrelevant context. We find that the model's sample efficiency can mainly be attributed to its selective attention mechanism.	PDF	6	2022
Measuring Context-Word Biases in Lexical Semantic Datasets	State-of-the-art pretrained contextualized models (PCM) eg. BERT use tasks such as WiC and WSD to evaluate their word-in-context representations. This inherently assumes that performance in these tasks reflect how well a model represents the coupled word and context semantics. We question this assumption by presenting the first quantitative analysis on the context-word interaction being tested in major contextual lexical semantic tasks. To achieve this, we run probing baselines on masked input, and propose measures to calculate and visualize the degree of context or word biases in existing datasets. The analysis was performed on both models and humans. Our findings demonstrate that models are usually not being tested for word-in-context semantics in the same way as humans are in these tasks, which helps us better understand the model-human gap. Specifically, to PCMs, most existing datasets fall into the extreme ends (the retrieval-based tasks exhibit strong target word bias while WiC-style tasks and WSD show strong context bias); In comparison, humans are less biased and achieve much better performance when both word and context are available than with masked input. We recommend our framework for understanding and controlling these biases for model interpretation and future task design.	PDF	6	2022
Ered: Enhanced Text Representations with Entities and Descriptions	External knowledge, e.g., entities and entity descriptions, can help humans understand texts. Many works have been explored to include external knowledge in the pre-trained models. These methods, generally, design pre-training tasks and implicitly introduce knowledge by updating model weights, alternatively, use it straightforwardly together with the original text. Though effective, there are some limitations. On the one hand, it is implicit and only model weights are paid attention to, the pre-trained entity embeddings are ignored. On the other hand, entity descriptions may be lengthy, and inputting into the model together with the original text may distract the model's attention. This paper aims to explicitly include both entities and entity descriptions in the fine-tuning stage. First, the pre-trained entity embeddings are fused with the original text representation and updated by the backbone model layer by layer. Second, descriptions are represented by the knowledge module outside the backbone model, and each knowledge layer is selectively connected to one backbone layer for fusing. Third, two knowledge-related auxiliary tasks, i.e., entity/description enhancement and entity enhancement/pollution task, are designed to smooth the semantic gaps among evolved representations. We conducted experiments on four knowledge-oriented tasks and two common tasks, and the results achieved a new state-of-the-art on several datasets. Besides, we conduct an ablation study to show that each module in our method is necessary.	PDF	6	2022
InvBERT: Reconstructing Text from Contextualized Word Embeddings by inverting the BERT pipeline	Digital Humanities and Computational Literary Studies apply text mining methods to investigate literature. Such automated approaches enable quantitative studies on large corpora which would not be feasible by manual inspection alone. However, due to copyright restrictions, the availability of relevant digitized literary works is limited. Derived Text Formats (DTFs) have been proposed as a solution. Here, textual materials are transformed in such a way that copyright-critical features are removed, but that the use of certain analytical methods remains possible.Contextualized word embeddings produced by transformer-encoders (like BERT) are promising candidates for DTFs because they allow for state-of-the-art performance on various analytical tasks and, at first sight, do not disclose the original text. However, in this paper we demonstrate that under certain conditions the reconstruction of the original copyrighted text becomes feasible and its publication in the form of contextualized token representations is not safe. Our attempts to invert BERT suggest, that publishing the encoder as a black box together with the contextualized embeddings is critical, since it allows to generate data to train a decoder with a reconstruction accuracy sufficient to violate copyright laws.	PDF	6	2022
Reducing Short Circuits in Multiple-Choice Natural Language Reasoning Models with Data Augmentation	Statistical biases in the training data may lead to fragility in neural models that makes choices in multiple-choice natural language reasoning problems without referring to the context or premises. To encourage the models to pay more attention to the relations between the premise and the choices, we propose two biologically inspired operations that can generate new training data that ``forces'' the model to look at the premises and reducing short circuits. They can augment any type of multiple choice reasoning dataset, and can be applied to any supervised learning models. Results show that models trained with the augmented data become more robust against both stress test and original test.	PDF	6	2022
Average Is Not Enough: Caveats of Multilingual Evaluation	This position paper discusses the problem of multilingual evaluation. Using simple statistics, such as average language performance, might inject linguistic biases in favor of dominant language families into evaluation methodology. We argue that a qualitative analysis informed by comparative linguistics is needed for multilingual results to detect this kind of bias. We show in our case study that results in published works can indeed be linguistically biased and we demonstrate that visualization based on URIEL typological database can detect it.	PDF	6	2022
SlovakBERT: Slovak Masked Language Model	We introduce a new Slovak masked language model called SlovakBERT. This is to our best knowledge the first paper discussing Slovak transformers-based language models. We evaluate our model on several NLP tasks and achieve state-of-the-art results. This evaluation is likewise the first attempt to establish a benchmark for Slovak language models. We publish the masked language model, as well as the fine-tuned models for part-of-speech tagging, sentiment analysis and semantic textual similarity.	PDF	6	2022
Improving Gender Fairness of Pre-Trained Language Models without Catastrophic Forgetting	Existing studies addressing gender bias of pre-trained language models, usually build a small gender-neutral data set and conduct a second phase pre-training on the model with such data. However, given the limited size and concentrated focus of the gender-neutral data, catastrophic forgetting would occur during second-phase pre-training. Forgetting information in the original training data may damage the model’s downstream performance by a large margin. In this work, we empirically show that catastrophic forgetting occurs in such methods by evaluating them with general NLP tasks in GLUE. Then, we propose a new method, GEnder Equality Prompt (GEEP), to improve gender fairness of pre-trained models with less forgetting. GEEP freezes the pre-trained model and learns gender-related prompts with gender-neutral data.Empirical results show that GEEP not only achieves SOTA performances on gender fairness tasks, but also forgets less and performs better on GLUE by a large margin.	PDF	6	2022
Ground then Navigate: Language-guided Navigation in Dynamic Scenes	We investigate the problem of Vision-and-Language Navigation (VLN) in the context of autonomous driving in outdoor settings. We solve the problem by explicitly grounding the navigable regions corresponding to the textual command. At each timestamp, the model predicts a segmentation mask corresponding to the intermediate or the final navigable region. Our work contrasts with existing efforts in VLN, which pose this task as a node selection problem, given a discrete connected graph corresponding to the environment. We do not assume the availability of such a discretised map. Our work moves towards continuity in action space, provides interpretability through visual feedback and allows VLN on commands like `park between the two cars”, requiring finer manoeuvres. Furthermore, we propose a novel meta dataset CARLA-NAV to allow efficient training and validation. The dataset comprises pre-recorded training sequences and a live environment for validation and testing. We provide extensive qualitative and quantitive empirical results to validate the efficacy of the proposed approach.	PDF	7	2022
Simple Recurrence Improves Masked Language Models	In this work, we explore whether modeling recurrence into the Transformer architecture can both be beneficial and efficient, by building an extremely simple recurrent module into the Transformer. We compare our model to baselines following the training and evaluation recipe of BERT. Our results confirm that recurrence can indeed improve Transformer models by a consistent margin, without requiring low-level performance optimizations, and while keeping the number of parameters constant. For example, our base model achieves an absolute improvement of 2.1 points averaged across 10 tasks and also demonstrates increased stability in fine-tuning over a range of learning rates.	PDF	7	2022
Automated political bias classification in news agencies: a word-based feature selection approach	This study offers a new solution to the problem of developing political bias classification models in news agencies. Our method uses search engine score functions to develop a measure of the relevance of each word in text scrapped from news websites. With these scores, we train models using existing feature selection methods and a custom feature selection algorithm that we developed. The resulting models are contrasted with each other and neural network-based counterparts. Models trained using our proposed method and custom algorithm outperformed others by achieving macro F1 scores of 0.81 and 0.78 on right-wing and left-wing bias detection respectively.	PDF	7	2022
Iterative Evidence Searching over Long Structured Documents for Question Answering	We propose a simple yet effective model, DOCHOPPER, for selecting evidence from long structured documents to answer complex questions. Similar to multi-hop question-answering (QA) systems, at each step, DOCHOPPER iteratively uses a query $q$ to extract information from a document, and combines this information with $q$ to produce the next query. However, in contrast to most previous multi-hop QA systems, DOCHOPPER is able to extract either short or long sections of the document, thus emulating a multi-step process of “navigating” through a long document to answer a question. To enable this novel behavior, DOCHOPPER does not combine document information with $q$ by concatenating text to the text of $q$, but by combining a compact neural representation of $q$ with a compact neural representation of a (potentially large) hierarchical part of the document. We evaluate DOCHOPPER on three different tasks that require reading long structured documents and finding multiple pieces of evidence, and show DOCHOPPER outperforms Transformer models for plain text input. Additionally, DOCHOPPER is efficient at inference time, being 10–250 times faster than baselines.	PDF	7	2022
Reasoning over Logically Interacted Conditions for Question Answering	Some questions have answers that are correct only if certain conditions apply. Conditions are used to distinguish answers as well as to provide additional information to support them. To answer questions with conditions, models need to first find eligible answers and conditions from context and then perform logical reasoning to check whether conditions have been satisfied. We propose TReasoner to model this challenging reasoning process. In addition to finding answers, TReasoner can also identify unsatisfied conditions that are required to support the answers, as some answers are constrained by multiple conditions but only one or a subset of the conditions are satisfied. TReasoner consists of an entailment module, a reasoning module, and a generation module (if answers are free-form text spans). TReasoner achieves state-of-the-art performance on two benchmark QA datasets, outperforming the previous state-of-the-art by 3-10 points.	PDF	7	2022
Evaluating the Portability of Rheumatoid Arthritis Phenotyping Algorithms: case study on French EHRs at the Patient and Encounter Level	High-throughput phenotyping can accelerate the development of statistical analysis from cohorts of Electronic Health Records. Previous work has successfully used machine learning and natural language processing for the phenotyping of Rheumatoid Arthritis (RA) patients in hospitals within the United States and France. Our goal is to evaluate the adaptability of RA phenotyping algorithms to a new hospital, both at the patient and encounter levels. Two algorithms are adapted to the context of the new hospital and evaluated with a newly developed RA gold standard corpus, including annotations at the encounter level.The adapted algorithms offer comparable performance for patient-level phenotyping on the new corpus (F1 0.71 to 0.79), performance is lower for encounter-level phenotyping (F1 0.54 to 0.57), illustrating adaptation feasibility and cost. The first algorithm incurred a heavier adaptation burden because it required manual feature engineering. However, it is less computationally intensive than the second, semi-supervised, algorithm.	PDF	7	2022
CoRec: An Easy Approach for Coordination Recognition	In this paper, we observe and address the challenges of the coordination recognition task. Most existing methods rely on syntactic parsers to identify the coordinators in a sentence and detect the coordination boundaries. However, state-of-the-art syntactic parsers are slow and suffer from errors, especially for long and complicated sentences. To better solve the problems, we propose a pipeline model COordination RECognizer (CoRec). It composes of two components: coordinator identifier and conjunct boundary detector. The experimental results on datasets from various domains demonstrate the effectiveness and efficiency of the proposed method. Further experiments show that CoRec positively impacts downstream tasks, improving the yield of state-of-the-art Open IE models.	PDF	7	2022
KGRefiner: Knowledge Graph Refinement for Improving Accuracy of Translational Link Prediction Methods	Link Prediction is the task of predicting missing relations between knowledge graph entities (KG). Recent work in link prediction mainly attempted to adapt a model to increase link prediction accuracy by using more layers in neural network architecture, which heavily rely on computational resources. This paper proposes the refinement of knowledge graphs to perform link prediction operations more accurately using relatively fast translational models. Translational link prediction models have significantly less complexity than deep learning approaches; this motivated us to improve their accuracy. Our method uses the ontologies of knowledge graphs to add information as auxiliary nodes to the graph. Then, these auxiliary nodes are connected to ordinary nodes of the KG that contain auxiliary information in their hierarchy. Our experiments show that our method can significantly increase the performance of translational link prediction methods in Hit@10, Mean Rank, and Mean Reciprocal Rank.	PDF	7	2022
LEAP: Learnable Pruning for Transformer-based Models	Pruning is an effective method to reduce the memory footprint and computational cost associated with large natural language processing models. However, current pruning algorithms either only focus on one pruning category, e.g., structured pruning and unstructured, or need extensive hyperparameter tuning in order to get reasonable accuracy performance. To address these challenges, we propose LEArnable Pruning (LEAP), an effective method to gradually prune the model based on thresholds learned by gradient descent. Different than previous learnable pruning methods, which utilize L0 or L1 penalty to indirectly affect the final pruning ratio, LEAP introduces a novel regularization function, that directly interacts with the preset target pruning ratio. Moreover, in order to reduce hyperparameter tuning, a novel adaptive regularization coefficient is deployed to control the regularization penalty adaptively. With the new regularization term and its associated adaptive regularization coefficient, LEAP is able to be applied for different pruning granularity, including unstructured pruning, structured pruning, and hybrid pruning, with minimal hyperparameter tuning. We apply LEAP for BERT models on QQP/MNLI/SQuAD for different pruning settings. Our result shows that for all datasets, pruning granularity, and pruning ratios, LEAP achieves on-par or better results as compared to previous heavily hand-tuned methods.	PDF	7	2022
Evaluating Existing Models for Sentiments Analysis of News Corpora	When dealing with domain-specific tasks, using domain-specific data in pre-training has proven to be useful. Improved BERT models for finance, chemistry and biology already exist, and they significantly outperform base BERT in these fields. However, a model specialized in understanding news is lacking. To achieve SOTA in news corpora, the first step is to find the base model best suited for it. This work evaluates three base models for sentiment analysis of sentence-level news and document-level news. Results show that the fine-tuned DeBERTa and Big Bird model achieved the highest F1 score on sentence-level and document-level respectively. We further explore the underlying reasons and provide empirical advice on which features tend to be more useful than others in understanding news. The code is available at: https://github. com/Richard1007/sentiment_news	PDF	7	2022
Classifiers are Better Experts for Controllable Text Generation	This paper proposes a simple method for controllable text generation based on weighting logits with a free-form classifier, namely CAIF sampling. Using an arbitrary text classifier, we adjust a small part of a language model's logits and guide text generation towards or away from classifier prediction. We experimented with toxicity avoidance and sentiment control tasks and showed that the proposed method significantly outperforms recent PPLM, GeDi, and DExperts on PPL and task accuracy metrics based on the external classifier of generated texts. In addition, compared to other approaches, it is easier to implement and tune and has significantly fewer restrictions and requirements.	PDF	7	2022
Look Ma, Only 400 Samples! Revisiting the Effectiveness of Automatic N-Gram Rule Generation for Spelling Correction in Filipino	With 84.75 million Filipinos online, the ability for models to process online text is crucial for developing Filipino NLP applications. To this end, spelling correction is a crucial preprocessing step for downstream processing. However, the lack of data prevents the use of language models for this task. In this paper, we propose an N-Gram + Damerau-Levenshtein distance model with automatic rule extraction. We train the model on 300 samples, and show that despite limited training data, it achieves good performance and outperforms other deep learning approaches in terms of accuracy and edit distance. Moreover, the model (1) requires little compute power, (2) trains in little time, thus allowing for retraining, and (3) is easily interpretable, allowing for direct troubleshooting, highlighting the success of traditional approaches over more complex deep learning models in settings where data is unavailable.	PDF	7	2022
Semi-self-supervised Automated ICD Coding	Clinical Text Notes (CTNs) contain physicians' reasoning process, written in an unstructured free text format, as they examine and interview patients. In recent years, several studies have been published that provide evidence for the utility of machine learning for predicting doctors' diagnoses from CTNs, a task known as ICD coding.Data annotation is time consuming, particularly when a degree of specialization is needed, as is the case for medical data.This paper presents a method of augmenting a sparsely annotated dataset of Icelandic CTNs with a machine-learned data imputation in a semi-self-supervised manner. We train a neural network on a small set of annotated CTNs and use it to extract clinical features from a set of un-annotated CTNs. These clinical features consist of answers to about a thousand potential questions that a physician might find the answers to during a consultation with a patient. The features are then used to train a classifier for the diagnosis of certain types of diseases.We report the results of an evaluation of this data augmentation method over three tiers of information that are available to a physician. Our data augmentation method shows a significant positive effect, which is diminished when an increasing number of clinical features, from the examination of the patient and diagnostics, are made available. We recommend our method for augmenting scarce datasets for systems that take decisions based on clinical features that do not include examinations or tests.	PDF	7	2022
A Semi-Autoregressive Graph Generative Model for Dependency Parsing	Recent years have witnessed impressive progress in Neural Dependency Parsing. According to the different factorization approaches to the graph joint probabilities, existing parsers can be roughly divided into autoregressive and non-autoregressive patterns. The former means that the graph should be factorized into multiple sequentially dependent components, then it can be built up component by component. And the latter assumes these components to be independent so that they can be outputted at once. However, when treating the directed edge in the dependency graph as an explicit dependency, we discover that there is a mixture of independent and interdependent components in the dependency graph, signifying that both fail to precisely capture the explicit dependencies among nodes and edges. Based on this property, we design a Semi-Autoregressive Dependency Parser to generate dependency graphs via adding node groups and edge groups autoregressively while pouring out all group elements in parallel. The model meanwhile deals with two problems in graph generation with respect to the uncertainty of generation orders and edge sparsity, via introducing a novel concept of Topological Hierarchy and a Graph Transformer as the decoder. The experiments show the proposed parser outperforms strong baselines on Enhanced Universal Dependencies of $14$ languages. Also, the performances of model variations show the importance of specific parts.	PDF	7	2022
Higher-Order Dependency Parsing for Arc-Polynomial Score Functions via Gradient-Based Methods and Genetic Algorithm	We present a novel method for higher-order dependency parsing which takes advantage of the general form of score functions written as arc-polynomials, a general framework which encompasses common higher-order score functions, and includes new ones. This method is based on non-linear optimization techniques, namely coordinate ascent and genetic search where we iteratively update a candidate parse. Updates are formulated as gradient-based operations, and are efficiently computed by auto-differentiation libraries. Experiments show that this method obtains results matching the recent state-of-the-art second order parsers on three standard datasets.	PDF	7	2022
Distantly Supervised Course Concept Extraction in MOOCs with Academic Discipline	With the rapid growth of Massive Open Online Courses (MOOCs), it is expensive and time-consuming to extract high-quality knowledgeable concepts taught in the course by human effort to help learners grasp the essence of the course.In this paper, we propose to automatically extract course concepts using distant supervision to eliminate the heavy work of human annotations, which generates labels by matching them with an easily accessed dictionary.However, this matching process suffers from severe noisy and incomplete annotations because of the limited dictionary and diverse MOOCs.To tackle these challenges, we present a novel three-stage framework DS-MOCE, which leverages the power of pre-trained language models explicitly and implicitly and employs discipline-embedding models with a self-train strategy based on label generation refinement across different domains.We also provide an expert-labeled dataset spanning $20$ academic disciplines. Experimental results demonstrate the superiority of DS-MOCE over the state-of-the-art distantly supervised methods (with $7\%$ absolute F1 score improvement).	PDF	7	2022
BERT on a Data Diet: Finding Important Examples by Gradient-Based Pruning	Current pre-trained language models rely on large datasets for achieving state-of-the-art performance. However, past research has shown that not all examples in a dataset are equally important during training. In fact, it is sometimes possible to prune a considerable fraction of the training set while maintaining the test performance. Established on standard vision benchmarks, two gradient-based scoring metrics for finding important examples are GraNd and its estimated version, EL2N. In this work, we employ these two metrics for the first time in NLP. We demonstrate that these metrics need to be computed after at least one epoch of fine-tuning and they are not reliable in early steps. Furthermore, we show that by pruning a small portion of the examples with the highest GraNd/EL2N scores, we can not only preserve the test accuracy, but also surpass it. This paper details adjustments and implementation choices which enable GraNd and EL2N to be applied to NLP.	PDF	7	2022
Down-Scaling Language Models in the Era of Scale Is All You Need	Large language models are very resource intensive, both financially and environmentally, and require a huge amount of training data, which is only available to a small number of languages. In this work, we put the focus on low resource settings. We build language models in two languages trained with different configurations, which are then evaluated on several NLP tasks. Specifically, we analyze three lightweight BERT architectures (with 124M, 51M, and 16M parameters) which are trained with small corpora (125M, 25M, 5M words) for both Basque and Spanish languages. The trained models are evaluated on several tasks, and compared with traditional, non-neural supervised systems. We also present an estimate of resources and CO$_2$ emissions needed in each approach, which asks for a compromise between raw performance and environmental costs.	PDF	7	2022
Intrinsic Uncertainty-Aware Calibration Metric	Deep learning models have made great strides in recent years. Subsequently, model calibration and measurements of the quantity have gained much attention, with the degree being an indication of reliability of a model. In this study, we explore the limitations of the existing calibration metrics, and propose a simple calibration metric that caters to natural language generation (NLG) tasks. Unlike existing calibration metrics, our metric is not confined to/not sorely based on a single prediction; it considers a distribution mapped by a model. In this regard, the proposed metric takes intrinsic uncertainty present in a natural language into account when quantifying the calibration degree. The metric has been tested on machine translation datasets, a popular NLG task with intrinsic uncertainty. A thorough analysis illustrates that the proposed metric possesses the ability to handle intrinsic uncertainty and hence is more suitable measure under NLG tasks.	PDF	7	2022
Siamese vs Non-Siamese: Dual Encoders for Intent Detection	Bi-encoders have been shown to be effective for intent classification. Current Bi-encoders use the same weights to learn the embedding of both the contexts and candidates. However, this can be counter-productive when there exist contexts with overlapping keywords from competing candidate labels. This could lead to unrelated context and candidate having similar embeddings and being mis-classified. In this work, we investigate the potential of non-siamese Bi-encoders for intent detection, where separate weights are learned for context and candidate. Our results show that non-siamese Bi-encoders improve the performance of traditional Bi-encoders across datasets. We also show that using heterogeneous architectures in a non-siamese Bi-encoder can effectively reduce memory and computation requirement while maintaining prediction performance.	PDF	7	2022
CORAL: Contextual Response Retrievability Loss Function for Training Dialog Generation Models	Natural Language Generation (NLG) represents a large collection of tasks in the field of NLP. While many of these tasks have been tackled well by the cross-entropy (CE) loss, the task of dialog generation poses a few unique challenges for this loss function. First, CE loss assumes that for any given input, the only possible output is the one available as the ground truth in the training dataset. In general, this is not true for any task, as there can be multiple semantically equivalent sentences, each with a different surface form. This problem gets exaggerated further for the dialog generation task, as there can be multiple valid responses (for a given context) that not only have different surface forms but are also not semantically equivalent. Second, CE loss does not take the context into consideration while processing the response and, hence, it grades the response irrespective of the context. To grade the generated response for qualities like relevance, coherence, etc., the loss function should depend on both the context and the generated response. To circumvent these shortcomings of the CE loss, in this paper, we propose a novel loss function, CORAL, that directly optimizes recently proposed estimates of human preference for generated responses. Using CORAL, we can train dialog generation models without assuming non-existence of response other than the ground-truth. Also, the CORAL loss is computed based on both the context and the response. Extensive comparisons on two benchmark datasets show that the proposed methods outperform strong state-of-the-art baseline models of different sizes.	PDF	7	2022
Revisiting zero-shot cross-lingual topic identification: baselines, languages and evaluation	In this paper, we revisit cross-lingual topic identification (ID) in zero-shot settings by taking a deeper dive into current datasets, baseline systems and the languages covered. We identify shortcomings in the existing MLDoc evaluation protocol and propose a robust alternative scheme, while also extending the cross-lingual experimental setup to 17 languages. We benchmark several systems that are based on existing multilingual models such as LASER, XLM-R, mUSE, and LaBSE on the new evaluation protocol covering 17 languages. Further, we present a novel Bayesian multilingual document model (MBay) for learning language-independent document embeddings. The model learns to represent the document embeddings in the form of Gaussian distributions, thereby encoding the uncertainty in its covariance. We propagate the learned uncertainties through linear classifiers that benefit in zero-shot cross-lingual topic ID. Our experiments on 17 languages show that the proposed multilingual Bayesian document model performs competitively as compared to other systems based on LASER, XLM-R and mUSE on 8 high resource languages, and outperforms these systems on 9 mid-resource languages. Finally, we consolidate the observations from all our experiments, and discuss points that can potentially benefit the future research works in the area of cross-lingual topic ID.	PDF	7	2022
Enhancing Neural Topic Model with Multi-Level Supervisions from Seed Words	Efforts have been made to apply topic seed words to improve the topic interpretability of topic models. However, due to the semantic diversity of natural language, supervisions from seed words could be ambiguous, making it hard to be incorporated into the current neural topic models. In this paper, we propose SeededNTM, a neural topic model enhanced with supervisions from seed words on both word and document levels. We introduce a context-dependency assumption to alleviate the ambiguities with context document information, and an auto-adaptation mechanism to automatically balance between multi-level information. Moreover, an intra-sample consistency regularizer is proposed to deal with noisy supervisions via encouraging perturbation and semantic consistency. Extensive experiments on multiple datasets show that SeededNTM can derive semantically meaningful topics and outperforms the state-of-the-art seeded topic models in terms of topic quality and classification accuracy.	PDF	7	2022
Algorithmic Diversity and Tiny Models: Comparing Binary Networks and the Fruit Fly Algorithm on Document Representation Tasks	Neural language models have seen a dramatic increase in size in the last years. While many still advocate that `bigger is better', work in model distillation has shown that the number of parameters used by very large networks is actually more than what is required for state-of-the-art performance. This prompts an obvious question: can we build smaller models from scratch, rather than going through the inefficient process of training at scale and subsequently reducing model size. In this paper, we investigate the behaviour of a biologically inspired algorithm, based on the fruit fly's olfactory system. This algorithm has shown good performance in the past on the task of learning word embeddings. We now put it to the test on the task of semantic hashing. Specifically, we compare the fruit fly to a standard binary network on the task of generating locality-sensitive hashes for text documents, measuring both task performance and energy consumption. Our results indicate that the two algorithms have complementary strengths while showing similar electricity usage.	PDF	7	2022
Commonsense Frame Completion and its Probabilistic Evaluation	Commonsense knowledge is critical to achieving artificial general intelligence. Large language models have demonstrated impressive performance on commonsense tasks, however these tasks are often posed as multiple-choice questions, allowing models to exploit systematic biases. Commonsense is also inherently probabilistic; a plumber could repair a sink in a kitchen or a bathroom, or even a basement, although the former answers are more probable. Existing tasks do not capture the probabilistic nature of common sense. To this end we present commonsense frame completion (CFC), a new generative task which evaluates common sense via multiple open-ended generations. We also propose a method of probabilistic evaluation which strongly correlates with human judgements. Humans drastically outperform strong language model baselines on our dataset, indicating this approach is both a challenging and useful evaluation of machine common sense.	PDF	7	2022
To be a Knight-errant Novel Master: Knight-errant Style Transfer via Contrastive Learning	Knight-errant style writing is a challenging task for novice writers due to the highly condensed terminology and highly literary language culture of the knight-errant works.To tackle this problem, in this paper, we propose a new large-scale parallel knight-errant dataset and model the knight-errant writing as a text style transfer (TST) task between modern style and knight-errant style. We establish the benchmark performance of six current SOTA models for knight-errant style transfer. Empirical results demonstrate that the existing SOTA TST models are unable to accurately identify and generate knight-errant style sentences. Therefore, we propose Knight, a TST framework based on contrastive learning. Knight uses multiple strategies to construct positive and negative samples, making it significantly better than existing SOTA models in terms of content fluency, style transfer accuracy, and factuality.The data and code are publicly available \footnote{https://anonymous.4open.science/r/knight-errant-style-transfer-C2E1/}.	PDF	7	2022
We need to talk about random seeds	Modern neural network libraries all take as a hyperparameter a random seed, typically used to determine the initial state of the model parameters. This opinion piece argues that there are some safe uses for random seeds: as part of the hyperparameter search to select a good model, creating an ensemble of several models, or measuring the sensitivity of the training algorithm to the random seed hyperparameter. It argues that some uses for random seeds are risky: using a fixed random seed for "replicability" and varying only the random seed to create score distributions for performance comparison. An analysis of 85 recent publications from the ACL Anthology finds that more than 50% contain risky uses of random seeds.	PDF	7	2022
Gandalf: Data Augmentation is all you need for Extreme Classification	Extreme Multi-label Text Classification (XMC) involves learning a classifier that can assign an input with a subset of most relevant labels from millions of label choices. Recent works in this domain are increasingly focusing on the problem setting with (i) short-text input data, and (ii) labels endowed with meta-data in the form of textual descriptions. Short-text XMC with label features has found numerous applications in areas such as prediction of Related Searches, product recommendation based on titles, and bid-phrase suggestion, amongst others. In this work, by exploiting the problem characteristics of short-text XMC, we develop postulates stating the desired invariances, and propose two data augmentation techniques to achieve them. One, LabelMix, which performs data augmentation by concatenating an annotating label to the data-point; and the other, Gandalf, which generates additional data-points by considering labels as legitimate data-points. The efficacy of the proposed augmentation methods is demonstrated by showing upto 30% relative improvement when applied to a range of existing algorithms, and proposing an algorithmic framework, InceptionXML-LF, which furthers state-of-the-art on benchmark datasets.	PDF	7	2022