Argument Generation datasets collected from ChangeMyView, ver 1.1 (first released in May 2018, modified in Aug 2018) Project url: http://xinyuhua.github.io/Resources/ Distributed together with: Neural Argument Generation Augmented with Externally Retrieved Evidence Xinyu Hua and Lu Wang Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (ACL) If you use these datasets, please cite: @InProceedings{hua-wang:2018:Long, author = {Hua, Xinyu and Wang, Lu}, title = {Neural Argument Generation Augmented with Externally Retrieved Evidence}, booktitle = {Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)}, month = {July}, year = {2018}, address = {Melbourne, Australia}, publisher = {Association for Computational Linguistics}, abstract = { High quality arguments are essential elements for human reasoning and decision-making processes. However, effective argument construction is a challenging task for both human and machines. In this work, we study a novel task on automatically generating arguments of a different stance for a given statement. We propose an encoder-decoder style neural network-based argument generation model enriched with externally retrieved evidence from Wikipedia. Our model first generates a set of talking point phrases as intermediate representation, followed by a separate decoder producing the final argument based on both input and the keyphrases. } } ===== Content ====== I. Description of the datasets II. Change log III. Contact ===== I. Description of the datasets ===== We collect OP posts and replies from ChangeMyView, a subcommunity on reddit: https://www.reddit.com/r/changemyview/ The datasets consist of two major parts, core and pre-training dataset. The pre-training dataset is used only for parameter initialization. Therefore it only contains the OP posts and target arguments. The core dataset is used for training, validation, and test on our proposed models. It contains the OP post, sampled evidence sentences, extracted keyphrases, and target arguments. We also include the retrieved document titles from Wikipedia and corresponding ranked evidence sentences, to facilitate further development for evidence retrieval. ======== (1) Core dataset ======== cmv_processed/: op_train.jsonlist.bz2 op_valid.jsonlist.bz2 op_test.jsonlist.bz2 root_replies_train.jsonlist.bz2 root_replies_valid.jsonlist.bz2 root_replies_test.jsonlist.bz2 This directory contains the filtered OP posts and root replies used to create the core dataset. Each OP post has a unique tid (thread id, e.g. "t3_1fv4r6"), the URL for the whole discussion thread, and the tokenized post. Each root replies has a unique cid (comment id, e.g. "d3cw1w8"), the tid which it belongs to, the tokenized post, karma (#upvote - #downvote), and delta from OP poster. wikipedia_retrieval/: wiki_doc_retrieved_from_op_train.jsonlist.bz2 wiki_doc_retrieved_from_op_valid.jsonlist.bz2 wiki_doc_retrieved_from_op_test.jsonlist.bz2 wiki_doc_retrieved_from_rr_train.jsonlist.bz2 wiki_doc_retrieved_from_rr_valid.jsonlist.bz2 wiki_doc_retrieved_from_rr_test.jsonlist.bz2 This directory contains Wikipedia article titles retrieved by queries constructed from OP or root replies. Each OP post or root reply is first split into sentences. Query is constucted for each sentence (sometimes no query for sentence that are too short.). And we keep the top 5 retrieved articles (details described in Sec 5.1 in our paper). reranked_evidence/: selected_evidence_rr_train.jsonlist.bz2 selected_evidence_rr_valid.jsonlist.bz2 selected_evidence_rr_test.jsonlist.bz2 selected_evidence_op_test.jsonlist.bz2 This directory contains selected evidence sentences and extracted keyphrases for root replies and OP posts (OP retrieved evidence is only considered in test time, because in training time we use the reply retrieved evidence as oracle). For each sentence in OP post or root reply, we consider up to ten evidence sentences and their corresponding keyphrases (details described in Sec 5.1 and Sec 5.2 in our paper). trainable/: train_core_sample3.src train_core_sample3_arg.tgt train_core_sample3_kp.tgt valid_core_sample3.src valid_core_sample3_arg.tgt valid_core_sample3_kp.tgt This directory contains the data we used for training and validation. For each reply sentence we sample up to three evidence sentences and include them in the "*.src" files, separated by a special token "". The corresponding target arguments are in "*_arg.tgt" files, and the keyphrases are in "*_kp.tgt" files. test/: with_system_evidence/: test_system.src test_system_kp.tgt with_oracle_evidence/: test.src test_kp.tgt test_arg.tgt This directory contains test data we used for evaluation. "with_system_evidence/" contains OP posts and evidence sentences retrieved by using OP posts themselves. "with_oracle_evidence/" contains OP posts and evidence sentences retrieved by using gold-standard root replies (arguments). ======== (2) Pre-training dataset ======== pretrain_dataset/: train_pretrain.src train_pretrain.tgt valid_pretrain.src valid_pretrain.tgt test_pretrain.src test_pretrain.tgt This directory contains pre-training dataset. It includes root and non-root replies (but need to be replying directly to the OP poster), in both politics and non-politics threads. We use them to initialize a vanilla sequence-to-sequence model, therefore we do not include retrieval results for them. ===== II. Change log ===== 2018/08/23: the following files have been changed because part of them ( part) are not fully tokenized. The updated files are tokenized by using nltk.word_tokenize(). - trainable/train_core_sample3.src - trainable/valid_core_sample3.src - test/with_oracle_evidence/test.src - test/with_system_evidence/test_system.src ===== III. Contact ===== Should you have any question, please contact Xinyu Hua at hua.x@husky.neu.edu.