Project page for "Neural Argument Generation Augmented with Externally Retrieved Evidence"

View the Project on GitHub XinyuHua/neural-argument-generation


This repository holds the code for Neural Argument Generation project developed at Northeastern NLP. For details about the framework please read our ACL 2018 paper:




Update: This dataset has been updated on 2018/08/23. This change solves some tokenization errors exist in previous version.

Please download the dataset from here.

The dataset consists of the following 5 parts:

  1. cmv_processed: filtered OP posts and root replies used to create the core dataset

  2. wikipedia_retrieval: wikipedia article titles retrieved as evidence source for OP and root replies

  3. reranked_evidence: selected evidence sentences and extracted keyphrases for OP and root replies

  4. trainable: directly trainable dataset

  5. test: test set we used for evaluation

(Detailed readme file can be found here.)

File structure

Please download the corresponding data and put them under dat/ folder. If the folder does not exist please create by hand.

mkdir dat/log
mkdir -p dat/trainable/bin
 ├── src/
 │   ├──
 │   ├──
 │   ├──
 │   ├──
 │   ├──
 │   ├──
 │   ├──
 │   ├──
 │   ├──
 │   └──
 ├── scripts/
 │   ├──
 │   └── (coming soon)
 └── dat/
     ├── vocab.src
     ├── vocab.tgt
     ├── trainable/
     │    ├── train_core_sample3.src
     │    ├── train_core_sample3_arg.tgt
     │    ├── train_core_sample3_kp.tgt
     │    ├── valid_core_sample3.src
     │    ├── valid_core_sample3_arg.tgt
     │    ├── valid_core_sample3_kp.tgt
     │    └── bin/
     └── log/


This step binarizes the plain text data. Please make sure the plain text data files are in order.

python3 scripts/

Training and concurrent validation

Train the model by assigning --mode=train. While the model is training, start another thread by assigning --mode=eval for concurrent validation. The summaries on loss will be logged into the same exp folder. These results can be visualized by tensorboard.

python3 src/ [--mode={train,eval}] [--model={vanilla,seq_dec,shd_dec}] \
                      [--data_path=PATH_TO_BIN_DATA] \
                      [--model_path=PATH_TO_STORE_MODEL] \
                      [--exp_name=EXP_NAME] \
                      [--batch_size=BS] \
                      [--src_vocab_path=PATH_TO_SRC_VOCAB] \
                      [--tgt_vocab_path=PATH_TO_TGT_VOCAB] \


After the model is trained, decode on binarized data using the following command. Note that the default for --ckpt_id is -1, which indicates the newest (not necessarily the best) checkpoint.

python3 src/ [--mode=decode] [--model={vanilla,seq_dec,shd_dec}] \
                      [--data_path=PATH_TO_BIN_DATA] \
                      [--model_path=PATH_TO_STORE_MODEL] \
                      [--exp_name=EXP_NAME] \
                      [--ckpt_id=CKPT_ID] \
                      [--beam_size=BS] \
                      [--src_vocab_path=PATH_TO_SRC_VOCAB] \
                      [--tgt_vocab_path=PATH_TO_TGT_VOCAB] \


[coming soon]

Support or Contact

Please contact Xinyu Hua ( for any questions about this repository.


Part of this codebase is based on Pointer-generator. The dual attention implementation is adapted from Lisa Fan.