This repository contains the code and resources for the paper "Explainable Topic Continuity in Political Discourse: A Sentence Pair BERT Model Analysis". The project leverages Sentence Pair Modeling (SPM), BERT, and the Transformers Interpret library to analyze topic continuity in political discourse.
Topic continuity is defined by specific linguistic features that suggest a sustained subject or theme between two consecutive sentences. This research focuses on analyzing five linguistic features that define topic continuity:
- Coreferentiality
- Lexical cohesion
- Semantic cohesion
- Syntactic parallelism
- Transitional cohesion
The project includes a dataset of 2,884 sentence pairs and a fine-tuned BERT model (TopicContinuityBERT) to analyze how these linguistic features influence topic continuity across sentences.
This paper is part of the doctoral thesis:
"Explaining Large Language Models for Passage-Level Political Statement Extraction Using Linguistic Rule-Based Models"
A doctoral thesis submitted to the Faculty 1: Mathematics, Computer Science, Physics, Electrical Engineering and Information Technology of the Brandenburg University of Technology Cottbus-Senftenberg for the academic degree of Dr.-Ing.
This work was published in: Reyes, J. F., "Explainable Topic Continuity in Political Discourse: A Sentence Pair BERT Model Analysis", International Journal of Computational Linguistics (IJCL), Volume 15, Issue 2.
The model and dataset used in this project are published on Hugging Face:
- Dataset: TopicContinuity, https://doi.org/10.57967/hf/2756
- Model: TopicContinuityBERT, https://doi.org/10.57967/hf/2757
- db.py: Database connection and configuration for Google Sheets integration
- paper_c_1_split_dataset.py: Splits the dataset into train, validation, and test sets
- paper_c_2_train_bert.py: Trains the BERT model for topic continuity classification
- paper_c_3_test_bert.py: Evaluates the trained BERT model on the test dataset
- paper_c_4_inference_bert.py: Performs inference using the trained BERT model
- paper_c_5_plot_embeddings.py: Visualizes BERT embeddings
- paper_c_6_lrbm_classify.py: Implements a Logistic Regression Baseline Model for comparison
- paper_c_7_extend_tokenizer.py: Extends the BERT tokenizer with domain-specific tokens
- paper_c_8_transformers_interpret_analysis.py: Performs explainability analysis using Transformers Interpret
- paper_c_9_bert_hop_training.py: Implements a hyperparameter optimization training approach for BERT
- paper_c_10_feature_analysis.py: Analyzes linguistic features in the dataset
- paper_c_11_word_frequency_analysis.py: Analyzes word frequencies in the dataset
- paper_c_12_eda.py: Performs exploratory data analysis
- continuity_checks.py: Implements checks for topic continuity features
- ner_processing.py: Processes named entities for coreferentiality analysis
- text_utils.py: Provides text processing utilities
- utils.py: Contains general utility functions used across the project
- visualizations.py: Implements visualization functions for analysis results
- paper_c_1_dl_setfit_confusion_matrix.png: Confusion matrix visualization
- paper_c_bert_losses_final_28_06.png: Plot of BERT model training losses
- paper_c_bert_roc_curve.png: ROC curve for the BERT model
- paper_c_plot_bert_embeddings_22_07.png: Visualization of BERT embeddings
- topic_continuity_test.jsonl: Test dataset with sentence pairs and labels
- topic_continuity_train.jsonl: Training dataset with sentence pairs and labels
- topic_continuity_valid.jsonl: Validation dataset with sentence pairs and labels
- paper-c.html: The full research paper describing the methodology and findings
- unused_lib_files.md: List of library files not directly used in the main scripts
- Dataset Preparation: Run
paper_c_1_split_dataset.pyto prepare the dataset - Model Training: Run
paper_c_2_train_bert.pyto train the BERT model - Model Evaluation: Run
paper_c_3_test_bert.pyto evaluate the model - Analysis: Run the various analysis scripts (paper_c_8_transformers_interpret_analysis.py, paper_c_10_feature_analysis.py, etc.) to analyze the results
The project dependencies are listed in the requirements.txt file. Install them using:
pip install -r requirements.txt
The analysis reveals that coreferentiality, lexical cohesion, and transitional cohesion are pivotal in maintaining thematic consistency through sentence pairs. This research enhances our understanding of political rhetoric and improves transparency in natural language processing models, offering insights into the dynamics of political discourse.
If you use this code or the findings in your research, please cite the original paper:
Reyes, J. F. (2024). Explainable Topic Continuity in Political Discourse: A Sentence Pair BERT Model Analysis. International Journal of Computational Linguistics (IJCL), Volume 15, Issue 2.