Skip to content

Toolkit for the paper "Explainable Topic Continuity in Political Discourse: A Sentence Pair BERT Model Analysis". The project leverages Sentence Pair Modeling (SPM), BERT, and the Transformers Interpret library to analyze topic continuity in political discourse.

Notifications You must be signed in to change notification settings

pacoreyes/topic-continuity

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Topic Continuity in Political Discourse

This repository contains the code and resources for the paper "Explainable Topic Continuity in Political Discourse: A Sentence Pair BERT Model Analysis". The project leverages Sentence Pair Modeling (SPM), BERT, and the Transformers Interpret library to analyze topic continuity in political discourse.

Project Overview

Topic continuity is defined by specific linguistic features that suggest a sustained subject or theme between two consecutive sentences. This research focuses on analyzing five linguistic features that define topic continuity:

  • Coreferentiality
  • Lexical cohesion
  • Semantic cohesion
  • Syntactic parallelism
  • Transitional cohesion

The project includes a dataset of 2,884 sentence pairs and a fine-tuned BERT model (TopicContinuityBERT) to analyze how these linguistic features influence topic continuity across sentences.

Academic Context

This paper is part of the doctoral thesis:

"Explaining Large Language Models for Passage-Level Political Statement Extraction Using Linguistic Rule-Based Models"

A doctoral thesis submitted to the Faculty 1: Mathematics, Computer Science, Physics, Electrical Engineering and Information Technology of the Brandenburg University of Technology Cottbus-Senftenberg for the academic degree of Dr.-Ing.

This work was published in: Reyes, J. F., "Explainable Topic Continuity in Political Discourse: A Sentence Pair BERT Model Analysis", International Journal of Computational Linguistics (IJCL), Volume 15, Issue 2.

Hugging Face Resources

The model and dataset used in this project are published on Hugging Face:

Repository Structure

Root Directory Python Files

  • db.py: Database connection and configuration for Google Sheets integration
  • paper_c_1_split_dataset.py: Splits the dataset into train, validation, and test sets
  • paper_c_2_train_bert.py: Trains the BERT model for topic continuity classification
  • paper_c_3_test_bert.py: Evaluates the trained BERT model on the test dataset
  • paper_c_4_inference_bert.py: Performs inference using the trained BERT model
  • paper_c_5_plot_embeddings.py: Visualizes BERT embeddings
  • paper_c_6_lrbm_classify.py: Implements a Logistic Regression Baseline Model for comparison
  • paper_c_7_extend_tokenizer.py: Extends the BERT tokenizer with domain-specific tokens
  • paper_c_8_transformers_interpret_analysis.py: Performs explainability analysis using Transformers Interpret
  • paper_c_9_bert_hop_training.py: Implements a hyperparameter optimization training approach for BERT
  • paper_c_10_feature_analysis.py: Analyzes linguistic features in the dataset
  • paper_c_11_word_frequency_analysis.py: Analyzes word frequencies in the dataset
  • paper_c_12_eda.py: Performs exploratory data analysis

Library Files (lib/)

  • continuity_checks.py: Implements checks for topic continuity features
  • ner_processing.py: Processes named entities for coreferentiality analysis
  • text_utils.py: Provides text processing utilities
  • utils.py: Contains general utility functions used across the project
  • visualizations.py: Implements visualization functions for analysis results

Images (images/)

  • paper_c_1_dl_setfit_confusion_matrix.png: Confusion matrix visualization
  • paper_c_bert_losses_final_28_06.png: Plot of BERT model training losses
  • paper_c_bert_roc_curve.png: ROC curve for the BERT model
  • paper_c_plot_bert_embeddings_22_07.png: Visualization of BERT embeddings

Datasets (dataset/)

  • topic_continuity_test.jsonl: Test dataset with sentence pairs and labels
  • topic_continuity_train.jsonl: Training dataset with sentence pairs and labels
  • topic_continuity_valid.jsonl: Validation dataset with sentence pairs and labels

Documentation

  • paper-c.html: The full research paper describing the methodology and findings
  • unused_lib_files.md: List of library files not directly used in the main scripts

Usage

  1. Dataset Preparation: Run paper_c_1_split_dataset.py to prepare the dataset
  2. Model Training: Run paper_c_2_train_bert.py to train the BERT model
  3. Model Evaluation: Run paper_c_3_test_bert.py to evaluate the model
  4. Analysis: Run the various analysis scripts (paper_c_8_transformers_interpret_analysis.py, paper_c_10_feature_analysis.py, etc.) to analyze the results

Requirements

The project dependencies are listed in the requirements.txt file. Install them using:

pip install -r requirements.txt

Research Findings

The analysis reveals that coreferentiality, lexical cohesion, and transitional cohesion are pivotal in maintaining thematic consistency through sentence pairs. This research enhances our understanding of political rhetoric and improves transparency in natural language processing models, offering insights into the dynamics of political discourse.

Citation

If you use this code or the findings in your research, please cite the original paper:

Reyes, J. F. (2024). Explainable Topic Continuity in Political Discourse: A Sentence Pair BERT Model Analysis. International Journal of Computational Linguistics (IJCL), Volume 15, Issue 2.

About

Toolkit for the paper "Explainable Topic Continuity in Political Discourse: A Sentence Pair BERT Model Analysis". The project leverages Sentence Pair Modeling (SPM), BERT, and the Transformers Interpret library to analyze topic continuity in political discourse.

Topics

Resources

Stars

Watchers

Forks