This project is a submission for the Fellowship.ai Cohort 33 challenge. It implements sentiment analysis on the IMDB dataset using multiple approaches to demonstrate proficiency in natural language processing and machine learning.
The challenge involves building a sentiment analysis system that can effectively classify movie reviews as positive or negative, showcasing:
- Data preprocessing capabilities
- Feature engineering skills
- Model implementation and evaluation
- Code organization and documentation
- Use of modern NLP techniques (SpaCy and Transformers)
- Data preprocessing using SpaCy with GPU acceleration
- TF-IDF vectorization for feature extraction
- Logistic Regression for baseline classification
- Transformer-based models for advanced sentiment analysis
- Comprehensive visualization and evaluation metrics
- Python 3.x
- Google Colab (for original notebook execution)
- Required packages:
- pandas
- spacy
- scikit-learn
- seaborn
- matplotlib
- transformers
- kaggle
- tqdm
def preprocess_text(text, nlp):
text = clean_text(text)
doc = nlp(text)
tokens = [token.lemma_ for token in doc if token.is_alpha and len(token.text) > 2]
return ' '.join(tokens)- Utilizes SpaCy's GPU capabilities for faster processing
- Optimized for Google Colab's GPU environment
- Data cleaning and preprocessing
- Feature extraction using TF-IDF
- Model training with Logistic Regression
- Performance evaluation and visualization
- Advanced sentiment analysis using Transformers
- Open the notebook in Google Colab
- Upload your Kaggle credentials
- Run all cells sequentially
- Review the visualizations and performance metrics
- Uses SpaCy's
en_core_web_smmodel for preprocessing - Implements TF-IDF vectorization with:
- max_features: 50,000
- ngram_range: (1, 2)
- Logistic Regression parameters:
- C: 1.0
- max_iterations: 1000
- n_jobs: -1 (parallel processing)
The project provides:
- Classification metrics
- Confusion matrix visualization
- Review length distribution analysis
- Sample predictions using transformer models
- Implement cross-validation
- Add more advanced preprocessing techniques
- Experiment with different transformer architectures
- Add model comparison metrics
- Implement model serialization
- This project was developed as part of the Fellowship.ai Cohort 33 application process
- Originally developed in Google Colab for GPU acceleration
- Focuses on demonstrating both traditional ML and modern NLP approaches