Skip to content

Commit 99cd857

Browse files
committed
Merge remote-tracking branch 'upstream'
2 parents d97898c + b8e99e8 commit 99cd857

30 files changed

+1711
-1483
lines changed

.github/dependabot.yml

Lines changed: 12 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,12 +1,23 @@
1-
# Docs: https://docs.github.com/code-security/dependabot/dependabot-version-updates/configuration-options-for-the-dependabot.yml-file
1+
# Docs: https://docs.github.com/en/code-security/dependabot/working-with-dependabot/dependabot-options-reference
22

33
version: 2
44
updates:
55
- package-ecosystem: 'npm'
66
directory: '/website'
77
schedule:
88
interval: 'monthly'
9+
groups:
10+
major:
11+
update-types: ['major']
12+
minor-and-patch:
13+
update-types: ['minor', 'patch']
14+
915
- package-ecosystem: 'pip'
1016
directory: '/'
1117
schedule:
1218
interval: 'monthly'
19+
groups:
20+
major:
21+
update-types: ['major']
22+
minor-and-patch:
23+
update-types: ['minor', 'patch']

.github/workflows/firebase-hosting-merge.yml

Lines changed: 6 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,8 +1,13 @@
11
name: Deploy prod frontend on merge
2+
23
on:
34
push:
45
branches:
5-
- master
6+
- main
7+
8+
permissions:
9+
contents: read
10+
611
jobs:
712
build_and_deploy:
813
runs-on: ubuntu-latest

.github/workflows/firebase-hosting-pull-request.yml

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,9 +1,12 @@
11
name: Deploy frontend preview on PR
2+
23
on: pull_request
4+
35
permissions:
46
checks: write
57
contents: read
68
pull-requests: write
9+
710
jobs:
811
build_and_preview:
912
if: ${{ github.event.pull_request.head.repo.full_name == github.repository }}

.github/workflows/frontend-ci.yml

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,10 @@
11
name: Run Frontend CI on push
2+
23
on: [push]
4+
5+
permissions:
6+
contents: read
7+
38
jobs:
49
frontend-ci:
510
runs-on: ubuntu-latest

.prettierrc

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -3,6 +3,7 @@
33
"bracketSpacing": false,
44
"printWidth": 100,
55
"singleQuote": true,
6+
"objectWrap": "collapse",
67
"trailingComma": "es5",
78
"plugins": ["@ianvs/prettier-plugin-sort-imports"],
89
"importOrder": ["<BUILT_IN_MODULES>", "", "<THIRD_PARTY_MODULES>", "", "^[.]"]

README.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -39,7 +39,7 @@ encyclopedia.
3939

4040
- [Wikipedia API](https://www.mediawiki.org/wiki/API:Main_page)
4141
- [Wikipedia database layout](https://www.mediawiki.org/wiki/Manual:Database_layout)
42-
- [English Wikipedia database dumps](https://dumps.wikimedia.your.org/enwiki)
42+
- [English Wikipedia database dumps](https://dumps.wikimedia.org/enwiki)
4343

4444
## Contributing
4545

WARP.md

Lines changed: 175 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,175 @@
1+
# WARP.md
2+
3+
This file provides guidance to WARP (warp.dev) when working with code in this repository.
4+
5+
## Project Overview
6+
7+
Six Degrees of Wikipedia finds the shortest path between any two Wikipedia pages through their hyperlinks. The project consists of:
8+
9+
- **Backend**: Python Flask API server (`sdow/`) with SQLite database containing Wikipedia links
10+
- **Frontend**: React/TypeScript website (`website/`) built with Vite
11+
- **Database**: Scripts to download, process, and build Wikipedia dump databases (`scripts/`)
12+
- **Infrastructure**: Configuration for production deployment with Nginx, Gunicorn, Supervisord
13+
14+
## Development Environment Setup
15+
16+
### Initial Setup (One-time)
17+
```bash
18+
# From repo root - create Python virtual environment and install dependencies
19+
virtualenv env
20+
source env/bin/activate
21+
pip install -r requirements.txt
22+
23+
# Create mock database for development
24+
python scripts/create_mock_databases.py
25+
26+
# Setup frontend dependencies
27+
cd website/
28+
npm install
29+
```
30+
31+
### Development Server (Every Session)
32+
**Backend (Terminal 1):**
33+
```bash
34+
# From repo root
35+
source env/bin/activate
36+
cd sdow/
37+
export FLASK_APP=server.py FLASK_DEBUG=1
38+
flask run
39+
# Runs on http://localhost:5000
40+
```
41+
42+
**Frontend (Terminal 2):**
43+
```bash
44+
# From website/ directory
45+
npm start
46+
# Runs on http://localhost:3000
47+
```
48+
49+
## Core Development Commands
50+
51+
### Backend (Python Flask API)
52+
```bash
53+
# Run server in debug mode
54+
cd sdow/ && export FLASK_APP=server.py FLASK_DEBUG=1 && flask run
55+
56+
# Check Python code formatting
57+
pylint sdow/
58+
59+
# Format Python code (uses PEP 8 with 2-space indents, 100-char lines)
60+
autopep8 --in-place --recursive sdow/
61+
62+
# Query mock database directly
63+
litecli sdow/sdow.sqlite
64+
```
65+
66+
### Frontend (React/TypeScript)
67+
```bash
68+
# From website/ directory
69+
npm start # Development server (port 3000)
70+
npm run build # Production build
71+
npm run preview # Preview production build
72+
npm run lint # Run Prettier + ESLint
73+
npm run format # Auto-format code
74+
npm run analyze # Analyze bundle size
75+
npm run update-deps # Update dependencies (excludes react-router-dom)
76+
77+
# Type checking
78+
npx tsc --noEmit
79+
```
80+
81+
### Database Operations
82+
```bash
83+
# Create mock development databases
84+
python scripts/create_mock_databases.py
85+
86+
# Build full production database from Wikipedia dumps (takes hours/days)
87+
cd scripts/ && ./buildDatabase.sh
88+
89+
# Build for specific Wikipedia dump date
90+
cd scripts/ && ./buildDatabase.sh 20231201
91+
92+
# Upload database to Google Cloud Storage
93+
cd scripts/ && ./uploadToGcs.sh
94+
95+
# Backup searches database
96+
cd scripts/ && ./backupSearchesDatabase.sh
97+
```
98+
99+
## Architecture Overview
100+
101+
### Bi-Directional Breadth-First Search Algorithm
102+
The core search algorithm (`sdow/breadth_first_search.py`) uses bi-directional BFS:
103+
104+
1. **Dual Search**: Searches simultaneously from source and target pages
105+
2. **Adaptive Direction**: Chooses search direction based on link count (fewer outgoing/incoming links)
106+
3. **Optimal Strategy**: Forward search follows outgoing links, backward search follows incoming links
107+
4. **Path Construction**: When searches meet, reconstructs complete path through parent tracking
108+
109+
### Database Schema
110+
- **pages**: `id`, `title`, `is_redirect` - All Wikipedia pages
111+
- **links**: `id`, `outgoing_links_count`, `incoming_links_count`, `outgoing_links`, `incoming_links` - Page link relationships stored as pipe-separated strings
112+
- **redirects**: `source_id`, `target_id` - Page redirects
113+
- **searches**: Search result logging with timing data
114+
115+
### API Endpoints
116+
- `POST /paths` - Main search endpoint: `{"source": "Page A", "target": "Page B"}`
117+
- `GET /ok` - Health check
118+
119+
### Data Flow
120+
1. Frontend sends search request to `/paths` with source/target page titles
121+
2. Backend resolves titles to page IDs, handling redirects
122+
3. Bi-directional BFS algorithm finds shortest paths
123+
4. Wikipedia API fetched for page metadata (titles, URLs, summaries)
124+
5. Results returned as JSON with paths (page ID arrays) and page data
125+
6. Frontend renders results as both list and D3.js graph visualization
126+
127+
## Key Files and Patterns
128+
129+
### Backend Structure
130+
- `server.py` - Flask app with CORS, compression, error handling
131+
- `database.py` - SQLite query abstraction and caching
132+
- `breadth_first_search.py` - Core pathfinding algorithm
133+
- `helpers.py` - Wikipedia API integration, error classes
134+
135+
### Frontend Structure (see website/WARP.md)
136+
The frontend is a separate React/TypeScript application with its own WARP.md file containing detailed frontend-specific guidance.
137+
138+
### Database Build Pipeline
139+
Wikipedia dump processing involves multiple stages:
140+
1. **Download**: Gets latest Wikipedia dumps (pages, links, redirects) via wget/torrent
141+
2. **Trim**: Filters to main namespace (0) articles only
142+
3. **Transform**: Converts titles to IDs, resolves redirects
143+
4. **Sort/Dedupe**: Optimizes link data for search performance
144+
5. **Import**: Creates SQLite database with proper indexes
145+
146+
### Production Architecture
147+
- **Web Server**: Nginx reverse proxy
148+
- **App Server**: Gunicorn WSGI server running Flask app
149+
- **Process Management**: Supervisord for service monitoring
150+
- **Database**: SQLite files (~50GB+ for full Wikipedia)
151+
- **Deployment**: Google Cloud Platform with Firebase hosting for frontend
152+
153+
## Development Patterns
154+
155+
### Python Code Style
156+
- 2-space indentation (configured in .pylintrc, setup.cfg)
157+
- 100-120 character line limits
158+
- PEP 8 compliance with custom indent rules
159+
160+
### Error Handling
161+
- Custom `InvalidRequest` exception class for user-facing errors
162+
- Comprehensive logging with Google Cloud Logging integration
163+
- Graceful degradation when Wikipedia API unavailable
164+
165+
### Performance Considerations
166+
- **Link Storage**: Pipe-separated strings in TEXT fields for space efficiency
167+
- **Search Optimization**: Chooses BFS direction based on link density
168+
- **Database Indexes**: Optimized for title lookups and link count queries
169+
- **Caching**: Database connection pooling and query result caching
170+
171+
## Mock vs Production Data
172+
173+
**Development**: Uses `create_mock_databases.py` - creates ~35 mock Wikipedia pages with simple link relationships for testing the search algorithm.
174+
175+
**Production**: Uses `buildDatabase.sh` - downloads and processes full Wikipedia dumps (~6M+ pages, ~150M+ links). Takes significant time and disk space (50GB+).

requirements.txt

Lines changed: 8 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -1,10 +1,10 @@
1-
flask == 3.0.3
2-
flask-compress == 1.17
3-
flask-cors == 5.0.0
4-
litecli == 1.12.3
5-
google-cloud-logging == 3.11.3
1+
flask == 3.1.2
2+
flask-compress == 1.18
3+
flask-cors == 6.0.1
4+
litecli == 1.17.0
5+
google-cloud-logging == 3.12.1
66
google-compute-engine == 2.8.13
77
gunicorn == 23.0.0
8-
protobuf == 5.28.3
9-
requests == 2.32.3
10-
supervisor == 4.2.5
8+
protobuf == 6.32.1
9+
requests == 2.32.5
10+
supervisor == 4.3.0

scripts/buildDatabase.sh

Lines changed: 13 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -10,7 +10,11 @@ LANGWIKI=frwiki
1010
# By default, the latest Wikipedia dump will be downloaded. If a download date in the format
1111
# YYYYMMDD is provided as the first argument, it will be used instead.
1212
if [[ $# -eq 0 ]]; then
13+
<<<<<<< HEAD
1314
DOWNLOAD_DATE=$(wget -q -O- https://dumps.wikimedia.org/$LANGWIKI/ | grep -Po '\d{8}' | sort | tail -n1)
15+
=======
16+
DOWNLOAD_DATE=$(wget -q -O- https://dumps.wikimedia.org/enwiki/ | grep -Po '\d{8}' | sort | tail -n1)
17+
>>>>>>> upstream
1418
else
1519
if [ ${#1} -ne 8 ]; then
1620
echo "[ERROR] Invalid download date provided: $1"
@@ -23,9 +27,14 @@ fi
2327
ROOT_DIR=`pwd`
2428
OUT_DIR="dump"
2529

30+
<<<<<<< HEAD
2631
DELETE_PROGRESSIVELY=false
2732
DOWNLOAD_URL="https://dumps.wikimedia.org/$LANGWIKI/$DOWNLOAD_DATE"
2833
TORRENT_URL="https://tools.wmflabs.org/dump-torrents/$LANGWIKI/$DOWNLOAD_DATE"
34+
=======
35+
DOWNLOAD_URL="https://dumps.wikimedia.org/enwiki/$DOWNLOAD_DATE"
36+
TORRENT_URL="https://dump-torrents.toolforge.org/enwiki/$DOWNLOAD_DATE"
37+
>>>>>>> upstream
2938

3039
SHA1SUM_FILENAME="$LANGWIKI-$DOWNLOAD_DATE-sha1sums.txt"
3140

@@ -54,8 +63,10 @@ function download_file() {
5463
if [ $1 != sha1sums ] && command -v aria2c > /dev/null; then
5564
echo "[INFO] Downloading $1 file via torrent"
5665
time aria2c --summary-interval=0 --console-log-level=warn --seed-time=0 \
57-
"$TORRENT_URL/$2.torrent"
58-
else
66+
"$TORRENT_URL/$2.torrent" 2>&1 | grep -v "ERROR\|Exception" || true
67+
fi
68+
69+
if [ ! -f $2 ]; then
5970
echo "[INFO] Downloading $1 file via wget"
6071
time wget --progress=dot:giga "$DOWNLOAD_URL/$2"
6172
fi

0 commit comments

Comments
 (0)