Skip to content

Commit f19f1b5

Browse files
committed
Update mediawiki database schema + python version
1 parent b8e99e8 commit f19f1b5

8 files changed

+321
-63
lines changed

scripts/README.md

Lines changed: 69 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,69 @@
1+
# Description of the process
2+
3+
## Parsing of the tables
4+
5+
### links.txt
6+
- `pl_from` -> Id of the "from" page of this link
7+
- (`pl_namespace`) -> We keep only if equals 0 (= namespace of the "from" page of this link)
8+
- `pl_target_id` -> Target of this link (foreign key to `linktarget`)
9+
10+
### targets.txt
11+
- `lt_id` -> Id of this link (index)
12+
- (`lt_ns`) -> We keep only if equals 0 (= namespace of the targeted page)
13+
- `lt_title` -> Title of the targeted page
14+
15+
### pages.txt
16+
- `page_id` -> Id of the page
17+
- (`page_namespace`) -> We keep only if equals 0 (= namespace of this page)
18+
- `page_title` -> Title of this page
19+
- `page_is_redirect` -> Boolean wether this page is a redirect
20+
- Ignore the eight following
21+
22+
### redirects.txt
23+
- `rd_from` -> Id of the page from which we are redirected
24+
- (`rd_namespace`) -> We keep only if equals 0 (= namespace of the page we are redirected to)
25+
- `rd_title` -> Title of the page we are redirected to
26+
- Ignore the two following
27+
28+
## Joining the tables
29+
30+
### redirects.with_ids.txt (replace_titles_in_redirects_file.py)
31+
Replaces for each redirection, `rd_title` with the targetted `page_id` by matching on `page_title`.
32+
The targetted page_id is then computed as a redirect recursively, until we get on a "final" page.
33+
- `rd_from` -> The id of the page we are redirected from
34+
- `page_id` -> The id of the page we get to following redirections recursively
35+
36+
### targets.with_ids.txt (replace_titles_and_redirects_in_targets_file.py)
37+
Replaces, for each linktarget, `lt_title` with the targetted `page_id` by matching on `page_title`.
38+
We then compute the "final" page obtained from this page following redirection, with the file `redirects.with_ids.txt`.
39+
- `lt_id` -> Id of this link
40+
- `page_id` -> The id of the page this link is pointing to, after having followed all redirections
41+
42+
### links.with_ids.txt (replace_titles_and_redirects_in_links_file.py)
43+
Replaces, for each pagelink, `lt_id` with the targetted `page_id` by joining with `links.with_ids.txt`.
44+
- `pl_from` -> Id of the "from" page, after having followed all redirections
45+
- `page_id` -> Id of the "to" page, after having followed all redirections
46+
47+
### page.pruned.txt (prune_pages_file.py)
48+
Prunes the pages file by removing pages which are marked as redirects but have no corresponding redirect in the redirects file.
49+
50+
## Sorting, grouping, and counting the links
51+
52+
### links.sorted_by_XXX_id.txt
53+
Then we sorts the `links.with_ids.txt` according to the first "source" id, into
54+
the file `links.sorted_by_source_id.txt`, and according to the second "target" id
55+
into the file `links.sorted_by_target_id.txt`.
56+
57+
### links.grouped_by_XXX_id.txt
58+
Then, we use those two files to *GROUP BY* the links by source and by target.
59+
The file `links.grouped_by_source_id.txt` is like this
60+
- `pl_from` -> Id of the "from" page
61+
- `targets` -> A `|`-separated string of the ids the "from" page targets
62+
63+
The file `links.grouped_by_target_id.txt` is like this
64+
- `froms` -> A `|`-separated string of the ids of the pages targeting the "target" page
65+
- `pl_target` -> Id of the "target" page
66+
67+
### links.with_counts.txt (combine_grouped_links_files.py)
68+
69+
## Making the database

scripts/buildDatabase.sh

Lines changed: 71 additions & 21 deletions
Original file line numberDiff line numberDiff line change
@@ -1,15 +1,19 @@
11
#!/bin/bash
2-
32
set -euo pipefail
43

54
# Force default language for output sorting to be bytewise. Necessary to ensure uniformity amongst
65
# UNIX commands.
76
export LC_ALL=C
87

8+
# These variables can be set by external environment
9+
WLANG=''${WLANG:-en}
10+
OUT_DIR="${OUT_DIR:-dump}"
11+
DELETE_PROGRESSIVELY=${DELETE_PROGRESSIVELY:-false}
12+
913
# By default, the latest Wikipedia dump will be downloaded. If a download date in the format
1014
# YYYYMMDD is provided as the first argument, it will be used instead.
1115
if [[ $# -eq 0 ]]; then
12-
DOWNLOAD_DATE=$(wget -q -O- https://dumps.wikimedia.org/enwiki/ | grep -Po '\d{8}' | sort | tail -n1)
16+
DOWNLOAD_DATE=$(wget -q -O- https://dumps.wikimedia.org/${WLANG}wiki/ | grep -Po '\d{8}' | sort | tail -n1)
1317
else
1418
if [ ${#1} -ne 8 ]; then
1519
echo "[ERROR] Invalid download date provided: $1"
@@ -19,17 +23,17 @@ else
1923
fi
2024
fi
2125

22-
ROOT_DIR=`pwd`
23-
OUT_DIR="dump"
26+
# Root directory is that of this script
27+
ROOT_DIR=$(dirname "$0")
2428

25-
DOWNLOAD_URL="https://dumps.wikimedia.org/enwiki/$DOWNLOAD_DATE"
26-
TORRENT_URL="https://dump-torrents.toolforge.org/enwiki/$DOWNLOAD_DATE"
27-
28-
SHA1SUM_FILENAME="enwiki-$DOWNLOAD_DATE-sha1sums.txt"
29-
REDIRECTS_FILENAME="enwiki-$DOWNLOAD_DATE-redirect.sql.gz"
30-
PAGES_FILENAME="enwiki-$DOWNLOAD_DATE-page.sql.gz"
31-
LINKS_FILENAME="enwiki-$DOWNLOAD_DATE-pagelinks.sql.gz"
29+
DOWNLOAD_URL="https://dumps.wikimedia.org/${WLANG}wiki/$DOWNLOAD_DATE"
30+
TORRENT_URL="https://dump-torrents.toolforge.org/${WLANG}wiki/$DOWNLOAD_DATE"
3231

32+
SHA1SUM_FILENAME="${WLANG}wiki-$DOWNLOAD_DATE-sha1sums.txt"
33+
REDIRECTS_FILENAME="${WLANG}wiki-$DOWNLOAD_DATE-redirect.sql.gz"
34+
PAGES_FILENAME="${WLANG}wiki-$DOWNLOAD_DATE-page.sql.gz"
35+
LINKS_FILENAME="${WLANG}wiki-$DOWNLOAD_DATE-pagelinks.sql.gz"
36+
TARGETS_FILENAME="${WLANG}wiki-$DOWNLOAD_DATE-linktarget.sql.gz"
3337

3438
# Make the output directory if it doesn't already exist and move to it
3539
mkdir -p $OUT_DIR
@@ -79,6 +83,7 @@ download_file "sha1sums" $SHA1SUM_FILENAME
7983
download_file "redirects" $REDIRECTS_FILENAME
8084
download_file "pages" $PAGES_FILENAME
8185
download_file "links" $LINKS_FILENAME
86+
download_file "targets" $TARGETS_FILENAME
8287

8388
##########################
8489
# TRIM WIKIPEDIA DUMPS #
@@ -105,7 +110,7 @@ if [ ! -f redirects.txt.gz ]; then
105110
else
106111
echo "[WARN] Already trimmed redirects file"
107112
fi
108-
113+
if $DELETE_PROGRESSIVELY; then rm $REDIRECTS_FILENAME; fi
109114
if [ ! -f pages.txt.gz ]; then
110115
echo
111116
echo "[INFO] Trimming pages file"
@@ -118,16 +123,16 @@ if [ ! -f pages.txt.gz ]; then
118123
# Splice out the page title and whether or not the page is a redirect
119124
# Zip into output file
120125
time pigz -dc $PAGES_FILENAME \
121-
| sed -n 's/^INSERT INTO `page` VALUES (//p' \
122-
| sed -e 's/),(/\'$'\n/g' \
123-
| egrep "^[0-9]+,0," \
124-
| sed -e $"s/,0,'/\t/" \
125-
| sed -e $"s/',[^,]*,\([01]\).*/\t\1/" \
126+
| sed -n 's/^INSERT INTO `page` VALUES //p' \
127+
| egrep -o "\([0-9]+,0,'([^']*(\\\\')?)+',[01]," \
128+
| sed -re $"s/^\(([0-9]+),0,'/\1\t/" \
129+
| sed -re $"s/',([01]),/\t\1/" \
126130
| pigz --fast > pages.txt.gz.tmp
127131
mv pages.txt.gz.tmp pages.txt.gz
128132
else
129133
echo "[WARN] Already trimmed pages file"
130134
fi
135+
if $DELETE_PROGRESSIVELY; then rm $PAGES_FILENAME; fi
131136

132137
if [ ! -f links.txt.gz ]; then
133138
echo
@@ -143,14 +148,38 @@ if [ ! -f links.txt.gz ]; then
143148
time pigz -dc $LINKS_FILENAME \
144149
| sed -n 's/^INSERT INTO `pagelinks` VALUES (//p' \
145150
| sed -e 's/),(/\'$'\n/g' \
146-
| egrep "^[0-9]+,0,.*,0$" \
147-
| sed -e $"s/,0,'/\t/g" \
148-
| sed -e "s/',0//g" \
151+
| egrep "^[0-9]+,0,[0-9]+$" \
152+
| sed -e $"s/,0,/\t/g" \
149153
| pigz --fast > links.txt.gz.tmp
150154
mv links.txt.gz.tmp links.txt.gz
151155
else
152156
echo "[WARN] Already trimmed links file"
153157
fi
158+
if $DELETE_PROGRESSIVELY; then rm $LINKS_FILENAME; fi
159+
160+
if [ ! -f targets.txt.gz ]; then
161+
echo
162+
echo "[INFO] Trimming targets file"
163+
164+
# Unzip
165+
# Remove all lines that don't start with INSERT INTO...
166+
# Split into individual records
167+
# Only keep records in namespace 0
168+
# Replace namespace with a tab
169+
# Remove everything starting at the to page name's closing apostrophe
170+
# Zip into output file
171+
time pigz -dc $TARGETS_FILENAME \
172+
| sed -n 's/^INSERT INTO `linktarget` VALUES (//p' \
173+
| sed -e 's/),(/\'$'\n/g' \
174+
| egrep "^[0-9]+,0,.*$" \
175+
| sed -e $"s/,0,'/\t/g" \
176+
| sed -e "s/'$//g" \
177+
| pigz --fast > targets.txt.gz.tmp
178+
mv targets.txt.gz.tmp targets.txt.gz
179+
else
180+
echo "[WARN] Already trimmed targets file"
181+
fi
182+
if $DELETE_PROGRESSIVELY; then rm $TARGETS_FILENAME; fi
154183

155184

156185
###########################################
@@ -166,16 +195,29 @@ if [ ! -f redirects.with_ids.txt.gz ]; then
166195
else
167196
echo "[WARN] Already replaced titles in redirects file"
168197
fi
198+
if $DELETE_PROGRESSIVELY; then rm redirects.txt.gz; fi
199+
200+
if [ ! -f targets.with_ids.txt.gz ]; then
201+
echo
202+
echo "[INFO] Replacing titles and redirects in targets file"
203+
time python "$ROOT_DIR/replace_titles_and_redirects_in_targets_file.py" pages.txt.gz redirects.with_ids.txt.gz targets.txt.gz \
204+
| pigz --fast > targets.with_ids.txt.gz.tmp
205+
mv targets.with_ids.txt.gz.tmp targets.with_ids.txt.gz
206+
else
207+
echo "[WARN] Already replaced titles and redirects in targets file"
208+
fi
209+
if $DELETE_PROGRESSIVELY; then rm targets.txt.gz; fi
169210

170211
if [ ! -f links.with_ids.txt.gz ]; then
171212
echo
172213
echo "[INFO] Replacing titles and redirects in links file"
173-
time python "$ROOT_DIR/replace_titles_and_redirects_in_links_file.py" pages.txt.gz redirects.with_ids.txt.gz links.txt.gz \
214+
time python "$ROOT_DIR/replace_titles_and_redirects_in_links_file.py" pages.txt.gz redirects.with_ids.txt.gz targets.with_ids.txt.gz links.txt.gz \
174215
| pigz --fast > links.with_ids.txt.gz.tmp
175216
mv links.with_ids.txt.gz.tmp links.with_ids.txt.gz
176217
else
177218
echo "[WARN] Already replaced titles and redirects in links file"
178219
fi
220+
if $DELETE_PROGRESSIVELY; then rm links.txt.gz targets.with_ids.txt.gz; fi
179221

180222
if [ ! -f pages.pruned.txt.gz ]; then
181223
echo
@@ -185,6 +227,7 @@ if [ ! -f pages.pruned.txt.gz ]; then
185227
else
186228
echo "[WARN] Already pruned pages which are marked as redirects but with no redirect"
187229
fi
230+
if $DELETE_PROGRESSIVELY; then rm pages.txt.gz; fi
188231

189232
#####################
190233
# SORT LINKS FILE #
@@ -212,6 +255,7 @@ if [ ! -f links.sorted_by_target_id.txt.gz ]; then
212255
else
213256
echo "[WARN] Already sorted links file by target page ID"
214257
fi
258+
if $DELETE_PROGRESSIVELY; then rm links.with_ids.txt.gz; fi
215259

216260

217261
#############################
@@ -227,6 +271,7 @@ if [ ! -f links.grouped_by_source_id.txt.gz ]; then
227271
else
228272
echo "[WARN] Already grouped source links file by source page ID"
229273
fi
274+
if $DELETE_PROGRESSIVELY; then rm links.sorted_by_source_id.txt.gz; fi
230275

231276
if [ ! -f links.grouped_by_target_id.txt.gz ]; then
232277
echo
@@ -237,6 +282,7 @@ if [ ! -f links.grouped_by_target_id.txt.gz ]; then
237282
else
238283
echo "[WARN] Already grouped target links file by target page ID"
239284
fi
285+
if $DELETE_PROGRESSIVELY; then rm links.sorted_by_target_id.txt.gz; fi
240286

241287

242288
################################
@@ -251,6 +297,7 @@ if [ ! -f links.with_counts.txt.gz ]; then
251297
else
252298
echo "[WARN] Already combined grouped links files"
253299
fi
300+
if $DELETE_PROGRESSIVELY; then rm links.grouped_by_source_id.txt.gz links.grouped_by_target_id.txt.gz; fi
254301

255302

256303
############################
@@ -260,14 +307,17 @@ if [ ! -f sdow.sqlite ]; then
260307
echo
261308
echo "[INFO] Creating redirects table"
262309
time pigz -dc redirects.with_ids.txt.gz | sqlite3 sdow.sqlite ".read $ROOT_DIR/../sql/createRedirectsTable.sql"
310+
if $DELETE_PROGRESSIVELY; then rm redirects.with_ids.txt.gz; fi
263311

264312
echo
265313
echo "[INFO] Creating pages table"
266314
time pigz -dc pages.pruned.txt.gz | sqlite3 sdow.sqlite ".read $ROOT_DIR/../sql/createPagesTable.sql"
315+
if $DELETE_PROGRESSIVELY; then rm pages.pruned.txt.gz; fi
267316

268317
echo
269318
echo "[INFO] Creating links table"
270319
time pigz -dc links.with_counts.txt.gz | sqlite3 sdow.sqlite ".read $ROOT_DIR/../sql/createLinksTable.sql"
320+
if $DELETE_PROGRESSIVELY; then rm links.with_counts.txt.gz; fi
271321

272322
echo
273323
echo "[INFO] Compressing SQLite file"

scripts/combine_grouped_links_files.py

Lines changed: 17 additions & 16 deletions
Original file line numberDiff line numberDiff line change
@@ -28,26 +28,27 @@
2828

2929
# Create a dictionary of page IDs to their incoming and outgoing links.
3030
LINKS = defaultdict(lambda: defaultdict(str))
31-
for line in io.BufferedReader(gzip.open(OUTGOING_LINKS_FILE, 'r')):
32-
[source_page_id, target_page_ids] = line.rstrip('\n').split('\t')
33-
LINKS[source_page_id]['outgoing'] = target_page_ids
31+
# outgoing is [0], incoming is [1]
32+
for line in io.BufferedReader(gzip.open(OUTGOING_LINKS_FILE, 'rb')):
33+
[source_page_id, target_page_ids] = line.rstrip(b'\n').split(b'\t')
34+
LINKS[int(source_page_id)][0] = target_page_ids
3435

35-
for line in io.BufferedReader(gzip.open(INCOMING_LINKS_FILE, 'r')):
36-
[target_page_id, source_page_ids] = line.rstrip('\n').split('\t')
37-
LINKS[target_page_id]['incoming'] = source_page_ids
36+
for line in io.BufferedReader(gzip.open(INCOMING_LINKS_FILE, 'rb')):
37+
[target_page_id, source_page_ids] = line.rstrip(b'\n').split(b'\t')
38+
LINKS[int(target_page_id)][1] = source_page_ids
3839

3940
# For each page in the links dictionary, print out its incoming and outgoing links as well as their
4041
# counts.
41-
for page_id, links in LINKS.iteritems():
42-
outgoing_links = links.get('outgoing', '')
43-
outgoing_links_count = 0 if outgoing_links is '' else len(
44-
outgoing_links.split('|'))
42+
for page_id, links in LINKS.items():
43+
outgoing_links = links.get(0, b'')
44+
outgoing_links_count = 0 if outgoing_links==b'' else len(
45+
outgoing_links.split(b'|'))
4546

46-
incoming_links = links.get('incoming', '')
47-
incoming_links_count = 0 if incoming_links is '' else len(
48-
incoming_links.split('|'))
47+
incoming_links = links.get(1, b'')
48+
incoming_links_count = 0 if incoming_links==b'' else len(
49+
incoming_links.split(b'|'))
4950

50-
columns = [page_id, str(outgoing_links_count), str(
51-
incoming_links_count), outgoing_links, incoming_links]
51+
columns = [str(page_id).encode(), str(outgoing_links_count).encode(), str(
52+
incoming_links_count).encode(), outgoing_links, incoming_links]
5253

53-
print('\t'.join(columns))
54+
print(b'\t'.join(columns).decode())

scripts/prune_pages_file.py

Lines changed: 6 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -28,14 +28,14 @@
2828

2929
# Create a dictionary of redirects.
3030
REDIRECTS = {}
31-
for line in io.BufferedReader(gzip.open(REDIRECTS_FILE, 'r')):
32-
[source_page_id, _] = line.rstrip('\n').split('\t')
31+
for line in io.BufferedReader(gzip.open(REDIRECTS_FILE, 'rb')):
32+
[source_page_id, _] = line.rstrip(b'\n').split(b'\t')
3333
REDIRECTS[source_page_id] = True
3434

3535
# Loop through the pages file, ignoring pages which are marked as redirects but which do not have a
3636
# corresponding redirect in the redirects dictionary, printing the remaining pages to stdout.
37-
for line in io.BufferedReader(gzip.open(PAGES_FILE, 'r')):
38-
[page_id, page_title, is_redirect] = line.rstrip('\n').split('\t')
37+
for line in io.BufferedReader(gzip.open(PAGES_FILE, 'rb')):
38+
[page_id, page_title, is_redirect] = line.rstrip(b'\n').split(b'\t')
3939

40-
if is_redirect == '0' or page_id in REDIRECTS:
41-
print('\t'.join([page_id, page_title, is_redirect]))
40+
if True or is_redirect == '0' or page_id in REDIRECTS:
41+
print(b'\t'.join([page_id, page_title, is_redirect]).decode())

0 commit comments

Comments
 (0)