raw code

Wikipedia is the greatest encyclopedia which has ever existed, because everyone can contribute to the massive knowledge corpus. Analyzing this data with computers is becoming more and more indispensable, as nobody can survey the information by hand anymore. In order to work with the data, we have to import it into MySQL and here is how it works.

I'll show how to do this on Debian Jessie, but it should be easily adaptable to other distributions.

Installing Percona Server

If you already have a proper MySQL set-up, you can skip this section. I use the Percona Server - a MySQL Fork with performance in mind - almost everywhere as it has nice features built in, like Handlersocket or TokuDB and a well tuned InnoDB engine. But, let's get started.

Install Percona release information

wget https://repo.percona.com/apt/percona-release_latest.$(lsb_release -sc)_all.deb
sudo dpkg -i percona-release_latest.$(lsb_release -sc)_all.deb

Install MySQL server

apt-get update
apt-get install percona-server-server-5.7

Add the following lines into the [mysqld] section of /etc/mysql/my.cnf:

collation_server        = utf8_general_ci
character_set_server    = utf8

Downloading Wikipedia

Download the german Wikipedia corpus:

wget https://dumps.wikimedia.org/dewiki/latest/dewiki-latest-pages-articles-multistream.xml.bz2

or similarly the english:

wget https://dumps.wikimedia.org/enwiki/latest/enwiki-latest-pages-articles-multistream.xml.bz2

Now download MWDumper, which we use to translate the XML files to actual SQL statements:

wget https://dumps.wikimedia.org/tools/mwdumper.jar

Log in into MySQL and create a new database:


Now download and install the table schema from the official mediawiki table:

wget <raw_link> -O create-mediawiki.sql

mysql wiki < create-mediawiki.sql

And finally import the actual Wikipedia data:

bunzip2 -c enwiki-latest-pages-articles-multistream.xml.bz2 | \
      java -jar mwdumper.jar --format=sql:1.25 | mysql wiki

Finally you can remove enwiki-latest-pages-articles-multistream.xml.bz2 to clean up:

rm enwiki-latest-pages-articles-multistream.xml.bz2

If you see an error like ERROR 1054 (42S22) at line 84: Unknown column 'page_counter' in 'field list' or similar, you should double check the --format=... parameter, if it is still the most current.