Wikipedia is the greatest encyclopedia which has ever existed, because everyone can contribute to the massive knowledge corpus. Analyzing this data with computers is becoming more and more indispensable, as nobody can survey the information by hand anymore. In order to work with the data, we have to import it into MySQL and here is how it works.
I'll show how to do this on Debian Jessie, but it should be easily adaptable to other distributions.
Installing Percona Server
If you already have a proper MySQL set-up, you can skip this section. I use the Percona Server - a MySQL Fork with performance in mind - almost everywhere as it has nice features built in, like Handlersocket or TokuDB and a well tuned InnoDB engine. But, let's get started.
Install Percona release information
wget https://repo.percona.com/apt/percona-release_latest.$(lsb_release -sc)_all.deb
sudo dpkg -i percona-release_latest.$(lsb_release -sc)_all.deb
Install MySQL server
apt-get update
apt-get install percona-server-server-5.7
Add the following lines into the [mysqld] section of /etc/mysql/my.cnf:
innodb_file_per_table collation_server = utf8_general_ci character_set_server = utf8 skip-character-set-client-handshake
Downloading Wikipedia
Download the german Wikipedia corpus:
wget https://dumps.wikimedia.org/dewiki/latest/dewiki-latest-pages-articles-multistream.xml.bz2
or similarly the english:
wget https://dumps.wikimedia.org/enwiki/latest/enwiki-latest-pages-articles-multistream.xml.bz2
Now download MWDumper, which we use to translate the XML files to actual SQL statements:
wget https://dumps.wikimedia.org/tools/mwdumper.jar
Log in into MySQL and create a new database:
CREATE DATABASE wiki;
Now download and install the table schema from the official mediawiki table:
wget <raw_link> -O create-mediawiki.sql
mysql wiki < create-mediawiki.sql
And finally import the actual Wikipedia data:
bunzip2 -c enwiki-latest-pages-articles-multistream.xml.bz2 | \
java -jar mwdumper.jar --format=sql:1.25 | mysql wiki
Finally you can remove enwiki-latest-pages-articles-multistream.xml.bz2 to clean up:
rm enwiki-latest-pages-articles-multistream.xml.bz2
If you see an error like ERROR 1054 (42S22) at line 84: Unknown column 'page_counter' in 'field list' or similar, you should double check the --format=... parameter, if it is still the most current.