Please explain how Wordpress works with MySQL character set and collation at a low level

by X-Mann   Last Updated May 13, 2017 18:08 PM

As the question title suggests, I'm looking to understand how Wordpress works with MySQL character sets and collation options. As I will show below, things don't make much sense to me...

I installed Wordpress by following the instructions on their installation page:

As part of the instructions, I followed their advice for manual creation of the MySQL database on the commandline, namely the commands:

mysql> CREATE DATABASE databasename;
Query OK, 1 row affected (0.00 sec)

mysql> GRANT ALL PRIVILEGES ON databasename.* TO "wordpressusername"@"hostname"
-> IDENTIFIED BY "password";
Query OK, 0 rows affected (0.00 sec)

Query OK, 0 rows affected (0.01 sec)

mysql> EXIT

Further, as instructed, I edited the "wp-config.php" file to use UTF-8 character set:

define( 'DB_CHARSET', 'utf8' );

...and left the collation setting blank:

define( 'DB_COLLATE', '' );

Here is where the fun starts...

  1. If I enter a character that is not part of MySQL UTF-8, but is part of UTF-8 MB4, such as 𝌆, into a post, it shows up correctly on the rendered page. I would have expected this not to happen, as I haven't set the character set to UTF-8 MB4, but the more restricted UTF-8 (as defined by MySQL of course, not as generally understood).

  2. If I investigate the issue in MySQL on the commandline, it gets weirder. If I run show variables like 'char%';, I get this response:

    | Variable_name            | Value                      |
    | character_set_client     | utf8                       |
    | character_set_connection | utf8                       |
    | character_set_database   | latin1                     |
    | character_set_filesystem | binary                     |
    | character_set_results    | utf8                       |
    | character_set_server     | latin1                     |
    | character_set_system     | utf8                       |
    | character_sets_dir       | /usr/share/mysql/charsets/ |

I would have expected database character set to be UTF-8, not latin1.

  1. If I run the command show variables like 'collation%';, the output is:

    | Variable_name        | Value             |
    | collation_connection | utf8_general_ci   |
    | collation_database   | latin1_swedish_ci |
    | collation_server     | latin1_swedish_ci |

That's even stranger, for obvious reasons (wouldn't have expected the default latin1_swedish_ci collation in a UTF-8 database).

  1. Finally, if I run show full columns from mywpdatabase.wp_posts;, the output lines, where the value is not NULL, show collation to be:

| post_content_filtered | longtext | utf8mb4_unicode_ci |

My question then - how can this be explained? Why is my Wordpress install correctly rendering UTF-8 MB4 characters, when the database is defined as UTF-8 in the config? And why is the database showing in MySQL as latin1, swedish collation, instead of UTF-8? And how come, that despite all this, the individual fields in the table are utf8mb4_unicode_ci? A low-level explanation of the way Wordpress works with MySQL would be very helpful. Thank you!

Related Questions

Switching Databases in LocalHost

Updated October 01, 2017 11:08 AM