A Finland Swede in Bavaria

Thursday, 14 May 2009

Home-Made Blog Statistics from WordPress MU

For long, I had wanted to get some statistics on the number of blog posts I do per month and blog on blogs.arno.fi. For even longer, I had had abstinence symptoms from not doing almost any coding at all this century. Today, I combined the desire for statistics with the desire for some semi-coding and got some statistics done. And this is how I did it.

First, I logged into MySQL on the server where my WordPress MU resides. And I did a "use wordpress" to start using the Wordpress MU database. A first SHOW TABLES showed a whopping 153 tables. Wow! That comes from me (and my family members) having 18 blogs, each of which requires 8 tables. The remainder, 153 - 8*18, is 9 tables. Here are the tables, and what I found out about them while eagerly SELECTing from them:

wp_blogs: The key WordPress table, listing the blogs



mysql> select blog_id,path,registered,last_updated from wp_blogs;
+---------+-----------------+---------------------+---------------------+
| blog_id | path | registered | last_updated |
+---------+-----------------+---------------------+---------------------+
| 1 | / | 2008-09-20 15:15:54 | 2008-10-19 14:39:15 |
| 2 | /fussball/ | 2008-09-20 15:19:12 | 2008-10-07 22:07:09 |
| 4 | /efib/ | 2008-09-20 15:41:14 | 2009-03-16 09:51:48 |
| 5 | /fib/ | 2008-10-13 10:44:16 | 2009-05-11 21:34:25 |
| 6 | /dolce_vita/ | 2008-10-25 00:54:01 | 2009-05-03 08:18:41 |
| 7 | /fandorin/ | 2008-10-25 02:35:46 | 2009-04-16 11:48:50 |
| 8 | /isit/ | 2008-10-27 14:39:57 | 2009-05-12 16:42:24 |
| 9 | /linnea/ | 2008-10-31 20:26:30 | 2008-11-16 19:13:43 |
| 10 | /poquito/ | 2008-11-03 22:15:21 | 2008-11-09 22:21:48 |
| 11 | /vaba_lava/ | 2008-11-17 08:33:04 | 2008-11-17 10:44:24 |
| 12 | /labrit/ | 2008-11-17 10:02:50 | 2008-11-19 09:10:03 |
| 13 | /laisvas_zodis/ | 2008-11-17 12:22:51 | 2008-11-18 06:48:00 |
| 14 | /alman_degilim/ | 2008-11-18 10:41:46 | 2008-11-20 21:25:48 |
| 15 | /sushi/ | 2008-11-18 23:51:00 | 2008-11-22 13:28:29 |
| 16 | /furuvik/ | 2008-12-07 14:19:10 | 2009-05-04 14:13:53 |
| 17 | /kajsql/ | 2009-02-11 10:25:26 | 2009-05-06 16:28:08 |
| 18 | /guanxi/ | 2009-05-06 18:26:24 | 2009-05-06 18:50:29 |
| 19 | /yilingyi/ | 2009-05-06 18:30:02 | 2009-05-06 20:00:25 |
+---------+-----------------+---------------------+---------------------+
18 rows in set (0.00 sec)


Conclusion: blog_id and path are the main fields

The other eight top-level tables



mysql> show tables where Tables_in_wordpress > "wp_a";
+---------------------+
| Tables_in_wordpress |
+---------------------+
| wp_blog_versions |
| wp_blogs |
| wp_registration_log |
| wp_signups |
| wp_site |
| wp_sitecategories |
| wp_sitemeta |
| wp_usermeta |
| wp_users |
+---------------------+
9 rows in set (0.00 sec)


Browsing through the other tables, they seem to be best left for browsing and editing through the user interface of WordPress itself.

The eight blog-level tables in WordPress MU


The tables on the blog level are named wp_1_blahbah, wp_2_blahblah, wp_3_blahblah and so on. I can't say that this is a better table design than merely entering blog_id as a further key in each of the tables, but hey, I'm a happy WordPress user and not out to redesign their database structure. Besides, they may well have solved some scaling issues by adding eight new tables for each new blog. However, the complexity of the database structure will affect my job of doing statistics. And looking at life from a positive standpoint, it will increase the amount of programming joy I get out of my desire to do relevant blogging statistics.

There are, as said, eight tables for each blog. Let's pick my blog no 8, /isit/ (this very blog, "It's Some Interesting Topic").

mysql> show tables where Tables_in_wordpress like "wp_8_%";
+-------------------------+
| Tables_in_wordpress |
+-------------------------+
| wp_8_comments |
| wp_8_links |
| wp_8_options |
| wp_8_postmeta |
| wp_8_posts |
| wp_8_term_relationships |
| wp_8_term_taxonomy |
| wp_8_terms |
+-------------------------+
8 rows in set (0.01 sec)


The two most interesting ones here are wp_n_posts and wp_n_comments. The three tables named around "terms" are about tagging and automatically updated through WordPress. The wp_n_options (blog level options) and wp_n_links (blogrolls etc.) tables are relevant, but adequately browsable and editable through the WordPress user interface. The same goes for the field settings in wp_n_postmeta.

The wp_n_posts table


The wp_n_posts table has 24 fields. Browsing at them for a while, here are the key ones

  • ID: Primary key (1, 2, 3 ...)

  • post_date: Publishing date of the post. Crucial for statistics. Remember, you can backdate posts or write posts in advance.

  • post_modified: Editing date of the post. This is when you last touched the post.

  • post_type: This is usually "post", for normal, published blog posts. For what WordPress calls pages, it's "page". But there are also plenty of instances with "revision" (for intermediate versions of the blog post, if you save and edit it multiple times) and "attachment" for instance if you upload a .jpg file. Consequently, where post_type in ("page","post") becomes an essential part of many blog statistics SELECTs.

  • post_title: The header of the blog entry

  • comment_count: Number of approved comments

  • comment_status: Either "closed" or "open" depending on whether comments are allowed or not. I have combatted spam by disallowing comments for particularly infested entries, using the normal WordPress UI. I could bulk reallow comments on all closed entries with a simple UPDATE on comment_status, once I've installed better spam prevention.

  • guid: the complete URL to a post, such as http://blogs.arno.fi/isit/?p=211 (which then will be converted to http://blogs.arno.fi/isit/2009/03/16/a-digital-native-and-my-first-nokia/ by WordPress once you surf to it).

  • post_name: contains the "a-digital-native-and-my-first-nokia" part of the URL (which is neatly editable from within the WordPress UI).

  • post_status: it's "inherit" or "publish" depending on whether the post is live or not -- but I prefer using post_type to identify real entries from earlier versions


My first attempt at statistics


This knowledge enables compact statistics, such as:

mysql> select date_format(post_date,"%Y-%m") as Month, count(*) as Blogs
-> from wp_8_posts
-> where post_type in ("page","post")
-> group by 1;
+---------+-------+
| Month | Blogs |
+---------+-------+
| 2008-10 | 11 |
| 2008-11 | 15 |
| 2008-12 | 3 |
| 2009-01 | 4 |
| 2009-02 | 3 |
| 2009-03 | 3 |
| 2009-05 | 2 |
+---------+-------+
7 rows in set (0.00 sec)


However, what I'd really like to have is that very same piece of statistics, but for all my blogs in one report. I don't want to run eighteen reports, one for each blog.

And that's where my coding desire can be satisfied.

My idea is to create a new, aggregate table for statistics, where I insert one row for each blog entry in each of the eighteen blogs.

Let me first create the table, based on the above blog (numbered 8 and called "/isit/") with its 41 blog entries. It's a simple CREATE as documented as the second example on http://dev.mysql.com/doc/refman/5.1/en/create-table.html -- in my case like this:

mysql> CREATE TABLE kaj_wp_stats
-> SELECT 8 as blog_id,"/isit/" as path,
-> ID,post_type,post_date,post_title,comment_count,post_modified,guid
-> FROM wp_8_posts
-> WHERE post_type IN ("page","post");
Query OK, 41 rows affected (0.00 sec)
Records: 41 Duplicates: 0 Warnings: 0


I specifically mentioned 8 as blog_id and "/isit/" as path in order to later be able to separate blog entries by blog.

Then I take the very same SELECT statement and use it for another blog, but substitute the CREATE with the equivalent INSERT statement:

mysql> INSERT INTO kaj_wp_stats
-> SELECT 4 as blog_id,"/fib/" as path,
-> ID,post_type,post_date,post_title,comment_count,post_modified,guid
-> FROM wp_4_posts
-> WHERE post_type IN ("page","post");
Query OK, 27 rows affected (0.00 sec)
Records: 27 Duplicates: 0 Warnings: 0


Note the changes: "INSERT INTO", "4", "fib" and "wp_4_posts".

Generating SELECTs with SELECTs


Now, I could of course do a lot of error-prone manual labour and insert the following 16 blogs in the same way. But I'm lazy and I desire some coding, so instead, what I do is that I create the corresponding code with a SELECT statement from the wp_blogs file. This is my meta-code:

select concat("INSERT INTO kaj_wp_stats SELECT ",blog_id," as blog_id,'",path,
"' as path,ID,post_type,post_date,post_title,comment_count,post_modified,guid from wp_",
blog_id,"_posts WHERE post_type in ('page','post');") from wp_blogs;


In this metacode, the lower-case "select" and "from" is what is executed. The upper case INSERTs and SELECTs end up in the output, which looks like this:

INSERT INTO kaj_wp_stats SELECT 1 as blog_id,'/' as path,ID,post_type,post_date,post_title,comment_count,post_modified,guid from wp_1_posts WHERE post_type in ('page','post');
INSERT INTO kaj_wp_stats SELECT 2 as blog_id,'/fussball/' as path,ID,post_type,post_date,post_title,comment_count,post_modified,guid from wp_2_posts WHERE post_type in ('page','post');
INSERT INTO kaj_wp_stats SELECT 4 as blog_id,'/efib/' as path,ID,post_type,post_date,post_title,comment_count,post_modified,guid from wp_4_posts WHERE post_type in ('page','post');
INSERT INTO kaj_wp_stats SELECT 5 as blog_id,'/fib/' as path,ID,post_type,post_date,post_title,comment_count,post_modified,guid from wp_5_posts WHERE post_type in ('page','post');
INSERT INTO kaj_wp_stats SELECT 6 as blog_id,'/dolce_vita/' as path,ID,post_type,post_date,post_title,comment_count,post_modified,guid from wp_6_posts WHERE post_type in ('page','post');
INSERT INTO kaj_wp_stats SELECT 7 as blog_id,'/fandorin/' as path,ID,post_type,post_date,post_title,comment_count,post_modified,guid from wp_7_posts WHERE post_type in ('page','post');
INSERT INTO kaj_wp_stats SELECT 8 as blog_id,'/isit/' as path,ID,post_type,post_date,post_title,comment_count,post_modified,guid from wp_8_posts WHERE post_type in ('page','post');
INSERT INTO kaj_wp_stats SELECT 9 as blog_id,'/linnea/' as path,ID,post_type,post_date,post_title,comment_count,post_modified,guid from wp_9_posts WHERE post_type in ('page','post');
INSERT INTO kaj_wp_stats SELECT 10 as blog_id,'/poquito/' as path,ID,post_type,post_date,post_title,comment_count,post_modified,guid from wp_10_posts WHERE post_type in ('page','post');
INSERT INTO kaj_wp_stats SELECT 11 as blog_id,'/vaba_lava/' as path,ID,post_type,post_date,post_title,comment_count,post_modified,guid from wp_11_posts WHERE post_type in ('page','post');
INSERT INTO kaj_wp_stats SELECT 12 as blog_id,'/labrit/' as path,ID,post_type,post_date,post_title,comment_count,post_modified,guid from wp_12_posts WHERE post_type in ('page','post');
INSERT INTO kaj_wp_stats SELECT 13 as blog_id,'/laisvas_zodis/' as path,ID,post_type,post_date,post_title,comment_count,post_modified,guid from wp_13_posts WHERE post_type in ('page','post');
INSERT INTO kaj_wp_stats SELECT 14 as blog_id,'/alman_degilim/' as path,ID,post_type,post_date,post_title,comment_count,post_modified,guid from wp_14_posts WHERE post_type in ('page','post');
INSERT INTO kaj_wp_stats SELECT 15 as blog_id,'/sushi/' as path,ID,post_type,post_date,post_title,comment_count,post_modified,guid from wp_15_posts WHERE post_type in ('page','post');
INSERT INTO kaj_wp_stats SELECT 16 as blog_id,'/furuvik/' as path,ID,post_type,post_date,post_title,comment_count,post_modified,guid from wp_16_posts WHERE post_type in ('page','post');
INSERT INTO kaj_wp_stats SELECT 17 as blog_id,'/kajsql/' as path,ID,post_type,post_date,post_title,comment_count,post_modified,guid from wp_17_posts WHERE post_type in ('page','post');
INSERT INTO kaj_wp_stats SELECT 18 as blog_id,'/guanxi/' as path,ID,post_type,post_date,post_title,comment_count,post_modified,guid from wp_18_posts WHERE post_type in ('page','post');
INSERT INTO kaj_wp_stats SELECT 19 as blog_id,'/yilingyi/' as path,ID,post_type,post_date,post_title,comment_count,post_modified,guid from wp_19_posts WHERE post_type in ('page','post');


I then clean up the kaj_wp_stats table by removing it:

mysql> DROP TABLE kaj_wp_stats;
Query OK, 0 rows affected (0.00 sec)


Now I manually substitute the first INSERT INTO with a CREATE TABLE, and run the select-generated SELECTs.

However, I get Warnings: for each of the INSERT statements -- which is due to the unfortunate fact that the first blog has a path "/", which turns into a varchar(1) into which of course none of the other paths fit. Hence I manually add

ALTER TABLE kaj_wp_stats MODIFY path varchar(50);


right after the first CREATE TABLE statement, and then everything runs fine. Here's a shortened version of it all:

mysql> DROP TABLE kaj_wp_stats;
Query OK, 0 rows affected (0.00 sec)

mysql> CREATE TABLE kaj_wp_stats SELECT 1 as blog_id,'/' as path,ID,post_type,post_date,post_title,comment_count,post_modified,guid from wp_1_posts WHERE post_type in ('page','post');
Query OK, 1 row affected (0.00 sec)
Records: 1 Duplicates: 0 Warnings: 0

mysql> ALTER TABLE kaj_wp_stats MODIFY path varchar(50);
Query OK, 1 row affected (0.02 sec)
Records: 1 Duplicates: 0 Warnings: 0

:
:

mysql> INSERT INTO kaj_wp_stats SELECT 5 as blog_id,'/fib/' as path,ID,post_type,post_date,post_title,comment_count,post_modified,guid from wp_5_posts WHERE post_type in ('page','post');
Query OK, 85 rows affected (0.00 sec)
Records: 85 Duplicates: 0 Warnings: 0

:
:

mysql> INSERT INTO kaj_wp_stats SELECT 8 as blog_id,'/isit/' as path,ID,post_type,post_date,post_title,comment_count,post_modified,guid from wp_8_posts WHERE post_type in ('page','post');
Query OK, 41 rows affected (0.00 sec)
Records: 41 Duplicates: 0 Warnings: 0

:
:

mysql> INSERT INTO kaj_wp_stats SELECT 19 as blog_id,'/yilingyi/' as path,ID,post_type,post_date,post_title,comment_count,post_modified,guid from wp_19_posts WHERE post_type in ('page','post');
Query OK, 2 rows affected (0.00 sec)
Records: 2 Duplicates: 0 Warnings: 0


Finally, the statistics I desire!


Now, I can do statistics such as this:

mysql> select blog_id,path,count(*) as blogs from kaj_wp_stats group by blog_id,path;
+---------+-----------------+-------+
| blog_id | path | blogs |
+---------+-----------------+-------+
| 1 | / | 1 |
| 2 | /fussball/ | 6 |
| 4 | /efib/ | 27 |
| 5 | /fib/ | 85 |
| 6 | /dolce_vita/ | 10 |
| 7 | /fandorin/ | 8 |
| 8 | /isit/ | 41 |
| 9 | /linnea/ | 13 |
| 10 | /poquito/ | 9 |
| 11 | /vaba_lava/ | 2 |
| 12 | /labrit/ | 1 |
| 13 | /laisvas_zodis/ | 1 |
| 14 | /alman_degilim/ | 4 |
| 15 | /sushi/ | 2 |
| 16 | /furuvik/ | 38 |
| 17 | /kajsql/ | 2 |
| 18 | /guanxi/ | 1 |
| 19 | /yilingyi/ | 2 |
+---------+-----------------+-------+
18 rows in set (0.00 sec)


Or this

mysql> select date_format(post_date,"%Y-%m") as Month,
-> count(*) as Blogs,
-> sum(if(path="/isit/",1,0)) as isit,
-> sum(if(path="/fib/",1,0)) as fib,
-> sum(if(path="/efib/",1,0)) as efib,
-> sum(if(path="/furuvik/",1,0)) as furuvik,
-> sum(if(path="/dolce_vita/",1,0)) as dolce_vita,
-> sum(if(path="/fandorin/",1,0)) as fandorin,
-> sum(if(path="/poquito/",1,0)) as poquito
-> from kaj_wp_stats
-> where post_date > "2007-12-31"
-> group by 1;
+---------+-------+------+------+------+---------+------------+----------+---------+
| Month | Blogs | isit | fib | efib | furuvik | dolce_vita | fandorin | poquito |
+---------+-------+------+------+------+---------+------------+----------+---------+
| 2008-02 | 1 | 0 | 1 | 0 | 0 | 0 | 0 | 0 |
| 2008-03 | 1 | 0 | 1 | 0 | 0 | 0 | 0 | 0 |
| 2008-04 | 3 | 0 | 1 | 0 | 0 | 1 | 0 | 0 |
| 2008-05 | 1 | 0 | 1 | 0 | 0 | 0 | 0 | 0 |
| 2008-06 | 5 | 0 | 1 | 0 | 0 | 0 | 1 | 0 |
| 2008-08 | 3 | 0 | 2 | 0 | 1 | 0 | 0 | 0 |
| 2008-09 | 8 | 0 | 4 | 0 | 0 | 0 | 0 | 0 |
| 2008-10 | 50 | 11 | 11 | 9 | 0 | 6 | 3 | 0 |
| 2008-11 | 59 | 15 | 14 | 3 | 4 | 1 | 3 | 9 |
| 2008-12 | 23 | 3 | 9 | 5 | 3 | 2 | 1 | 0 |
| 2009-01 | 29 | 4 | 7 | 6 | 12 | 0 | 0 | 0 |
| 2009-02 | 23 | 3 | 12 | 1 | 6 | 0 | 0 | 0 |
| 2009-03 | 23 | 3 | 11 | 3 | 6 | 0 | 0 | 0 |
| 2009-04 | 4 | 0 | 4 | 0 | 0 | 0 | 0 | 0 |
| 2009-05 | 12 | 2 | 2 | 0 | 5 | 0 | 0 | 0 |
| 2013-06 | 1 | 0 | 1 | 0 | 0 | 0 | 0 | 0 |
+---------+-------+------+------+------+---------+------------+----------+---------+
16 rows in set (0.00 sec)


Mission accomplished! The statistics are there, and I had fun "coding" with MySQL.

Tuesday, 12 May 2009

Photo Manager: How do you keep track of your pictures?

If you're like me, you have tens if not hundreds of thousands of digital pictures. Tendency: Growing. How do you sort them? Archive them? Tag them? Find the right picture? Keep order in your backups?

Personally, I have no good answer to the above questions. I am still looking for the right program -- ideally, Open Source Software that helps me keep track of my digital pictures. So if you have a good answer, please tell me!

This blog entry is an attempt to define the specifications of what I need. I have labelled the software "FOSS Photo Manager", but I do understand it needs a catchier name. Basically, I'm fairly happy about my software for editing pictures. But not for managing my collection of pictures. Perhaps I should call the dreamt-up software "Robfat" (for rename, order, backup, find, archivetag), as I want to remove excess fat from my HDs (and CD/DVD cabinets). I need a lean, easily accessible collection of all my pictures.  

This is what I have:

  • one Mac laptop with a small HD, storing <10 % of my pictures

  • three external HDs with my pictures (some for backup, some as primary storage)

  • over a hundred CDs and DVDs that I burned as backups prior to having several external HDs


Looking at the pictures, I have:

  • over 100.000 pictures in over 1.000 directories

  • mostly the same directories on the various HDs


I also have disorder:

  • some new pictures (edits) I've saved only onto some HDs

  • some directories have been cleaned of bad pictures that I want to throw away, but these pictures still exist on other HDs (and I don't want them to return from the dead) 

  • some CDs and DVDs are clearly unnecessary by now, but may contain some files found nowhere else


My basic needs are

  1. To simplify my workflow from the point in time where I've downloaded the pics to my HD until the point where I've identified the best pics, sorted them, renamed the best ones, and backed them up

  2. To create and maintain a directory structure (while browsing the pics) that makes the pics easier to find

  3. To quickly tag my pictures while browsing them or as a bulk update of an entire directory, with the tags to be stored in the file's own EXIF data in order for it to be accessible from everywhere 

  4. To quickly rank my pictures (quality, one-to-five) or use the ranking from other software

  5. To easily compare the contents of two directories, identify discrepancies and fix them (by deleting pics I don't want to have, and by copying pics I do want to have) so that directories are identical between original and backup

  6. To swiftly harvest CDs / DVDs for any pics (files) that I don't have anywhere on my HD (remembering that I may have renamed them since), so I can throw away the old CDs / DVDs with peace of mind

  7. To retain data about pics on my external HDs even when I don't have the external HDs around (directory structure, file names, dates, times, sizes, tags)

  8. To mark and bulk copy pictures to new directories (for sharing pics, backing them up, creating copies that other software can freely edit, identify pics to upload to Picasa / Flickr / Facebook, or to create copies

  9. To bulk separate portrait from landscape pictures when copying (for better use in digital frames, phones, iPods)

  10. To identify likely siblings of a picture (edited versions, renamed versions -- based on similar file name, file size, date).

  11. To create statistics of my picture collection: How many pics do I have? By year, by tag, by various other criteria.


Ideally (but this isn't a basic need), I would also like to keep track of where I've used which picture (photo books, phones, digiframes, printouts, screen savers, Picasa, Flickr, Facebook).

The deadly sins I don't want this program to commit are

  1. Keep track only of pictures currently accessible. Nope! I want also the external HDs that aren't around, and the CDs/DVDs.

  2. Touch my pixels. Nope! Don't even turn the pics 90 degrees. All editing should be done with separate software. OK, so I'm happy if this software creates thumbnail copies (using ImageMagick or whatnot), if I clearly instruct it to.

  3. Do stuff not related to keeping track of my pictures. Nope, don't email pics. Don't synch them to the web. Don't print pictures. Don't import pictures from cameras or memory cards. Just keep order on my file system.

  4. Be an island like iPhoto. Nope. Everything this software does to create order (structures, tags, albums) should be visible outside the program, in the directory structure, in file names, in EXIF tags.

  5. Force repetitive tasks on the user. Nope. If all pics in a directory need a particular tag, then make it easy to tag them all. If normal workflow requires frequent hand movements between the keyboard and the mouse, then redesign it. 


So: No reinvention of the wheel -- but yes please, an invention of sensible traffic management that relieves me of queuing in the rush hour! Do you have tips for me?

Wednesday, 6 May 2009

On Princes and Databases, and Retroblogging

Through a recommendation by Peter Vesterbacka on Facebook, I popped into the interesting blog of WILLIAM JULIUS OGEDENGBE, a NIGERIAN PRINCE who is skilled at blogging in ALL CAPS.

His posts, today NINE in total, are hilarious. They prompted me to retroblog something semi-Nigerian I wrote in January 2004, after a long working day for MySQL AB in Japan. You'll find it on my /kajsql/ blog.

Retroblog, I hear you ask. What's that? By  “retroblog” I mean a blog entry that is backdated to the date when it was actually written, albeit at that time not intended as a blog entry. So while in a sense this is a milder genre of creative historical revisionism, the only blatant lie is the backdating of the blog entry. The text itself isn't backdated. (Stalin would still likely be proud of me, had I ever been a Soviet citizen).

Modification before hitting "Publish": Googling on "retroblogging" gave me 159 000 hits, so there is clearly prior art when it comes to defining the concept. What a shame. Still, "backdated blog without backdating text" is the only relevant part of my private little definition, anyway.