The simple answer is not enough: Using weights to make two-step associative searches useful

Some time ago on the MySQL Forums I answered one of the typical beginner questions on how to solve many-to-many relationships. There was only one special thing about the question: The relationship actually involved two steps (a many-to-many-to-many relationship). Just hitting the Post button to my standard answer I realised that (albeit correct from a database design point of view) it would probably not give the users of the system the expected results. The simple answer was not enough.

The original question asked for a schema to manage sound files grouped into themes. The poster mentioned an example of a theme BEACH with sound files such as waves.mp3, wind.mp3 or dog_barking.mp3. So any theme can include an arbitrary number of sounds and it's evident that any of these sound files might also belong to another theme. That's a classic many-to-many relationship.

The question also mentioned another many-to-many relationship: It should be possible to search for a theme using keywords. So searches for sun, vacation or sand should all display the theme BEACH (and possibly other suitable themes). The user interface will also allow to list sounds directly for keyword searches, bypassing the listing of themes.

All of the above means we end up with three different entities to deal with: keywords, themes and sounds. Every entity needs a table in our schema. (Please skip the next few paragraphs if you're already familiar with many-to-many relationships, I just want to give a short introduction to not exclude anybody.)

+---------------+                              +-------------+
| keywords      |                              | themes      |
+---------------+                              +-------------+
| keyword_id PK |                              | theme_id PK |
| keyword       |                              | theme       |
+---------------+                              +-------------+


                        +-------------+
                        | sounds      |
                        +-------------+
                        | sound_id PK |
                        | sound       |
                        +-------------+

The poster of the question wanted to solve the problem using only these three tables with some additional pointer fields. But of course there was no way to correctly account for the links between multiple keywords and themes and between multiple themes and sounds. Let's just create the tables first anyway.

CREATE TABLE `keywords` (
  `keyword_id` int(10) unsigned NOT NULL auto_increment,
  `keyword` varchar(64) default NULL,
  PRIMARY KEY (`keyword_id`),
  UNIQUE KEY `keyword` (`keyword`)
) ENGINE=InnoDB DEFAULT CHARSET=latin1;
CREATE TABLE `themes` (
  `theme_id` int(10) unsigned NOT NULL auto_increment,
  `theme` varchar(64) default NULL,
  PRIMARY KEY (`theme_id`),
  UNIQUE KEY `theme` (`theme`)
) ENGINE=InnoDB DEFAULT CHARSET=latin1;
CREATE TABLE `sounds` (
  `sound_id` int(10) unsigned NOT NULL auto_increment,
  `sound` varchar(64) default NULL,
  PRIMARY KEY (`sound_id`),
  UNIQUE KEY `sound` (`sound`)
) ENGINE=InnoDB DEFAULT CHARSET=latin1;

The poster felt that there was something wrong with his three tables approach. Of course you always need an additional table for every many-to-many relationship. This gives us a total of five tables in our example. The two additional tables just list all the connections between keywords and themes and between themes and sounds.

+---------------+     +-----------------+      +-------------+
| keywords      |     | keywords2themes |      | themes      |
+---------------+     +-----------------+      +-------------+
| keyword_id PK |<----| keyword_id   PK |  +-->| theme_id PK |<--+
| keyword       |     | theme_id     PK |--+   | theme       |   |
+---------------+     +-----------------+      +-------------+   |
                                                                 |
                                                                 |
                        +-------------+      +---------------+   |
                        | sounds      |      | themes2sounds |   |
                        +-------------+      +---------------+   |
                        | sound_id PK |<--+  | theme_id   PK |---+
                        | sound       |   +--| sound_id   PK |
                        +-------------+      +---------------+

This is of course the simple standard answer I gave at first. The definitions of the two additional tables would look something like the following in our schema:

CREATE TABLE `keywords2themes` (
  `keyword_id` int(10) unsigned NOT NULL,
  `theme_id` int(10) unsigned NOT NULL,
  PRIMARY KEY (`keyword_id`,`theme_id`),
  UNIQUE KEY `reverse_primary` (`theme_id`, `keyword_id`),
  CONSTRAINT `keywords_fk` FOREIGN KEY (`keyword_id`)
    REFERENCES `keywords` (`keyword_id`),
  CONSTRAINT `keywords2themes_fk` FOREIGN KEY (`theme_id`)
    REFERENCES `themes` (`theme_id`)
) ENGINE=InnoDB DEFAULT CHARSET=latin1;
CREATE TABLE `themes2sounds` (
  `theme_id` int(10) unsigned NOT NULL,
  `sound_id` int(10) unsigned NOT NULL,
  PRIMARY KEY (`theme_id`,`sound_id`),
  UNIQUE KEY `reverse_primary` (`sound_id`, `theme_id`),
  CONSTRAINT `sounds2themes_fk` FOREIGN KEY (`theme_id`)
    REFERENCES `themes` (`theme_id`),
  CONSTRAINT `sounds_fk` FOREIGN KEY (`sound_id`)
    REFERENCES `sounds` (`sound_id`)
) ENGINE=InnoDB DEFAULT CHARSET=latin1;

So what's wrong with this, you might ask. Nothing. At first sight. At least not structurally. But let's add some content to our tables. Some of the examples from the first paragraphs plus an additional theme FARM will be a good point to start with.

INSERT INTO keywords (keyword_id, keyword)
VALUES      (1, 'sun'), (2, 'countryside'), (3, 'vacation');
INSERT INTO themes (theme_id, theme)
VALUES      (1, 'BEACH'), (2, 'FARM');
INSERT INTO keywords2themes (keyword_id, theme_id)
VALUES      (1, 1), (2, 2), (3, 1);
INSERT INTO sounds (sound_id, sound)
VALUES      (1, 'waves.mp3'), (2, 'dog_barking.mp3'), (3, 'wind.mp3');
INSERT INTO themes2sounds (theme_id, sound_id)
VALUES      (1, 1), (1, 2), (1, 3), (2, 2);

A SELECT with a few JOINs and GROUP_CONCATs gives us a list of all themes with their associated keywords and sounds, just to get a quick overiew of the current situation.

SELECT     theme,
           GROUP_CONCAT(DISTINCT keyword) AS keywords,
           GROUP_CONCAT(DISTINCT sound) AS sounds
FROM       themes AS t
INNER JOIN keywords2themes AS k2t
ON         k2t.theme_id = t.theme_id
INNER JOIN keywords AS k
ON         k2t.keyword_id = k.keyword_id
INNER JOIN themes2sounds AS t2s
ON         t2s.theme_id = t.theme_id
INNER JOIN sounds AS s
ON         t2s.sound_id = s.sound_id
GROUP BY   t.theme_id;

That's what we currently got in our database:

+-------+--------------+------------------------------------+
| theme | keywords     | sounds                             |
+-------+--------------+------------------------------------+
| BEACH | sun,vacation | waves.mp3,dog_barking.mp3,wind.mp3 |
| FARM  | countryside  | dog_barking.mp3                    |
+-------+--------------+------------------------------------+

Now we can search search for sounds directly by keywords, bypassing themes (as requested by the original question):

SELECT     DISTINCT sound
FROM       keywords AS k
INNER JOIN keywords2themes AS k2t
ON         k.keyword_id = k2t.keyword_id
INNER JOIN themes2sounds AS t2s
ON         k2t.theme_id = t2s.theme_id
INNER JOIN sounds AS s
ON         t2s.sound_id = s.sound_id
WHERE      k.keyword = ?;

A search for vacation using the above query gives the expected results.

+-----------------+
| sound           |
+-----------------+
| waves.mp3       |
| dog_barking.mp3 |
| wind.mp3        |
+-----------------+

But what happens if we start to add more and more keywords and sounds? For example we'd like to add the keyword long drink and associate it with our existing theme BEACH:

INSERT INTO keywords VALUES (4, 'long drink');
INSERT INTO keywords2themes VALUES (4, 1);

A search for this new keyword long drink now gives us surprising results. Or would you expect a barking dog to be listed prominently in the result set?

+-----------------+
| sound           |
+-----------------+
| waves.mp3       |
| dog_barking.mp3 |
| wind.mp3        |
+-----------------+

So what should we do? This kind of two-step associative search is probably always problematic. A simple solution would be to just bypass the detour via themes and link sounds to keywords directly with an additional table keywords2sounds. But that was not what the original question was asking for and there's maybe a good reason why. Maintenance is just one example that jumps to my mind. The database will perhaps list thousands of sound files in the future, many more than themes. Adding a new keyword to this system means going through the list of all sound files and themes and checking for every single one if it corresponds to the new keyword. Not having this additional table keywords2sounds thus saves you from a lot of work when adding new keywords.

But there's another solution that could improve the search quality of the system with just two additional columns to the existing schema.

+---------------+     +-----------------+      +-------------+
| keywords      |     | keywords2themes |      | themes      |
+---------------+     +-----------------+      +-------------+
| keyword_id PK |<----| keyword_id   PK |  +-->| theme_id PK |<--+
| keyword       |     | theme_id     PK |--+   | theme       |   |
+---------------+     | weight (NEW)    |      +-------------+   |
                      +-----------------+                        |
                                                                 |
                        +-------------+      +---------------+   |
                        | sounds      |      | themes2sounds |   |
                        +-------------+      +---------------+   |
                        | sound_id PK |<--+  | theme_id   PK |---+
                        | sound       |   +--| sound_id   PK |
                        +-------------+      | weight (NEW)  |
                                             +---------------+

Every relationship is now weighted. This means that we take advantage of the fact that the felt associativity between keywords and themes or themes and sounds differs substantially. Whereas the sound of waves is very strongly associated with a beach, a barking dog is much less so. The same is probably true for vacation and beach (fairly strong associativity) and long drink and beach (slightly weaker associativity).

So let's add the additional columns for the weighting. We'll use values between 0.1 (weakest associativity) and 1.0 (strongest associativity) in our examples, so a column of type DECIMAL(2, 1) should be fine.

ALTER TABLE keywords2themes ADD COLUMN weight DECIMAL(2, 1) NOT NULL;
ALTER TABLE themes2sounds ADD COLUMN weight DECIMAL(2, 1) NOT NULL;

These are some example weightings that should fit our keywords, themes and sounds:

UPDATE keywords2themes SET weight = 1.0
  WHERE keyword_id = 1 AND theme_id = 1;  -- sun-BEACH             1.0
UPDATE keywords2themes SET weight = 0.8
  WHERE keyword_id = 2 AND theme_id = 2;  -- countryside-FARM      0.8
UPDATE keywords2themes SET weight = 0.9
  WHERE keyword_id = 3 AND theme_id = 1;  -- vacation-BEACH        0.9
UPDATE keywords2themes SET weight = 0.6
  WHERE keyword_id = 4 AND theme_id = 1;  -- long drink-BEACH      0.6
UPDATE themes2sounds SET weight = 1.0
  WHERE theme_id = 1 AND sound_id = 1;    -- BEACH-waves.mp3       1.0
UPDATE themes2sounds SET weight = 0.6
  WHERE theme_id = 1 AND sound_id = 2;    -- BEACH-dog_barking.mp3 0.6
UPDATE themes2sounds SET weight = 0.8
  WHERE theme_id = 1 AND sound_id = 3;    -- BEACH-wind.mp3        0.8
UPDATE themes2sounds SET weight = 0.8
  WHERE theme_id = 2 AND sound_id = 2;    -- FARM-dog_barking.mp3  0.8

To get the final weighting of a sound for any given keyword we just have to multiply the weightings of both relationships. Our query to retrieve sounds from keywords thus looks like so:

SELECT     sound,
           ROUND(MAX(k2t.weight * t2s.weight), 1) AS final_weight
FROM       keywords AS k
INNER JOIN keywords2themes AS k2t
ON         k.keyword_id = k2t.keyword_id
INNER JOIN themes2sounds AS t2s
ON         k2t.theme_id = t2s.theme_id
INNER JOIN sounds AS s
ON         t2s.sound_id = s.sound_id
WHERE      k.keyword = ?
GROUP BY   sound
HAVING     final_weight >= 0.5
ORDER BY   final_weight DESC;

You might wonder why there is a GROUP BY sound clause and a MAX() function on the final_weight in the query: A single sound file might be found multiple times for a given keyword if there are multiple themes associated with that keyword and containing that sound. In such cases we only want the sound to be listed once (that's what the DISTINCT was for in the original query) with its maximum weight.

Now we have a possibility to define a threshold for results (HAVING clause), which allows to filter sounds with a small associativity for a keyword, and we can also sort the result to display the best matches first (ORDER BY clause). This gives us much more intuitive results.

+--------------------------------+
| Results for "vacation"         |
+-----------------+--------------+
| sound           | final_weight |
+-----------------+--------------+
| waves.mp3       |          0.9 |
| wind.mp3        |          0.7 |
| dog_barking.mp3 |          0.5 |
+-----------------+--------------+

The new system also no longer lists a barking dog as a sound associated with a long drink:

+--------------------------+
| Results for "long drink" |
+-----------+--------------+
| sound     | final_weight |
+-----------+--------------+
| waves.mp3 |          0.6 |
| wind.mp3  |          0.5 |
+-----------+--------------+

The users will now get what they asked for, at least as long as themes and weights are organised properly. And we learned that even a correctly designed database schema sometimes is not the final solution. We also have to take the semantics of the stored data into consideration.