Algorithm to identify degree of similarity based on prior categorization

by user1594322   Last Updated December 03, 2017 19:05 PM

I am working with digital images that people have manually categorized on a number of parameters.

I want to be able to find images that are similar or dissimilar based on certain categories. For instance, we might want to pick two random images from our database that have similar "significance", but are otherwise as different as possible.

As an example, this picture of Stonehenge would be categorized like this:


colors: blue, green, grey
themes: natural, land, sky
brightness: light
location: outdoor
shapes: angular
texture: hard
lines: vertical
significance: historical

The categories are mostly fixed and small, so when manually categorized the person is choosing the main colors from a list of ten colors (red, orange, yellow, green, blue, purplse, white, black, grey, brown), and the same is true of the other categories. Most of the categories only have a small number of options, such as location is indoor, outdoor, or unknown, and brightness is light, dark, or neutral.

My naive approach is something like this:

  1. Query the database for pictures with historical significance
  2. Pick a random starting image from the historically significant images
  3. Of the remaining historical images, do a random sampling of N images. Different values of N will give us more variety at the expense of accuracy.
  4. For each randomly selected image, create an ordered binary vector for all possible category values (ex. red=0, blue=1, outdoor=1, etc), producing some binary value.
  5. Compare the binary values for the starting image and the random image, for example by using a bitwise XOR operation (or equivalent set operation), then count the number of 1's in the result.

I am not necessarily interested in finding the most dissimilar images. Any two images that are mostly dissimilar will be fine, and in fact we typically do want some randomness so the same images pairs are not frequently selected.

This naive approach works okay. Are there any proper algorithms or techniques I should look at that might apply to this problem?

Answers 2

Everything you said sounds close, but it's missing some details on vector creation and querying that are important. Ideally you need to form unique vectors for each image. You'll need to convert the color into HSV (or HSL) or a color space that lends itself to comparisons. Assign a number to each choice for the components.

Next assign a weight to each component in your vector. This should normalize the importance of each component. If you can imagine your vector in n-dimensional space you want each component to have equal weight (or be biased by its importance in the comparison). As an example lines might not be important so in the range [0, 1] horizontal = 0.2 and vertical = 0.4. Outdoor might be an important distinction so indoor = 0.5 and outdoor = 1.0. (You don't have to stay between 0 and 1, but it does keep it simple).

For colors you have to handle them specially. For color assign each component a number between [0, 1] (for each HSV component).

My only worry with your setup is your categorization is very simple and your color and theme system is a bit complex having multiple colors and themes. When generating a vector you'd essentially want to union the results with the vector and each color and theme. So for a single image with you'd generate every combination of vector (filtering for components you're searching for). Example:

(blue, natural,light, ...)
(blue, land, light, ...)
(blue, sky, light, ...)
(green, natural, light, ...)
(green, land, light, ...)
(green, sky, light, ...)
(grey, natural, light, ...)
(grey, land, light, ...)
(grey, sky, light, ...)

You'd convert each component to their numeric weighted values, so you'd have a real vector in n-dimensional space. (blue would be converted to 3 HSV components just to be very clear).

When you have vectors based on the parameters you want to compare you'd normally use a k-d tree nearest neighbor look-up.

Then perform for each vector an n neighbor look-up to get a set of vectors associated with an image. Then union the resulting sets. If you find nothing increase n in your sampling. You can use a dot product calculation averaged for all vector comparisons to kind of get a similarity factor if you're looking for a percentage of similarity. (Might need to be clever in that regard if you compare a lot of components since it'll make the images appear more dissimilar than they might seem to the eye).

This is a very rough overview of k-d tree data mining. You'd need to read a data mining book to get the more exact answer and other approaches. (Creating the weights though is a bit of an art since it's up to you if color is more important than theme or shape).

October 02, 2015 19:00 PM

Sirisian's answer started along the right path, but then he lost me, so here is my take on the issue.

The fact that you are dealing with images is irrelevant, and the fact that one of the properties must be identical is also irrelevant. They are both red herrings. All I see here is entities with properties, and the problem can easily be redefined to exclude the property which must be identical, and to only consider the subset of entities that do in fact have the same value on that excluded property.

So, basically, what you have is a set of entities where each entity has a list of N properties, and you have a separate entity which we will call the "reference" entity, and you want to search through that set to find entities that have as dissimilar as possible values from the reference entity.

Essentially, you want a maximum distance in N-dimensional space algorithm.

First of all you need to compute the N-dimensional vector of the values of your reference entity. This is essentially the coordinates of a point in N-dimensional space. (If you want, you can imagine that N = 3, so you can be thinking of the problem in 3-dimensional space.)

Then, you need to loop through the rest of your entities, and for each entity you need to calculate its N-dimensional vector, (imagine another point in 3-dimensional space,) then you need to calculate the distance between the reference vector and this vector, which is going to be another N-vector, and then you need to take the absolute value, a.k.a. magnitude of that vector, which will be a single value, and store it.

Once all the magnitudes have been collected, you need to sort all your entities by this magnitude value, and your desired results will be gathered near the end of the list, where the largest magnitudes will be.

This is the standard way of solving the bulk of your problem, and I would strongly advise solving it precisely like that, so as to be implementing an algorithm which will be understandable by others, and also by yourself, should you revisit it a few months later.

Now, your particular situation has certain peculiarities:

  • The values of your properties are not nice real numbers, they are sets of discrete values. So, you will need to map them to real numbers. A range from 0 to 1 will work just fine, as Sirisian suggested. You can probably combine your discrete color and your discrete brightness together into three property values, whether they be HSV, HSL, or perhaps even RGB, I don't think it will matter much.

  • You want a variable number of results, and you want some randomness. So, select a percentage of entities at the end of the sorted list, and choose a specific number from them, at random.

Mike Nakis
Mike Nakis
October 02, 2015 20:06 PM

Related Questions search algorithm?

Updated December 03, 2017 19:05 PM

Data structure for pattern matching

Updated December 12, 2016 08:02 AM