How to improve your efficiency using inside sorting algorithms
The question is how to improve your efficiency using inside sorting algorithms… Well, we are all familiar with sorting stuff: take for instance the alphabet -that’s just sorted letters. Or consider any book’s table of contents: that’s all of that book’s contents, conveniently arranged in a high-level representation to be easily found. Biology as a discipline is about sorting living organisms, and a calendar is just a sorted representation of time.
The key similarity across all these examples is efficiency: something is represented in such a way that it takes less time to find it when needed (and usually makes it easier to understand, but that’s a topic for an entirely different post). The importance of this fact cannot be exaggerated: just imagine that you wanted to look up the meaning of a word in a dictionary… and the dictionary wasn’t sorted. Glups. You’d need to check the entries one by one. Even in case you could reliably check 5 entries per second (which you probably can’t), assuming a conservative 170,000-word estimate for the size of the English vocabulary, knowing that a word is not in the dictionary would take you close to 10 hours. Without eating or sleeping.
The fact that, every time you look up a word in a dictionary, you spend 30 seconds instead of an hour, is thanks to sorting. So, as your mother said, being organized pays off.
The limitations of the alphabetical order
If you are like us, then you are the kind of person who likes to sort stuff everywhere in your life and, most importantly, improve efficiency so that you can finish things early and spend the rest of the afternoon doing something truly remarkable like riding your bike. If that’s the case, then you are probably also frustrated by the limitations of the alphabetical order. Wait, what? Oh yeah, we just said it!
As unbelievably useful and as epically transparent as it is, alphabetical order has one rather staggering limitation: it is anchored to left-to-right sorting, that is, if two items have the same first letter, then they are sorted based on their second letter, and so on.
Usually that’s great, actually, and all you need for doing some meaningful sorting, particularly when sorting dictionary words.
However, at some point or other, we all have run into the problem that predefined and undefined are dictionary-pages-apart despite the fact that they are obviously variations of the same root. Alphabetical sorting is efficient and straightforward, but it also fails to capture some dimensions in which two words may be closer to each other than just alphabetical similarity.
Today we are concerned with those additional dimensions. What may we be missing when we sort alphabetically that might be better represented using some other dimension? Especially when it comes to elements longer than words: series of words like phrases, titles, or sentences. If we sort them alphabetically, we will definitely see some interesting patterns, but limited to left-to-right comparison. What other patterns may become apparent if we had another way to sort them?
As a matter of fact, that’s something I’ve been looking into for a while. As a computational linguist, I usually work making sense of unstructured data (or far worse than unstructured, sometimes plain unbelievable data). I usually need to find patterns in data that simply doesn’t follow any meaningful alphabetical order. The solution I came up with is inside sorting, namely, sorting sequences of words according to the words in them and, more specifically, how strongly statistically associated the words in them are.
But, what exactly is inside sorting?
Consider the sequences of non-contiguous letters below. For simplicity, they are sentences made up of really short words. At this point, they are sorted in no particular order (just as I typed them out). However, try to remember the line index at the beginning of each row because something is about to happen to them:
Imagine we now inside-sort them based on the behavior we described above for our ideal inside-sorting method. The output may look something like this:
Notice the line indexes at the beginning of each row, they are now all over the place! Is this really sorted? you may rightly ask. Not numerically, that’s for sure, we would surely answer.
If we look more carefully, we’ll start to notice several relevant patterns:
1. All the c’s are in the bottom half.
2. Most a’s tend to be in the upper half except those co-occurring significantly with c’s (which follows nicely from the previous point).
3. Within the c group in the bottom half, all rows containing some b are clustered together.
4. The rows containing d’s are also clustered together and are actually sitting still where they started. As the only ones containing d’s, in practice they were already sorted and we simply needed to avoid messing them up
Let’s now reveal the representation the inside-sorting algorithm created for this data. Notice below the two intermediate columns added to each row: the first one contains each row’s cluster index, the second one denotes the features for each row’s cluster as a slash-separated sequence of increasingly specific features (think of a folder tree):
To sum up, inside sorting provides a lightweight pseudo-clustering algorithm with a few differences with respect to actual clustering or clustering-like algorithms (anything from K-Means to K-Nearest Neighbors to LDA). More specifically, inside sorting
- is deterministic
2. is not terribly slow (13 seconds for ~8,000 Amazon titles)
3. does not need to be provided with the target number of partitions as a parameter (it obviously has parameters, but only indirectly related to the number of clusters). Usually, you want the algorithm to figure out the appropriate number of clusters for you. If you already knew the clusters, you probably wouldn’t need to cluster in the first place!
4. can be applied to (and is actually designed for) short texts like sentences or phrases (tweets, e-commerce product titles, search queries, etc. We haven’t tried it on blog posts yet, but it may be able to handle them, too)
And now…?
In a follow-up to this post, we will take a look at some real-world examples and will take a deeper look at the code (ugly and obscure as it is 😀 Still in its prototype version at this point, I’m afraid), which we definitely hope some of our more eager readers are already googling for. Let us save you some time: go ahead and check out this GitHub repo if interested.
As always, any suggestions welcome!
Happy sorting!!
Originally published at blog.quarizmi.com on February 23, 2016.