From Sequitur to STUMPY: The Evolution of Motif Discovery
Have you ever looked at a long string of data and wondered if there are hidden, repeating patterns within it? This could be a sequence of text, DNA, or even a discretized time series. The task of automatically identifying these recurring subsequences is known as motif discovery, and it’s a fundamental problem in data analysis.
Finding Patterns in Symbols with Sequitur
One of the classic algorithms for this task is Sequitur, developed by Craig Nevill-Manning and Ian H. Witten. It’s an elegant algorithm that processes a sequence of discrete symbols and builds a context-free grammar from it. By identifying and replacing repeated patterns (digrams) with new rules, Sequitur simultaneously compresses the data and reveals its underlying hierarchical structure. The patterns it finds are the motifs.
To get a better feel for how it works, I’ve created a simple, interactive demonstration in JavaScript. You can paste in any text and see the motifs Sequitur discovers in real-time.
The Next Frontier: Continuous Data
But what if your data isn’t a neat sequence of symbols? What about continuous, real-valued time series data, like sensor readings, stock prices, or audio signals? The principles of motif discovery are just as relevant, but the methods need to adapt.
The journey from discrete to continuous motif discovery took an intermediate stop. Eamonn Keogh’s research group first came up with SAX as a way to tokenize a continuous time series and exploit discrete sequence algorithms like sequitur. Realising the shortcomings of such a method, Eamonn Keogh’s research group next came up with Matrix Profile, a revolutionary data structure that enables incredibly fast and scalable motif discovery in massive datasets.
The state-of-the-art implementation of this concept is a powerful Python library called STUMPY. It allows data scientists and analysts to efficiently find patterns, anomalies, and other structural information within their time series data. It’s a testament to how a powerful idea—finding repeated patterns—can evolve to tackle new and more complex challenges. The fact that STUMPY was originally implemented by TD Ameritrade suggests that it had been explored for some financial trading applications.