Time series data contains signals that are easy to miss and expensive to misread.

For businesses working with sensor data, operational telemetry, machine logs, and other sequential data streams, accurate classification directly affects uptime, risk, maintenance costs, and creates competitive advantage.

Today, organisations still rely on specialists to manually adapt classification pipelines for new data streams. Signal Sphinx' new model changes that. Our methods automate preprocessing and signal discovery, allowing businesses to uncover hidden patterns in raw temporal data without constant manual intervention or domain-specific tuning.

By solving the long-standing problem of automated preprocessing in time series classification, our model attains leading accuracy results on the widely used Bakeoff Redux benchmark, even on a modest computational budget. In this white paper, we show how Signal Sphinx achieves automatic preprocessing, and contrast our approach with the previous state of the art.

Critical diagram of the Demsar-style analysis commonly used in time series classification benchmarks. Our model ranks in the top spot. The absence of connecting horizontal bars means there is a statistically-significant separation from the runner-ups. Our model is highlighted in red, and the model it is based on, KG-MTP, is highlighted in blue.
Our model leads the Bakeoff Redux benchmark, amongst all the currently published state-of-the-art models. A critical diagram using the Wilcoxon-Holm analysis.

Introduction

The time series classification literature has addressed many aspects of automated preprocessing. Early papers reported good initial results [Gorecki & Luczak, 2013; 2014], but these results did not reproduce when paired with with stronger classification pipelines. Later papers argued that it could never be done in the general case [Large et al., 2018], or quietly assumed the same [Lo et al., 2026; others]. Our solution hints at the reason: each of its components must be present and correctly applied for the method to work at all. When every link of the chain is in place, it pulls the model toward higher accuracy and lower compute cost; if just one link is weak or missing, the results are inconclusive at best, and may misleadingly suggest that the approach is fundamentally impossible.

A chain diagram illustrating the six interlinked components of our model: the chain is very strong, but only as long as all the links are present.
The six inter-connected components of our model

1. Different representations

We select alternative representations from a pool of filters that are known for their useful properties: smoothing functions that filter out noise, enhancing filters that expose signal, frequency/energy transforms, and others. These different views, when presented to the machine learning model, highlight different aspects that aid the model’s classification capabilities. If the pool of candidates is too small, we may not be able to find a good representation for a particular dataset.

2. Know when to gate

There are two kinds of transforms: universal and niche. We always apply the universally useful ones. Niche transforms are gated: we only admit them when statistical evidence shows they are a good fit. This dual approach reduces the inherent noise of the decision process.

For the raw representation, and a small handful of others, gating would mostly lead to false negatives. These transforms are beneficial on all but a small fraction of datasets, so we want the classifier to always have access to them.

The remaining representations are more or less niche. In most cases, they hurt more than they help. To avoid false positives, we admit them only once the evidence is statistically evaluated.

3. Strong metric

We score a filter by how much it increases the classifier’s margin, i.e. the distance from a classification boundary. The metric is based on hinge loss [Cortes & Vapnik, 1995; Crammer & Singer, 2001], which is a measure of how confidently the model can separate different classes.

This picture helps the reader visualize the margin-based metric. Two graphs. The graph on the left shows the margin delta, and the graph on the right shows the metric delta. In-graph legend says: "mean delta loss = 0.383; improved=71% of all samples; mode=multiclass". The metric is overwhelmingly positive, with most deltas being positive, and only a handful being negative. The positive metric deltas are also larger on average than the negative deltas.
Metric visualization -- multiclass problem with four classes labelled y = 0, 1, 2, 3. Samples ordered by raw margin. Samples where margin change was zero have been omitted for clarity. On the left: margin. On the right: hinge loss difference (our metric). Note that the metric does not reward margin increase past a certain good-enough threshold (here, 1.0), but improvements of margin are rewarded equally regardless of whether the classification result is false or true, positive or negative.

Viewed in isolation, hinge loss appears poorly suited to this setting. Accuracy becomes unstable as the margin metric is maximized. The metric only becomes reliable when this instability is regularized by additional ensembling.

The interaction between the metric, the underlying problem, and the transformation being applied is complex. The key takeaway is that the margin metric is not reliable on its own. However, we have discovered it works very well when paired with regularisation.

4. Skeleton classifier

We start with a strong, state-of-the-art classifier (one that achieves the highest accuracy), and we keep only the parts of the classification pipeline required to calculate the metric. This skeleton classifier performs work equivalent to hundreds of full iterations in only about 2.5× the time, making the gate feasible.

Starting with a strong classifier is important: a weak classifier has a relative lack of discernment, and can therefore be helped even by simple filters. In the three graphs below, reproduced from a seminal 2014 work by Gorecki & Luczak, we can see three different transforms all yielding relatively comparable improvements when paired with a weak classifier.

Reproduction from the Gorecki & Luczak 2014 paper. Three 4-quadrant graphs show true positives, true negatives, false positives, and false negatives, as results of three transformations: cosine, sine, and the Hilbert transform. Judged from the graphs, all three transforms are beneficial, with many true positives. This is because the base classifier is weak enough to benefit even from simple transforms.
Weak classifier benefits even from simple transforms -- here sine, cosine, and the Hilbert transform all appear to be beneficial, when paired with a weak classifier. Courtesy of [Gorecki & Luczak, 2014 (Figure 9)] -- their visualisations are the best.

A strong classifier already has a much better perspective internally, and will find the rudimentary manipulations of both sine and cosine to be at best neutral, whereas it will be able to leverage the more complex Hilbert transform much better. A weak classifier cannot therefore be used in the gate as a stand-in for the classifier we actually plan to use for the final classification.

At the same time, strong performance and computational cost are at odds. We take full advantage of recent advances in computational efficiency: our base models already achieve results comparable to much more computationally expensive models on a much humbler computational budget. We further strip them down to the absolute minimum and preserve the only thing that matters: the performance comparison of the filtered and raw representations.

5. Statistical evaluation

We repeatedly evaluate randomized samples from the training set, and after each iteration we perform a statistical test. We stop once we can say with sufficient confidence that the filter helps or hurts. If the decision stays unclear, or the gain is too small, our model defaults to the unfiltered data.

Our team has been leveraging the speed of the skeleton classifier to run orders of magnitude more experiments, achieving a much higher speed of development. Without this, finding a statistically sound evaluation would likely not have been possible. The central insight is to invest a given compute budget as effectively as possible.

To assess both the stability, and the overall effect of the filter, we characterize the gate by a single scalar: the confidence that the decision is correct. The procedure is as follows: (i) repeatedly evaluate randomized samples of the training data for a given filter; (ii) once the estimated effect (improvement or degradataion) exceeds the confidence threshold, return the decision (see image below); (iii) if the effect remains uncertain after a fixed maximum number of iterations, the filter is rejected.

The image shows three graphs with explanatory text. The top graph portrays the state after n = 6 samples, where the result is not yet conclusive because the null effect, represented by 0, is still inside the yellow band of plausible mean Δ values. The bottom graphs show the state at n = 8. On the left, 0 has fallen outside the yellow band to the right, indicating mean Δ < 0, so the filter lowers loss and the transform is kept. On the right, the distribution is shifted to the right of 0, indicating mean Δ > 0, so the filter increases loss and the transform is rejected.
By estimating the effect of the filter as a probability distribution, we are able to characterize the confidence of the computed mean. When the probability of the null effect (represented by 0) has "fallen out" of the interior of the estimated probability distribution, the decision is reached.

This allows us to reduce the complex evaluation to a simple question -- are we really sure the filtered representation is better than the raw one -- and answer it in a much more meaningful way.

Winner's curse

The effective confidence interval craters precipitously -- at 7 candidates the odds of a cursed winner are even, and at 50 candidates the odds the winner is cursed are essentially 100%. The cost is mildly superlinear. The dominant term is linear in candidate count, with a modest logarithmic penalty from the confidence correction. Most of the increase comes from the larger candidate set itself. The winner’s curse correction adds a significant but secondary multiplier: about 2.8× over the 7 to 2000 range. Therefore, the scaling is meaningfully superlinear, but not explosively so.
Winner's curse -- Effective confidence interval (CI) and compute requirements of the gate in units of base classifier compute, at baseline CI = 90%. Model: total compute = 1 + 1.5 × Number of candidates × sample multiplier. Base classifier = 1×. Gate cost at 1 candidate = 1.5×. Sample multiplier preserves two-sided 90% confidence using Šidák correction (Bonferroni correction gives essentially the same results and is not shown).

There is a downside to having a large pool of candidate transforms: the winner's curse. As the number of candidates grows, we must appropriately increase the amount of samples. If we don't, even with a relatively small number of candidates, it becomes more probable that a winning candidate is chosen based on luck rather than performance. This presents us with a dilemma: a trade-off between computational budget and the risk of the gate making a wrong decision. If we overcorrect, we spend compute needlessly. If we undercorrect, we can lose precision drastically, and the gate's decisions become unreliable. Calculating the required number of additional samples exactly is therefore essential when dealing with the winner's curse.

6. Additional ensembling

Recent state-of-the-art time-series classification methods use feature concatenation to ensemble the individual base classifiers, as illustrated below. This feature-level ensembling is one of the several innovations that has enabled modern methods to achieve state-of-the-art performance — at a fraction of the computational cost of previous approaches. This approach is however not perfect, and leaves a lot of performance on the table -- concatenation destroys some of the useful information available to the base classifiers.

Diagram of a time-series classification pipeline. Raw time-series data is transformed into multiple preprocessing representations: identity, differencing, smoothing, wavelets, periodogram, and additional transforms. Random one-dimensional kernels are applied by convolution. Pooling operations summarize the kernel outputs using MAX and MIN, recording the maximum and minimum respectively, from each of the dot-product vectors. Some pooling summaries are kept and others are pruned as weak or redundant. The remaining summaries are concatenated into a feature vector and passed to a Ridge classifier. Ridge is an advanced variant of linear regression.
How a classifier works: overview of a modern classification pipeline. Adapted from the excellent [Middlehurst et al., 2024 (Figure 26, page 35)].

In addition to this feature-level ensembling, our model also uses a traditional score-level ensembling. Each base classifier contributes a separate soft vote, and the final class scores are computed as a weighted linear combination of the models’ class-wise decision scores. These additional votes regularise the final prediction and further improve accuracy, allowing us to use a stronger evaluation metric. In our experiments, this approximately doubles the accuracy gain obtained from ensembling.

Flow diagram for multiclass linear ensembling: class-wise score vectors from models A and B are combined using weight alpha, with each class score computed as (1-alpha) times model A score plus alpha times model B score; the predicted label is the class with the highest combined score.
Linear ensembling (multiclass): Decision scores of base models A and B are combined at weighting ratio α. Final prediction y_hat is the class label with the highest combined score.

Regularization

The additional score-level ensemble also acts as a regulariser for the margin-maximised classifiers. Margin maximisation increases the separation between the selected class and the alternatives, but it is not inherently aware of whether the selected class is correct. The key point is that these incorrect margin pulls tend to be brittle, and as a result, differ across base classifiers. When the base classifiers’ scores are combined, many of these unstable incorrect pulls cancel out, while the more robust correct pulls are reinforced.

Conclusion

A chain is only as strong as its weakest link.
— Anne Robinson, probably

Our model represents a significant step forward in time series classification. On the challenging Bakeoff Redux benchmark, it improves accuracy by 0.80 percentage points over the model it builds on. That gain is larger than any single improvement previously achieved across this model’s lineage, making it clear that this is not an incremental refinement.

Line graph showing model accuracy on the Bakeoff Redux benchmark from 2019 to 2026. The plotted series starts with Rocket in 2019 at 86.80%; MiniRocket in 2020 at 87.40%; and MultiRocket in 2021 at 88.18%; showing a steep early improvement. Progress then slows: Hydra-MultiRocket reaches 88.39% in 2022; there is no improvement in 2023; and SelF-Rocket reaches 88.48% in 2024. Accuracy improves again with KG-MTP at 88.88% in 2025, followed by our model at 89.68% in 2026, the largest single jump shown in the graph.
Accuracy on the Bakeoff Redux benchmark are hard won. After rapid early gain from Rocket to MultiRocket, progress largely plateaued until KG-MTP. Our model produces the largest single improvement in this lineage, reaching 89.68% average accuracy.

The results depend on all six components working in concert. First, the method requires a wide variety of representations, so that useful structure in the data is likely to be exposed somewhere in the candidate pool. Second, gating must be applied only when it is appropriate: unnecessary gating leaves performance on the table, while not gating when one should leads to performance degradation. Third, the gating decision must be based on the strongest available metric, because the quality of that decision determines which representations are retained and which are discarded. Fourth, the classifier must be made substantially faster without degrading performance with respect to the gating metric; otherwise, the speedup is obtained by changing the problem rather than solving it more efficiently. Fifth, the gating decision must be obtained with sufficient statistical rigour, in order to leverage available information and contrast it against the irreducible noise floor. Finally, model outputs must be regularised by additional ensembling; otherwise the gains from margin maximisation will be outweighed by the instability it introduces.

  1. The method requires a wide variety of representations, so that useful structure in the data is likely to be exposed somewhere in the candidate pool.
  2. Gating must be applied only when it is appropriate: unnecessary gating leaves performance on the table, while not gating when one should leads to performance degradation.
  3. The gating decision must be based on the strongest available metric, because the quality of that decision determines which representations are retained and which are discarded.
  4. The classifier must be made substantially faster without degrading performance with respect to the gating metric; otherwise, the speedup is obtained by changing the problem rather than solving it more efficiently.
  5. The gating decision must be obtained with sufficient statistical rigour, in order to leverage available information and contrast it against the irreducible noise floor.
  6. Model outputs must be regularised by additional ensembling; otherwise the gains from margin maximisation will be outweighed by the instability it introduces.
Free signal-loss evaluation

Employing our model creates practical value for data analysis in environments where accuracy and computational budgets matter. If your business depends on sensor data, telemetry, or other time-dependent data streams, request a free evaluation today. We’ll show you where hidden signal loss may already be costing you money.

Request free evaluation
Cortes & Vapnik(1995) Support-vector networks. Machine learning, 20, 273–297. https://doi.org/10.1007/BF00994018
Crammer & Singer(2001) On the algorithmic implementation of multiclass kernel-based vector machines. Journal of machine learning research, 2, 265–292. https://doi.org/10.5555/944790.944813
Gorecki & Luczak(2013) Using derivatives in time series classification https://doi.org/http://doi.org/10.1007/s10618-012-0251-4
Gorecki & Luczak(2014) Non-isometric transforms in time series classification using DTW http://dx.doi.org/10.1016/j.knosys.2014.02.011
Large et al.(2018) Can automated smoothing significantly improve benchmark time series classification algorithms? https://arxiv.org/pdf/1811.00894
Lo et al.(2026) Time series classification with random convolution kernels: pooling operators and input representations matter https://arxiv.org/abs/2409.01115
Middlehurst et al.(2024) Bake off redux: a review and experimental evaluation of recent time series classification algorithms https://arxiv.org/pdf/2304.13029