Time series data contains signals that are easy to miss and expensive to misread.
For businesses working with sensor data, operational telemetry, machine logs, and other sequential data streams, accurate classification directly affects uptime, risk, maintenance costs, and creates competitive advantage.
Today, organisations still rely on specialists to manually adapt classification pipelines for new data streams. Signal Sphinx' new model changes that. Our methods automate preprocessing and signal discovery, allowing businesses to uncover hidden patterns in raw temporal data without constant manual intervention or domain-specific tuning.
By solving the long-standing problem of automated preprocessing in time series classification, our model attains leading accuracy results on the widely used Bakeoff Redux benchmark, even on a modest computational budget. In this white paper, we show how Signal Sphinx achieves automatic preprocessing, and contrast our approach with the previous state of the art.

Introduction
The time series classification literature has addressed many aspects of automated preprocessing. Early papers reported good initial results [Gorecki & Luczak, 2013; 2014], but these results did not reproduce when paired with with stronger classification pipelines. Later papers argued that it could never be done in the general case [Large et al., 2018], or quietly assumed the same [Lo et al., 2026; others]. Our solution hints at the reason: each of its components must be present and correctly applied for the method to work at all. When every link of the chain is in place, it pulls the model toward higher accuracy and lower compute cost; if just one link is weak or missing, the results are inconclusive at best, and may misleadingly suggest that the approach is fundamentally impossible.

1. Different representations
We select alternative representations from a pool of filters that are known for their useful properties: smoothing functions that filter out noise, enhancing filters that expose signal, frequency/energy transforms, and others. These different views, when presented to the machine learning model, highlight different aspects that aid the model’s classification capabilities. If the pool of candidates is too small, we may not be able to find a good representation for a particular dataset.
2. Know when to gate
There are two kinds of transforms: universal and niche. We always apply the universally useful ones. Niche transforms are gated: we only admit them when statistical evidence shows they are a good fit. This dual approach reduces the inherent noise of the decision process.
For the raw representation, and a small handful of others, gating would mostly lead to false negatives. These transforms are beneficial on all but a small fraction of datasets, so we want the classifier to always have access to them.
The remaining representations are more or less niche. In most cases, they hurt more than they help. To avoid false positives, we admit them only once the evidence is statistically evaluated.
3. Strong metric
We score a filter by how much it increases the classifier’s margin, i.e. the distance from a classification boundary. The metric is based on hinge loss [Cortes & Vapnik, 1995; Crammer & Singer, 2001], which is a measure of how confidently the model can separate different classes.

Viewed in isolation, hinge loss appears poorly suited to this setting. Accuracy becomes unstable as the margin metric is maximized. The metric only becomes reliable when this instability is regularized by additional ensembling.
The interaction between the metric, the underlying problem, and the transformation being applied is complex. The key takeaway is that the margin metric is not reliable on its own. However, we have discovered it works very well when paired with regularisation.
4. Skeleton classifier
We start with a strong, state-of-the-art classifier (one that achieves the highest accuracy), and we keep only the parts of the classification pipeline required to calculate the metric. This skeleton classifier performs work equivalent to hundreds of full iterations in only about 2.5× the time, making the gate feasible.
Starting with a strong classifier is important: a weak classifier has a relative lack of discernment, and can therefore be helped even by simple filters. In the three graphs below, reproduced from a seminal 2014 work by Gorecki & Luczak, we can see three different transforms all yielding relatively comparable improvements when paired with a weak classifier.

A strong classifier already has a much better perspective internally, and will find the rudimentary manipulations of both sine and cosine to be at best neutral, whereas it will be able to leverage the more complex Hilbert transform much better. A weak classifier cannot therefore be used in the gate as a stand-in for the classifier we actually plan to use for the final classification.
At the same time, strong performance and computational cost are at odds. We take full advantage of recent advances in computational efficiency: our base models already achieve results comparable to much more computationally expensive models on a much humbler computational budget. We further strip them down to the absolute minimum and preserve the only thing that matters: the performance comparison of the filtered and raw representations.
5. Statistical evaluation
We repeatedly evaluate randomized samples from the training set, and after each iteration we perform a statistical test. We stop once we can say with sufficient confidence that the filter helps or hurts. If the decision stays unclear, or the gain is too small, our model defaults to the unfiltered data.
Our team has been leveraging the speed of the skeleton classifier to run orders of magnitude more experiments, achieving a much higher speed of development. Without this, finding a statistically sound evaluation would likely not have been possible. The central insight is to invest a given compute budget as effectively as possible.
To assess both the stability, and the overall effect of the filter, we characterize the gate by a single scalar: the confidence that the decision is correct. The procedure is as follows: (i) repeatedly evaluate randomized samples of the training data for a given filter; (ii) once the estimated effect (improvement or degradataion) exceeds the confidence threshold, return the decision (see image below); (iii) if the effect remains uncertain after a fixed maximum number of iterations, the filter is rejected.

This allows us to reduce the complex evaluation to a simple question -- are we really sure the filtered representation is better than the raw one -- and answer it in a much more meaningful way.
Winner's curse

There is a downside to having a large pool of candidate transforms: the winner's curse. As the number of candidates grows, we must appropriately increase the amount of samples. If we don't, even with a relatively small number of candidates, it becomes more probable that a winning candidate is chosen based on luck rather than performance. This presents us with a dilemma: a trade-off between computational budget and the risk of the gate making a wrong decision. If we overcorrect, we spend compute needlessly. If we undercorrect, we can lose precision drastically, and the gate's decisions become unreliable. Calculating the required number of additional samples exactly is therefore essential when dealing with the winner's curse.
6. Additional ensembling
Recent state-of-the-art time-series classification methods use feature concatenation to ensemble the individual base classifiers, as illustrated below. This feature-level ensembling is one of the several innovations that has enabled modern methods to achieve state-of-the-art performance — at a fraction of the computational cost of previous approaches. This approach is however not perfect, and leaves a lot of performance on the table -- concatenation destroys some of the useful information available to the base classifiers.

In addition to this feature-level ensembling, our model also uses a traditional score-level ensembling. Each base classifier contributes a separate soft vote, and the final class scores are computed as a weighted linear combination of the models’ class-wise decision scores. These additional votes regularise the final prediction and further improve accuracy, allowing us to use a stronger evaluation metric. In our experiments, this approximately doubles the accuracy gain obtained from ensembling.

Regularization
The additional score-level ensemble also acts as a regulariser for the margin-maximised classifiers. Margin maximisation increases the separation between the selected class and the alternatives, but it is not inherently aware of whether the selected class is correct. The key point is that these incorrect margin pulls tend to be brittle, and as a result, differ across base classifiers. When the base classifiers’ scores are combined, many of these unstable incorrect pulls cancel out, while the more robust correct pulls are reinforced.
Conclusion
A chain is only as strong as its weakest link.— Anne Robinson, probably
Our model represents a significant step forward in time series classification. On the challenging Bakeoff Redux benchmark, it improves accuracy by 0.80 percentage points over the model it builds on. That gain is larger than any single improvement previously achieved across this model’s lineage, making it clear that this is not an incremental refinement.

The results depend on all six components working in concert. First, the method requires a wide variety of representations, so that useful structure in the data is likely to be exposed somewhere in the candidate pool. Second, gating must be applied only when it is appropriate: unnecessary gating leaves performance on the table, while not gating when one should leads to performance degradation. Third, the gating decision must be based on the strongest available metric, because the quality of that decision determines which representations are retained and which are discarded. Fourth, the classifier must be made substantially faster without degrading performance with respect to the gating metric; otherwise, the speedup is obtained by changing the problem rather than solving it more efficiently. Fifth, the gating decision must be obtained with sufficient statistical rigour, in order to leverage available information and contrast it against the irreducible noise floor. Finally, model outputs must be regularised by additional ensembling; otherwise the gains from margin maximisation will be outweighed by the instability it introduces.
- The method requires a wide variety of representations, so that useful structure in the data is likely to be exposed somewhere in the candidate pool.
- Gating must be applied only when it is appropriate: unnecessary gating leaves performance on the table, while not gating when one should leads to performance degradation.
- The gating decision must be based on the strongest available metric, because the quality of that decision determines which representations are retained and which are discarded.
- The classifier must be made substantially faster without degrading performance with respect to the gating metric; otherwise, the speedup is obtained by changing the problem rather than solving it more efficiently.
- The gating decision must be obtained with sufficient statistical rigour, in order to leverage available information and contrast it against the irreducible noise floor.
- Model outputs must be regularised by additional ensembling; otherwise the gains from margin maximisation will be outweighed by the instability it introduces.
Employing our model creates practical value for data analysis in environments where accuracy and computational budgets matter. If your business depends on sensor data, telemetry, or other time-dependent data streams, request a free evaluation today. We’ll show you where hidden signal loss may already be costing you money.