Fitting models to (experimental) data is one key aspect of the analysis of spectroscopic data. For good reason, optimisation is an entire sub-discipline of applied mathematics, and frequently, spectroscopists are not aware of the intricate details of the diverse available fitting strategies. Furthermore, while implementing actual fitting and optimisation algorithms is usually beyond the scope of the average spectroscopist, using the available packages quickly becomes complicated.
Partly, this complication is due to the inherent complexity of properly fitting models to data. But to a large extend, the complication is unnecessary and originates from different interfaces and the lack of a unifying abstraction. Furthermore, as always in programming: making something to work once (and with one set of data) is rather easy. But doing it in a reproducible way that another person can understand even after a few years is something completely different—and rarely achieved.
The idea behind the FitPy framework: making advanced fitting of spectroscopic data possible for the average scientist while retaining full reproducibility and reliability. To achieve this goal, two things are absolutely essential: relieving the users from the duty to document what they have done to their data, and provide a powerful user interface that is simple to use and requires no programming skills.
The problem with fitting
Fitting a model to (experimental) data is a routine task in the analysis of spectroscopic data, as well as in other fields of science. One major reason for its widespread adoption: parameters necessary for interpreting the results can often only be extracted from models fitted to the data, but not directly from the data themselves.
Stating the problem at hand on an abstract level is straight-forward: Take a model that reasonably describes the physical reality and tweak its parameters until the distance between the model and the data is minimal (ideally zero). Once done with minimising the distance between model and data, take the parameters of the model and try to make sense of them.
Putting aside the question of how to find an appropriate model for the data, the problem starts with defining the distance between model and data. Every optimisation algorithm needs some measure for this distance, and probably the most widespread criterion is the sum of the squares of the difference between model and data for each individual data point.
The next question regards the algorithm used for minimising the distance between model and data. Some algorithms use gradients to guide the direction of their search, others work without gradients. For some problems there even exist analytical solutions, while finding these solutions may be forbiddingly expensive in terms of computing time. Another approach to optimisation are stochastic algorithms, and eventually, one can use a brute-force approach and sample a regular grid in the parameter hyperspace. Genetic or evolutionary algorithms have even not been mentioned until now. While each of these algorithms has its advantages and disadvantages, depending on the problem at hand, taking an informed decision requires a certain familiarity with the field and the underlying theory. For an excellent overview of the different algorithms for optimisation and their mathematical details, the interested reader is referred to [Kochenderfer and Wheeler, 2019].
Having decided upon an adequate distance measure and an algorithm for optimisation, and having libraries at hand that implement the chosen algorithm, the next problem surfaces: Each library comes with its own interface, and regularly, the interfaces for different routines of the same library differ as well, making swapping one algorithm for another a tedious and error-prone task. Given sufficient programming skills and time, all this can of course be dealt with, resulting in fits of sufficient quality.
However, the story is not over yet. We have decided upon a model, a distance measure, and a fitting algorithm and we have implemented the actual fitting, giving us high-quality results that we can use to interpret the parameters and come to important new conclusions regarding the actual scientific question at hand. How do we make sure we or others can reproduce the results and assess the validity and quality of our approach? How do we keep track of all the decisions taken in between, particulary all the small decisions including parameters of the algorithm and alike? What is even more problematic: Many of these decisions are taken for us and often buried somewhere in the code implementing the algorithms. Even if we can sometimes change the defaults by explicitly setting parameters, we might not have access to nor even be aware of the default values. This, however, is a real problem for reproducibility, as default values buried in code can (and will) change over time. If such default value is a measure for the tolerance for termination, it might be rather unimportant. If the default regards the fitting algorithm chosen by default, not being aware of it and potential changes can become much more of a problem.
To summarise, the problems with fitting are (i) the inherent complexity of the task given the huge number of decisions one has to make, and (ii) the lack of transparency that comes with most implementations and that leads directly to a lack of reproducibility. One may argue that a third problem is the unnecessary complication of applying fitting strategies to data due to the lack of a unified representation of different optimisation strategies in the respective libraries.
The inherent complexity of any optimisation problem cannot be reduced and can only be addressed by getting familiar with the underlying theory and gain practical experience. However, both other problem domains mentioned, the lack of transparency and reproducibility as well as the unnecessary complication, can be addressed by applying appropriate strategies of software engineering. This is the goal and promise of the FitPy framework: providing a unified representation of optimisation strategies and therefore a user-friendly interface, and focussing on reproducibility, helping others to judge both, suitability and quality of the approach taken and its results.
What makes FitPy different?
First of all, FitPy does not implement any fitting algorithm. There are clearly excellent libraries available implementing optimisation and fitting algorithms. FitPy itself relies on the SciPy software stack and on lmfit for its fitting capabilities. So why then using (and developing) FitPy anyway? Several problems with optimisation have been detailed above, and two are explicitly addressed by FitPy: (i) the lack of transparency that comes with most implementations of fitting algorithms and that leads directly to a lack of reproducibility, and (ii) the unnecessary complication of applying fitting strategies to data due to the lack of a unified representation of different optimisation strategies in the respective libraries.
Building upon the ASpecD framework for reproducible analysis of spectroscopic data, FitPy comes with a powerful abstract user interface: recipe-driven data analysis. No need to actually program any more, as all tasks to be performed on the data are described in a structured text file. At the same time, the underlying machinery takes care of logging every single step, including all parameters, explicit and implicit. Therefore, it ensures full reproducibility and allows others to independently assess the analysis, without needing to dig into code. Fitting is “just another” analysis step in the data processing and analysis pipeline and integrates seamlessly with all packages derived from the ASpecD framework.
Many abstractions provided by FitPy (and the underlying ASpecD framework), such as models and treating model parameters as structured data types rather than floats in a vector, exist in other libraries, perhaps most prominent in the excellent lmfit package. However, all these packages fall short of providing the transparency necessary for true reproducibility. As stated in the “Zen of Python”, explicit is better than implicit. This is even more true for science than for software development and programming.
Eventually, FitPy helps to answer two crucial questions of scientific data analysis: What has been done? And how has it been done? Of course, even explicitly stating all parameters requires you to trust the implementation of the algorithms used. Given that all libraries are available open-source, this can either be informed blind trust or the result of explicitly checking the implementation yourself. Here, informed blind trust refers to the underlying libraries, mainly lmfit and SciPy, being developed by experts in the field.
A powerful abstraction and a unified representation of optimisation allow for a user-friendly interface, and reproducibility built-in ensures good scientific practice. But that is not all. Well-crafted representations of the results of fitting models to data are highly important for answering the actual scientific questions at hand. FitPy provides two different building blocks here: graphical representations and formatted reports. From own experience, a well-designed pipeline for setting up the fitting procedure together with presenting the results in form of clearly arranged and unified reports make all the difference. Having these two tools at hand, a skilled undergraduate student can independently analyse a large bunch of data after a few hours of introduction, not to mention the gain in reproducibility and the ability to independently assess the results.
FitPy is built around a small set of concepts developed within the ASpecD framework, such as models, the actual fitting as an analysis step, and reports, together with recipe-driven data analysis as main user interface. For further details, carry on reading.