Comparison with Similar Packages
Some other packages have similar functionality as checkpointing
.
However, we handle a few cases better than them, as explained below.
How the cases are written
The cases are generally written in the following format of two tabs.
This denotes:
- First, you have the script in the "1st run" tab. Running it gives you the corresponding output.
- Next, modify the script and change it to what's shown in the "2nd run" tab. Running it gives you another result, and it shows how the function gets skipped or re-executed.
If you want to try it out, please remember to clear the cache directory
(./.checkpointing
by default) between two cases, as one case might affect another.
cachier
The following cases are tested with cachier version 2.2.11.
Code change
cachier does not watch the function code at all, therefore, if the code logic has changed, it will not rerun, leading to erroneous results.
Even though the code is changed to x + 1
,
cachier is using the previously cached value, which is wrong.
Replace cachier with checkpointing, the second execution of the script will rerun foo
,
and gives the correct result, -1
.
Limited hashable types
cachier can only work on minimal data types as arguments.
import numpy as np
from cachier import cachier
@cachier()
def foo(x):
return x
if __name__ == "__main__":
df = pd.DataFrame()
foo(df)
Run this script and it will raise a TypeError
, because it cannot hash the numpy.ndarray
object.
You will have to write your own hasher, even on this commonly used data type.
In contrast, checkpoint has built-in support for commonly used data manipulation packages, such as numpy and pandas.
joblib
The following cases are tested with joblib version 1.3.21.
Irrelevant code change
joblib watches the function code source directly, which means that even adding comments or reformatting the code will cause a function to rerun.
The only difference in the code is a line of comment, which shouldn't affect the function's return value. However, joblib failed to use the cache, causing redundant re-compute.
Replace joblib with checkpointing, in the 2nd run, foo
won't be executed.
We watch the function code with an AST
based approach, which eliminates the effect of type annotations, comments, and formatting.
Global variable
joblib does not consider the reference global variable.
Even though the value of the reference global variable is changed to 2
,
the execution of foo
is skipped and the previous cache is returned, which is wrong.
Replace joblib with checkpointing, in the second execution of the script,
foo
will be invoked and return the correct result 2
.