pydatasentry¶
pydatasentry package allows auditability of modeling code and data by logging all relevant information for every single model run (e.g., a regression)
Sentry (Commandline Tool)¶
sentry.py allows instrumenting a python/pandas program with no modifications to the program itself. Note that only python 3 is supported.
sentry.py help
sentry.py init <sentry-conf.py>
sentry.py example <filename.py>
sentry.py [run|commit] [-c <sentry-conf.py>] <python-program-to-be-instrumented>"
run and commit are almost the same. The latter suggest final
run. Only committed runs are stored/uploaded.
-
bin.sentry.
example
(path)[source]¶ Initialize a sentry configuration file
Parameters: conf – sentry configuration file
Signature - Capture attributes¶
Maintains the overall configuration. Any over-rides provided by the users are incorporated into the configuration. The configuration is combined with run-specific data to the post-processing function.
The default configuration is
{
'debug': False,
'spec': {
# High level experiment information
'experiment': {
'scope': 'offers',
'run': 'conditional-offers',
'version': 1
},
# Which modules should be instrumented
'instrumentation': {
'modules': ['statsmodels.formula.api']
},
# What should be captured
'output': {
'params': [
{
'content': 'attributes.output.default-signature',
'path': 'attributes.output.relative-path',
'filename': 'signature.json'
}
]
},
# Where should they be stored and how
'store': {
'params': ['attributes.storage.local']
},
}
-
pydatasentry.config.
initialize_config
(update={})[source]¶ Initialize the configuration of pydatasentry and over-ride it with with any user or run specific parameters
Parameters: update – Dict that over-rides the basic configuration
-
pydatasentry.config.
validate_config
()[source]¶ Checks whether the specified configuration has all the essential fields such as the experiment details. More checks will be added over time.
Returns: “Invalid configuration” exception if there is an issue
These are the set of attributes that can be computed for each run. The configuration specifies which subset of these should be stored, how and where.
The attributes provide a fairly expressive language to collect information recursively. Each attribute can combine multiple other attributes’ values to generate new values
The default module that is instrumented is statsmodels.formula.api and the storage is local.
Lineage - Track transformations¶
Track lineage of any dataset from the point it is loaded and pass it in the signature.
-
pydatasentry.lineage.
datasets
= {}¶ Maintain a experiment-specific list of datasets
-
pydatasentry.lineage.
lineage
= {}¶ Maintain a experiment-specific lineage
Instrumentation - Intercept functions¶
This is the instrumentation library. It instruments all the functions in the modules specified in the spec (see config)
-
pydatasentry.instrumentation.
cleanup_instrumentation
()[source]¶ Remove the interception for all functions.
-
pydatasentry.instrumentation.
instrument
()[source]¶ Instrument each of the modules specified in the config[‘spec’][‘instrumentation’][‘modules’]
-
pydatasentry.instrumentation.
intercept
(func, metadata)[source]¶ Helper wrapper function that captures the input and output to every instrumented function
Post-Capture Processing - Organize output¶
-
pydatasentry.process.
evaluate_attribute
(name, run, form=<class 'str'>, depth=0)[source]¶ Evaluate the signature and other attributes specified by the configuration.
Parameters: - name – Name of the attribute
- run – Combination of configuration and run-specific information (internally generated)
- depth – <internal parameter to track recursion>
-
pydatasentry.process.
lookup_attribute
(name, run)[source]¶ Looks up the run configuration for the value of a given attribute. The function tries a couple of options before giving up. The default is to return the name unmodified
Parameters: - name – name of the attribute
- run – Combination of configuration and run-specific information (internally generated)
Returns attribute: dict corresponding to the attribute