pydatasentry

pydatasentry package allows auditability of modeling code and data by logging all relevant information for every single model run (e.g., a regression)

Sentry (Commandline Tool)

sentry.py allows instrumenting a python/pandas program with no modifications to the program itself. Note that only python 3 is supported.

sentry.py help 
sentry.py init <sentry-conf.py>
sentry.py example <filename.py>
sentry.py [run|commit] [-c <sentry-conf.py>] <python-program-to-be-instrumented>"

run and commit are almost the same. The latter suggest final
run. Only committed runs are stored/uploaded. 
bin.sentry.example(path)[source]

Initialize a sentry configuration file

Parameters:conf – sentry configuration file
bin.sentry.initialize(conf)[source]

Initialize a sentry configuration file

Parameters:conf – sentry configuration file
bin.sentry.load_configuration(conf)[source]
bin.sentry.load_program()[source]

Load the user’s command line

bin.sentry.main()[source]
bin.sentry.sentry_help()[source]

Signature - Capture attributes

Maintains the overall configuration. Any over-rides provided by the users are incorporated into the configuration. The configuration is combined with run-specific data to the post-processing function.

The default configuration is

{
   'debug': False, 
   'spec': {         

       # High level experiment information
       'experiment': { 
           'scope': 'offers',
           'run': 'conditional-offers',
           'version': 1
       },

       # Which modules should be instrumented 
       'instrumentation': {
           'modules': ['statsmodels.formula.api']
       },

       # What should be captured 
       'output': {
           'params': [ 
               {
                   'content': 'attributes.output.default-signature',
                   'path': 'attributes.output.relative-path',
                   'filename': 'signature.json'
               }
           ]
       },

       # Where should they be stored and how 
       'store': {
           'params': ['attributes.storage.local']
       },

}
pydatasentry.config.get_config()[source]

Read the configuration

Returns:current configuration
pydatasentry.config.initialize_config(update={})[source]

Initialize the configuration of pydatasentry and over-ride it with with any user or run specific parameters

Parameters:update – Dict that over-rides the basic configuration
pydatasentry.config.validate_config()[source]

Checks whether the specified configuration has all the essential fields such as the experiment details. More checks will be added over time.

Returns:“Invalid configuration” exception if there is an issue

These are the set of attributes that can be computed for each run. The configuration specifies which subset of these should be stored, how and where.

The attributes provide a fairly expressive language to collect information recursively. Each attribute can combine multiple other attributes’ values to generate new values

The default module that is instrumented is statsmodels.formula.api and the storage is local.

Lineage - Track transformations

Track lineage of any dataset from the point it is loaded and pass it in the signature.

pydatasentry.lineage.datasets = {}

Maintain a experiment-specific list of datasets

pydatasentry.lineage.get_lineage()[source]
Returns lineage:
 history until now
pydatasentry.lineage.lineage = {}

Maintain a experiment-specific lineage

class pydatasentry.lineage.track(action, notes='', **kwargs)[source]

Bases: contextlib.ContextDecorator

Track lineage of each dataset - load, transform, and store. This is provided as a “With” decorator. We expect this to be automatic in future.

Instrumentation - Intercept functions

This is the instrumentation library. It instruments all the functions in the modules specified in the spec (see config)

pydatasentry.instrumentation.cleanup_instrumentation()[source]

Remove the interception for all functions.

pydatasentry.instrumentation.instrument()[source]

Instrument each of the modules specified in the config[‘spec’][‘instrumentation’][‘modules’]

pydatasentry.instrumentation.intercept(func, metadata)[source]

Helper wrapper function that captures the input and output to every instrumented function

pydatasentry.capture.capture_input(args, kwargs, metadata)[source]

Capture the function parameters for the functions that have been instrumented

pydatasentry.capture.capture_output(run, result)[source]

Capture the results of the instrumented function

Post-Capture Processing - Organize output

pydatasentry.process.evaluate_attribute(name, run, form=<class 'str'>, depth=0)[source]

Evaluate the signature and other attributes specified by the configuration.

Parameters:
  • name – Name of the attribute
  • run – Combination of configuration and run-specific information (internally generated)
  • depth – <internal parameter to track recursion>
pydatasentry.process.lookup_attribute(name, run)[source]

Looks up the run configuration for the value of a given attribute. The function tries a couple of options before giving up. The default is to return the name unmodified

Parameters:
  • name – name of the attribute
  • run – Combination of configuration and run-specific information (internally generated)
Returns attribute:
 

dict corresponding to the attribute

pydatasentry.process.summarize_run(run)[source]

Post-process the input and output data from the run.

Parameters:run – Combination of configuration and run-specific information (internally generated)