PyDataSentry - Memory for Data Science

pydatasentry package allows auditability of modeling code and data by logging all relevant information for every single model run (e.g., a regression)

Background

Audience

Mainly data scientists that run statistical models using pandas/statsmodels/scikit-learn of various kinds on a regular basis.

Problem

The combination of datasets, questions, and nature of analysis is growing everyday. Data scientists find it hard to keep track of all the different datasets they dealt with, what they did with those datasets, and what they presented to the model-audience (business etc)

Solution

pydatasentry package allows auditability of modeling code and data by logging all relevant information for every single model run (e.g., a regression) You could use this for audit past results for correctness, share models and results with peers, search past results to avoid repition of work.

Features

  • Automatic interception of the library calls. Right now only statsmodels.formula.api is supported. But the support can be easily extended to other libraries
  • Capture of all relevant context for each run including the signature (who called, with what parameters etc.), the input dataset, the resulting object (full and summary versions). In addition parameters (sentryopt) can be passed to the library call that is extracted and captured (e.g., a set of tags)
  • Storage in local directory in systematic way locally and remotely if needed.
  • Optional github commit information to know what code was responsible for this call.

And all these can be over-ridden and extended.

Installation

pip install glob2 numpy scipy patsy scipy six statsmodels pandas
pip install pydatasentry

We need explicit install for now because scikit-learn has numpy dependency but setuptools is not installing dependencies in the order that we are specifying.

Examples

No code change:

$ sentry.py
<see the help>
sentry: Transparently instrument pandas code
sentry.py help
sentry.py init <sentry-conf.py>
sentry.py run [-c|--config <sentry-conf.py>]  <python-program-to-be-instrumented>

$ sentry.py example basic_ols.py
<creates a simple OLS regression file>

$ sentry.py init sentry-conf.py
$ emacs sentry-conf.py
<initialize a sentry configuration file and edit>
<update scope and run>

$ sentry.py run -c sentry-conf.py basic_ols.py

$ find model-output
<see the output>

Minimal:

Only two lines are required by default. Please check the config module to know what are the defaults for what needs to be captured (dataframes, statsmodels interface, signature) and where they should be stored (local directory ‘model-output’)

#!/usr/bin/env python

import os, sys
import pandas as pd
import statsmodels.formula.api as smf
import pydatasentry

if __name__ == "__main__":

    pydatasentry.initialize()

    df = pd.DataFrame({"A": [10,20,30,40,50],
                           "B": [20, 30, 10, 40, 50],
                           "C": [32, 234, 23, 23, 42523]})

    result = smf.ols(formula="A ~ B + C",
                     data=df
                 ).fit()

    print(result.summary())

The output is stored in a experiment and time dependent directory that has a unique identifier associated with it.

$ find model-output
model-output
model-output/offers
model-output/offers/conditional
model-output/offers/conditional/1
model-output/offers/conditional/1/ols
model-output/offers/conditional/1/ols/96f5b468-85ee-11e5-b3b5-0800274d1e8c
model-output/offers/conditional/1/ols/96f5b468-85ee-11e5-b3b5-0800274d1e8c/2015-Nov-08-13:29:08
model-output/offers/conditional/1/ols/96f5b468-85ee-11e5-b3b5-0800274d1e8c/2015-Nov-08-13:29:08/full.pickle
model-output/offers/conditional/1/ols/96f5b468-85ee-11e5-b3b5-0800274d1e8c/2015-Nov-08-13:29:08/signature.json
model-output/offers/conditional/1/ols/96f5b468-85ee-11e5-b3b5-0800274d1e8c/2015-Nov-08-13:29:08/summary.pickle

$ cat model-output/offers/conditional/1/ols/96f5b468-85ee-11e5-b3b5-0800274d1e8c/2015-Nov-08-13:29:08/signature.json
{
    "data": {
        "name": "random",
        "columns": [
            "A",
            "B",
            "C"
        ],
        "shape": [
            5,
            3
        ]
    },
    "uuid": "51ef2ae4-85ed-11e5-a8bc-0800274d1e8c",
    "model": {
        "module": "statsmodels.formula.api",
        "formula": "A ~ B + C",
        "function": "ols"
    },
    "experiment": {
        "scope": "test",
        "version": 1,
        "run": "test"
    }
}

Detailed:

pydatasentry gives the user control over every aspect of the process. The example below shows the user over-riding the experiment details, output parameters, and tracking lineage.

#!/usr/bin/env python

import os, sys
import pandas as pd
import statsmodels.formula.api as smf
import pydatasentry

if __name__ == "__main__":

    # Specify what and how of the capture in great detail
    pydatasentry.initialize({
        'debug': True,

        'spec': {
            'experiment': {
                'scope': 'test',
                'run': 'test',
                'version': 1
            },
            'output': {
                'params': [
                    {
                        'content': 'attributes.output.default-signature',
                        'path': 'attributes.output.relative-path',
                        'filename': 'signature.json'
                    }
                ]
            },
        },
    })

with tracklineage("load", "sample"):
    df = pd.DataFrame({"A": [10,20,30,40,50],
                           "B": [20, 30, 10, 40, 50],
                           "C": [32, 234, 23, 23, 42523]})

    result = smf.ols(formula="A ~ B + C",
                     data=df,
                     sentryopts={
                         'dataset': "sample"
                     }
                 ).fit()

    print(result.summary())

Caveats

  • Only python3 is supported.

Next Steps

A number of next steps are planned:

  • Test will several statsmodels libraries
  • Improve the instrumentation.

Please let me (pingali@gmail.com) know or post an issue

License

Standard MIT License. See LICENSE.txt

Acknowledgements

To FourthLion for agreeing to contribute this code back to the community.