Computational Advances in Data-Consistent Inversion

Measure-Theoretic Methods for Improving Predictions

Motivation

Ideas beget prototype notebooks/scripts
Spawned other branches of development
Eventually would attempt to scale code
- school servers, office workstation
- cloud (??)

FAILURE TO RUN

A Pattern

Prototype script for idea: ~10-20 lines
Production-grade class/module: ~200-500 lines

or,

Figure generates on Mac OSX, crashes on Linux
Figure generates on Linux, looks weird on Mac OSX

etc.

Motivation

Result: Began studying Software Development

Dependencies, Packaging
Environment, Reproducibility
- Images, Containers, Docker
- Registries, Cloud Solutions
Unit Testing
Continuous Integration/Deployment
Organization of Git-Based Workflows

What if…?

One applied best practices for software development to all aspects of the dissertation process?

What if…?

Document (PDF) itself can be re-created
- resolve LaTeX dependencies
- package into Docker container
Examples/Figures/Tables Completely Reproducible
- resolve Python dependencies
- re-usable command-line Python scripts
- provide convenience bash scripts

What if…?

Novel math is implemented as open-source code
- update, extend, and enrich BET
- enables thesis examples
- ensures robustness (unit testing)
Package an All-in-One Docker Image
- host in cloud-based registries (DockerHub)
- original build files publicly hosted (GitHub)
- thesis repository contains full instructions

Order of Operations

Resolve environment for thesis repository
Get BET up-to-date
Add features to BET implementing new approach
Make more user-friendly, capable of automation
Write scripts for each example in thesis
Assured of results, write them up
Ensure thesis still compiles, generates figures

A Reproducible Thesis

One-click launch in-browser via Binder
Executable scripts: #!/bin/bash, chmod +x
Minimal tool-set in Dockerfile, options available
- include command-line tools (e.g. image resizing)
Dockerfile for LaTeX (standalone), no figure-creation
Dockerfile for Python+LaTeX (reproducibility)
Dockerfile for full-stack environment (JupyterLab)

BET

[Butler, Estep, Tavener] Method

+ Mattis, Graham, Walsh, Pilosov

github.com/UT-CHG/BET

About BET

Purpose: Implement Measure-Theoretic Methods
- Data-Consistent Inversion
GNU General Public License
First released 2014
Python 2.7, upgraded to 3.6+
- Tested weekly on 2.7, 3.6, 3.7
Fully documented

Documentation

BET is auto-documented using a tool called Sphinx

r"""
This is a description of the method.
All sorts of formatting options are understood by Sphinx.
(kind of something to learn on its own, but usually you
can make sense of syntax by looking at existing ones)
"""

Comments that are formatted inside blocks are converted into web-page documentation
Updating docs through command-line
GitHub pages used to host the resulting output

Unit-Testing, Coverage, Versioning

nose is a Python framework for unit testing
- desirable: compatible with unittest
codecov works alongside it as a tracker
setup.py contains versioning information
- pip install ., or
- `python setup.py install
- Major/Minor releases ~ extent of changes
- The third number in v1.2.3 is for incremental changes, such as bug-fixes, typos, patches, etc.

UNit Testing with Nose

One-to-one file structure
- one test (class) for each sub-module method
- multiple “tests” within each
Class consists of several methods
- setup and teardown required
- test for each function within module
Each test should anticipate mixtures of arguments that could be passed to each function

Unit Testing

from unnecessary_math import multiply
 
def test_numbers_3_4():
    assert multiply(3,4) == 12 
 
def test_strings_a_3():
    assert multiply('a',3) == 'aaa'

nosetests -v test_um_nose.py

Source

Continuous Integration

Cloud instance carries out instructions
Travis runs when you submit a PR to check that everything works
GitHub checks for ability to merge automatically
Passing does not ensure a PR is merged
- Ultimately up to the administrators of the repo
Helpful for contributors to debug before admins take a look
Used to prevent broken master branches

Upgrading BET

Python 2 support ending by 2020
Used a tool called 2to3
- Takes care of most major changes
- Two weeks fixing tests
Fixed tests for CI pipeline (Travis)
- only handled numprocs=1,2
Addressed matplotlib upgrades, warnings
- Released via PR as v2.1.0

Enhancing BET

Ability to measure accuracy of solutions.
- ensure it will be future-compatible
Sampling-based approach
- ensure ability to switch methods
Handle data-driven methods
- be capable of loading/transforming data
Automate some decisions, defaults for users (WIP)
Update documentation, installation options (WIP)
Publish to PyPi, Anaconda

Novel Theoretical Advancements

New framework based on Bayes’ Rule
New framework for “parameter identification”
Motivates different user experience with code
- define “initial” assumptions
“Consistent Bayes” -> “Data-Consistent Inversion”

Data-Consistent Inversion

In directions informed by data,
- “turn off” regularization
Use “initial” distribution to regularize in the nullspace of the QoI map
Existence, Uniqueness, Stability given by Disintegration Theorem

Connection to Deterministic Optimization

$$\pi^{up} = \pi^{in} (\lambda) \frac{\pi^{ob} (Q(\lambda)) }{\pi^{pr} (Q(\lambda)) }$$

Given a linear map, full rank, and
- Gaussian prior/initial
- Gaussian likelihood (1 datum)

There exists a connection to Tikonov regularization.

New Developments

Parameter Identification
- “data-driven maps”
Closed-form solution for linear maps
Iterative Algorithm
- sequential projections onto solution manifolds
- gradient-free optimization
Classification (variant of Naive Bayes)
- handles unequal class-representation
- comparison still unclear

Challenges with Implementation

Implementation v.s. Theory: many nuances
- various packages, approaches, optimizations
How to be efficient with parallel processors?
Where are the “memory bottlenecks?”
- Can we trade off approximation accuracy?
Choice of QoI (columns) impacts solution quality
- Allow flexibility to do feature-selection
Working on a project with wide scope
- Work often siloed in development
- Sensitivity/robustness analysis before MVP

Structure of Thesis (Pt 1)

1. Introduction & Motivations
      i. Preliminaries
     ii. Framework
    iii. Software Contributions

2. Background on DCI
      i. Notation
     ii. Set-Valued
    iii. Sample-Based
     iv. Software Contributions
      v. Illustrative Examples

3. Impact on Accuracy
    - sections mirror Chapter 2.

Structure of Thesis (Pt 2)

4. Data-Driven Maps & Consistent Inversion
      i. Stochastic Map Framework
     ii. Data-Driven Maps
    iii. Software Contributions
     iv. Numerical Results & Analysis

5. Research Directions
    - Fit in extensions
    - Work-in-progress, draft ideas
    - Approaches to approximation

6. References, Appendix

Status

Dissertation repository largely in place
Docker experience level improving
- still missing minimal versions
- one-click launch “working”
Architecture/Structure of thesis mostly settled
Some bash/python scripts completed for examples
- Larger examples in notebooks, uncoverted
Still a lot of content to be written up

Status

Software going through final rounds of testing
Documentation for new features still missing
Some (new) tests failing in parallel
Haven’t fully integrated with mybinder.org yet
No progress on demonstration/example repository
- plan: use thesis examples as basis for content

Fin

Links:

Homepage

Slideshows

BET