Computational Advances in Data-Consistent Inversion
Measure-Theoretic Methods for Improving Predictions
Motivation
- Ideas beget prototype notebooks/scripts
- Spawned other branches of development
- Eventually would attempt to scale code
- school servers, office workstation
- cloud (??)
FAILURE TO RUN
A Pattern
- Prototype script for idea: ~10-20 lines
- Production-grade class/module: ~200-500 lines
or,
- Figure generates on Mac OSX, crashes on Linux
- Figure generates on Linux, looks weird on Mac OSX
etc.
Motivation
Result: Began studying Software Development
- Dependencies, Packaging
- Environment, Reproducibility
- Images, Containers, Docker
- Registries, Cloud Solutions
- Unit Testing
- Continuous Integration/Deployment
- Organization of Git-Based Workflows
What if…?
One applied best practices for software development to all aspects of the dissertation process?
Order of Operations
- Resolve environment for thesis repository
- Get BET up-to-date
- Add features to BET implementing new approach
- Make more user-friendly, capable of automation
- Write scripts for each example in thesis
- Assured of results, write them up
- Ensure thesis still compiles, generates figures
A Reproducible Thesis
- One-click launch in-browser via Binder
- Executable scripts:
#!/bin/bash, chmod +x
- Minimal tool-set in Dockerfile, options available
- include command-line tools (e.g. image resizing)
- Dockerfile for LaTeX (standalone), no figure-creation
- Dockerfile for Python+LaTeX (reproducibility)
- Dockerfile for full-stack environment (JupyterLab)
About BET
- Purpose: Implement Measure-Theoretic Methods
- Data-Consistent Inversion
- GNU General Public License
- First released 2014
- Python 2.7, upgraded to 3.6+
- Tested weekly on 2.7, 3.6, 3.7
- Fully documented
Documentation
BET is auto-documented using a tool called Sphinx
r"""
This is a description of the method.
All sorts of formatting options are understood by Sphinx.
(kind of something to learn on its own, but usually you
can make sense of syntax by looking at existing ones)
"""
- Comments that are formatted inside blocks are converted into web-page documentation
- Updating docs through command-line
- GitHub pages used to host the resulting output
Unit-Testing, Coverage, Versioning
nose is a Python framework for unit testing
- desirable: compatible with
unittest
codecov works alongside it as a tracker
setup.py contains versioning information
pip install ., or
- `python setup.py install
- Major/Minor releases ~ extent of changes
- The third number in
v1.2.3 is for incremental changes, such as bug-fixes, typos, patches, etc.
UNit Testing with Nose
- One-to-one file structure
- one test (class) for each sub-module method
- multiple “tests” within each
- Class consists of several methods
setup and teardown required
- test for each function within module
- Each test should anticipate mixtures of arguments that could be passed to each function
Unit Testing
from unnecessary_math import multiply
def test_numbers_3_4():
assert multiply(3,4) == 12
def test_strings_a_3():
assert multiply('a',3) == 'aaa'
nosetests -v test_um_nose.py
Source
Continuous Integration
- Cloud instance carries out instructions
- Travis runs when you submit a PR to check that everything works
- GitHub checks for ability to merge automatically
- Passing does not ensure a PR is merged
- Ultimately up to the administrators of the repo
- Helpful for contributors to debug before admins take a look
- Used to prevent broken
master branches
Upgrading BET
- Python 2 support ending by 2020
- Used a tool called
2to3
- Takes care of most major changes
- Two weeks fixing tests
- Fixed tests for CI pipeline (Travis)
- only handled
numprocs=1,2
- Addressed matplotlib upgrades, warnings
- Released via PR as v2.1.0
Enhancing BET
- Ability to measure accuracy of solutions.
- ensure it will be future-compatible
- Sampling-based approach
- ensure ability to switch methods
- Handle data-driven methods
- be capable of loading/transforming data
- Automate some decisions, defaults for users (WIP)
- Update documentation, installation options (WIP)
- Publish to PyPi, Anaconda
Novel Theoretical Advancements
- New framework based on Bayes’ Rule
- New framework for “parameter identification”
- Motivates different user experience with code
- define “initial” assumptions
- “Consistent Bayes” -> “Data-Consistent Inversion”
Data-Consistent Inversion
- In directions informed by data,
- “turn off” regularization
- Use “initial” distribution to regularize in the nullspace of the QoI map
- Existence, Uniqueness, Stability given by Disintegration Theorem
Connection to Deterministic Optimization
$$\pi^{up} = \pi^{in} (\lambda) \frac{\pi^{ob} (Q(\lambda)) }{\pi^{pr} (Q(\lambda)) }$$
- Given a linear map, full rank, and
- Gaussian prior/initial
- Gaussian likelihood (1 datum)
There exists a connection to Tikonov regularization.
New Developments
- Parameter Identification
- Closed-form solution for linear maps
- Iterative Algorithm
- sequential projections onto solution manifolds
- gradient-free optimization
- Classification (variant of Naive Bayes)
- handles unequal class-representation
- comparison still unclear
Challenges with Implementation
- Implementation v.s. Theory: many nuances
- various packages, approaches, optimizations
- How to be efficient with parallel processors?
- Where are the “memory bottlenecks?”
- Can we trade off approximation accuracy?
- Choice of QoI (columns) impacts solution quality
- Allow flexibility to do feature-selection
- Working on a project with wide scope
- Work often siloed in development
- Sensitivity/robustness analysis before MVP
Structure of Thesis (Pt 1)
1. Introduction & Motivations
i. Preliminaries
ii. Framework
iii. Software Contributions
2. Background on DCI
i. Notation
ii. Set-Valued
iii. Sample-Based
iv. Software Contributions
v. Illustrative Examples
3. Impact on Accuracy
- sections mirror Chapter 2.
Structure of Thesis (Pt 2)
4. Data-Driven Maps & Consistent Inversion
i. Stochastic Map Framework
ii. Data-Driven Maps
iii. Software Contributions
iv. Numerical Results & Analysis
5. Research Directions
- Fit in extensions
- Work-in-progress, draft ideas
- Approaches to approximation
6. References, Appendix
Status
- Dissertation repository largely in place
- Docker experience level improving
- still missing minimal versions
- one-click launch “working”
- Architecture/Structure of thesis mostly settled
- Some bash/python scripts completed for examples
- Larger examples in notebooks, uncoverted
- Still a lot of content to be written up
Status
- Software going through final rounds of testing
- Documentation for new features still missing
- Some (new) tests failing in parallel
- Haven’t fully integrated with mybinder.org yet
- No progress on demonstration/example repository
- plan: use thesis examples as basis for content