Keep .md files in the docs/

ASvyatkovskiy · ASvyatkovskiy · commit 3ff7a5cc790d · 2017-02-24T20:18:08.000-05:00
diff --git a/docs/Model.md b/docs/Model.md
@@ -0,0 +1,31 @@
+# Model builder
+
+Depends on Keras and model loader
+
+Contains 2 classes: ModelBuilder and LossHistory (utility)
+
+ModelBuilder takes conf and builds a model using Keras library, and provides methods to manipulate the model (save, load, etc)
+
+## MPI builder
+
+Serves a similar purpose, provides a set of MPI wrapper classes. Uses Keras SGD with Theano backend.
+
+# Model runner
+
+Depends on model Loader, performance utils
+
+Contains a set of standalone functions, which givena shotlist perform training, make predictions, make evaluations and produce plots.
+
+
+# Targets
+
+Defines a class hierarchy of targets, specifying loss, activation functions and other params for the NNs
+
+
+# Loader
+
+Depends on from primitives.shots
+
+Given conf and shotlist, provides tools to load shotlist, get batches, construct patches and manipulate them.
+
+It is a way to deliver preprocessed data into model and prepare it for training.
diff --git a/docs/Preprocessing.md b/docs/Preprocessing.md
@@ -0,0 +1,44 @@
+# Raw data
+
+The raw 0D data comes in a plain structured text format:
+
+ 1. Shot list: a 2 column CSV file having a unique identifier of a shot and a disruption time columns (-1 for non-disruptive). 
+ 1. Individual shot files: a 2 column CSV files having a time and a plasma current value columns. The time grid used is common, but the length os shots is rather different (for each file in the shot list)
+
+See `plasma.jet_signals` for more details.
+
+# Preprocessing
+
+The goal of the preprocessing step is to go from the raw data to the higher level primitives: Shots, ShotLists. 
+In addition, signal is cut, clipped and resampled (use univariate linear spline, log transformation on signal) and a `ttd` (time-to-disruption) variable is introduced. 
+
+Certain shots are marked invalid depending on the magnitude of the plasma current.
+
+Preprocessed results are saved in a numpy binary `npz` file.
+
+The core methods are:
+  1. `plasma.preprocessor.preprocess.get_signals_and_times_from_file`
+  1. `plasma.preprocessor.preprocess.cut_and_resample_signals`
+  1. `plasma.utils.processing.cut_and_resample_signal`
+
+
+# Normalization
+
+Shot normalization is done to address the problem of different scales of plasma signals which could potentially have a negative effect on the neural network training and inference.
+
+Normalizers are trained on the training shots (requires one pass over data before the RNN training). Normalizer training essentially means extracting a set of statistics about shots and incorporating them into shot (mean, std, min-max).
+Similarly to preprocessing step, an entire ShotList is split into sublists, a random sublist is picked, then stats are extracted on a shot-by-shot basis and saved in a normalizer object.
+
+Example:
+
+```python
+class MeanVarNormalizer(Normalizer):
+    def __init__(self,conf):
+        Normalizer.__init__(self,conf)
+        self.means = None
+        self.stds = None
+```  
+
+Will contain lists of means and standard deviations of signals in the training shot list.
+
+Normalization is implemented as a class hierarchy, wirth a base `plasma.preprocessor.Normalizer` class defining how stats are extracted and how training is perfromed. A set of specific normalization classes e.g. `MeanVarNormalizer`, `VarNormalizer` is derived from it, implementing different methods of shot normalization.
diff --git a/docs/Primitives.md b/docs/Primitives.md
@@ -0,0 +1,52 @@
+## Shot
+
+Each shot is a measurement of plasma current as a function of time. The Shot objects contains following attributes:
+
+ 1. number - integer, unique identifier of a shot
+ 1. t_disrupt - double, disruption time in milliseconds (second column in the shotlist input file)
+ 1. ttd - array of doubles, time profile of the shot converted to time-to-disruption values
+ 1. valid - boolean, whether plasma current reaches a certain value during the shot
+ 1. is_disruptive - boolean, 
+
+        
+For 0D data, each shot is modeled as 2D array - time vs plasma current.
+
+## ShotList
+
+Is a wrapper around list of shots. Therefore, it is a list of 2D arrays.
+
+## Sublist
+
+Shot lists is split into sublists having `num_at_once` shots from an entire dataset contained in ShotList. 
+
+## Patch
+
+The length of shots varies by a factor of 20. For data parallel synchronous training it is essential that amounds of train data passed to the model replica is about the same size.
+
+Patches are subsets of shot time/signal profiles of equal length. Patch size is approximately equal to the minimum shot length (or the largest number less or equal to the minimum shot length divisible by the LSTM model length).
+
+Since shot lengthes are not multiples of the min shot length in general, some non-deterministic fraction of patches is created.
+
+## Chunk
+
+A subset of `patch` defined as:
+```
+num_chunks = Length of the patch/ num_timesteps
+```        
+where `num_timesteps` is the sequence length fed to the RNN model.
+
+## Batch
+
+Mini-batch gradient descent is used to train neural network model.
+`num_batches` represents the number of *patches* per mini-batch.
+
+### Batch input shape
+
+The data in batches fed to the Keras model should have shape:
+
+```
+batch_input_shape = (num_chunks*batch_size,num_timesteps,num_dimensions_of_data)
+```
+
+where `num_dimensions_of_data` is the signal dimensionality. For 0D dataset we only have a time profile of plasma current,
+so `num_dimensions_of_data = 1`
diff --git a/docs/Targets.md b/docs/Targets.md
@@ -0,0 +1,12 @@
+# Understanding targets
+
+An abstract base class implemented using Python ABC library and  a set of classes derived from it.
+
+## Data members
+
+activation and loss, type string
+
+
+## Static methods
+
+remapper and threshold_range