README
¶
post
post is a program for processing structured data files in bulk.
It was originally intended as an automation tool for generating LaTeX
graphs from functionObject data generated by OpenFOAM® simulations,
but has since evolved such that it can be used as a general structured data
processor with optional graph generation support.
It's primary use is processing and formatting data spread over multiple files and/or archives. The main benefit being that the entire process is defined through one or more YAML formatted run files, hence, automating data processing pipelines is fairly simple, while no programming is necessary.
CLI usage
Usage:
post [run file] [flags]
post [command]
Available Commands:
completion Generate the autocompletion script for the specified shell
graphfile Generate graph file stub(s)
help Help about any command
runfile Generate a run file stub
Flags:
--dry-run check runfile syntax and exit
-h, --help help for post
--no-graph don't write or generate graphs
--no-graph-generate don't generate graphs
--no-graph-write don't write graph files
--no-output don't output data
--no-process don't process data
--only-graphs only write and generate graphs, skip input, processing and output
--skip strings a list of pipeline IDs to be skipped during processing
-v, --verbose verbose log output
Run file structure
post is controlled by a run file in YAML format file supplied as a CLI parameter.
The run file consists of a list of pipelines, each defining 4 sections:
input, process, output and graph. The input section defines input
files and formats from which data is read; the process section defines
operations which are applied to the data; the output section defines how
the processed data will be output/stored; and the graph section defines
how the data will be graphed.
Even though all sections are technically optional, certain sections depend on
others, specifically, the process and output sections require an input
section to be defined in order to work, since 'some data' is necessary for
processing/output. The graph section is entirely optional and can both be
omitted, defined by itself, or as part of a pipeline.
A single pipeline has the following fields:
- id:
input:
type:
fields:
type_spec:
process:
- type:
type_spec:
output:
- type:
type_spec:
graph:
type:
graphs:
id: the pipeline tag, used to reference the pipeline on the CLI; optionalinput: the input sectiontype: input type; see Input for type descriptionsfields: field (column) names of the input data; optionaltype_spec: input type specific configuration
process: the process sectiontype: process type; see Processing for type descriptionstype_spec: process type specific configuration
output: the output sectiontype: output type; see Output for type descriptionstype_spec: output type specific configuration
graph: the graph sectiontype: graph type; see Graphing for type descriptionsgraphs: a list of graph type specific graph configurations
A simple run file example is shown below.
- input:
type: dat
fields: [x, y]
type_spec:
file: 'xy.dat'
process:
- type: expression
type_spec:
expression: '100*y'
result: 'result'
output:
- type: csv
type_spec:
file: 'output/data.csv'
graph:
type: tex
graphs:
- name: xy.tex
directory: output
table_file: 'output/data.csv'
axes:
- x:
min: 0
max: 1
label: '$x$'
y:
min: 0
max: 100
label: '$100 y$'
tables:
- x_field: x
y_field: result
legend_entry: 'result'
The example run file instructs post to do the following:
- read data from a
DATformatted filexy.datand rename the fields (columns) toxandy - evaluate the expression
100*yand store the result to a field namedresult - output the data, now containing the fields
x,yandresultto aCSVformatted fileoutput/data.csv, if the directoryoutputdoes not exist, it will be created - generate a graph using TeX in the
outputdirectory, usingoutput/data.csvas the table (data) file, withxas the abscissa andresultas the ordinate
For more examples see the test directory.
Input
The following is a list of available input types and their descriptions along with their run file configuration stubs.
-
archivereads input from an archive. The archive format is inferred from the file name extension. The following archive formats are supported:TAR,TAR-GZ,TAR-BZIP2,TAR-XZ,ZIP. Note thatarchiveinput wraps one or more input types, i.e., thearchiveconfiguration only specifies how to read 'some data' from an archive, the wrapped input type reads the actual data. Another important note is that the contents of the archive are stored into memory the first time it is read, so if the same archive is used multiple times as an input source, it will be read from disk only once, each subsequent read will read directly from RAM. Hence it is beneficial to use thearchiveinput type when the data consists of a large amount of input files, e.g., a largetime-series.type: archive type_spec: file: # file path of the archive format_spec: # input type configuration, e.g., a CSV input type -
csvreads from aCSVformatted file. If the file contains a header line theheaderfield should be set totrueand the header column names will be used as the field names for the data. If no header line is present theheaderfield must be set tofalse.type: csv type_spec: file: # file path of the CSV file header: # determines if the CSV file has a header; default 'true' comment: # character to denote comments; default '#' delimiter: # character to use as the field delimiter; default ',' -
datreads from a white-space-separated-value file. The type and amount of white space between columns is irrelevant, as are leading and trailing white spaces, as long as the number of columns (non-white space fields) is consistent in each row.type: dat type_spec: file: # file path of the DAT file -
multipleis a wrapper for multiple input types. Data is read from each input type specified and once all inputs have been read, the data from each input is merged into a single data instance containing all fields (columns) from all inputs. The number and type of input types specified is arbitrary, but each input must yield data with the same number of rows.type: multiple type_spec: format_specs: # a list of input type configurations -
ramreads data from an in-memory store. For the data to be read it must have been stored previously, e.g., a previousoutputsection defines aramoutput.type: ram type_spec: name: # key under which the data is stored -
time-seriesreads data from a time-series of structured data files in the following format:. ├── 0.0 │ ├── data_0.csv │ ├── data_1.dat │ └── ... ├── 0.1 │ ├── data_0.csv │ ├── data_1.dat │ └── ... └── ...where each
data_*.*file contains the data in some format at the moment in time specified by the directory name. Each series dataset must be output into a different file, i.e., thedata_0.csvfiles contain one dataset,data_1.datanother one, and so on.type: time-series type_spec: file: # file name (base only) of the time-series data files directory: # path to the root directory of the time-series time_name: # the time field name; default is 'time' format_spec: # input type configuration, e.g., a CSV input type
Processing
The following is a list of available processor types and their descriptions along with their run file configuration stubs.
-
average-cyclemutates the data by computing the enesemble average of a cycle for all numeric fields. The ensemble average is computed as:Φ(ωt) = 1/N Σ ϕ[ω(t+j)T], j = 0...N-1where
ϕis the slice of values to be averaged,ωthe angular velocity,tthe time andTthe period.The resulting data will contain the cycle average of all numeric fields and a time field (named
time), containing times for each row of cycle average data, in the range (0, T]. The time field will be the last field (column), while the order of the other fields is preserved.Time matching can be optionally specified, as well as the match precision, by setting
time_fieldandtime_precisionrespectively in the configuration. This checks whether the time (step) is uniform and whether there is a mismatch between the expected time of the averaged value, as per the number of cycles defined in the configuration and the supplied data, and the read time. The read time is the one read from the field namedtime_field. Note that in this case the output time field will be named aftertime_field, i.e., the time field name will remain unchanged.type: average-cycle type_spec: n_cycles: # number of cycles to average over time_field: # time field name; optional time_precision: # time-matching precision; optional -
expressionevaluates an arithmetic expression and appends the resulting field (column) to the data. The expression operands can be scalar values or fields (columns) present in the data, which are referenced by their names. Note that at least one of the operands must be a field present in the data.Each operation involving a field is applied element-wise. The following arithmetic operations are supported:
+-*/**type: expression type_spec: expression: # an arithmetic expression result: # field name of the resulting field -
filtermutates the data by applying a set of row filters as defined in the configuration. The filter behaviour is described by providing the field namefieldto which the filter is applied, the comparison operatoropand a comparison valuevalue. Rows satisfying the comparison are kept, while others are discarded. The following comparison operators are supported:==!=>>=<<=All defined filters are applied at the same time. The way in which they are aggregated is controlled by setting the
aggregationfield in the configuration,andandoraggregation modes are available. Theormode is the default if theaggregationfield is unset.type: filter type_spec: aggregation: # aggregration mode; defaults to 'or' filters: - field: # field name to which the filter is applied op: # filtering operation value: # comparison value -
resamplemutates the data by linearly interpolating all numeric fields, such that the resulting fields haven_pointsvalues, at uniformly distributed values of the fieldx_field. Ifx_fieldis not set, a uniform resampling is performed, i.e., as if the values of each field were given at a uniformly distributed x, where x ∈ [0,1]. The first and last values of a field are preserved in the resampled field.type: resample type_spec: n_points: # number of resampling points x_field: # field name of the independent variable; optional -
selectmutates the data by selecting fields (extracting columns) specified byfieldswhich is a list of field names.type: select type_spec: fields: # a list of field names
Output
The following is a list of available output types and their descriptions along with their run file configuration stubs.
-
csvwritesCSVformatted data to a file. Ifheaderis set totruethe file will contain a header line with the field names as the column names. Note that, if necessary, directories will be created so as to ensure thatfilespecifies a valid path.type: csv type_spec: file: # file path of the CSV file header: # determines if the CSV file has a header; default 'true' comment: # character to denote comments; default '#' delimiter: # character to use as the field delimiter; default ',' -
ramstores data in an in-memory store. Once data is stored, any subsequentraminput type can access the data.type: ram type_spec: name: # key under which the data is stored
Graphing
Only TeX graphing, via tikz and pgfplots, is supported currently. Hence
for the graph generation to work, TeX needs to be installed along with any
dependent packages.
Graphing consists of two steps: generating TeX graph files from templates, and generating the graphs from TeX files. To see the default template files run:
$ post graphfile --outdir=templates
The templates can be user supplied by setting template_directory and
template_main (if necessary) in the run file configuration. The templates
use go template syntax, see the package documentation
for more information.
A tex graph configuration stub is given below, note several fields expect
raw TeX as input.
type: tex
graphs:
- name: # used as a basename for all graph related files
directory: # optional; output directory name, created if not present
table_file: # optional; needed if 'tables.table_file' is undefined
template_directory: # optional; template directory
template_main: # optional; root template file name
template_delims: # optional; go template delimiters; ['__{','}__'] by default
tex_command: # optional; 'pdflatex' by default
axes:
- x:
min:
max:
label: # raw TeX
y:
min:
max:
label: # raw TeX
width: # optional; raw TeX, axis width option
height: # optional; raw TeX, axis height option
legend_style: # optional; raw TeX, axis legend style option
raw_options: # optional; raw TeX, if defined all other options are ignored
tables:
- x_field:
y_field:
legend_entry: # raw TeX
col_sep: # optional; 'comma' by default
table_file: # optional; needed if 'graphs.table_file' is undefined