quantpylib.simulator.models

quantpylib.simulator.models house powerful features for statistical analysis involving market and non-market variables. It features quantpylib.simulator.models.GeneticRegression class that is an abstraction layer written on top of the quantpylib.simulator.gene.Gene class and statsmodels.formula.api to perform no-code regression analysis using simple string specifications.

An example scenario for quantitative analysis is a momentum study on the impact of standardized returns on forward returns. We may specify such a regression study by the following regression formula:

forward_1(logret_1()) ~ div(logret_25(),volatility_25()) + tsargmax_16(close)

The GeneticRegression enables this in multiple steps:

Parse the formula into blocks:
- b0: forward_1(logret_1())
- b1: div(logret_25(),volatility_25())
- b2: tsargmax_16(close)
Construct the equivalent regression specification: b0 ~ b1 + b2
Evaluate each block using our evaluator-parser in the Gene class.
Pass the evaluated blocks and regression specification into statsmodels for regression analysis.

This allows the user to both leverage on the well-tested and familiar statistical package developed under statsmodels, while enhancing the expressive capabilities of the formulaic language specialized for trading analysis. The full list of primitives (constants and functions) are documented here.

We also provide additional convenience methods for data binning and aggregation, diagnostics and plotting. The specifications are to be referred below.

The following notations apply in the documentation:

- y, b0 : response variable
- x[*], b[1..] : independent variable(s)
- y^ : fitted response
- uCI : upper confidence interval
- lCI : lower confidence interval
- res : residuals
- res# : (z-score) normalized residuals
- res+ : internally studentized residuals
- res* : externally studentized residuals
- PRP : partial regression plot
- CCPR : component-component plus residual plot

`Bin`

Bases: Enum

Enumeration representing different binning methods.

Attributes:

Name	Type	Description
`WIDTH`		Binning method where each bin has an equal interval length.
`OBSERVATIONS`		Binning method where each bin contains an equal number of observations.

`GeneticRegression`

`init(formula='forward_1(logret_1()) ~ div(logret_25(),volatility_25())', start=None, end=None, dfs={}, instruments=[], granularity=Period.DAILY, build=True)`

Initializes a GeneticRegression object.

Parameters:

Name	Type	Description	Default
`formula`	`str`	The regression formula. The formula describes the statistical model being analysed, and is closely inspired by the formula mini-language used in R and S. The model formula should consist of valid string representations of `Gene` formula blocks as dependent and independent variables delimited by the operators [ ~ + * : ] of the `patsy` language.	`'forward_1(logret_1()) ~ div(logret_25(),volatility_25())'`
`start`	`datetime`	Start period for regression analysis. If not tz-aware, assumed UTC. If not given, assume min of dataset in dfs.	`None`
`end`	`datetime`	End period for regression analysis. If not tz-aware, assumed UTC. If not given, assume max of dataset in dfs.	`None`
`dfs`	`dict`	inst : OHLCV/other Dataframes used for computations. Default is an empty dictionary.	`{}`
`instruments`	`list`	List of instruments used in the regression analysis.	`[]`
`granularity`	`Period`	The granularity of the regression analysis. Datapoints of lower granularity than specified are ignored. Last known datapoint of multiple entries in the same granularity interval is taken. Default is `Period.DAILY`.	`DAILY`
`build`	`bool`	Whether to evaluate the formulaic blocks upon initialization. Default is True.	`True`

`build()`

Evaluates the formulaic (dependent and independent) blocks to be used as regression variables using the initialized formula and dataframes provided. If build=False at initialization, then build() needs to be called before any of the regression methods, such as ols, are called.

`diagnose()`

Diagnoses the regression model for multicollinearity and other issues.

Returns:

Type	Description
`dict`	A dictionary containing the following: "cond_num": Condition number of the derived design matrix "vif b[1..]" : variance inflation factor for the relevant regressor variable.

`ols(axis='flatten', bins=0, binned_by=Bin.OBSERVATIONS, bin_block='b0', selector=None, aggregator=lambda x: np.mean(winsorize(x, limits=(0.05, 0.05))))`

Performs Ordinary Least Squares (OLS) regression analysis.

Parameters:

Name	Type	Description	Default
`axis`	`str`	The axis along which the regression analysis is performed. Possible values are `flatten`, `xs`, or `ts`. `flatten` uses all of the available data as regression input. `xs` uses cross sectional data on a particular date as regression input. `ts` uses time series data for a particular instrument as regression input.	`'flatten'`
`bins`	`int`	The number of bins for grouping the data. If 0, no binning is performed. Defaults to 0.	`0`
`binned_by`	`Bin`	The method used for binning the data. Possible values are `Bin.OBSERVATIONS` or `Bin.WIDTH`, corresponding to equal number of observations in each bin and equal interval length in each bin respectiely.	`OBSERVATIONS`
`bin_block`	`str`	The block used for binning the data. Defaults to `b0`, the dependent variable.	`'b0'`
`selector`	`str or datetime`	`str` instrument when `axis=ts` and `datetime` index when `axis=xs` to perform regression on. Ignored when `axis=flatten`.	`None`
`aggregator`	`callable or dict`	Used for aggregating data within each bin. If callable is provided, then all of the blocks are aggregated using this function. Different aggregators can be provided for different blocks, by providing dictionary containing `block : aggregator` such as `{"b0" : np.mean, "b1" : np.median}`. Defaults to `lambda x:np.mean(winsorize(x, limits=(0.05, 0.05)))`.	`lambda x: mean(winsorize(x, limits=(0.05, 0.05)))`

Returns:

Type	Description
`RegressionResults`	The statsmodels results of the OLS regression analysis.

`parse_formula(formula)` `staticmethod`

Obtains the block-formula mapping and the derived patsy formula describing the regression model.

Parameters:

Name	Type	Description	Default
`formula`	`str`	The regression formula. The formula describes the statistical model being analyzed and follows the syntax of the patsy formula. Supports [ ~ + * : ] operators.	required

Returns:

Name	Type	Description
`tuple`		A tuple containing dict: Mapping of block names to their respective formula blocks. str: The derived patsy formula representing the regression model.

Examples:

>>> GeneticRegression.parse_formula("forward_1(logret_1()) ~ div(logret_25(),volatility_25()) + tsargmax_16(close)")
({'b0': 'forward_1(logret_1())', 'b1': 'div(logret_25(),volatility_25())', 'b2': 'tsargmax_16(close)'}, 'b0~b1+b2')

`plot(fit=True, diagnostics=True, influence=True, leverage=True)`

Plots various diagnostic plots for the regression model.

Parameters:

Name	Type	Description	Default
`fit`	`bool`	If `True`, plots, for each regressor x, uCI ~ x lCI ~ x y ~ x y^ ~ x	`True`
`diagnostics`	`bool`	If `True`, plots, for each regressor, y ~ x, y^ ~ x, uCI ~ x, lCI ~ x res ~ x PRP CCPR	`True`
`influence`	`bool`	If `True`, plots res* ~ leverage	`True`
`leverage`	`bool`	If `True`, plots leverage ~ (res#)**2	`True`

quantpylib.simulator.models

Bin

GeneticRegression

__init__(formula='forward_1(logret_1()) ~ div(logret_25(),volatility_25())', start=None, end=None, dfs={}, instruments=[], granularity=Period.DAILY, build=True)

build()

diagnose()

ols(axis='flatten', bins=0, binned_by=Bin.OBSERVATIONS, bin_block='b0', selector=None, aggregator=lambda x: np.mean(winsorize(x, limits=(0.05, 0.05))))

parse_formula(formula) staticmethod

plot(fit=True, diagnostics=True, influence=True, leverage=True)

`Bin`

`GeneticRegression`

`init(formula='forward_1(logret_1()) ~ div(logret_25(),volatility_25())', start=None, end=None, dfs={}, instruments=[], granularity=Period.DAILY, build=True)`

`build()`

`diagnose()`

`ols(axis='flatten', bins=0, binned_by=Bin.OBSERVATIONS, bin_block='b0', selector=None, aggregator=lambda x: np.mean(winsorize(x, limits=(0.05, 0.05))))`

`parse_formula(formula)` `staticmethod`

`plot(fit=True, diagnostics=True, influence=True, leverage=True)`