Skip to content

quantpylib.simulator.models

quantpylib.simulator.models house powerful features for statistical analysis involving market and non-market variables. It features quantpylib.simulator.models.GeneticRegression class that is an abstraction layer written on top of the quantpylib.simulator.gene.Gene class and statsmodels.formula.api to perform no-code regression analysis using simple string specifications.

An example scenario for quantitative analysis is a momentum study on the impact of standardized returns on forward returns. We may specify such a regression study by the following regression formula:

forward_1(logret_1()) ~ div(logret_25(),volatility_25()) + tsargmax_16(close)
The GeneticRegression enables this in multiple steps:

  • Parse the formula into blocks:
    • b0: forward_1(logret_1())
    • b1: div(logret_25(),volatility_25())
    • b2: tsargmax_16(close)
  • Construct the equivalent regression specification: b0 ~ b1 + b2
  • Evaluate each block using our evaluator-parser in the Gene class.
  • Pass the evaluated blocks and regression specification into statsmodels for regression analysis.

This allows the user to both leverage on the well-tested and familiar statistical package developed under statsmodels, while enhancing the expressive capabilities of the formulaic language specialized for trading analysis. The full list of primitives (constants and functions) are documented here.

We also provide additional convenience methods for data binning and aggregation, diagnostics and plotting. The specifications are to be referred below.

The following notations apply in the documentation:

- y, b0 : response variable
- x[*], b[1..] : independent variable(s)
- y^ : fitted response
- uCI : upper confidence interval
- lCI : lower confidence interval
- res : residuals
- res# : (z-score) normalized residuals
- res+ : internally studentized residuals
- res* : externally studentized residuals
- PRP : partial regression plot
- CCPR : component-component plus residual plot

Bin

Bases: Enum

Enumeration representing different binning methods.

Attributes:

Name Type Description
WIDTH

Binning method where each bin has an equal interval length.

OBSERVATIONS

Binning method where each bin contains an equal number of observations.

GeneticRegression

__init__(formula='forward_1(logret_1()) ~ div(logret_25(),volatility_25())', intercept=True, df=None, start=None, end=None, dfs={}, instruments=[], granularity=Period.DAILY, build=True)

Initializes a GeneticRegression object.

Parameters:

Name Type Description Default
formula str

The regression formula. The formula describes the statistical model being analysed, and is closely inspired by the formula mini-language used in R and S. The model formula should consist of valid string representations of Gene formula blocks as dependent and independent variables delimited by the operators [ ~ + * : ] of the patsy language.

'forward_1(logret_1()) ~ div(logret_25(),volatility_25())'
start datetime

Start period for regression analysis. If not tz-aware, assumed UTC. If not given, assume min of dataset in dfs.

None
end datetime

End period for regression analysis. If not tz-aware, assumed UTC. If not given, assume max of dataset in dfs.

None
dfs dict

inst : OHLCV/other Dataframes used for computations. Default is an empty dictionary.

{}
instruments list

List of instruments used in the regression analysis.

[]
granularity Period

The granularity of the regression analysis. Datapoints of lower granularity than specified are ignored. Last known datapoint of multiple entries in the same granularity interval is taken. Default is Period.DAILY.

DAILY
build bool

Whether to evaluate the formulaic blocks upon initialization. Default is True.

True

build()

Evaluates the formulaic (dependent and independent) blocks to be used as regression variables using the initialized formula and dataframes provided. If build=False at initialization, then build() needs to be called before any of the regression methods, such as ols, are called.

diagnose()

Diagnoses the regression model for multicollinearity and other issues.

Returns:

Type Description
dict

A dictionary containing the following:

  • "cond_num": Condition number of the derived design matrix
  • "vif b[1..]" : variance inflation factor for the relevant regressor variable.

ols(axis='flatten', bins=0, binned_by=Bin.OBSERVATIONS, bin_block='b0', selector=None, aggregator=lambda x: np.mean(winsorize(x, limits=(0.05, 0.05))))

Performs Ordinary Least Squares (OLS) regression analysis.

Parameters:

Name Type Description Default
axis str

The axis along which the regression analysis is performed. Possible values are flatten, xs, or ts. flatten uses all of the available data as regression input. xs uses cross sectional data on a particular date as regression input. ts uses time series data for a particular instrument as regression input.

'flatten'
bins int

The number of bins for grouping the data. If 0, no binning is performed. Defaults to 0.

0
binned_by Bin

The method used for binning the data. Possible values are Bin.OBSERVATIONS or Bin.WIDTH, corresponding to equal number of observations in each bin and equal interval length in each bin respectiely.

OBSERVATIONS
bin_block str

The block used for binning the data. Defaults to b0, the dependent variable.

'b0'
selector str or datetime

str instrument when axis=ts and datetime index when axis=xs to perform regression on. Ignored when axis=flatten.

None
aggregator callable or dict

Used for aggregating data within each bin. If callable is provided, then all of the blocks are aggregated using this function. Different aggregators can be provided for different blocks, by providing dictionary containing block : aggregator such as {"b0" : np.mean, "b1" : np.median}. Defaults to lambda x:np.mean(winsorize(x, limits=(0.05, 0.05))).

lambda x: mean(winsorize(x, limits=(0.05, 0.05)))

Returns:

Type Description
RegressionResults

The statsmodels results of the OLS regression analysis.

parse_formula(formula) staticmethod

Obtains the block-formula mapping and the derived patsy formula describing the regression model.

Parameters:

Name Type Description Default
formula str

The regression formula. The formula describes the statistical model being analyzed and follows the syntax of the patsy formula. Supports [ ~ + * : ] operators.

required

Returns:

Name Type Description
tuple

A tuple containing

  • dict: Mapping of block names to their respective formula blocks.
  • str: The derived patsy formula representing the regression model.

Examples:

>>> GeneticRegression.parse_formula("forward_1(logret_1()) ~ div(logret_25(),volatility_25()) + tsargmax_16(close)")
({'b0': 'forward_1(logret_1())', 'b1': 'div(logret_25(),volatility_25())', 'b2': 'tsargmax_16(close)'}, 'b0~b1+b2')

plot(fit=True, diagnostics=True, influence=True, leverage=True)

Plots various diagnostic plots for the regression model.

Parameters:

Name Type Description Default
fit bool

If True, plots, for each regressor x,

  • uCI ~ x
  • lCI ~ x
  • y ~ x
  • y^ ~ x
True
diagnostics bool

If True, plots, for each regressor x,

  • y ~ x, y^ ~ x, uCI ~ x, lCI ~ x
  • res ~ x
  • PRP
  • CCPR
True
influence bool

If True, plots

  • res* ~ leverage
True
leverage bool

If True, plots

  • leverage ~ (res#)**2
True