Modelling Utilities#

balsa.routines.modelling.distance_array(x0: ndarray | Series, y0: ndarray | Series, x1: ndarray | Series, y1: ndarray | Series, *, method: str = 'euclidean', **kwargs) ndarray | Series#

Fast method to compute distance between 2 (x, y) points, represented by 4 separate arrays, using the NumExpr package. Supports several equations for computing distances

Parameters:
  • x0 (numpy.ndarray | pandas.Series) – X or Lon coordinate of first point

  • y0 (numpy.ndarray | pandas.Series) – Y or Lat coordinate of first point

  • x1 (numpy.ndarray | pandas.Series) – X or Lon coordinate of second point

  • y1 (numpy.ndarray | pandas.Series) – Y or Lat coordinate of second point

  • method (str, optional) – Defaults to 'EUCLIDEAN'. Specifies the method by which to compute distance. Valid options are: 'EUCLIDEAN': Computes straight-line, ‘as-the-crow flies’ distance. 'MANHATTAN': Computes the Manhattan distance 'HAVERSINE': Computes distance based on lon/lat.

  • **kwargs – Additional scalars to pass into the evaluation context

Kwargs:
coord_unit (float):

Factor applies directly to the result, defaulting to 1.0 (no conversion). Useful when the coordinates are provided in one unit (e.g. m) and the desired result is in a different unit (e.g. km). Only used for Euclidean or Manhattan distance

earth_radius_factor (float):

Factor to convert from km to other units when using Haversine distance

Returns:

Distance from the vectors of first points to the vectors of second points. A Series is returned when one or more coordinate arrays are given as a Series object

Return type:

numpy.ndarray or pandas.Series

balsa.routines.modelling.distance_matrix(x0: ndarray | Series, y0: ndarray | Series, *, labels0: Iterable | Index = None, tall: bool = False, x1: ndarray | Series = None, y1: ndarray | Series = None, labels1: ndarray | Series = None, method: str = 'EUCLIDEAN', **kwargs) Series | DataFrame | ndarray#

Fastest method of computing a distance matrix from vectors of coordinates, using the NumExpr package. Supports several equations for computing distances.

Accepts two or four vectors of x-y coordinates. If only two vectors are provided (x0, y0), the result will be the 2D product of this vector with itself (vector0 * vector0). If all four are provided (x0, y0, x1, y1), the result will be the 2D product of the first and second vector (vector0 * vector1).

Parameters:
  • x0 (numpy.ndarray | pandas.Series) – Vector of x-coordinates, of length N0. Can be a Series to specify labels.

  • y0 (numpy.ndarray | pandas.Series) – Vector of y-coordinates, of length N0. Can be a Series to specify labels.

  • labels0 (pandas.Index-like, optional) – Defaults to None. Override set of labels to use if x0 and y0 are both raw Numpy arrays

  • x1 (numpy.ndarray | pandas.Series, optional) – Defaults to None. A second vector of x-coordinates, of length N1. Can be a Series to specify labels

  • y1 (numpy.ndarray | pandas.Series, optional) – Defaults to None. A second vector of y-coordinates, of length N1. Can be a Series to specify labels

  • labels1 (pandas.Index-like) – Override set of labels to use if x1 and y1 are both raw Numpy arrays

  • tall (bool, optional) – Defaults to False. If True, returns a vector whose shape is N0 x N1. Otherwise, returns a matrix whose shape is (N0, N1).

  • method (str, optional) – Defaults to 'EUCLIDEAN'. Specifies the method by which to compute distance. Valid options are: 'EUCLIDEAN': Computes straight-line, ‘as-the-crow flies’ distance. 'MANHATTAN': Computes the Manhattan distance 'HAVERSINE': Computes distance based on lon/lat.

  • **kwargs – Additional scalars to pass into the evaluation context

Kwargs:
coord_unit (float):

Factor applies directly to the result, defaulting to 1.0 (no conversion). Useful when the coordinates are provided in one unit (e.g. m) and the desired result is in a different unit (e.g. km). Only used for Euclidean or Manhattan distance

earth_radius_factor (float):

Factor to convert from km to other units when using Haversine distance

Returns:

A Series will be returned when tall=True, and labels can be inferred and will always have 2-level MultiIndex. A DataFrame will be returned when tall=False and labels can be inferred. A ndarray will be returned when labels could not be inferred; if tall=True the array will be 1-dimensional, with shape (N x N,). Otherwise, it will 2-dimensional with shape (N, N)

Return type:

pandas.Series, pandas.DataFrame or numpy.ndarray

Note

The type of the returned object depends on whether labels can be inferred from the arguments. This is always true when the labels argument is specified, and the returned value will use cross-product of the labels vector.

Otherwise, the function will try and infer the labels from the x and y objects, if one or both of them are provided as Series.

balsa.routines.modelling.tlfd(values: ndarray | Series, *, bin_start: int = 0, bin_end: int = 200, bin_step: int = 2, weights: ndarray | Series = None, intrazonal: ndarray | Series = None, label_type: str = 'MULTI', include_top: bool = False) Series#

Generates a Trip Length Frequency Distribution (i.e. a histogram) from given data. Produces a “pretty” Pandas object suitable for charting.

Parameters:
  • values (numpy.ndarray | pandas.Series) – A vector of trip lengths, with a length of “N”. Can be provided from a table of trips, or from a matrix (in “tall” format).

  • bin_start (int, optional) – Defaults is 0. The minimum bin value, in the same units as values.

  • bin_end (int, optional) – Defaults to 200. The maximum bin value, in the same units as values. Values over this limit are either ignored, or counted under a separate category (see include_top)

  • bin_step (int, optional) – Default is 2. The size of each bin, in the same unit as values.

  • weights (numpy.ndarray | pandas.Series, optional) – Defaults to None. A vector of weights to use of length “N”, to produce a weighted histogram.

  • intrazonal (numpy.ndarray | pandas.Series, optional) – Defaults to None. A boolean vector indicating which values are considered “intrazonal”. When specified, prepends an intrazonal category to the front of the histogram.

  • label_type (str, optional) – Defaults to 'MULTI'. The format of the returned index. Options are: - MULTI: The returned index will be a 2-level MultiIndex [‘from’, ‘to’]; - TEXT: The returned index will be text-based: “0 to 2”; - BOTTOM: The returned index will be the bottom of each bin; and - TOP: The returned index will be the top of each bin.

  • include_top (bool, optional) – Defaults to False. If True, the function will count all values (and weights, if provided) above the bin_top, and add them to the returned Series. This bin is described as going from bin_top to inf.

Returns:

The weighted or unweighted histogram, depending on the options configured above.

Return type:

pandas.Series