Matrix Utilities#

balsa.routines.matrices.aggregate_matrix(matrix: ~pandas.core.frame.DataFrame | ~pandas.core.series.Series, *, groups: ~pandas.core.series.Series | ~numpy.ndarray[~typing.Any, ~numpy.dtype[~numpy._typing._generic_alias.ScalarType]] = None, row_groups: ~pandas.core.series.Series | ~numpy.ndarray[~typing.Any, ~numpy.dtype[~numpy._typing._generic_alias.ScalarType]] = None, col_groups: ~pandas.core.series.Series | ~numpy.ndarray[~typing.Any, ~numpy.dtype[~numpy._typing._generic_alias.ScalarType]] = None, aggfunc: ~typing.Callable[[~typing.Iterable[int | float]], int | float] = <function sum>) DataFrame | Series#

Aggregates a matrix based on mappings provided for each axis, using a specified aggregation function.

Parameters:
  • matrix (pandas.DataFrame | pandas.Series) – Matrix data to aggregate. DataFrames and Series with 2-level indices are supported

  • groups (pandas.Series | NDArray, optional) – Syntactic sugar to specify both row_groups and col_groups to use the same grouping series.

  • row_groups (pandas.Series | NDArray, optional) – Groups for the rows. If aggregating a DataFrame, this must match the index of the matrix. For a “tall” matrix, this series can match either the “full” index of the series, or it can match the first level of the matrix (it would be the same as if aggregating a DataFrame). Alternatively, an array can be provided, but it must be the same length as the DataFrame’s index, or the full length of the Series.

  • col_groups (pandas.Series | NDArray, optional) – Groups for the columns. If aggregating a DataFrame, this must match the columns of the matrix. For a “tall” matrix, this series can match either the “full” index of the series, or it can match the second level of the matrix (it would be the same as if aggregating a DataFrame). Alternatively, an array can be provided, but it must be the same length as the DataFrame’s columns, or the full length of the Series.

  • aggfunc – The aggregation function to use. Default is np.sum.

Returns:

The aggregated matrix, in the same type as was provided, e.g. Series -> Series, DataFrame -> DataFrame.

Return type:

pandas.Series or pandas.DataFrame

Example

matrix:

1

2

3

4

5

6

7

1

2

1

9

6

7

8

5

2

4

1

1

4

8

7

6

3

5

8

5

3

5

9

4

4

1

1

2

9

4

9

9

5

6

3

4

6

9

9

3

6

7

2

5

8

2

5

9

7

3

1

8

6

3

5

6

groups:

1

A

2

B

3

A

4

A

5

C

6

C

7

B

new_matrix = aggregate_matrix(matrix, groups=groups)

new_matrix:

A

B

C

A

42

28

42

B

26

14

23

C

36

17

25

balsa.routines.matrices.disaggregate_matrix(matrix: DataFrame, *, mapping: Series = None, proportions: Series = None, row_mapping: Series = None, row_proportions: Series = None, col_mapping: Series = None, col_proportions: Series = None) DataFrame#

Split multiple rows and columns in a matrix all at once. The cells in the matrix MUST be numeric, but the row and column labels do not.

Parameters:
  • matrix (pandas.DataFrame) – The input matrix to disaggregate

  • mapping (pandas.Series, optional) – Dict-like Series of “New label” : “Old label”. Sets both the row_mapping and col_mapping variables if provided (resulting in a square matrix).

  • proportions (pandas.Series, optional) – Dict-like Series of “New label”: “Proportion of old label”. Its index must match the index of the mapping argument. Sets both the row_proportions and col_proportions arguments if provided.

  • row_mapping (pandas.Series, optional) – Same as mapping, except applied only to the rows.

  • row_proportions (pandas.Series, optional) – Same as proportions, except applied only to the rows

  • col_mapping (pandas.Series, optional) – Same as mapping, except applied only to the columns.

  • col_proportions (pandas.Series, optional) – Same as proportions, except applied only to the columns

Returns:

An expanded DataFrame with the new indices. The new matrix will sum to the same total as the original.

Return type:

pandas.DataFrame

Examples

df:

A

B

C

A

10

30

20

B

20

10

10

C

30

20

20

correspondence:

new

old

prop

A1

A

0.25

A2

A

0.75

B1

B

0.55

B2

B

0.45

C1

C

0.62

C2

C

0.38

new_matrix = disaggregate_matrix(df, mapping=correspondence['old'], proportions=correspondence['prop'])

new_matrix:

new

A1

A2

B1

B2

C1

C2

A1

0.625

1.875

4.125

3.375

3.100

1.900

A2

1.875

5.625

12.375

10.125

9.300

5.700

B1

2.750

8.250

3.025

2.475

3.410

2.090

B2

2.250

6.750

2.475

2.025

2.790

1.710

C1

4.650

13.95

6.820

5.580

7.688

4.712

C2

2.850

8.55

4.180

3.420

4.712

2.888

balsa.routines.matrices.fast_stack(frame: DataFrame, multi_index: MultiIndex, *, deep_copy: bool = True) Series#

Performs the same action as DataFrame.stack(), but provides better performance when the target stacked index is known beforehand. Useful in converting a lot of matrices from “wide” to “tall” format. The inverse of fast_unstack().

Notes

This function does not check that the entries in the multi_index are compatible with the index and columns of the source DataFrame, only that the lengths are compatible. It can therefore be used to assign a whole new set of labels to the result.

Parameters:
  • frame (pandas.DataFrame) – The DataFrame to stack.

  • multi_index (pandas.MultiIndex) – The 2-level MultiIndex known ahead-of-time.

  • deep_copy (bool, optional) – Defaults to True. A flag indicating if the returned Series should be a view of the underlying data (deep_copy=False) or a copy of it (deep_copy=True). A deep copy takes a little longer to convert and takes up more memory but preserves the original data of the DataFrame. The default value of True is recommended for most uses.

Returns:

The stacked data.

Return type:

pandas.Series

balsa.routines.matrices.fast_unstack(series: Series, index: Index, columns: Index, *, deep_copy: bool = True) DataFrame#

Performs the same action as DataFrame.unstack(), but provides better performance when the target unstacked index and columns are known beforehand. Useful in converting a lot of matrices from “tall” to “wide” format. The inverse of fast_stack().

Notes

This function does not check that the entries in index and columns are compatible with the MultiIndex of the source Series, only that the lengths are compatible. It can therefore be used to assign a whole new set of labels to the result.

Parameters:
  • series (pandas.Series) – The Series with 2-level MultiIndex to unstack

  • index (pandas.Index) – The row index known ahead-of-time

  • columns (pandas.Index) – The columns index known ahead-of-time.

  • deep_copy (bool) – Defaults to True. A flag indicating if the returned DataFrame should be a view of the underlying data (deep_copy=False) or a copy of it (deep_copy=True). A deep copy takes a little longer to convert and takes up more memory but preserves the original data of the Series. The default value of True is recommended for most uses.

Returns:

The unstacked dat

Return type:

pandas.DataFrame

balsa.routines.matrices.matrix_balancing_1d(m: ndarray[Any, dtype[ScalarType]], a: ndarray[Any, dtype[ScalarType]], axis: int) ndarray[Any, dtype[ScalarType]]#

Balances a matrix using a single constraint.

Parameters:
  • m (NDArray) – The matrix (a 2-dimensional ndarray) to be balanced

  • a (NDArray) – The totals vector (a 1-dimensional ndarray) constraint

  • axis (int) – Direction to constrain (0 = along columns, 1 = along rows)

Returns:

A balanced matrix

Return type:

NDArray

balsa.routines.matrices.matrix_balancing_2d(m: ndarray[Any, dtype[ScalarType]] | DataFrame, a: ndarray[Any, dtype[ScalarType]], b: ndarray[Any, dtype[ScalarType]], *, totals_to_use: str = 'raise', max_iterations: int = 1000, rel_error: float = 0.0001, n_threads: int = 1) Tuple[ndarray[Any, dtype[ScalarType]] | DataFrame, float, int]#

Balances a two-dimensional matrix using iterative proportional fitting.

Parameters:
  • m (NDArray | pandas.DataFrame) – The matrix (a 2-dimensional ndarray) to be balanced. If a DataFrame is supplied, the output will be returned as a DataFrame.

  • a (NDArray) – The row totals (a 1-dimensional ndarray) to use for balancing

  • b (NDArray) – The column totals (a 1-dimensional ndarray) to use for balancing

  • totals_to_use (str, optional) – Defaults to 'raise'. Describes how to scale the row and column totals if their sums do not match. Must be one of [‘rows’, ‘columns’, ‘average’, ‘raise’]. - rows: scales the columns totals so that their sums matches the row totals - columns: scales the row totals so that their sums matches the column totals - average: scales both row and column totals to the average value of their sums - raise: raises an Exception if the sums of the row and column totals do not match

  • max_iterations (int, optional) – Defaults to 1000. Maximum number of iterations

  • rel_error (float, optional) – Defaults to 1.0E-4. Relative error stopping criteria

  • n_threads (int, optional) – Defaults to 1. Number of processors for parallel computation. (Not used)

Returns:

The balanced matrix, residual, and n_iterations

Return type:

Tuple[NDArray | pandas.DataFrame, float, int]

balsa.routines.matrices.matrix_bucket_rounding(m: ndarray[Any, dtype[ScalarType]] | DataFrame, *, decimals: int = 0) ndarray[Any, dtype[ScalarType]] | DataFrame#

Bucket rounds to the given number of decimals.

Parameters:
  • m (NDArray | pandas.DataFrame) – The matrix to be rounded

  • decimals (int, optional) – Defaults to 0. Number of decimal places to round to. If decimals is negative, it specifies the number of positions to the left of the decimal point.

Returns:

The rounded matrix

Return type:

NDArray | pandas.DataFrame

balsa.routines.matrices.split_zone_in_matrix(base_matrix: DataFrame, old_zone: int, new_zones: List[int], proportions: List[float]) DataFrame#

Takes a zone in a matrix (as a DataFrame) and splits it into several new zones, prorating affected cells by a vector of proportions (one value for each new zone). The old zone is removed.

Parameters:
  • base_matrix (pandas.DataFrame) – The matrix to re-shape

  • old_zone (int) – The original zone to split

  • new_zones (List[int]) – The list of new zones to add

  • proportions (List[float]) – The proportions to split the original zone to. The list must be the same length as new_zones and sum to 1.0

Returns:

The re-shaped matrix

Return type:

pandas.DataFrame