Matrix Utilities#
- balsa.routines.matrices.aggregate_matrix(matrix: ~pandas.core.frame.DataFrame | ~pandas.core.series.Series, *, groups: ~pandas.core.series.Series | ~numpy.ndarray[~typing.Any, ~numpy.dtype[~numpy._typing._generic_alias.ScalarType]] = None, row_groups: ~pandas.core.series.Series | ~numpy.ndarray[~typing.Any, ~numpy.dtype[~numpy._typing._generic_alias.ScalarType]] = None, col_groups: ~pandas.core.series.Series | ~numpy.ndarray[~typing.Any, ~numpy.dtype[~numpy._typing._generic_alias.ScalarType]] = None, aggfunc: ~typing.Callable[[~typing.Iterable[int | float]], int | float] = <function sum>) DataFrame | Series #
Aggregates a matrix based on mappings provided for each axis, using a specified aggregation function.
- Parameters:
matrix (pandas.DataFrame | pandas.Series) – Matrix data to aggregate. DataFrames and Series with 2-level indices are supported
groups (pandas.Series | NDArray, optional) – Syntactic sugar to specify both row_groups and col_groups to use the same grouping series.
row_groups (pandas.Series | NDArray, optional) – Groups for the rows. If aggregating a DataFrame, this must match the index of the matrix. For a “tall” matrix, this series can match either the “full” index of the series, or it can match the first level of the matrix (it would be the same as if aggregating a DataFrame). Alternatively, an array can be provided, but it must be the same length as the DataFrame’s index, or the full length of the Series.
col_groups (pandas.Series | NDArray, optional) – Groups for the columns. If aggregating a DataFrame, this must match the columns of the matrix. For a “tall” matrix, this series can match either the “full” index of the series, or it can match the second level of the matrix (it would be the same as if aggregating a DataFrame). Alternatively, an array can be provided, but it must be the same length as the DataFrame’s columns, or the full length of the Series.
aggfunc – The aggregation function to use. Default is np.sum.
- Returns:
The aggregated matrix, in the same type as was provided, e.g. Series -> Series, DataFrame -> DataFrame.
- Return type:
pandas.Series or pandas.DataFrame
Example
matrix:
1
2
3
4
5
6
7
1
2
1
9
6
7
8
5
2
4
1
1
4
8
7
6
3
5
8
5
3
5
9
4
4
1
1
2
9
4
9
9
5
6
3
4
6
9
9
3
6
7
2
5
8
2
5
9
7
3
1
8
6
3
5
6
groups:
1
A
2
B
3
A
4
A
5
C
6
C
7
B
new_matrix = aggregate_matrix(matrix, groups=groups)
new_matrix:
A
B
C
A
42
28
42
B
26
14
23
C
36
17
25
- balsa.routines.matrices.disaggregate_matrix(matrix: DataFrame, *, mapping: Series = None, proportions: Series = None, row_mapping: Series = None, row_proportions: Series = None, col_mapping: Series = None, col_proportions: Series = None) DataFrame #
Split multiple rows and columns in a matrix all at once. The cells in the matrix MUST be numeric, but the row and column labels do not.
- Parameters:
matrix (pandas.DataFrame) – The input matrix to disaggregate
mapping (pandas.Series, optional) – Dict-like Series of “New label” : “Old label”. Sets both the row_mapping and col_mapping variables if provided (resulting in a square matrix).
proportions (pandas.Series, optional) – Dict-like Series of “New label”: “Proportion of old label”. Its index must match the index of the mapping argument. Sets both the row_proportions and col_proportions arguments if provided.
row_mapping (pandas.Series, optional) – Same as mapping, except applied only to the rows.
row_proportions (pandas.Series, optional) – Same as proportions, except applied only to the rows
col_mapping (pandas.Series, optional) – Same as mapping, except applied only to the columns.
col_proportions (pandas.Series, optional) – Same as proportions, except applied only to the columns
- Returns:
An expanded DataFrame with the new indices. The new matrix will sum to the same total as the original.
- Return type:
pandas.DataFrame
Examples
df:
A
B
C
A
10
30
20
B
20
10
10
C
30
20
20
correspondence:
new
old
prop
A1
A
0.25
A2
A
0.75
B1
B
0.55
B2
B
0.45
C1
C
0.62
C2
C
0.38
new_matrix = disaggregate_matrix(df, mapping=correspondence['old'], proportions=correspondence['prop'])
new_matrix:
new
A1
A2
B1
B2
C1
C2
A1
0.625
1.875
4.125
3.375
3.100
1.900
A2
1.875
5.625
12.375
10.125
9.300
5.700
B1
2.750
8.250
3.025
2.475
3.410
2.090
B2
2.250
6.750
2.475
2.025
2.790
1.710
C1
4.650
13.95
6.820
5.580
7.688
4.712
C2
2.850
8.55
4.180
3.420
4.712
2.888
- balsa.routines.matrices.fast_stack(frame: DataFrame, multi_index: MultiIndex, *, deep_copy: bool = True) Series #
Performs the same action as
DataFrame.stack()
, but provides better performance when the target stacked index is known beforehand. Useful in converting a lot of matrices from “wide” to “tall” format. The inverse offast_unstack()
.Notes
This function does not check that the entries in the multi_index are compatible with the index and columns of the source DataFrame, only that the lengths are compatible. It can therefore be used to assign a whole new set of labels to the result.
- Parameters:
frame (pandas.DataFrame) – The DataFrame to stack.
multi_index (pandas.MultiIndex) – The 2-level MultiIndex known ahead-of-time.
deep_copy (bool, optional) – Defaults to
True
. A flag indicating if the returned Series should be a view of the underlying data (deep_copy=False) or a copy of it (deep_copy=True). A deep copy takes a little longer to convert and takes up more memory but preserves the original data of the DataFrame. The default value of True is recommended for most uses.
- Returns:
The stacked data.
- Return type:
pandas.Series
- balsa.routines.matrices.fast_unstack(series: Series, index: Index, columns: Index, *, deep_copy: bool = True) DataFrame #
Performs the same action as
DataFrame.unstack()
, but provides better performance when the target unstacked index and columns are known beforehand. Useful in converting a lot of matrices from “tall” to “wide” format. The inverse offast_stack()
.Notes
This function does not check that the entries in index and columns are compatible with the MultiIndex of the source Series, only that the lengths are compatible. It can therefore be used to assign a whole new set of labels to the result.
- Parameters:
series (pandas.Series) – The Series with 2-level MultiIndex to unstack
index (pandas.Index) – The row index known ahead-of-time
columns (pandas.Index) – The columns index known ahead-of-time.
deep_copy (bool) – Defaults to
True
. A flag indicating if the returned DataFrame should be a view of the underlying data (deep_copy=False) or a copy of it (deep_copy=True). A deep copy takes a little longer to convert and takes up more memory but preserves the original data of the Series. The default value of True is recommended for most uses.
- Returns:
The unstacked dat
- Return type:
pandas.DataFrame
- balsa.routines.matrices.matrix_balancing_1d(m: ndarray[Any, dtype[ScalarType]], a: ndarray[Any, dtype[ScalarType]], axis: int) ndarray[Any, dtype[ScalarType]] #
Balances a matrix using a single constraint.
- Parameters:
m (NDArray) – The matrix (a 2-dimensional ndarray) to be balanced
a (NDArray) – The totals vector (a 1-dimensional ndarray) constraint
axis (int) – Direction to constrain (0 = along columns, 1 = along rows)
- Returns:
A balanced matrix
- Return type:
NDArray
- balsa.routines.matrices.matrix_balancing_2d(m: ndarray[Any, dtype[ScalarType]] | DataFrame, a: ndarray[Any, dtype[ScalarType]], b: ndarray[Any, dtype[ScalarType]], *, totals_to_use: str = 'raise', max_iterations: int = 1000, rel_error: float = 0.0001, n_threads: int = 1) Tuple[ndarray[Any, dtype[ScalarType]] | DataFrame, float, int] #
Balances a two-dimensional matrix using iterative proportional fitting.
- Parameters:
m (NDArray | pandas.DataFrame) – The matrix (a 2-dimensional ndarray) to be balanced. If a DataFrame is supplied, the output will be returned as a DataFrame.
a (NDArray) – The row totals (a 1-dimensional ndarray) to use for balancing
b (NDArray) – The column totals (a 1-dimensional ndarray) to use for balancing
totals_to_use (str, optional) – Defaults to
'raise'
. Describes how to scale the row and column totals if their sums do not match. Must be one of [‘rows’, ‘columns’, ‘average’, ‘raise’]. - rows: scales the columns totals so that their sums matches the row totals - columns: scales the row totals so that their sums matches the column totals - average: scales both row and column totals to the average value of their sums - raise: raises an Exception if the sums of the row and column totals do not matchmax_iterations (int, optional) – Defaults to
1000
. Maximum number of iterationsrel_error (float, optional) – Defaults to
1.0E-4
. Relative error stopping criterian_threads (int, optional) – Defaults to
1
. Number of processors for parallel computation. (Not used)
- Returns:
The balanced matrix, residual, and n_iterations
- Return type:
Tuple[NDArray | pandas.DataFrame, float, int]
- balsa.routines.matrices.matrix_bucket_rounding(m: ndarray[Any, dtype[ScalarType]] | DataFrame, *, decimals: int = 0) ndarray[Any, dtype[ScalarType]] | DataFrame #
Bucket rounds to the given number of decimals.
- Parameters:
m (NDArray | pandas.DataFrame) – The matrix to be rounded
decimals (int, optional) – Defaults to
0
. Number of decimal places to round to. If decimals is negative, it specifies the number of positions to the left of the decimal point.
- Returns:
The rounded matrix
- Return type:
NDArray | pandas.DataFrame
- balsa.routines.matrices.split_zone_in_matrix(base_matrix: DataFrame, old_zone: int, new_zones: List[int], proportions: List[float]) DataFrame #
Takes a zone in a matrix (as a DataFrame) and splits it into several new zones, prorating affected cells by a vector of proportions (one value for each new zone). The old zone is removed.
- Parameters:
base_matrix (pandas.DataFrame) – The matrix to re-shape
old_zone (int) – The original zone to split
new_zones (List[int]) – The list of new zones to add
proportions (List[float]) – The proportions to split the original zone to. The list must be the same length as
new_zones
and sum to 1.0
- Returns:
The re-shaped matrix
- Return type:
pandas.DataFrame