IPU

Iterative proportional updating is a method developed by Arizona State University that allows the IPF procedure to match household- and person-level marginals. In the basic IPF procedure, all marginal distributions must describe the same thing (e.g. households). IPU allows you to say, for example, that a zone needs a total household count of 500, but also needs 800 people.

Example 1: Simple Example

This example creates a random seed table and target values to illustrate how the package is used. The targets are specified for two separate geographies (geo_clusters). Any field name can be used as long as it:

starts with “geo_”
Is included in both the target table(s) and seed table

This simple example only has one target marginal distribution, and could be solved directly without ipu. However, it is designed to show the basics needed to run the function.

Seed table creation

The seed table is the starting point for the IPF procedure. In this example, we make up some survey data.

Each row represents a household
The household ID column is named/renamed to pid (“primary ID”)
The geography field is included and starts with “geo_” (geo_taz)
- There are two traffic analysis zones (TAZs), each with different seed data and targets.

hh_seed <- tribble(
  ~pid, ~siz, ~inc, ~weight, ~geo_taz,
  1,    1,    1,    12,       1,
  2,    1,    2,     3,       1,
  3,    2,    1,    6,       1,
  4,    2,    2,    5,       1
)

Target creation

The number of households by size (e.g., 1-person, 2-person, etc.) is referred to as a marginal distribution. Often, from the Census, we know the total number of households by each individual marginal. This information becomes the target that the IPU process tries to match.

Marginal targets are specified below for each taz:

The geography field geo_taz matches the seed table.
The name of the table in the list links it to the siz and inc columns of the seed.
The column names are the values that show up in the seed’s siz and inc columns.

hh_targets <- list()
hh_targets$siz <- tribble(
  ~geo_taz, ~`1`, ~`2`,
  1,        18,   12
)
hh_targets$inc <- tribble(
  ~geo_taz, ~`1`, ~`2`,
  1,        20,  10
)

hh_targets

## $siz
## # A tibble: 1 x 3
##   geo_taz   `1`   `2`
##     <dbl> <dbl> <dbl>
## 1       1    18    12
## 
## $inc
## # A tibble: 1 x 3
##   geo_taz   `1`   `2`
##     <dbl> <dbl> <dbl>
## 1       1    20    10

Run IPU

result <- ipu(hh_seed, hh_targets)

ipu() returns a named list.

names(result)

## [1] "weight_tbl"   "weight_dist"  "primary_comp"

The first element is the resulting weight table. It is the primary seed table with three columns added:

weight
- The expanded weight of the record.
avg_weight
- The average weight for the geography (total target / number of seed records)
weight_factor
- The weight divided by the average weight.

result$weight_tbl

## # A tibble: 4 x 7
##     pid   siz   inc weight geo_taz avg_weight weight_factor
##   <dbl> <dbl> <dbl>  <dbl>   <dbl>      <dbl>         <dbl>
## 1     1     1     1     12       1        7.5         1.6  
## 2     2     1     2      6       1        7.5         0.8  
## 3     3     2     1      8       1        7.5         1.07 
## 4     4     2     2      4       1        7.5         0.533

The second element is a histogram of the weight_factor. This provides a quick overview of the distribution of weights.

result$weight_dist

The next element is a comparison back to the targets provided. With complex seed and target tables, this makes investigating results quick and easy.

result$primary_comp

## # A tibble: 4 x 6
##   geo       category target result  diff pct_diff
##   <chr>     <chr>     <dbl>  <dbl> <dbl>    <dbl>
## 1 geo_taz_1 inc_1        20     20     0        0
## 2 geo_taz_1 inc_2        10     10     0        0
## 3 geo_taz_1 siz_1        18     18     0        0
## 4 geo_taz_1 siz_2        12     12     0        0

If secondary targets are provided to ipu(), a fourth item in the list will contain a secondary_comp table.

In addition to making sure the marginal targets are matched, it is important to ensure that the underlying distribution of households still resembles the seed data. As an example, if your seed data says that most low-income households are also one-person households, that information should be preserved.

hh_seed %>%
  mutate(inc = paste0("inc", inc)) %>%
  filter(geo_taz == 1) %>%
  select(siz, inc, weight) %>%
  spread(inc, weight)

## # A tibble: 2 x 3
##     siz  inc1  inc2
##   <dbl> <dbl> <dbl>
## 1     1    12     3
## 2     2     6     5

result$weight_tbl %>%
  mutate(inc = paste0("inc", inc)) %>%
  filter(geo_taz == 1) %>%
  select(siz, inc, weight) %>%
  spread(inc, weight)

## # A tibble: 2 x 3
##     siz  inc1  inc2
##   <dbl> <dbl> <dbl>
## 1     1    12     6
## 2     2     8     4

Example 2: The Arizona Paper Example

In household survey expansion, it is common to want to control for certain features that describe households, (like size), while controlling for other attributes that describe people (like age). This is possible with the ipu() function.

This example is taken directly from the Arizona paper on page 20: http://www.scag.ca.gov/Documents/PopulationSynthesizerPaper_TRB.pdf

In this example, household type could represent size (e.g. 1-person and 2-person households). Person type could represent age groups (e.g. under 18, between 18 and 50, and over 50).

The code block below re-creates the seed and target tables for both persons and households.

Only a single geography is used (geo_region)
Both seed tables have the pid field
- The pid field in the persons seed table links to the household seed

hh_seed <- data_frame(
  geo_region = 1,
  pid = c(1:8),
  hhtype = c(1, 1, 1, 2, 2, 2, 2, 2)
)

per_seed <- data_frame(
  pid = c(1, 1, 1, 2, 2, 3, 3, 3, 4, 4, 4, 5, 5, 5, 6, 6, 7, 7, 7, 7, 7, 8, 8),
  pertype = c(1, 2, 3, 1, 3, 1, 1, 2, 1, 3, 3, 2, 2, 3, 1, 2, 1, 1, 2, 3, 3, 1, 2)
)

hh_targets <- list()
hh_targets$hhtype <- data_frame(
  geo_region = 1,
  `1` = 35,
  `2` = 65
)

per_targets <- list()
per_targets$pertype <- data_frame(
  geo_region = 1,
  `1` = 91,
  `2` = 65,
  `3` = 104
)

In the interst of keeping vignette build time short, the ipu() algorithm is only run for 30 iterations. After running for 400 or more iterations, the results match closely to those shown in the paper.

The household seed table is the primary_seed
The household target list is the primary_target
The person seed table is the secondary_seed
The person target list is the secondary_target

result <- ipu(hh_seed, hh_targets, per_seed, per_targets, max_iterations = 30)

The first table shows the result. The second table shows the primary comparison table. Since we added secondary seeds and targets, the output now contains a secondary comparison table. Feel free to run the code chunk above for 400 or more iterations and then look again.

result$weight_tbl %>%
  mutate(weight = round(weight, 2))

## # A tibble: 8 x 6
##   geo_region   pid hhtype weight avg_weight weight_factor
##        <dbl> <int>  <dbl>  <dbl>      <dbl>         <dbl>
## 1          1     1      1   3.77       12.5         0.301
## 2          1     2      1  23          12.5         1.84 
## 3          1     3      1   6.5        12.5         0.520
## 4          1     4      2  25.7        12.5         2.06 
## 5          1     5      2  17.4        12.5         1.39 
## 6          1     6      2   7.26       12.5         0.581
## 7          1     7      2   4.21       12.5         0.337
## 8          1     8      2   7.26       12.5         0.581

result$primary_comp %>% 
  mutate(result = round(result, 2))

## # A tibble: 2 x 6
##   geo          category target result  diff pct_diff
##   <chr>        <chr>     <dbl>  <dbl> <dbl>    <dbl>
## 1 geo_region_1 hhtype_1     35   33.3 -1.73    -4.94
## 2 geo_region_1 hhtype_2     65   61.8 -3.15    -4.84

result$secondary_comp %>% 
  mutate(result = round(result, 2))

## # A tibble: 3 x 6
##   geo          category  target result  diff pct_diff
##   <chr>        <chr>      <dbl>  <dbl> <dbl>    <dbl>
## 1 geo_region_1 pertype_1     91   88.4 -2.58    -2.83
## 2 geo_region_1 pertype_2     65   63.8 -1.16    -1.79
## 3 geo_region_1 pertype_3    104  104    0        0

Example 3: Using Multiple Geographies

ipu() allows different geographies to be specified for different marginal tables. There are a few rules that make this possible, but in short, the geo field on each target table tells the algorithm which scale to constrain to.

All of the following rules are checked by the algorithm a warning message will show if one is violated.

The primary/household seed table must contain all geo fields used by any target table and the pid field.
- Do not put any geo fields on the secondary/person seed table.
- This prevents potential errors/inconsistencies between seed tables.
All fields that designate geographies must start with “geo_” e.g.
- geo_cluster
- geo_region
- geo_state
Each target table must have a geo field that is present in the primary seed table

To demonstrate, the Arizona example from example 1 is modified to add two different clusters for household controls but to still control the person targets at the regional level.

# Modifying example 1 for example 2

# Repeat the hh_seed to create cluster 1 and 2 households
hh_seed <- hh_seed %>%
  rename(geo_cluster = geo_region)
hh_seed <- bind_rows(
  hh_seed,
  hh_seed %>% 
    mutate(geo_cluster = 2, pid = pid + 8)
)
hh_seed$geo_region = 1

hh_seed

## # A tibble: 16 x 4
##    geo_cluster   pid hhtype geo_region
##          <dbl> <dbl>  <dbl>      <dbl>
##  1           1     1      1          1
##  2           1     2      1          1
##  3           1     3      1          1
##  4           1     4      2          1
##  5           1     5      2          1
##  6           1     6      2          1
##  7           1     7      2          1
##  8           1     8      2          1
##  9           2     9      1          1
## 10           2    10      1          1
## 11           2    11      1          1
## 12           2    12      2          1
## 13           2    13      2          1
## 14           2    14      2          1
## 15           2    15      2          1
## 16           2    16      2          1

# Repeat the household targets for two clusters
hh_targets$hhtype <- bind_rows(hh_targets$hhtype, hh_targets$hhtype)
hh_targets$hhtype <- hh_targets$hhtype %>%
  rename(geo_cluster = geo_region) %>%
  mutate(geo_cluster = c(1, 2))

hh_targets$hhtype

## # A tibble: 2 x 3
##   geo_cluster   `1`   `2`
##         <dbl> <dbl> <dbl>
## 1           1    35    65
## 2           2    35    65

# Repeat the per_seed to create cluster 1 and 2 persons
per_seed <- bind_rows(
  per_seed,
  per_seed %>% 
    mutate(pid = pid + 8)
)

per_seed %>%
  head()

## # A tibble: 6 x 2
##     pid pertype
##   <dbl>   <dbl>
## 1     1       1
## 2     1       2
## 3     1       3
## 4     2       1
## 5     2       3
## 6     3       1

# Double the regional person targets
per_targets$pertype <- per_targets$pertype %>%
  mutate_at(
    .vars = vars("1", "2", "3"),
    .funs = funs(. * 2)
  )

per_targets$pertype

## # A tibble: 1 x 4
##   geo_region   `1`   `2`   `3`
##        <dbl> <dbl> <dbl> <dbl>
## 1          1   182   130   208

Run the IPU algorithm. Again, for vignette build time, only 30 iterations are performed. Run the code yourself with max_iterations set to 600 to see the converged result.

result <- ipu(hh_seed, hh_targets, per_seed, per_targets, max_iterations = 30)

The tables below show the results compared back to targets. More iterations would make a better match.

result$primary_comp %>%
  mutate(result = round(result, 2))

## # A tibble: 4 x 6
##   geo           category target result  diff pct_diff
##   <chr>         <chr>     <dbl>  <dbl> <dbl>    <dbl>
## 1 geo_cluster_1 hhtype_1     35   33.3 -1.73    -4.94
## 2 geo_cluster_1 hhtype_2     65   61.8 -3.15    -4.84
## 3 geo_cluster_2 hhtype_1     35   33.3 -1.73    -4.94
## 4 geo_cluster_2 hhtype_2     65   61.8 -3.15    -4.84

result$secondary_comp %>%
  mutate(result = round(result, 2))

## # A tibble: 3 x 6
##   geo          category  target result  diff pct_diff
##   <chr>        <chr>      <dbl>  <dbl> <dbl>    <dbl>
## 1 geo_region_1 pertype_1    182   177. -5.16    -2.83
## 2 geo_region_1 pertype_2    130   128. -2.33    -1.79
## 3 geo_region_1 pertype_3    208   208   0        0

How IPU addresses common IPF problems

This section will show how ipu() addresses some common problems found in basic ipf procedures. It uses the example data from the first example.

Zero weights

IPF works by successively multiplying the table weights by factors. Cells with a zero weight cannot be modified by this process. As the number of zero weights increase, the flexibility of the process is reduced, and convergence becomes more difficult. ipfr solves this problem by setting a minimum weight for all cells to .0001. This minimum weight can be adjusted using the min_weight parameter and should be arbitrarily small compared to your seed table weights.

Missing seed information

Not every combination of marginal categories is required to be included in the seed table; however, at least one observation of each category must exist. For example, the combination:

siz = 1
wrk = 1
veh = 0

may not have been observed in the survey, and thus may be missing from the seed table. As long as other combinations of size-1 households exist (e.g. with 0 workers and 1 vehicle), ipfr will work fine. On the other hand, if there are no observations of any size-1 households, ipfr will stop with an error message.

See the first IPU example to see how it works.

Target agreement

ipfr handles two separate issues concerning marginal agreement:

Agreement within primary or secondary targets
Balance between primary and secondary targets

Agreement within Primary or Secondary Targets

A basic implementation of iterative proportional fitting requires that all targets agree on the total. For example, if the households by size target table has a total of 100 households, but the households by income table has a total of 120, both cannot be satisfied.

ipfr handles this by scaling all tables in the same target list (either primary or secondary) to match the total of the first table.

In the example below, the size marginal sums to a total of 100 households. The vehicle marginal sums to 300. With the verbose option set to TRUE, a message will be displayed telling which, if any, target tables are scaled.

hh_seed <- data_frame(
  geo_region = 1,
  pid = c(1:8),
  hhsiz = c(1, 1, 1, 2, 2, 2, 2, 2),
  hhveh = c(0, 2, 1, 1, 1, 2, 1, 0)
)

hh_targets <- list()
hh_targets$hhsiz <- data_frame(
  geo_region = 1,
  `1` = 35,
  `2` = 65
)
hh_targets$hhveh <- data_frame(
  geo_region = 1,
  `0` = 100,
  `1` = 100,
  `2` = 100
)

result <- ipu(hh_seed, hh_targets, max_iterations = 30, verbose = TRUE)

## Scaling target tables:  hhveh

## 
 Finished iteration  2 . %RMSE =  9.044233
 Finished iteration  3 . %RMSE =  0.4706742
 Finished iteration  4 . %RMSE =  0.02385747
 Finished iteration  5 . %RMSE =  0.001207629

## IPU converged

## All targets matched within the absolute_diff of 10

Importantly, the performance measures below compare the result to the scaled target not the original. Note that the vehicle targets have been scaled down.

result$primary_comp %>%
  mutate_at(
    .vars = vars(target, result),
    .funs = funs(round(., 2))
  )

## # A tibble: 5 x 6
##   geo          category target result  diff pct_diff
##   <chr>        <chr>     <dbl>  <dbl> <dbl>    <dbl>
## 1 geo_region_1 hhsiz_1    35     35       0        0
## 2 geo_region_1 hhsiz_2    65     65       0        0
## 3 geo_region_1 hhveh_0    33.3   33.3     0        0
## 4 geo_region_1 hhveh_1    33.3   33.3     0        0
## 5 geo_region_1 hhveh_2    33.3   33.3     0        0

Balance Between Primary and Secondary Targets

In population synthesis or survey expansion, adding a secondary set of person- level targets can lead to a different issue: target balance. Naturally, the total number of households and the total number of persons will be very different. A balance issue arises when the average weight for household records and person records are very different.

In the Arizona example, note that the average weights for household and person records are similar.

avg_hh_weight <- (rowSums(hh_targets$hhtype) - 1) / nrow(hh_seed)
avg_per_weight <- (rowSums(per_targets$pertype) - 1) / nrow(per_seed)

Average household weight = 12.5
Average person weight = 11.3

In real applications, this is often not true. The example below demonstrates the consequences by modifying the Arizona to double the person targets.

per_targets$pertype <- per_targets$pertype %>%
  mutate_at(
    .vars = vars(`1`, `2`, `3`),
    .funs = funs(. * 2)
  )

result <- ipu(hh_seed, hh_targets, per_seed, per_targets, max_iterations = 30)

The resulting weights tend towards the extreme as the algorithm attempts to match unbalanced primary and secondary targets. In effect, the algorithm is making a large shift to the basic persons-per-household metric found in the seed table. Households with mutiple people get large weights, while households with a single person get small weights.

result$weight_dist

ipu can fix the underlying problem using the secondary_importance argument. It is 1 by default, which means the algorithm will attempt to match the absolute values of the secondary targets (as above). As this value is decreased to 0, the secondary targets are scaled to match the average weight of the primary targets.

The examples below set secondary_importance to 0.80, 0.20, and 0.00 to show the effect on results. With each decrease in importance, the match to person targets gets worse, but weight extremes are reduced.

result <- ipu(hh_seed, hh_targets, per_seed, per_targets, max_iterations = 30,
              secondary_importance = .80)

result

## $weight_tbl
## # A tibble: 8 x 6
##   geo_region   pid hhtype weight avg_weight weight_factor
##        <dbl> <int>  <dbl>  <dbl>      <dbl>         <dbl>
## 1          1     1      1 41.1         12.5        3.29  
## 2          1     2      1  0.635       12.5        0.0508
## 3          1     3      1  1.90        12.5        0.152 
## 4          1     4      2  1.12        12.5        0.0893
## 5          1     5      2  0.782       12.5        0.0626
## 6          1     6      2  3.35        12.5        0.268 
## 7          1     7      2 72.3         12.5        5.78  
## 8          1     8      2  3.35        12.5        0.268 
## 
## $weight_dist

## 
## $primary_comp
## # A tibble: 2 x 6
##   geo          category target result  diff pct_diff
##   <chr>        <chr>     <dbl>  <dbl> <dbl>    <dbl>
## 1 geo_region_1 hhtype_1     35   43.7  8.66     24.7
## 2 geo_region_1 hhtype_2     65   80.9 15.9      24.5
## 
## $secondary_comp
## # A tibble: 3 x 6
##   geo          category  target result   diff pct_diff
##   <chr>        <chr>      <dbl>  <dbl>  <dbl>    <dbl>
## 1 geo_region_1 pertype_1    182   198.  16.0      8.79
## 2 geo_region_1 pertype_2    130   124.  -6.41    -4.93
## 3 geo_region_1 pertype_3    208   189. -18.6     -8.95

result <- ipu(hh_seed, hh_targets, per_seed, per_targets, max_iterations = 30,
              secondary_importance = .20)

result

## $weight_tbl
## # A tibble: 8 x 6
##   geo_region   pid hhtype weight avg_weight weight_factor
##        <dbl> <int>  <dbl>  <dbl>      <dbl>         <dbl>
## 1          1     1      1  19.6        12.5         1.57 
## 2          1     2      1  12.7        12.5         1.02 
## 3          1     3      1   3.30       12.5         0.264
## 4          1     4      2  17.4        12.5         1.39 
## 5          1     5      2  13.0        12.5         1.04 
## 6          1     6      2   4.50       12.5         0.360
## 7          1     7      2  26.7        12.5         2.14 
## 8          1     8      2   4.50       12.5         0.360
## 
## $weight_dist

## 
## $primary_comp
## # A tibble: 2 x 6
##   geo          category target result  diff pct_diff
##   <chr>        <chr>     <dbl>  <dbl> <dbl>    <dbl>
## 1 geo_region_1 hhtype_1     35   35.6  0.63     1.79
## 2 geo_region_1 hhtype_2     65   66.1  1.13     1.73
## 
## $secondary_comp
## # A tibble: 3 x 6
##   geo          category  target result  diff pct_diff
##   <chr>        <chr>      <dbl>  <dbl> <dbl>    <dbl>
## 1 geo_region_1 pertype_1    182  119.  -63.2    -34.7
## 2 geo_region_1 pertype_2    130   84.6 -45.4    -34.9
## 3 geo_region_1 pertype_3    208  134.  -74.4    -35.8

result <- ipu(hh_seed, hh_targets, per_seed, per_targets, max_iterations = 30,
              secondary_importance = 0)

result

## $weight_tbl
## # A tibble: 8 x 6
##   geo_region   pid hhtype weight avg_weight weight_factor
##        <dbl> <int>  <dbl>  <dbl>      <dbl>         <dbl>
## 1          1     1      1   9.29       12.5         0.743
## 2          1     2      1  19.8        12.5         1.58 
## 3          1     3      1   5.65       12.5         0.452
## 4          1     4      2  23.8        12.5         1.90 
## 5          1     5      2  16.0        12.5         1.28 
## 6          1     6      2   6.79       12.5         0.543
## 7          1     7      2  11.2        12.5         0.894
## 8          1     8      2   6.79       12.5         0.543
## 
## $weight_dist

## 
## $primary_comp
## # A tibble: 2 x 6
##   geo          category target result  diff pct_diff
##   <chr>        <chr>     <dbl>  <dbl> <dbl>    <dbl>
## 1 geo_region_1 hhtype_1     35   34.7 -0.26    -0.75
## 2 geo_region_1 hhtype_2     65   64.5 -0.47    -0.73
## 
## $secondary_comp
## # A tibble: 3 x 6
##   geo          category  target result  diff pct_diff
##   <chr>        <chr>      <dbl>  <dbl> <dbl>    <dbl>
## 1 geo_region_1 pertype_1    182  100.  -81.9    -45.0
## 2 geo_region_1 pertype_2    130   71.6 -58.4    -44.9
## 3 geo_region_1 pertype_3    208  115   -93      -44.7

Extreme Weights

Often, it is preferable to constrain weights so that certain, under-sampled observations to do not end up with extreme weights. ipu() supports this by using the min_ratio and max_ratio variables.

First, the average weight is calculated per geography based on the total of the target tables divided by the number of records in the seed table. Then, the max and min factors set a cap and floor based on a multiple of that average.

Common values to use are:

max_ratio = 5 (5x the average weight)
min_ratio = .2 (1/5 the average weight)

However, care should be taken when moving these variables from their default values. These variables impose another constraint on the algorithm and increase the chance of failure. In the example below, very strict values are used with the same seed and target data from IPU Example 1.

Values of 1.2 and .8 mean that all weights must be within 20% of the average weight.

hh_seed <- data_frame(
  pid = c(1, 2, 3, 4),
  siz = c(1, 2, 2, 1),
  weight = c(1, 1, 1, 1),
  geo_cluster = c(1, 1, 2, 2)
)

hh_targets <- list()
hh_targets$siz <- data_frame(
  geo_cluster = c(1, 2),
  `1` = c(75, 100),
  `2` = c(25, 150)
)

result <- ipu(hh_seed, hh_targets, max_iterations = 10,
              max_ratio = 1.2, min_ratio = .8)

Consider the effect on geo_cluster 1. With a total target of 100 households and two records in the seed table, the average weight is 50. This means that the weights must be between 40 and 60. The algorithm does not have enough flexibility to meet the controls.

result$primary_comp

## # A tibble: 4 x 6
##   geo           category target result  diff pct_diff
##   <chr>         <chr>     <dbl>  <dbl> <dbl>    <dbl>
## 1 geo_cluster_1 siz_1        75     60   -15      -20
## 2 geo_cluster_1 siz_2        25     40    15       60
## 3 geo_cluster_2 siz_1       100    100     0        0
## 4 geo_cluster_2 siz_2       150    150     0        0

A second problem can arrise from capping weights based on the average weight. In the example below, I change the targets so that, for geo_cluster 1, they are very unbalanced. Cluster 1 now has 100,000 1-person households but only 5 2-person households.

hh_targets <- list()
hh_targets$siz <- data_frame(
  geo_cluster = c(1, 2),
  `1` = c(100000, 100),
  `2` = c(5, 150)
)

result <- ipu(hh_seed, hh_targets, max_iterations = 10,
              max_ratio = 5, min_ratio = .2)

result$primary_comp

## # A tibble: 4 x 6
##   geo           category target  result  diff pct_diff
##   <chr>         <chr>     <dbl>   <dbl> <dbl>    <dbl>
## 1 geo_cluster_1 siz_1    100000 100000     0         0
## 2 geo_cluster_1 siz_2         5  10000. 9996.   199910
## 3 geo_cluster_2 siz_1       100    100     0         0
## 4 geo_cluster_2 siz_2       150    150     0         0

Even with reasonable values for the weight caps, the minimum allowable weight is much higher than 5. This is an extreme example, and is unlikely to be an issue in applications related to housing and population - the targets are generally on the same scale. However, when expanding a through-trip table, it is common to have some external stations with large targets and others with small. In these cases, it is advisable to leave the scale arguments at their default values.

IPU_NR

The function ipu_nr only differs from ipu in one significant way: the method used to balance primary and secondary targets.

As in the more detailed ipu example above, we modify the Arizona example (which is balanced) to double the person targets. This creates a significant imbalance that standard approahces struggle with.

per_targets$pertype <- per_targets$pertype %>%
  mutate_at(
    .vars = vars(`1`, `2`, `3`),
    .funs = funs(. * 2)
  )

While ipu balances the secondary targets directly using secondary_importance, ipu_nr uses an iterative approach and the target_priority argument.

By default, all target tables have an equally high priority, which means that the algorithm will attempt to match all targets exactly. However, target_priority can be modified in several ways. In the code below, a data frame is used to assign the hhtype target a higher priority. (If using a data frame, the column names must be target and priority.) A simple named list can also be used (both options shown below).

# Option 1: a data frame
target_priority <- data_frame(
  target = c("hhtype", "pertype"),
  priority = c(10000, 10)
)

# Options 2: use a named list
target_priority <- list()
target_priority$hhtype <- 10000
target_priority$pertype <- 10

result <- ipu_nr(hh_seed, hh_targets, per_seed, per_targets, max_iterations = 30,
              target_priority = target_priority)

As ipu_nr runs, it relaxes the target constraints on pertype much faster than on hhtype. As a result, the final weights will match the household type much closer. The two methods generally match targets to the same degree, but often lead to very different distributions of weight ratios. In addition, ipu tends to reach convergence levels around .1 %RMSE faster than ipu_nr, but for levels below that, ipu_nr tends to be faster.

result

## $weight_tbl
## # A tibble: 8 x 6
##   geo_region   pid hhtype weight avg_weight weight_factor
##        <dbl> <int>  <dbl>  <dbl>      <dbl>         <dbl>
## 1          1     1      1  32.2        12.5         2.58 
## 2          1     2      1   2.53       12.5         0.203
## 3          1     3      1   1.86       12.5         0.149
## 4          1     4      2   4.09       12.5         0.327
## 5          1     5      2   5.74       12.5         0.459
## 6          1     6      2   3.00       12.5         0.240
## 7          1     7      2  52.1        12.5         4.17 
## 8          1     8      2   3.00       12.5         0.240
## 
## $weight_dist

## 
## $primary_comp
## # A tibble: 2 x 6
##   geo          category target result  diff pct_diff
##   <chr>        <chr>     <dbl>  <dbl> <dbl>    <dbl>
## 1 geo_region_1 hhtype_1     35   36.6  1.61     4.59
## 2 geo_region_1 hhtype_2     65   67.9  2.91     4.48
## 
## $secondary_comp
## # A tibble: 3 x 6
##   geo          category  target result  diff pct_diff
##   <chr>        <chr>      <dbl>  <dbl> <dbl>    <dbl>
## 1 geo_region_1 pertype_1    182   153. -29.3    -16.1
## 2 geo_region_1 pertype_2    130   104. -26.4    -20.3
## 3 geo_region_1 pertype_3    208   153. -55.2    -26.5

Using ipf

Kyle Ward

2018-06-19

Introduction

IPU