| Title: | Functions for Epidemiological Analysis using Population Data |
|---|---|
| Description: | Enables computation of epidemiological statistics, including those where counts or mortality rates of the reference population are used. Currently supported: excess hazard models (Dickman, Sloggett, Hills, and Hakulinen (2012) <doi:10.1002/sim.1597>), rates, mean survival times, relative/net survival (in particular the Ederer II (Ederer and Heise (1959)) and Pohar Perme (Pohar Perme, Stare, and Esteve (2012) <doi:10.1111/j.1541-0420.2011.01640.x>) estimators), and standardized incidence and mortality ratios, all of which can be easily adjusted for by covariates such as age. Fast splitting and aggregation of 'Lexis' objects (from package 'Epi') and other computations achieved using 'data.table'. |
| Authors: | Joonas Miettinen [cre, aut] (ORCID: <https://orcid.org/0000-0001-8624-6754>), Matti Rantanen [aut], Karri Seppa [ctb] (ORCID: <https://orcid.org/0000-0002-3847-8814>) |
| Maintainer: | Joonas Miettinen <[email protected]> |
| License: | MIT + file LICENSE |
| Version: | 0.4.13.61 |
| Built: | 2026-06-04 11:11:53 UTC |
| Source: | https://github.com/finnishcancerregistry/popepi |
This function is only intended to be used within a formula
when supplied to e.g. [survtab_ag] and should not be
used elsewhere.
adjust(...)adjust(...)
... |
variables to adjust by, e.g. |
Returns a list of promises of the variables supplied which can be evaluated.
y ~ x + adjust(z)y ~ x + adjust(z)
Lexis dataAggregates a split Lexis object by given variables
and / or expressions into a long-format table of person-years and
transitions / end-points. Automatic aggregation over time scales
by which data has been split if the respective time scales are mentioned
in the aggregation argument to e.g. intervals of calendar time, follow-up time
and/or age.
aggre( lex, by = NULL, type = c("unique", "full"), sum.values = NULL, subset = NULL, verbose = FALSE )aggre( lex, by = NULL, type = c("unique", "full"), sum.values = NULL, subset = NULL, verbose = FALSE )
lex |
a |
by |
variables to tabulate (aggregate) by.
Flexible input, typically e.g.
|
type |
determines output levels to which data is aggregated varying
from returning only rows with |
sum.values |
optional: additional variables to sum by argument
|
subset |
a logical condition to subset by before computations;
e.g. |
verbose |
|
Basics
aggre is intended for aggregation of split Lexis data only.
See [Epi::Lexis] for forming Lexis objects by hand
and e.g. [Epi::splitLexis], [splitLexisDT], and
[splitMulti] for splitting the data. [lexpand]
may be used for simple data sets to do both steps as well as aggregation
in the same function call.
Here aggregation refers to computing person-years and the appropriate events (state transitions and end points in status) for the subjects in the data. Hence, it computes e.g. deaths (end-point and state transition) and censorings (end-point) as well as events in a multi-state setting (state transitions).
The result is a long-format data.frame or data.table
(depending on options("popEpi.datatable"); see ?popEpi)
with the columns pyrs and the appropriate transitions named as
fromXtoY, e.g. from0to0 and from0to1 depending
on the values of lex.Cst and lex.Xst.
The by argument
The by argument determines the length of the table, i.e.
the combinations of variables to which data is aggregated.
by is relatively flexible, as it can be supplied as
a character string vector, e.g. c("sex", "area"),
naming variables existing in lex
an expression, e.g. factor(sex, 0:1, c("m", "f"))
using any variable found in lex
a list (fully or partially named) of expressions, e.g.
list(gender = factor(sex, 0:1, c("m", "f"), area)
Note that expressions effectively allow a variable to be supplied simply as
e.g. by = sex (as a symbol/name in R lingo).
The data is then aggregated to the levels of the given variables
or expression(s). Variables defined to be time scales in the supplied
Lexis are processed in a special way: If any are mentioned in the
by argument, intervals of them are formed based on the breaks
used to split the data: e.g. if age was split using the breaks
c(0, 50, Inf), mentioning age in by leads to
creating the age intervals [0, 50) and [50, Inf)
and aggregating to them. The intervals are identified in the output
as the lower bounds of the appropriate intervals.
The order of multiple time scales mentioned in by matters,
as the last mentioned time scale is assumed to be a survival time scale
for when computing event counts. E.g. when the data is split by the breaks
list(FUT = 0:5, CAL = c(2008,2010)), time lines cut short at
CAL = 2010 are considered to be censored, but time lines cut short at
FUT = 5 are not. See Return.
Aggregation types (styles)
It is almost always enough to aggregate the data to variable levels
that are actually represented in the data
(default aggre = "unique"; alias "non-empty").
For certain uses it may be useful
to have also "empty" levels represented (resulting in some rows in output
with zero person-years and events); in these cases supplying
aggre = "full" (alias "cartesian") causes aggre
to determine the Cartesian product of all the levels of the supplied
by variables or expressions and aggregate to them. As an example
of a Cartesian product, try
merge(1:2, 1:5).
A long data.frame or data.table of aggregated person-years
(pyrs), numbers of subjects at risk (at.risk), and events
formatted fromXtoY, where X and X are states
transitioning from and to or states at the end of each lex.id's
follow-up (implying X = Y). Subjects at risk are computed
in the beginning of an interval defined by any Lexis time scales and
mentioned in by, but events occur at any point within an interval.
When the data has been split along multiple time scales, the last
time scale mentioned in by is considered to be the survival time
scale with regard to computing events. Time lines cut short by the
extrema of non-survival-time-scales are considered to be censored
("transitions" from the current state to the current state).
Joonas Miettinen
[aggregate] for a similar base R solution,
and [ltable] for a data.table based aggregator. Neither
are directly applicable to split Lexis data.
Other aggregation functions:
as.aggre(),
lexpand(),
setaggre(),
summary.aggre()
## form a Lexis object library(Epi) data(sibr) x <- sibr[1:10,] x[1:5,]$sex <- 0 ## pretend some are male x <- Lexis(data = x, entry = list(AGE = dg_age, CAL = get.yrs(dg_date)), exit = list(CAL = get.yrs(ex_date)), entry.status=0, exit.status = status) x <- splitMulti(x, breaks = list(CAL = seq(1993, 2013, 5), AGE = seq(0, 100, 50))) ## these produce the same results (with differing ways of determining aggre) a1 <- aggre(x, by = list(gender = factor(sex, 0:1, c("m", "f")), agegroup = AGE, period = CAL)) a2 <- aggre(x, by = c("sex", "AGE", "CAL")) a3 <- aggre(x, by = list(sex, agegroup = AGE, CAL)) ## returning also empty levels a4 <- aggre(x, by = c("sex", "AGE", "CAL"), type = "full") ## computing also expected numbers of cases x <- lexpand(sibr[1:10,], birth = bi_date, entry = dg_date, exit = ex_date, status = status %in% 1:2, pophaz = popmort, fot = 0:5, age = c(0, 50, 100)) x$d.exp <- with(x, lex.dur*pop.haz) ## these produce the same result a5 <- aggre(x, by = c("sex", "age", "fot"), sum.values = list(d.exp)) a5 <- aggre(x, by = c("sex", "age", "fot"), sum.values = "d.exp") a5 <- aggre(x, by = c("sex", "age", "fot"), sum.values = d.exp) ## same result here with custom name a5 <- aggre(x, by = c("sex", "age", "fot"), sum.values = list(expCases = d.exp)) ## computing pohar-perme weighted figures x$d.exp.pp <- with(x, lex.dur*pop.haz*pp) a6 <- aggre(x, by = c("sex", "age", "fot"), sum.values = c("d.exp", "d.exp.pp")) ## or equivalently e.g. sum.values = list(expCases = d.exp, expCases.p = d.exp.pp).## form a Lexis object library(Epi) data(sibr) x <- sibr[1:10,] x[1:5,]$sex <- 0 ## pretend some are male x <- Lexis(data = x, entry = list(AGE = dg_age, CAL = get.yrs(dg_date)), exit = list(CAL = get.yrs(ex_date)), entry.status=0, exit.status = status) x <- splitMulti(x, breaks = list(CAL = seq(1993, 2013, 5), AGE = seq(0, 100, 50))) ## these produce the same results (with differing ways of determining aggre) a1 <- aggre(x, by = list(gender = factor(sex, 0:1, c("m", "f")), agegroup = AGE, period = CAL)) a2 <- aggre(x, by = c("sex", "AGE", "CAL")) a3 <- aggre(x, by = list(sex, agegroup = AGE, CAL)) ## returning also empty levels a4 <- aggre(x, by = c("sex", "AGE", "CAL"), type = "full") ## computing also expected numbers of cases x <- lexpand(sibr[1:10,], birth = bi_date, entry = dg_date, exit = ex_date, status = status %in% 1:2, pophaz = popmort, fot = 0:5, age = c(0, 50, 100)) x$d.exp <- with(x, lex.dur*pop.haz) ## these produce the same result a5 <- aggre(x, by = c("sex", "age", "fot"), sum.values = list(d.exp)) a5 <- aggre(x, by = c("sex", "age", "fot"), sum.values = "d.exp") a5 <- aggre(x, by = c("sex", "age", "fot"), sum.values = d.exp) ## same result here with custom name a5 <- aggre(x, by = c("sex", "age", "fot"), sum.values = list(expCases = d.exp)) ## computing pohar-perme weighted figures x$d.exp.pp <- with(x, lex.dur*pop.haz*pp) a6 <- aggre(x, by = c("sex", "age", "fot"), sum.values = c("d.exp", "d.exp.pp")) ## or equivalently e.g. sum.values = list(expCases = d.exp, expCases.p = d.exp.pp).
Given a character vector, checks if all names are present in names(data).
Throws error if stops=TRUE, else returns FALSE if some variable name is not present.
all_names_present(data, var.names, stops = TRUE, msg = NULL)all_names_present(data, var.names, stops = TRUE, msg = NULL)
data |
dataset where the variable names should be found |
var.names |
a character vector of variable names, e.g.
|
stops |
logical, stop returns exception |
msg |
Custom message to return instead of default message.
Special: include |
TRUE if all var.names are in data, else FALSE,
Joonas Miettinen
[robust_values]
arrays, data.frames and ratetablesUtilities to transform objects between array, data.frame, and
survival::ratetable.
long_df_to_array(x, stratum.col.nms, value.col.nm) long_df_to_ratetable( x, stratum.col.nms, value.col.nm, dim.types, cut.points = NULL ) long_dt_to_array(x, stratum.col.nms, value.col.nm) long_dt_to_ratetable( x, stratum.col.nms, value.col.nm, dim.types, cut.points = NULL ) array_to_long_df(x) array_to_long_dt(x) array_to_ratetable(x, dim.types, cut.points = NULL) ratetable_to_array(x) ratetable_to_long_df(x) ratetable_to_long_dt(x)long_df_to_array(x, stratum.col.nms, value.col.nm) long_df_to_ratetable( x, stratum.col.nms, value.col.nm, dim.types, cut.points = NULL ) long_dt_to_array(x, stratum.col.nms, value.col.nm) long_dt_to_ratetable( x, stratum.col.nms, value.col.nm, dim.types, cut.points = NULL ) array_to_long_df(x) array_to_long_dt(x) array_to_ratetable(x, dim.types, cut.points = NULL) ratetable_to_array(x) ratetable_to_long_df(x) ratetable_to_long_dt(x)
x |
|
stratum.col.nms |
a vector of column names in |
value.col.nm |
name of column in |
dim.types |
see |
cut.points |
see
|
long_df_to_array: converts a long-format data.frame to an array
with one or more dimensions
long_df_to_ratetable: calls long_df_to_array and then
array_to_ratetable
long_dt_to_array: simply asserts that x is a data.table and
calls long_df_to_array
long_dt_to_ratetable: calls long_dt_to_array and then
array_to_ratetable
array_to_long_df: converts an array with one or more dimensions into
a long-format data.frame; any dimnames are used to name and fill the
stratifying columns; for dimensions without a name, ".dX" is used
for stratifying column number X; for each k, if there are no contents
in dimnames(x)[[k]], the elements of seq(dim(x)[k]) are used to fill
the corresponding stratifying column; the value column always has the name
"value"
array_to_long_dt: calls array_to_long_df and converts result to a
data.table for convenience
array_to_ratetable: converts an array to a survival::ratetable
ratetable_to_array: converts a survival::ratetable to an array
ratetable_to_long_df: calls ratetable_to_array and then
array_to_long_df
ratetable_to_long_dt: calls ratetable_to_array and then
array_to_long_dt
long_df_to_array: an array
long_df_to_ratetable: a survival::ratetable
long_dt_to_array: an array
long_dt_to_ratetable: a survival::ratetable
array_to_long_df: an data.frame
array_to_long_dt: an data.table
array_to_ratetable: a survival::ratetable
ratetable_to_array: an array
ratetable_to_long_df: a data.frame
ratetable_to_long_dt: a data.table
long_dt <- popEpi::popmort arr <- long_df_to_array(long_dt, c("agegroup", "year", "sex"), "haz") rt <- array_to_ratetable(arr, dim.types = c(2L, 4L, 1L)) arr2 <- ratetable_to_array(rt) long_df2 <- array_to_long_df(arr2) identical(sort(long_dt[["haz"]]), sort(long_df2[["value"]]))long_dt <- popEpi::popmort arr <- long_df_to_array(long_dt, c("agegroup", "year", "sex"), "haz") rt <- array_to_ratetable(arr, dim.types = c(2L, 4L, 1L)) arr2 <- ratetable_to_array(rt) long_df2 <- array_to_long_df(arr2) identical(sort(long_dt[["haz"]]), sort(long_df2[["value"]]))
aggre
Coerces an R object to an aggre object, identifying
the object as one containing aggregated counts, person-years and other
information.
as.aggre(x, values = NULL, by = NULL, breaks = NULL, ...) ## S3 method for class 'data.frame' as.aggre(x, values = NULL, by = NULL, breaks = NULL, ...) ## S3 method for class 'data.table' as.aggre(x, values = NULL, by = NULL, breaks = NULL, ...) ## Default S3 method: as.aggre(x, ...)as.aggre(x, values = NULL, by = NULL, breaks = NULL, ...) ## S3 method for class 'data.frame' as.aggre(x, values = NULL, by = NULL, breaks = NULL, ...) ## S3 method for class 'data.table' as.aggre(x, values = NULL, by = NULL, breaks = NULL, ...) ## Default S3 method: as.aggre(x, ...)
x |
a |
values |
a character string vector; the names of value variables |
by |
a character string vector; the names of variables by which
|
breaks |
a list of breaks, where each element is a breaks vector
as usually passed to e.g. |
... |
arguments passed to or from methods |
Returns a copy of x with attributes set to those of an object of class
"aggre".
as.aggre(data.frame): Coerces a data.frame to an aggre object
as.aggre(data.table): Coerces a data.table to an aggre object
as.aggre(default): Default method for as.aggre (stops computations
if no class-specific method found)
Joonas Miettinen
Other aggregation functions:
aggre(),
lexpand(),
setaggre(),
summary.aggre()
library("data.table") df <- data.frame(sex = rep(c("male", "female"), each = 5), obs = rpois(10, rep(7,5, each=5)), pyrs = rpois(10, lambda = 10000)) dt <- as.data.table(df) df <- as.aggre(df, values = c("pyrs", "obs"), by = "sex") dt <- as.aggre(dt, values = c("pyrs", "obs"), by = "sex") class(df) class(dt) BL <- list(fot = 0:5) df <- data.frame(df) df <- as.aggre(df, values = c("pyrs", "obs"), by = "sex", breaks = BL)library("data.table") df <- data.frame(sex = rep(c("male", "female"), each = 5), obs = rpois(10, rep(7,5, each=5)), pyrs = rpois(10, lambda = 10000)) dt <- as.data.table(df) df <- as.aggre(df, values = c("pyrs", "obs"), by = "sex") dt <- as.aggre(dt, values = c("pyrs", "obs"), by = "sex") class(df) class(dt) BL <- list(fot = 0:5) df <- data.frame(df) df <- as.aggre(df, values = c("pyrs", "obs"), by = "sex", breaks = BL)
Coerces an yrs object to a Date object.
Some loss of information comes if year.length = "approx"
was set when using [get.yrs], so the transformation back
to Date will not be perfect there. With year.length = "actual"
the original values are perfectly retrieved.
## S3 method for class 'yrs' as.Date(x, ...)## S3 method for class 'yrs' as.Date(x, ...)
x |
an |
... |
unused, included for compatibility with other |
A vector of Date values based on the input fractional years.
Joonas Miettinen
[get.yrs]
data("sire", package = "popEpi") ## approximate year lengths: here 20 % have an extra day added sire$dg_yrs <- get.yrs(sire$dg_date) summary(sire$dg_yrs) dg_date2 <- as.Date(sire$dg_yrs) summary(as.numeric(dg_date2 - as.Date(sire$dg_date))) ## using actual year lengths sire$dg_yrs <- get.yrs(sire$dg_date, year.length = "actual") summary(sire$dg_yrs) dg_date2 <- as.Date(sire$dg_yrs) summary(as.numeric(dg_date2 - as.Date(sire$dg_date)))data("sire", package = "popEpi") ## approximate year lengths: here 20 % have an extra day added sire$dg_yrs <- get.yrs(sire$dg_date) summary(sire$dg_yrs) dg_date2 <- as.Date(sire$dg_yrs) summary(as.numeric(dg_date2 - as.Date(sire$dg_date))) ## using actual year lengths sire$dg_yrs <- get.yrs(sire$dg_date, year.length = "actual") summary(sire$dg_yrs) dg_date2 <- as.Date(sire$dg_yrs) summary(as.numeric(dg_date2 - as.Date(sire$dg_date)))
data.table/data.frame from long format to wide formatConvenience function for using [data.table::dcast.data.table];
inputs are character strings (names of variables) instead of a formula.
cast_simple(data = NULL, columns = NULL, rows = NULL, values = NULL)cast_simple(data = NULL, columns = NULL, rows = NULL, values = NULL)
data |
a |
columns |
a character string vector; the (unique combinations of the) levels of these variable will be different rows |
rows |
a character string vector; the (unique combinations of the) levels of these variable will be different columns |
values |
a character string; the variable which will be represented
on rows and columns as specified by |
This function is just a small interface for dcast /
dcast.data.table and less flexible than the originals.
Note that all data.table objects are also data.frame
objects, but that each have their own dcast method.
[data.table::dcast.data.table] is faster.
If any values in value.vars need to be
aggregated, they are aggregated using sum.
See ?dcast.
A data.table just like [data.table::dcast].
Matti Rantanen, Joonas Miettinen
library("data.table") ## e.g. silly counts from a long-format table to a wide format test <- data.table::copy(popEpi::sire) test$dg_y <- year(test$dg_date) test$ex_y <- year(test$ex_date) tab <- ltable(test, c("dg_y","ex_y")) cast_simple(tab, columns='dg_y', rows="ex_y", values="obs")library("data.table") ## e.g. silly counts from a long-format table to a wide format test <- data.table::copy(popEpi::sire) test$dg_y <- year(test$dg_date) test$ex_y <- year(test$ex_date) tab <- ltable(test, c("dg_y","ex_y")) cast_simple(tab, columns='dg_y', rows="ex_y", values="obs")
Selects lowest values of each factor after cut() based on the assumption that the value starts from index 2 and end in comma ",".
cut_bound(t, factor = TRUE)cut_bound(t, factor = TRUE)
t |
is a character vector of elements, e.g. "(20,60]" |
factor |
logical; TRUE returns informative character string, FALSE numeric (left value) |
type = 'factor': "[50,52)" -> "50-51" OR "[50,51)" -> "50"
type = 'numeric': lowest bound in numeric.
If factor = TRUE, returns a character vector; else returns a numeric
vector.
Matti Rantanen
cut_bound("[1900, 1910)") ## "1900-1909"cut_bound("[1900, 1910)") ## "1900-1909"
Several functions in popEpi have support for direct standardization of estimates. This document explains the usage of weighting with those functions.
Direct standardization is performed by computing estimates of
E
by the set of adjusting variables A, to which a set of weights
W is applicable. The weighted average over A is then the
direct-adjusted estimate of E (E*).
To enable both quick and easy as well as more rigorous usage of direct standardization with weights, the weights arguments in popEpi can be supplied in several ways. Ability to use the different ways depends on the number of adjusting variables.
The weights are always handled internally to sum to 1, so they do not need to be scaled in this manner when they are supplied. E.g. counts of subjects in strata may be passed.
In the simple case where we are adjusting by only one variable (e.g. by age group), one can simply supply a vector of weights:
FUN(weights = c(0.1, 0.25, 0.25, 0.2, 0.2))
which may be stored in advance:
w <- c(0.1, 0.25, 0.25, 0.2, 0.2)
FUN(weights = w)
The order of the weights matters. popEpi functions with direct
adjusting enabled match the supplied weights to the adjusting variables
as follows: If the adjusting variable is a factor, the order
of the levels is used. Otherwise, the alphabetic order of the unique
values is used (try sort to see how it works). For clarity
and certainty we recommend using factor or numeric variables
when possible. character variables should be avoided: to see why,
try sort(15:9) and sort(as.character(15:9)).
It is also possible to supply a character string corresponding
to one of the age group standardization schemes integrated into popEpi:
'europe_1976_18of5' - European std. population (1976), 18 age groups
'nordic_2000_18of5' - Nordic std. population (2000), 18 age groups
'world_1966_18of5' - world standard (1966), 18 age groups
'world_2000_18of5' - world standard (2000), 18 age groups
'world_2000_20of5' - world standard (2000), 20 age groups
'world_2000_101of1' - world standard (2000), 101 age groups
Additionally, [ICSS] contains international weights used in
cancer survival analysis, but they are not currently usable by passing
a string to weights and must be supplied by hand.
You may also supply weights = "internal" to use internally
computed weights, i.e. usually simply the counts of subjects / person-time
experienced in each stratum. E.g.
FUN(weights = "world_2000_18of5")
will use the world standard population from 2000 as
weights for 18 age groups, that your adjusting variable is
assumed to contain. The adjusting variable must be coded in this case as
a numeric variable containing 1:18 or as a factor with
18 levels (coded from the youngest to the oldest age group).
In the case that you employ more than one adjusting variable, separate
weights should be passed to match to the levels of the different adjusting
variables. When supplied correctly, "grand" weights are formed based on
the variable-specific weights by multiplying over the variable-specific
weights (e.g. if men have w = 0.5 and the age group 0-4 has
w = 0.1, the "grand" weight for men aged 0-4 is 0.5*0.1).
The "grand" weights are then used for adjusting after ensuring they
sum to one.
When using multiple adjusting variables, you
are allowed to pass either a named list of
weights or a data.frame of weights. E.g.
WL <- list(agegroup = age_w, sex = sex_w)
FUN(weights = WL)
where age_w and sex_w are numeric vectors. Given the
conditions explained in the previous section are satisfied, you may also do
e.g.
WL <- list(agegroup = "world_2000_18of", sex = sex_w)
FUN(weights = WL)
and the world standard pop is used as weights for the age groups as outlined in the previous section.
Sometimes using a data.frame can be clearer (and it is fool-proof
as well). To do this, form a data.frame that repeats the levels
of your adjusting variables by each level of every other adjusting variable,
and assign the weights as a column named "weights". E.g.
wdf <- data.frame(sex = rep(0:1, each = 18), agegroup = rep(1:18, 2))
wdf$weights <- rbinom(36, size = 100, prob = 0.25)
FUN(weights = wdf)
If you want to use the counts of subjects in strata as the weights, one way to do this is by e.g.
wdf <- as.data.frame(x$V1, x$V2, x$V3)
names(wdf) <- c("V1", "V2", "V3", "weights")
Joonas Miettinen
Source of the Nordic standard population in 5-year age groups (also contains European & 1966 world standards): https://www-dep.iarc.fr/NORDCAN/english/glossary.htm
Source of the 1976 European standard population:
Waterhouse, J.,Muir, C.S.,Correa, P.,Powell, J., eds (1976). Cancer Incidence in Five Continents, Vol. III. IARC Scientific Publications, No. 15, Lyon, IARC. ISBN: 9789283211150
Source of 2000 world standard population in 1-year age groups: https://seer.cancer.gov/stdpopulations/stdpop.singleages.html
Other weights:
ICSS,
stdpop101,
stdpop18
Other popEpi argument evaluation docs:
flexible_argument
Convert factor variable with numbers as levels into a numeric variable
fac2num(x)fac2num(x)
x |
a factor variable with numbers as levels |
For example, a factor with levels c("5","7") is converted into
a numeric variable with values c(5,7).
A numeric vector based on the levels of x.
[robust_values]
## this is often not intended as.numeric(factor(c(5,7))) ## result: c(1,2) ## but this fac2num(factor(c(5,7))) ## result: c(5,7) ## however as.numeric(factor(c("5","7","a"))) ## 1:3 suppressWarnings( fac2num(factor(c("5","7","a"))) ## c(5,7,NA) )## this is often not intended as.numeric(factor(c(5,7))) ## result: c(1,2) ## but this fac2num(factor(c(5,7))) ## result: c(5,7) ## however as.numeric(factor(c("5","7","a"))) ## 1:3 suppressWarnings( fac2num(factor(c("5","7","a"))) ## c(5,7,NA) )
Certain arguments in popEpi can be passed in multiple ways. This document shows the usage and a pitfall in the usage of such flexible arguments.
Flexible arguments in popEpi are used to pass variables existing
in your data or in the environment where the function is used
(for everyday users this is the global environment - in simple terms,
where your data is / your work space). The flexible arguments
are modelled after the by argument in data.tables -
see ?data.table. There are many ways to supply the same information
to certain functions in popEpi, but the possible ways listed below
may be limited in some of them to only allow for using only a part of them.
Most commonly you may pass variable names as character strings, e.g.
FUN(arg = c("V1", "V2"), data = x)
which may be stored in advance:
vars <- c("V1", "V2")
FUN(arg = vars, data = x)
where x contains those variables. You may also supply variable
names as symbols:
FUN(arg = V1, data = x)
Or as a list of symbols (similarly to as in [aggregate]):
FUN(arg = list(V1, V2), data = x)
Or as a list of expressions:
FUN(arg = list(V1 + 1, factor(V2)), data = x)
A formula without a left-hand-side specified is sometimes allowed as well:
FUN(arg = ~ I(V1 + 1) + factor(V2), data = x)
Using a symbol or a list of symbols/expressions typically causes the function to look for the variable(s) first in the supplied data (if any) and then where the function was called. For everyday users this means you might define e.g.
V3 <- factor(letters)
and do e.g.
FUN(arg = list(V1 + 1, factor(V2), V3), data = x)
provided V1 and V2 exist in x or in the function calling
environment.
There is one way to use flexible arguments incorrectly: By supplying the name of a variable which exists both in the supplied data and the calling environment, and intending the latter to be used. E.g.
vars <- c("V2")
FUN(arg = V3, data = x)
where x has a column named vars. This causes the function to
use x$vars and NOT x$V2.
Function programmers are advised to pass character strings whenever possible. To fool-proof against conflicts as described in the section above, refer to the calling environment explicitly when passing the variable containing the character strings:
TF <- environment() ## current env to refer to
vars <- c("V1", "V2")
FUN(arg = TF$vars, data = x)
Even if x has columns named vars and TF,
using TF$vars does not use those columns but only evaluates
TF$vars
in the calling environment. This is made possible by the fact
that data is always passed as a data.frame, within which evaluation
of expressions using the dollar operator is not possible. Therefore
it is safe to assume the data should not be used. However, lists of
expressions will not be checked for dollar use and will fail in conflict
situations:
TF <- environment() ## current env to refer to
vars <- letters[1:5]
x <- data.frame(vars = 1:5, TF = 5:1, V1 = 10:6)
FUN(arg = list(TF$vars, V1), data = x)
On the other hand you may typically also pass quoted ([quote])
or substituted [substitute] expressions etc., where
the env$object trick will work as well:
q <- quote(list(vars, V1))
FUN(arg = TF$q, data = x)
This works even with
a <- 1:5
V1 <- quote(TF$a)
FUN(arg = TF$V1, data = x)
So no conflicts should occur.
Joonas Miettinen
Other popEpi argument evaluation docs:
direct_standardization
data(sire) ## prepare data for e.g. 5-year "period analysis" for 2008-2012 ## note: sire is a simulated cohort integrated into popEpi. BL <- list(fot=seq(0, 5, by = 1/12)) x <- lexpand(sire, birth = bi_date, entry = dg_date, exit = ex_date, status = status %in% 1:2, breaks = BL) x <- aggre(x, by = fot) ## silly example of referring to pyrs data by fixed character string; ## its possible that the real name wont be fixed in a real-life application. pyrs <- "actual_pyrs" TF <- environment() x$actual_pyrs <- as.numeric(x$pyrs) x$pyrs <- 1 ## this works (uses actual_pyrs eventually) st <- survtab_ag(fot ~ 1, data = x, surv.type = "surv.obs", pyrs = TF$pyrs, d = from0to1, surv.method = "hazard") ## this would be wrong (sees expression 'pyrs' and uses that column, ## which is not what is intended here) st <- survtab_ag(fot ~ 1, data = x, surv.type = "surv.obs", pyrs = pyrs, d = from0to1, surv.method = "hazard")data(sire) ## prepare data for e.g. 5-year "period analysis" for 2008-2012 ## note: sire is a simulated cohort integrated into popEpi. BL <- list(fot=seq(0, 5, by = 1/12)) x <- lexpand(sire, birth = bi_date, entry = dg_date, exit = ex_date, status = status %in% 1:2, breaks = BL) x <- aggre(x, by = fot) ## silly example of referring to pyrs data by fixed character string; ## its possible that the real name wont be fixed in a real-life application. pyrs <- "actual_pyrs" TF <- environment() x$actual_pyrs <- as.numeric(x$pyrs) x$pyrs <- 1 ## this works (uses actual_pyrs eventually) st <- survtab_ag(fot ~ 1, data = x, surv.type = "surv.obs", pyrs = TF$pyrs, d = from0to1, surv.method = "hazard") ## this would be wrong (sees expression 'pyrs' and uses that column, ## which is not what is intended here) st <- survtab_ag(fot ~ 1, data = x, surv.type = "surv.obs", pyrs = pyrs, d = from0to1, surv.method = "hazard")
Using Date objects, calculates given dates as fractional years.
get.yrs(x, year.length = "approx", ...)get.yrs(x, year.length = "approx", ...)
x |
a |
year.length |
character string, either |
... |
additional arguments passed on to |
x should preferably be a Date or IDate
object, although it can also be a character string variable
which is coerced internally to Date format
using [as.Date.character].
When year.length = 'actual', fractional years are calculated as
year + (day_in_year-1)/365 for non-leap-years
and as year + (day_in_year-1)/366 for leap years.
If year.length = 'approx', fractional years are always
calculated as in year + (day_in_year-1)/365.242199.
There is a slight difference, then, between the two methods when calculating durations between fractional years. For meticulous accuracy one might instead want to calculate durations using dates (days) and convert the results to fractional years.
Note that dates are effectively converted to fractional years at
00:00:01 o'clock:
get.yrs("2000-01-01") = 2000, and
get.yrs("2000-01-02") = 2000 + 1/365.242199.
A numeric vector of fractional years.
Joonas Miettinen
[Epi::cal.yr], [as.Date.yrs], [as.Date]
data("sire") sire$dg_yrs <- get.yrs(sire$dg_date) summary(sire$dg_yrs) ## see: ?as.Date.yrs dg_date2 <- as.Date(sire$dg_yrs) summary(as.numeric(dg_date2 - as.Date(sire$dg_date))) ## Epi's cal.yr versus get.yrs d <- as.Date("2000-01-01") Epi::cal.yr(d) ## 1999.999 get.yrs(d) ## 2000 ## "..." passed on to as.Date, so character / numeric also accepted as input ## (and whatever else as.Date accepts) get.yrs("2000-06-01") get.yrs("20000601", format = "%Y%m%d") get.yrs("1/6/00", format = "%d/%m/%y") get.yrs(100, origin = "1970-01-01")data("sire") sire$dg_yrs <- get.yrs(sire$dg_date) summary(sire$dg_yrs) ## see: ?as.Date.yrs dg_date2 <- as.Date(sire$dg_yrs) summary(as.numeric(dg_date2 - as.Date(sire$dg_date))) ## Epi's cal.yr versus get.yrs d <- as.Date("2000-01-01") Epi::cal.yr(d) ## 1999.999 get.yrs(d) ## 2000 ## "..." passed on to as.Date, so character / numeric also accepted as input ## (and whatever else as.Date accepts) get.yrs("2000-06-01") get.yrs("20000601", format = "%Y%m%d") get.yrs("1/6/00", format = "%d/%m/%y") get.yrs(100, origin = "1970-01-01")
Contains three sets age-standardisation weights for age-standardized survival (net, relative or observed).
data.table with columns
age - lower bound of the age group
ICSS1 - first set of weights, sums to 100 000
ICSS2 - second set of weights, sums to 100 000
ICSS3 - third set of weights, sums to 100 000
ICSS weights (US National Cancer Institute website)
Corazziari, Isabella, Mike Quinn, and Riccardo Capocaccia. "Standard cancer patient population for age standardising survival ratios." European Journal of Cancer 40.15 (2004): 2307-2316.
Other popEpi data:
meanpop_fi,
popmort,
sibr,
sire,
stdpop101,
stdpop18
Other weights:
direct_standardization,
stdpop101,
stdpop18
## aggregate weights to a subset of age groups data(ICSS) cut <- c(0, 30, 50, 70, Inf) agegr <- cut(ICSS$age, cut, right = FALSE) aggregate(ICSS1~agegr, data = ICSS, FUN = sum)## aggregate weights to a subset of age groups data(ICSS) cut <- c(0, 30, 50, 70, Inf) agegr <- cut(ICSS$age, cut, right = FALSE) aggregate(ICSS1~agegr, data = ICSS, FUN = sum)
Given a vector or column of year values (numeric or integer), [is_leap_year] returns a vector of equal length
of logical indicators, i.e. a vector where corresponding leap years have value TRUE, and FALSE otherwise.
is_leap_year(years)is_leap_year(years)
years |
a vector or column of year values (numeric or integer) |
A logical vector where TRUE indicates a leap year.
Joonas Miettinen
## can be used to assign new columns easily, e.g. a dummy indicator column df <- data.frame(yrs=c(1900,1904,2005,1995)) df$lyd <- as.integer(is_leap_year(df$yrs)) ## mostly it is useful as a condition or to indicate which rows have leap years which(is_leap_year(df$yrs)) # 2 df[is_leap_year(df$yrs),] # 2nd row## can be used to assign new columns easily, e.g. a dummy indicator column df <- data.frame(yrs=c(1900,1904,2005,1995)) df$lyd <- as.integer(is_leap_year(df$yrs)) ## mostly it is useful as a condition or to indicate which rows have leap years which(is_leap_year(df$yrs)) # 2 df[is_leap_year(df$yrs),] # 2nd row
Date objectTest whether obj inherits one of Date or IDate.
is.Date(obj)is.Date(obj)
obj |
object to test on |
TRUE if obj is of class "Date" or "IDate".
Joonas Miettinen
[get.yrs], [is_leap_year], [as.Date]
Lexis DatasetsMake [Epi::Lexis] objects.
Lexis_fpa( data, birth = NULL, entry = NULL, exit = NULL, entry.status = NULL, exit.status = NULL, subset = NULL, ... ) Lexis_dt(...)Lexis_fpa( data, birth = NULL, entry = NULL, exit = NULL, entry.status = NULL, exit.status = NULL, subset = NULL, ... ) Lexis_dt(...)
data |
a |
birth |
the time of birth; A character string naming the variable in data or an expression to evaluate - see Flexible input |
entry |
the time at entry to follow-up; supplied the
same way as |
exit |
the time at exit from follow-up; supplied the
same way as |
entry.status |
passed on to |
exit.status |
passed on to |
subset |
a logical condition to subset by before passing data
and arguments to |
... |
arguments passed to |
popEpi::Lexis_fpa
Returns a Lexis object with the additional class data.table. It has
the usual columns that Lexis objects
have, and with time scale columns fot, per, and age.
They are calculated as
fot = entry - entry (to ensure correct format, e.g. difftime)
per = entry
and
age = entry - birth.
popEpi::Lexis_dt
Returns a Lexis object that is also a data.table —
therefore output has the classes c("Lexis", "data.table", "data.frame").
popEpi::Lexis_fpa
popEpi::Lexis_fpa collects data from its inputs to to call [Epi::Lexis].
This is a convenience function for making a Lexis object with the
time scales fot, per, and age.
popEpi::Lexis_dt
popEpi::Lexis_dt performs the following steps:
Calls [Epi::Lexis].
Calls [data.table::setDT] to make output into a data.table,
[data.table::setattr] to set its class to
c("Lexis", "data.table", "data.frame"), and
[data.table::setkeyv] to sort the output by lex.id and the time
scales.
Returns the Lexis / data.table object.
Lexis_fpa output is now also a data.table, so the complete class vector
is c("Lexis", "data.table", "data.frame").
New function Lexis_dt, a wrapper of Epi::Lexis which also sets class to
c("Lexis", "data.table", "data.frame").
Other Lexis_functions:
lexis_merge(),
lexis_split_merge_aggregate_by_stratum()
# popEpi::Lexis_fpa data("sire", package = "popEpi") lex <- Lexis_fpa(sire, birth = "bi_date", entry = dg_date, exit = ex_date + 1L, exit.status = "status") ## some special cases myVar <- "bi_date" l <- list(myVar = "bi_date") sire$l <- sire$myVar <- 1 ## conflict: myVar taken from data when "bi_date" was intended lex <- Lexis_fpa(sire, birth = myVar, entry = dg_date, exit = ex_date + 1L, exit.status = "status") ## no conflict with names in data lex <- Lexis_fpa(sire, birth = l$myVar, entry = dg_date, exit = ex_date + 1L, exit.status = "status") stopifnot( identical(class(lex), c("Lexis", "data.table", "data.frame")), Epi::timeScales(lex) %in% c("fot", "per", "age") ) # popEpi::Lexis_dt lex_1 <- popEpi::Lexis_dt( data = popEpi::sire, entry = list( ts_fut = 0L, ts_age = as.integer(dg_date - bi_date), ts_cal = as.integer(dg_date) ), exit = list(ts_cal = as.integer(ex_date)), entry.status = 0L, exit.status = status ) stopifnot( class(lex_1) == c("Lexis", "data.table", "data.frame"), Epi::timeScales(lex_1) %in% c("ts_fut", "ts_age", "ts_cal") ) lex_2 <- popEpi::Lexis_dt( data = popEpi::sire, entry = list( ts_fut = 0L, ts_age = as.integer(dg_date - bi_date), ts_cal = as.integer(dg_date) ), duration = as.integer(ex_date - dg_date), entry.status = 0L, exit.status = status ) stopifnot( class(lex_2) == c("Lexis", "data.table", "data.frame"), Epi::timeScales(lex_2) %in% c("ts_fut", "ts_age", "ts_cal") )# popEpi::Lexis_fpa data("sire", package = "popEpi") lex <- Lexis_fpa(sire, birth = "bi_date", entry = dg_date, exit = ex_date + 1L, exit.status = "status") ## some special cases myVar <- "bi_date" l <- list(myVar = "bi_date") sire$l <- sire$myVar <- 1 ## conflict: myVar taken from data when "bi_date" was intended lex <- Lexis_fpa(sire, birth = myVar, entry = dg_date, exit = ex_date + 1L, exit.status = "status") ## no conflict with names in data lex <- Lexis_fpa(sire, birth = l$myVar, entry = dg_date, exit = ex_date + 1L, exit.status = "status") stopifnot( identical(class(lex), c("Lexis", "data.table", "data.frame")), Epi::timeScales(lex) %in% c("fot", "per", "age") ) # popEpi::Lexis_dt lex_1 <- popEpi::Lexis_dt( data = popEpi::sire, entry = list( ts_fut = 0L, ts_age = as.integer(dg_date - bi_date), ts_cal = as.integer(dg_date) ), exit = list(ts_cal = as.integer(ex_date)), entry.status = 0L, exit.status = status ) stopifnot( class(lex_1) == c("Lexis", "data.table", "data.frame"), Epi::timeScales(lex_1) %in% c("ts_fut", "ts_age", "ts_cal") ) lex_2 <- popEpi::Lexis_dt( data = popEpi::sire, entry = list( ts_fut = 0L, ts_age = as.integer(dg_date - bi_date), ts_cal = as.integer(dg_date) ), duration = as.integer(ex_date - dg_date), entry.status = 0L, exit.status = status ) stopifnot( class(lex_2) == c("Lexis", "data.table", "data.frame"), Epi::timeScales(lex_2) %in% c("ts_fut", "ts_age", "ts_cal") )
Lexis ObjectFunction(s) to merge data into Lexis objects intelligently when merging
is (partially) based on time scales.
lexis_merge( lexis, merge_dt, merge_dt_by, merge_dt_harmonisers = NULL, subset = NULL, optional_steps = NULL, lex_dur_multiplier = NULL )lexis_merge( lexis, merge_dt, merge_dt_by, merge_dt_harmonisers = NULL, subset = NULL, optional_steps = NULL, lex_dur_multiplier = NULL )
lexis |
A |
merge_dt |
A |
merge_dt_by |
Names of columns in both |
merge_dt_harmonisers |
Optional list of quoted expressions which, when evaluated, harmonise
data in
|
subset |
Merge data only into specific rows in
|
optional_steps |
Optional steps to perform during the function's run.
|
lex_dur_multiplier |
When As an aside, of course if you split also by calendar time (and age and
whatever you have in your |
popEpi::lexis_merge
Returns lexis after adding value columns from
merge_dt into lexis.
popEpi::lexis_merge
popEpi::lexis_merge can be used to merge additional information into
Lexis data, allowing the use of the Lexis time scales in the
merge. The typical use-case is to split Lexis data and then merge
population (expected) hazards to the subject-intervals.
popEpi::lexis_merge performs the following steps:
Run
optional_steps[["on_entry"]](call_env = call_env, eval_env = eval_env)
if that optional_steps element exists. call_env is the environment
where this function was called and eval_env is the temporary environment
in which the commands that this function consists of are evaluated.
Run
on.exit(optional_steps[["on_exit"]](call_env = call_env, eval_env = eval_env))
if that optional_steps element exists.
If is.null(merge_dt_harmonisers), attempt to
automatically determine the harmonisers making use of cut for each
time scale column to merge by as follows:
If merge_dt[[col_nm]] is a factor column,
look at the levels() of the column. If they all look
like [x, y[, ]x, y], or x - y, extract breaks from them,
infer whether right = FALSE or TRUE, and use the levels()
as labels. E.g. factor("[60, 61[") produces
list(breaks = 60:61, right = FALSE, labels = "[60, 61[").
If merge_dt[[col_nm]] contains numbers we define cut breaks as the
unique values of
merge_dt[[col_nm]] and as the ceiling
max(merge_dt[[col_nm]]) + last_diff. Here last_diff is the
difference between the highest and second highest values. E.g.
c(1950, 1960, 1970:2020, 2021) for merge_dt[[col_nm]]
containing unique values c(1950, 1960, 1970:2020).
If merge_dt[[col_nm]] is not of class integer, numeric, or
factor, an error is
raised because we don't know how to automatically form a
harmoniser.
Run
optional_steps[["pre_default_harmoniser_creation"]](call_env = call_env, eval_env = eval_env, lapply_eval_env = lapply_eval_env)
if that optional_steps element exists.
Here make_harmoniser_eval_env is similar to eval_env but it is the
evaluation environment of the function passed to lapply
which attempts to make each default harmoniser.
With the cut breaks defined, the automatically created harmoniser
becomes a cut call with arguments inferred above and with
x = col + lex.dur * lex_dur_multiplier, where col is the current
column.
dig.lab = 10.
Depending on what we are harmonising with, the harmoniser will return
either a factor with labels based on the
cut call, e.g. ts_cal = 2021.524 turns into [2021, 2022[
with both 2021 and 2022 in breaks, or the identified interval's
lower bound such as 2021.
Run
optional_steps[["post_default_harmoniser_creation"]](call_env = call_env, eval_env = eval_env, lapply_eval_env = lapply_eval_env)
if that optional_steps element exists.
Run
optional_steps[["post_merge_dt_harmonisers"]](call_env = call_env, eval_env = eval_env)
if that optional_steps element exists.
Armed with either user-defined or automatically created
merge_dt_harmonisers, they are each evaluated to create a temporary
data.table with harmonised data from lexis. This is performed via
eval with envir = lexis and enclos = harmoniser_eval_env where
harmoniser_eval_env is a temporary environment which contains every
argument passed to lexis_merge as well as eval_env and call_env
which you can read about in the detailed explanation of optional_steps.
lexis is a subset if !is.null(subset).
However, if a column
has no harmoniser at this point then it is used as-is. For instance there
is no need to harmonise stratifying columns such as area because they are
not changed by splitting the Lexis data.
Run
optional_steps[["post_harmonisation"]](call_env = call_env, eval_env = eval_env)
if that optional_steps element exists.
Then we perform the actual merge between merge_dt and the harmonised
data. The value columns in merge_dt are added into lexis either
using data.table::set for a data.table or a simple
lexis[, merge_value_col_nms] <- join_result assignment.
Note that data.table::set modifies the lexis object in-place.
Run
optional_steps[["post_merge"]](call_env = call_env, eval_env = eval_env)
if that optional_steps element exists.
Each merged-in column from merge_dt (all columns not in merge_dt_by)
are inspected for missing values. If there are any, an error is raised.
This usually occurs if merge_dt does not contain data for all data in
lexis. For instance it only covers years 1950-2020 but lexis contains
also data for 2021. This error helps you to spot those problems early
isntead of producing nonsense results downstream.
Returns lexis after adding value columns from
merge_dt into lexis.
Other Lexis_functions:
lexis_funs,
lexis_split_merge_aggregate_by_stratum()
# popEpi::lexis_merge lexis <- Epi::Lexis( entry = list(ts_fut = 0.0, ts_cal = 2010.3, ts_age = 56.8), exit = list(ts_cal = 2024.9999), entry.status = 0L, exit.status = 0L ) lexis$sex <- 0L lexis <- popEpi::splitMulti( data = lexis, breaks = list(ts_fut = seq(0, 3, 1 / 12)) ) my_merge_dt <- data.table::CJ(sex = 0:1, ts_age = 0:100, ts_cal = 2000:2025) data.table::set( x = my_merge_dt, j = "merge_value", value = runif(nrow(my_merge_dt)) ) # lexis is also a data.table and is modified in-place. no need to keep # output of popEpi::lexis_merge call. popEpi::lexis_merge( lexis = lexis, merge_dt = my_merge_dt, merge_dt_by = c("sex", "ts_age", "ts_cal") ) stopifnot( "merge_value" %in% names(lexis), !is.na(lexis[["merge_value"]]) ) data.table::set( x = lexis, j = "merge_value", value = NULL ) popEpi::lexis_merge( lexis = lexis, merge_dt = my_merge_dt, merge_dt_by = c("sex", "ts_age", "ts_cal"), merge_dt_harmonisers = list( ts_cal = quote(as.integer(ts_cal)), ts_age = quote(as.integer(ts_age)) ) ) stopifnot( "merge_value" %in% names(lexis), !is.na(lexis[["merge_value"]]) ) # special factor columns in merge_dt data.table::set( x = lexis, j = "merge_value", value = NULL ) data.table::set( x = my_merge_dt, j = c("ts_age", "ts_cal"), value = list( cut(my_merge_dt[["ts_age"]], breaks = 0:101, right = FALSE), cut(my_merge_dt[["ts_cal"]], breaks = 2000:2026, right = FALSE, dig.lab = 4) ) ) popEpi::lexis_merge( lexis = lexis, merge_dt = my_merge_dt, merge_dt_by = c("sex", "ts_age", "ts_cal") ) stopifnot( "merge_value" %in% names(lexis), !is.na(lexis[["merge_value"]]) )# popEpi::lexis_merge lexis <- Epi::Lexis( entry = list(ts_fut = 0.0, ts_cal = 2010.3, ts_age = 56.8), exit = list(ts_cal = 2024.9999), entry.status = 0L, exit.status = 0L ) lexis$sex <- 0L lexis <- popEpi::splitMulti( data = lexis, breaks = list(ts_fut = seq(0, 3, 1 / 12)) ) my_merge_dt <- data.table::CJ(sex = 0:1, ts_age = 0:100, ts_cal = 2000:2025) data.table::set( x = my_merge_dt, j = "merge_value", value = runif(nrow(my_merge_dt)) ) # lexis is also a data.table and is modified in-place. no need to keep # output of popEpi::lexis_merge call. popEpi::lexis_merge( lexis = lexis, merge_dt = my_merge_dt, merge_dt_by = c("sex", "ts_age", "ts_cal") ) stopifnot( "merge_value" %in% names(lexis), !is.na(lexis[["merge_value"]]) ) data.table::set( x = lexis, j = "merge_value", value = NULL ) popEpi::lexis_merge( lexis = lexis, merge_dt = my_merge_dt, merge_dt_by = c("sex", "ts_age", "ts_cal"), merge_dt_harmonisers = list( ts_cal = quote(as.integer(ts_cal)), ts_age = quote(as.integer(ts_age)) ) ) stopifnot( "merge_value" %in% names(lexis), !is.na(lexis[["merge_value"]]) ) # special factor columns in merge_dt data.table::set( x = lexis, j = "merge_value", value = NULL ) data.table::set( x = my_merge_dt, j = c("ts_age", "ts_cal"), value = list( cut(my_merge_dt[["ts_age"]], breaks = 0:101, right = FALSE), cut(my_merge_dt[["ts_cal"]], breaks = 2000:2026, right = FALSE, dig.lab = 4) ) ) popEpi::lexis_merge( lexis = lexis, merge_dt = my_merge_dt, merge_dt_by = c("sex", "ts_age", "ts_cal") ) stopifnot( "merge_value" %in% names(lexis), !is.na(lexis[["merge_value"]]) )
Lexis ObjectFunction(s) which split, merge, and aggregate a Lexis object in one go.
lexis_split_merge_aggregate_by_stratum( lexis, breaks, aggre_exprs, aggre_by = NULL, merge_dt = NULL, merge_dt_by = NULL, merge_optional_args = NULL, subset = NULL, weight_col_nm = NULL, split_lexis_column_exprs = NULL, breaks_collapse_args = NULL, optional_steps = NULL )lexis_split_merge_aggregate_by_stratum( lexis, breaks, aggre_exprs, aggre_by = NULL, merge_dt = NULL, merge_dt_by = NULL, merge_optional_args = NULL, subset = NULL, weight_col_nm = NULL, split_lexis_column_exprs = NULL, breaks_collapse_args = NULL, optional_steps = NULL )
lexis |
A |
breaks |
List of breaks to split |
aggre_exprs |
Defines what is aggregated within every stratum-box
defined via Each element must be either named and an R expression (see e.g. |
aggre_by |
|
merge_dt |
Passed to |
merge_dt_by |
Passed to |
merge_optional_args |
Each element passed to |
subset |
This argument is evaluated within the context of
Due to the non-standard evaluation method used here, you can make use of
data in the calling environment as well. E.g. with variable |
weight_col_nm |
Name of weight column in
|
split_lexis_column_exprs |
Additional columns to create in
|
breaks_collapse_args |
Optional, if you supply this argument then
|
optional_steps |
Optional steps to perform along the way.
|
popEpi::lexis_split_merge_aggregate_by_stratum can be used to split Lexis
([Epi::Lexis]) data, merge something to it after the merge, and
then perform an aggregation step. The following steps are performed:
Handle aggre_exprs as follows:
An aggre_exprs element that is string such as "n_events"
is replaced with the appropriate expression retrieved from a table
of pre-defined expressions (see below).
If a string element is transition-specific, e.g. n_events_[0, 1],
we first turn it into its general form (e.g. n_events_[x, y]).
Additionally, any uses of x and y in the fetched expression
is replaced with the correct states, e.g.
sum(lex.Cst == x & lex.Xst == y) is turned into
sum(lex.Cst == 0 & lex.Xst == 1).
An aggre_exprs element that is an expression already
(e.g. my_n_events = quote(sum(lex.Cst == 0 & lex.Xst != 0)))
does not receive the same treatment regarding the states and when
you write your own expressions you must specify them yourself.
Therefore e.g.
my_n_events = quote(sum(lex.Cst == x & lex.Xst != y)) will not
work.
Regardless of the type of the aggre_exprs element, we remove every
use of * iw in the expression if individual weights are not used.
Table of general aggregation expressions known to popEpi:
| Name | Explanation | Expression |
n_in_follow_up_at_interval_start |
Number of Lexis subjects in follow-up at the start of a follow-up time scale interval. | sum(in_follow_up_at_interval_start) |
n_entered_late_during_interval |
Number of Lexis subjects who entered a follow-up time scale interval late. | sum(entered_late_during_interval) |
n_left_early_during_interval |
Number of Lexis subjects who left a follow-up time scale interval early. | sum(left_early_during_interval) |
n_at_risk_eff |
Effective number of subjects at risk. Whereas in the Kaplan-Meier (product-limit) estimator of survival we simply estimate the conditional survival probabilities as n_events / n_at_risk at each event time point, in the actuarial approach we estimate the conditional survival probabilities of follow-up time intervals such as ]0, 0.1]. Using the number at risk at the start of the interval is problematic because subjects may enter late, leave early, or both. We adjust our number at risk in each of these special cases to arrive out our effective number. The iw in the expression stands for the individual weight and is included to allow for individual weighting. |
sum((in_follow_up_at_interval_start + 0.5 * (entered_late_during_interval &
!left_early_during_interval) + 0.25 * (entered_late_during_interval &
left_early_during_interval) - 0.5 * (!entered_late_during_interval &
left_early_during_interval)) * iw) |
n_events |
Number of events, i.e. the total number of all kinds of transitions. The iw in the expression stands for the individual weight and is included to allow for individual weighting. |
sum((lex.Xst != lex.Cst) * iw) |
t_at_risk |
Total time at risk. The iw in the expression stands for the individual weight and is included to allow for individual weighting. |
sum(lex.dur * iw) |
n_events_exp_e2 |
Number of expected events according to the Ederer II method. Well, there is only one formula for the expected number of events, h_exp * t_at_risk, but this name is reserved for the Ederer II method and is used with observed data (as opposed to extrapolated data according to the Ederer I method). The iw in the expression stands for the individual weight and is included to allow for individual weighting. |
sum(lex.dur * h_exp * iw) |
n_events_pp |
The same n_events but with Pohar Perme weighting. |
sum((lex.Xst != lex.Cst) * pp * iw) |
n_events_pp_double_weighted |
The same as n_events but with Pohar Perme weighting where the weights are squared. |
sum((lex.Xst != lex.Cst) * pp * pp * iw) |
n_events_exp_pp |
The number of expected events with Pohar Perme weighting. The iw in the expression stands for the individual weight and is included to allow for individual weighting. |
sum(lex.dur * h_exp * pp * iw) |
n_at_risk_eff_pp |
The same as n_at_risk_eff but with Pohar Perme weighting. |
sum((in_follow_up_at_interval_start + 0.5 * (entered_late_during_interval &
!left_early_during_interval) + 0.25 * (entered_late_during_interval &
left_early_during_interval) - 0.5 * (!entered_late_during_interval &
left_early_during_interval)) * pp * iw) |
t_at_risk_pp |
The same as t_at_risk but with Pohar Perme weighting. |
sum(lex.dur * pp * iw) |
n_events_[x, y] |
The number of transitions from x to y. The iw in the expression stands for the individual weight and is included to allow for individual weighting. |
sum((lex.Cst %in% x & lex.Xst %in% y) * iw)
|
If weight_col_nm was supplied, include in subset only records where
lexis[[weight_col_nm]] > 0.
Call
optional_steps[["on_entry"]](eval_env = eval_env, call_env = call_env)
if that optional_steps element exists.
eval_env is the temporary evaluation environment of
popEpi::lexis_split_merge_aggregate_by_stratum which contains all
contains all the arguments of
popEpi::lexis_split_merge_aggregate_by_stratum and call_env is the environment
where it was called.
Call
on.exit(optional_steps[["on_exit"]](eval_env = eval_env, call_env = call_env))
if that optional_steps element exists.
For each stratum in aggre_by:
Run
optional_steps[["stratum_on_entry"]](stratum_eval_env = stratum_eval_env, eval_env = eval_env, call_env = call_env)
if that optional_steps element exists.
stratum_eval_env is the environment where the stratum-specific
steps are performed.
Loop over every box defined by all other time scales except the
last one. E.g. if
breaks = list(ts_cal = c(2001, 2004, 2007), ts_fut = 0:5)
then we perform what follows below separately for the boxes
]2001, 2004], ]2004, 2007].
Run
optional_steps[["stratum_pre_split"]](stratum_eval_env = stratum_eval_env, eval_env = eval_env, call_env = call_env)
if that optional_steps element exists.
Using a subset lexis containing data only for the current
stratum (e.g. sex = 0), drop data outside the current
stratum box (e.g. ts_cal outside of
]2001, 2004]). Also crop follow-up to the end of the box.
This does not split the data yet.
If !is.null(breaks_collapse_args), run
popEpi::lexis_breaks_collapse_1d on the last element of
breaks. Arguments lexis and breaks are set
automatically. This makes it possible to "collapse" the breaks
for each stratum and stratum box combination separately.
Run [splitMulti] for the second time, this time splitting the
data by the last time scale in breaks. Remember that we are
inside the box defined by the other time scales already.
Run
optional_steps[["stratum_post_split"]](stratum_eval_env = stratum_eval_env, eval_env = eval_env, call_env = call_env)
if that optional_steps element exists.
Run
[lexis_merge] with merge_dt, merge_dt_by, and
merged_optional_args, if merge_dt has been supplied.
Run
optional_steps[["stratum_post_merge"]](stratum_eval_env = stratum_eval_env, eval_env = eval_env, call_env = call_env)
if that optional_steps element exists.
Evaluate aggre_exprs as follows:
First collect variables mentioned in aggre_exprs using
[all.vars].
There is a pre-defined table of expressions that are evaluated
and added as new columns in the split data if they appear in any
of the aggre_exprs. See below for the table.
Table of expressions which create new columns into split data:
| Name | Explanation | Expression |
in_follow_up_at_interval_start |
This column indicates (TRUE / FALSE) whether a subject is in follow-up at the start of a follow-up time scale interval. It is enough to be within a very small distance of the interval to count as TRUE. |
{
ts_fut_floor <- box_dt[[ts_fut_start_col_nm]][box_id]
absolute_distance_from_floor <- abs(ts_fut - ts_fut_floor)
absolute_distance_from_floor < 1e-10
} |
entered_late_during_interval |
This column indicates (TRUE / FALSE) whether a subject entered a follow-up time scale interval late. TRUE if entered after a small amount of time from the start. E.g. ts_fut = 1.00001 is a late entry in the interval that starts at ts_fut = 1.0. |
{
ts_fut_floor <- box_dt[[ts_fut_start_col_nm]][box_id]
distance_from_floor <- (ts_fut - ts_fut_floor)
distance_from_floor > 1e-10
} |
left_early_during_interval |
This column indicates (TRUE / FALSE) whether a subject left a follow-up time scale interval early. TRUE if left before a small amount of time from the end. E.g. ts_fut = 1.99999 is an early leaver in the interval that ends at ts_fut = 2.0. |
{
out <- duplicated(lex.id, fromLast = TRUE)
out[out] <- lex.Cst[out] == lex.Xst[out]
ts_fut_stop <- ts_fut[out] + lex.dur[out]
ts_fut_ceiling <- box_dt[[ts_fut_stop_col_nm]][box_id[out]]
distance_from_ceiling <- ts_fut_ceiling - ts_fut_stop
out[out] <- distance_from_ceiling > 1e-10
out
} |
pp |
This is the ""Pohar Perme weight"" which we use to weigh summary statistics such as t_at_risk needed by our actuarial Pohar Perme estimator. See: Karri Seppä, Timo Hakulinen, Arun Pokhrel, Choosing the net survival method for cancer survival estimation, European Journal of Cancer, Volume 51, Issue 9, 2015, Pages 1123-1129, ISSN 0959-8049, https://doi.org/10.1016/j.ejca.2013.09.019. |
{
surv_pohar_perme_weight__(lexis = work_dt, ts_fut_breaks = eval_env[['breaks']][[ts_fut_col_nm]],
ts_fut_col_nm = ts_fut_col_nm, hazard_col_nm = 'h_exp')
}
|
+ It can happen that e.g. a survival interval has absolutely no data in
it. Especially in sparse data and with delayed entry. For the
pre-specified aggregation expressions such as `n_events` we ensure
that empty time scale boxes have value zero. Your custom aggregation
expressions will result in `NA` values in empty time scale boxes.
* Run
`optional_steps[["stratum_post_aggregation"]](stratum_eval_env = stratum_eval_env, eval_env = eval_env, call_env = call_env)`
if that `optional_steps` element exists.
Replace NA values in value columns starting with t_ or with n_
with zeroes. E.g. t_at_risk will then be zero in intervals with no
subjects in follow-up. Even a custom expression such as
t_at_risk_squared will be handled in this manner.
Set proper data.table
attributes on the resulting big table and sort it by the stratifying
variables.
Store as metadata a list with names stratum_col_nms, ts_col_nms, and
value_col_nms into the attribute named
lexis_split_merge_aggregate_by_stratum_meta, where
stratum_col_nms = names(aggre_by), ts_col_nms = names(breaks), and
value_col_nms are the names of the columns resulting from aggre_exprs.
Run
optional_steps[["post_aggregation"]](eval_env = eval_env, call_env = call_env)
if that optional_steps element exists.
Return a data.table with stratum columns as specified via
aggre_by, time scale columns specified via breaks and value columns
as specified via aggre_exprs.
Returns a data.table with stratum columns as specified via
aggre_by, time scale columns specified via breaks and value columns
as specified via aggre_exprs.
Other Lexis_functions:
lexis_funs,
lexis_merge()
# popEpi::split_merge_aggregate_by_stratum make_pm <- function() { pm <- data.table::copy(popEpi::popmort) data.table::setnames( pm, c("year", "agegroup", "haz"), c("ts_cal", "ts_age", "h_exp") ) data.table::setkeyv(pm, c("sex", "ts_cal", "ts_age")) data.table::setkeyv(pm, c("sex", "ts_cal", "ts_age")) return(pm[]) } make_column_icss_ag <- function(age) { cut( age, breaks = c(0, 60, 70, 80, Inf), right = FALSE, labels = c("0-59", "60-69", "70-79", "80+") ) } make_standard_weight_dt <- function() { return(popEpi::ICSS[ j = list( weight = as.double(sum(.SD[["ICSS1"]])) ), keyby = list( icss_ag = make_column_icss_ag(popEpi::ICSS[["age"]]) ) ][]) } make_sire <- function() { sire <- popEpi::sire sire <- sire[ sire[["dg_date"]] < sire[["ex_date"]] & data.table::between( sire[["ex_date"]], as.Date("1999-01-01"), as.Date("2003-12-31"), incbounds = TRUE ) & (get.yrs(sire[["ex_date"]]) - get.yrs(sire[["bi_date"]])) < 100 ] sire[j = "my_stratum" := sample(2L, size = nrow(sire), replace = TRUE)] sire <- sire[ j = .SD[as.integer(seq(1L, .N, length.out = 50L))], keyby = "my_stratum" ] # you can also use popEpi::Lexis_dt sire <- Epi::Lexis( entry = list( ts_cal = popEpi::get.yrs(dg_date), ts_age = popEpi::get.yrs(dg_date) - popEpi::get.yrs(bi_date), ts_fut = 0.0 ), duration = popEpi::get.yrs(ex_date) - popEpi::get.yrs(dg_date), entry.status = 0L, exit.status = status, data = sire ) sire[["icss_ag"]] <- make_column_icss_ag(sire[["dg_age"]]) sire[["individual_weight"]] <- popEpi::surv_individual_weights( df = sire, standard_weight_dt = wdt ) sire } pm <- make_pm() wdt <- make_standard_weight_dt() sire <- make_sire() # note that we use here 1-year survival intervals which are too wide for # practical survival estimation. we also aggregate for period analysis # for demonstration purposes. bl <- list(ts_cal = c(1999, 2004), ts_fut = 0:5) # using some pre-defined aggregation expressions agdt <- popEpi::lexis_split_merge_aggregate_by_stratum( lexis = sire, breaks = bl, aggre_exprs = c("n_events", "t_at_risk", "n_events_[0, 1]"), aggre_by = "my_stratum" ) stopifnot( c("n_events", "t_at_risk", "n_events_[0, 1]") %in% names(agdt) ) # using some pre-defined aggregation expressions + using expected hazard. agdt <- popEpi::lexis_split_merge_aggregate_by_stratum( lexis = sire, breaks = bl, aggre_exprs = c( "n_events", "t_at_risk", "n_events_[0, 1]", "n_events_exp_e2" ), aggre_by = "my_stratum", merge_dt = pm, merge_dt_by = c("sex", "ts_cal", "ts_age") ) # using custom expressions --- `h_exp` is merged from `merge_dt` agdt <- popEpi::lexis_split_merge_aggregate_by_stratum( lexis = sire, breaks = bl, aggre_exprs = list( "n_events", "t_at_risk", "n_events_[0, 1]", "n_events_exp_e2", "my_n_events" = quote(sum(lex.Cst == 0 & lex.Xst != 0)), "my_n_events_exp_e2" = quote(sum(h_exp * lex.dur)) ), aggre_by = "my_stratum", merge_dt = pm, merge_dt_by = c("sex", "ts_cal", "ts_age") ) stopifnot( agdt[["my_n_events"]] == agdt[["n_events"]], all.equal(agdt[["my_n_events_exp_e2"]], agdt[["n_events_exp_e2"]]) ) # using custom expressions --- some pre-defined variables which are added # automatically agdt <- popEpi::lexis_split_merge_aggregate_by_stratum( lexis = sire, breaks = bl, aggre_exprs = list( "n_events_pp", "my_n_events_pp" = quote(sum((lex.Cst == 0 & lex.Xst != 0) * pp)), "n_at_risk_eff", "my_n_at_risk_eff" = quote(sum( in_follow_up_at_interval_start + 0.5 * entered_late_during_interval - 0.5 * left_early_during_interval )), "my_n_at_risk_eff_2" = quote({ # custom definition for detecting late entry. in this example we add # a larger tolerance than is used normally. you can find the normal # tolerance used in the table of expressions which create columns into # the split lexis dataset. ts_fut_floor <- eval_env[["box_dt"]][["ts_fut_start"]][box_id] distance_from_floor <- (ts_fut - ts_fut_floor) entered_late_during_interval <- distance_from_floor > 1e-3 sum( in_follow_up_at_interval_start + 0.5 * entered_late_during_interval - 0.5 * left_early_during_interval ) }) ), aggre_by = "my_stratum", merge_dt = pm, merge_dt_by = c("sex", "ts_cal", "ts_age") ) stopifnot( all.equal(agdt[["my_n_events_pp"]], agdt[["n_events_pp"]]), all.equal(agdt[["n_at_risk_eff"]], agdt[["my_n_at_risk_eff"]]), all.equal(agdt[["n_at_risk_eff"]], agdt[["my_n_at_risk_eff_2"]]) ) # using custom expressions --- also a custom column at the split lexis level. # this is for demonstration of what is possible and probably no-one would # need such an h_exp_mean in a real application. agdt <- popEpi::lexis_split_merge_aggregate_by_stratum( lexis = sire, breaks = bl, aggre_exprs = list( "my_stat" = quote(sum(lex.dur * h_exp_mean)) ), aggre_by = "my_stratum", merge_dt = pm, merge_dt_by = c("sex", "ts_cal", "ts_age"), split_lexis_column_exprs = list( h_exp_mean = quote({ dt <- data.table::setDT(list(lex.id = lex.id, h_exp = h_exp)) h_exp_lag1 <- dt[ j = list(lag1 = data.table::shift(h_exp, n = 1L, type = "lag")), by = "lex.id" ][["lag1"]] h_exp_lag1[is.na(h_exp_lag1)] <- h_exp[is.na(h_exp_lag1)] h_exp_lead1 <- dt[ j = list(lead1 = data.table::shift(h_exp, n = 1L, type = "lead")), by = "lex.id" ][["lead1"]] h_exp_lead1[is.na(h_exp_lead1)] <- h_exp[is.na(h_exp_lead1)] (h_exp + h_exp_lag1 + h_exp_lead1) / 3 }) ) ) stopifnot( "my_stat" %in% names(agdt), !is.na(agdt[["my_stat"]]) ) # this function was written for survival estimation but is is in fact more # general. for instance we can look at a calendar year / age grid. agdt <- popEpi::lexis_split_merge_aggregate_by_stratum( lexis = sire, breaks = list(ts_cal = 1999:2004, ts_age = seq(0, 100, 10)), aggre_exprs = list( "n_events_exp_e2" ), merge_dt = pm, merge_dt_by = c("sex", "ts_cal", "ts_age") ) stopifnot( c("ts_cal_start", "ts_age_start") %in% names(agdt) ) agdt_wide <- data.table::dcast( data = agdt, formula = ts_age_start ~ ts_cal_start, value.var = "n_events_exp_e2" ) # print(round(agdt_wide, 2)) # Key: <ts_age_start> # ts_age_start 1999 2000 2001 2002 2003 # <num> <num> <num> <num> <num> <num> # 1: 0 0.00 0.00 0.00 0.00 0.00 # 2: 10 0.00 0.00 0.00 0.00 0.00 # 3: 20 0.00 0.00 0.00 0.00 0.00 # 4: 30 0.00 0.00 0.00 0.00 0.00 # 5: 40 0.00 0.00 0.00 0.00 0.00 # 6: 50 0.01 0.01 0.01 0.00 0.00 # 7: 60 0.04 0.04 0.03 0.01 0.02 # 8: 70 0.33 0.28 0.31 0.16 0.05 # 9: 80 1.36 1.04 0.89 0.48 0.20 # 10: 90 0.39 0.29 0.47 0.37 0.56 # this is what happens when there is an empty interval agdt <- popEpi::lexis_split_merge_aggregate_by_stratum( lexis = popEpi::Lexis_dt( entry = list(ts_fut = c(0.0, 3.5)), exit = list(ts_fut = c(1.5, 4.5)), entry.status = 0L, exit.status = 1L ), breaks = list(ts_fut = 0:5), aggre_exprs = list( "t_at_risk", "n_events", my_n_events = quote(sum(lex.Cst == 0 & lex.Xst != 0)) ) ) stopifnot( identical(agdt[["t_at_risk"]], c(1.0, 0.5, 0.0, 0.5, 0.5)), identical(agdt[["n_events"]], c(0L, 1L, 0L, 0L, 1L)), identical(agdt[["my_n_events"]], c(0L, 1L, NA, 0L, 1L)) ) # this is what happens when an empty stratum is requested agdt <- popEpi::lexis_split_merge_aggregate_by_stratum( lexis = sire, aggre_by = data.table::data.table(sex = 0:2), breaks = list(ts_fut = 0:5), aggre_exprs = list( "t_at_risk", "my_stat" = quote(sum(lex.dur ^ 2)), "t_at_risk_squaredd" = quote(sum(lex.dur ^ 2)) ) ) stopifnot( nrow(agdt) == length(0:2) * (length(0:5) - 1), agdt[["t_at_risk"]][!agdt[["sex"]] %in% sire[["sex"]]] == 0, is.na(agdt[["my_stat"]][!agdt[["sex"]] %in% sire[["sex"]]]), agdt[["t_at_risk_squared"]][!agdt[["sex"]] %in% sire[["sex"]]] == 0 ) # and this is what happens when no records are in the dataset agdt <- popEpi::lexis_split_merge_aggregate_by_stratum( lexis = sire, subset = rep(FALSE, nrow(sire)), aggre_by = data.table::data.table(sex = 0:2), breaks = list(ts_fut = 0:5), aggre_exprs = list( "t_at_risk", "my_stat" = quote(sum(lex.dur ^ 2)) ) ) stopifnot( nrow(agdt) == length(0:2) * (length(0:5) - 1), agdt[["t_at_risk"]] == 0, is.na(agdt[["my_stat"]]) ) # and this is what happens when no records are in the dataset agdt <- popEpi::lexis_split_merge_aggregate_by_stratum( lexis = sire, subset = rep(FALSE, nrow(sire)), aggre_by = data.table::data.table(sex = 0:2), breaks = list(ts_fut = 0:5), aggre_exprs = list( "t_at_risk", "my_stat" = quote(sum(lex.dur ^ 2)) ) ) stopifnot( nrow(agdt) == length(0:2) * (length(0:5) - 1), agdt[["t_at_risk"]] == 0, is.na(agdt[["my_stat"]]) )# popEpi::split_merge_aggregate_by_stratum make_pm <- function() { pm <- data.table::copy(popEpi::popmort) data.table::setnames( pm, c("year", "agegroup", "haz"), c("ts_cal", "ts_age", "h_exp") ) data.table::setkeyv(pm, c("sex", "ts_cal", "ts_age")) data.table::setkeyv(pm, c("sex", "ts_cal", "ts_age")) return(pm[]) } make_column_icss_ag <- function(age) { cut( age, breaks = c(0, 60, 70, 80, Inf), right = FALSE, labels = c("0-59", "60-69", "70-79", "80+") ) } make_standard_weight_dt <- function() { return(popEpi::ICSS[ j = list( weight = as.double(sum(.SD[["ICSS1"]])) ), keyby = list( icss_ag = make_column_icss_ag(popEpi::ICSS[["age"]]) ) ][]) } make_sire <- function() { sire <- popEpi::sire sire <- sire[ sire[["dg_date"]] < sire[["ex_date"]] & data.table::between( sire[["ex_date"]], as.Date("1999-01-01"), as.Date("2003-12-31"), incbounds = TRUE ) & (get.yrs(sire[["ex_date"]]) - get.yrs(sire[["bi_date"]])) < 100 ] sire[j = "my_stratum" := sample(2L, size = nrow(sire), replace = TRUE)] sire <- sire[ j = .SD[as.integer(seq(1L, .N, length.out = 50L))], keyby = "my_stratum" ] # you can also use popEpi::Lexis_dt sire <- Epi::Lexis( entry = list( ts_cal = popEpi::get.yrs(dg_date), ts_age = popEpi::get.yrs(dg_date) - popEpi::get.yrs(bi_date), ts_fut = 0.0 ), duration = popEpi::get.yrs(ex_date) - popEpi::get.yrs(dg_date), entry.status = 0L, exit.status = status, data = sire ) sire[["icss_ag"]] <- make_column_icss_ag(sire[["dg_age"]]) sire[["individual_weight"]] <- popEpi::surv_individual_weights( df = sire, standard_weight_dt = wdt ) sire } pm <- make_pm() wdt <- make_standard_weight_dt() sire <- make_sire() # note that we use here 1-year survival intervals which are too wide for # practical survival estimation. we also aggregate for period analysis # for demonstration purposes. bl <- list(ts_cal = c(1999, 2004), ts_fut = 0:5) # using some pre-defined aggregation expressions agdt <- popEpi::lexis_split_merge_aggregate_by_stratum( lexis = sire, breaks = bl, aggre_exprs = c("n_events", "t_at_risk", "n_events_[0, 1]"), aggre_by = "my_stratum" ) stopifnot( c("n_events", "t_at_risk", "n_events_[0, 1]") %in% names(agdt) ) # using some pre-defined aggregation expressions + using expected hazard. agdt <- popEpi::lexis_split_merge_aggregate_by_stratum( lexis = sire, breaks = bl, aggre_exprs = c( "n_events", "t_at_risk", "n_events_[0, 1]", "n_events_exp_e2" ), aggre_by = "my_stratum", merge_dt = pm, merge_dt_by = c("sex", "ts_cal", "ts_age") ) # using custom expressions --- `h_exp` is merged from `merge_dt` agdt <- popEpi::lexis_split_merge_aggregate_by_stratum( lexis = sire, breaks = bl, aggre_exprs = list( "n_events", "t_at_risk", "n_events_[0, 1]", "n_events_exp_e2", "my_n_events" = quote(sum(lex.Cst == 0 & lex.Xst != 0)), "my_n_events_exp_e2" = quote(sum(h_exp * lex.dur)) ), aggre_by = "my_stratum", merge_dt = pm, merge_dt_by = c("sex", "ts_cal", "ts_age") ) stopifnot( agdt[["my_n_events"]] == agdt[["n_events"]], all.equal(agdt[["my_n_events_exp_e2"]], agdt[["n_events_exp_e2"]]) ) # using custom expressions --- some pre-defined variables which are added # automatically agdt <- popEpi::lexis_split_merge_aggregate_by_stratum( lexis = sire, breaks = bl, aggre_exprs = list( "n_events_pp", "my_n_events_pp" = quote(sum((lex.Cst == 0 & lex.Xst != 0) * pp)), "n_at_risk_eff", "my_n_at_risk_eff" = quote(sum( in_follow_up_at_interval_start + 0.5 * entered_late_during_interval - 0.5 * left_early_during_interval )), "my_n_at_risk_eff_2" = quote({ # custom definition for detecting late entry. in this example we add # a larger tolerance than is used normally. you can find the normal # tolerance used in the table of expressions which create columns into # the split lexis dataset. ts_fut_floor <- eval_env[["box_dt"]][["ts_fut_start"]][box_id] distance_from_floor <- (ts_fut - ts_fut_floor) entered_late_during_interval <- distance_from_floor > 1e-3 sum( in_follow_up_at_interval_start + 0.5 * entered_late_during_interval - 0.5 * left_early_during_interval ) }) ), aggre_by = "my_stratum", merge_dt = pm, merge_dt_by = c("sex", "ts_cal", "ts_age") ) stopifnot( all.equal(agdt[["my_n_events_pp"]], agdt[["n_events_pp"]]), all.equal(agdt[["n_at_risk_eff"]], agdt[["my_n_at_risk_eff"]]), all.equal(agdt[["n_at_risk_eff"]], agdt[["my_n_at_risk_eff_2"]]) ) # using custom expressions --- also a custom column at the split lexis level. # this is for demonstration of what is possible and probably no-one would # need such an h_exp_mean in a real application. agdt <- popEpi::lexis_split_merge_aggregate_by_stratum( lexis = sire, breaks = bl, aggre_exprs = list( "my_stat" = quote(sum(lex.dur * h_exp_mean)) ), aggre_by = "my_stratum", merge_dt = pm, merge_dt_by = c("sex", "ts_cal", "ts_age"), split_lexis_column_exprs = list( h_exp_mean = quote({ dt <- data.table::setDT(list(lex.id = lex.id, h_exp = h_exp)) h_exp_lag1 <- dt[ j = list(lag1 = data.table::shift(h_exp, n = 1L, type = "lag")), by = "lex.id" ][["lag1"]] h_exp_lag1[is.na(h_exp_lag1)] <- h_exp[is.na(h_exp_lag1)] h_exp_lead1 <- dt[ j = list(lead1 = data.table::shift(h_exp, n = 1L, type = "lead")), by = "lex.id" ][["lead1"]] h_exp_lead1[is.na(h_exp_lead1)] <- h_exp[is.na(h_exp_lead1)] (h_exp + h_exp_lag1 + h_exp_lead1) / 3 }) ) ) stopifnot( "my_stat" %in% names(agdt), !is.na(agdt[["my_stat"]]) ) # this function was written for survival estimation but is is in fact more # general. for instance we can look at a calendar year / age grid. agdt <- popEpi::lexis_split_merge_aggregate_by_stratum( lexis = sire, breaks = list(ts_cal = 1999:2004, ts_age = seq(0, 100, 10)), aggre_exprs = list( "n_events_exp_e2" ), merge_dt = pm, merge_dt_by = c("sex", "ts_cal", "ts_age") ) stopifnot( c("ts_cal_start", "ts_age_start") %in% names(agdt) ) agdt_wide <- data.table::dcast( data = agdt, formula = ts_age_start ~ ts_cal_start, value.var = "n_events_exp_e2" ) # print(round(agdt_wide, 2)) # Key: <ts_age_start> # ts_age_start 1999 2000 2001 2002 2003 # <num> <num> <num> <num> <num> <num> # 1: 0 0.00 0.00 0.00 0.00 0.00 # 2: 10 0.00 0.00 0.00 0.00 0.00 # 3: 20 0.00 0.00 0.00 0.00 0.00 # 4: 30 0.00 0.00 0.00 0.00 0.00 # 5: 40 0.00 0.00 0.00 0.00 0.00 # 6: 50 0.01 0.01 0.01 0.00 0.00 # 7: 60 0.04 0.04 0.03 0.01 0.02 # 8: 70 0.33 0.28 0.31 0.16 0.05 # 9: 80 1.36 1.04 0.89 0.48 0.20 # 10: 90 0.39 0.29 0.47 0.37 0.56 # this is what happens when there is an empty interval agdt <- popEpi::lexis_split_merge_aggregate_by_stratum( lexis = popEpi::Lexis_dt( entry = list(ts_fut = c(0.0, 3.5)), exit = list(ts_fut = c(1.5, 4.5)), entry.status = 0L, exit.status = 1L ), breaks = list(ts_fut = 0:5), aggre_exprs = list( "t_at_risk", "n_events", my_n_events = quote(sum(lex.Cst == 0 & lex.Xst != 0)) ) ) stopifnot( identical(agdt[["t_at_risk"]], c(1.0, 0.5, 0.0, 0.5, 0.5)), identical(agdt[["n_events"]], c(0L, 1L, 0L, 0L, 1L)), identical(agdt[["my_n_events"]], c(0L, 1L, NA, 0L, 1L)) ) # this is what happens when an empty stratum is requested agdt <- popEpi::lexis_split_merge_aggregate_by_stratum( lexis = sire, aggre_by = data.table::data.table(sex = 0:2), breaks = list(ts_fut = 0:5), aggre_exprs = list( "t_at_risk", "my_stat" = quote(sum(lex.dur ^ 2)), "t_at_risk_squaredd" = quote(sum(lex.dur ^ 2)) ) ) stopifnot( nrow(agdt) == length(0:2) * (length(0:5) - 1), agdt[["t_at_risk"]][!agdt[["sex"]] %in% sire[["sex"]]] == 0, is.na(agdt[["my_stat"]][!agdt[["sex"]] %in% sire[["sex"]]]), agdt[["t_at_risk_squared"]][!agdt[["sex"]] %in% sire[["sex"]]] == 0 ) # and this is what happens when no records are in the dataset agdt <- popEpi::lexis_split_merge_aggregate_by_stratum( lexis = sire, subset = rep(FALSE, nrow(sire)), aggre_by = data.table::data.table(sex = 0:2), breaks = list(ts_fut = 0:5), aggre_exprs = list( "t_at_risk", "my_stat" = quote(sum(lex.dur ^ 2)) ) ) stopifnot( nrow(agdt) == length(0:2) * (length(0:5) - 1), agdt[["t_at_risk"]] == 0, is.na(agdt[["my_stat"]]) ) # and this is what happens when no records are in the dataset agdt <- popEpi::lexis_split_merge_aggregate_by_stratum( lexis = sire, subset = rep(FALSE, nrow(sire)), aggre_by = data.table::data.table(sex = 0:2), breaks = list(ts_fut = 0:5), aggre_exprs = list( "t_at_risk", "my_stat" = quote(sum(lex.dur ^ 2)) ) ) stopifnot( nrow(agdt) == length(0:2) * (length(0:5) - 1), agdt[["t_at_risk"]] == 0, is.na(agdt[["my_stat"]]) )
Given subject-level data, data is split
by calendar time (per), age, and follow-up
time (fot, from 0 to the end of follow-up)
into subject-time-interval rows according to
given breaks and additionally processed if requested.
lexpand( data, birth = NULL, entry = NULL, exit = NULL, event = NULL, status = status != 0, entry.status = NULL, breaks = list(fot = c(0, Inf)), id = NULL, overlapping = TRUE, aggre = NULL, aggre.type = c("unique", "cartesian"), drop = TRUE, pophaz = NULL, pp = TRUE, subset = NULL, merge = TRUE, verbose = FALSE, ... )lexpand( data, birth = NULL, entry = NULL, exit = NULL, event = NULL, status = status != 0, entry.status = NULL, breaks = list(fot = c(0, Inf)), id = NULL, overlapping = TRUE, aggre = NULL, aggre.type = c("unique", "cartesian"), drop = TRUE, pophaz = NULL, pp = TRUE, subset = NULL, merge = TRUE, verbose = FALSE, ... )
data |
dataset of e.g. cancer cases as rows |
birth |
birth time in date format or fractional years; string, symbol or expression |
entry |
entry time in date format or fractional years; string, symbol or expression |
exit |
exit from follow-up time in date format or fractional years; string, symbol or expression |
event |
advanced: time of possible event differing from |
status |
variable indicating type of event at |
entry.status |
input in the same way as |
breaks |
a named list of vectors of time breaks;
e.g. |
id |
optional; an id variable; e.g. |
overlapping |
advanced, logical; if |
aggre |
e.g. |
aggre.type |
one of |
drop |
logical; if |
pophaz |
a dataset of population hazards to merge with split data; see Details |
pp |
logical; if |
subset |
a logical vector or any logical condition; data is subsetted before splitting accordingly |
merge |
logical; if |
verbose |
logical; if |
... |
e.g. |
Basics
[lexpand] splits a given data set (with e.g. cancer diagnoses
as rows) to subintervals of time over
calendar time, age, and follow-up time with given time breaks
using [splitMulti].
The dataset must contain appropriate
Date / IDate format or
other numeric variables that can be used
as the time variables.
You may take a look at a simulated cohort
[sire] as an example of the
minimum required information for processing data with lexpand.
Many arguments can be supplied as a character string naming the appropriate
variable (e.g. "sex"), as a symbol (e.g. sex) or as an expression
(e.g. factor(sex, 0:1, c("m", "f"))) for flexibility.
Breaks
You should define all breaks as left inclusive and right exclusive
time points (e.g.[a,b) )
for 1-3 time dimensions so that the last member of a breaks vector
is a meaningful "final upper limit",
e.g. per = c(2002,2007,2012)
to create a last subinterval of the form [2007,2012).
All breaks are explicit, i.e. if drop = TRUE,
any data beyond the outermost breaks points are dropped.
If one wants to have unspecified upper / lower limits on one time scale,
use Inf: e.g. breaks = list(fot = 0:5, age = c(0,45,Inf)).
Breaks for per can also be given in
Date / IDateformat, whereupon
they are converted to fractional years before used in splitting.
The age time scale can additionally
be automatically split into common age grouping schemes
by naming the scheme with an appropriate character string:
"18of5": age groups 0-4, 5-9, 10-14, ..., 75-79, 80-84, 85+
"20of5": age groups 0-4, 5-9, 10-14, ..., 85-89, 90-94, 95+
"101of1": age groups 0, 1, 2, ..., 98, 99, 100+
Time variables
If any of the given time variables
(birth, entry, exit, event)
is in any kind of date format, they are first coerced to
fractional years before splitting
using [get.yrs] (with year.length = "actual").
Sometimes in e.g. SIR/SMR calculation one may want the event time to differ
from the time of exit from follow-up, if the subject is still considered
to be at risk of the event. If event is specified, the transition to
status is moved to event from exit
using [Epi::cutLexis]. See Examples.
The status variable
The statuses in the expanded output (lex.Cst and lex.Xst)
are determined by using either only status or both status
and entry.status. If entry.status = NULL, the status at entry
is guessed according to the type of variable supplied via status:
For numeric variables it will be zero, for factors the first level
(levels(status)[1]) and otherwise the first unique value in alphabetical
order (sort(unique(status))[1]).
Using numeric or factor status
variables is strongly recommended. Logical expressions are also allowed
(e.g. status = my_status != 0L) and are converted to integer internally.
Merging population hazard information
To enable computing relative/net survivals with [survtab]
and [relpois], lexpand merges an appropriate
population hazard data (pophaz) to the expanded data
before dropping rows outside the specified
time window (if drop = TRUE). pophaz must, for this reason,
contain at a minimum the variables named
agegroup, year, and haz. pophaz may contain additional variables to specify
different population hazard levels in different strata; e.g. popmort includes sex.
All the strata-defining variables must be present in the supplied data. lexpand will
automatically detect variables with common names in the two datasets and merge using them.
Currently year must be an integer variable specifying the appropriate year. agegroup
must currently also specify one-year age groups, e.g. popmort specifies 101 age groups
of length 1 year. In both
year and agegroup variables the values are interpreted as the lower bounds of intervals
(and passed on to a cut call). The mandatory variable haz
must specify the appropriate average rate at the person-year level;
e.g. haz = -log(survProb) where survProb is a one-year conditional
survival probability will be the correct hazard specification.
The corresponding pophaz population hazard value is merged by using the mid points
of the records after splitting as reference values. E.g. if age=89.9 at the start
of a 1-year interval, then the reference age value is 90.4 for merging.
This way we get a "typical" population hazard level for each record.
Computing Pohar-Perme weights
If pp = TRUE, Pohar-Perme weights
(the inverse of cumulative population survival) are computed. This will
create the new pp variable in the expanded data. pp is a
reserved name and lexpand throws exception if a variable with that name
exists in data.
When a survival interval contains one or several rows per subject
(e.g. due to splitting by the per scale),
pp is cumulated from the beginning of the first record in a survival
interval for each subject to the mid-point of the remaining time within that
survival interval, and that value is given for every other record
that a given person has within the same survival interval.
E.g. with 5 rows of duration 1/5 within a survival interval
[0,1)], pp is determined for all records by a cumulative
population survival from 0 to 0.5. The existing accuracy is used,
so that the weight is cumulated first up to the end of the second row
and then over the remaining distance to the mid-point (first to 0.4, then to
0.5). This ensures that more accurately merged population hazards are fully
used.
Event not at end of follow-up & overlapping time lines
event may be used if the event indicated by status should
occur at a time differing from exit. If event is defined,
cutLexis is used on the data set after coercing it to the Lexis
format and before splitting. Note that some values of event are allowed
to be NA as with cutLexis to accommodate observations
without an event occurring.
Additionally, setting overlapping = FALSE ensures that (irrespective
of using event) the each subject defined by id only has one
continuous time line instead of possibly overlapping time lines if
there are multiple rows in data by id.
Aggregating
Certain analyses such as SIR/SMR calculations require tables of events and
person-years by the unique combinations (interactions) of several variables.
For this, aggre can be specified as a list of such variables
(preferably factor variables but not mandatory)
and any arbitrary functions of the
variables at one's disposal. E.g.
aggre = list(sex, agegr = cut(dg_age, 0:100))
would tabulate events and person-years by sex and an ad-hoc age group variable. Every ad-hoc-created variable should be named.
fot, per, and age are special reserved variables which,
when present in the aggre list, are output as categories of the
corresponding time scale variables by using
e.g.
cut(fot, breaks$fot, right=FALSE).
This only works if
the corresponding breaks are defined in breaks or via "...".
E.g.
aggre = list(sex, fot.int = fot) with
breaks = list(fot=0:5).
The output variable fot.int in the above example will have
the lower limits of the appropriate intervals as values.
aggre as a named list will output numbers of events and person-years
with the given new names as categorizing variable names, e.g.
aggre = list(follow_up = fot, gender = sex, agegroup = age).
The output table has person-years (pyrs) and event counts
(e.g. from0to1) as columns. Event counts are the numbers of transitions
(lex.Cst != lex.Xst) or the lex.Xst value at a subject's
last record (subject possibly defined by id).
If aggre.type = "unique" (alias "non-empty"),
the above results are computed for existing
combinations of expressions given in aggre, but also for non-existing
combinations if aggre.type = "cartesian" (alias "full"). E.g. if a
factor variable has levels "a", "b", "c" but the data is limited
to only have levels "a", "b" present
(more than zero rows have these level values), the former setting only
computes results for "a", "b", and the latter also for "c"
and any combination with other variables or expression given in aggre.
In essence, "cartesian" forces also combinations of variables used
in aggre that have no match in data to be shown in the result.
If aggre is not NULL and pophaz has been supplied,
lexpand also aggregates the expected counts of events, which
appears in the output data by the reserved name d.exp. Additionally,
having pp = TRUE causes lexpand to also compute various
Pohar-Perme weighted figures necessary for computing Pohar-Perme net survivals
with [survtab_ag]. This can be slow, so consider what is really
needed. The Pohar-Perme weighted figures have the suffix .pp.
[0,1)]: R:0,1) [survtab_ag]: R:survtab_ag
If aggre = NULL, returns
a data.table or data.frame
(depending on options("popEpi.datatable"); see ?popEpi)
object expanded to accommodate split observations with time scales as
fractional years and pophaz merged in if given. Population
hazard levels in new variable pop.haz, and Pohar-Perme
weights as new variable pp if requested.
If aggre is defined, returns a long-format
data.table/data.frame with the variable pyrs (person-years),
and variables for the counts of transitions in state or state at end of
follow-up formatted fromXtoY, where X and Y are
the states transitioned from and to, respectively. The data may also have
the columns d.exp for expected numbers of cases and various
Pohar-Perme weighted figures as identified by the suffix .pp; see
Details.
Joonas Miettinen
[Epi::Lexis], [popmort]
Other splitting functions:
splitLexisDT(),
splitMulti()
Other aggregation functions:
aggre(),
as.aggre(),
setaggre(),
summary.aggre()
## prepare data for e.g. 5-year cohort survival calculation x <- lexpand(sire, breaks=list(fot=seq(0, 5, by = 1/12)), birth = bi_date, entry = dg_date, exit = ex_date, status = status != 0, pophaz=popmort) ## prepare data for e.g. 5-year "period analysis" for 2008-2012 BL <- list(fot = seq(0, 5, by = 1/12), per = c("2008-01-01", "2013-01-01")) x <- lexpand(sire, breaks = BL, birth = bi_date, entry = dg_date, exit = ex_date, pophaz=popmort, status = status != 0) ## aggregating BL <- list(fot = 0:5, per = c("2003-01-01","2008-01-01", "2013-01-01")) ag <- lexpand(sire, breaks = BL, status = status != 0, birth = bi_date, entry = dg_date, exit = ex_date, aggre=list(sex, period = per, surv.int = fot)) ## aggregating even more ag <- lexpand(sire, breaks = BL, status = status != 0, birth = bi_date, entry = dg_date, exit = ex_date, aggre=list(sex, period = per, surv.int = fot), pophaz = popmort, pp = TRUE) ## using "..." x <- lexpand(sire, fot=0:5, status = status != 0, birth = bi_date, entry = dg_date, exit = ex_date, pophaz=popmort) x <- lexpand(sire, fot=0:5, status = status != 0, birth = bi_date, entry = dg_date, exit = ex_date, aggre=list(sex, surv.int = fot)) ## using the "event" argument: it just places the transition to given "status" ## at the "event" time instead of at the end, if possible using cutLexis x <- lexpand(sire, status = status, event = dg_date, birth = bi_date, entry = dg_date, exit = ex_date,) ## aggregating with custom "event" time ## (the transition to status is moved to the "event" time) x <- lexpand(sire, status = status, event = dg_date, birth = bi_date, entry = dg_date, exit = ex_date, per = 1970:2014, age = c(0:100,Inf), aggre = list(sex, year = per, agegroup = age))## prepare data for e.g. 5-year cohort survival calculation x <- lexpand(sire, breaks=list(fot=seq(0, 5, by = 1/12)), birth = bi_date, entry = dg_date, exit = ex_date, status = status != 0, pophaz=popmort) ## prepare data for e.g. 5-year "period analysis" for 2008-2012 BL <- list(fot = seq(0, 5, by = 1/12), per = c("2008-01-01", "2013-01-01")) x <- lexpand(sire, breaks = BL, birth = bi_date, entry = dg_date, exit = ex_date, pophaz=popmort, status = status != 0) ## aggregating BL <- list(fot = 0:5, per = c("2003-01-01","2008-01-01", "2013-01-01")) ag <- lexpand(sire, breaks = BL, status = status != 0, birth = bi_date, entry = dg_date, exit = ex_date, aggre=list(sex, period = per, surv.int = fot)) ## aggregating even more ag <- lexpand(sire, breaks = BL, status = status != 0, birth = bi_date, entry = dg_date, exit = ex_date, aggre=list(sex, period = per, surv.int = fot), pophaz = popmort, pp = TRUE) ## using "..." x <- lexpand(sire, fot=0:5, status = status != 0, birth = bi_date, entry = dg_date, exit = ex_date, pophaz=popmort) x <- lexpand(sire, fot=0:5, status = status != 0, birth = bi_date, entry = dg_date, exit = ex_date, aggre=list(sex, surv.int = fot)) ## using the "event" argument: it just places the transition to given "status" ## at the "event" time instead of at the end, if possible using cutLexis x <- lexpand(sire, status = status, event = dg_date, birth = bi_date, entry = dg_date, exit = ex_date,) ## aggregating with custom "event" time ## (the transition to status is moved to the "event" time) x <- lexpand(sire, status = status, event = dg_date, birth = bi_date, entry = dg_date, exit = ex_date, per = 1970:2014, age = c(0:100,Inf), aggre = list(sex, year = per, agegroup = age))
Plot SIR spline lines with R base graphics
## S3 method for class 'sirspline' lines(x, conf.int = TRUE, print.levels = NA, select.spline, ...)## S3 method for class 'sirspline' lines(x, conf.int = TRUE, print.levels = NA, select.spline, ...)
x |
an object returned by function sirspline |
conf.int |
logical; default TRUE draws also the 95 confidence intervals |
print.levels |
name(s) to be plotted. Default plots all levels. |
select.spline |
select which spline variable (a number or a name) is plotted. |
... |
arguments passed on to lines() |
In lines.sirspline most of graphical parameters is user
adjustable.
Desired spline variable can be selected with select.spline and only one
can be plotted at a time. The spline variable can include
several levels, e.g. gender (these are the levels of print
from sirspline). All levels are printed by default, but a
specific level can be selected using argument
print.levels. Printing the levels separately enables e.g. to
give different colours for each level.
Always returns NULL invisibly.
This function is called for its side effects.
Matti Rantanen
Other sir functions:
plot.sirspline(),
sir(),
sir_exp(),
sir_ratio(),
sirspline()
Plots the observed (with extrapolation) and expected survival
curves for all strata in an object created by [survmean]
## S3 method for class 'survmean' lines(x, ...)## S3 method for class 'survmean' lines(x, ...)
x |
a |
... |
arguments passed (ultimately) to |
This function is intended to be a workhorse for [plot.survmean].
If you want finer control over the plotted curves, extract the curves from
the survmean output using
attr(x, "curves")
where x is a survmean object.
Always returns NULL invisibly.
This function is called for its side effects.
Joonas Miettinen
Other survmean functions:
Surv(),
plot.survmean(),
survmean()
lines method for survtab objectsPlot lines from a survtab object
## S3 method for class 'survtab' lines(x, y = NULL, subset = NULL, conf.int = TRUE, col = NULL, lty = NULL, ...)## S3 method for class 'survtab' lines(x, y = NULL, subset = NULL, conf.int = TRUE, col = NULL, lty = NULL, ...)
x |
a |
y |
a variable to plot; a quoted name of a variable
in |
subset |
a logical condition; |
conf.int |
logical; if |
col |
line colour passed to |
lty |
line type passed to |
... |
additional arguments passed on to to a |
Always returns NULL invisibly.
This function is called for its side effects.
Joonas Miettinen
Other survtab functions:
Surv(),
plot.survtab(),
print.survtab(),
summary.survtab(),
survtab(),
survtab_ag()
data(sire) data(sibr) si <- rbind(sire, sibr) si$period <- cut(si$dg_date, as.Date(c("1993-01-01", "2004-01-01", "2013-01-01")), right = FALSE) si$cancer <- c(rep("rectal", nrow(sire)), rep("breast", nrow(sibr))) x <- lexpand(si, birth = bi_date, entry = dg_date, exit = ex_date, status = status %in% 1:2, fot = 0:5, aggre = list(cancer, period, fot)) st <- survtab_ag(fot ~ cancer + period, data = x, surv.method = "lifetable", surv.type = "surv.obs") plot(st, "surv.obs", subset = cancer == "breast", ylim = c(0.5, 1), col = "blue") lines(st, "surv.obs", subset = cancer == "rectal", col = "red") ## or plot(st, "surv.obs", col = c(2,2,4,4), lty = c(1, 2, 1, 2))data(sire) data(sibr) si <- rbind(sire, sibr) si$period <- cut(si$dg_date, as.Date(c("1993-01-01", "2004-01-01", "2013-01-01")), right = FALSE) si$cancer <- c(rep("rectal", nrow(sire)), rep("breast", nrow(sibr))) x <- lexpand(si, birth = bi_date, entry = dg_date, exit = ex_date, status = status %in% 1:2, fot = 0:5, aggre = list(cancer, period, fot)) st <- survtab_ag(fot ~ cancer + period, data = x, surv.method = "lifetable", surv.type = "surv.obs") plot(st, "surv.obs", subset = cancer == "breast", ylim = c(0.5, 1), col = "blue") lines(st, "surv.obs", subset = cancer == "rectal", col = "red") ## or plot(st, "surv.obs", col = c(2,2,4,4), lty = c(1, 2, 1, 2))
selects lowest values of each factor after cut() based on that the value starts from index 2 and end in comma ",".
lower_bound(cut)lower_bound(cut)
cut |
is a character vector of elements "(20,60]" |
A numeric vector.
Matti Rantanen
ltable makes use of data.table
capabilities to tabulate frequencies or
arbitrary functions of given variables into a long format
data.table/data.frame. expr.by.cj is the
equivalent for more advanced users.
ltable( data, by.vars = NULL, expr = list(obs = .N), subset = NULL, use.levels = TRUE, na.rm = FALSE, robust = TRUE ) expr.by.cj( data, by.vars = NULL, expr = list(obs = .N), subset = NULL, use.levels = FALSE, na.rm = FALSE, robust = FALSE, .SDcols = NULL, enclos = parent.frame(1L), ... )ltable( data, by.vars = NULL, expr = list(obs = .N), subset = NULL, use.levels = TRUE, na.rm = FALSE, robust = TRUE ) expr.by.cj( data, by.vars = NULL, expr = list(obs = .N), subset = NULL, use.levels = FALSE, na.rm = FALSE, robust = FALSE, .SDcols = NULL, enclos = parent.frame(1L), ... )
data |
a |
by.vars |
names of variables that are used for categorization,
as a character vector, e.g. |
expr |
object or a list of objects where each object is a function of a variable (see: details) |
subset |
a logical condition; data is limited accordingly before
evaluating |
use.levels |
logical; if |
na.rm |
logical; if |
robust |
logical; if |
.SDcols |
advanced; a character vector of column names
passed to inside the data.table's brackets
|
enclos |
advanced; an environment; the enclosing environment of the data. |
... |
advanced; other arguments passed to inside the
data.table's brackets |
Returns expr for each unique combination of given by.vars.
By default makes use of any and all [levels] present for
each variable in by.vars. This is useful,
because even if a subset of the data does not contain observations
for e.g. a specific age group, those age groups are
nevertheless presented in the resulting table; e.g. with the default
expr = list(obs = .N) all age group levels
are represented by a row and can have obs = 0.
The function differs from the
vanilla [table] by giving a long format table of values
regardless of the number of by.vars given.
Make use of e.g. [cast_simple] if data needs to be
presented in a wide format (e.g. a two-way table).
The rows of the long-format table are effectively Cartesian products
of the levels of each variable in by.vars,
e.g. with by.vars = c("sex", "area") all levels of
area are repeated for both levels of sex
in the table.
The expr allows the user to apply any function(s) on all
levels defined by by.vars. Here are some examples:
.N or list(.N) is a function used inside a data.table to
calculate counts in each group
list(obs = .N), same as above but user assigned variable name
list(sum(obs), sum(pyrs), mean(dg_age)), multiple objects in a list
list(obs = sum(obs), pyrs = sum(pyrs)), same as above with user defined variable names
If use.levels = FALSE, no levels information will
be used. This means that if e.g. the agegroup
variable is a factor and has 18 levels defined, but only 15 levels
are present in the data, no rows for the missing
levels will be shown in the table.
na.rm simply drops any rows from the resulting table where
any of the by.vars values was NA.
A data.table of statistics (e.g. counts) stratified by the columns defined
in by.vars.
expr.by.cj(): Somewhat more streamlined ltable with
defaults for speed. Explicit determination of enclosing environment
of data.
Joonas Miettinen, Matti Rantanen
[table], [cast_simple], [data.table::melt]
data("sire", package = "popEpi") sr <- sire sr$agegroup <- cut(sr$dg_age, breaks=c(0,45,60,75,85,Inf)) ## counts by default ltable(sr, "agegroup") ## any expression can be given ltable(sr, "agegroup", list(mage = mean(dg_age))) ltable(sr, "agegroup", list(mage = mean(dg_age), vage = var(dg_age))) ## also returns levels where there are zero rows (expressions as NA) ltable(sr, "agegroup", list(obs = .N, minage = min(dg_age), maxage = max(dg_age)), subset = dg_age < 85) #### expr.by.cj expr.by.cj(sr, "agegroup") ## any arbitrary expression can be given expr.by.cj(sr, "agegroup", list(mage = mean(dg_age))) expr.by.cj(sr, "agegroup", list(mage = mean(dg_age), vage = var(dg_age))) ## only uses levels of by.vars present in data expr.by.cj(sr, "agegroup", list(mage = mean(dg_age), vage = var(dg_age)), subset = dg_age < 70) ## .SDcols trick expr.by.cj(sr, "agegroup", lapply(.SD, mean), subset = dg_age < 70, .SDcols = c("dg_age", "status"))data("sire", package = "popEpi") sr <- sire sr$agegroup <- cut(sr$dg_age, breaks=c(0,45,60,75,85,Inf)) ## counts by default ltable(sr, "agegroup") ## any expression can be given ltable(sr, "agegroup", list(mage = mean(dg_age))) ltable(sr, "agegroup", list(mage = mean(dg_age), vage = var(dg_age))) ## also returns levels where there are zero rows (expressions as NA) ltable(sr, "agegroup", list(obs = .N, minage = min(dg_age), maxage = max(dg_age)), subset = dg_age < 85) #### expr.by.cj expr.by.cj(sr, "agegroup") ## any arbitrary expression can be given expr.by.cj(sr, "agegroup", list(mage = mean(dg_age))) expr.by.cj(sr, "agegroup", list(mage = mean(dg_age), vage = var(dg_age))) ## only uses levels of by.vars present in data expr.by.cj(sr, "agegroup", list(mage = mean(dg_age), vage = var(dg_age)), subset = dg_age < 70) ## .SDcols trick expr.by.cj(sr, "agegroup", lapply(.SD, mean), subset = dg_age < 70, .SDcols = c("dg_age", "status"))
Mean population counts in Finland year, sex, and age group.
data.table with columns
sex gender coded as male, female (0, 1)
year calendar year 1981-2016
agegroup - coded 0 to 100; one-year age groups
meanpop the mean population count; that is, the mean of the
annual population counts of two consecutive years; e.g. for 1990
meanpop is the mean of population counts for 1990 and 1991
(counted at 1990-01-01 and 1991-01-01, respectively)
Statistics Finland
Other popEpi data:
ICSS,
popmort,
sibr,
sire,
stdpop101,
stdpop18
Given a data.table DT, replaces any NA values
in the variables given in vars in DT. Takes a copy of the
original data and returns the modified copy.
na2zero(DT, vars = NULL)na2zero(DT, vars = NULL)
DT |
|
vars |
a character string vector of variables names in |
Given a data.table object, converts NA values
to numeric (double) zeros for all variables named in vars or
all variables if vars = NULL.
A copy of DT where NA values have been replaced with zero.
Joonas Miettinen
Plot rate estimates with confidence intervals lines using R base graphics
## S3 method for class 'rate' plot(x, conf.int = TRUE, eps = 0.2, left.margin, xlim, ...)## S3 method for class 'rate' plot(x, conf.int = TRUE, eps = 0.2, left.margin, xlim, ...)
x |
a rate object (see |
conf.int |
logical; default TRUE draws the confidence intervals |
eps |
is the height of the ending of the error bars |
left.margin |
set a custom left margin for long variable names. Function tries to do it by default. |
xlim |
change the x-axis location |
... |
arguments passed on to graphical functions points and segment
(e.g. |
This is limited explanatory tool but most graphical parameters are user adjustable.
Always returns NULL invisibly.
This function is called for its side effects.
Matti Rantanen
Plot SIR estimates with error bars
## S3 method for class 'sir' plot( x, conf.int = TRUE, ylab, xlab, xlim, main, eps = 0.2, abline = TRUE, log = FALSE, left.margin, ... )## S3 method for class 'sir' plot( x, conf.int = TRUE, ylab, xlab, xlim, main, eps = 0.2, abline = TRUE, log = FALSE, left.margin, ... )
x |
an object returned by function |
conf.int |
default TRUE draws confidence intervals |
ylab |
overwrites default y-axis label |
xlab |
overwrites default x-axis label |
xlim |
x-axis minimum and maximum values |
main |
optional plot title |
eps |
error bar vertical bar height (works only in 'model' or 'univariate') |
abline |
logical; draws a grey line in SIR = 1 |
log |
logical; SIR is not in log scale by default |
left.margin |
adjust left marginal of the plot to fit long variable names |
... |
arguments passed on to plot(), segment and lines() |
Plot SIR estimates and confidence intervals
univariate - plots SIR with univariate confidence intervals
model - plots SIR with Poisson modelled confidence intervals
Customize
Normal plot parameters can be passed to plot. These can be a vector when plotting error bars:
pch - point type
lty - line type
col - line/point colour
lwd - point/line size
Tips for plotting splines
It's possible to use plot to first draw the
confidence intervals using specific line type or colour and then plotting
again the estimate using lines(... , conf.int = FALSE) with different
settings. This works only when plot.type is 'splines'.
Always returns NULL invisibly.
This function is called for its side effects.
Matti Rantanen
[sir], [sirspline]
# Plot SIR estimates # plot(sir.by.gender, col = c(4,2), log=FALSE, eps=0.2, lty=1, lwd=2, pch=19, # main = 'SIR by gender', abline=TRUE)# Plot SIR estimates # plot(sir.by.gender, col = c(4,2), log=FALSE, eps=0.2, lty=1, lwd=2, pch=19, # main = 'SIR by gender', abline=TRUE)
plot method for sirspline-objectPlot SIR splines using R base graphics.
## S3 method for class 'sirspline' plot(x, conf.int = TRUE, abline = TRUE, log = FALSE, type, ylab, xlab, ...)## S3 method for class 'sirspline' plot(x, conf.int = TRUE, abline = TRUE, log = FALSE, type, ylab, xlab, ...)
x |
an object returned by function sirspline |
conf.int |
logical; default TRUE draws also the 95 confidence intervals |
abline |
logical; draws a reference line where SIR = 1 |
log |
logical; default FALSE. Should the y-axis be in log scale |
type |
select |
ylab |
overwrites default y-axis label; can be a vector if multiple splines fitted |
xlab |
overwrites default x-axis label; can be a vector if multiple splines fitted |
... |
arguments passed on to plot() |
In plot.sirspline almost every graphical parameter are user
adjustable, such as ylim, xlim.
plot.sirsplines calls lines.splines to add lines.
The plot axis without lines can be plotted using option type = 'n'.
On top of the frame it's then possible to add a grid,
abline or text before plotting the lines (see: sirspline).
Always returns NULL invisibly.
This function is called for its side effects.
Matti Rantanen
Other sir functions:
lines.sirspline(),
sir(),
sir_exp(),
sir_ratio(),
sirspline()
Plots the observed (with extrapolation) and expected survival
curves for all strata in an object created by [survmean]
## S3 method for class 'survmean' plot(x, ...)## S3 method for class 'survmean' plot(x, ...)
x |
a |
... |
arguments passed (ultimately) to |
For examples see [survmean]. This function is intended only
for graphically inspecting that the observed survival curves with extrapolation
and the expected survival curves have been sensibly computed in survmean.
If you want finer control over the plotted curves, extract the curves from
the survmean output using
attr(x, "curves")
where x is a survmean object.
Always returns NULL invisibly.
This function is called for its side effects.
Joonas Miettinen
Other survmean functions:
Surv(),
lines.survmean(),
survmean()
plot method for survtab objectsPlotting for survtab objects
## S3 method for class 'survtab' plot( x, y = NULL, subset = NULL, conf.int = TRUE, col = NULL, lty = NULL, ylab = NULL, xlab = NULL, ... )## S3 method for class 'survtab' plot( x, y = NULL, subset = NULL, conf.int = TRUE, col = NULL, lty = NULL, ylab = NULL, xlab = NULL, ... )
x |
a |
y |
survival a character vector of a variable names to plot;
e.g. |
subset |
a logical condition; |
conf.int |
logical; if |
col |
line colour; one value for each stratum; will be recycled |
lty |
line type; one value for each stratum; will be recycled |
ylab |
label for Y-axis |
xlab |
label for X-axis |
... |
additional arguments passed on to |
Always returns NULL invisibly.
This function is called for its side effects.
Joonas Miettinen
Other survtab functions:
Surv(),
lines.survtab(),
print.survtab(),
summary.survtab(),
survtab(),
survtab_ag()
data(sire) data(sibr) si <- rbind(sire, sibr) si$period <- cut(si$dg_date, as.Date(c("1993-01-01", "2004-01-01", "2013-01-01")), right = FALSE) si$cancer <- c(rep("rectal", nrow(sire)), rep("breast", nrow(sibr))) x <- lexpand(si, birth = bi_date, entry = dg_date, exit = ex_date, status = status %in% 1:2, fot = 0:5, aggre = list(cancer, period, fot)) st <- survtab_ag(fot ~ cancer + period, data = x, surv.method = "lifetable", surv.type = "surv.obs") plot(st, "surv.obs", subset = cancer == "breast", ylim = c(0.5, 1), col = "blue") lines(st, "surv.obs", subset = cancer == "rectal", col = "red") ## or plot(st, "surv.obs", col = c(2,2,4,4), lty = c(1, 2, 1, 2))data(sire) data(sibr) si <- rbind(sire, sibr) si$period <- cut(si$dg_date, as.Date(c("1993-01-01", "2004-01-01", "2013-01-01")), right = FALSE) si$cancer <- c(rep("rectal", nrow(sire)), rep("breast", nrow(sibr))) x <- lexpand(si, birth = bi_date, entry = dg_date, exit = ex_date, status = status %in% 1:2, fot = 0:5, aggre = list(cancer, period, fot)) st <- survtab_ag(fot ~ cancer + period, data = x, surv.method = "lifetable", surv.type = "surv.obs") plot(st, "surv.obs", subset = cancer == "breast", ylim = c(0.5, 1), col = "blue") lines(st, "surv.obs", subset = cancer == "rectal", col = "red") ## or plot(st, "surv.obs", col = c(2,2,4,4), lty = c(1, 2, 1, 2))
Computes confidence intervals for Poisson rates
poisson.ci(x, pt = 1, conf.level = 0.95)poisson.ci(x, pt = 1, conf.level = 0.95)
x |
observed |
pt |
expected |
conf.level |
alpha level |
A data.frame with columns
x: arg x
pt: arg pt
rate: result of x / pt
lower: lower bound of CI
upper: upper bound of CI
conf.level: arg conf.level
epitools
poisson.ci(x = 4, pt = 5, conf.level = 0.95)poisson.ci(x = 4, pt = 5, conf.level = 0.95)
Several functions in popEpi make use of population or expected
hazards in computing the intended estimates (e.g. [survtab]).
This document explains using such data sets in this package.
Population hazard data sets (pophaz for short) in popEpi should
be data.frames in the "long" format where one of the columns must be
named haz (for hazard), and other columns define the values or
levels in variables relating to subjects in your data. For example,
[popmort] contains Finnish population mortality hazards
by sex, calendar year, and 1-year age group.
sex |
year |
agegroup |
haz |
| 0 | 1951 | 0 | 0.036363176 |
| 0 | 1951 | 1 | 0.003616547 |
| 0 | 1951 | 2 | 0.002172384 |
| 0 | 1951 | 3 | 0.001581249 |
| 0 | 1951 | 4 | 0.001180690 |
| 0 | 1951 | 5 | 0.001070595 |
The names of the columns should match to the names of the variables
that you have in your subject-level data. Time variables in your pophaz
may also correspond to Lexis time scales; see
[survtab].
Any time variables (as they usually have) should be coded consistently:
When using fractional years in your data, the time variables in your pophaz
must also be coded in fractional years. When using e.g. Dates in your
data, ensure that the pophaz time variables are coded at the level of days
(or Dates for calendar time).
The haz variable in your pophaz should also be coded consistently
with the used time variables. E.g. haz values in life-tables
reported as deaths per person-year should be multiplied by 365.25 when
using day-level time variables. Typically you'll have calendar time and age
expressed in years, which means haz should be expressed as the number
of deaths per person-year.
If you have your population hazards in a ratetable object
usable by functions in survival and relsurv, you may
transform them to long-format data.frames using
[ratetable_to_long_dt]. Ensure, however, that the
created haz column is coded at the right level (events per
days or years typically).
National statistical institutions, the WHO, and e.g. the Human Life-Table Database supply life-table data.
Joonas Miettinen
[pophaz].Population mortality rates in Finland 1951 - 2013 in 101 age groups and
by gender. This is an example of a population hazard table as used in
popEpi; for the general help page, see [pophaz].
data.table with columns
sex gender coded as male, female (0, 1)
year calendar year
agegroup - coded 0 to 100; one-year age groups
haz the average population mortality rate per person-year
(d/(pyrs), where d is the number of deaths and pyrs is the person-years)
Statistics Finland
[pophaz]
Other popEpi data:
ICSS,
meanpop_fi,
sibr,
sire,
stdpop101,
stdpop18
prepExpo uses a Lexis object of periods of exposure
to fill gaps between the periods and overall entry and exit times without
accumulating exposure time in periods of no exposure, and splits the
result if requested.
prepExpo( lex, freezeScales = "work", cutScale = "per", entry = min(get(cutScale)), exit = max(get(cutScale)), by = "lex.id", breaks = NULL, freezeDummy = NULL, subset = NULL, verbose = FALSE, ... )prepExpo( lex, freezeScales = "work", cutScale = "per", entry = min(get(cutScale)), exit = max(get(cutScale)), by = "lex.id", breaks = NULL, freezeDummy = NULL, subset = NULL, verbose = FALSE, ... )
lex |
a |
freezeScales |
a character vector naming |
cutScale |
the |
entry |
an expression; the time of entry to follow-up which may be earlier, at, or after
the first time of exposure in |
exit |
the same as |
by |
a character vector indicating variable names in |
breaks |
a named list of breaks;
e.g. |
freezeDummy |
a character string; specifies the name for a dummy variable
that this function will create and add to output which
identifies rows where the |
subset |
a logical condition to subset data by before computations;
e.g. |
verbose |
logical; if |
... |
additional arguments passed on to |
prepExpo is a convenience function for the purpose of eventually aggregating
person-time and events in categories of not only normally progressing
[Epi::Lexis] time scales but also some time scales which should not
progress sometimes. For example a person may work at a production facility
only intermittently, meaning exposure time (to work-related substances
for example) should not progress outside of periods of work. This allows for
e.g. a correct aggregation of person-time and events by categories of cumulative
time of exposure.
Given an [Epi::Lexis] object containing rows (time lines)
where a subject is exposed to something (and NO periods without exposure),
fills any gaps between exposure periods for each unique combination of by
and the subject-specific "ultimate" entry and exit times,
"freezes" the cumulative exposure times in periods of no exposure,
and splits data using breaks passed to [splitMulti]
if requested. Results in a (split) Lexis object where freezeScales
do not progress in time periods where no exposure was recorded in lex.
This function assumes that entry and exit arguments are the
same for each row within a unique combination of variables named in by.
E.g. with by = "lex.id" only each lex.id has a unique value
for entry and exit at most.
The supplied breaks split the data using splitMulti, with
the exception that breaks supplied concerning any frozen time scales
ONLY split the rows where the time scales are not frozen. E.g.
with freezeScales = "work",
breaks = list(work = 0:10, cal = 1995:2010) splits all rows over
"cal" but only non-frozen rows over "work".
Only supports frozen time scales that advance and freeze contemporaneously: e.g. it would not currently be possible to take into account the cumulative time working at a facility and the cumulative time doing a single task at the facility, if the two are not exactly the same. On the other hand one might use the same time scale for different exposure types, supply them as separate rows, and identify the different exposures using a dummy variable.
Returns a Lexis object that has been split if breaks is specified.
The resulting time is also a data.table if
options("popEpi.datatable") == TRUE (see: ?popEpi)
Function(s) to compute prevalence statistics.
prev_lexis( lexis, observation_time_points, stratum_breaks, aggre_by = NULL, subset = NULL, merge_dt = NULL, merge_optional_args = NULL )prev_lexis( lexis, observation_time_points, stratum_breaks, aggre_by = NULL, subset = NULL, merge_dt = NULL, merge_optional_args = NULL )
lexis |
A |
observation_time_points |
Prevalence observation time points.
The |
stratum_breaks |
Breaks to split |
aggre_by |
Passed to |
subset |
This argument is evaluated within the context of
Due to the non-standard evaluation method used here, you can make use of
data in the calling environment as well. E.g. with variable |
merge_dt |
Table containing survival estimates applicable to
|
merge_optional_args |
Each element passed to |
popEpi::prev_lexis
Returns a data.table with
Stratifying columns defined via aggre_by (if any),
Stratifying time scale columns defined via stratum_breaks (if any),
The observation time scale column defined via observation_time_points,
n_prev, the number of subjects in follow-up, and
n_prev_eff, the above plus the "extrapolated" number in follow-up, if
this was requested.
popEpi::prev_lexis
popEpi::prev_lexis can be used to compute numbers of (potentially
effective numbers of) subjects remaining in follow-up at arbitrary time
points. It performs the following steps:
For each observation time point in observation_time_points[[1]]:
Calls [lexis_split_merge_aggregate_by_stratum] with the
lexis, subset, aggre_by supplied to
popEpi::prev_lexis and with
aggre_exprs = list(n_prev = quote(.N)), and breaks = stratum_breaks.
This produces a table stratified by
aggre_by and all time scales used in stratum_breaks.
The only value column at this point is n_prev, the number
of subjects in follow-up at the given observation time point.
If !is.null(merge_dt),
collect subjects in lexis who were censored before the
current observation time point. This is defined as those in
lexis who have lexis[["lex.Cst"]] == lexis[["lex.Xst"]] and
lexis[[obs_ts_col_nm]] + lexis[["lex.dur"]] < obs_tp, where
obs_ts_col_nm = names(observation_time_points) and
obs_tp is the current observation time point. These collected
subjects are the ones we need to extrapolate to the current
observation time point.
If there are no observations that need to be extrapolated
then we end the extrapolation step early and simply use
n_prev_eff = n_prev. Otherwise proceed as described below.
If inherits(merge_dt, "list"), collect arguments from
merge_dt to call [surv_lexis].
Arguments lexis, subset, and aggre_by are assigned
internally to the respective arguments supplied to prev_lexis
except subset additionally is used to detect only those cases
who entered follow-up (any time) before the current
observation time point. It may be worth emphasizing that lexis
really is the input argument lexis and not only those whose
follow-up we need to "extrapolate".
We also set estimators = "S_ch".
Any arguments that are passed via
merge_dt override even the internally set arguments.
The arguments not set internally or supplied by the user via
merge_dt make use of the defaults of [surv_lexis].
E.g.
merge_dt = list(aggre_by = NULL, breaks = list(ts_fut = futs))
are both passed to [surv_lexis] despite what aggre_by was
for prev_lexis.
If inherits(merge_dt, "list"), we also interpolate survival
estimates in the table we have just produced so that there
are always 1000 intervals starting from zero and ending where
the last survival estimate is available. This is peformed to
ensure that the survival estimates we merge will be available
at a sufficient "resolution". For instance, even if the
survival estimates are for one-year intervals due to sparsity,
we get a reasonable survival estimate for a subject who was
censored at 1.01 — from somewhere close to 1.01 instead from
2.0.
Merge (for the first time) merge_dt with the collected
subjects at the original exit time of each
subject. This yields the survival probability for each subject
at exit, in math S(t_e)
(starting from zero — delayed entry is not supported).
Merge merge_dt for the second time, this time at the
current prevalence observation time point such as at
ts_cal = 2009.999.
In math this is S(t_p) where t_p is the prevalence
observation time point.
With both S(t_e) and S(t_p) available, our "extrapolated" or
"effective" number of being in follow-up is between zero and
one for each subject and defined simply as the conditional
survival up to t_p starting from t_e,
S(t_p|t_e) = S(t_p) / S(t_e). E.g.
S(t_p) / S(t_e) = 0.8 / 0.9 ~ 0.8888889.
Call [lexis_split_merge_aggregate_by_stratum] for the second
time, this time with the subjects collected for extrapolation,
and sum the number of extrapolated subjects in follow-up into a
table with the identical stratification as the one created
before.
Add column n_prev_eff as the sum of the number of extrapolated
subjects and the original n_prev into the first table we
created.
Collect the observation time point-specific results into one big table and return it.
aggre ObjectPrint method function for aggre objects; see
[as.aggre] and [aggre].
## S3 method for class 'aggre' print(x, subset = NULL, ...)## S3 method for class 'aggre' print(x, subset = NULL, ...)
x |
an |
subset |
a logical condition to subset results table by
before printing; use this to limit to a certain stratum. E.g.
|
... |
arguments passed to |
Always returns NULL invisibly.
This function is called for its side effects.
Joonas Miettinen
Print method function for rate objects; see
[rate].
## S3 method for class 'rate' print(x, subset = NULL, ...)## S3 method for class 'rate' print(x, subset = NULL, ...)
x |
an |
subset |
a logical condition to subset results table by
before printing; use this to limit to a certain stratum. E.g.
|
... |
arguments for data.tables print method, e.g. row.names = FALSE suppresses row numbers. |
Always returns NULL invisibly.
This function is called for its side effects.
Matti Rantanen
Print method function for survtab objects; see
[survtab_ag].
## S3 method for class 'survtab' print(x, subset = NULL, ...)## S3 method for class 'survtab' print(x, subset = NULL, ...)
x |
a |
subset |
a logical condition to subset results table by
before printing; use this to limit to a certain stratum. E.g.
|
... |
arguments passed to |
Always returns NULL invisibly.
This function is called for its side effects.
Joonas Miettinen
Other survtab functions:
Surv(),
lines.survtab(),
plot.survtab(),
summary.survtab(),
survtab(),
survtab_ag()
rate calculates adjusted rates using
preloaded weights data or user specified weights.
rate( data, obs = NULL, pyrs = NULL, print = NULL, adjust = NULL, weights = NULL, subset = NULL )rate( data, obs = NULL, pyrs = NULL, print = NULL, adjust = NULL, weights = NULL, subset = NULL )
data |
aggregated data (see e.g. |
obs |
observations variable name in data.
Flexible input, typically e.g.
|
pyrs |
person-years variable name in data.
Flexible input, typically e.g.
|
print |
variable name to stratify the rates.
Flexible input, typically e.g.
|
adjust |
variable for adjusting the rates.
Flexible input, typically e.g.
|
weights |
typically a list of weights or a |
subset |
a logical expression to subset data. |
Input data needs to be in aggregated format with observations
and person-years. For individual data use [lexpand], or
[ltable] and merge person-years manually.
The confidence intervals are based on the normal approximation of the logarithm of the rate. The variance of the log rate that is used to derive the confidence intervals is derived using the delta method.
Returns a data.table with observations, person-years, rates and
adjusted rates, if available. Results are stratified by print.
Adjusted rates are identified with suffix .adj and
.lo and .hi are for confidence intervals lower and upper
95% bounds, respectively.
The prefix SE. stands for standard error.
Matti Rantanen, Joonas Miettinen
[lexpand], [ltable]
Other main functions:
Surv(),
relpois(),
relpois_ag(),
sir(),
sirspline(),
survmean(),
survtab(),
survtab_ag()
Other rate functions:
rate_ratio()
## Prepare data with lexpand and then reformat agegroup. data(sibr) x <- lexpand(sibr, birth = bi_date, entry = dg_date, exit = ex_date, breaks = list(per = c(1990,2000,2010,2020), age = c(0:17*5,Inf)), aggre = list(agegroup = age, year.cat = per), status = status != 0) x$agegroup <- cut(x$agegroup, c(0:17*5,Inf), right = FALSE) ## calculate rates for selected periods with Nordic 2000 weights: r1 <- rate( data = x, obs = from0to1, pyrs = pyrs, print = year.cat, adjust = agegroup, weights = 'nordic') r1 ## use total person-years by stratum as weights (some have zero) w <- ltable(x, by.vars = "agegroup", expr = sum(pyrs)) w[is.na(w$V1),]$V1 <- 0 r2 <- rate( data = x, obs = from0to1, pyrs = pyrs, print = year.cat, adjust = agegroup, weights = w$V1) r2 ## use data.frame of weights: names(w) <- c("agegroup", "weights") r2 <- rate( data = x, obs = from0to1, pyrs = pyrs, print = year.cat, adjust = agegroup, weights = w) r2 ## internal weights (same result as above) r3 <- rate( data = x, obs = from0to1, pyrs = pyrs, print = year.cat, adjust = agegroup, weights = "internal") r3## Prepare data with lexpand and then reformat agegroup. data(sibr) x <- lexpand(sibr, birth = bi_date, entry = dg_date, exit = ex_date, breaks = list(per = c(1990,2000,2010,2020), age = c(0:17*5,Inf)), aggre = list(agegroup = age, year.cat = per), status = status != 0) x$agegroup <- cut(x$agegroup, c(0:17*5,Inf), right = FALSE) ## calculate rates for selected periods with Nordic 2000 weights: r1 <- rate( data = x, obs = from0to1, pyrs = pyrs, print = year.cat, adjust = agegroup, weights = 'nordic') r1 ## use total person-years by stratum as weights (some have zero) w <- ltable(x, by.vars = "agegroup", expr = sum(pyrs)) w[is.na(w$V1),]$V1 <- 0 r2 <- rate( data = x, obs = from0to1, pyrs = pyrs, print = year.cat, adjust = agegroup, weights = w$V1) r2 ## use data.frame of weights: names(w) <- c("agegroup", "weights") r2 <- rate( data = x, obs = from0to1, pyrs = pyrs, print = year.cat, adjust = agegroup, weights = w) r2 ## internal weights (same result as above) r3 <- rate( data = x, obs = from0to1, pyrs = pyrs, print = year.cat, adjust = agegroup, weights = "internal") r3
Calculate rate ratio with confidence intervals for rate objects or observations and person-years.
rate_ratio(x, y, crude = FALSE, SE.method = TRUE)rate_ratio(x, y, crude = FALSE, SE.method = TRUE)
x |
a rate-object, vector of two; rate and standard error or observed and person-years. |
y |
a rate-object, vector of two; rate and standard error or observed and person-years. |
crude |
set TRUE to use crude rates; default is FALSE. |
SE.method |
default TRUE; if |
Calculate rate ratio of two age standardized rate objects (see [rate]).
Multiple rates for each objects is supported if there are an equal number of rates.
Another option is to set x and y as a vector of two.
rate and its standard error, and set SE.method = TRUE.
observations and person-year, and set SE.method = FALSE.
See examples.
A vector length of three: rate_ratio, and lower and upper confidence intervals.
Matti Rantanen
[rate]
Other rate functions:
rate()
# two rate ratios; silly example with female rectal / breast cancer ## mortality rates data("sire", package = "popEpi") data("sibr", package = "popEpi") BL <- list(per = 2000:2005) re <- lexpand(sire, birth = "bi_date", entry = "dg_date", exit = "ex_date", status = status == 1, breaks = BL, aggre = list(per)) br <- lexpand(sibr, birth = "bi_date", entry = "dg_date", exit = "ex_date", status = status == 1, breaks = BL, aggre = list(per)) r_re <- rate(re, obs = "from0to1", pyrs = "pyrs") r_br <- rate(br, obs = "from0to1", pyrs = "pyrs") rate_ratio(r_re, r_br, SE.method = TRUE) # manually set rates (0.003 and 0.005) and SEs (0.001 and 0.002) # so that x = y = c('rate', 'SE') rate_ratio(x= c(0.003, 0.001), y= c(0.005, 0.002), SE.method = TRUE) # observed numbers (10 and 20) and person-years (30000 and 40000): rate_ratio(x = c(10, 30000), y = c(20, 40000), SE.method = FALSE)# two rate ratios; silly example with female rectal / breast cancer ## mortality rates data("sire", package = "popEpi") data("sibr", package = "popEpi") BL <- list(per = 2000:2005) re <- lexpand(sire, birth = "bi_date", entry = "dg_date", exit = "ex_date", status = status == 1, breaks = BL, aggre = list(per)) br <- lexpand(sibr, birth = "bi_date", entry = "dg_date", exit = "ex_date", status = status == 1, breaks = BL, aggre = list(per)) r_re <- rate(re, obs = "from0to1", pyrs = "pyrs") r_br <- rate(br, obs = "from0to1", pyrs = "pyrs") rate_ratio(r_re, r_br, SE.method = TRUE) # manually set rates (0.003 and 0.005) and SEs (0.001 and 0.002) # so that x = y = c('rate', 'SE') rate_ratio(x= c(0.003, 0.001), y= c(0.005, 0.002), SE.method = TRUE) # observed numbers (10 and 20) and person-years (30000 and 40000): rate_ratio(x = c(10, 30000), y = c(20, 40000), SE.method = FALSE)
Estimate a Poisson piecewise constant excess hazards model
relpois(data, formula, fot.breaks = NULL, subset = NULL, check = TRUE, ...)relpois(data, formula, fot.breaks = NULL, subset = NULL, check = TRUE, ...)
data |
a dataset split with e.g. |
formula |
a formula which is passed on to |
fot.breaks |
optional; a numeric vector of [a,b) breaks to specify
survival intervals over the follow-up time; if |
subset |
a logical vector or condition; e.g. |
check |
logical; if |
... |
any argument passed on to |
Basics
relpois employs a custom link function of the Poisson variety
to estimate piecewise constant parametric excess hazards. The pieces
are determined by fot.breaks. A log(person-years) offset
is passed automatically to the glm call.
Formula usage
The formula can be used like any ordinary glm formula. The user must
define the outcome in some manner, which is usually lex.Xst after splitting
with e.g. lexpand. The exception is the possibility of including
the baseline excess hazard terms by including the
reserved term FOT in the formula.
For example, lex.Xst != 0 ~ FOT + agegr estimates a model with constant
excess hazards at the follow-up intervals as specified by
the pertinent breaks used in splitting data,
as well as for the different age groups.
FOT is created ad hoc if it is used in the formula.
If you leave out FOT, the hazard is effectively
assumed to be constant across the whole follow-up time.
You can also simply use your own follow-up time interval variable that
you have created before calling relpois. However, when using
FOT, relpois automatically checks for e.g.
negative excess cases in follow-up intervals,
allowing for quickly finding splitting breaks
where model estimation is possible. It also drops any data outside the
follow-up time window.
Splitting and merging population hazard
The easiest way to both split and to include population hazard information is
by using [lexpand]. You may also fairly easily do it by hand
by splitting first and then merging in your population hazard information.
Data requirements
The population hazard information must be available for each record and named
pop.haz. The follow-up time variable must be named "fot" e.g.
as a result of using lexpand. The lex.dur variable must also
be present, containing person-year information.
A glm object created using a custom Poisson family construct. Some
glm methods are applicable.
Joonas Miettinen, Karri Seppa
Paul W Dickman, Andy Sloggett, Michael Hills, and Timo Hakulinen. Regression models for relative survival. Stat Med. 2004 Jan 15;23(1):51-64. doi:10.1002/sim.1597
[lexpand], [stats::poisson], [stats::glm]
Other main functions:
Surv(),
rate(),
relpois_ag(),
sir(),
sirspline(),
survmean(),
survtab(),
survtab_ag()
Other relpois functions:
RPL,
relpois_ag(),
rpcurve()
## use the simulated rectal cancer cohort data("sire", package = "popEpi") sire$agegr <- cut(sire$dg_age, c(0,45,60,Inf), right=FALSE) ## usable straight away after splitting fb <- c(0,3/12,6/12,1,2,3,4,5) x <- lexpand(sire, birth = bi_date, entry = dg_date, exit = ex_date, status=status, breaks = list(fot=fb), pophaz=popmort) rpm <- relpois(x, formula = lex.Xst %in% 1:2 ~ FOT + agegr) ## some methods for glm work. e.g. test for interaction rpm2 <- relpois(x, formula = lex.Xst %in% 1:2 ~ FOT*agegr) anova(rpm, rpm2, test="LRT") AIC(rpm, rpm2) ## update() won't work currently## use the simulated rectal cancer cohort data("sire", package = "popEpi") sire$agegr <- cut(sire$dg_age, c(0,45,60,Inf), right=FALSE) ## usable straight away after splitting fb <- c(0,3/12,6/12,1,2,3,4,5) x <- lexpand(sire, birth = bi_date, entry = dg_date, exit = ex_date, status=status, breaks = list(fot=fb), pophaz=popmort) rpm <- relpois(x, formula = lex.Xst %in% 1:2 ~ FOT + agegr) ## some methods for glm work. e.g. test for interaction rpm2 <- relpois(x, formula = lex.Xst %in% 1:2 ~ FOT*agegr) anova(rpm, rpm2, test="LRT") AIC(rpm, rpm2) ## update() won't work currently
Estimate a Poisson Piecewise Constant Excess Hazards Model
relpois_ag( formula, data, d.exp, offset = NULL, breaks = NULL, subset = NULL, piecewise = TRUE, check = TRUE, ... )relpois_ag( formula, data, d.exp, offset = NULL, breaks = NULL, subset = NULL, piecewise = TRUE, check = TRUE, ... )
formula |
a formula with the counts of events as the response.
Passed on to |
data |
an |
d.exp |
the counts of expected cases. Mandatory.
E.g. |
offset |
the offset for the Poisson model, supplied as e.g.
|
breaks |
optional; a numeric vector of [a,b) breaks to specify
survival intervals over the follow-up time; if |
subset |
a logical vector or condition; e.g. |
piecewise |
|
check |
|
... |
any other argument passed on to |
A relpois object created using a custom Poisson family construct.
Joonas Miettinen, Karri Seppa
[lexpand], [poisson], [glm]
Other main functions:
Surv(),
rate(),
relpois(),
sir(),
sirspline(),
survmean(),
survtab(),
survtab_ag()
Other relpois functions:
RPL,
relpois(),
rpcurve()
## use the simulated rectal cancer cohort data(sire, package = "popEpi") sire$agegr <- cut(sire$dg_age, c(0,45,60,Inf), right=FALSE) ## create aggregated example data fb <- c(0,3/12,6/12,1,2,3,4,5) x <- lexpand(sire, birth = bi_date, entry = dg_date, exit = ex_date, status=status %in% 1:2, breaks = list(fot=fb), pophaz=popmort, pp = FALSE, aggre = list(agegr, fot)) ## fit model using aggregated data rpm <- relpois_ag(formula = from0to1 ~ fot + agegr, data = x, d.exp = d.exp, offset = log(pyrs)) summary(rpm) ## the usual functions for handling glm models work rpm2 <- update(rpm, . ~ fot*agegr) anova(rpm, rpm2, test="LRT") AIC(rpm, rpm2) ## other features such as residuals or predicting are not guaranteed ## to work as intended.## use the simulated rectal cancer cohort data(sire, package = "popEpi") sire$agegr <- cut(sire$dg_age, c(0,45,60,Inf), right=FALSE) ## create aggregated example data fb <- c(0,3/12,6/12,1,2,3,4,5) x <- lexpand(sire, birth = bi_date, entry = dg_date, exit = ex_date, status=status %in% 1:2, breaks = list(fot=fb), pophaz=popmort, pp = FALSE, aggre = list(agegr, fot)) ## fit model using aggregated data rpm <- relpois_ag(formula = from0to1 ~ fot + agegr, data = x, d.exp = d.exp, offset = log(pyrs)) summary(rpm) ## the usual functions for handling glm models work rpm2 <- update(rpm, . ~ fot*agegr) anova(rpm, rpm2, test="LRT") AIC(rpm, rpm2) ## other features such as residuals or predicting are not guaranteed ## to work as intended.
Brute force solution for ensuring a variable is numeric by coercing a variable of any type first to factor and then to numeric
robust_values(num.values, force = FALSE, messages = TRUE)robust_values(num.values, force = FALSE, messages = TRUE)
num.values |
values to convert to numeric |
force |
logical; if |
messages |
logical; if |
A numeric vector.
Returns NULL if given num.values is NULL.
Joonas Miettinen
## this works values <- c("1", "3", "5") values <- robust_values(values) ## this works values <- c("1", "3", "5", NA) values <- robust_values(values) ## this returns originals and throws warnings values <- c("1", "3", "5", "a") suppressWarnings( values <- robust_values(values) ) ## this forces "a" to NA and works otherwise; throws warning about NAs values <- c("1", "3", "5", "a") suppressWarnings( values <- robust_values(values, force=TRUE) )## this works values <- c("1", "3", "5") values <- robust_values(values) ## this works values <- c("1", "3", "5", NA) values <- robust_values(values) ## this returns originals and throws warnings values <- c("1", "3", "5", "a") suppressWarnings( values <- robust_values(values) ) ## this forces "a" to NA and works otherwise; throws warning about NAs values <- c("1", "3", "5", "a") suppressWarnings( values <- robust_values(values, force=TRUE) )
Fit a marginal relative survival curve based on a relpois fit
rpcurve(object)rpcurve(object)
object |
a |
Estimates a marginal curve, i.e. the average of all possible individual curves.
Only supported when the reserved FOT variable was used in relpois.
Computes a curve for each unique combination of covariates (e.g. 4 sets)
and returns a weighted average curve based on the counts
of subjects for each combination (e.g. 1000, 125, 50, 25 respectively).
Fairly fast when only categorical variables have been used, otherwise
go get a cup of coffee.
If delayed entry is present in data due to period analysis limiting, the marginal curve is constructed only for those whose follow-up started in the respective period.
A data.table of relative survival curves.
Joonas Miettinen
Other relpois functions:
RPL,
relpois(),
relpois_ag()
## use the simulated rectal cancer cohort data("sire", package = "popEpi") ab <- c(0,45,55,65,70,Inf) sire$agegr <- cut(sire$dg_age, breaks = ab, right = FALSE) BL <- list(fot= seq(0,10,1/12)) pm <- data.frame(popEpi::popmort) x <- lexpand(sire, breaks=BL, pophaz=pm, birth = bi_date, entry = dg_date, exit = ex_date, status = status %in% 1:2) rpm <- relpois(x, formula = lex.Xst %in% 1:2 ~ -1+ FOT + agegr, fot.breaks=c(0,0.25,0.5,1:8,10)) pmc <- rpcurve(rpm) ## compare with non-parametric estimates names(pm) <- c("sex", "per", "age", "haz") x$agegr <- cut(x$dg_age, c(0,45,55,65,75,Inf), right = FALSE) st <- survtab(fot ~ adjust(agegr), data = x, weights = "internal", pophaz = pm) plot(st, y = "r.e2.as") lines(y = pmc$est, x = pmc$Tstop, col="red")## use the simulated rectal cancer cohort data("sire", package = "popEpi") ab <- c(0,45,55,65,70,Inf) sire$agegr <- cut(sire$dg_age, breaks = ab, right = FALSE) BL <- list(fot= seq(0,10,1/12)) pm <- data.frame(popEpi::popmort) x <- lexpand(sire, breaks=BL, pophaz=pm, birth = bi_date, entry = dg_date, exit = ex_date, status = status %in% 1:2) rpm <- relpois(x, formula = lex.Xst %in% 1:2 ~ -1+ FOT + agegr, fot.breaks=c(0,0.25,0.5,1:8,10)) pmc <- rpcurve(rpm) ## compare with non-parametric estimates names(pm) <- c("sex", "per", "age", "haz") x$agegr <- cut(x$dg_age, c(0,45,55,65,75,Inf), right = FALSE) st <- survtab(fot ~ adjust(agegr), data = x, weights = "internal", pophaz = pm) plot(st, y = "r.e2.as") lines(y = pmc$est, x = pmc$Tstop, col="red")
A family object for GLM fitting of relative Poisson models
RPLRPL
A list very similar to that created by poisson().
Karri Seppa
Other relpois functions:
relpois(),
relpois_ag(),
rpcurve()
aggre attributes to an object by modifying in placeCoerces an R object to an aggre object, identifying
the object as one containing aggregated counts, person-years and other
information. setaggre modifies in place without taking any copies.
Retains all other attributes.
setaggre(x, values = NULL, by = NULL, breaks = NULL)setaggre(x, values = NULL, by = NULL, breaks = NULL)
x |
a |
values |
a character string vector; the names of value variables |
by |
a character string vector; the names of variables by which
|
breaks |
a list of breaks, where each element is a breaks vector
as usually passed to e.g. |
setaggre sets x to the aggre class in place
without taking a copy as e.g. as.data.frame.XXX functions do; see e.g.
[data.table::setDT].
Returns x invisibly after setting attributes to it without taking a copy.
This function is called for its side effects.
Joonas Miettinen
Other aggregation functions:
aggre(),
as.aggre(),
lexpand(),
summary.aggre()
df <- data.frame(sex = rep(c("male", "female"), each = 5), obs = rpois(10, rep(7,5, each=5)), pyrs = rpois(10, lambda = 10000)) ## without any breaks setaggre(df, values = c("obs", "pyrs"), by = "sex") df <- data.frame(df) df$FUT <- 0:4 ## with breaks list setaggre(df, values = c("obs", "pyrs"), by = "sex", breaks = list(FUT = 0:5))df <- data.frame(sex = rep(c("male", "female"), each = 5), obs = rpois(10, rep(7,5, each=5)), pyrs = rpois(10, lambda = 10000)) ## without any breaks setaggre(df, values = c("obs", "pyrs"), by = "sex") df <- data.frame(df) df$FUT <- 0:4 ## with breaks list setaggre(df, values = c("obs", "pyrs"), by = "sex", breaks = list(FUT = 0:5))
setattr(obj, "class", CLASS)); can add instead of replaceSets the class of an object in place to cl
by replacing or adding
setclass(obj, cl, add = FALSE, add.place = "first")setclass(obj, cl, add = FALSE, add.place = "first")
obj |
and object for which to set class |
cl |
class to set |
add |
if |
add.place |
|
Joonas Miettinen
data.table columns if thereDeletes columns in a data.table conveniently.
May only delete columns that are found silently. Sometimes useful in e.g.
on.exit expressions.
setcolsnull( DT = NULL, delete = NULL, keep = NULL, colorder = FALSE, soft = TRUE )setcolsnull( DT = NULL, delete = NULL, keep = NULL, colorder = FALSE, soft = TRUE )
DT |
a |
delete |
a character vector of column names to be deleted |
keep |
a character vector of column names to keep;
the rest will be removed; |
colorder |
logical; if |
soft |
logical; if |
Always returns NULL invisibly.
This function is called for its side effects.
Joonas Miettinen
sibr is a simulated cohort pertaining female Finnish breast cancer patients
diagnosed between 1993-2012. Instead of actual original dates, the dates are masked
via modest randomization within several time windows. The dataset is additionally
a random sample of 10 000 cases from the pertaining time window.
data.table with columns
sex - gender of the patient (1 = female)
bi_date - date of birth
dg_date - date of cancer diagnosis
ex_date - date of exit from follow-up (death or censoring)
status - status of the person at exit; 0 alive; 1 dead due to pertinent cancer; 2 dead due to other causes
dg_age - age at diagnosis expressed as fractional years
The closing date for the pertinent data was 2012-12-31, meaning status information was
available only up to that point — hence the maximum possible ex_date is 2012-12-31.
Karri Seppa
The Finnish Cancer Registry
Other popEpi data:
ICSS,
meanpop_fi,
popmort,
sire,
stdpop101,
stdpop18
Other survival data:
sire
Poisson modelled standardised incidence or mortality ratios (SIRs / SMRs) i.e. indirect method for calculating standardised rates. SIR is a ratio of observed and expected cases. Expected cases are derived by multiplying the strata-specific population rate with the corresponding person-years of the cohort.
sir( coh.data, coh.obs, coh.pyrs, ref.data = NULL, ref.obs = NULL, ref.pyrs = NULL, ref.rate = NULL, subset = NULL, print = NULL, adjust = NULL, mstate = NULL, test.type = "homogeneity", conf.type = "profile", conf.level = 0.95, EAR = FALSE )sir( coh.data, coh.obs, coh.pyrs, ref.data = NULL, ref.obs = NULL, ref.pyrs = NULL, ref.rate = NULL, subset = NULL, print = NULL, adjust = NULL, mstate = NULL, test.type = "homogeneity", conf.type = "profile", conf.level = 0.95, EAR = FALSE )
coh.data |
aggregated cohort data, see e.g. |
coh.obs |
variable name for observed cases; quoted or unquoted. A vector when using |
coh.pyrs |
variable name for person years in cohort data;
quoted (as a string |
ref.data |
population data. Can be left NULL if |
ref.obs |
variable name for observed cases; quoted or unquoted |
ref.pyrs |
variable name for person-years in population data; quoted or unquoted |
ref.rate |
population rate variable (cases/person-years). Overwrites
arguments |
subset |
logical condition to select data from |
print |
variable names to stratify results; quoted vector or unquoted named list with functions |
adjust |
variable names for adjusting without stratifying output; quoted vector or unquoted list |
mstate |
set column names for cause specific observations; quoted or unquoted. Relevant only
when |
test.type |
Test for equal SIRs. Test available are 'homogeneity' and 'trend'. |
conf.type |
Confidence interval type: 'profile'(=default), 'wald' or 'univariate'. |
conf.level |
Level of type-I error in confidence intervals, default 0.05 is 95% CI. |
EAR |
logical; TRUE calculates Excess Absolute Risks for univariate SIRs. (see details) |
sir is a comprehensive tool for modelling SIRs/SMRs with flexible
options to adjust and print SIRs, test homogeneity and utilize
multi-state data. The cohort data and the variable names for observation
counts and person-years are required.
The reference data is optional, since the cohort data
can be stratified (print) and compared to total.
Adjust and print
A SIR can be adjusted or standardised using the covariates found in both coh.data and ref.data.
Variable to adjust are given in adjust.
Variable names needs to match in both coh.data and ref.data.
Typical variables to adjust by are gender, age group and calendar period.
print is used to stratify the SIR output. In other words, the variables
assigned to print are the covariates of the Poisson model.
Variable levels are treated as categorical.
Variables can be assigned in both print and adjust.
This means the output it adjusted and printed by these variables.
print can also be a list of expressions. This enables changing variable
names or transforming variables with functions such as cut and round.
For example, agegroup can be transformed on-the-go with
print = list(my_ag = cut(agegroup, my_ag_breaks))
ref.rate or ref.obs & ref.pyrs
The population rate variable can be given to the ref.rate parameter.
That is, when using e.g. the popmort or a comparable data file, one may
supply ref.rate instead of ref.obs and ref.pyrs, which
will be ignored if ref.rate is supplied.
Note that if all the stratifying variables in
ref.data are not listed in adjust,
or when the categories are otherwise combined,
the (unweighted) mean of rates is used for computing expected cases.
This might incur a small bias in comparison to when exact numbers of observations
and person-years are available.
mstate
E.g. using lexpand it's possible to compute counts for several outcomes
so that the population at risk is same for each
outcome such as a certain kind of cancer.
The transition counts are in wide data format,
and the relevant columns can be supplied to sir
in a vector via the coh.obs argument.
The name of the corresponding new column in ref.data is given in
mstate. It's recommended to include the mstate variable in adjust,
so the corresponding information should also be available in ref.data.
More examples in sir-vignette.
This approach is analogous to where SIRs are calculated separately their own function calls.
Other parameters
univariate confidence intervals are calculated using exact
Poisson intervals (poisson.ci). The options profile and wald are
is based on a Poisson regression model: profile-likelihood confidence intervals
or Wald's normal-approximation. P-value is Poisson model based conf.type
or calculated using the method described by Breslow and Day. Function automatically
switches to another conf.type if calculation is not possible with a message.
Usually model fit fails if there is print stratum with zero expected values.
The LRT p-value tests the levels of print. The test can be either
"homogeneity", a likelihood ratio test where the model variables defined in
print (factor) is compared to the constant model.
Option "trend" tests if the linear trend of the continuous variable in
print is significant (using model comparison).
EAR: Excess Absolute Risk
Excess Absolute Risk is a simple way to quantify the absolute difference between cohort risk and population risk. Make sure that the person-years are calculated accordingly before using EAR. (when using mstate)
Formula for EAR:
Data format
The data should be given in tabulated format. That is the number of observations and person-years are represented for each stratum. Note that also individual data is allowed as long as each observations, person-years, and print and adjust variables are presented in columns. The extra variables and levels are reduced automatically before estimating SIRs. Example of data format:
| sex | age | period | obs | pyrs |
| 0 | 1 | 2010 | 0 | 390 |
| 0 | 2 | 2010 | 5 | 385 |
| 1 | 1 | 2010 | 3 | 308 |
| 1 | 2 | 2010 | 12 | 315 |
A sir-object that is a data.table with meta information in the attributes.
Matti Rantanen, Joonas Miettinen
[lexpand]
A SIR calculation vignette
Other sir functions:
lines.sirspline(),
plot.sirspline(),
sir_exp(),
sir_ratio(),
sirspline()
Other main functions:
Surv(),
rate(),
relpois(),
relpois_ag(),
sirspline(),
survmean(),
survtab(),
survtab_ag()
data(popmort) data(sire) c <- lexpand( sire, status = status, birth = bi_date, exit = ex_date, entry = dg_date, breaks = list(per = 1950:2013, age = 1:100, fot = c(0,10,20,Inf)), aggre = list(fot, agegroup = age, year = per, sex) ) ## SMR due other causes: status = 2 se <- sir( coh.data = c, coh.obs = 'from0to2', coh.pyrs = 'pyrs', ref.data = popmort, ref.rate = 'haz', adjust = c('agegroup', 'year', 'sex'), print = 'fot') se ## for examples see: vignette('sir')data(popmort) data(sire) c <- lexpand( sire, status = status, birth = bi_date, exit = ex_date, entry = dg_date, breaks = list(per = 1950:2013, age = 1:100, fot = c(0,10,20,Inf)), aggre = list(fot, agegroup = age, year = per, sex) ) ## SMR due other causes: status = 2 se <- sir( coh.data = c, coh.obs = 'from0to2', coh.pyrs = 'pyrs', ref.data = popmort, ref.rate = 'haz', adjust = c('agegroup', 'year', 'sex'), print = 'fot') se ## for examples see: vignette('sir')
Calculate Standardized Mortality Ratios (SMRs) using a single data set that includes observed and expected cases and additionally person-years.
sir_lex solves SMR from an [Epi::Lexis] object
calculated with [lexpand].
sir_ag solves SMR from a [aggre] object
calculated using [lexpand].
sir_exp( x, obs, exp, pyrs = NULL, print = NULL, conf.type = "profile", test.type = "homogeneity", conf.level = 0.95, subset = NULL ) sir_lex(x, print = NULL, breaks = NULL, ...) sir_ag( x, obs = "from0to1", print = attr(x, "aggre.meta")$by, exp = "d.exp", pyrs = "pyrs", ... )sir_exp( x, obs, exp, pyrs = NULL, print = NULL, conf.type = "profile", test.type = "homogeneity", conf.level = 0.95, subset = NULL ) sir_lex(x, print = NULL, breaks = NULL, ...) sir_ag( x, obs = "from0to1", print = attr(x, "aggre.meta")$by, exp = "d.exp", pyrs = "pyrs", ... )
x |
Data set e.g. |
obs |
Variable name of the observed cases in the data set |
exp |
Variable name or expression for expected cases |
pyrs |
Variable name for person-years (optional) |
print |
Variables or expression to stratify the results |
conf.type |
select confidence interval type: (default=) |
test.type |
Test for equal SIRs. Test available are 'homogeneity' and 'trend' |
conf.level |
Level of type-I error in confidence intervals, default 0.05 is 95% CI |
subset |
a logical vector for subsetting data |
breaks |
a named list to split age group (age), period (per) or follow-up (fot). |
... |
pass arguments to |
These functions are intended to calculate SMRs from a single data set
that includes both observed and expected number of cases. For example utilizing the
argument pop.haz of the [lexpand].
sir_lex automatically exports the transition fromXtoY using the first
state in lex.Str as 0 and all other as 1. No missing values
is allowed in observed, pop.haz or person-years.
A sir object
Matti Rantanen
[lexpand]
A SIR calculation vignette
Other sir functions:
lines.sirspline(),
plot.sirspline(),
sir(),
sir_ratio(),
sirspline()
BL <- list(fot = 0:5, per = c("2003-01-01","2008-01-01", "2013-01-01")) ## Aggregated data x1 <- lexpand(sire, breaks = BL, status = status != 0, birth = bi_date, entry = dg_date, exit = ex_date, pophaz=popmort, aggre=list(sex, period = per, surv.int = fot)) sir_ag(x1, print = 'period') # no aggreate or breaks x2 <- lexpand(sire, status = status != 0, birth = bi_date, entry = dg_date, exit = ex_date, pophaz=popmort) sir_lex(x2, breaks = BL, print = 'per')BL <- list(fot = 0:5, per = c("2003-01-01","2008-01-01", "2013-01-01")) ## Aggregated data x1 <- lexpand(sire, breaks = BL, status = status != 0, birth = bi_date, entry = dg_date, exit = ex_date, pophaz=popmort, aggre=list(sex, period = per, surv.int = fot)) sir_ag(x1, print = 'period') # no aggreate or breaks x2 <- lexpand(sire, status = status != 0, birth = bi_date, entry = dg_date, exit = ex_date, pophaz=popmort) sir_lex(x2, breaks = BL, print = 'per')
Calculate ratio of two SIRs/SMRs and the confidence intervals of the ratio.
sir_ratio( x, y, digits = 3, alternative = "two.sided", conf.level = 0.95, type = "exact" )sir_ratio( x, y, digits = 3, alternative = "two.sided", conf.level = 0.95, type = "exact" )
x |
a sir-object or a vector of two; observed and expected cases. |
y |
a sir-object or a vector of two; observed and expected cases. |
digits |
number of digits in the output |
alternative |
The null-hypothesis test: (default:) |
conf.level |
the type-I error in confidence intervals, default 0.95 for 95% CI. |
type |
How the binomial confidence intervals are calculated (default:) |
Function works with pooled sir-objects i.e. the print argument in sir is ignored.
Also x and y can be a vector of two where first index is the
observed cases and second is expected cases (see examples).
Note that the ratio of two SIR's is only applicable when the age distributions are similar
in both populations.
Formula
The observed number of first sir O1 is considered as a Binomial variable with sample
size of O1+O2. The confidence intervals for Binomial proportion A
is solved using exact or asymptotic
method. Now the CI for ratio O1/O2 is B = A/(1 - A). And further the CI for SIR/SMR
is B*E2/E1. (Ederer and Mantel)
A vector length of three: sir_ratio, and lower and upper confidence intervals.
Parameter alternative is always two.sided when parameter
type is set to asymptotic.
Matti Rantanen
Statistics with Confidence: Confidence Intervals and Statistical Guidelines, Douglas Altman, 2000. ISBN: 978-0-727-91375-3
[sir]
A SIR calculation vignette
Other sir functions:
lines.sirspline(),
plot.sirspline(),
sir(),
sir_exp(),
sirspline()
## Ratio for sir-object and the same values given manually: ## create example dataset dt1 <- data.frame(obs = rep(c(5,7), 10), pyrs = rep(c(250,300,350,400), 5), var = 1:20) Ref <- data.frame(obs = rep(c(50,70,80,100), 5), pyrs = rep(c(2500,3000,3500,4000), 5), var = 1:20) ## sir using the function s1 <- sir(coh.data = dt1, coh.obs = obs, coh.pyrs = pyrs, ref.data = Ref, ref.obs = obs, ref.pyrs = pyrs, adjust = var) ## Ratio is simply 1: sir_ratio(s1, c(120, 150))## Ratio for sir-object and the same values given manually: ## create example dataset dt1 <- data.frame(obs = rep(c(5,7), 10), pyrs = rep(c(250,300,350,400), 5), var = 1:20) Ref <- data.frame(obs = rep(c(50,70,80,100), 5), pyrs = rep(c(2500,3000,3500,4000), 5), var = 1:20) ## sir using the function s1 <- sir(coh.data = dt1, coh.obs = obs, coh.pyrs = pyrs, ref.data = Ref, ref.obs = obs, ref.pyrs = pyrs, adjust = var) ## Ratio is simply 1: sir_ratio(s1, c(120, 150))
sire is a simulated cohort pertaining female Finnish rectal cancer patients
diagnosed between 1993-2012. Instead of actual original dates, the dates are masked
via modest randomization within several time windows.
data.table with columns
sex - gender of the patient (1 = female)
bi_date - date of birth
dg_date - date of cancer diagnosis
ex_date - date of exit from follow-up (death or censoring)
status - status of the person at exit; 0 alive; 1 dead due to pertinent cancer; 2 dead due to other causes
dg_age - age at diagnosis expressed as fractional years
The closing date for the pertinent data was 2012-12-31, meaning status information was
available only up to that point — hence the maximum possible ex_date is 2012-12-31.
Karri Seppa
The Finnish Cancer Registry
Other popEpi data:
ICSS,
meanpop_fi,
popmort,
sibr,
stdpop101,
stdpop18
Other survival data:
sibr
Splines for standardised incidence or mortality ratio. A useful tool to e.g. check whether a constant SIR can be assumed for all calendar periods, age groups or follow-up intervals. Splines can be fitted for these time dimensions separately or in the same model.
sirspline( coh.data, coh.obs, coh.pyrs, ref.data = NULL, ref.obs = NULL, ref.pyrs = NULL, ref.rate = NULL, subset = NULL, print = NULL, adjust = NULL, mstate = NULL, spline, knots = NULL, reference.points = NULL, dependent.splines = TRUE )sirspline( coh.data, coh.obs, coh.pyrs, ref.data = NULL, ref.obs = NULL, ref.pyrs = NULL, ref.rate = NULL, subset = NULL, print = NULL, adjust = NULL, mstate = NULL, spline, knots = NULL, reference.points = NULL, dependent.splines = TRUE )
coh.data |
cohort data with observations and at risk time variables |
coh.obs |
variable name for observed cases |
coh.pyrs |
variable name for person-years in cohort data |
ref.data |
aggregated population data |
ref.obs |
variable name for observed cases |
ref.pyrs |
variable name for person-years in population data |
ref.rate |
population rate observed/expected. This overwrites the parameters
|
subset |
logical condition to subset |
print |
variable names for which to estimate SIRs/SMRs and associated splines separately |
adjust |
variable names for adjusting the expected cases |
mstate |
set column names for cause specific observations. Relevant only
when coh.obs length is two or more. See help for |
spline |
variable name(s) for the splines |
knots |
number knots (vector), pre-defined knots (list of vectors) or for optimal number of knots left NULL |
reference.points |
fixed reference values for rate ratios. If left |
dependent.splines |
logical; if TRUE, all splines are fitted in same model. |
See [sir] for help on SIR/SMR estimation in general; usage of splines
is discussed below.
The spline variables
The model can include one, two or three splines variables.
Variables can be included in the same model selecting dependent.splines = TRUE
and SIR ratios are calculated (first one is the SIR, others SIR ratios).
Reference points vector can be set via reference.points
where first element of the vector is the reference point for first ratio.
Variable(s) to fit splines are given as a vector in argument spline.
Order will affect the results.
dependent.splines
By default dependent.splines is FALSE and all splines are fitted in separate models.
If TRUE, the first variable in spline is a function of a SIR and other(s) are ratios.
knots
There are three options to set knots to splines:
Set the number of knots for each spline variable with a vector. The knots are automatically placed to the quantiles of observed cases in cohort data. The first and last knots are always the maximum and minimum values, so knot value needs to be at least two.
Predefined knot places can be set with a list of vectors. The vector for each spline in the list specifies the knot places. The lowest and the largest values are the boundary knots and these should be checked beforehand.
If knots is left NULL, the model searches the optimal number
of knots by model AIC by fitting models iteratively from 2 to 15 knots and
the one with smallest AIC is selected.
If dependent.splines = TRUE, the number of knots is searched by fitting each spline
variable separately.
Splines can be stratified by the levels of variable given in print. If
print is a vector, only the first variable is accounted for. The knots
are placed globally for all levels of print. This also ensures that the likelihood
ratio test is valid.
Splines are also fitted independently for each level of print.
This allows for searching interactions, e.g. by fitting spline for period
(splines='period') for each age group (print = 'agegroup').
p-values
The output p-value is a test of whether the splines are equal (homogenous)
at different levels of print.
The test is based on the likelihood ratio test, where the full model
includes print and is
compared to a null model without it.
When (dependent.splines = TRUE) the p-value returned is a global p-value.
Otherwise the p-value is spline-specific.
A list of data.frames and vectors.
Three spline estimates are named as spline.est.A/B/C and the corresponding values
in spline.seq.A/B/C for manual plotting
Matti Rantanen, Joonas Miettinen
[splitMulti]
A SIR calculation vignette
Other sir functions:
lines.sirspline(),
plot.sirspline(),
sir(),
sir_exp(),
sir_ratio()
Other main functions:
Surv(),
rate(),
relpois(),
relpois_ag(),
sir(),
survmean(),
survtab(),
survtab_ag()
## for examples see: vignette('sir')## for examples see: vignette('sir')
Split a Lexis object along one time scale
(as [Epi::splitLexis]) with speed
splitLexisDT(lex, breaks, timeScale, merge = TRUE, drop = TRUE)splitLexisDT(lex, breaks, timeScale, merge = TRUE, drop = TRUE)
lex |
a Lexis object, split or not |
breaks |
a vector of |
timeScale |
a character string; name of the time scale to split by |
merge |
logical; if |
drop |
logical; if |
splitLexisDT is in essence a data.table version of
splitLexis or survSplit for splitting along a single
time scale. It requires a Lexis object as input, which may have already
been split along some time scale.
Unlike splitLexis, splitLexisDT drops observed time outside
the roof and floor of breaks by default - with drop = FALSE
the functions have identical behaviour.
The Lexis time scale variables can be of any arbitrary
format, e.g. Date,
fractional years (see [Epi::cal.yr]) and [get.yrs].
A data.table or data.frame
(depending on options("popEpi.datatable"); see ?popEpi)
object expanded to accommodate split observations.
Joonas Miettinen
Other splitting functions:
lexpand(),
splitMulti()
library(Epi) data("sire", package = "popEpi") x <- Lexis(data=sire[1000:1100, ], entry = list(fot=0, per=get.yrs(dg_date), age=dg_age), exit=list(per=get.yrs(ex_date)), exit.status=status) BL <- list(fot=seq(0, 5, by = 3/12), per=c(2008, 2013)) x2 <- splitMulti(x, breaks = BL, drop = FALSE) x3 <- splitLexisDT(x, breaks = BL$fot, timeScale = "fot", drop = FALSE) x3 <- splitLexisDT(x3, breaks = BL$per, timeScale = "per", drop = FALSE) x4 <- splitLexis(x, breaks = BL$fot, time.scale = "fot") x4 <- splitLexis(x4, breaks = BL$per, time.scale = "per") ## all produce identical results ## using Date variables x <- Lexis( data=sire[1000:1100, ], entry = list(fot = 0L, per = dg_date, age = as.integer(dg_date - bi_date)), duration = as.integer(ex_date - dg_date), entry.status = 0L, exit.status = status ) BL <- list(fot = 0:5*365.25, per = as.Date(c("2008-01-01", "2013-01-01"))) x2 <- splitMulti(x, breaks = BL, drop = FALSE) x3 <- splitLexisDT(x, breaks = BL$fot, timeScale = "fot", drop = FALSE) x3 <- splitLexisDT(x3, breaks = BL$per, timeScale = "per", drop = FALSE) ## splitLexis may not work when using Dateslibrary(Epi) data("sire", package = "popEpi") x <- Lexis(data=sire[1000:1100, ], entry = list(fot=0, per=get.yrs(dg_date), age=dg_age), exit=list(per=get.yrs(ex_date)), exit.status=status) BL <- list(fot=seq(0, 5, by = 3/12), per=c(2008, 2013)) x2 <- splitMulti(x, breaks = BL, drop = FALSE) x3 <- splitLexisDT(x, breaks = BL$fot, timeScale = "fot", drop = FALSE) x3 <- splitLexisDT(x3, breaks = BL$per, timeScale = "per", drop = FALSE) x4 <- splitLexis(x, breaks = BL$fot, time.scale = "fot") x4 <- splitLexis(x4, breaks = BL$per, time.scale = "per") ## all produce identical results ## using Date variables x <- Lexis( data=sire[1000:1100, ], entry = list(fot = 0L, per = dg_date, age = as.integer(dg_date - bi_date)), duration = as.integer(ex_date - dg_date), entry.status = 0L, exit.status = status ) BL <- list(fot = 0:5*365.25, per = as.Date(c("2008-01-01", "2013-01-01"))) x2 <- splitMulti(x, breaks = BL, drop = FALSE) x3 <- splitLexisDT(x, breaks = BL$fot, timeScale = "fot", drop = FALSE) x3 <- splitLexisDT(x3, breaks = BL$per, timeScale = "per", drop = FALSE) ## splitLexis may not work when using Dates
Split a Lexis object along multiple time scales
with speed and ease
splitMulti( data, breaks = NULL, ..., drop = TRUE, merge = TRUE, verbose = FALSE )splitMulti( data, breaks = NULL, ..., drop = TRUE, merge = TRUE, verbose = FALSE )
data |
a Lexis object with event cases as rows |
breaks |
a list of named numeric vectors of breaks; see Details and Examples |
... |
alternate way of supplying breaks as named vectors;
e.g. |
drop |
logical; if |
merge |
logical; if |
verbose |
logical; if |
splitMulti is in essence a data.table version of
splitLexis or survSplit for splitting along multiple
time scales.
It requires a Lexis object as input.
The breaks must be a list of named vectors of the appropriate type.
The breaks are fully explicit and
left-inclusive and right exclusive, e.g. fot=c(0,5)
forces the data to only include time between
[0,5) for each original row (unless drop = FALSE).
Use Inf or -Inf for open-ended intervals,
e.g. per=c(1990,1995,Inf) creates the intervals
[1990,1995), [1995, Inf).
Instead of specifying breaks, one may make use of the ...
argument to pass breaks: e.g.
splitMulti(x, breaks = list(fot = 0:5))
is equivalent to
splitMulti(x, fot = 0:5).
Multiple breaks can be supplied in the same manner. However, if both
breaks and ... are used, only the breaks in breaks
are utilized within the function.
The Lexis time scale variables can be of any arbitrary
format, e.g. Date,
fractional years (see [Epi::cal.yr]) and [get.yrs],
or other.
A data.table or data.frame
(depending on options("popEpi.datatable"); see ?popEpi)
object expanded to accommodate split observations.
Fixed popEpi::splitMulti when merge = FALSE. Did not include lex.dur
previously which caused an error.
Joonas Miettinen
[Epi::splitLexis], [Epi::Lexis],
[survival::survSplit]
Other splitting functions:
lexpand(),
splitLexisDT()
#### let's prepare data for computing period method survivals #### in case there are problems with dates, we first #### convert to fractional years. library("Epi") library("data.table") data("sire", package = "popEpi") x <- Lexis(data=sire[dg_date < ex_date, ], entry = list(fot=0, per=get.yrs(dg_date), age=dg_age), exit=list(per=get.yrs(ex_date)), exit.status=status) x2 <- splitMulti(x, breaks = list(fot=seq(0, 5, by = 3/12), per=c(2008, 2013))) # equivalently: x2 <- splitMulti(x, fot=seq(0, 5, by = 3/12), per=c(2008, 2013)) ## using dates; note: breaks must be expressed as dates or days! x <- Lexis( data = sire[dg_date < ex_date, ], entry = list(fot = 0L, per = dg_date, age = as.integer(dg_date - bi_date)), duration = as.integer(ex_date - dg_date), entry.status = 0L, exit.status = status ) BL <- list(fot = seq(0, 5, by = 3/12)*365.242199, per = as.Date(paste0(c(1980:2014),"-01-01")), age = c(0,45,85,Inf)*365.242199) x2 <- splitMulti(x, breaks = BL, verbose=TRUE) ## multistate example (healty - sick - dead) sire2 <- data.frame(sire) sire2 <- sire2[sire2$dg_date < sire2$ex_date, ] set.seed(1L) not_sick <- sample.int(nrow(sire2), 6000L, replace = FALSE) sire2$dg_date[not_sick] <- NA sire2$status[!is.na(sire2$dg_date) & sire2$status == 0] <- -1 sire2$status[sire2$status==2] <- 1 sire2$status <- factor(sire2$status, levels = c(0, -1, 1), labels = c("healthy", "sick", "dead")) xm <- Lexis(data = sire2, entry = list(fot=0, per=get.yrs(bi_date), age=0), exit = list(per=get.yrs(ex_date)), exit.status=status) xm2 <- cutLexis(xm, cut = get.yrs(xm$dg_date), timescale = "per", new.state = "sick") xm2[xm2$lex.id == 6L, ] xm2 <- splitMulti(xm2, breaks = list(fot = seq(0,150,25))) xm2[xm2$lex.id == 6L, ]#### let's prepare data for computing period method survivals #### in case there are problems with dates, we first #### convert to fractional years. library("Epi") library("data.table") data("sire", package = "popEpi") x <- Lexis(data=sire[dg_date < ex_date, ], entry = list(fot=0, per=get.yrs(dg_date), age=dg_age), exit=list(per=get.yrs(ex_date)), exit.status=status) x2 <- splitMulti(x, breaks = list(fot=seq(0, 5, by = 3/12), per=c(2008, 2013))) # equivalently: x2 <- splitMulti(x, fot=seq(0, 5, by = 3/12), per=c(2008, 2013)) ## using dates; note: breaks must be expressed as dates or days! x <- Lexis( data = sire[dg_date < ex_date, ], entry = list(fot = 0L, per = dg_date, age = as.integer(dg_date - bi_date)), duration = as.integer(ex_date - dg_date), entry.status = 0L, exit.status = status ) BL <- list(fot = seq(0, 5, by = 3/12)*365.242199, per = as.Date(paste0(c(1980:2014),"-01-01")), age = c(0,45,85,Inf)*365.242199) x2 <- splitMulti(x, breaks = BL, verbose=TRUE) ## multistate example (healty - sick - dead) sire2 <- data.frame(sire) sire2 <- sire2[sire2$dg_date < sire2$ex_date, ] set.seed(1L) not_sick <- sample.int(nrow(sire2), 6000L, replace = FALSE) sire2$dg_date[not_sick] <- NA sire2$status[!is.na(sire2$dg_date) & sire2$status == 0] <- -1 sire2$status[sire2$status==2] <- 1 sire2$status <- factor(sire2$status, levels = c(0, -1, 1), labels = c("healthy", "sick", "dead")) xm <- Lexis(data = sire2, entry = list(fot=0, per=get.yrs(bi_date), age=0), exit = list(per=get.yrs(ex_date)), exit.status=status) xm2 <- cutLexis(xm, cut = get.yrs(xm$dg_date), timescale = "per", new.state = "sick") xm2[xm2$lex.id == 6L, ] xm2 <- splitMulti(xm2, breaks = list(fot = seq(0,150,25))) xm2[xm2$lex.id == 6L, ]
World standard population by 1 year age groups from 1 to 101. Sums to 100 000.
data.table with columns
world_std weight that sums to 100000 (numeric)
agegroup age group from 1 to 101 (numeric)
Standard population is from: world standard population "101of1"
Other popEpi data:
ICSS,
meanpop_fi,
popmort,
sibr,
sire,
stdpop18
Other weights:
ICSS,
direct_standardization,
stdpop18
World, European, and Nordic standard populations by 18 age categories. Sums to 100000.
data.table with columns
agegroup, age group in 18 categories (character)
world, World 2000 standard population (numeric)
europe, European standard population (numeric)
nordic, Nordic standard population (numeric)
Nordcan, 2000
Other popEpi data:
ICSS,
meanpop_fi,
popmort,
sibr,
sire,
stdpop101
Other weights:
ICSS,
direct_standardization,
stdpop101
aggre Objectsummary method function for aggre objects; see
[as.aggre] and [aggre].
## S3 method for class 'aggre' summary(object, by = NULL, subset = NULL, ...)## S3 method for class 'aggre' summary(object, by = NULL, subset = NULL, ...)
object |
an |
by |
list of columns to summarize by - e.g. |
subset |
a logical condition to subset results table by
before summarizing; use this to limit to a certain stratum. E.g.
|
... |
unused |
Returns a data.table — a further aggregated version of object.
Joonas Miettinen
Other aggregation functions:
aggre(),
as.aggre(),
lexpand(),
setaggre()
Summary method function for survtab objects; see
[survtab_ag]. Returns estimates at given time points
or all time points if t and q are both NULL.
## S3 method for class 'survtab' summary(object, t = NULL, subset = NULL, q = NULL, ...)## S3 method for class 'survtab' summary(object, t = NULL, subset = NULL, q = NULL, ...)
object |
a |
t |
a vector of times at which time points (actually intervals that
contain t) to print summary table of survival function estimates by strata;
values not existing in any interval cause rows containing only |
subset |
a logical condition to subset results table by
before printing; use this to limit to a certain stratum. E.g.
|
q |
a named |
... |
unused; required for congruence with other |
Note that this function returns the intervals and NOT the time points corresponding to quantiles / estimates corresponding to time points. If you want precise estimates at time points that are not interval breaks, add the time points as breaks and re-estimate the survival time function. In interval-based estimation, the estimates denote e.g. probability of dying during the interval, so time points within the intervals are not usually considered at all. See e.g. Seppa, Dyba, and Hakulinen (2015).
A data.table: a slice from object based on t, subset, and q.
Joonas Miettinen
Seppa K., Dyba T. and Hakulinen T.: Cancer Survival, Reference Module in Biomedical Sciences. Elsevier. 08-Jan-2015. doi:10.1016/B978-0-12-801238-3.02745-8
Other survtab functions:
Surv(),
lines.survtab(),
plot.survtab(),
print.survtab(),
survtab(),
survtab_ag()
library(Epi) ## NOTE: recommended to use factor status variable x <- Lexis(entry = list(FUT = 0, AGE = dg_age, CAL = get.yrs(dg_date)), exit = list(CAL = get.yrs(ex_date)), data = sire[sire$dg_date < sire$ex_date, ], exit.status = factor(status, levels = 0:2, labels = c("alive", "canD", "othD")), merge = TRUE) ## pretend some are male set.seed(1L) x$sex <- rbinom(nrow(x), 1, 0.5) ## observed survival st <- survtab(Surv(time = FUT, event = lex.Xst) ~ sex, data = x, surv.type = "cif.obs", breaks = list(FUT = seq(0, 5, 1/12))) ## estimates at full years of follow-up summary(st, t = 1:5) ## interval estimate closest to 75th percentile, i.e. ## first interval where surv.obs < 0.75 at end ## (just switch 0.75 to 0.5 for median survival, etc.) summary(st, q = list(surv.obs = 0.75)) ## multiple quantiles summary(st, q = list(surv.obs = c(0.75, 0.90), CIF_canD = 0.20)) ## if you want all estimates in a new data.frame, you can also simply do x <- as.data.frame(st)library(Epi) ## NOTE: recommended to use factor status variable x <- Lexis(entry = list(FUT = 0, AGE = dg_age, CAL = get.yrs(dg_date)), exit = list(CAL = get.yrs(ex_date)), data = sire[sire$dg_date < sire$ex_date, ], exit.status = factor(status, levels = 0:2, labels = c("alive", "canD", "othD")), merge = TRUE) ## pretend some are male set.seed(1L) x$sex <- rbinom(nrow(x), 1, 0.5) ## observed survival st <- survtab(Surv(time = FUT, event = lex.Xst) ~ sex, data = x, surv.type = "cif.obs", breaks = list(FUT = seq(0, 5, 1/12))) ## estimates at full years of follow-up summary(st, t = 1:5) ## interval estimate closest to 75th percentile, i.e. ## first interval where surv.obs < 0.75 at end ## (just switch 0.75 to 0.5 for median survival, etc.) summary(st, q = list(surv.obs = 0.75)) ## multiple quantiles summary(st, q = list(surv.obs = c(0.75, 0.90), CIF_canD = 0.20)) ## if you want all estimates in a new data.frame, you can also simply do x <- as.data.frame(st)
popEpi::Surv simply calls [survival::Surv]. This wrapper was written
simply to avoid doing e.g. library(survival) when Surv is used in a
formula evaluated by popEpi.
Surv( time, time2, event, type = c("right", "left", "interval", "counting", "interval2"), origin = 0 )Surv( time, time2, event, type = c("right", "left", "interval", "counting", "interval2"), origin = 0 )
time |
See |
time2 |
See |
event |
See |
type |
See |
origin |
See |
See [survival::Surv].
Other main functions:
rate(),
relpois(),
relpois_ag(),
sir(),
sirspline(),
survmean(),
survtab(),
survtab_ag()
Other survtab functions:
lines.survtab(),
plot.survtab(),
print.survtab(),
summary.survtab(),
survtab(),
survtab_ag()
Other survmean functions:
lines.survmean(),
plot.survmean(),
survmean()
Functions used for estimation of various survival time statistics. E.g. relative survival.
surv_estimate( dt, ts_fut_col_nm, stratum_col_nms = NULL, value_col_nms = NULL, estimators = "S_ch", weight_dt = NULL, conf_methods = "log", conf_lvls = 0.95 ) surv_collapse_1d( dt, ts_fut_col_nm, stratum_col_nms = NULL, value_col_nms = NULL, test_expr = NULL, mandatory_breaks = NULL ) surv_collapse_strata_list( dt, stratum_col_nms, collapse_stratum_col_nms, value_col_nms = NULL, test_expr = NULL ) surv_collapse_strata_1d( dt, stratum_col_nms, collapse_stratum_col_nm, value_col_nms = NULL, test_expr = quote(min(t_at_risk) > 0) ) surv_brenner_weight_dt( standard_weight_dt, observed_weight_dt = NULL, collapse_stratum_col_nms = NULL ) surv_individual_weights( df, standard_weight_dt, observed_weight_dt = NULL, collapse_stratum_col_nms = NULL ) lexis_breaks_collapse_1d( lexis, breaks_1d, test_expr = NULL, mandatory_breaks = NULL ) surv_lexis( lexis, breaks, merge_dt = NULL, merge_dt_by = NULL, aggre_by = NULL, subset = NULL, split_merge_aggregate_optional_args = NULL, estimators = "S_ch", conf_methods = "log", conf_lvls = 0.95, weights = NULL )surv_estimate( dt, ts_fut_col_nm, stratum_col_nms = NULL, value_col_nms = NULL, estimators = "S_ch", weight_dt = NULL, conf_methods = "log", conf_lvls = 0.95 ) surv_collapse_1d( dt, ts_fut_col_nm, stratum_col_nms = NULL, value_col_nms = NULL, test_expr = NULL, mandatory_breaks = NULL ) surv_collapse_strata_list( dt, stratum_col_nms, collapse_stratum_col_nms, value_col_nms = NULL, test_expr = NULL ) surv_collapse_strata_1d( dt, stratum_col_nms, collapse_stratum_col_nm, value_col_nms = NULL, test_expr = quote(min(t_at_risk) > 0) ) surv_brenner_weight_dt( standard_weight_dt, observed_weight_dt = NULL, collapse_stratum_col_nms = NULL ) surv_individual_weights( df, standard_weight_dt, observed_weight_dt = NULL, collapse_stratum_col_nms = NULL ) lexis_breaks_collapse_1d( lexis, breaks_1d, test_expr = NULL, mandatory_breaks = NULL ) surv_lexis( lexis, breaks, merge_dt = NULL, merge_dt_by = NULL, aggre_by = NULL, subset = NULL, split_merge_aggregate_optional_args = NULL, estimators = "S_ch", conf_methods = "log", conf_lvls = 0.95, weights = NULL )
dt |
Dataset containing aggregate statistics
which can be used to compute survival function estimates. Must also have
column |
ts_fut_col_nm |
Name of time scale column over which survival estimates will be computed.
E.g. |
stratum_col_nms |
Stratum column names in
|
value_col_nms |
Value column names in
|
estimators |
One or more names of estimators whose estimates will be computed into
|
weight_dt |
Weights for direct adjusting or Brenner weighting.
See Details to understand how the
A table of weights must fulfill these requirements:
|
conf_methods |
Passed one at a time to
|
conf_lvls |
Passed one at a time to
|
test_expr |
This expression is evaluated for every interval. If the test does not pass, the interval must be aggregated with one of its neighbours. |
mandatory_breaks |
These breaks cannot be aggregated over. E.g. with |
collapse_stratum_col_nms |
Optional names of stratum column names in both
|
collapse_stratum_col_nm |
As the |
standard_weight_dt |
Table of standardisation weights, e.g. ICSS weights. A table of weights must fulfill these requirements:
|
observed_weight_dt |
Table of weights in your dataset.
A table of weights must fulfill these requirements:
|
df |
Dataset containing stratum columns also found in |
lexis |
A |
breaks_1d |
List of breaks with one element. These define the initial intervals which
are subject to being combined. E.g. |
breaks |
Passed to |
merge_dt |
Passed to |
merge_dt_by |
Passed to |
aggre_by |
Passed to |
subset |
Passed to |
split_merge_aggregate_optional_args |
Optional arguments to pass to
|
weights |
Weights for adjusting estimates.
|
popEpi::surv_estimate
Returns a data.table containing all columns in dt (unless direct
adjusting was performed — then the adjusting column(s) are not present)
and additional columns of estimates, standard errors, and confidence
intervals.
popEpi::surv_individual_weights
Returns a vector of weights.
popEpi::surv_lexis
Returns a data.table as produced by surv_estimate.
popEpi::surv_estimate
Compute survival time function estimates. Performs the following steps:
Handles estimators. Elements of estimators that are character
strings cause the corresponding pre-defined formulae to be used.
E.g. "S_ch".
Pre-defined estimators with their formulae:
| Name | Explanation | Estimator | Standard Error |
h_ch |
Constant hazards estimate of... hazard. The most basic hazard estimate where the number of events is Poisson distributed. | n_events/t_at_risk |
sqrt(h_ch_est/t_at_risk) |
H_ch |
Cumulative hazard function estimate based on the above. | cumsum(delta_t * h_ch_est) |
sqrt(cumsum(delta_t^2 * h_ch_se^2)) |
h_exp_e2_ch |
Expected constant hazards estimated by the Ederer II method. | n_events_exp_e2/t_at_risk |
0 + 0 |
h_exc_e2_ch |
Excess hazard, observed - expected. Based on Ederer II expected hazard. | (n_events - n_events_exp_e2)/t_at_risk |
h_ch_se |
S_ch |
Observed survival, constant hazards estimator. We estimate the hazard first for each survival interval and derive survival from that. | exp(-cumsum(delta_t * h_ch_est)) |
S_ch_est * sqrt(cumsum((delta_t^2) * n_events/(t_at_risk^2))) |
F_ch |
The is the cumulative density / distribution function. It is simply 1 - S. Based on constant hazards. |
1 - S_ch_est |
S_ch_se |
S_lt |
Observed survival, lifetable estimator. We first estimate the (conditional) probabilities of surviving each survival interval and cumulate those. The lifetable estimator in its simplest form assumes that outcomes in an interval are binomially distributed, but we perform corrections on those who were censored etc. See the definition of n_at_risk_eff. |
cumprod(1 - n_events/n_at_risk_eff) |
S_lt_est * sqrt(cumsum(n_events/(n_at_risk_eff * (n_at_risk_eff - n_events)))) |
H_lt |
This cumulative hazard estimator is derived from S_lt. We don't know the standard error formula for this one. |
-log(S_lt_est) |
0 + NA_real_ |
h_lt |
This hazard estimator is derived from S_lt. We don't know the standard error formula for this one. |
diff(c(0, H_lt_est))/delta_t |
0 + NA_real_ |
F_lt |
This is simply 1 - S_lt. |
1 - S_lt_est |
S_lt_se |
S_exp_e2_lt |
Expected Ederer II survival, lifetable estimator. | cumprod(1 - n_events_exp_e2/n_at_risk_eff) |
0 + 0 |
S_exp_e2_ch |
Expected Ederer II survival, constant hazards estimator. | exp(-cumsum(delta_t * h_exp_e2_ch_est)) |
0 + 0 |
RS_e2_lt |
Relative survival, Ederer II, lifetable estimator. | cumprod(1 - (n_events - n_events_exp_e2)/n_at_risk_eff) |
S_lt_se/S_exp_e2_lt_est |
RS_e2_ch |
Relative survival, Ederer II, constant hazards estimator. | exp(-cumsum(delta_t * h_exc_e2_ch_est)) |
S_ch_se/S_exp_e2_ch_est |
NS_pp_lt |
Net survival, Pohar Perme, constan hazards estimator. The Pohar Perme method is the least biased method for estimating net survival when follow-up is complete but it can have larger uncertainty. Net survival is the hypothetical survival probability in a world where the event of interest is the only event possible, e.g. when estimated using a cohort of cancer patients we get the hypothetical survival probability when cancer is the only thing that kills people. Maja Pohar Perme, Janez Stare, Jacques Esteve, On Estimation in Relative Survival, Biometrics, Volume 68, Issue 1, March 2012, Pages 113-120, https://doi.org/10.1111/j.1541-0420.2011.01640.x; Karri Seppa, Timo Hakulinen, Arun Pokhrel, Choosing the net survival method for cancer survival estimation, European Journal of Cancer, Volume 51, Issue 9, 2015, Pages 1123-1129, ISSN 0959-8049, https://doi.org/10.1016/j.ejca.2013.09.019 | cumprod(1 - (n_events_pp - n_events_exp_pp)/n_at_risk_eff_pp) |
NS_pp_lt_est * sqrt(cumsum(n_events_pp_double_weighted/(n_at_risk_eff^2))) |
NS_pp_ch |
Net survival, Pohar Perme, lifetable estimator. The Pohar Perme method is the least biased method for estimating net survival when follow-up is complete but it can have larger uncertainty. Net survival is the hypothetical survival probability in a world where the event of interest is the only event possible, e.g. when estimated using a cohort of cancer patients we get the hypothetical survival probability when cancer is the only thing that kills people. Maja Pohar Perme, Janez Stare, Jacques Esteve, On Estimation in Relative Survival, Biometrics, Volume 68, Issue 1, March 2012, Pages 113-120, https://doi.org/10.1111/j.1541-0420.2011.01640.x; Karri Seppa, Timo Hakulinen, Arun Pokhrel, Choosing the net survival method for cancer survival estimation, European Journal of Cancer, Volume 51, Issue 9, 2015, Pages 1123-1129, ISSN 0959-8049, https://doi.org/10.1016/j.ejca.2013.09.019 | exp(-cumsum(delta_t * (n_events_pp - n_events_exp_pp)/t_at_risk_pp)) |
NS_pp_ch_est * sqrt(cumsum((delta_t^2) * n_events_pp_double_weighted/(t_at_risk_pp^2))) |
h_lt_[x, y] |
Cause-specific (transition-specific) hazard, life table estimator, for transitioning from x to y. We don't know the standard error formula for this one. | -log(1 - `n_events_[x, y]`/n_at_risk_eff) |
0 + NA_real_ |
h_ch_[x, y] |
Cause-specific (transition-specific) hazard, constant hazards estimator, for transitioning from x to y. Both x and y can be sets of statuses, e.g. x = 1:2. |
`n_events_[x, y]`/t_at_risk |
sqrt(`h_ch_[x, y]_est`/t_at_risk) |
H_lt_[x, y] |
Cumulative version of h_lt_[x, y]. We don't know the standard error formula for this one. |
-log(cumprod(1 - `n_events_[x, y]`/n_at_risk_eff)) |
0 + NA_real_ |
H_ch_[x, y] |
Cumulative version of h_ch_[x, y]. |
cumsum(delta_t * `h_ch_[x, y]_est`) |
sqrt(cumsum(delta_t^2 * `h_ch_[x, y]_se`^2)) |
S_lt_[x, y] |
Cause-specific (transition-specific) survival, life table estimator, for transitioning from x to y. Both x and y can be sets of statuses, e.g. x = 1:2.. Cause-specific survival treats all other events but the one of interest as random censoring. |
cumprod(1 - `n_events_[x, y]`/n_at_risk_eff) |
`S_lt_[x, y]_est` * sqrt(cumsum(`n_events_[x, y]`/(n_at_risk_eff * (n_at_risk_eff - `n_events_[x, y]`)))) |
S_ch_[x, y] |
Cause-specific (transition-specific) survival, constant hazards estimator, for transitioning from x to y. Both x and y can be sets of statuses, e.g. x = 1:2.. Cause-specific survival treats all other events but the one of interest as random censoring. |
exp(-`H_ch_[x, y]_est`) |
S_ch_est * sqrt(cumsum((delta_t^2) * `n_events_[x, y]`/(t_at_risk^2))) |
F_lt_[x, y] |
Cause-specific (transition-specific) (absolute) risk, lifetable estimator for transitioning from x to y. Both x and y can be sets of statuses, e.g. x = 1:2. Here we use the estimator presented in Prentice et al. We don't know the standard error formula for this one. Prentice RL, Kalbfleisch JD, Peterson AV Jr, Flournoy N, Farewell VT, Breslow NE. The analysis of failure times in the presence of competing risks. Biometrics. 1978 Dec;34(4):541-54. PMID: 373811 |
{
q <- (1 - S_lt_cond_est) * `n_events_[x, y]`/n_events
q[n_events == 0] <- 0
cumsum(S_lt_est_lag1 * q)
} |
0 + NA_real_ |
F_ch_[x, y] |
Cause-specific (transition-specific) (absolute) risk, constant hazards estimator. Here we use the estimator presented in Prentice et al. We don't know the standard error formula for this one. Prentice RL, Kalbfleisch JD, Peterson AV Jr, Flournoy N, Farewell VT, Breslow NE. The analysis of failure times in the presence of competing risks. Biometrics. 1978 Dec;34(4):541-54. PMID: 373811 | {
q <- (1 - S_ch_cond_est) * `n_events_[x, y]`/n_events
q[n_events == 0] <- 0
cumsum(S_ch_est_lag1 * q)
} |
0 + NA_real_ |
RF_e2_lt |
""Relative"" absolute risk, lifetable estimator. This was proposed by Karri Seppa at the Finnish Cancer Registry. It is a modification of the method presented in Prentice et al above. There is no publication for this. Based on our testing it is very close to the cause-specific estimator of cancer death in cancer patients. We don't know the standard error formula for this one. | {
q <- (1 - S_lt_cond_est) * (n_events - n_events_exp_e2)/n_events
q[n_events == 0] <- 0
cumsum(S_lt_est_lag1 * q)
} |
0 + NA_real_ |
RF_e2_ch |
""Relative"" absolute risk, constant hazards estimator. This was proposed by Karri Seppa at the Finnish Cancer Registry. It is a modification of the method presented in Prentice et al above. There is no publication for this. Based on our testing it is very close to the cause-specific estimator of cancer death in cancer patients. We don't know the standard error formula for this one. | {
q <- (1 - S_ch_cond_est) * (n_events - n_events_exp_e2)/n_events
q[n_events == 0] <- 0
cumsum(S_ch_est_lag1 * q)
} |
0 + NA_real_
|
If weight_dt is a data.table with column "brenner_weight", we
merge the weights into dt, multiply all value_col_nms with the
respective Brenner weight, and sum value_col_nms over the adjusting
strata in weight_dt. E.g. if dt is stratified by sex and ag and
weight_dt contains columns
ag and brenner_weight then the resulting table will have
weighted sums of value_col_nms by sex (and the intervals).
This approach is the same as assigning weights for each individual before
splitting and aggregating when the individual weights are based on
strata only (and not maybe some continuous variable).
Check columns t_at_risk, n_at_risk_eff for zeroes/NAs if they are in
dt. Intervals with e.g. t_at_risk == 0 have no survival probability
defined for them which causes NA values (i.e. zero divided by zero
is not defined). Throw a warning if such intervals are found.
Add column delta_t as the difference between the _stop and _start
columns of the follow-up time scale, e.g. ts_fut_stop - ts_fut_start.
Armed with a list of expressions based on estimates, called
expressions, for each i:
Evaluate each element of expressions[[i]] and add the result into
dt. E.g. S_ch_est and
S_ch_se.
Call
[directadjusting::directly_adjusted_estimates].
If no weights were given for direct adjusting then this step simply
produces the confidence intervals.
If direct adjusting was performed, the summary statistics such as
n_events are summed over the adjusting strata and will be included
in the output. These are not weighted averages/sums but simple sums.
Return a data.table containing all columns in dt (unless direct
adjusting was performed — then the adjusting column(s) are not present)
and additional columns of estimates, standard errors, and confidence
intervals.
popEpi::surv_brenner_weight_dt
popEpi::surv_brenner_weight_dt performs the following steps:
Merge standard_weight_dt and observed_weight_dt into one table.
Scale the standard weights and the observed weights to sum into one
(they are in separate columns). E.g. one stratum has
weight_standard = 0.5 and weight_observed = 0.4.
At this point the weight table might look like this:
| ag | weight_standard | weight_observed |
| [0,15[ | 0.00000 | 0.0000000 |
| [15,20[ | 0.00164 | 0.0000000 |
| [20,25[ | 0.00215 | 0.0004110 |
| [25,30[ | 0.00416 | 0.0030037 |
| [30,35[ | 0.00842 | 0.0092639 |
| [35,40[ | 0.01874 | 0.0174213 |
| [40,45[ | 0.03489 | 0.0317440 |
| [45,50[ | 0.04906 | 0.0641520 |
| [50,55[ | 0.07094 | 0.0962755 |
| [55,60[ | 0.09641 | 0.1094916 |
| [60,65[ | 0.13359 | 0.1531554 |
| [65,70[ | 0.14234 | 0.1582459 |
| [70,75[ | 0.14766 | 0.1212217 |
| [75,80[ | 0.14131 | 0.0954218 |
| [80,85[ | 0.09347 | 0.0718983 |
| [85,Inf[ | 0.05522 | 0.0682939 |
If the weight table contains rows where the observed weight is zero and the standard weight is not, the individual weights will not work properly.
If is.null(collapse_stratum_col_nms), a warning is thrown.
Else we combine strata identified by collapse_stratum_col_nms
to ensure there are zero problematic rows in the weight table.
At this point the weight table looks like e.g. (weight_standard and
weight_observed are the same for the 2nd and 3rd rows because they
were combined — but we keep the original strata because we need them
for the join later)
| ag | weight_standard | weight_observed |
| [0,15[ | 0.00000 | 0.0000000 |
| [15,20[ | 0.00379 | 0.0004110 |
| [20,25[ | 0.00379 | 0.0004110 |
| [25,30[ | 0.00416 | 0.0030037 |
| [30,35[ | 0.00842 | 0.0092639 |
| [35,40[ | 0.01874 | 0.0174213 |
| [40,45[ | 0.03489 | 0.0317440 |
| [45,50[ | 0.04906 | 0.0641520 |
| [50,55[ | 0.07094 | 0.0962755 |
| [55,60[ | 0.09641 | 0.1094916 |
| [60,65[ | 0.13359 | 0.1531554 |
| [65,70[ | 0.14234 | 0.1582459 |
| [70,75[ | 0.14766 | 0.1212217 |
| [75,80[ | 0.14131 | 0.0954218 |
| [80,85[ | 0.09347 | 0.0718983 |
| [85,Inf[ | 0.05522 | 0.0682939 |
Compute the individual weights as the standard weights divided by the
observed weights. E.g. one stratum has
weight_brenner = 0.5 / 0.4 = 1.2. If either the standard or observed
weight is zero then set the Brenner weight to zero.
At this point the table looks like e.g. (note: weight_brenner was made
larger by our combining the strata in rows 2 and 3. In fact it will
always be larger when we combine strata because weight_standard is
summed and becomes greater but weight_observed does not become any
greater. The observations here serve "double duty" and represent all the
strata in the combination.):
| ag | weight_standard | weight_observed | weight_brenner |
| [0,15[ | 0.00000 | 0.0000000 | 0.0000000 |
| [15,20[ | 0.00379 | 0.0004110 | 9.2214110 |
| [20,25[ | 0.00379 | 0.0004110 | 9.2214110 |
| [25,30[ | 0.00416 | 0.0030037 | 1.3849735 |
| [30,35[ | 0.00842 | 0.0092639 | 0.9089002 |
| [35,40[ | 0.01874 | 0.0174213 | 1.0756964 |
| [40,45[ | 0.03489 | 0.0317440 | 1.0991045 |
| [45,50[ | 0.04906 | 0.0641520 | 0.7647460 |
| [50,55[ | 0.07094 | 0.0962755 | 0.7368441 |
| [55,60[ | 0.09641 | 0.1094916 | 0.8805243 |
| [60,65[ | 0.13359 | 0.1531554 | 0.8722511 |
| [65,70[ | 0.14234 | 0.1582459 | 0.8994864 |
| [70,75[ | 0.14766 | 0.1212217 | 1.2180987 |
| [75,80[ | 0.14131 | 0.0954218 | 1.4808988 |
| [80,85[ | 0.09347 | 0.0718983 | 1.3000304 |
| [85,Inf[ | 0.05522 | 0.0682939 | 0.8085640 |
popEpi::surv_individual_weights
Produce a vector of weights, one weight for each row in df.
These weights have been called individual weights, Brenner weights,
(Brenner et al 2004, https://doi.org/10.1016/j.ejca.2004.07.007),
pre-weights, and maybe even others. A beloved child has many names.
The idea of these weights is to weigh the individual contribution of each person to the summary statistics from which the survival function estimates themselves are produced. They pronounce the influence of under-represented data and reduce the influence of over-represented data just as the more conventional computation of weighted averages survival function estimates does. However, in this individual weighting approach even a single event can be included in the summary statistics as e.g. 0.80 or 1.20 events for over and under-represented data respectively.
This function makes producing these weights easier. It performs the following steps:
If is.null(observed_weight_dt), observed_weight_dt is computed
by surv_individual_weights by simply counting the number of cases
in each stratum in standard_weight_dt.
Call popEpi::surv_brenner_weight_dt to form Brenner weights.
Using left-join, produce a vector of length nrow(df) where each row in
df gets an individual weight based on its stratum.
Return a vector of weights.
popEpi::surv_lexis
Compute survival estimates on a Lexis dataset
([Epi::Lexis] / [Lexis_dt]).
Performs the following steps:
estimators is analysed and the following will be appended to
aggre_exprs arg of [lexis_split_merge_aggregate_by_stratum] by
surv_lexis:
Detect variables used in the estimation expressions with [all.vars].
Retain only those known to popEpi — see
?lexis_split_merge_aggregate_by_stratum. This results in a character
string vector that the argument aggre_exprs of
lexis_split_merge_aggregate_by_stratum accepts.
This results in aggre_exprs with both anything that the user defined
and also what was added by surv_lexis. However, we drop duplicates
in aggre_exprs based on both duplicated(names(aggre_exprs)) and
duplicated(aggre_exprs). E.g. if you supply
aggre_exprs = list(n_events = quote(sum(lex.Xst != 0)))
and n_events is also added by surv_lexis then only the one you
supplied is retained.
Call lexis_split_merge_aggregate_by_stratum.
The resulting table of aggregated data is
stratified by both aggre_by and by any stratifying columns found in
weight_dt if a data.tablewas supplied as that argument. E.g. withaggre_by = "sex"andweight_dt = data.table::data.table(ag = 1:3, weight = c(100, 150, 200)), the statistics table is stratified by both sexandag. With aggre_by = "sex"andweights = "individual_weight"' the table is
stratified by sex and contains individually weighted statistics.
Call surv_estimate.
New function surv_estimate for estimating arbitrary survival time
functions using aggregated data.
# popEpi::surv_estimate dt <- data.table::data.table( box_id = 1:60, ts_fut_start = seq(0, 5 - 1 / 12, 1 / 12), ts_fut_stop = seq(1 / 12, 5, 1 / 12), n_events = rpois(n = 60, lambda = 60:1), t_at_risk = rpois(n = 60, lambda = 300:240) ) data.table::set( x = dt, j = "my_h_ch", value = dt[["n_events"]] / dt[["t_at_risk"]] ) sdt <- popEpi::surv_estimate( dt = dt, ts_fut_col_nm = "ts_fut", value_col_nms = c("n_events", "t_at_risk"), estimators = c("h_ch", "S_ch"), conf_methods = "log-log" ) stopifnot( c("h_ch_est", "S_ch_est") %in% names(sdt), dt[["my_h_ch"]] == sdt[["h_ch_est"]] ) sdt <- popEpi::surv_estimate( dt = dt, ts_fut_col_nm = "ts_fut", value_col_nms = c("n_events", "t_at_risk"), estimators = list( "h_ch", my_surv = list( est = quote( exp(-cumsum((ts_fut_stop - ts_fut_start) * h_ch_est)) ), se = quote(rep(0.0, length(ts_fut_start))) ) ), conf_methods = c("log", "none") ) stopifnot( c("h_ch_est", "my_surv_est") %in% names(sdt) ) sdt <- popEpi::surv_estimate( dt = dt, ts_fut_col_nm = "ts_fut", value_col_nms = c("n_events", "t_at_risk"), estimators = list( "h_ch", "S_ch" ), conf_methods = list( "log", list( g = quote(stats::qnorm(p = theta)), g_inv = quote(stats::pnorm(q = g)) ) ) ) stopifnot( c("h_ch_est", "S_ch_est") %in% names(sdt) ) # one approach to brenner weighting --- effectively the same as assigning # a weight for each individual using strata. make_column_ag_icss <- function(age) { cut( age, breaks = c(0, 60, 70, 80, Inf), right = FALSE, labels = c("0-59", "60-69", "70-79", "80+") ) } make_sire <- function() { sire <- popEpi::sire sire <- sire[ sire[["dg_date"]] < sire[["ex_date"]] & data.table::between( sire[["ex_date"]], as.Date("1999-01-01"), as.Date("2003-12-31"), incbounds = TRUE ) & (get.yrs(sire[["ex_date"]]) - get.yrs(sire[["bi_date"]])) < 100 ] sire[j = "my_stratum" := sample(2L, size = nrow(sire), replace = TRUE)] sire <- sire[ j = .SD[as.integer(seq(1L, .N, length.out = 50L))], keyby = "my_stratum" ] # you can also use popEpi::Lexis_dt sire <- Epi::Lexis( entry = list( ts_cal = popEpi::get.yrs(dg_date), ts_age = popEpi::get.yrs(dg_date) - popEpi::get.yrs(bi_date), ts_fut = 0.0 ), duration = popEpi::get.yrs(ex_date) - popEpi::get.yrs(dg_date), entry.status = 0L, exit.status = status, data = sire ) sire[["ag_icss"]] <- make_column_ag_icss(sire[["dg_age"]]) sire } sire <- make_sire() make_weight_dt <- function() { wdt <- popEpi::ICSS[ j = list( standard_weight = as.double(sum(.SD[["ICSS1"]])) ), keyby = list( ag_icss = make_column_ag_icss(popEpi::ICSS[["age"]]) ) ] wdt[ j = "standard_weight" := wdt[["standard_weight"]] / sum(wdt[["standard_weight"]]) ] owdt <- data.table::setDT(as.list(sire))[ j = list(observed_weight = .N), keyby = "ag_icss" ] owdt[ j = "observed_weight" := owdt[["observed_weight"]] / sum(owdt[["observed_weight"]]) ] wdt[ j = "observed_weight" := owdt[["observed_weight"]] ] wdt[ j = "brenner_weight" := wdt[["standard_weight"]] / owdt[["observed_weight"]] ] wdt[i = is.na(wdt[["brenner_weight"]]), j = "brenner_weight" := 0.0] return(wdt[]) } wdt <- make_weight_dt() sdt <- popEpi::lexis_split_merge_aggregate_by_stratum( lexis = sire, breaks = list(ts_fut = seq(0, 5, 1 / 12)), aggre_exprs = c("t_at_risk", "n_events"), aggre_by = "ag_icss" ) sdt_bw <- popEpi::surv_estimate( dt = sdt, ts_fut_col_nm = "ts_fut", estimators = "S_ch", weight_dt = data.table::data.table( ag_icss = wdt[["ag_icss"]], brenner_weight = wdt[["brenner_weight"]] ), value_col_nms = c("n_events", "t_at_risk") ) stopifnot("S_ch_est" %in% names(sdt_bw), !"ag_icss" %in% names(sdt_bw)) # direct adjusting for comparison sdt_da <- popEpi::surv_estimate( dt = sdt, ts_fut_col_nm = "ts_fut", estimators = "S_ch", weight_dt = data.table::data.table( ag_icss = wdt[["ag_icss"]], weight = wdt[["standard_weight"]] ), value_col_nms = c("n_events", "t_at_risk") ) stopifnot("S_ch_est" %in% names(sdt_da), !"ag_icss" %in% names(sdt_da)) stopifnot( max(abs(sdt_bw[["S_ch_est"]] - sdt_da[["S_ch_est"]])) < 0.01 ) # popEpi::surv_collapse_1d sdt <- data.table::data.table( box_id = 1:3, ts_fut_id = 1:3, ts_fut_start = 0:2, ts_fut_stop = 1:3, t_at_risk = c(1.0, 0.0, 0.5), n_events = 0L ) sdt <- surv_collapse_1d( dt = sdt, ts_fut_col_nm = "ts_fut", value_col_nms = c("t_at_risk", "n_events"), test_expr = quote(t_at_risk > 0.0) ) stopifnot( nrow(sdt) == 2, sdt[["ts_fut_start"]] == c(0.0, 1.0), sdt[["ts_fut_stop"]] == c(1.0, 3.0), sdt[["t_at_risk"]] == c(1.0, 0.5) ) # popEpi::surv_collapse_strata_list sdt <- data.table::CJ( ag = 1:3, box_id = 1:5 ) sdt[ i = sdt[["ag"]] == 1, j = "t_at_risk" := 0.0 ] sdt[ i = sdt[["ag"]] == 2, j = "t_at_risk" := c(1, 1, 0, 2, 1) ] sdt[ i = sdt[["ag"]] == 3, j = "t_at_risk" := c(1, 1, 1, 0, 1) ] sdt_collapse_data <- popEpi::surv_collapse_strata_list( dt = sdt, stratum_col_nms = "ag", collapse_stratum_col_nms = "ag" ) stopifnot( "result" %in% names(sdt_collapse_data), inherits(sdt_collapse_data[["result"]][[1]], "list"), c("new", "old") %in% names(sdt_collapse_data[["result"]][[1]]), sdt_collapse_data[["result"]][[1]][["new"]][["stratum_id"]] == 1L ) sdt[ i = sdt[["ag"]] == 1, j = "t_at_risk" := 0.5 ] sdt_collapse_data <- popEpi::surv_collapse_strata_list( dt = sdt, stratum_col_nms = "ag", collapse_stratum_col_nms = "ag" ) stopifnot( "result" %in% names(sdt_collapse_data), inherits(sdt_collapse_data[["result"]][[1]], "list"), c("new", "old") %in% names(sdt_collapse_data[["result"]][[1]]), identical( sdt_collapse_data[["result"]][[1]][["new"]][["stratum_id"]], c(rep(1L, 5L), rep(2L, 10L)) ) ) # popEpi::surv_collapse_strata_1d sdt_collapsed <- popEpi::surv_collapse_strata_1d( dt = sdt, stratum_col_nms = "ag", collapse_stratum_col_nm = "ag" ) stopifnot( nrow(sdt_collapsed) == 10, names(sdt) %in% names(sdt_collapsed), names(sdt_collapsed) %in% names(sdt), is.character(sdt_collapsed[["ag"]]), identical(unique(sdt_collapsed[["ag"]]), c("1", "2 & 3")) ) # popEpi::surv_brenner_weight_dt make_column_icss_ag <- function(age) { cut( age, breaks = c(seq(15, 85, 5), Inf), right = FALSE ) } make_standard_weight_dt <- function() { swdt <- popEpi::ICSS[ j = list( weight = as.double(sum(.SD[["ICSS1"]])) ), keyby = list( icss_ag = make_column_icss_ag(popEpi::ICSS[["age"]]) ) ][] swdt <- swdt[!is.na(swdt[["icss_ag"]]), ] return(swdt[]) } swdt <- make_standard_weight_dt() owdt <- data.table::setDT(list( icss_ag = make_column_icss_ag(c(50.5, 60.5)) ))[ i = data.table::setDT(list(icss_ag = swdt[["icss_ag"]])), on = "icss_ag", j = list(weight = .N), keyby = .EACHI ] # this is what happens when at least one weighting stratum is empty warn_env <- new.env() warn_env[["w"]] <- NULL bwdt_bad <- suppressWarnings(withCallingHandlers( popEpi::surv_brenner_weight_dt( standard_weight_dt = swdt, observed_weight_dt = owdt ), warning = function(w) warn_env[["w"]] <- w )) stopifnot( inherits(warn_env[["w"]], "warning"), bwdt_bad[["weight_brenner"]] >= 0 ) # here is an easy way to handle this problem. note that the weights are # always larger when we need to aggregate strata --- because the standard # weight increases but the observed weight remains the same. the same # observations serve "double duty" now for multiple original strata. bwdt <- popEpi::surv_brenner_weight_dt( standard_weight_dt = swdt, observed_weight_dt = owdt, collapse_stratum_col_nms = "icss_ag" ) stopifnot( nrow(swdt) == nrow(bwdt), bwdt[["weight_brenner"]] > 0 ) # popEpi::surv_individual_weights make_column_icss_ag <- function(age) { cut( age, breaks = c(seq(15, 85, 5), Inf), right = FALSE ) } make_standard_weight_dt <- function() { swdt <- popEpi::ICSS[ j = list( weight = as.double(sum(.SD[["ICSS1"]])) ), keyby = list( icss_ag = make_column_icss_ag(popEpi::ICSS[["age"]]) ) ][] return(swdt[]) } lexis <- Epi::Lexis( entry = list(ts_fut = c(0.0, 0.0)), exit = list(ts_fut = c(1.5, 2.5)), entry.status = 0L, exit.status = 0L ) lexis[["icss_ag"]] <- make_column_icss_ag(c(50.5, 60.5)) swdt <- make_standard_weight_dt() # this is what happens when at least one weighting stratum is empty warn_env <- new.env() warn_env[["w"]] <- NULL iw_bad <- suppressWarnings(withCallingHandlers(popEpi::surv_individual_weights( df = lexis, standard_weight_dt = swdt ), warning = function(w) warn_env[["w"]] <- w)) stopifnot( inherits(warn_env[["w"]], "warning"), iw_bad >= 0 ) # here is an easy way to handle this problem. note that the weights are # always larger when we need to aggregate strata --- because the standard # weight increases but the observed weight remains the same. the same # observations serve "double duty" now for multiple original strata. iw <- popEpi::surv_individual_weights( df = lexis, standard_weight_dt = swdt, collapse_stratum_col_nms = "icss_ag" ) stopifnot( length(iw) == nrow(lexis), iw >= 0.0, iw >= iw_bad ) # popEpi::lexis_breaks_collapse_1d lexis <- Epi::Lexis( entry = list(ts_fut = c(0.0, 3.0)), duration = c(2.5, 10.0), entry.status = 0L, exit.status = c(0L, 1L) ) bl <- list(ts_fut = seq(0, 5, 0.5)) agdt <- popEpi::lexis_split_merge_aggregate_by_stratum( lexis = lexis, breaks = bl, aggre_exprs = "t_at_risk" ) # what to do about empty intervals? we have intervals with no subjects in them. stopifnot( agdt[["t_at_risk"]][-6] > 0, agdt[["t_at_risk"]][6] == 0.0 ) # combine intervals until there are no empty ones left. breaks_ts_fut <- popEpi::lexis_breaks_collapse_1d( lexis = lexis, breaks_1d = bl ) # this has resulted in the combined interval ]2.5, 3.5]. stopifnot(!3.0 %in% breaks_ts_fut) # but sometimes we want to retain specific breaks in the output. for instance # if the goal is to produce survival estimates for every year. # we can make use of arg `mandatory_breaks` to achieve this. breaks_ts_fut <- popEpi::lexis_breaks_collapse_1d( lexis = lexis, breaks_1d = bl, mandatory_breaks = 0:5 ) # this has resulted in the combined interval ]2.0, 3.0]. stopifnot(!2.5 %in% breaks_ts_fut) agdt <- popEpi::lexis_split_merge_aggregate_by_stratum( lexis = lexis, breaks = list(ts_fut = breaks_ts_fut), aggre_exprs = "t_at_risk" ) stopifnot( agdt[["t_at_risk"]] > 0.0 ) # popEpi::surv_lexis make_pm <- function() { pm <- data.table::copy(popEpi::popmort) data.table::setnames( pm, c("year", "agegroup", "haz"), c("ts_cal", "ts_age", "h_exp") ) data.table::setkeyv(pm, c("sex", "ts_cal", "ts_age")) pm <- c( list(pm), lapply(101:110, function(age) { sub_pm <- pm[ pm[["ts_age"]] == 100 ] sub_pm[ j = "ts_age" := age ] sub_pm[] }) ) pm <- data.table::rbindlist(pm) data.table::setkeyv(pm, c("sex", "ts_cal", "ts_age")) return(pm[]) } make_column_icss_ag <- function(age) { cut( age, breaks = c(0, 60, 70, 80, Inf), right = FALSE, labels = c("0-59", "60-69", "70-79", "80+") ) } make_standard_weight_dt <- function() { return(popEpi::ICSS[ j = list( weight = as.double(sum(.SD[["ICSS1"]])) ), keyby = list( icss_ag = make_column_icss_ag(popEpi::ICSS[["age"]]) ) ][]) } make_sire <- function() { sire <- popEpi::sire sire <- sire[ sire[["dg_date"]] < sire[["ex_date"]] & sire[["ex_date"]] >= as.Date("1999-01-01") & (get.yrs(sire[["ex_date"]]) - get.yrs(sire[["bi_date"]])) < 100 ] set.seed(1337) sire[j = "my_stratum" := sample(2L, size = .N, replace = TRUE)] # you can also use popEpi::Lexis_dt sire <- Epi::Lexis( data = sire, entry = list( ts_cal = popEpi::get.yrs(dg_date), ts_age = popEpi::get.yrs(dg_date) - popEpi::get.yrs(bi_date), ts_fut = 0.0 ), duration = popEpi::get.yrs(ex_date) - popEpi::get.yrs(dg_date), entry.status = 0L, exit.status = status ) sire[["icss_ag"]] <- make_column_icss_ag(sire[["dg_age"]]) sire[["individual_weight"]] <- popEpi::surv_individual_weights( df = sire, standard_weight_dt = wdt ) return(sire[]) } pm <- make_pm() wdt <- make_standard_weight_dt() sire <- make_sire() # using breaks on the calendar time causes period analysis to be performed. # this is done here for demonstration purposes but of course everything works # also without ts_cal breaks. bl <- list( ts_cal = c(1999, 2004), ts_fut = popEpi::lexis_breaks_collapse_1d( lexis = sire, breaks_1d = list(ts_fut = seq(0, 5, 1 / 12)), mandatory_breaks = 0:5 ) ) # observed survival sdt <- popEpi::surv_lexis( lexis = sire, breaks = bl, aggre_by = "my_stratum", subset = NULL, estimators = "S_ch", conf_methods = "log", conf_lvls = 0.95 ) stopifnot( inherits(sdt, "data.table"), "my_stratum" %in% names(sdt), c("S_ch_est", "S_ch_se", "S_ch_lo", "S_ch_hi") %in% names(sdt) ) # observed survival with direct adjusting sdt_da <- popEpi::surv_lexis( lexis = sire, breaks = bl, aggre_by = "my_stratum", estimators = "S_ch", weights = wdt ) stopifnot( inherits(sdt_da, "data.table"), "my_stratum" %in% names(sdt_da), c("S_ch_est", "S_ch_se", "S_ch_lo", "S_ch_hi") %in% names(sdt_da) ) # observed survival with individual weighting sdt_iw <- popEpi::surv_lexis( lexis = sire, breaks = bl, aggre_by = "my_stratum", estimators = "S_ch", weights = "individual_weight" ) stopifnot( inherits(sdt_iw, "data.table"), "my_stratum" %in% names(sdt_iw), c("S_ch_est", "S_ch_se", "S_ch_lo", "S_ch_hi") %in% names(sdt_iw), # direct adjusting and individual weighting produce similar results. # the larger the dataset, the smaller the difference between the two. max(abs(sdt_da[["S_ch_est"]] - sdt_iw[["S_ch_est"]])) < 0.02, max(abs(sdt_da[["S_ch_se"]] - sdt_iw[["S_ch_se"]])) < 0.001 ) # observed survival with direct adjusting, multiple period estimates # with the periods as strata in output bl_multi_period <- bl bl_multi_period[["ts_cal"]] <- c(bl[["ts_cal"]], max(bl[["ts_cal"]]) + 5) sdt_da <- popEpi::surv_lexis( lexis = sire, breaks = bl_multi_period, aggre_by = "my_stratum", estimators = "S_ch", weights = wdt ) stopifnot( inherits(sdt_da, "data.table"), c("ts_cal_start", "ts_fut_start") %in% names(sdt_da), c("S_ch_est", "S_ch_se", "S_ch_lo", "S_ch_hi") %in% names(sdt_da) ) # observed survival with individual adjusting, period estimates sdt_iw <- popEpi::surv_lexis( lexis = sire, breaks = bl_multi_period, aggre_by = "my_stratum", estimators = "S_ch", weights = "individual_weight" ) stopifnot( inherits(sdt_iw, "data.table"), c("ts_cal_start", "ts_fut_start") %in% names(sdt_iw), c("S_ch_est", "S_ch_se", "S_ch_lo", "S_ch_hi") %in% names(sdt_iw), # direct adjusting and individual weighting produce similar results. # the larger the dataset, the smaller the difference between the two. max(abs(sdt_da[["S_ch_est"]] - sdt_iw[["S_ch_est"]])) < 0.03, max(abs(sdt_da[["S_ch_se"]] - sdt_iw[["S_ch_se"]])) < 0.006 ) # a few common survival time function estimates sdt <- popEpi::surv_lexis( lexis = sire, breaks = bl, merge_dt_by = c("sex", "ts_cal", "ts_age"), merge_dt = pm, aggre_by = "my_stratum", estimators = c( "S_ch", "RS_e2_ch", "NS_pp_ch" ) ) stopifnot( inherits(sdt, "data.table"), c("S_ch_est", "RS_e2_ch_est", "NS_pp_ch_est") %in% names(sdt) ) # your very own estimator (the example `se` is made up) sdt <- popEpi::surv_lexis( lexis = sire, breaks = bl, merge_dt_by = c("sex", "ts_cal", "ts_age"), merge_dt = pm, aggre_by = "my_stratum", estimators = list( my_estimator = list( est = quote(n_events / n_events_exp_e2), se = quote(sqrt(n_events / (n_events_exp_e2 ^ 2))) ) ), conf_method = "identity" ) stopifnot( c("my_estimator_est", "my_estimator_se") %in% names(sdt) ) # your very own estimator with direct adjusting sdt <- popEpi::surv_lexis( lexis = sire, breaks = bl, merge_dt_by = c("sex", "ts_cal", "ts_age"), merge_dt = pm, aggre_by = "my_stratum", estimators = list( my_estimator = list( est = quote(n_events / n_events_exp_e2), se = quote(sqrt(n_events / (n_events_exp_e2 ^ 2))) ) ), conf_method = "identity", weights = wdt ) stopifnot( c("my_estimator_est", "my_estimator_se") %in% names(sdt) ) # your very own estimator with individual weighting sdt <- popEpi::surv_lexis( lexis = sire, breaks = bl, merge_dt_by = c("sex", "ts_cal", "ts_age"), merge_dt = pm, aggre_by = "my_stratum", estimators = list( my_estimator = list( est = quote(n_events / n_events_exp_e2), se = quote(sqrt(n_events / (n_events_exp_e2 ^ 2))) ) ), conf_method = "identity", weights = "individual_weight" ) stopifnot( c("my_estimator_est", "my_estimator_se") %in% names(sdt) ) # your very own method of confidence interval estimation sdt <- popEpi::surv_lexis( lexis = sire, breaks = bl, estimators = c("S_ch", "S_lt"), conf_methods = list(list( g = quote(qnorm(theta)), g_inv = quote(pnorm(g)) )), conf_lvls = 0.95, weights = "individual_weight" ) stopifnot( c("S_ch_lo", "S_ch_hi") %in% names(sdt) ) # a bunch of estimators, cause-specific ones from state 0 to 1 or 2 # and non-cause-specific ones also sdt <- popEpi::surv_lexis( lexis = sire, breaks = bl, merge_dt_by = c("sex", "ts_cal", "ts_age"), merge_dt = pm, aggre_by = NULL, subset = NULL, estimators = c( "h_ch_[0, 1]", "h_ch_[0, 2]", "H_ch_[0, 1]", "H_ch_[0, 2]", "F_ch_[0, 1]", "F_ch_[0, 2]", "S_ch_[0, 1]", "S_ch_[0, 2]", "S_ch_[0, 1:2]", "H_ch", "H_lt", "h_ch", "h_lt", "F_ch", "F_lt", "S_ch", "S_lt", "NS_pp_ch", "RF_e2_ch" ) ) stopifnot( # hazard is additive. all.equal( sdt[["h_ch_est"]], sdt[["h_ch_[0, 1]_est"]] + sdt[["h_ch_[0, 2]_est"]] ), # cumulative hazard is additive. all.equal( sdt[["H_ch_est"]], sdt[["H_ch_[0, 1]_est"]] + sdt[["H_ch_[0, 2]_est"]] ), # competing risks cdf is additive. in epidemiology the cdf is often called # the cumulative incidence function. all.equal( sdt[["F_ch_est"]], sdt[["F_ch_[0, 1]_est"]] + sdt[["F_ch_[0, 2]_est"]] ), # competing risks survival is NOT additive! sdt[["S_ch_[0, 1]_est"]] + sdt[["S_ch_[0, 2]_est"]] > 1, # you can define multiple states in the cause-specific transition. # this one covers all exit statuses and is therefore the same as overall # survival by definition. sdt[["S_ch_[0, 1:2]_est"]] == sdt[["S_ch_est"]]#, # the different estimation methods produce about the same results. # abs(sdt[["S_ch_est"]] - sdt[["S_lt_est"]]) < 0.01, # abs(sdt[["H_ch_est"]] - sdt[["H_lt_est"]]) < 0.01, # net survival is approximately the same as cause-specific survival in this # rather ideal dataset. # abs(sdt[["S_ch_[0, 1]_est"]] - sdt[["NS_pp_ch_est"]]) < 0.01, # the ederer 2 based "relative cdf" is approximately the same as # cause-specific cdf in this rather ideal dataset. # abs(sdt[["F_ch_[0, 1]_est"]] - sdt[["RF_e2_ch_est"]]) < 0.01 ) # factor status data.table::set( x = sire, j = c("lex.Cst", "lex.Xst"), value = list( factor( x = sire[["lex.Cst"]], levels = 0:2, labels = c("alive", "dead from cancer", "dead from other") ), factor( x = sire[["lex.Xst"]], levels = 0:2, labels = c("alive", "dead from cancer", "dead from other") ) ) ) sdt <- popEpi::surv_lexis( lexis = sire, breaks = bl, merge_dt_by = c("sex", "ts_cal", "ts_age"), merge_dt = pm, aggre_by = NULL, subset = NULL, estimators = c( "F_ch_['alive', 'dead from cancer']", "F_ch_['alive', 'dead from other']" ) ) stopifnot( # produces the estimates as intended. maybe these are not too pretty but # you could use shorter factor labels such as c("a", "dc", "do"). "F_ch_['alive', 'dead from cancer']_est" %in% names(sdt) ) # TRUE/FALSE status data.table::set( x = sire, j = c("lex.Cst", "lex.Xst"), value = list( sire[["lex.Cst"]] != "alive", sire[["lex.Xst"]] != "alive" ) ) sdt <- popEpi::surv_lexis( lexis = sire, breaks = bl, merge_dt_by = c("sex", "ts_cal", "ts_age"), merge_dt = pm, aggre_by = NULL, subset = NULL, estimators = c( "F_ch", "F_ch_[F, T]" ) ) stopifnot( "F_ch_[F, T]_est" %in% names(sdt), max(abs(sdt[["F_ch_[F, T]_est"]] - sdt[["F_ch_est"]])) < 1e-15 ) # more arguments to popEpi::lexis_split_merge_aggregate_by_stratum sdt <- popEpi::surv_lexis( lexis = sire, breaks = bl, split_merge_aggregate_optional_args = list( breaks_collapse_args = list(mandatory_breaks = 0:5) ), estimators = "S_ch" ) stopifnot( 1:5 %in% sdt[["ts_fut_stop"]] )# popEpi::surv_estimate dt <- data.table::data.table( box_id = 1:60, ts_fut_start = seq(0, 5 - 1 / 12, 1 / 12), ts_fut_stop = seq(1 / 12, 5, 1 / 12), n_events = rpois(n = 60, lambda = 60:1), t_at_risk = rpois(n = 60, lambda = 300:240) ) data.table::set( x = dt, j = "my_h_ch", value = dt[["n_events"]] / dt[["t_at_risk"]] ) sdt <- popEpi::surv_estimate( dt = dt, ts_fut_col_nm = "ts_fut", value_col_nms = c("n_events", "t_at_risk"), estimators = c("h_ch", "S_ch"), conf_methods = "log-log" ) stopifnot( c("h_ch_est", "S_ch_est") %in% names(sdt), dt[["my_h_ch"]] == sdt[["h_ch_est"]] ) sdt <- popEpi::surv_estimate( dt = dt, ts_fut_col_nm = "ts_fut", value_col_nms = c("n_events", "t_at_risk"), estimators = list( "h_ch", my_surv = list( est = quote( exp(-cumsum((ts_fut_stop - ts_fut_start) * h_ch_est)) ), se = quote(rep(0.0, length(ts_fut_start))) ) ), conf_methods = c("log", "none") ) stopifnot( c("h_ch_est", "my_surv_est") %in% names(sdt) ) sdt <- popEpi::surv_estimate( dt = dt, ts_fut_col_nm = "ts_fut", value_col_nms = c("n_events", "t_at_risk"), estimators = list( "h_ch", "S_ch" ), conf_methods = list( "log", list( g = quote(stats::qnorm(p = theta)), g_inv = quote(stats::pnorm(q = g)) ) ) ) stopifnot( c("h_ch_est", "S_ch_est") %in% names(sdt) ) # one approach to brenner weighting --- effectively the same as assigning # a weight for each individual using strata. make_column_ag_icss <- function(age) { cut( age, breaks = c(0, 60, 70, 80, Inf), right = FALSE, labels = c("0-59", "60-69", "70-79", "80+") ) } make_sire <- function() { sire <- popEpi::sire sire <- sire[ sire[["dg_date"]] < sire[["ex_date"]] & data.table::between( sire[["ex_date"]], as.Date("1999-01-01"), as.Date("2003-12-31"), incbounds = TRUE ) & (get.yrs(sire[["ex_date"]]) - get.yrs(sire[["bi_date"]])) < 100 ] sire[j = "my_stratum" := sample(2L, size = nrow(sire), replace = TRUE)] sire <- sire[ j = .SD[as.integer(seq(1L, .N, length.out = 50L))], keyby = "my_stratum" ] # you can also use popEpi::Lexis_dt sire <- Epi::Lexis( entry = list( ts_cal = popEpi::get.yrs(dg_date), ts_age = popEpi::get.yrs(dg_date) - popEpi::get.yrs(bi_date), ts_fut = 0.0 ), duration = popEpi::get.yrs(ex_date) - popEpi::get.yrs(dg_date), entry.status = 0L, exit.status = status, data = sire ) sire[["ag_icss"]] <- make_column_ag_icss(sire[["dg_age"]]) sire } sire <- make_sire() make_weight_dt <- function() { wdt <- popEpi::ICSS[ j = list( standard_weight = as.double(sum(.SD[["ICSS1"]])) ), keyby = list( ag_icss = make_column_ag_icss(popEpi::ICSS[["age"]]) ) ] wdt[ j = "standard_weight" := wdt[["standard_weight"]] / sum(wdt[["standard_weight"]]) ] owdt <- data.table::setDT(as.list(sire))[ j = list(observed_weight = .N), keyby = "ag_icss" ] owdt[ j = "observed_weight" := owdt[["observed_weight"]] / sum(owdt[["observed_weight"]]) ] wdt[ j = "observed_weight" := owdt[["observed_weight"]] ] wdt[ j = "brenner_weight" := wdt[["standard_weight"]] / owdt[["observed_weight"]] ] wdt[i = is.na(wdt[["brenner_weight"]]), j = "brenner_weight" := 0.0] return(wdt[]) } wdt <- make_weight_dt() sdt <- popEpi::lexis_split_merge_aggregate_by_stratum( lexis = sire, breaks = list(ts_fut = seq(0, 5, 1 / 12)), aggre_exprs = c("t_at_risk", "n_events"), aggre_by = "ag_icss" ) sdt_bw <- popEpi::surv_estimate( dt = sdt, ts_fut_col_nm = "ts_fut", estimators = "S_ch", weight_dt = data.table::data.table( ag_icss = wdt[["ag_icss"]], brenner_weight = wdt[["brenner_weight"]] ), value_col_nms = c("n_events", "t_at_risk") ) stopifnot("S_ch_est" %in% names(sdt_bw), !"ag_icss" %in% names(sdt_bw)) # direct adjusting for comparison sdt_da <- popEpi::surv_estimate( dt = sdt, ts_fut_col_nm = "ts_fut", estimators = "S_ch", weight_dt = data.table::data.table( ag_icss = wdt[["ag_icss"]], weight = wdt[["standard_weight"]] ), value_col_nms = c("n_events", "t_at_risk") ) stopifnot("S_ch_est" %in% names(sdt_da), !"ag_icss" %in% names(sdt_da)) stopifnot( max(abs(sdt_bw[["S_ch_est"]] - sdt_da[["S_ch_est"]])) < 0.01 ) # popEpi::surv_collapse_1d sdt <- data.table::data.table( box_id = 1:3, ts_fut_id = 1:3, ts_fut_start = 0:2, ts_fut_stop = 1:3, t_at_risk = c(1.0, 0.0, 0.5), n_events = 0L ) sdt <- surv_collapse_1d( dt = sdt, ts_fut_col_nm = "ts_fut", value_col_nms = c("t_at_risk", "n_events"), test_expr = quote(t_at_risk > 0.0) ) stopifnot( nrow(sdt) == 2, sdt[["ts_fut_start"]] == c(0.0, 1.0), sdt[["ts_fut_stop"]] == c(1.0, 3.0), sdt[["t_at_risk"]] == c(1.0, 0.5) ) # popEpi::surv_collapse_strata_list sdt <- data.table::CJ( ag = 1:3, box_id = 1:5 ) sdt[ i = sdt[["ag"]] == 1, j = "t_at_risk" := 0.0 ] sdt[ i = sdt[["ag"]] == 2, j = "t_at_risk" := c(1, 1, 0, 2, 1) ] sdt[ i = sdt[["ag"]] == 3, j = "t_at_risk" := c(1, 1, 1, 0, 1) ] sdt_collapse_data <- popEpi::surv_collapse_strata_list( dt = sdt, stratum_col_nms = "ag", collapse_stratum_col_nms = "ag" ) stopifnot( "result" %in% names(sdt_collapse_data), inherits(sdt_collapse_data[["result"]][[1]], "list"), c("new", "old") %in% names(sdt_collapse_data[["result"]][[1]]), sdt_collapse_data[["result"]][[1]][["new"]][["stratum_id"]] == 1L ) sdt[ i = sdt[["ag"]] == 1, j = "t_at_risk" := 0.5 ] sdt_collapse_data <- popEpi::surv_collapse_strata_list( dt = sdt, stratum_col_nms = "ag", collapse_stratum_col_nms = "ag" ) stopifnot( "result" %in% names(sdt_collapse_data), inherits(sdt_collapse_data[["result"]][[1]], "list"), c("new", "old") %in% names(sdt_collapse_data[["result"]][[1]]), identical( sdt_collapse_data[["result"]][[1]][["new"]][["stratum_id"]], c(rep(1L, 5L), rep(2L, 10L)) ) ) # popEpi::surv_collapse_strata_1d sdt_collapsed <- popEpi::surv_collapse_strata_1d( dt = sdt, stratum_col_nms = "ag", collapse_stratum_col_nm = "ag" ) stopifnot( nrow(sdt_collapsed) == 10, names(sdt) %in% names(sdt_collapsed), names(sdt_collapsed) %in% names(sdt), is.character(sdt_collapsed[["ag"]]), identical(unique(sdt_collapsed[["ag"]]), c("1", "2 & 3")) ) # popEpi::surv_brenner_weight_dt make_column_icss_ag <- function(age) { cut( age, breaks = c(seq(15, 85, 5), Inf), right = FALSE ) } make_standard_weight_dt <- function() { swdt <- popEpi::ICSS[ j = list( weight = as.double(sum(.SD[["ICSS1"]])) ), keyby = list( icss_ag = make_column_icss_ag(popEpi::ICSS[["age"]]) ) ][] swdt <- swdt[!is.na(swdt[["icss_ag"]]), ] return(swdt[]) } swdt <- make_standard_weight_dt() owdt <- data.table::setDT(list( icss_ag = make_column_icss_ag(c(50.5, 60.5)) ))[ i = data.table::setDT(list(icss_ag = swdt[["icss_ag"]])), on = "icss_ag", j = list(weight = .N), keyby = .EACHI ] # this is what happens when at least one weighting stratum is empty warn_env <- new.env() warn_env[["w"]] <- NULL bwdt_bad <- suppressWarnings(withCallingHandlers( popEpi::surv_brenner_weight_dt( standard_weight_dt = swdt, observed_weight_dt = owdt ), warning = function(w) warn_env[["w"]] <- w )) stopifnot( inherits(warn_env[["w"]], "warning"), bwdt_bad[["weight_brenner"]] >= 0 ) # here is an easy way to handle this problem. note that the weights are # always larger when we need to aggregate strata --- because the standard # weight increases but the observed weight remains the same. the same # observations serve "double duty" now for multiple original strata. bwdt <- popEpi::surv_brenner_weight_dt( standard_weight_dt = swdt, observed_weight_dt = owdt, collapse_stratum_col_nms = "icss_ag" ) stopifnot( nrow(swdt) == nrow(bwdt), bwdt[["weight_brenner"]] > 0 ) # popEpi::surv_individual_weights make_column_icss_ag <- function(age) { cut( age, breaks = c(seq(15, 85, 5), Inf), right = FALSE ) } make_standard_weight_dt <- function() { swdt <- popEpi::ICSS[ j = list( weight = as.double(sum(.SD[["ICSS1"]])) ), keyby = list( icss_ag = make_column_icss_ag(popEpi::ICSS[["age"]]) ) ][] return(swdt[]) } lexis <- Epi::Lexis( entry = list(ts_fut = c(0.0, 0.0)), exit = list(ts_fut = c(1.5, 2.5)), entry.status = 0L, exit.status = 0L ) lexis[["icss_ag"]] <- make_column_icss_ag(c(50.5, 60.5)) swdt <- make_standard_weight_dt() # this is what happens when at least one weighting stratum is empty warn_env <- new.env() warn_env[["w"]] <- NULL iw_bad <- suppressWarnings(withCallingHandlers(popEpi::surv_individual_weights( df = lexis, standard_weight_dt = swdt ), warning = function(w) warn_env[["w"]] <- w)) stopifnot( inherits(warn_env[["w"]], "warning"), iw_bad >= 0 ) # here is an easy way to handle this problem. note that the weights are # always larger when we need to aggregate strata --- because the standard # weight increases but the observed weight remains the same. the same # observations serve "double duty" now for multiple original strata. iw <- popEpi::surv_individual_weights( df = lexis, standard_weight_dt = swdt, collapse_stratum_col_nms = "icss_ag" ) stopifnot( length(iw) == nrow(lexis), iw >= 0.0, iw >= iw_bad ) # popEpi::lexis_breaks_collapse_1d lexis <- Epi::Lexis( entry = list(ts_fut = c(0.0, 3.0)), duration = c(2.5, 10.0), entry.status = 0L, exit.status = c(0L, 1L) ) bl <- list(ts_fut = seq(0, 5, 0.5)) agdt <- popEpi::lexis_split_merge_aggregate_by_stratum( lexis = lexis, breaks = bl, aggre_exprs = "t_at_risk" ) # what to do about empty intervals? we have intervals with no subjects in them. stopifnot( agdt[["t_at_risk"]][-6] > 0, agdt[["t_at_risk"]][6] == 0.0 ) # combine intervals until there are no empty ones left. breaks_ts_fut <- popEpi::lexis_breaks_collapse_1d( lexis = lexis, breaks_1d = bl ) # this has resulted in the combined interval ]2.5, 3.5]. stopifnot(!3.0 %in% breaks_ts_fut) # but sometimes we want to retain specific breaks in the output. for instance # if the goal is to produce survival estimates for every year. # we can make use of arg `mandatory_breaks` to achieve this. breaks_ts_fut <- popEpi::lexis_breaks_collapse_1d( lexis = lexis, breaks_1d = bl, mandatory_breaks = 0:5 ) # this has resulted in the combined interval ]2.0, 3.0]. stopifnot(!2.5 %in% breaks_ts_fut) agdt <- popEpi::lexis_split_merge_aggregate_by_stratum( lexis = lexis, breaks = list(ts_fut = breaks_ts_fut), aggre_exprs = "t_at_risk" ) stopifnot( agdt[["t_at_risk"]] > 0.0 ) # popEpi::surv_lexis make_pm <- function() { pm <- data.table::copy(popEpi::popmort) data.table::setnames( pm, c("year", "agegroup", "haz"), c("ts_cal", "ts_age", "h_exp") ) data.table::setkeyv(pm, c("sex", "ts_cal", "ts_age")) pm <- c( list(pm), lapply(101:110, function(age) { sub_pm <- pm[ pm[["ts_age"]] == 100 ] sub_pm[ j = "ts_age" := age ] sub_pm[] }) ) pm <- data.table::rbindlist(pm) data.table::setkeyv(pm, c("sex", "ts_cal", "ts_age")) return(pm[]) } make_column_icss_ag <- function(age) { cut( age, breaks = c(0, 60, 70, 80, Inf), right = FALSE, labels = c("0-59", "60-69", "70-79", "80+") ) } make_standard_weight_dt <- function() { return(popEpi::ICSS[ j = list( weight = as.double(sum(.SD[["ICSS1"]])) ), keyby = list( icss_ag = make_column_icss_ag(popEpi::ICSS[["age"]]) ) ][]) } make_sire <- function() { sire <- popEpi::sire sire <- sire[ sire[["dg_date"]] < sire[["ex_date"]] & sire[["ex_date"]] >= as.Date("1999-01-01") & (get.yrs(sire[["ex_date"]]) - get.yrs(sire[["bi_date"]])) < 100 ] set.seed(1337) sire[j = "my_stratum" := sample(2L, size = .N, replace = TRUE)] # you can also use popEpi::Lexis_dt sire <- Epi::Lexis( data = sire, entry = list( ts_cal = popEpi::get.yrs(dg_date), ts_age = popEpi::get.yrs(dg_date) - popEpi::get.yrs(bi_date), ts_fut = 0.0 ), duration = popEpi::get.yrs(ex_date) - popEpi::get.yrs(dg_date), entry.status = 0L, exit.status = status ) sire[["icss_ag"]] <- make_column_icss_ag(sire[["dg_age"]]) sire[["individual_weight"]] <- popEpi::surv_individual_weights( df = sire, standard_weight_dt = wdt ) return(sire[]) } pm <- make_pm() wdt <- make_standard_weight_dt() sire <- make_sire() # using breaks on the calendar time causes period analysis to be performed. # this is done here for demonstration purposes but of course everything works # also without ts_cal breaks. bl <- list( ts_cal = c(1999, 2004), ts_fut = popEpi::lexis_breaks_collapse_1d( lexis = sire, breaks_1d = list(ts_fut = seq(0, 5, 1 / 12)), mandatory_breaks = 0:5 ) ) # observed survival sdt <- popEpi::surv_lexis( lexis = sire, breaks = bl, aggre_by = "my_stratum", subset = NULL, estimators = "S_ch", conf_methods = "log", conf_lvls = 0.95 ) stopifnot( inherits(sdt, "data.table"), "my_stratum" %in% names(sdt), c("S_ch_est", "S_ch_se", "S_ch_lo", "S_ch_hi") %in% names(sdt) ) # observed survival with direct adjusting sdt_da <- popEpi::surv_lexis( lexis = sire, breaks = bl, aggre_by = "my_stratum", estimators = "S_ch", weights = wdt ) stopifnot( inherits(sdt_da, "data.table"), "my_stratum" %in% names(sdt_da), c("S_ch_est", "S_ch_se", "S_ch_lo", "S_ch_hi") %in% names(sdt_da) ) # observed survival with individual weighting sdt_iw <- popEpi::surv_lexis( lexis = sire, breaks = bl, aggre_by = "my_stratum", estimators = "S_ch", weights = "individual_weight" ) stopifnot( inherits(sdt_iw, "data.table"), "my_stratum" %in% names(sdt_iw), c("S_ch_est", "S_ch_se", "S_ch_lo", "S_ch_hi") %in% names(sdt_iw), # direct adjusting and individual weighting produce similar results. # the larger the dataset, the smaller the difference between the two. max(abs(sdt_da[["S_ch_est"]] - sdt_iw[["S_ch_est"]])) < 0.02, max(abs(sdt_da[["S_ch_se"]] - sdt_iw[["S_ch_se"]])) < 0.001 ) # observed survival with direct adjusting, multiple period estimates # with the periods as strata in output bl_multi_period <- bl bl_multi_period[["ts_cal"]] <- c(bl[["ts_cal"]], max(bl[["ts_cal"]]) + 5) sdt_da <- popEpi::surv_lexis( lexis = sire, breaks = bl_multi_period, aggre_by = "my_stratum", estimators = "S_ch", weights = wdt ) stopifnot( inherits(sdt_da, "data.table"), c("ts_cal_start", "ts_fut_start") %in% names(sdt_da), c("S_ch_est", "S_ch_se", "S_ch_lo", "S_ch_hi") %in% names(sdt_da) ) # observed survival with individual adjusting, period estimates sdt_iw <- popEpi::surv_lexis( lexis = sire, breaks = bl_multi_period, aggre_by = "my_stratum", estimators = "S_ch", weights = "individual_weight" ) stopifnot( inherits(sdt_iw, "data.table"), c("ts_cal_start", "ts_fut_start") %in% names(sdt_iw), c("S_ch_est", "S_ch_se", "S_ch_lo", "S_ch_hi") %in% names(sdt_iw), # direct adjusting and individual weighting produce similar results. # the larger the dataset, the smaller the difference between the two. max(abs(sdt_da[["S_ch_est"]] - sdt_iw[["S_ch_est"]])) < 0.03, max(abs(sdt_da[["S_ch_se"]] - sdt_iw[["S_ch_se"]])) < 0.006 ) # a few common survival time function estimates sdt <- popEpi::surv_lexis( lexis = sire, breaks = bl, merge_dt_by = c("sex", "ts_cal", "ts_age"), merge_dt = pm, aggre_by = "my_stratum", estimators = c( "S_ch", "RS_e2_ch", "NS_pp_ch" ) ) stopifnot( inherits(sdt, "data.table"), c("S_ch_est", "RS_e2_ch_est", "NS_pp_ch_est") %in% names(sdt) ) # your very own estimator (the example `se` is made up) sdt <- popEpi::surv_lexis( lexis = sire, breaks = bl, merge_dt_by = c("sex", "ts_cal", "ts_age"), merge_dt = pm, aggre_by = "my_stratum", estimators = list( my_estimator = list( est = quote(n_events / n_events_exp_e2), se = quote(sqrt(n_events / (n_events_exp_e2 ^ 2))) ) ), conf_method = "identity" ) stopifnot( c("my_estimator_est", "my_estimator_se") %in% names(sdt) ) # your very own estimator with direct adjusting sdt <- popEpi::surv_lexis( lexis = sire, breaks = bl, merge_dt_by = c("sex", "ts_cal", "ts_age"), merge_dt = pm, aggre_by = "my_stratum", estimators = list( my_estimator = list( est = quote(n_events / n_events_exp_e2), se = quote(sqrt(n_events / (n_events_exp_e2 ^ 2))) ) ), conf_method = "identity", weights = wdt ) stopifnot( c("my_estimator_est", "my_estimator_se") %in% names(sdt) ) # your very own estimator with individual weighting sdt <- popEpi::surv_lexis( lexis = sire, breaks = bl, merge_dt_by = c("sex", "ts_cal", "ts_age"), merge_dt = pm, aggre_by = "my_stratum", estimators = list( my_estimator = list( est = quote(n_events / n_events_exp_e2), se = quote(sqrt(n_events / (n_events_exp_e2 ^ 2))) ) ), conf_method = "identity", weights = "individual_weight" ) stopifnot( c("my_estimator_est", "my_estimator_se") %in% names(sdt) ) # your very own method of confidence interval estimation sdt <- popEpi::surv_lexis( lexis = sire, breaks = bl, estimators = c("S_ch", "S_lt"), conf_methods = list(list( g = quote(qnorm(theta)), g_inv = quote(pnorm(g)) )), conf_lvls = 0.95, weights = "individual_weight" ) stopifnot( c("S_ch_lo", "S_ch_hi") %in% names(sdt) ) # a bunch of estimators, cause-specific ones from state 0 to 1 or 2 # and non-cause-specific ones also sdt <- popEpi::surv_lexis( lexis = sire, breaks = bl, merge_dt_by = c("sex", "ts_cal", "ts_age"), merge_dt = pm, aggre_by = NULL, subset = NULL, estimators = c( "h_ch_[0, 1]", "h_ch_[0, 2]", "H_ch_[0, 1]", "H_ch_[0, 2]", "F_ch_[0, 1]", "F_ch_[0, 2]", "S_ch_[0, 1]", "S_ch_[0, 2]", "S_ch_[0, 1:2]", "H_ch", "H_lt", "h_ch", "h_lt", "F_ch", "F_lt", "S_ch", "S_lt", "NS_pp_ch", "RF_e2_ch" ) ) stopifnot( # hazard is additive. all.equal( sdt[["h_ch_est"]], sdt[["h_ch_[0, 1]_est"]] + sdt[["h_ch_[0, 2]_est"]] ), # cumulative hazard is additive. all.equal( sdt[["H_ch_est"]], sdt[["H_ch_[0, 1]_est"]] + sdt[["H_ch_[0, 2]_est"]] ), # competing risks cdf is additive. in epidemiology the cdf is often called # the cumulative incidence function. all.equal( sdt[["F_ch_est"]], sdt[["F_ch_[0, 1]_est"]] + sdt[["F_ch_[0, 2]_est"]] ), # competing risks survival is NOT additive! sdt[["S_ch_[0, 1]_est"]] + sdt[["S_ch_[0, 2]_est"]] > 1, # you can define multiple states in the cause-specific transition. # this one covers all exit statuses and is therefore the same as overall # survival by definition. sdt[["S_ch_[0, 1:2]_est"]] == sdt[["S_ch_est"]]#, # the different estimation methods produce about the same results. # abs(sdt[["S_ch_est"]] - sdt[["S_lt_est"]]) < 0.01, # abs(sdt[["H_ch_est"]] - sdt[["H_lt_est"]]) < 0.01, # net survival is approximately the same as cause-specific survival in this # rather ideal dataset. # abs(sdt[["S_ch_[0, 1]_est"]] - sdt[["NS_pp_ch_est"]]) < 0.01, # the ederer 2 based "relative cdf" is approximately the same as # cause-specific cdf in this rather ideal dataset. # abs(sdt[["F_ch_[0, 1]_est"]] - sdt[["RF_e2_ch_est"]]) < 0.01 ) # factor status data.table::set( x = sire, j = c("lex.Cst", "lex.Xst"), value = list( factor( x = sire[["lex.Cst"]], levels = 0:2, labels = c("alive", "dead from cancer", "dead from other") ), factor( x = sire[["lex.Xst"]], levels = 0:2, labels = c("alive", "dead from cancer", "dead from other") ) ) ) sdt <- popEpi::surv_lexis( lexis = sire, breaks = bl, merge_dt_by = c("sex", "ts_cal", "ts_age"), merge_dt = pm, aggre_by = NULL, subset = NULL, estimators = c( "F_ch_['alive', 'dead from cancer']", "F_ch_['alive', 'dead from other']" ) ) stopifnot( # produces the estimates as intended. maybe these are not too pretty but # you could use shorter factor labels such as c("a", "dc", "do"). "F_ch_['alive', 'dead from cancer']_est" %in% names(sdt) ) # TRUE/FALSE status data.table::set( x = sire, j = c("lex.Cst", "lex.Xst"), value = list( sire[["lex.Cst"]] != "alive", sire[["lex.Xst"]] != "alive" ) ) sdt <- popEpi::surv_lexis( lexis = sire, breaks = bl, merge_dt_by = c("sex", "ts_cal", "ts_age"), merge_dt = pm, aggre_by = NULL, subset = NULL, estimators = c( "F_ch", "F_ch_[F, T]" ) ) stopifnot( "F_ch_[F, T]_est" %in% names(sdt), max(abs(sdt[["F_ch_[F, T]_est"]] - sdt[["F_ch_est"]])) < 1e-15 ) # more arguments to popEpi::lexis_split_merge_aggregate_by_stratum sdt <- popEpi::surv_lexis( lexis = sire, breaks = bl, split_merge_aggregate_optional_args = list( breaks_collapse_args = list(mandatory_breaks = 0:5) ), estimators = "S_ch" ) stopifnot( 1:5 %in% sdt[["ts_fut_stop"]] )
Computes mean survival times based on survival estimation up to
a point in follow-up time (e.g. 10 years),
after which survival is extrapolated
using an appropriate hazard data file (pophaz) to yield the "full"
survival curve. The area under the full survival curve is the mean survival.
survmean( formula, data, adjust = NULL, weights = NULL, breaks = NULL, pophaz = NULL, e1.breaks = NULL, e1.pophaz = pophaz, r = "auto", surv.method = "hazard", subset = NULL, verbose = FALSE )survmean( formula, data, adjust = NULL, weights = NULL, breaks = NULL, pophaz = NULL, e1.breaks = NULL, e1.pophaz = pophaz, r = "auto", surv.method = "hazard", subset = NULL, verbose = FALSE )
formula |
a |
data |
a |
adjust |
variables to adjust estimates by, e.g. |
weights |
weights to use to adjust mean survival times. See the
dedicated help page for more details on
weighting. |
breaks |
a list of breaks defining the time window to compute
observed survival in, and the intervals used in estimation. E.g.
|
pophaz |
a data set of population hazards passed to
|
e1.breaks |
|
e1.pophaz |
Same as |
r |
either a numeric multiplier such as |
surv.method |
passed to |
subset |
a logical condition; e.g. |
verbose |
|
Basics
survmean computes mean survival times. For median survival times
(i.e. where 50 % of subjects have died or met some other event)
use [survtab].
The mean survival time is simply the area under the survival curve. However, since full follow-up rarely happens, the observed survival curves are extrapolated using expected survival: E.g. one might compute observed survival till up to 10 years and extrapolate beyond that (till e.g. 50 years) to yield an educated guess on the full observed survival curve.
The area is computed by trapezoidal integration of the area under the curve. This function also computes the "full" expected survival curve from T = 0 till e.g. T = 50 depending on supplied arguments. The expected mean survival time is the area under the mean expected survival curve. This function returns the mean expected survival time to be compared with the mean survival time and for computing years of potential life lost (YPLL).
Results can be formed by strata and adjusted for e.g. age by using
the formula argument as in survtab. See also Examples.
Extrapolation tweaks
Argument r controls the relative survival ratio (RSR) assumed to
persist beyond the time window where observed survival is computed
(defined by argument breaks; e.g. up to FUT = 10).
The RSR is simply RSR_i = p_oi / p_ei for a time interval i,
i.e. the observed divided by the expected
(conditional, not cumulative) probability of surviving from the beginning of
a time interval till its end. The cumulative product of RSR_i
over time is the (cumulative) relative survival curve.
If r is numeric, e.g. 0.995, that RSR level is assumed
to persist beyond the observed survival curve.
Numeric r should be > 0 and expressed at the annual level
when using fractional years as the scale of the time variables.
E.g. if RSR is known to be 0.95 at the month level, then the
annualized RSR is 0.95^12. This enables correct usage of the RSR
with survival intervals of varying lengths. When using day-level time
variables (such as Dates; see as.Date), numeric r
should be expressed at the day level, etc.
If r is "auto" or "auto1", this function computes
RSR estimates internally and automatically uses the RSR_i
in the last survival interval in each stratum (and adjusting group)
and assumes that to persist beyond the observed survival curve.
Automatic determination of r is a good starting point,
but in situations where the RSR estimate is uncertain it may produce poor
results. Using "autoX" such as "auto6" causes survmean
to use the mean of the estimated RSRs in the last X survival intervals,
which may be more stable.
Automatic determination will not use values >1 but set them to 1.
Visual inspection of the produced curves is always recommended: see
Examples.
One may also tweak the accuracy and length of extrapolation and
expected survival curve computation by using
e1.breaks. By default this is whatever was supplied to breaks
for the survival time scale, to which
c(seq(1/12, 1, 1/12), seq(1.2, 1.8, 0.2), 2:19, seq(20, 50, 5))
is added after the maximum value, e.g. with breaks = list(FUT = 0:10)
we have
..., 10+1/12, ..., 11, 11.2, ..., 2, 3, ..., 19, 20, 25, ... 50
as the e1.breaks. Supplying e1.breaks manually requires
the breaks over time survival time scale supplied to argument breaks
to be reiterated in e1.breaks; see Examples. NOTE: the
default extrapolation breaks assume the time scales in the data to be
expressed as fractional years, meaning this will work extremely poorly
when using e.g. day-level time scales (such as Date variables).
Set the extrapolation breaks manually in such cases.
Returns a data.frame or data.table (depending on
getOptions("popEpi.datatable"); see ?popEpi) containing the
following columns:
est: The estimated mean survival time
exp: The computed expected survival time
obs: Counts of subjects in data
YPLL: Years of Potential Life Lost, computed as
((exp - est) * obs) — though your time data may be in e.g. days,
this column will have the same name regardless.
The returned data also has columns named according to the variables
supplied to the right-hand-side of the formula.
Joonas Miettinen
Other survmean functions:
Surv(),
lines.survmean(),
plot.survmean()
Other main functions:
Surv(),
rate(),
relpois(),
relpois_ag(),
sir(),
sirspline(),
survtab(),
survtab_ag()
library(Epi) ## take 500 subjects randomly for demonstration data(sire) sire <- sire[sire$dg_date < sire$ex_date, ] set.seed(1L) sire <- sire[sample(x = nrow(sire), size = 500),] ## NOTE: recommended to use factor status variable x <- Lexis(entry = list(FUT = 0, AGE = dg_age, CAL = get.yrs(dg_date)), exit = list(CAL = get.yrs(ex_date)), data = sire, exit.status = factor(status, levels = 0:2, labels = c("alive", "canD", "othD")), merge = TRUE) ## phony variable set.seed(1L) x$group <- rbinom(nrow(x), 1, 0.5) ## age group x$agegr <- cut(x$dg_age, c(0,45,60,Inf), right=FALSE) ## population hazards data set pm <- data.frame(popEpi::popmort) names(pm) <- c("sex", "CAL", "AGE", "haz") ## breaks to define observed survival estimation BL <- list(FUT = seq(0, 10, 1/12)) ## crude mean survival sm1 <- survmean(Surv(FUT, lex.Xst != "alive") ~ 1, pophaz = pm, data = x, weights = NULL, breaks = BL) sm1 <- survmean(FUT ~ 1, pophaz = pm, data = x, weights = NULL, breaks = BL) ## mean survival by group sm2 <- survmean(FUT ~ group, pophaz = pm, data = x, weights = NULL, breaks = BL) ## ... and adjusted for age using internal weights (counts of subjects) ## note: need also longer extrapolation here so that all curves ## converge to zero in the end. eBL <- list(FUT = c(BL$FUT, 11:75)) sm3 <- survmean(FUT ~ group + adjust(agegr), pophaz = pm, data = x, weights = "internal", breaks = BL, e1.breaks = eBL) ## visual inspection of how realistic extrapolation is for each stratum; ## solid lines are observed + extrapolated survivals; ## dashed lines are expected survivals plot(sm1) ## plotting object with both stratification and standardization ## plots curves for each strata-std.group combination plot(sm3) ## for finer control of plotting these curves, you may extract ## from the survmean object using e.g. attributes(sm3)$survmean.meta$curves #### using Dates x <- Lexis(entry = list(FUT = 0L, AGE = dg_date-bi_date, CAL = dg_date), exit = list(CAL = ex_date), data = sire[sire$dg_date < sire$ex_date, ], exit.status = factor(status, levels = 0:2, labels = c("alive", "canD", "othD")), merge = TRUE) ## phony group variable set.seed(1L) x$group <- rbinom(nrow(x), 1, 0.5) ## NOTE: population hazard should be reported at the same scale ## as time variables in your Lexis data. data(popmort, package = "popEpi") pm <- data.frame(popmort) names(pm) <- c("sex", "CAL", "AGE", "haz") ## from year to day level pm$haz <- pm$haz/365.25 pm$CAL <- as.Date(paste0(pm$CAL, "-01-01")) pm$AGE <- pm$AGE*365.25 BL <- list(FUT = seq(0, 8, 1/12)*365.25) eBL <- list(FUT = c(BL$FUT, c(8.25,8.5,9:60)*365.25)) smd <- survmean(FUT ~ group, data = x, pophaz = pm, verbose = TRUE, r = "auto5", breaks = BL, e1.breaks = eBL) plot(smd)library(Epi) ## take 500 subjects randomly for demonstration data(sire) sire <- sire[sire$dg_date < sire$ex_date, ] set.seed(1L) sire <- sire[sample(x = nrow(sire), size = 500),] ## NOTE: recommended to use factor status variable x <- Lexis(entry = list(FUT = 0, AGE = dg_age, CAL = get.yrs(dg_date)), exit = list(CAL = get.yrs(ex_date)), data = sire, exit.status = factor(status, levels = 0:2, labels = c("alive", "canD", "othD")), merge = TRUE) ## phony variable set.seed(1L) x$group <- rbinom(nrow(x), 1, 0.5) ## age group x$agegr <- cut(x$dg_age, c(0,45,60,Inf), right=FALSE) ## population hazards data set pm <- data.frame(popEpi::popmort) names(pm) <- c("sex", "CAL", "AGE", "haz") ## breaks to define observed survival estimation BL <- list(FUT = seq(0, 10, 1/12)) ## crude mean survival sm1 <- survmean(Surv(FUT, lex.Xst != "alive") ~ 1, pophaz = pm, data = x, weights = NULL, breaks = BL) sm1 <- survmean(FUT ~ 1, pophaz = pm, data = x, weights = NULL, breaks = BL) ## mean survival by group sm2 <- survmean(FUT ~ group, pophaz = pm, data = x, weights = NULL, breaks = BL) ## ... and adjusted for age using internal weights (counts of subjects) ## note: need also longer extrapolation here so that all curves ## converge to zero in the end. eBL <- list(FUT = c(BL$FUT, 11:75)) sm3 <- survmean(FUT ~ group + adjust(agegr), pophaz = pm, data = x, weights = "internal", breaks = BL, e1.breaks = eBL) ## visual inspection of how realistic extrapolation is for each stratum; ## solid lines are observed + extrapolated survivals; ## dashed lines are expected survivals plot(sm1) ## plotting object with both stratification and standardization ## plots curves for each strata-std.group combination plot(sm3) ## for finer control of plotting these curves, you may extract ## from the survmean object using e.g. attributes(sm3)$survmean.meta$curves #### using Dates x <- Lexis(entry = list(FUT = 0L, AGE = dg_date-bi_date, CAL = dg_date), exit = list(CAL = ex_date), data = sire[sire$dg_date < sire$ex_date, ], exit.status = factor(status, levels = 0:2, labels = c("alive", "canD", "othD")), merge = TRUE) ## phony group variable set.seed(1L) x$group <- rbinom(nrow(x), 1, 0.5) ## NOTE: population hazard should be reported at the same scale ## as time variables in your Lexis data. data(popmort, package = "popEpi") pm <- data.frame(popmort) names(pm) <- c("sex", "CAL", "AGE", "haz") ## from year to day level pm$haz <- pm$haz/365.25 pm$CAL <- as.Date(paste0(pm$CAL, "-01-01")) pm$AGE <- pm$AGE*365.25 BL <- list(FUT = seq(0, 8, 1/12)*365.25) eBL <- list(FUT = c(BL$FUT, c(8.25,8.5,9:60)*365.25)) smd <- survmean(FUT ~ group, data = x, pophaz = pm, verbose = TRUE, r = "auto5", breaks = BL, e1.breaks = eBL) plot(smd)
This function estimates survival time functions: survival, relative/net survival, and crude/absolute risk functions (CIF).
survtab( formula, data, adjust = NULL, breaks = NULL, pophaz = NULL, weights = NULL, surv.type = "surv.rel", surv.method = "hazard", relsurv.method = "e2", subset = NULL, conf.level = 0.95, conf.type = "log-log", verbose = FALSE )survtab( formula, data, adjust = NULL, breaks = NULL, pophaz = NULL, weights = NULL, surv.type = "surv.rel", surv.method = "hazard", relsurv.method = "e2", subset = NULL, conf.level = 0.95, conf.type = "log-log", verbose = FALSE )
formula |
a |
data |
a |
adjust |
can be used as an alternative to passing variables to
argument |
breaks |
a named list of breaks, e.g.
|
pophaz |
a |
weights |
typically a list of weights or a |
surv.type |
one of |
surv.method |
either |
relsurv.method |
either |
subset |
a logical condition; e.g. |
conf.level |
confidence level used in confidence intervals;
e.g. |
conf.type |
character string; must be one of |
verbose |
logical; if |
Returns a table of life time function values and other information with survival intervals as rows. Returns some of the following estimates of survival time functions:
surv.obs - observed (raw, overall) survival
surv.obs.K - observed cause-specific survival for cause K
CIF_k - cumulative incidence function for cause k
CIF.rel - cumulative incidence function using excess cases
r.e2 - relative survival, EdererII
r.pp - relative survival, Pohar-Perme weighted
The suffix .as implies adjusted estimates, and .lo and
.hi imply lower and upper confidence limits, respectively.
The prefix SE. stands for standard error.
This function computes interval-based estimates of survival time functions, where the intervals are set by the user. For product-limit-based estimation see packages survival and relsurv.
if surv.type = 'surv.obs', only 'raw' observed survival
is estimated over the chosen time intervals. With
surv.type = 'surv.rel', also relative survival estimates
are supplied in addition to observed survival figures.
surv.type = 'cif.obs' requests cumulative incidence functions (CIF)
to be estimated.
CIFs are estimated for each competing risk based
on a survival-interval-specific proportional hazards
assumption as described by Chiang (1968).
With surv.type = 'cif.rel', a CIF is estimated with using
excess cases as the ”cause-specific” cases. Finally, with
surv.type = 'surv.cause', cause-specific survivals are
estimated separately for each separate type of event.
In hazard-based estimation (surv.method = "hazard") survival
time functions are transformations of the estimated corresponding hazard
in the intervals. The hazard itself is estimated using counts of events
(or excess events) and total subject-time in the interval. Life table
surv.method = "lifetable" estimates are constructed as transformations
of probabilities computed using counts of events and counts of subjects
at risk.
The vignette survtab_examples has some practical examples.
When surv.type = 'surv.rel', the user can choose
relsurv.method = 'pp', whereupon Pohar-Perme weighting is used.
By default relsurv.method = 'e2', i.e. the Ederer II method
is used to estimate relative survival.
Adjusted estimates in this context mean computing estimates separately by the levels of adjusting variables and returning weighted averages of the estimates. For example, computing estimates separately by age groups and returning a weighted average estimate (age-adjusted estimate).
Adjusting requires specification of both the adjusting variables and
the weights for all the levels of the adjusting variables. The former can be
accomplished by using adjust() with the argument formula,
or by supplying variables directly to argument adjust. E.g. the
following are all equivalent:
formula = fot ~ sex + adjust(agegr) + adjust(area)
formula = fot ~ sex + adjust(agegr, area)
formula = fot ~ sex, adjust = c("agegr", "area")
formula = fot ~ sex, adjust = list(agegr, area)
The adjusting variables must match with the variable names in the
argument weights;
see the dedicated help page.
Typically weights are supplied as a list or
a data.frame. The former can be done by e.g.
weights = list(agegr = VEC1, area = VEC2),
where VEC1 and VEC2 are vectors of weights (which do not
have to add up to one). See
survtab_examples
for an example of using a data.frame to pass weights.
To calculate e.g. period analysis (delayed entry) estimates, limit the data when/before supplying to this function.See survtab_examples.
Perme, Maja Pohar, Janez Stare, and Jacques Esteve. "On estimation in relative survival." Biometrics 68.1 (2012): 113-120. doi:10.1111/j.1541-0420.2011.01640.x
Hakulinen, Timo, Karri Seppa, and Paul C. Lambert. "Choosing the relative survival method for cancer survival estimation." European Journal of Cancer 47.14 (2011): 2202-2210. doi:10.1016/j.ejca.2011.03.011
Seppa, Karri, Timo Hakulinen, and Arun Pokhrel. "Choosing the net survival method for cancer survival estimation." European Journal of Cancer (2013). doi:10.1016/j.ejca.2013.09.019
CHIANG, Chin Long. Introduction to stochastic processes in biostatistics. 1968. ISBN-14: 978-0471155003
Seppa K., Dyba T. and Hakulinen T.: Cancer Survival, Reference Module in Biomedical Sciences. Elsevier. 08-Jan-2015. doi:10.1016/B978-0-12-801238-3.02745-8
[splitMulti], [lexpand],
[ICSS], [sire]
The survtab_examples vignette
Other main functions:
Surv(),
rate(),
relpois(),
relpois_ag(),
sir(),
sirspline(),
survmean(),
survtab_ag()
Other survtab functions:
Surv(),
lines.survtab(),
plot.survtab(),
print.survtab(),
summary.survtab(),
survtab_ag()
data("sire", package = "popEpi") library(Epi) ## NOTE: recommended to use factor status variable x <- Lexis(entry = list(FUT = 0, AGE = dg_age, CAL = get.yrs(dg_date)), exit = list(CAL = get.yrs(ex_date)), data = sire[sire$dg_date < sire$ex_date, ], exit.status = factor(status, levels = 0:2, labels = c("alive", "canD", "othD")), merge = TRUE) ## phony group variable set.seed(1L) x$group <- rbinom(nrow(x), 1, 0.5) ## observed survival. explicit supplying of status: st <- survtab(Surv(time = FUT, event = lex.Xst) ~ group, data = x, surv.type = "surv.obs", breaks = list(FUT = seq(0, 5, 1/12))) ## this assumes the status is lex.Xst (right 99.9 % of the time) st <- survtab(FUT ~ group, data = x, surv.type = "surv.obs", breaks = list(FUT = seq(0, 5, 1/12))) ## relative survival (ederer II) data("popmort", package = "popEpi") pm <- data.frame(popmort) names(pm) <- c("sex", "CAL", "AGE", "haz") st <- survtab(FUT ~ group, data = x, surv.type = "surv.rel", pophaz = pm, breaks = list(FUT = seq(0, 5, 1/12))) ## ICSS weights usage data("ICSS", package = "popEpi") cut <- c(0, 30, 50, 70, Inf) agegr <- cut(ICSS$age, cut, right = FALSE) w <- aggregate(ICSS1~agegr, data = ICSS, FUN = sum) x$agegr <- cut(x$dg_age, cut, right = FALSE) st <- survtab(FUT ~ group + adjust(agegr), data = x, surv.type = "surv.rel", pophaz = pm, weights = w$ICSS1, breaks = list(FUT = seq(0, 5, 1/12))) #### using dates with survtab x <- Lexis(entry = list(FUT = 0L, AGE = dg_date-bi_date, CAL = dg_date), exit = list(CAL = ex_date), data = sire[sire$dg_date < sire$ex_date, ], exit.status = factor(status, levels = 0:2, labels = c("alive", "canD", "othD")), merge = TRUE) ## phony group variable set.seed(1L) x$group <- rbinom(nrow(x), 1, 0.5) st <- survtab(Surv(time = FUT, event = lex.Xst) ~ group, data = x, surv.type = "surv.obs", breaks = list(FUT = seq(0, 5, 1/12)*365.25)) ## NOTE: population hazard should be reported at the same scale ## as time variables in your Lexis data. data(popmort, package = "popEpi") pm <- data.frame(popmort) names(pm) <- c("sex", "CAL", "AGE", "haz") ## from year to day level pm$haz <- pm$haz/365.25 pm$CAL <- as.Date(paste0(pm$CAL, "-01-01")) pm$AGE <- pm$AGE*365.25 st <- survtab(Surv(time = FUT, event = lex.Xst) ~ group, data = x, surv.type = "surv.rel", relsurv.method = "e2", pophaz = pm, breaks = list(FUT = seq(0, 5, 1/12)*365.25))data("sire", package = "popEpi") library(Epi) ## NOTE: recommended to use factor status variable x <- Lexis(entry = list(FUT = 0, AGE = dg_age, CAL = get.yrs(dg_date)), exit = list(CAL = get.yrs(ex_date)), data = sire[sire$dg_date < sire$ex_date, ], exit.status = factor(status, levels = 0:2, labels = c("alive", "canD", "othD")), merge = TRUE) ## phony group variable set.seed(1L) x$group <- rbinom(nrow(x), 1, 0.5) ## observed survival. explicit supplying of status: st <- survtab(Surv(time = FUT, event = lex.Xst) ~ group, data = x, surv.type = "surv.obs", breaks = list(FUT = seq(0, 5, 1/12))) ## this assumes the status is lex.Xst (right 99.9 % of the time) st <- survtab(FUT ~ group, data = x, surv.type = "surv.obs", breaks = list(FUT = seq(0, 5, 1/12))) ## relative survival (ederer II) data("popmort", package = "popEpi") pm <- data.frame(popmort) names(pm) <- c("sex", "CAL", "AGE", "haz") st <- survtab(FUT ~ group, data = x, surv.type = "surv.rel", pophaz = pm, breaks = list(FUT = seq(0, 5, 1/12))) ## ICSS weights usage data("ICSS", package = "popEpi") cut <- c(0, 30, 50, 70, Inf) agegr <- cut(ICSS$age, cut, right = FALSE) w <- aggregate(ICSS1~agegr, data = ICSS, FUN = sum) x$agegr <- cut(x$dg_age, cut, right = FALSE) st <- survtab(FUT ~ group + adjust(agegr), data = x, surv.type = "surv.rel", pophaz = pm, weights = w$ICSS1, breaks = list(FUT = seq(0, 5, 1/12))) #### using dates with survtab x <- Lexis(entry = list(FUT = 0L, AGE = dg_date-bi_date, CAL = dg_date), exit = list(CAL = ex_date), data = sire[sire$dg_date < sire$ex_date, ], exit.status = factor(status, levels = 0:2, labels = c("alive", "canD", "othD")), merge = TRUE) ## phony group variable set.seed(1L) x$group <- rbinom(nrow(x), 1, 0.5) st <- survtab(Surv(time = FUT, event = lex.Xst) ~ group, data = x, surv.type = "surv.obs", breaks = list(FUT = seq(0, 5, 1/12)*365.25)) ## NOTE: population hazard should be reported at the same scale ## as time variables in your Lexis data. data(popmort, package = "popEpi") pm <- data.frame(popmort) names(pm) <- c("sex", "CAL", "AGE", "haz") ## from year to day level pm$haz <- pm$haz/365.25 pm$CAL <- as.Date(paste0(pm$CAL, "-01-01")) pm$AGE <- pm$AGE*365.25 st <- survtab(Surv(time = FUT, event = lex.Xst) ~ group, data = x, surv.type = "surv.rel", relsurv.method = "e2", pophaz = pm, breaks = list(FUT = seq(0, 5, 1/12)*365.25))
This function estimates survival time functions: survival, relative/net survival, and crude/absolute risk functions (CIF).
survtab_ag( formula = NULL, data, adjust = NULL, weights = NULL, surv.breaks = NULL, n = "at.risk", d = "from0to1", n.cens = "from0to0", pyrs = "pyrs", d.exp = "d.exp", n.pp = NULL, d.pp = "d.pp", d.pp.2 = "d.pp.2", n.cens.pp = "n.cens.pp", pyrs.pp = "pyrs.pp", d.exp.pp = "d.exp.pp", surv.type = "surv.rel", surv.method = "hazard", relsurv.method = "e2", subset = NULL, conf.level = 0.95, conf.type = "log-log", verbose = FALSE )survtab_ag( formula = NULL, data, adjust = NULL, weights = NULL, surv.breaks = NULL, n = "at.risk", d = "from0to1", n.cens = "from0to0", pyrs = "pyrs", d.exp = "d.exp", n.pp = NULL, d.pp = "d.pp", d.pp.2 = "d.pp.2", n.cens.pp = "n.cens.pp", pyrs.pp = "pyrs.pp", d.exp.pp = "d.exp.pp", surv.type = "surv.rel", surv.method = "hazard", relsurv.method = "e2", subset = NULL, conf.level = 0.95, conf.type = "log-log", verbose = FALSE )
formula |
a |
data |
since popEpi 0.4.0, a |
adjust |
can be used as an alternative to passing variables to
argument |
weights |
typically a list of weights or a |
surv.breaks |
a vector of breaks on the
survival time scale. Optional if |
n |
variable containing counts of subjects at-risk at the start of a
time interval; e.g. |
d |
variable(s) containing counts of subjects experiencing an event.
With only one type of event, e.g. |
n.cens |
variable containing counts of subjects censored during a
survival time interval; E.g. |
pyrs |
variable containing total subject-time accumulated within a
survival time interval; E.g. |
d.exp |
variable denoting total "expected numbers of events"
(typically computed |
n.pp |
variable containing total Pohar-Perme weighted counts of
subjects at risk in an interval,
supplied as argument |
d.pp |
variable(s) containing Pohar-Perme weighted counts of events,
supplied as argument |
d.pp.2 |
variable(s) containing total Pohar-Perme
"double-weighted" counts of events,
supplied as argument |
n.cens.pp |
variable containing total Pohar-Perme weighted counts
censorings,
supplied as argument |
pyrs.pp |
variable containing total Pohar-Perme weighted subject-times,
supplied as argument |
d.exp.pp |
variable containing total Pohar-Perme weighted counts
of excess events,
supplied as argument |
surv.type |
one of |
surv.method |
either |
relsurv.method |
either |
subset |
a logical condition; e.g. |
conf.level |
confidence level used in confidence intervals;
e.g. |
conf.type |
character string; must be one of |
verbose |
logical; if |
Returns a table of life time function values and other information with survival intervals as rows. Returns some of the following estimates of survival time functions:
surv.obs - observed (raw, overall) survival
surv.obs.K - observed cause-specific survival for cause K
CIF_k - cumulative incidence function for cause k
CIF.rel - cumulative incidence function using excess cases
r.e2 - relative survival, EdererII
r.pp - relative survival, Pohar-Perme weighted
The suffix .as implies adjusted estimates, and .lo and
.hi imply lower and upper confidence limits, respectively.
The prefix SE. stands for standard error.
This function computes interval-based estimates of survival time functions, where the intervals are set by the user. For product-limit-based estimation see packages survival and relsurv.
if surv.type = 'surv.obs', only 'raw' observed survival
is estimated over the chosen time intervals. With
surv.type = 'surv.rel', also relative survival estimates
are supplied in addition to observed survival figures.
surv.type = 'cif.obs' requests cumulative incidence functions (CIF)
to be estimated.
CIFs are estimated for each competing risk based
on a survival-interval-specific proportional hazards
assumption as described by Chiang (1968).
With surv.type = 'cif.rel', a CIF is estimated with using
excess cases as the ”cause-specific” cases. Finally, with
surv.type = 'surv.cause', cause-specific survivals are
estimated separately for each separate type of event.
In hazard-based estimation (surv.method = "hazard") survival
time functions are transformations of the estimated corresponding hazard
in the intervals. The hazard itself is estimated using counts of events
(or excess events) and total subject-time in the interval. Life table
surv.method = "lifetable" estimates are constructed as transformations
of probabilities computed using counts of events and counts of subjects
at risk.
The vignette survtab_examples has some practical examples.
When surv.type = 'surv.rel', the user can choose
relsurv.method = 'pp', whereupon Pohar-Perme weighting is used.
By default relsurv.method = 'e2', i.e. the Ederer II method
is used to estimate relative survival.
Adjusted estimates in this context mean computing estimates separately by the levels of adjusting variables and returning weighted averages of the estimates. For example, computing estimates separately by age groups and returning a weighted average estimate (age-adjusted estimate).
Adjusting requires specification of both the adjusting variables and
the weights for all the levels of the adjusting variables. The former can be
accomplished by using adjust() with the argument formula,
or by supplying variables directly to argument adjust. E.g. the
following are all equivalent:
formula = fot ~ sex + adjust(agegr) + adjust(area)
formula = fot ~ sex + adjust(agegr, area)
formula = fot ~ sex, adjust = c("agegr", "area")
formula = fot ~ sex, adjust = list(agegr, area)
The adjusting variables must match with the variable names in the
argument weights;
see the dedicated help page.
Typically weights are supplied as a list or
a data.frame. The former can be done by e.g.
weights = list(agegr = VEC1, area = VEC2),
where VEC1 and VEC2 are vectors of weights (which do not
have to add up to one). See
survtab_examples
for an example of using a data.frame to pass weights.
To calculate e.g. period analysis (delayed entry) estimates, limit the data when/before supplying to this function.See survtab_examples.
survtab_ag computes estimates of survival time functions using
pre-aggregated data. For using subject-level data directly, use
[survtab]. For aggregating data, see [lexpand]
and [aggre].
By default, and if data is an aggre object (not mandatory),
survtab_ag makes use of the exact same breaks that were used in
splitting the original data (with e.g. lexpand), so it is not
necessary to specify any surv.breaks. If specified, the
surv.breaks must be a subset of the pertinent
pre-existing breaks. When data is not an aggre object, breaks
must always be specified. Interval lengths (delta in output) are
also calculated based on whichever breaks are used,
so the upper limit of the breaks should
therefore be meaningful and never e.g. Inf.
Perme, Maja Pohar, Janez Stare, and Jacques Esteve. "On estimation in relative survival." Biometrics 68.1 (2012): 113-120. doi:10.1111/j.1541-0420.2011.01640.x
Hakulinen, Timo, Karri Seppa, and Paul C. Lambert. "Choosing the relative survival method for cancer survival estimation." European Journal of Cancer 47.14 (2011): 2202-2210. doi:10.1016/j.ejca.2011.03.011
Seppa, Karri, Timo Hakulinen, and Arun Pokhrel. "Choosing the net survival method for cancer survival estimation." European Journal of Cancer (2013). doi:10.1016/j.ejca.2013.09.019
CHIANG, Chin Long. Introduction to stochastic processes in biostatistics. 1968. ISBN-14: 978-0471155003
Seppa K., Dyba T. and Hakulinen T.: Cancer Survival, Reference Module in Biomedical Sciences. Elsevier. 08-Jan-2015. doi:10.1016/B978-0-12-801238-3.02745-8
[splitMulti], [lexpand],
[ICSS], [sire]
The survtab_examples vignette
Other main functions:
Surv(),
rate(),
relpois(),
relpois_ag(),
sir(),
sirspline(),
survmean(),
survtab()
Other survtab functions:
Surv(),
lines.survtab(),
plot.survtab(),
print.survtab(),
summary.survtab(),
survtab()
## see more examples with explanations in vignette("survtab_examples") #### survtab_ag usage data("sire", package = "popEpi") ## prepare data for e.g. 5-year "period analysis" for 2008-2012 ## note: sire is a simulated cohort integrated into popEpi. BL <- list(fot=seq(0, 5, by = 1/12), per = c("2008-01-01", "2013-01-01")) x <- lexpand(sire, birth = bi_date, entry = dg_date, exit = ex_date, status = status %in% 1:2, breaks = BL, pophaz = popmort, aggre = list(fot)) ## calculate relative EdererII period method ## NOTE: x is an aggre object here, so surv.breaks are deduced ## automatically st <- survtab_ag(fot ~ 1, data = x) summary(st, t = 1:5) ## annual estimates summary(st, q = list(r.e2 = 0.75)) ## 1st interval where r.e2 < 0.75 at end plot(st) ## non-aggre data: first call to survtab_ag would fail df <- data.frame(x) # st <- survtab_ag(fot ~ 1, data = x) st <- survtab_ag(fot ~ 1, data = x, surv.breaks = BL$fot) ## calculate age-standardised 5-year relative survival ratio using ## Ederer II method and period approach sire$agegr <- cut(sire$dg_age,c(0,45,55,65,75,Inf),right=FALSE) BL <- list(fot=seq(0, 5, by = 1/12), per = c("2008-01-01", "2013-01-01")) x <- lexpand(sire, birth = bi_date, entry = dg_date, exit = ex_date, status = status %in% 1:2, breaks = BL, pophaz = popmort, aggre = list(agegr, fot)) ## age standardisation using internal weights (age distribution of ## patients diagnosed within the period window) ## (NOTE: what is done here is equivalent to using weights = "internal") w <- aggregate(at.risk ~ agegr, data = x[x$fot == 0], FUN = sum) names(w) <- c("agegr", "weights") st <- survtab_ag(fot ~ adjust(agegr), data = x, weights = w) plot(st, y = "r.e2.as", col = c("blue")) ## age standardisation using ICSS1 weights data(ICSS) cut <- c(0, 45, 55, 65, 75, Inf) agegr <- cut(ICSS$age, cut, right = FALSE) w <- aggregate(ICSS1~agegr, data = ICSS, FUN = sum) names(w) <- c("agegr", "weights") st <- survtab_ag(fot ~ adjust(agegr), data = x, weights = w) lines(st, y = "r.e2.as", col = c("red")) ## cause-specific survival sire$stat <- factor(sire$status, 0:2, c("alive", "canD", "othD")) x <- lexpand(sire, birth = bi_date, entry = dg_date, exit = ex_date, status = stat, breaks = BL, pophaz = popmort, aggre = list(agegr, fot)) st <- survtab_ag(fot ~ adjust(agegr), data = x, weights = w, d = c("fromalivetocanD", "fromalivetoothD"), surv.type = "surv.cause") plot(st, y = "surv.obs.fromalivetocanD.as") lines(st, y = "surv.obs.fromalivetoothD.as", col = "red")## see more examples with explanations in vignette("survtab_examples") #### survtab_ag usage data("sire", package = "popEpi") ## prepare data for e.g. 5-year "period analysis" for 2008-2012 ## note: sire is a simulated cohort integrated into popEpi. BL <- list(fot=seq(0, 5, by = 1/12), per = c("2008-01-01", "2013-01-01")) x <- lexpand(sire, birth = bi_date, entry = dg_date, exit = ex_date, status = status %in% 1:2, breaks = BL, pophaz = popmort, aggre = list(fot)) ## calculate relative EdererII period method ## NOTE: x is an aggre object here, so surv.breaks are deduced ## automatically st <- survtab_ag(fot ~ 1, data = x) summary(st, t = 1:5) ## annual estimates summary(st, q = list(r.e2 = 0.75)) ## 1st interval where r.e2 < 0.75 at end plot(st) ## non-aggre data: first call to survtab_ag would fail df <- data.frame(x) # st <- survtab_ag(fot ~ 1, data = x) st <- survtab_ag(fot ~ 1, data = x, surv.breaks = BL$fot) ## calculate age-standardised 5-year relative survival ratio using ## Ederer II method and period approach sire$agegr <- cut(sire$dg_age,c(0,45,55,65,75,Inf),right=FALSE) BL <- list(fot=seq(0, 5, by = 1/12), per = c("2008-01-01", "2013-01-01")) x <- lexpand(sire, birth = bi_date, entry = dg_date, exit = ex_date, status = status %in% 1:2, breaks = BL, pophaz = popmort, aggre = list(agegr, fot)) ## age standardisation using internal weights (age distribution of ## patients diagnosed within the period window) ## (NOTE: what is done here is equivalent to using weights = "internal") w <- aggregate(at.risk ~ agegr, data = x[x$fot == 0], FUN = sum) names(w) <- c("agegr", "weights") st <- survtab_ag(fot ~ adjust(agegr), data = x, weights = w) plot(st, y = "r.e2.as", col = c("blue")) ## age standardisation using ICSS1 weights data(ICSS) cut <- c(0, 45, 55, 65, 75, Inf) agegr <- cut(ICSS$age, cut, right = FALSE) w <- aggregate(ICSS1~agegr, data = ICSS, FUN = sum) names(w) <- c("agegr", "weights") st <- survtab_ag(fot ~ adjust(agegr), data = x, weights = w) lines(st, y = "r.e2.as", col = c("red")) ## cause-specific survival sire$stat <- factor(sire$status, 0:2, c("alive", "canD", "othD")) x <- lexpand(sire, birth = bi_date, entry = dg_date, exit = ex_date, status = stat, breaks = BL, pophaz = popmort, aggre = list(agegr, fot)) st <- survtab_ag(fot ~ adjust(agegr), data = x, weights = w, d = c("fromalivetocanD", "fromalivetoothD"), surv.type = "surv.cause") plot(st, y = "surv.obs.fromalivetocanD.as") lines(st, y = "surv.obs.fromalivetoothD.as", col = "red")
Attempts to convert a numeric object to integer,
but won't if loss of information is imminent (if values after decimal
are not zero for even one value in obj)
try2int(obj, tol = .Machine$double.eps^0.5)try2int(obj, tol = .Machine$double.eps^0.5)
obj |
a numeric vector |
tol |
tolerance; if each numeric value in |
An integer vector if no information is lost in coercion; else numeric
vector.
James Arnold