Title: | Automatic Database Normalisation for Data Frames |
---|---|
Description: | Automatic normalisation of a data frame to third normal form, with the intention of easing the process of data cleaning. (Usage to design your actual database for you is not advised.) Originally inspired by the 'AutoNormalize' library for 'Python' by 'Alteryx' (<https://github.com/alteryx/autonormalize>), with various changes and improvements. Automatic discovery of functional or approximate dependencies, normalisation based on those, and plotting of the resulting "database" via 'Graphviz', with options to exclude some attributes at discovery time, or remove discovered dependencies at normalisation time. |
Authors: | Mark Webster [aut, cre] |
Maintainer: | Mark Webster <[email protected]> |
License: | BSD_3_clause + file LICENSE |
Version: | 2.3.1 |
Built: | 2025-03-20 05:15:04 UTC |
Source: | https://github.com/charnelmouse/autodb |
Generic function, for fetching attribute sets for elements of a relational object.
attrs(x, ...) attrs(x, ...) <- value
attrs(x, ...) attrs(x, ...) <- value
x |
a relational schema object, such as a |
... |
further arguments passed on to methods. |
value |
A character vector of the same length as |
A list, containing a character vector for each element of x
.
Generic function, fetching attribute order for relational objects.
attrs_order(x, ...) attrs_order(x, ...) <- value
attrs_order(x, ...) attrs_order(x, ...) <- value
x |
an R object, such as a |
... |
further arguments passed on to methods. |
value |
A character vector of the same length as |
All classes in autodb
contain an attrs_order
attribute. It
gives an easy way to find a list of all attributes/variables involved in an
object, but its main purpose is to also assign those attributes a consistent
order when printing or plotting the object.
A character vector, giving attributes in the order in which they're
prioritised for sorting within x
.
This is a wrapper function for applying normalise
,
autoref
, and decompose
. This takes a data frame
and converts it straight into a database, which is the main intended use case
for the package.
autodb( df, digits = getOption("digits"), single_ref = FALSE, ensure_lossless = TRUE, remove_avoidable = FALSE, constants_name = "constants", progress = FALSE, progress_file = "", ... )
autodb( df, digits = getOption("digits"), single_ref = FALSE, ensure_lossless = TRUE, remove_avoidable = FALSE, constants_name = "constants", progress = FALSE, progress_file = "", ... )
df |
a data.frame, containing the data to be normalised. |
digits |
a positive integer, indicating how many significant digits are
to be used for numeric and complex variables. This is used for both
pre-formatting in |
single_ref |
a logical, FALSE by default. If TRUE, then only one reference between each relation pair is kept when generating foreign key references. If a pair has multiple references, the kept reference refers to the earliest key for the child relation, as sorted by priority order. |
ensure_lossless |
a logical, indicating whether to check whether the
normalisation is lossless. If it is not, then an additional relation is
added to the final "database", containing a key for |
remove_avoidable |
a logical, indicating whether to remove avoidable attributes in relations. If so, then an attribute are removed from relations if the keys can be changed such that it is not needed to preserve the given functional dependencies. |
constants_name |
a scalar character, giving the name for any relation created to store constant attributes. If this is the same as a generated relation name, it will be changed, with a warning, to ensure that all relations have a unique name. |
progress |
a logical, for whether to display progress to the user during
dependency search in |
progress_file |
a scalar character or a connection. If |
... |
further arguments passed on to |
Since decompose
only works with functional dependencies, not approximate
dependencies, the accuracy in discover
is fixed as 1.
A database
, containing the data in df
within the
inferred database schema.
# simple example autodb(ChickWeight)
# simple example autodb(ChickWeight)
Adds foreign key references to a relation_schema
object
automatically, replacing any existing references.
autoref(schema, single_ref = FALSE)
autoref(schema, single_ref = FALSE)
schema |
a |
single_ref |
a logical, FALSE by default. If TRUE, then only one reference between each relation pair is kept when generating foreign key references. If a pair has multiple references, the kept reference refers to the earliest key for the child relation, as sorted by priority order. |
The method for generating references is simple. First, it finds every link
between two relation schemas, where the parent contains all the attributes in
one of the child's keys. This can be done separately for all of the child's
keys, so there can be multiple links with the same parent and child if
single_ref
is TRUE
.
Second, any transitive references are removed: if there are link relation pairs a -> b, b -> c, and a -> c, then the latter is transitive, and so is removed. If there is a cyclic reference, e.g. where c -> a, then the choice of which link to remove is arbitrary. Cycles cannot occur in sets of relation schemas resulting from decomposing a single table.
A database_schema
object, containing the given relation
schemas and the created foreign key references.
rs <- relation_schema( list( a_b_c = list(c("a", "b", "c", "d"), list(c("a", "b", "c"))), a_b = list(c("a", "b", "d"), list(c("a", "b"), c("b", "d"))) ), letters[1:4] ) autoref(rs, single_ref = FALSE) autoref(rs, single_ref = TRUE)
rs <- relation_schema( list( a_b_c = list(c("a", "b", "c", "d"), list(c("a", "b", "c"))), a_b = list(c("a", "b", "d"), list(c("a", "b"), c("b", "d"))) ), letters[1:4] ) autoref(rs, single_ref = FALSE) autoref(rs, single_ref = TRUE)
Create a relation data object, using the given relational schema object, with
the resulting relations empty and ready for data insertion using
insert
.
create(x, ...)
create(x, ...)
x |
a relational schema object, representing the schema to create an
instance of, such as a |
... |
further arguments passed on to methods. |
An instance of the schema. For example, calling create
on a
database_schema
creates a database
, where all
the relations contain zero records.
Enhances a relation
object with foreign key reference
information.
database(relations, references)
database(relations, references)
relations |
a |
references |
a list of references, each represented by a list containing four character elements. In order, the elements are a scalar giving the name of the child (referrer) schema, a vector giving the child attribute names, a scalar giving the name of the parent (referee) schema, and a vector giving the parent attribute names. The vectors must be of the same length and contain names for attributes present in their respective schemas, and the parent attributes must form a key. |
Unlike relation_schema
and relation
, and like
database_schema
, database
is not designed to be
vector-like: it only holds a single database. This adheres to the usual
package use case, where a single data frame is being analysed at a time.
However, it inherits from relation
, so is vectorised with
respect to its relations.
As with relation
, duplicate relations, after ordering by
attribute, are allowed, and can be removed with unique
.
References, i.e. foreign key references, are allowed to have different
attribute names in the child and parent relations; this can't occur in the
output for autoref
and normalise
.
Subsetting removes any references that involve removed relations.
Removing duplicates with unique
changes references involving
duplicates to involve the kept equivalent relations instead. Renaming
relations with names<-
also changes their names
in the references.
A database
object, containing relations
with
references
stored in an attribute of the same name. References
are stored with their attributes in the order they appear in their
respective relations.
rels <- relation( list( a = list( df = data.frame(a = logical(), b = logical()), keys = list("a") ), b = list( df = data.frame(b = logical(), c = logical()), keys = list("b", "c") ) ), attrs_order = c("a", "b", "c", "d") ) db <- database( rels, list(list("a", "b", "b", "b")) ) print(db) attrs(db) stopifnot(identical( attrs(db), lapply(records(db), names) )) keys(db) attrs_order(db) names(db) references(db) # relations can't reference themselves ## Not run: database( relation( list(a = list(df = data.frame(a = 1:5), keys = list("a"))), c("a", "b") ), list(list("a", "a", "a", "a")) ) database( relation( list(a = list(df = data.frame(a = 1:5, b = 6:10), keys = list("a"))), c("a", "b") ), list(list("a", "b", "a", "a")) ) ## End(Not run) # an example with references between differently-named attributes print(database( relation( list( citation = list(df = data.frame(citer = 1:5, citee = 6:10), keys = list(c("citer", "citee"))), article = list(df = data.frame(article = 1:10), keys = list("article")) ), c("citer", "citee", "article") ), list( list("citation", "citer", "article", "article"), list("citation", "citee", "article", "article") ) )) # inserting data insert(db, data.frame(a = 1L, b = 2L, c = 3L, d = 4L)) # data is only inserted into relations where all columns are given... insert(db, data.frame(a = 1L, b = 2L, c = 3L)) # and that are listed in relations argument insert( db, data.frame(a = 1L, b = 2L, c = 3L, d = 4L), relations = "b" ) # inserted data can't violate keys ## Not run: insert( db, data.frame(a = 1L, b = 1:2) ) ## End(Not run) # inserted data can't violate foreign key references ## Not run: insert( db, data.frame(a = 1L, b = 2L, c = 3L, d = 4L), relations = "a" ) ## End(Not run) # vector operations db2 <- database( relation( list( e = list(df = data.frame(a = 1:5, e = 6:10), keys = list("e")) ), attrs_order = c("a", "e") ), list() ) c(db, db2) # attrs_order attributes are merged unique(c(db, db)) # subsetting db[1] stopifnot(identical(db[[1]], db[1])) db[c(1, 2, 1, 2)] # replicates the foreign key references c(db[c(1, 2)], db[c(1, 2)]) # doesn't reference between separate copies of db unique(db[c(1, 2, 1, 2)]) # unique() also merges references # another example of unique() merging references db_merge <- database( relation( list( a = list( df = data.frame(a = logical(), b = logical()), keys = list("a") ), b = list( df = data.frame(b = logical(), c = logical(), d = logical()), keys = list("b") ), c_d = list( df = data.frame(c = logical(), d = logical(), e = logical()), keys = list(c("c", "d")) ), a.1 = list( df = data.frame(a = logical(), b = logical()), keys = list("a") ), b.1 = list( df = data.frame(b = logical(), c = logical(), d = logical()), keys = list("b") ) ), c("a", "b", "c", "d", "e") ), list( list("a", "b", "b", "b"), list("b.1", c("c", "d"), "c_d", c("c", "d")) ) ) print(db_merge) unique(db_merge) # reassignment # can't change keys included in references ## Not run: keys(db)[[2]] <- list("c") # can't remove attributes included in keys ## Not run: attrs(db)[[2]] <- list("c", "d") # can't remove attributes included in references ## Not run: attrs(db)[[1]] <- c("a", "d") db3 <- db # can change subset of schema, but loses references between altered and # non-altered subsets db3[2] <- database( relation( list(d = list(df = data.frame(d = logical(), c = logical()), keys = list("d"))), attrs_order(db3) ), list() ) print(db3) # note the schema's name doesn't change # names(db3)[2] <- "d" # this would change the name keys(db3)[[2]] <- list(character()) # removing keys first... # for a database_schema, we could then change the attrs for # the second database. For a created relation, this is not # allowed. ## Not run: attrs(db3)[[2]] <- c("b", "c") names(records(db3)[[2]]) <- c("b", "c") ## End(Not run) # changing appearance priority for attributes attrs_order(db3) <- c("d", "c", "b", "a") print(db3) # changing relation schema names changes them in references names(db3) <- paste0(names(db3), "_long") print(db3) # reconstructing from components db_recon <- database( relation( Map(list, df = records(db), keys = keys(db)), attrs_order(db) ), references(db) ) stopifnot(identical(db_recon, db)) db_recon2 <- database( subrelations(db), references(db) ) stopifnot(identical(db_recon2, db)) # can be a data frame column data.frame(id = 1:2, relation = db)
rels <- relation( list( a = list( df = data.frame(a = logical(), b = logical()), keys = list("a") ), b = list( df = data.frame(b = logical(), c = logical()), keys = list("b", "c") ) ), attrs_order = c("a", "b", "c", "d") ) db <- database( rels, list(list("a", "b", "b", "b")) ) print(db) attrs(db) stopifnot(identical( attrs(db), lapply(records(db), names) )) keys(db) attrs_order(db) names(db) references(db) # relations can't reference themselves ## Not run: database( relation( list(a = list(df = data.frame(a = 1:5), keys = list("a"))), c("a", "b") ), list(list("a", "a", "a", "a")) ) database( relation( list(a = list(df = data.frame(a = 1:5, b = 6:10), keys = list("a"))), c("a", "b") ), list(list("a", "b", "a", "a")) ) ## End(Not run) # an example with references between differently-named attributes print(database( relation( list( citation = list(df = data.frame(citer = 1:5, citee = 6:10), keys = list(c("citer", "citee"))), article = list(df = data.frame(article = 1:10), keys = list("article")) ), c("citer", "citee", "article") ), list( list("citation", "citer", "article", "article"), list("citation", "citee", "article", "article") ) )) # inserting data insert(db, data.frame(a = 1L, b = 2L, c = 3L, d = 4L)) # data is only inserted into relations where all columns are given... insert(db, data.frame(a = 1L, b = 2L, c = 3L)) # and that are listed in relations argument insert( db, data.frame(a = 1L, b = 2L, c = 3L, d = 4L), relations = "b" ) # inserted data can't violate keys ## Not run: insert( db, data.frame(a = 1L, b = 1:2) ) ## End(Not run) # inserted data can't violate foreign key references ## Not run: insert( db, data.frame(a = 1L, b = 2L, c = 3L, d = 4L), relations = "a" ) ## End(Not run) # vector operations db2 <- database( relation( list( e = list(df = data.frame(a = 1:5, e = 6:10), keys = list("e")) ), attrs_order = c("a", "e") ), list() ) c(db, db2) # attrs_order attributes are merged unique(c(db, db)) # subsetting db[1] stopifnot(identical(db[[1]], db[1])) db[c(1, 2, 1, 2)] # replicates the foreign key references c(db[c(1, 2)], db[c(1, 2)]) # doesn't reference between separate copies of db unique(db[c(1, 2, 1, 2)]) # unique() also merges references # another example of unique() merging references db_merge <- database( relation( list( a = list( df = data.frame(a = logical(), b = logical()), keys = list("a") ), b = list( df = data.frame(b = logical(), c = logical(), d = logical()), keys = list("b") ), c_d = list( df = data.frame(c = logical(), d = logical(), e = logical()), keys = list(c("c", "d")) ), a.1 = list( df = data.frame(a = logical(), b = logical()), keys = list("a") ), b.1 = list( df = data.frame(b = logical(), c = logical(), d = logical()), keys = list("b") ) ), c("a", "b", "c", "d", "e") ), list( list("a", "b", "b", "b"), list("b.1", c("c", "d"), "c_d", c("c", "d")) ) ) print(db_merge) unique(db_merge) # reassignment # can't change keys included in references ## Not run: keys(db)[[2]] <- list("c") # can't remove attributes included in keys ## Not run: attrs(db)[[2]] <- list("c", "d") # can't remove attributes included in references ## Not run: attrs(db)[[1]] <- c("a", "d") db3 <- db # can change subset of schema, but loses references between altered and # non-altered subsets db3[2] <- database( relation( list(d = list(df = data.frame(d = logical(), c = logical()), keys = list("d"))), attrs_order(db3) ), list() ) print(db3) # note the schema's name doesn't change # names(db3)[2] <- "d" # this would change the name keys(db3)[[2]] <- list(character()) # removing keys first... # for a database_schema, we could then change the attrs for # the second database. For a created relation, this is not # allowed. ## Not run: attrs(db3)[[2]] <- c("b", "c") names(records(db3)[[2]]) <- c("b", "c") ## End(Not run) # changing appearance priority for attributes attrs_order(db3) <- c("d", "c", "b", "a") print(db3) # changing relation schema names changes them in references names(db3) <- paste0(names(db3), "_long") print(db3) # reconstructing from components db_recon <- database( relation( Map(list, df = records(db), keys = keys(db)), attrs_order(db) ), references(db) ) stopifnot(identical(db_recon, db)) db_recon2 <- database( subrelations(db), references(db) ) stopifnot(identical(db_recon2, db)) # can be a data frame column data.frame(id = 1:2, relation = db)
Enhances a relation_schema
object with foreign key reference
information.
database_schema(relation_schemas, references)
database_schema(relation_schemas, references)
relation_schemas |
a |
references |
a list of references, each represented by a list containing four character elements. In order, the elements are a scalar giving the name of the child (referrer) schema, a vector giving the child attribute names, a scalar giving the name of the parent (referee) schema, and a vector giving the parent attribute names. The vectors must be of the same length and contain names for attributes present in their respective schemas, and the parent attributes must form a key. |
Unlike functional_dependency
and relation_schema
,
database_schema
is not designed to be vector-like: it only holds a
single database schema. This adheres to the usual package use case, where a
single data frame is being analysed at a time. However, it inherits from
relation_schema
, so is vectorised with respect to its relation
schemas.
As with relation_schema
, duplicate relation schemas, after
ordering by attribute, are allowed, and can be removed with
unique
.
References, i.e. foreign key references, are allowed to have different
attribute names in the child and parent relations; this can't occur in the
output for autoref
and normalise
.
Subsetting removes any references that involve removed relation schemas.
Removing duplicates with unique
changes references involving
duplicates to involve the kept equivalent schemas instead. Renaming relation
schemas with names<-
also changes their names in
the references.
A database_schema
object, containing relation_schemas
with references
stored in an attribute of the same name.
References are stored with their attributes in the order they appear in
their respective relation schemas.
attrs
, keys
, attrs_order
,
and references
for extracting parts of the information in a
database_schema
; create
for creating a
database
object that uses the given schema; gv
for converting the schema into Graphviz code; rename_attrs
for renaming the attributes in attrs_order
; reduce
for
filtering a schema's relations to those connected to a given relation by
foreign key references; subschemas
to return the
relation_schema
that the given schema contains;
merge_empty_keys
for combining relations with an empty key;
merge_schemas
for combining relations with matching sets of
keys.
rs <- relation_schema( list( a = list(c("a", "b"), list("a")), b = list(c("b", "c"), list("b", "c")) ), attrs_order = c("a", "b", "c", "d") ) ds <- database_schema( rs, list(list("a", "b", "b", "b")) ) print(ds) attrs(ds) keys(ds) attrs_order(ds) names(ds) references(ds) # relations can't reference themselves ## Not run: database_schema( relation_schema( list(a = list("a", list("a"))), c("a", "b") ), list(list("a", "a", "a", "a")) ) database_schema( relation_schema( list(a = list(c("a", "b"), list("a"))), c("a", "b") ), list(list("a", "b", "a", "a")) ) ## End(Not run) # an example with references between differently-named attributes print(database_schema( relation_schema( list( citation = list(c("citer", "citee"), list(c("citer", "citee"))), article = list("article", list("article")) ), c("citer", "citee", "article") ), list( list("citation", "citer", "article", "article"), list("citation", "citee", "article", "article") ) )) # vector operations ds2 <- database_schema( relation_schema( list( e = list(c("a", "e"), list("e")) ), attrs_order = c("a", "e") ), list() ) c(ds, ds2) # attrs_order attributes are merged unique(c(ds, ds)) # subsetting ds[1] stopifnot(identical(ds[[1]], ds[1])) ds[c(1, 2, 1, 2)] # replicates the foreign key references c(ds[c(1, 2)], ds[c(1, 2)]) # doesn't reference between separate copies of ds unique(ds[c(1, 2, 1, 2)]) # unique() also merges references # another example of unique() merging references ds_merge <- database_schema( relation_schema( list( a = list(c("a", "b"), list("a")), b = list(c("b", "c", "d"), list("b")), c_d = list(c("c", "d", "e"), list(c("c", "d"))), a.1 = list(c("a", "b"), list("a")), b.1 = list(c("b", "c", "d"), list("b")) ), c("a", "b", "c", "d", "e") ), list( list("a", "b", "b", "b"), list("b.1", c("c", "d"), "c_d", c("c", "d")) ) ) print(ds_merge) unique(ds_merge) # reassignment # can't change keys included in references ## Not run: keys(ds)[[2]] <- list("c") # can't remove attributes included in keys ## Not run: attrs(ds)[[2]] <- list("c", "d") # can't remove attributes included in references ## Not run: attrs(ds)[[1]] <- c("a", "d") ds3 <- ds # can change subset of schema, but loses references between altered and # non-altered subsets ds3[2] <- database_schema( relation_schema( list(d = list(c("d", "c"), list("d"))), attrs_order(ds3) ), list() ) print(ds3) # note the schema's name doesn't change # names(ds3)[2] <- "d" # this would change the name keys(ds3)[[2]] <- list(character()) # removing keys first... attrs(ds3)[[2]] <- c("b", "c") # so we can change the attrs legally keys(ds3)[[2]] <- list("b", "c") # add the new keys # add the reference lost during subset replacement references(ds3) <- c(references(ds3), list(list("a", "b", "b", "b"))) stopifnot(identical(ds3, ds)) # changing appearance priority for attributes attrs_order(ds3) <- c("d", "c", "b", "a") print(ds3) # changing relation schema names changes them in references names(ds3) <- paste0(names(ds3), "_long") print(ds3) # reconstructing from components ds_recon <- database_schema( relation_schema( Map(list, attrs(ds), keys(ds)), attrs_order(ds) ), references(ds) ) stopifnot(identical(ds_recon, ds)) ds_recon2 <- database_schema( subschemas(ds), references(ds) ) stopifnot(identical(ds_recon2, ds)) # can be a data frame column data.frame(id = 1:2, schema = ds)
rs <- relation_schema( list( a = list(c("a", "b"), list("a")), b = list(c("b", "c"), list("b", "c")) ), attrs_order = c("a", "b", "c", "d") ) ds <- database_schema( rs, list(list("a", "b", "b", "b")) ) print(ds) attrs(ds) keys(ds) attrs_order(ds) names(ds) references(ds) # relations can't reference themselves ## Not run: database_schema( relation_schema( list(a = list("a", list("a"))), c("a", "b") ), list(list("a", "a", "a", "a")) ) database_schema( relation_schema( list(a = list(c("a", "b"), list("a"))), c("a", "b") ), list(list("a", "b", "a", "a")) ) ## End(Not run) # an example with references between differently-named attributes print(database_schema( relation_schema( list( citation = list(c("citer", "citee"), list(c("citer", "citee"))), article = list("article", list("article")) ), c("citer", "citee", "article") ), list( list("citation", "citer", "article", "article"), list("citation", "citee", "article", "article") ) )) # vector operations ds2 <- database_schema( relation_schema( list( e = list(c("a", "e"), list("e")) ), attrs_order = c("a", "e") ), list() ) c(ds, ds2) # attrs_order attributes are merged unique(c(ds, ds)) # subsetting ds[1] stopifnot(identical(ds[[1]], ds[1])) ds[c(1, 2, 1, 2)] # replicates the foreign key references c(ds[c(1, 2)], ds[c(1, 2)]) # doesn't reference between separate copies of ds unique(ds[c(1, 2, 1, 2)]) # unique() also merges references # another example of unique() merging references ds_merge <- database_schema( relation_schema( list( a = list(c("a", "b"), list("a")), b = list(c("b", "c", "d"), list("b")), c_d = list(c("c", "d", "e"), list(c("c", "d"))), a.1 = list(c("a", "b"), list("a")), b.1 = list(c("b", "c", "d"), list("b")) ), c("a", "b", "c", "d", "e") ), list( list("a", "b", "b", "b"), list("b.1", c("c", "d"), "c_d", c("c", "d")) ) ) print(ds_merge) unique(ds_merge) # reassignment # can't change keys included in references ## Not run: keys(ds)[[2]] <- list("c") # can't remove attributes included in keys ## Not run: attrs(ds)[[2]] <- list("c", "d") # can't remove attributes included in references ## Not run: attrs(ds)[[1]] <- c("a", "d") ds3 <- ds # can change subset of schema, but loses references between altered and # non-altered subsets ds3[2] <- database_schema( relation_schema( list(d = list(c("d", "c"), list("d"))), attrs_order(ds3) ), list() ) print(ds3) # note the schema's name doesn't change # names(ds3)[2] <- "d" # this would change the name keys(ds3)[[2]] <- list(character()) # removing keys first... attrs(ds3)[[2]] <- c("b", "c") # so we can change the attrs legally keys(ds3)[[2]] <- list("b", "c") # add the new keys # add the reference lost during subset replacement references(ds3) <- c(references(ds3), list(list("a", "b", "b", "b"))) stopifnot(identical(ds3, ds)) # changing appearance priority for attributes attrs_order(ds3) <- c("d", "c", "b", "a") print(ds3) # changing relation schema names changes them in references names(ds3) <- paste0(names(ds3), "_long") print(ds3) # reconstructing from components ds_recon <- database_schema( relation_schema( Map(list, attrs(ds), keys(ds)), attrs_order(ds) ), references(ds) ) stopifnot(identical(ds_recon, ds)) ds_recon2 <- database_schema( subschemas(ds), references(ds) ) stopifnot(identical(ds_recon2, ds)) # can be a data frame column data.frame(id = 1:2, schema = ds)
Decomposes a data frame into several relations, based on the given database schema. It's intended that the data frame satisfies all the functional dependencies implied by the schema, such as if the schema was constructed from the same data frame. If this is not the case, the function will returns an error.
decompose(df, schema)
decompose(df, schema)
df |
a data.frame, containing the data to be normalised. |
schema |
a database schema with foreign key references, such as given by
|
If the schema was constructed using approximate dependencies for the same
data frame, decompose
returns an error, to prevent either duplicate records
or lossy decompositions. This is temporary: for the next update, we plan to
add an option to allow this, or to add "approximate" equivalents of databases
and database schemas.
A database
object, containing the data in df
within the database schema given in schema
.
Generic function, with the only given method fetching dependants for functional dependencies.
dependant(x, ...) dependant(x, ...) <- value
dependant(x, ...) dependant(x, ...) <- value
x |
an R object. For the given method, a
|
... |
further arguments passed on to methods. |
value |
A character vector of the same length as |
A character vector containing dependants.
Generic function, with the only given method fetching determinant sets for functional dependencies.
detset(x, ...) detset(x, ...) <- value
detset(x, ...) detset(x, ...) <- value
x |
an R object. For the given method, a
|
... |
further arguments passed on to methods. |
value |
A character vector of the same length as |
A list containing determinant sets, each consisting of a character vector with unique elements.
duplicated
"determines which elements of a vector or data frame
are duplicates of elements with smaller subscripts, and returns a logical
vector indicating which elements (rows) are duplicates". However, as of R
4.1, calling this on a data frame with zero columns always returns an empty
logical vector. This has repercussions on other functions that use
duplicated
, such as unique
and anyDuplicated
.
These functions add zero-column data frames as a special case.
df_duplicated(x, incomparables = FALSE, fromLast = FALSE, ...) df_unique(x, incomparables = FALSE, fromLast = FALSE, ...) df_anyDuplicated(x, incomparables = FALSE, fromLast = FALSE, ...) df_records(x, use_rownames = FALSE, use_colnames = FALSE)
df_duplicated(x, incomparables = FALSE, fromLast = FALSE, ...) df_unique(x, incomparables = FALSE, fromLast = FALSE, ...) df_anyDuplicated(x, incomparables = FALSE, fromLast = FALSE, ...) df_records(x, use_rownames = FALSE, use_colnames = FALSE)
x |
a data frame. |
incomparables |
a vector of values that cannot be compared.
|
fromLast |
logical indicating if duplication should be considered
from the reverse side, i.e., the last (or rightmost) of identical
elements would correspond to |
... |
arguments for particular methods. |
use_rownames |
a logical, FALSE by default, indicating whether row
values should keep the row names from |
use_colnames |
a logical, FALSE by default, indicating whether row
values should keep the column names from |
For df_duplicated
, a logical vector with one element for each
row.
For df_unique
, a data frame is returned with the same
columns, but possible fewer rows (and with row names from the first
occurrences of the unique rows).
For df_anyDuplicated
, an integer or real vector of length
one with value the 1-based index of the first duplicate if any, otherwise
0.
For df_records
, a list of the row values in x
. This is
based on a step in duplicated.data.frame
. However, for data
frames with zero columns, special handling returns a list of empty row
values, one for each row in x
. Without special handling, this step
returns an empty list. This was the cause for duplicated
returning incorrect results for zero-column data frames in older versions
of R.
# row values for a 5x0 data frame x <- data.frame(a = 1:5)[, FALSE, drop = FALSE] do.call(Map, unname(c(list, x))) # original step returns empty list df_records(x) # corrected version preserves row count
# row values for a 5x0 data frame x <- data.frame(a = 1:5)[, FALSE, drop = FALSE] do.call(Map, unname(c(list, x))) # original step returns empty list df_records(x) # corrected version preserves row count
A convenience function, mostly used to testing that rejoin
works as intended. It checks that data frames have the same dimensions and
column names, with duplicates allowed, then checks they contain the same
data. For the latter step, column names are made unique first, so columns
with duplicate names must be presented in the same order in both data frames.
df_equiv(df1, df2, digits = getOption("digits"))
df_equiv(df1, df2, digits = getOption("digits"))
df1 , df2
|
Data frames. |
digits |
a positive integer, indicating how many significant digits are
to be used for numeric and complex variables. A value of NA results in no
rounding. By default, this uses |
A logical.
rbind
takes "a sequence of vector, matrix or data-frame
arguments", and combines by rows for the latter. However, as of R 4.1,
calling this on data frame with zero columns always returns zero rows, due to
the issue mentioned for df_duplicated
. This function adds
zero-column data frames as a special case.
df_rbind(...)
df_rbind(...)
... |
data frames. |
A data frame containing the ...
arguments row-wise.
Finds all the minimal functional dependencies represented in a data frame.
discover( df, accuracy, digits = getOption("digits"), full_cache = TRUE, store_cache = TRUE, skip_bijections = FALSE, exclude = character(), exclude_class = character(), dependants = names(df), detset_limit = ncol(df) - 1L, progress = FALSE, progress_file = "" )
discover( df, accuracy, digits = getOption("digits"), full_cache = TRUE, store_cache = TRUE, skip_bijections = FALSE, exclude = character(), exclude_class = character(), dependants = names(df), detset_limit = ncol(df) - 1L, progress = FALSE, progress_file = "" )
df |
a data.frame, the relation to evaluate. |
accuracy |
a numeric in (0, 1]: the accuracy threshold required in order to conclude a dependency. |
digits |
a positive integer, indicating how many significant digits are
to be used for numeric and complex variables. A value of |
full_cache |
a logical, indicating whether to store information about how sets of attributes group the relation records (stripped partitions). Otherwise, only the number of groups is stored. Storing the stripped partition is expected to let the algorithm run more quickly, but might be inefficient for small data frames or small amounts of memory. |
store_cache |
a logical, indicating whether to keep cached information to use when finding dependencies for other dependants. This allows the algorithm to run more quickly by not having to re-calculate information, but takes up more memory. |
skip_bijections |
a logical, indicating whether to skip some dependency
searches that are made redundant by discovered bijections between
attributes. This can significantly speed up the search if |
exclude |
a character vector, containing names of attributes to not
consider as members of determinant sets. If names are given that aren't
present in |
exclude_class |
a character vector, indicating classes of attributes to not consider as members of determinant_sets. Attributes are excluded if they inherit from any given class. |
dependants |
a character vector, containing names of all attributes for which to find minimal functional dependencies for which they are the dependant. By default, this is all of the attribute names. A smaller set of attribute names reduces the amount of searching required, so can reduce the computation time if only some potential dependencies are of interest. |
detset_limit |
an integer, indicating the largest determinant set size that should be searched for. By default, this is large enough to allow all possible determinant sets. See Details for comments about the effect on the result, and on the computation time. |
progress |
a logical, for whether to display progress to the user during
dependency search in |
progress_file |
a scalar character or a connection. If |
Column names for df
must be unique.
The algorithm used for finding dependencies is DFD. This searches for determinant sets for each dependent attribute (dependant) by traversing the powerset of the other (non-excluded) attributes, and is equivalent to depth-first.
The implementation for DFD differs a little from the algorithm presented in the original paper:
Some attributes, or attribute types, can be designated, ahead of time, as not being candidate members for determinant sets. This reduces the number of candidate determinant sets to be searched, saving time by not searching for determinant sets that the user would remove later anyway.
Attributes that have a single unique value, i.e. are constant, get attributed a single empty determinant set. In the standard DFD algorithm, they would be assigned all the other non-excluded attributes as length-one determinant sets. Assigning them the empty set distinguishes them as constant, allowing for special treatment at normalisation and later steps.
As was done in the original Python library, there is an extra case in seed generation for when there are no discovered maximal non-dependencies. In this case, we take all of the single-attribute nodes, then filter out by minimal dependencies as usual. This is equivalent to taking the empty set as the single maximal non-dependency.
There are three results when checking whether a candidate node is minimal/maximal. TRUE indicates the node is minimal/maximal, as usual. FALSE has been split into FALSE and NA. NA indicates that we can not yet determine whether the node is minimal/maximal. FALSE indicates that we have determined that it is not minimal/maximal, and we can set its category as such. This is done by checking whether any of its adjacent subsets/supersets are dependencies/non-dependencies, instead of waiting to exhaust the adjacent subsets/supersets to visit when picking the next node to visit.
We do not yet keep hashmaps to manage subset/superset relationships, as described in Section 3.5 of the original paper.
skip_bijections
allows for additional optimisation for finding
functional dependencies when there are pairwise-equivalent attributes.
Missing values (NA) are treated as a normal value, with NA = NA being true, and x = NA being false for any non-NA value of x.
Numerical/complex values, i.e. floating-point values, represent difficulties for stating functional dependencies. A fundamental condition for stating functional dependencies is that we can compare two values for the same variable, and they are equivalent or not equivalent.
Usually, this is done by checking they're equal – this is the approach used
in discover
– but we can use any comparison that is an equivalence
relation.
However, checking floating-point values for equality is not simple. ==
is not appropriate, even when comparing non-calculated values we've read from
a file, because how a given number is converted into a float can vary by
computer architecture, meaning that two values can be considered equal on one
computer, and not equal on another. This can happen even if they're both
using 64-bit R, and even though all R platforms work with values conforming
to the same standard (see double
). For example,
and
are converted into
different floating-point representations on x86, but the same representation
on ARM, resulting in inequality and equality respectively.
For this and other reasons, checking numerical/complex values for
(near-)equality in R is usually done with all.equal
. This
determines values and
to be equal if their absolute/relative
absolute difference is within some tolerance value. However, we can not use
this. Equivalence relations must be transitive: if we have values
,
, and
, and
is equivalent to both
and
,
then
and
must also be equivalent. This tolerance-based
equivalence is not transitive: it is reasonably straightforward to set up
three values so that the outer values are far enough apart to be considered
non-equivalent, but the middle value is close enough to be considered
equivalent to both of them. Using this to determine functional dependencies,
therefore, could easily result in a large number of inconsistencies.
This means we have no good option for comparing numerical/complex values as-is for equivalence, with consistent results across different machines, so we must treat them differently. We have three options:
Round/truncate the values, before comparison, to some low degree of precision;
Coerce the values to another class before passing them into discover
;
Read values as characters if reading data from a file.
discover
takes the first option, with a default number of significant
digits low enough to ensure consistency across different machines. However,
the user can also use any of these options when processing the data before
passing it to discover
. The third option, in particular, is
recommended if reading data from a file.
Skipping bijections allows skipping redundant searches. For example, if the
search discovers that A -> B
and B -> A
, then only one of those
attributes is considered for the remainder of the search. Since the search
time increases exponentially with the number of attributes considered, this
can significantly speed up search times. At the moment, this is only be done
for bijections between single attributes, such as A <-> B
; if A
<-> {B, C}
, nothing is skipped. Whether bijections are skipped doesn't
affect which functional dependencies are present in the output, but it might
affect their order.
Skipping bijections for approximate dependencies, i.e. when accuracy < 1
,
should be avoided: it can result in incorrect output, since an approximate
bijection doesn't imply equivalent approximate dependencies.
Setting detset_limit
smaller than the largest-possible value has
different behaviour for different search algorithms, the result is always
that discover(x, 1, detset_limit = n)
is equivalent to doing a full
search, fds <- discover(x, 1)
, then
filtering by determinant set size post-hoc, fds[lengths(detset(fds)) <=
n]
.
For DFD, the naive way to implement it is by removing determinant sets larger than the limit from the search tree for possible functional dependencies for each dependant. However, this usually results in the search taking much more time than without a limit.
For example, suppose we search for determinant sets for a dependant that has
none (the dependant is the only key for df
, for example). Using DFD,
we begin with a single attribute, then add other attributes one-by-one, since
every set gives a non-dependency. When we reach a maximum-size set, we can
mark all subsets as also being non-dependencies.
With the default limit, there is only one maximum-size set, containing all of
the available attributes. If there are candidate attributes for
determinants, the search finishes after visiting
sets.
With a smaller limit , there are
maximum-size sets
to explore. Since a DFD search adds or removes one attribute at each step,
this means the search must take at least
steps,
which is larger than
for all non-trivial cases
.
We therefore use a different approach, where any determinant sets above the size limit are not allowed to be candidate seeds for new search paths, and any discovered dependencies with a size above the limit are discard at the end of the entire DFD search. This means that nodes for determinant sets above the size limit are only visited in order to determine maximality of non-dependencies within the size limit. It turns out to be rare that this results in a significant speed-up, but it never results in the search having to visit more nodes than it would without a size limit, so the average search time is never made worse.
A functional_dependency
object, containing the discovered
dependencies. The column names of df
are stored in the attrs
attribute, in order, to serve as a default priority order for the
attributes during normalisation.
Abedjan Z., Schulze P., Naumann F. (2014) DFD: efficient functional dependency discovery. Proceedings of the 23rd ACM International Conference on Conference on Information and Knowledge Management (CIKM '14). New York, U.S.A., 949–958.
# simple example discover(ChickWeight, 1) # example with spurious dependencies discover(CO2, 1) # exclude attributes that can't be determinants. # in this case, the numeric attributes are now # not determined by anything, because of repeat measurements # with no variable to mark them as such. discover(CO2, 1, exclude_class = "numeric") # include only dependencies with dependants of interest. discover(CO2, 1, dependants = c("Treatment", "uptake"))
# simple example discover(ChickWeight, 1) # example with spurious dependencies discover(CO2, 1) # exclude attributes that can't be determinants. # in this case, the numeric attributes are now # not determined by anything, because of repeat measurements # with no variable to mark them as such. discover(CO2, 1, exclude_class = "numeric") # include only dependencies with dependants of interest. discover(CO2, 1, dependants = c("Treatment", "uptake"))
Creates a set of functional dependencies with length-one dependants.
functional_dependency(FDs, attrs_order, unique = TRUE)
functional_dependency(FDs, attrs_order, unique = TRUE)
FDs |
a list of functional dependencies, in the form of two-elements lists: the first element contains a character vector of all attributes in the determinant set, and the second element contains the single dependent attribute (dependant). |
attrs_order |
a character vector, giving the names of all attributes.
These need not be present in |
unique |
a logical, TRUE by default, for whether to remove duplicate dependencies. |
When several sets of functional dependencies are concatenated, their
attrs_order
attributes are merged, so as to preserve all of the
original attribute orders, if possible. If this is not possible, because the
orderings disagree, then the returned value of the attrs_order
attribute is their union instead.
A functional_dependency
object, containing the list given in
FDs
, with attrs_order
an attribute of the same name.
Functional dependencies are returned with their determinant sets sorted
according to the attribute order in attrs
. Any duplicates found
after sorting are removed.
detset
, dependant
, and
attrs_order
for extracting parts of the information in a
functional_dependency
; rename_attrs
for renaming the attributes in attrs_order
.
fds <- functional_dependency( list(list(c("a", "b"), "c"), list(character(), "d")), attrs_order = c("a", "b", "c", "d") ) print(fds) detset(fds) dependant(fds) attrs_order(fds) # vector operations fds2 <- functional_dependency(list(list("e", "a")), c("a", "e")) c(fds, fds2) # attrs_order attributes are merged unique(c(fds, fds)) # subsetting fds[1] fds[c(1, 2, 1)] stopifnot(identical(fds[[2]], fds[2])) # reassignment fds3 <- fds fds3[2] <- functional_dependency(list(list("a", "c")), attrs_order(fds3)) print(fds3) detset(fds3)[[2]] <- character() dependant(fds3)[[2]] <- "d" stopifnot(identical(fds3, fds)) # changing appearance priority for attributes attrs_order(fds3) <- rev(attrs_order(fds3)) fds3 # reconstructing from components fds_recon <- functional_dependency( Map(list, detset(fds), dependant(fds)), attrs_order(fds) ) stopifnot(identical(fds_recon, fds)) # can be a data frame column data.frame(id = 1:2, fd = fds) # (in)equality ignores header stopifnot(all(fds3 == fds)) stopifnot(!any(fds != fds))
fds <- functional_dependency( list(list(c("a", "b"), "c"), list(character(), "d")), attrs_order = c("a", "b", "c", "d") ) print(fds) detset(fds) dependant(fds) attrs_order(fds) # vector operations fds2 <- functional_dependency(list(list("e", "a")), c("a", "e")) c(fds, fds2) # attrs_order attributes are merged unique(c(fds, fds)) # subsetting fds[1] fds[c(1, 2, 1)] stopifnot(identical(fds[[2]], fds[2])) # reassignment fds3 <- fds fds3[2] <- functional_dependency(list(list("a", "c")), attrs_order(fds3)) print(fds3) detset(fds3)[[2]] <- character() dependant(fds3)[[2]] <- "d" stopifnot(identical(fds3, fds)) # changing appearance priority for attributes attrs_order(fds3) <- rev(attrs_order(fds3)) fds3 # reconstructing from components fds_recon <- functional_dependency( Map(list, detset(fds), dependant(fds)), attrs_order(fds) ) stopifnot(identical(fds_recon, fds)) # can be a data frame column data.frame(id = 1:2, fd = fds) # (in)equality ignores header stopifnot(all(fds3 == fds)) stopifnot(!any(fds != fds))
Produces text input for Graphviz to make an HTML diagram of a given object.
gv(x, name = NA_character_, ...)
gv(x, name = NA_character_, ...)
x |
an object to be plotted. |
name |
a scalar character, giving the name of the object, if any. This name is used for the resulting graph, to allow for easier combining of graphs into a single diagram if required. |
... |
further arguments passed to or from other methods. |
Details of what is plotted are given in individual methods. There are expected commonalities, which are described below.
The object is expected to be one of the following:
an object whose elements have the same length. Examples would be data frames, matrices, and other objects that can represent relations, with names for the elements, and an optional name for the object itself.
a graph of sub-objects, each of which represent a relation as described above, possibly with connections between the objects, and an optional name for the graph as a whole.
Each relation is presented as a record-like shape, with the following elements:
A optional header with the relation's name, and the number of (unique) records.
A set of rows, one for each attribute in the relation. These rows have the following contents:
the attribute names.
a depiction of the relation's (candidate) keys. Each column represents a key, and a filled cell indicates that the attribute in that row is in that key. The keys are given in lexical order, with precedence given to keys with fewer attributes, and keys with attributes that appear earlier in the original data frame's attribute order. Default output from other package functions will thus have the primary key given first. In the future, this will be changed to always give the primary key first.
optionally, the attribute types: specifically, the first element
when passing the attribute's values into class
.
Any foreign key references between relations are represented by one-way arrows, one per attribute in the foreign key.
If the object has a name, this name is attached to the resulting graph in Graphviz. This is to allow easier combination of several such graphs into a single image, if a user wishes to do so.
A scalar character, containing text input for Graphviz.
gv.data.frame
, gv.relation_schema
,
gv.database_schema
, gv.relation
, and
gv.database
for individual methods.
# simple data.frame example txt_df <- gv(ChickWeight, "chick") cat(txt_df) if (requireNamespace("DiagrammeR", quietly = TRUE)) { DiagrammeR::grViz(txt_df) } # simple database example db <- autodb(ChickWeight) txt_db <- gv(db) cat(txt_db) if (requireNamespace("DiagrammeR", quietly = TRUE)) { DiagrammeR::grViz(txt_db) } # simple relation schemas rschema <- synthesise(discover(ChickWeight, 1)) txt_rschema <- gv(rschema) cat(txt_rschema) if (requireNamespace("DiagrammeR", quietly = TRUE)) { DiagrammeR::grViz(txt_rschema) } # simple database schema dschema <- normalise(discover(ChickWeight, 1)) txt_dschema <- gv(dschema) cat(txt_dschema) DiagrammeR::grViz(txt_dschema) # simple relations rel <- create(synthesise(discover(ChickWeight, 1))) txt_rel <- gv(rel) cat(txt_rel) if (requireNamespace("DiagrammeR", quietly = TRUE)) { DiagrammeR::grViz(txt_rel) }
# simple data.frame example txt_df <- gv(ChickWeight, "chick") cat(txt_df) if (requireNamespace("DiagrammeR", quietly = TRUE)) { DiagrammeR::grViz(txt_df) } # simple database example db <- autodb(ChickWeight) txt_db <- gv(db) cat(txt_db) if (requireNamespace("DiagrammeR", quietly = TRUE)) { DiagrammeR::grViz(txt_db) } # simple relation schemas rschema <- synthesise(discover(ChickWeight, 1)) txt_rschema <- gv(rschema) cat(txt_rschema) if (requireNamespace("DiagrammeR", quietly = TRUE)) { DiagrammeR::grViz(txt_rschema) } # simple database schema dschema <- normalise(discover(ChickWeight, 1)) txt_dschema <- gv(dschema) cat(txt_dschema) DiagrammeR::grViz(txt_dschema) # simple relations rel <- create(synthesise(discover(ChickWeight, 1))) txt_rel <- gv(rel) cat(txt_rel) if (requireNamespace("DiagrammeR", quietly = TRUE)) { DiagrammeR::grViz(txt_rel) }
Produces text input for Graphviz to make an HTML diagram of a given data frame.
## S3 method for class 'data.frame' gv(x, name = NA_character_, ...)
## S3 method for class 'data.frame' gv(x, name = NA_character_, ...)
x |
a data.frame. |
name |
a character scalar, giving the name of the record, if any. The
name must be non-empty, since it is also used to name the single table in
the plot. Defaults to |
... |
further arguments passed to or from other methods. |
The rows in the plotted data frame include information about the attribute classes.
A scalar character, containing text input for Graphviz.
The generic gv
.
Produces text input for Graphviz to make an HTML diagram of a given database.
## S3 method for class 'database' gv(x, name = NA_character_, ...)
## S3 method for class 'database' gv(x, name = NA_character_, ...)
x |
|
name |
a scalar character, giving the name of the database, if any. This name is used for the resulting graph, to allow for easier combining of graphs into a single diagram if required. |
... |
further arguments passed to or from other methods. |
Each relation in the database is presented as a set of rows, one for each attribute in the relation. These rows include information about the attribute classes.
A scalar character, containing text input for Graphviz.
The generic gv
.
Produces text input for Graphviz to make an HTML diagram of a given database schema.
## S3 method for class 'database_schema' gv(x, name = NA_character_, ...)
## S3 method for class 'database_schema' gv(x, name = NA_character_, ...)
x |
a database schema, as given by |
name |
a character scalar, giving the name of the schema, if any. |
... |
further arguments passed to or from other methods. |
Each relation in the schema is presented as a set of rows, one for each attribute in the relation. These rows do not include information about the attribute classes.
Any foreign key references are represented by arrows between the attribute pairs.
A scalar character, containing text input for Graphviz.
The generic gv
.
Produces text input for Graphviz to make an HTML diagram of a given relation.
## S3 method for class 'relation' gv(x, name = NA_character_, ...)
## S3 method for class 'relation' gv(x, name = NA_character_, ...)
x |
a |
name |
a character scalar, giving the name of the schema, if any. |
... |
further arguments passed to or from other methods. |
Each relation is presented as a set of rows, one for each attribute in the relation. These rows include information about the attribute classes.
A scalar character, containing text input for Graphviz.
The generic gv
.
Produces text input for Graphviz to make an HTML diagram of a given relation schema.
## S3 method for class 'relation_schema' gv(x, name = NA_character_, ...)
## S3 method for class 'relation_schema' gv(x, name = NA_character_, ...)
x |
a relation schema, as given by |
name |
a character scalar, giving the name of the schema, if any. |
... |
further arguments passed to or from other methods. |
Each relation in the schema is presented as a set of rows, one for each attribute in the relation. These rows do not include information about the attribute classes.
A scalar character, containing text input for Graphviz.
The generic gv
.
Generic function for inserting a data frame of data into an object.
insert(x, vals, relations = names(x), all = FALSE, ...)
insert(x, vals, relations = names(x), all = FALSE, ...)
x |
a relational data object, into which to insert data, such as a
|
vals |
a data frame, containing data to insert. |
relations |
a character vector, containing names of elements of |
all |
a logical, indicating whether |
... |
further arguments pass on to methods. |
This function is intended for inserting into an object that is itself
comprised of data frames, such as a relation
or a
database
. The given methods have the following behaviour:
If an empty set of data is inserted, into a non-empty object element, nothing happens.
If an empty set of data is inserted into an empty object element, the resulting element is also empty, but takes on the attribute/column classes of the inserted data. This is done to prevent having to know attribute classes during object creation.
Insertion can fail if inserting would violate object constraints. For example, databases cannot have data inserted that would violate candidate/foreign key constraints.
For other cases, the data is inserted in an object element in the
same way as using rbind
, followed by unique
.
While key violations prevent insertion, re-insertion of existing records in
an object element does not. This makes insertion equivalent to an INSERT OR
IGNORE
expression in SQL. In particular, it is somewhat like using this
expression in SQLite, since that implementation uses dynamic typing.
If vals
contains attributes not included in
attrs_order(x)
, insert
throws an error, since those
attributes can't be inserted.
If a partial set of attributes is inserted, and all
is FALSE
,
then data is only inserted into components of x[relations]
whose
required attributes are all present in vals
. If all
is
TRUE
, insert
returns an error instead. This is useful when
specifying relations
: in that case, you often intend to insert
into all of the specified elements, so not including all the required
attributes is a mistake, and all = TRUE
prevents it.
If all
is TRUE
, insert
throws an error in this case: This ensures you insert into all members of a
specified value of relations
.
An R object of the same class as x
, containing the additional
new data.
Generic function, with the only given method fetching candidate key lists for relation schemas.
keys(x, ...) keys(x, ...) <- value
keys(x, ...) keys(x, ...) <- value
x |
a relational schema object, such as a |
... |
further arguments passed on to methods. |
value |
A list of lists of character vectors, of the same length as
|
A list containing lists of unique character vectors, representing
candidate keys for each element of x
.
Merges an object's schemas with empty keys. The remaining such schema contains all attributes contained in such schemas.
merge_empty_keys(x)
merge_empty_keys(x)
x |
a relational schema object, such as a |
This function is not itself generic, but makes use of the generic functions
keys
and merge_schemas
. Any input class with
valid methods for these generic functions can be passed into this function.
For database_schema
objects, references involving the
schemas with empty keys are updated to refer to the merged schema.
An R object of the same class as x
, where relations with an
empty key have been merged into a single relation.
merge_schemas
, on which this function is based.
Generic function that merges pairs of an object's schemas with matching sets of keys. The remaining schemas contain all the attributes from the schemas merged into them.
merge_schemas(x, to_remove, merge_into, ...)
merge_schemas(x, to_remove, merge_into, ...)
x |
a relational schema object, such as a |
to_remove |
an integer vector, giving the indices for schemas to be merged into other schemas, then removed. |
merge_into |
an integer vector of the same length as |
... |
further arguments passed on to methods. |
An R object of the same class as x
, where the relations have
been merged as indicated.
merge_empty_keys
, which is based on this function.
rs <- relation_schema( list( a = list(c("a", "b"), list("a")), b = list(c("b", "c"), list("b")), b.1 = list(c("b", "d"), list("b")), d = list(c("d", "e"), list("d", "e")) ), letters[1:5] ) ds <- database_schema( rs, list( list("a", "b", "b", "b"), list("b.1", "d", "d", "d") ) ) merge_schemas(rs, 3, 2) # merging b and b.1 merge_schemas(ds, 3, 2) # also merging their references # merging a schema into itself just removes it merge_schemas(rs, 3, 3) merge_schemas(ds, 3, 3)
rs <- relation_schema( list( a = list(c("a", "b"), list("a")), b = list(c("b", "c"), list("b")), b.1 = list(c("b", "d"), list("b")), d = list(c("d", "e"), list("d", "e")) ), letters[1:5] ) ds <- database_schema( rs, list( list("a", "b", "b", "b"), list("b.1", "d", "d", "d") ) ) merge_schemas(rs, 3, 2) # merging b and b.1 merge_schemas(ds, 3, 2) # also merging their references # merging a schema into itself just removes it merge_schemas(rs, 3, 3) merge_schemas(ds, 3, 3)
Creates a database schema from given functional dependencies, satisfying at least third normal form, using Bernstein's synthesis.
normalise( dependencies, single_ref = FALSE, ensure_lossless = TRUE, reduce_attributes = TRUE, remove_avoidable = FALSE, constants_name = "constants", progress = FALSE, progress_file = "" )
normalise( dependencies, single_ref = FALSE, ensure_lossless = TRUE, reduce_attributes = TRUE, remove_avoidable = FALSE, constants_name = "constants", progress = FALSE, progress_file = "" )
dependencies |
a |
single_ref |
a logical, FALSE by default. If TRUE, then only one reference between each relation pair is kept when generating foreign key references. If a pair has multiple references, the kept reference refers to the earliest key for the child relation, as sorted by priority order. |
ensure_lossless |
a logical, TRUE by default. If TRUE, and the decomposition isn't lossless, an extra relation is added to make the decomposition lossless. |
reduce_attributes |
a logical, TRUE by default. If TRUE,
|
remove_avoidable |
a logical, indicating whether to remove avoidable attributes in relations. If so, then an attribute are removed from relations if the keys can be changed such that it is not needed to preserve the given functional dependencies. |
constants_name |
a scalar character, giving the name for any relation created to store constant attributes. If this is the same as a generated relation name, it will be changed, with a warning, to ensure that all relations have a unique name. |
progress |
a logical, for whether to display progress to the user during
dependency search in |
progress_file |
a scalar character or a connection. If |
This is a wrapper function for applying synthesise
and
autoref
, in order. For creating relation schemas and foreign
key references separately, use these functions directly. See both functions
for examples.
For details on the synthesis algorithm used, see synthesise
.
A database_schema
object, containing the synthesis
relation schemas and the created foreign key references.
Data used for a meta-analysis on the effectiveness of nudges, i.e. choice architecture interventions.
nudge
nudge
A data frame with 447 effect size measurements and 25 columns:
publication_id, integer ID number for the publication. Note that two publications were erroneously assigned the same ID number, so this is not a unique publication identifier.
study_id, integer ID number for the study.
es_id, integer ID number for the effect size measured.
reference, publication citation in "Author(s) (year)" format. Due to two publications being assigned the same reference, this is also not a unique publication identifier.
title, title of the publication. Due to the error in assigning publication ID numbers, this is the unique publication identifier within the data set.
year, year of the publication.
location, geographical location of the intervention. This is given as a factor, rather than an integer, using the information provided in the codebook.
domain, factor giving the intervention's behavioural domain.
intervention_category, factor giving the intervention's category, based on the taxonomy in Münscher et al. (2016).
intervention_technique, factor giving the intervention's technique, based on the taxonomy in Münscher et al. (2016).
type_experiment, factor giving the type of experiment, as defined by Harrison and List (2004).
population, factor giving the intervention's target population. This is given as a factor, rather than an integer, using the information provided in the codebook.
n_study, sample size of the overall study.
n_comparison, combined sample size of the control and the intervention for the measured effect size.
n_control, sample size of the control condition for the measured effect size.
n_intervention, sample size of the intervention condition for the measured effect size.
binary_outcome, logical for whether the outcome scale is binary or continuous.
mean_control, mean of outcome for the control condition.
sd_control, SD of outcome for the control condition.
mean_intervention, mean of outcome for the intervention condition.
sd_intervention, SD of outcome for the intervention condition.
cohens_d, extracted effect size of intervention.
variance_d, variance of extracted effect size.
approximation, logical for whether effect size extraction involved approximation.
wansink, logical for whether the study was (co-)authored by Brian Wansink. This was added on revision, because, a few years before publication, Wansink had many papers retracted or corrected, due to various questionable practices, resulting in Wansink being determined to have committed scientific misconduct. This column was added to check whether the findings were robust to the exclusion of non-retracted studies by the Cornell Food and Brand Laboratory, of which Wansink was the director.
Mertens S., Herberz M., Hahnel U. J. J., Brosch T. (2022) The effectiveness of nudging: A meta-analysis of choice architecture interventions across behavioral domains. Proc. Natl. Acad. Sci. U.S.A., 4, 119(1).
Generic function, for retrieving data contained in a database-like structure. In particular, this is intended for such structures where the individual relations can't be accessed with subsetting.
records(x, ...) records(x, ...) <- value
records(x, ...) records(x, ...) <- value
x |
a relational data object, such as a |
... |
further arguments passed on to methods. |
value |
A list of data frames of the same length as |
Since the relational data objects in autodb
, relation
and database
, have subsetting methods that return relational
data objects, the data contained within them can't be accessed by subsetting.
This function is intended for accessing it instead.
It's recommended to call records
before doing any subsetting, since
subsetting on a relation data object does more work that will be thrown away,
such as subsetting on a database
checking whether foreign key
references should be removed.
A list containing data frames, with elements named for their respective relations.
db <- autodb(ChickWeight) records(db) # data for Chick and Time_Chick relations # ways to get data for subsets records(db)[c(1, 2)] records(db)[[1]] records(db)$Chick # subsetting first isn't recommended: removes foreign key # reference as mentions, and you need to subset again anyway records(db[[1]])[[1]]
db <- autodb(ChickWeight) records(db) # data for Chick and Time_Chick relations # ways to get data for subsets records(db)[c(1, 2)] records(db)[[1]] records(db)$Chick # subsetting first isn't recommended: removes foreign key # reference as mentions, and you need to subset again anyway records(db[[1]])[[1]]
Filters an object's relations, keeping only the main relations, and those considered ancestors via foreign key references. Foreign key references involving removed relations are also removed.
reduce(x, ...)
reduce(x, ...)
x |
An object whose relations are to be filtered. |
... |
further arguments passed to or from other methods. |
Details on how the main tables are chosen are given in individual methods.
This function is mostly intended for simplifying a database, or a database schema, for the purposes of exploration, particularly by examining plots. While the filtering might remove important auxiliary relations, it's also likely to remove any based on spurious dependencies, of which some databases can contain many.
An object of the same class as x
, with the auxiliary relations
and foreign key references removed.
reduce.database_schema
, reduce.database
.
Filters a database's relations, keeping only the main relations, and those considered ancestors via foreign key references. Foreign key references involving removed relations are also removed.
## S3 method for class 'database' reduce(x, ...)
## S3 method for class 'database' reduce(x, ...)
x |
A database, whose relations are to be filtered. |
... |
further arguments passed to or from other methods. |
The main relations are considered to be the relations with the largest number of records.
Using rejoin
on the database resulting from reduce
is
likely to fail or return incomplete results.
A database, with the auxiliary relations and foreign key references removed.
Filters a database schema's relations, keeping only the given relations, and those considered ancestors via foreign key references. Foreign key references involving removed relations are also removed.
## S3 method for class 'database_schema' reduce(x, main, ...)
## S3 method for class 'database_schema' reduce(x, main, ...)
x |
A database schema, whose relations are to be filtered. |
main |
A character vector, containing names of relations to be considered as the "main" relations. |
... |
further arguments passed to or from other methods. |
This method takes a given set of main relations, rather than inferring them.
Using rejoin
on the database resulting from decomposing a data
frame with the reduced schema is likely to fail or return incomplete results.
A database schema, with the auxiliary relations and foreign key references removed.
Generic function, returning present (foreign key) references.
references(x, ...) references(x) <- value
references(x, ...) references(x) <- value
x |
an R object with references, such as a |
... |
further arguments passed on to methods. |
value |
A list, of the same length as |
A list, giving references.
Rejoins the relations in a database into a single data frame, if possible.
This is the inverse of calling autodb
, except that the rows
might be returned in a different order.
rejoin(database)
rejoin(database)
database |
A database containing the data to be rejoined, as returned by
|
The rejoining algorithm might not use all of the given relations: it begins with the relation with the largest number of records, then joins it with enough relations to contain all of the present attributes. This is not limited to relations that the starting relation is linked to by foreign keys, and is not limited to them either, since in some cases this constraint would make it impossible to rejoin with all of the present attributes.
Since the algorithm may not use all of the given relations, the algorithm may
ignore some types of database inconsistency, where different relations hold
data inconsistent with each other. In this case, the rejoining will be lossy.
Rejoining the results of reduce
can also be lossy.
Due to the above issues, the algorithm will be changed to use all of the relations in the future.
Not all databases can be represented as a single data frame. A simple example is any database where the same attribute name is used for several difference sources of data, since rejoining results in inappropriate merges.
A data frame, containing all information contained database
if
it is lossless and self-consistent.
# simple example db <- autodb(ChickWeight) rj <- rejoin(db) rj <- rj[order(as.integer(rownames(rj))), ] all(rj == ChickWeight) # TRUE # showing rejoin() doesn't check for inconsistency: # add another Chick table with the diets swapped db2 <- db[c(1, 2, 1)] records(db2)[[3]]$Diet <- rev(records(db2)[[3]]$Diet) rj2 <- rejoin(db2) rj2 <- rj2[order(as.integer(rownames(rj2))), ] all(rj2 == ChickWeight) # TRUE
# simple example db <- autodb(ChickWeight) rj <- rejoin(db) rj <- rj[order(as.integer(rownames(rj))), ] all(rj == ChickWeight) # TRUE # showing rejoin() doesn't check for inconsistency: # add another Chick table with the diets swapped db2 <- db[c(1, 2, 1)] records(db2)[[3]]$Diet <- rev(records(db2)[[3]]$Diet) rj2 <- rejoin(db2) rj2 <- rj2[order(as.integer(rownames(rj2))), ] all(rj2 == ChickWeight) # TRUE
Creates a set of relation schemas, including the relation's attributes and candidate keys.
relation(relations, attrs_order)
relation(relations, attrs_order)
relations |
a named list of relations, in the form of two-element lists: the first element contains a data frame, where the column names are the attributes in the associated schema, and the second element contains a list of character vectors, each representing a candidate key. |
attrs_order |
a character vector, giving the names of all attributes.
These need not be present in |
Relation vectors are unlikely to be needed by the user directly, since they
are essentially database
objects that can't have foreign key
references. They are mostly used to mirror the use of the vector-like
relation_schema
class for the database_schema
class to be a wrapper around. This makes creating a database
from a relation_schema
a two-step process, where the two steps
can be done in either order: creation with create
and
insert
, and adding references with
database_schema
or database
.
Duplicate schemas, after ordering by attribute, are allowed, and can be
removed with unique
.
When several sets of relation schemas are concatenated, their
attrs_order
attributes are merged, so as to preserve all of the original
attribute orders, if possible. If this is not possible, because the orderings
disagree, then the returned value of the attrs_order
attribute is their
union instead.
A relation
object, containing the list given in
relations
, with attrs_order
stored in an attribute of the
same name. Relation schemas are returned with their keys' attributes sorted
according to the attribute order in attrs_order
, and the keys then
sorted by priority order. Attributes in the data frame are also sorted,
first by order of appearance in the sorted keys, then by order in
attrs_order
for non-prime attributes.
records
, attrs
, keys
, and
attrs_order
for extracting parts of the information in a
relation_schema
; gv
for converting the schema into
Graphviz code; rename_attrs
for renaming the attributes in
attrs_order
.
rels <- relation( list( a = list( df = data.frame(a = logical(), b = logical()), keys = list("a") ), b = list( df = data.frame(b = logical(), c = logical()), keys = list("b", "c") ) ), attrs_order = c("a", "b", "c", "d") ) print(rels) records(rels) attrs(rels) stopifnot(identical( attrs(rels), lapply(records(rels), names) )) keys(rels) attrs_order(rels) names(rels) # inserting data insert(rels, data.frame(a = 1L, b = 2L, c = 3L, d = 4L)) # data is only inserted into relations where all columns are given... insert(rels, data.frame(a = 1L, b = 2L, c = 3L)) # and that are listed in relations argument insert( rels, data.frame(a = 1L, b = 2L, c = 3L, d = 4L), relations = "a" ) # vector operations rels2 <- relation( list( e = list( df = data.frame(a = logical(), e = logical()), keys = list("e") ) ), attrs_order = c("a", "e") ) c(rels, rels2) # attrs_order attributes are merged unique(c(rels, rels)) # subsetting rels[1] rels[c(1, 2, 1)] stopifnot(identical(rels[[1]], rels[1])) # reassignment rels3 <- rels rels3[2] <- relation( list( d = list( df = data.frame(d = logical(), c = logical()), keys = list("d") ) ), attrs_order(rels3) ) print(rels3) # note the relation's name doesn't change # names(rels3)[2] <- "d" # this would change the name keys(rels3)[[2]] <- list(character()) # removing keys first... # for a relation_schema, we could then change the attrs for # the second relation. For a created relation, this is not # allowed. ## Not run: attrs(rels3)[[2]] <- c("b", "c") names(records(rels3)[[2]]) <- c("b", "c") ## End(Not run) # changing appearance priority for attributes rels4 <- rels attrs_order(rels4) <- c("d", "c", "b", "a") print(rels4) # reconstructing from components rels_recon <- relation( Map(list, df = records(rels), keys = keys(rels)), attrs_order(rels) ) stopifnot(identical(rels_recon, rels)) # can be a data frame column data.frame(id = 1:2, relation = rels)
rels <- relation( list( a = list( df = data.frame(a = logical(), b = logical()), keys = list("a") ), b = list( df = data.frame(b = logical(), c = logical()), keys = list("b", "c") ) ), attrs_order = c("a", "b", "c", "d") ) print(rels) records(rels) attrs(rels) stopifnot(identical( attrs(rels), lapply(records(rels), names) )) keys(rels) attrs_order(rels) names(rels) # inserting data insert(rels, data.frame(a = 1L, b = 2L, c = 3L, d = 4L)) # data is only inserted into relations where all columns are given... insert(rels, data.frame(a = 1L, b = 2L, c = 3L)) # and that are listed in relations argument insert( rels, data.frame(a = 1L, b = 2L, c = 3L, d = 4L), relations = "a" ) # vector operations rels2 <- relation( list( e = list( df = data.frame(a = logical(), e = logical()), keys = list("e") ) ), attrs_order = c("a", "e") ) c(rels, rels2) # attrs_order attributes are merged unique(c(rels, rels)) # subsetting rels[1] rels[c(1, 2, 1)] stopifnot(identical(rels[[1]], rels[1])) # reassignment rels3 <- rels rels3[2] <- relation( list( d = list( df = data.frame(d = logical(), c = logical()), keys = list("d") ) ), attrs_order(rels3) ) print(rels3) # note the relation's name doesn't change # names(rels3)[2] <- "d" # this would change the name keys(rels3)[[2]] <- list(character()) # removing keys first... # for a relation_schema, we could then change the attrs for # the second relation. For a created relation, this is not # allowed. ## Not run: attrs(rels3)[[2]] <- c("b", "c") names(records(rels3)[[2]]) <- c("b", "c") ## End(Not run) # changing appearance priority for attributes rels4 <- rels attrs_order(rels4) <- c("d", "c", "b", "a") print(rels4) # reconstructing from components rels_recon <- relation( Map(list, df = records(rels), keys = keys(rels)), attrs_order(rels) ) stopifnot(identical(rels_recon, rels)) # can be a data frame column data.frame(id = 1:2, relation = rels)
Creates a set of relation schemas, including the relation's attributes and candidate keys.
relation_schema(schemas, attrs_order)
relation_schema(schemas, attrs_order)
schemas |
a named list of schemas, in the form of two-element lists: the first element contains a character vector of all attributes in the relation schema, and the second element contains a list of character vectors, each representing a candidate key. |
attrs_order |
a character vector, giving the names of all attributes.
These need not be present in |
Duplicate schemas, after ordering by attribute, are allowed, and can be
removed with \code{\link{unique}}
.
When several sets of relation schemas are concatenated, their
attrs_order
attributes are merged, so as to preserve all of the original
attribute orders, if possible. If this is not possible, because the orderings
disagree, then the returned value of the attrs_order
attribute is their
union instead.
A relation_schema
object, containing the list given in
schemas
, with attrs_order
stored in an attribute of the same
name. Relation schemas are returned with their keys' attributes sorted
according to the attribute order in attrs_order
, and the keys then
sorted by priority order. Attributes in the schema are also sorted, first
by order of appearance in the sorted keys, then by order in
attrs_order
for non-prime attributes.
attrs
, keys
, and
attrs_order
for extracting parts of the information in a
relation_schema
; create
for creating a
relation
object that uses the given schema; gv
for converting the schema into Graphviz code; rename_attrs
for renaming the attributes in attrs_order
;
merge_empty_keys
for combining relations with an empty key;
merge_schemas
for combining relations with matching sets of
keys.
schemas <- relation_schema( list( a = list(c("a", "b"), list("a")), b = list(c("b", "c"), list("b", "c")) ), attrs_order = c("a", "b", "c", "d") ) print(schemas) attrs(schemas) keys(schemas) attrs_order(schemas) names(schemas) # vector operations schemas2 <- relation_schema( list( e = list(c("a", "e"), list("e")) ), attrs_order = c("a", "e") ) c(schemas, schemas2) # attrs_order attributes are merged unique(c(schemas, schemas)) # subsetting schemas[1] schemas[c(1, 2, 1)] stopifnot(identical(schemas[[1]], schemas[1])) # reassignment schemas3 <- schemas schemas3[2] <- relation_schema( list(d = list(c("d", "c"), list("d"))), attrs_order(schemas3) ) print(schemas3) # note the schema's name doesn't change # names(schemas3)[2] <- "d" # this would change the name keys(schemas3)[[2]] <- list(character()) # removing keys first... attrs(schemas3)[[2]] <- c("b", "c") # so we can change the attrs legally keys(schemas3)[[2]] <- list("b", "c") # add the new keys stopifnot(identical(schemas3, schemas)) # changing appearance priority for attributes attrs_order(schemas3) <- c("d", "c", "b", "a") print(schemas3) # reconstructing from components schemas_recon <- relation_schema( Map(list, attrs(schemas), keys(schemas)), attrs_order(schemas) ) stopifnot(identical(schemas_recon, schemas)) # can be a data frame column data.frame(id = 1:2, schema = schemas)
schemas <- relation_schema( list( a = list(c("a", "b"), list("a")), b = list(c("b", "c"), list("b", "c")) ), attrs_order = c("a", "b", "c", "d") ) print(schemas) attrs(schemas) keys(schemas) attrs_order(schemas) names(schemas) # vector operations schemas2 <- relation_schema( list( e = list(c("a", "e"), list("e")) ), attrs_order = c("a", "e") ) c(schemas, schemas2) # attrs_order attributes are merged unique(c(schemas, schemas)) # subsetting schemas[1] schemas[c(1, 2, 1)] stopifnot(identical(schemas[[1]], schemas[1])) # reassignment schemas3 <- schemas schemas3[2] <- relation_schema( list(d = list(c("d", "c"), list("d"))), attrs_order(schemas3) ) print(schemas3) # note the schema's name doesn't change # names(schemas3)[2] <- "d" # this would change the name keys(schemas3)[[2]] <- list(character()) # removing keys first... attrs(schemas3)[[2]] <- c("b", "c") # so we can change the attrs legally keys(schemas3)[[2]] <- list("b", "c") # add the new keys stopifnot(identical(schemas3, schemas)) # changing appearance priority for attributes attrs_order(schemas3) <- c("d", "c", "b", "a") print(schemas3) # reconstructing from components schemas_recon <- relation_schema( Map(list, attrs(schemas), keys(schemas)), attrs_order(schemas) ) stopifnot(identical(schemas_recon, schemas)) # can be a data frame column data.frame(id = 1:2, schema = schemas)
Generic function, for renaming attributes present in a database-like structure.
rename_attrs(x, names, ...)
rename_attrs(x, names, ...)
x |
an object with an |
names |
a character vector of the same length as |
... |
further arguments passed on to methods. |
This function has a different intended use to re-assigning
attrs_order
: that is intended only for rearranging the order of
the attributes, without renaming them. This is intended for renaming the
attributes without re-ordering them.
A relational object of the same type as x
, with attributes
renamed consistently across the whole object.
Generic function, returning subrelations for x
.
subrelations(x, ...)
subrelations(x, ...)
x |
an R object, intended to be some sort of database-like object that
contains relations, such as a |
... |
further arguments passed on to methods. |
A relation-type object, or a list of relation-type objects if the
subrelation isn't vectorised. For example, if x
is a
database
, the result is the contained relation
.
Generic function, returning subschemas for x
.
subschemas(x, ...)
subschemas(x, ...)
x |
an R object, intended to be some sort of schema that contains other
schemas, such as a |
... |
further arguments passed on to methods. |
A schema-type object, or a list of schema-type objects if the
subschema isn't vectorised. For example, if x
is a
database_schema
, the result is the contained
relation_schema
.
Synthesises the dependency relationships in dependencies into a database schema satisfying at least third normal form, using Bernstein's synthesis.
synthesise( dependencies, ensure_lossless = TRUE, reduce_attributes = TRUE, remove_avoidable = FALSE, constants_name = "constants", progress = FALSE, progress_file = "" )
synthesise( dependencies, ensure_lossless = TRUE, reduce_attributes = TRUE, remove_avoidable = FALSE, constants_name = "constants", progress = FALSE, progress_file = "" )
dependencies |
a |
ensure_lossless |
a logical, TRUE by default. If TRUE, and the decomposition isn't lossless, an extra relation is added to make the decomposition lossless. |
reduce_attributes |
a logical, TRUE by default. If TRUE,
|
remove_avoidable |
a logical, indicating whether to remove avoidable attributes in relations. If so, then an attribute are removed from relations if the keys can be changed such that it is not needed to preserve the given functional dependencies. |
constants_name |
a scalar character, giving the name for any relation created to store constant attributes. If this is the same as a generated relation name, it will be changed, with a warning, to ensure that all relations have a unique name. |
progress |
a logical, for whether to display progress to the user during
dependency search in |
progress_file |
a scalar character or a connection. If |
Bernstein's synthesis is a synthesis algorithm for normalisation of a set of dependencies into a set of relations that are in third normal form. This implementation is based on the version given in the referenced paper.
The implementation also includes a common additional step, to ensure that the resulting decomposition is lossless, i.e. a relation satisfying the given dependencies can be perfectly reconstructed from the relations given by the decomposition. This is done by adding an additional relation, containing a key for all the original attributes, if one is not already present.
As an additional optional step, schemas are checked for "avoidable" attributes, that can be removed without loss of information.
Constant attributes, i.e. those whose only determinant set is empty, get assigned to a relation with no keys.
Output is independent of the order of the input dependencies: schemas are sorted according to their simplest keys.
Schemas are sorted before ensuring for losslessness, or removing avoidable attributes. As a result, neither optional step changes the order of the schemas, and ensuring losslessness can only add an extra schema to the end of the output vector.
A relation_schema
object, containing the synthesised
relation schemas.
3NF synthesis algorithm: Bernstein P. A. (1976) Synthesizing third normal form relations from functional dependencies. ACM Trans. Database Syst., 1, 4, 277–298.
Removal of avoidable attributes: Ling T., Tompa F. W., Kameda T. (1981) An improved third normal form for relational databases. ACM Trans. Database Syst., 6, 2, 329–346.
# example 6.24 from The Theory of Relational Databases by David Maier # A <-> B, AC -> D, AC -> E, BD -> C deps <- functional_dependency( list( list("A", "B"), list("B", "A"), list(c("A", "C"), "D"), list(c("A", "C"), "E"), list(c("B", "D"), "C") ), attrs_order = c("A", "B", "C", "D", "E") ) synthesise(deps, remove_avoidable = FALSE) synthesise(deps, remove_avoidable = TRUE)
# example 6.24 from The Theory of Relational Databases by David Maier # A <-> B, AC -> D, AC -> E, BD -> C deps <- functional_dependency( list( list("A", "B"), list("B", "A"), list(c("A", "C"), "D"), list(c("A", "C"), "E"), list(c("B", "D"), "C") ), attrs_order = c("A", "B", "C", "D", "E") ) synthesise(deps, remove_avoidable = FALSE) synthesise(deps, remove_avoidable = TRUE)