library(dplyr)
library(palmerpenguins)
library(kableExtra)
- 0
-
For this exercise, we use the
penguins
data set from the{palmerpenguins}
package.
Antonio Fidalgo
Miguel Salema
February 13, 2024
A table that shows the frequencies and/or relative frequencies of each group. Then, add a line for the total at the bottom of the table.
Species | N | % |
---|---|---|
Adelie | 152 | 44.19 |
Gentoo | 124 | 36.05 |
Chinstrap | 68 | 19.77 |
Total | 344 | 100.01 |
{dplyr}
to create a table with summary statistics.{kableExtra}
.Install the following packages [install.packages()
] if they are not present in your machine.
The key function here is summarise()
, in conjunction with group_by()
, to first count [n()
] how many observations each group has.
Outside the grouping, we create [mutate()
] the variable for the relative frequencies. If we did it inside the grouping, then the relative frequency would be 1 for each group.
The ordering [arrange()
] in descending order [desc()
] is optional.
df <- penguins |>
group_by(species) |>
summarise(n = n())|>
ungroup() |>
mutate(rfreq = (n/ sum(n)*100) |> round(2)) |>
arrange(desc(rfreq))
n()
is the helper function that counts the number of rows.
( )
around the whole expression that you want to pipe into round()
.
df
# A tibble: 3 × 3
species n rfreq
<fct> <int> <dbl>
1 Adelie 152 44.2
2 Gentoo 124 36.0
3 Chinstrap 68 19.8
We take the object created above and use summarise()
again, this time to get the total [sum()
] of each variable. It is good practice to make the sum robust to the presence of NA
s [na.rm = TRUE
].
total_line <- df |>
summarise(n = sum(n, na.rm = TRUE),
rfreq = sum(rfreq, na.rm = TRUE),
species = "Total")
total_line
# A tibble: 1 × 3
n rfreq species
<int> <dbl> <chr>
1 344 100. Total
In one command, I can ask to calculate [summarise()
] the sum [sum()
] of each variable [across()
] that satisfies [where()
] is.numeric()
. Notice how the formula for each variable [.x
] is introduced [~
].
The bind_rows()
function from {dplyr}
will take care of matching the columns by their name.
df
# A tibble: 4 × 3
species n rfreq
<chr> <int> <dbl>
1 Adelie 152 44.2
2 Gentoo 124 36.0
3 Chinstrap 68 19.8
4 Total 344 100.
We use the function kable()
from {kableExtra}
.
Here, we change the names [colnames =
] and style the table [full_width = FALSE
, ]. Importantly, we add a line below the penultimate row.
df|>
kable(table.attr = 'data-quarto-disable-processing="true"',
escape = FALSE,
col.names = c("Species",
"N",
"%")) |>
kable_styling(full_width = FALSE) |>
column_spec(1, width = "10em") |>
row_spec(nrow(df) - 1,
extra_css = "border-bottom: 1px solid")
escape = FALSE
, we can better control the possible special characters.
full_width
is a self-explanatory argument that can be used with other functions. kable_styling()
styles the table with a variety of arguments.
column_spec()
].
df
is already in the environment, we can write thus the penultimate row of df
.
extra_css
].
Species | N | % |
---|---|---|
Adelie | 152 | 44.19 |
Gentoo | 124 | 36.05 |
Chinstrap | 68 | 19.77 |
Total | 344 | 100.01 |
For printing a table in a pdf document, via \(\LaTeX\), there are a few adjustments needed.
The above must get two changes [booktabs
, extra_latex_after
].
df |>
kable(booktabs = TRUE,
escape = FALSE,
col.names = c("Species",
"$N$",
"\\%")) |>
kable_styling(full_width = FALSE) |>
column_spec(1, width = "10em") |>
row_spec(nrow(df) - 1,
extra_latex_after = "\\hline")
TRUE
improves the aesthetic quality of tables thanks to a thought-through format.
$ $
environment introduces math symbols. I find it appropriate, here.
%
character introduces comments in \(\LaTeX\), a catastrophe in the middle of the code of a table. We must escape it twice [\\
].
em
units. We could go for cm
instead.
Let me suggest three small aesthetic changes:
extra_latex_after
value [\\cline{2-3}
]row_spec
, bold = TRUE
].