library(dplyr)
library(palmerpenguins)
library(kableExtra)- 0
-
For this exercise, we use the
penguinsdata set from the{palmerpenguins}package.
Antonio Fidalgo
Miguel Salema
February 13, 2024
A table that shows the frequencies and/or relative frequencies of each group. Then, add a line for the total at the bottom of the table.
| Species | N | % |
|---|---|---|
| Adelie | 152 | 44.19 |
| Gentoo | 124 | 36.05 |
| Chinstrap | 68 | 19.77 |
| Total | 344 | 100.01 |
{dplyr} to create a table with summary statistics.{kableExtra}.Install the following packages [install.packages()] if they are not present in your machine.
The key function here is summarise(), in conjunction with group_by(), to first count [n()] how many observations each group has.
Outside the grouping, we create [mutate()] the variable for the relative frequencies. If we did it inside the grouping, then the relative frequency would be 1 for each group.
The ordering [arrange()] in descending order [desc()] is optional.
df <- penguins |>
group_by(species) |>
summarise(n = n())|>
ungroup() |>
mutate(rfreq = (n/ sum(n)*100) |> round(2)) |>
arrange(desc(rfreq))n() is the helper function that counts the number of rows.
( ) around the whole expression that you want to pipe into round().
df
# A tibble: 3 × 3
species n rfreq
<fct> <int> <dbl>
1 Adelie 152 44.2
2 Gentoo 124 36.0
3 Chinstrap 68 19.8
We take the object created above and use summarise() again, this time to get the total [sum()] of each variable. It is good practice to make the sum robust to the presence of NAs [na.rm = TRUE].
total_line <- df |>
summarise(n = sum(n, na.rm = TRUE),
rfreq = sum(rfreq, na.rm = TRUE),
species = "Total")total_line
# A tibble: 1 × 3
n rfreq species
<int> <dbl> <chr>
1 344 100. Total
In one command, I can ask to calculate [summarise()] the sum [sum()] of each variable [across()] that satisfies [where()] is.numeric(). Notice how the formula for each variable [.x] is introduced [~].
The bind_rows() function from {dplyr} will take care of matching the columns by their name.
df
# A tibble: 4 × 3
species n rfreq
<chr> <int> <dbl>
1 Adelie 152 44.2
2 Gentoo 124 36.0
3 Chinstrap 68 19.8
4 Total 344 100.
We use the function kable() from {kableExtra}.
Here, we change the names [colnames =] and style the table [full_width = FALSE, ]. Importantly, we add a line below the penultimate row.
df|>
kable(table.attr = 'data-quarto-disable-processing="true"',
escape = FALSE,
col.names = c("Species",
"N",
"%")) |>
kable_styling(full_width = FALSE) |>
column_spec(1, width = "10em") |>
row_spec(nrow(df) - 1,
extra_css = "border-bottom: 1px solid")escape = FALSE, we can better control the possible special characters.
full_width is a self-explanatory argument that can be used with other functions. kable_styling() styles the table with a variety of arguments.
column_spec()].
df is already in the environment, we can write thus the penultimate row of df.
extra_css].
| Species | N | % |
|---|---|---|
| Adelie | 152 | 44.19 |
| Gentoo | 124 | 36.05 |
| Chinstrap | 68 | 19.77 |
| Total | 344 | 100.01 |
For printing a table in a pdf document, via \(\LaTeX\), there are a few adjustments needed.
The above must get two changes [booktabs, extra_latex_after].
df |>
kable(booktabs = TRUE,
escape = FALSE,
col.names = c("Species",
"$N$",
"\\%")) |>
kable_styling(full_width = FALSE) |>
column_spec(1, width = "10em") |>
row_spec(nrow(df) - 1,
extra_latex_after = "\\hline")TRUE improves the aesthetic quality of tables thanks to a thought-through format.
$ $ environment introduces math symbols. I find it appropriate, here.
% character introduces comments in \(\LaTeX\), a catastrophe in the middle of the code of a table. We must escape it twice [\\].
em units. We could go for cm instead.
Let me suggest three small aesthetic changes:
extra_latex_after value [\\cline{2-3}]row_spec, bold = TRUE].