Exploring Publicly Available Data in Singapore

Data.gov.sg was launched in 2011 as a one-stop portal to its publicly-available data sets from 70 public agencies. It recently underwent a revamp, and has a new-look website. Other than data from data.gov.sg, I also use data from the Department of Statistics Singapore. This is a personal project to practice R programming, and in the process hope to learn some insights from the data sets.

Demographics

Live Births and Fertility

Singapore has one of the lowest fertility rates globally World Bank.

# Load data
live_births <- read_csv("data/births-and-fertility-annual/live-births.csv")

## Rows: 118 Columns: 3
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (2): level_1, value
## dbl (1): year
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

# Convert value to numeric
live_births$value <- as.numeric(as.character(live_births$value))

## Warning: NAs introduced by coercion

# Calculate long term mean
live_births %>%
  group_by(level_1) %>% 
  summarize(across(value,
                   mean,
                   na.rm = TRUE)) %>% 
  ungroup()

## Warning: There was 1 warning in `summarize()`.
## ℹ In argument: `across(value, mean, na.rm = TRUE)`.
## ℹ In group 1: `level_1 = "Resident Live-births"`.
## Caused by warning:
## ! The `...` argument of `across()` is deprecated as of dplyr 1.1.0.
## Supply arguments directly to `.fns` through an anonymous function instead.
## 
##   # Previously
##   across(a:b, mean, na.rm = TRUE)
## 
##   # Now
##   across(a:b, \(x) mean(x, na.rm = TRUE))

## # A tibble: 2 × 2
##   level_1               value
##   <chr>                 <dbl>
## 1 Resident Live-births 40763.
## 2 Total Live-births    45152.

# Plot number of live-births
live_births %>% 
  drop_na() %>% 
  ggplot(aes(x = year,
             y = value,
             colour = level_1)) +
  geom_line(linewidth = 1.05) +
  geom_hline(yintercept = 45000,
             linetype = 2,
             linewidth = 0.5) +
  facet_wrap(~level_1,
             scales = "free_x") +
  scale_y_continuous(labels = label_number(big.mark = ",")) +
  theme_classic() +
  theme(legend.position = "none") +
  scale_colour_paletteer_d("ggsci::default_jco") +
  labs(x = "",
       y = "",
       title = "Number of Live-births in Singapore",
       subtitle = "Dashed line at 45,000 represents the long term mean",
       caption = "*Note: Resident Live-Births refers to births with at least one parent who is a Singapore citizen or permanent resident\nData: Department of Statistics (data.gov.sg) | Graphic: @weiyuet")

Total Fertility Rate by Ethnic Groups

The Total Fertility Rate refers to the average number of live-births each female would have during her reproductive years if she were to experience the age-specific fertility rates prevailing during the period. It is derived by aggregating the age-specific fertility rates of females in each of the reproductive ages for a specific year.

# Load data
total_fertility_rate_by_ethnic_group <- read_csv("data/births-and-fertility-annual/total-fertility-rate-by-ethnic-group.csv")

## Rows: 177 Columns: 4
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (2): level_1, level_2
## dbl (2): year, value
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

# Plot fertility rate by ethnic groups
total_fertility_rate_by_ethnic_group %>% 
  ggplot(aes(x = year,
             y = value,
             colour = level_2)) +
  geom_line() +
  geom_point(aes(shape = level_2)) +
  geom_hline(yintercept = 2.1,
             linetype = 2,
             linewidth = 0.5) +
  theme_classic() +
  theme(legend.title = element_blank(),
        legend.position = c(0.85, 0.55)) +
  scale_x_continuous(expand = c(0.01, 0),
                     limits = c(1960, 2020),
                     breaks = seq(1960, 2020, 5)) +
  scale_y_continuous(breaks = seq(0, 8, 0.5),
                     labels = label_number(decimal.mark = '.',
                                           accuracy = 0.1)) +
  scale_colour_paletteer_d("ggsci::default_jco") +
  labs(x = "",
       y = "",
       title = "Total Fertility Rate by Ethnic Groups",
       subtitle = "Dashed line at 2.1 represents the population replacement rate",
       caption = "Data: Department of Statistics (data.gov.sg) | Graphic: @weiyuet")

## Warning: A numeric `legend.position` argument in `theme()` was deprecated in ggplot2
## 3.5.0.
## ℹ Please use the `legend.position.inside` argument of `theme()` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

Health

Life Expectancy

In contrast to falling birth rates and fertility rates, the life expectancy has been increasing steadily.

# Load data
life_expectancy <- read_csv('data/life-expectancy-by-sex-annual/life-expectancy-at-birth-and-age-65-years.csv')

## Rows: 86 Columns: 3
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (1): level_1
## dbl (2): year, value
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

# Plot data
# Line plot of life expectancy
life_expectancy %>%
       ggplot(aes(x = year,
                  y = value,
                  colour = level_1)) +
       geom_line(linewidth = 1) +
       facet_wrap(vars(level_1),
                  scales = "free_y") +
       scale_x_continuous(breaks = seq(1960, 2020, 10)) +
       scale_colour_paletteer_d("ggsci::default_jco") +
       labs(x = "",
            y = "",
            title = "Life Expectancy of Residents in Singapore",
            caption = "Data: Ministry of Trade and Industry - Department of Statistics (data.gov.sg) | Graphic: @weiyuet") +
        theme_classic() +
        theme(legend.title = element_blank(),
              legend.position = "none")

Number of Students in Primary and Secondary Schools

Together with the falling birth rate, the number of students in primary and secondary schools have also been declining.

This has lead to the Ministry of Education merging and closing schools. Although, the student-teacher ratio has been decreasing (which is considered a positive metric).

Number of Students in Primary Schools

# Load data
students_and_teachers_primary_schools <- read_csv("data/students-and-teachers-in-schools/students-and-teachers-primary-schools.csv")

## Rows: 168 Columns: 5
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (2): sex, school_type
## dbl (3): year, students_pri, teachers_pri
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

# Plot number of students in primary schools
students_and_teachers_primary_schools %>% 
  ggplot(aes(x = year,
             y = students_pri,
             colour = school_type)) +
  geom_line(linewidth = 1) +
  facet_wrap(vars(sex)) +
  guides(colour = guide_legend(nrow = 1)) +
  scale_colour_paletteer_d("ggsci::default_jco") +
  scale_x_continuous(breaks = seq(1980, 2020, 5)) +
  scale_y_continuous(labels = label_number(big.mark = ","), 
                     limits = c(10000, 250000)) +
  labs(x = "",
       y = "",
       title = "Number of Students in Primary Schools",
       colour = "School Type",
       caption = "Data: Ministry of Education (data.gov.sg) | Graphic: @weiyuet") +
  theme_classic() +
  theme(legend.position = "bottom",
        legend.title = element_blank())

Student-Teacher Ratio in Primary Schools

# Wrangle data
# Calculate student-teacher ratio
students_and_teachers_primary_schools <- students_and_teachers_primary_schools %>% 
  mutate(student_teacher_ratio = students_pri/teachers_pri)

students_and_teachers_primary_schools %>% 
  filter(sex == "MF") %>%
  ggplot(aes(x = year,
             y = student_teacher_ratio,
             colour = school_type)) +
  geom_smooth() +
  geom_jitter() +
  facet_wrap(vars(school_type)) +
  scale_colour_paletteer_d("ggsci::default_jco") +
  labs(x = "",
       y = "",
       title = "Student-Teacher Ratio in Primary Schools",
       caption = "Data: Ministry of Education (data.gov.sg) | Graphic: @weiyuet") +
  theme_classic() +
  theme(legend.position = "none")

## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'

Number of Students in Secondary Schools

# Load data
students_and_teachers_secondary_schools <- read_csv("data/students-and-teachers-in-schools/students-and-teachers-secondary-schools.csv")

## Rows: 484 Columns: 5
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (2): sex, school_type
## dbl (3): year, student_sec, teacher_sec
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

# Fix duplicate school types "Specialised Independant" and "Specialised Independent" and "Auto"
students_and_teachers_secondary_schools <- students_and_teachers_secondary_schools %>%
  mutate(school_type = case_when(school_type == "Specialised Independant" ~ "Specialised Independent",
                                 school_type == "Auto" ~ "Autonomous",
                                 TRUE ~ school_type))

# Plot number of secondary school students
students_and_teachers_secondary_schools %>% 
  ggplot(aes(x = year,
             y = student_sec,
             colour = school_type)) +
  geom_line(linewidth = 1) +
  facet_wrap(vars(sex)) +
  guides(colour = guide_legend(nrow = 1)) +
  scale_colour_paletteer_d("ggsci::default_jco") +
  scale_x_continuous(breaks = seq(1980, 2020, 5)) +
  scale_y_continuous(breaks = seq(0, 200000, 25000),
                     labels = label_number(big.mark = ",")) +
  labs(x = "",
       y = "",
       title = "Number of Students in Secondary Schools",
       colour = "School Type",
       caption = "Data: Ministry of Education (data.gov.sg) | Graphic: @weiyuet") +
  theme_classic() +
  theme(legend.position = "bottom",
        legend.title = element_blank())

Student-Teacher Ratio in Seondary Schools

# Wrangle data
# Calculate student-teacher ratio
students_and_teachers_secondary_schools <- students_and_teachers_secondary_schools %>% 
  mutate(student_teacher_ratio = student_sec/teacher_sec)

students_and_teachers_secondary_schools %>% 
  filter(sex == "MF") %>%
  ggplot(aes(x = year,
             y = student_teacher_ratio,
             colour = school_type)) +
  geom_smooth() +
  geom_jitter() +
  facet_wrap(vars(school_type),
             scales = "free") +
  scale_colour_paletteer_d("ggsci::default_jco") +
  labs(x = "",
       y = "",
       title = "Student-Teacher Ratio in Secondary Schools",
       caption = "Data: Ministry of Education (data.gov.sg) | Graphic: @weiyuet") +
  theme_classic() +
  theme(legend.position = "none")

## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'

## Warning: Removed 84 rows containing non-finite outside the scale range
## (`stat_smooth()`).

## Warning: Removed 84 rows containing missing values or values outside the scale range
## (`geom_point()`).

Infrastructure

Housing

Being an island state, land is limited, and thus housing is a perennial issue.
The majority of people live in public housing, and the term evokes a different image compared to other countries.

Flats Constructed By Housing And Development Board, Annual

# Load data
flats_constructed <- read_csv("data/flats-constructed/flats-constructed-by-housing-and-development-board-annual.csv")

## Rows: 41 Columns: 2
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## dbl (2): year, flats_constructed
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

# Plot number of flats constructed
p <- flats_constructed %>%
  ggplot(aes(x = year,
             y = flats_constructed)) +
  geom_step() +
  scale_x_continuous(expand = c(0.01, 0),
                     limits = c(1975, 2020),
                     breaks = seq(1975, 2020, 5)) +
  scale_y_continuous(limits = c(2000, 75000),
                     breaks = seq(5000, 75000, 10000),
                     labels = label_number(big.mark = ",")) +
  labs(x = "",
       y = "",
       title = glue("Number of Flats Constructed by the Housing and Development Board ({min(flats_constructed$year)}-{max(flats_constructed$year)})"),
       caption = "Data: Housing and Development Board (data.gov.sg) | Graphic: @weiyuet") +
  theme_classic()

# Annotate
p + (annotate("text",
              x = 1984,
              y = 70000,
              label = "67,017",
              size = 3)) + 
       (annotate("text",
                 x = 2006,
                 y = 8000,
                 label = "2,733",
                 size = 3))

Economy

Container Throughput Monthly

# Load data
container_throughput_monthly <- read_csv("data/container-throughput-monthly-total/container-throughput-monthly.csv")

## Rows: 371 Columns: 2
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (1): month
## dbl (1): container_throughput
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

# Separate month and year
container_throughput_monthly <- container_throughput_monthly %>% 
  separate(month, 
           c("year", "month"))

# Convert year and month into numeric
container_throughput_monthly <- container_throughput_monthly %>% 
  mutate(across(1:2, 
                as.numeric))

# Plot container throughput monthly
container_throughput_monthly %>% 
  ggplot(aes(x = month,
             y = container_throughput)) +
  geom_step() +
  facet_wrap(vars(year)) +
  scale_x_continuous(breaks = seq(1:12),
                     labels = month.abb[seq(1:12)]) +
  scale_y_continuous(expand = c(0.1, 0)) +
  labs(x = "",
       y = "",
       title = glue("Container Throughput ({min(container_throughput_monthly$year)}-{max(container_throughput_monthly$year)})"),
       subtitle = "'000 Twenty-foot equivalent units",
       caption = "Data: Maritime and Port Authority of Singapore (data.gov.sg) | Graphic: @weiyuet") +
  theme_classic() +
  theme(axis.text.x = element_text(angle = 90,
                                   vjust = 0.5,
                                   hjust = 1))

Climate

Number of Rain Days Monthly

# Load data
number_of_rain_days_monthly <- read_csv("data/rainfall-monthly-number-of-rain-days/rainfall-monthly-number-of-rain-days.csv")

## Rows: 527 Columns: 2
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (1): month
## dbl (1): no_of_rainy_days
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

# Separate year and month
number_of_rain_days_monthly <- number_of_rain_days_monthly %>% 
  separate(month, c("year", "month"))

# Convert year and month into numeric
number_of_rain_days_monthly <- number_of_rain_days_monthly %>% 
  mutate(year = as.numeric(year),
         month = as.numeric(month))

# Plot number of rain days monthly
number_of_rain_days_monthly %>%
  ggplot(aes(x = month,
             y = no_of_rainy_days)) +
  geom_point() +
  geom_line() +
  geom_hline(yintercept = 15,
             linetype = "dotted",
             colour = "red") +
  facet_wrap(vars(year)) +
  scale_x_continuous(breaks = 1:12,
                     labels = month.abb[1:12]) +
  scale_y_continuous(breaks = seq(0, 30, 5)) +
  labs(x = "",
       y = "",
       title = glue("Number of Rain Days per Month in Singapore ({min(number_of_rain_days_monthly$year)}-{max(number_of_rain_days_monthly$year)})"),
       subtitle = "Recorded at Changi Climate Station (1.3667, 103.9833)",
       caption = "Data: National Environment Agency (data.gov.sg) | Graphic: @weiyuet") +
  theme_classic() +
  theme(axis.text.x = element_text(angle = 90,
                                   vjust = 0.5,
                                   hjust = 1,
                                   size = 7))

Endemic Diseases

Dengue Fever

# Load data
weekly_infectious_disease <- read_csv('data/weekly-infectious-disease-bulletin-cases/weekly-infectious-disease-bulletin-cases.csv')

## Rows: 20070 Columns: 3
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (2): epi_week, disease
## dbl (1): no._of_cases
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

# Create new empty columns
col <- paste("col", 1:2)

# Split year and week column
weekly_infectious_disease <- weekly_infectious_disease %>%
  separate(col = epi_week,
           sep = "-",
           into = col,
           remove = TRUE)

weekly_infectious_disease <- weekly_infectious_disease %>% 
  rename(year = "col 1",
         week = "col 2",
         cases = "no._of_cases")

# Extract numbers from week column
weekly_infectious_disease$week <- as.numeric(str_extract(weekly_infectious_disease$week,
                                                         "[0-9]+"))

# Convert year into numeric
weekly_infectious_disease <- weekly_infectious_disease %>% 
  mutate(year = as.numeric(year))

# Clean disease names
weekly_infectious_disease <- weekly_infectious_disease %>% 
  mutate(disease = case_when(disease == "HFMD" ~ "Hand, Foot Mouth Disease",
                             disease == "Zika Virus Infection" ~ "Zika",
                             TRUE ~ disease))

# Plot weekly case numbers of Dengue Fever
disease_selected <- c("Dengue Fever")

# Calculate mean from sample period
weekly_infectious_disease %>% 
  filter(disease %in% disease_selected) %>% 
  group_by(disease) %>% 
  summarize(across(cases,
                   mean,
                   na.rm = TRUE)) %>% 
  ungroup()

## # A tibble: 1 × 2
##   disease      cases
##   <chr>        <dbl>
## 1 Dengue Fever  286.

weekly_infectious_disease %>%
  filter(disease %in% disease_selected) %>%
  ggplot(aes(x = week,
             y = cases)) +
  geom_col() +
  facet_wrap(vars(year)) +
  geom_hline(yintercept = 286,
             linetype = "dashed",
             colour = "red") +
  scale_x_continuous(expand = c(0, 0)) +
  scale_y_continuous(expand = c(0, 0)) +
  labs(x = "Week#",
       y = "",
       title = glue("Weekly Case Numbers of Dengue Fever in Singapore ({min(weekly_infectious_disease$year)}-{max(weekly_infectious_disease$year)})"),
       subtitle = "Horizontal dashed line represents the mean of the entire data sample (mean = 286)",
       caption = "Data: Ministry of Health (data.gov.sg) | Graphic: @weiyuet") +
  theme_classic() +
  theme(axis.ticks.x = element_blank())

Singapore Data Project

Created: 26 May, 2022. Updated: 31 December, 2025