Using R Programming To Look At Number of Schools In Ontario School Boards

Hi there. In this programming post I showcase R programming for analyzing a dataset on the number of schools in the province of Ontario in Canada. The number of schools is categorized by school boards (2021-2022).

The data is publicly available in .txt format or as an Excel file in English or French. Link


Pixabay Image Source

Loading The File


Load up the tidyverse package as well as the openxlsx package. I am using the excel file to load in the data into RStudio.

# Load tidyverse package & openxlsx
library(tidyverse)
library(openxlsx)

# Load excel spreadsheet from website

file_path <- paste("https://data.ontario.ca/dataset/81c48bca-938c-413c-8c35-a33f96ed8660/",
             "resource/220a6820-0084-46ee-9747-f295f58301f2/download/schools_by_board_2021-2022_en.xlsx",
             sep= "")

# Using openxlsx package command
ont_school_boards <- read.xlsx(file_path, sheet = 1)

 

The head() function can be used to preview a sample of the dataframe.

head_board_counts.PNG

Using str() allows for viewing the dimensions of the dataframe. It also allows you to see the column names and the data types for the values in each column.

str_board_counts.PNG

 


Pixabay Image Source

Convert Into Numeric Columns


The Elementary.Schools column and the Secondary.Schools columns contain characters with its numbers. I convert these two columns into numeric columns. I plan to sort the data by the number of schools later. You cannot sort numbers when the values are strings/characters.

#### Dataframe Formatting

df <- ont_school_boards

# Change Elementary.Schools & Secondary.Schools values into numeric.

df$Elementary.Schools <- as.numeric(df$Elementary.Schools)
df$Secondary.Schools <- as.numeric(df$Secondary.Schools)

Dealing With Missing Values


I do check if the dataset contains any missing values or NA values.

> ## Dealing with missing values:
> 
> # Number of NA values
> sum(is.na(df))
[1] 6

 

There are six missing values in this dataset. I locate the rows with NA values with the use of filter() from the tidyverse package. (Technically filter() is from dplyr).

locate_NAs.PNG

These NA values are in the Elementary Schools and Secondary schools columns. For this case it is safe for the NA values to be replaced by zero.

There is a replace_na() function that allows for replacing NAs to a value of the user's choice. Use list(), specify the columns and the values for replacing NAs.

replace_NAs.PNG


Pixabay Image Source

 

Using Filter To Obtain School Counts


The filter() function is a very useful function to obtain parts of a dataset that meets a certain condition.

French Schools In Ontario
filter_French_schools.PNG

 

Ontario Catholic Schools

In the province of Ontario we have public schools, private schools (not featured here) and public Catholic schools.

With this one my condition is where the Board Type is the Catholic District School board. I have %>% head(15) for displaying the top 15 Catholic schools from the original dataset order.

filter_catholic_schools15.PNG

French Catholic Schools In Ontario

There can be more than one condition when it comes to using filter(). I filter by Language being French and the Board Type.

filter_frenchCatholic.PNG

 

Toronto Schools

Toronto is the capital of the province of Ontario. Here are the Toronto school boards.

filter_Toronto_schools.PNG


Pixabay Image Source

 

Sorting Data By School Board Counts


The arrange() function allows for sorting given a column to sort by. I sort by Total.Schools in descending order.

sort_total_counts_head.PNG

The Toronto District School Board (Toronto DSB) has the largest number of schools out of the school boards in Ontario.

In second it is the Peel DSB with 259 schools. The Peel district school board is west of Toronto with Mississauga as one of the big cities there near the Toronto Pearson Airport.

This following code chunk is for making sure that the horizontal bar graphs are in order from the highest counts to the lowest counts.

# Sort by counts
total_counts$Board.Name <- factor(total_counts$Board.Name, 
                          levels = total_counts$Board.Name[order(total_counts$Total.Schools)])

 

The ggplot2 package that is inside the tidyverse package allows for the creation of bar graphs, line graphs, histograms and other data visualizations.

I don't really want get into too much details with the plotting code below. The main part is ggplot(head(total_counts, 15), aes(x = Board.Name, y = Total.Schools) for specifying the dataset input, the x variable and the y variable.

Stating geom_bar(stat = "identity", width = 0.8) gives a bar graph with the bars of width 0.8.

Having coord_flip() converts the default vertical bar graph into a horizontal bar graph.

The other items such as geom_text(), labs() and theme() is for centering the titles, adding labels, font size and label colours.

# ggplot - Show Top 15 Counts for Ontario School Board

ggplot(head(total_counts, 15), aes(x = Board.Name, y = Total.Schools)) + 
  geom_bar(stat = "identity", width = 0.8) +
  coord_flip() +
  geom_text(aes(label = Total.Schools), hjust = 1.2, colour = "white", fontface = "bold") +
  labs(x = "\n School Board", y = "Count \n", title = "Ontario School Board Counts (21-22)\n") +
  theme(plot.title = element_text(hjust = 0.5, size = 16),
  axis.title.x = element_text(face="bold", colour="darkgreen", size = 12),
  axis.title.y = element_text(face="bold", colour="darkgreen", size = 12))

ontario_school_counts_bar.png

York Region District School Board is in third place. This region is just north of Toronto. Many of the schools in Ontario are in the Toronto and surrounding area.

 

Sort Counts By Number Of Ontario Elementary Schools

The code for elementary schools and secondary schools (high schools) are more or less the same as the previous section. Here it is.

### Sort Counts by Elementary Schools

elem_counts <- df %>% arrange(desc(Elementary.Schools))
head(elem_counts, 10)

# Sort by Elem school counts
elem_counts$Board.Name <- factor(elem_counts$Board.Name, 
                                 levels = elem_counts$Board.Name[order(elem_counts$Elementary.Schools)])

# ggplot - Show Top 15 Counts for Elem Schools Ontario

ggplot(head(elem_counts, 15), aes(x = Board.Name, y = Elementary.Schools)) + 
  geom_bar(stat = "identity", width = 0.8) +
  coord_flip() +
  geom_text(aes(label = Elementary.Schools), hjust = 1.2, colour = "white", fontface = "bold") +
  labs(x = "\n School Board", y = "Count \n", title = "Ontario Elementary School \nBoard Counts (21-22)\n") +
  theme(plot.title = element_text(hjust = 0.5, size = 16),
        axis.title.x = element_text(face="bold", colour="darkgreen", size = 12),
        axis.title.y = element_text(face="bold", colour="darkgreen", size = 12))

ontario_elem_counts_bar.png

 

Sort Counts By Number Of Ontario Secondary Schools

### Sort Counts By Secondary Schools (High School)

hs_counts <- df %>% arrange(desc(Secondary.Schools))
head(hs_counts, 10)

# Sort by High school counts
hs_counts$Board.Name <- factor(hs_counts$Board.Name, 
                              levels = hs_counts$Board.Name[order(hs_counts$Secondary.Schools)])

# ggplot - Show Top 15 Counts for Elem Schools Ontario

ggplot(head(hs_counts, 15), aes(x = Board.Name, y = Secondary.Schools)) + 
  geom_bar(stat = "identity", width = 0.8) +
  coord_flip() +
  geom_text(aes(label = Secondary.Schools), hjust = 1.2, colour = "white", fontface = "bold") +
  labs(x = "\n School Board", y = "Count \n", title = "Ontario Secondary School \nBoard Counts (21-22)\n") +
  theme(plot.title = element_text(hjust = 0.5, size = 16),
        axis.title.x = element_text(face="bold", colour="darkgreen", size = 12),
        axis.title.y = element_text(face="bold", colour="darkgreen", size = 12))

ontario_secondary_counts_bar.png

The numbers for secondary school is much lower compared to the elementary schools. I do wonder what the school sizes are like and what the class sizes are like. That is something to investigate beyond this data.


Pixabay Image Source