A new tool to make it easier to work with ABS data

This post will only be of interest to people who work with ABS time series data in R.

To use most time series data from the Australian Bureau of Statistics1 you need to click around on their website – with your hands, like an animal – and download a spreadsheet, like it’s the 90s or something. Horrifying stuff.2 Even worse, once you’ve got the spreadsheet you’ll find the data is formatted in a way that makes it hard to work with in statistical software.

Building on earlier work by Zoe Meers and Jaron Lee3, I’ve written an R package called readabs that makes it easier to work with ABS time series data.

The package has one key function, read_abs(), which will download, import, and tidy ABS time series data for you.

The messiness of ABS time series

If you want to visualise or analyse data in R, you will often need to tidy it first. In tidy data:

  1. Each variable forms a column.
  2. Each observation forms a row.
  3. Each type of observational unit forms a table.

ABS time series data is not tidy. Tidying it requires a bit of work. This screenshot of an ABS time series spreadsheet shows some of the problems, namely:

  • metadata and data are in the same columns;
  • in some cases, the data are spread across multiple worksheets;
  • each time series has its own column; and
  • dates are in an Excel format (eg. Feb-1978 is stored as 28522), which is a pain to convert.

readabs does a lot of the work of tidying these spreadsheets for you, so you can get to your analysis more quickly.

Installing readabs

The readabs package is on CRAN, so you can install it in R by running:

install.packages("readabs")

If you want to install the latest work-in-progress version, you can get it from my GitHub:

devtools::install_github("mattcowgill/readabs")

Using readabs

There’s one main function in the package: read_abs(). To download all the time series spreadsheets from an ABS catalogue number, just run read_abs("9999.0"), replacing “9999.0” with the catalogue number you’re interested in. For example, to get all the time series from the Wage Price Index, catalogue number 6345.0 you’d run:

library(readabs)

all_wpi <- read_abs("6345.0")
#> Finding filenames for tables from ABS catalogue 6345.0
#> Attempting to download files from cat. no. 6345.0, Wage Price Index, Australia
#> Extracting data from downloaded spreadsheets
#> Tidying data from imported ABS spreadsheets

str(all_wpi)
#> Classes 'tbl_df', 'tbl' and 'data.frame':    54261 obs. of  12 variables:
#>  $ table_no        : chr  "634501" "634501" "634501" "634501" ...
#>  $ sheet_no        : chr  "Data1" "Data1" "Data1" "Data1" ...
#>  $ table_title     : chr  "Table 1. Total Hourly Rates of Pay Excluding Bonuses: Sector, Original, Seasonally Adjusted and Trend" "Table 1. Total Hourly Rates of Pay Excluding Bonuses: Sector, Original, Seasonally Adjusted and Trend" "Table 1. Total Hourly Rates of Pay Excluding Bonuses: Sector, Original, Seasonally Adjusted and Trend" "Table 1. Total Hourly Rates of Pay Excluding Bonuses: Sector, Original, Seasonally Adjusted and Trend" ...
#>  $ date            : Date, format: "1997-09-01" "1997-12-01" ...
#>  $ series          : chr  "Quarterly Index ;  Total hourly rates of pay excluding bonuses ;  Australia ;  Private ;  All industries ;" "Quarterly Index ;  Total hourly rates of pay excluding bonuses ;  Australia ;  Private ;  All industries ;" "Quarterly Index ;  Total hourly rates of pay excluding bonuses ;  Australia ;  Private ;  All industries ;" "Quarterly Index ;  Total hourly rates of pay excluding bonuses ;  Australia ;  Private ;  All industries ;" ...
#>  $ value           : num  67.4 67.9 68.5 68.8 69.6 70 70.4 70.8 71.5 71.9 ...
#>  $ series_type     : chr  "Original" "Original" "Original" "Original" ...
#>  $ data_type       : chr  "INDEX" "INDEX" "INDEX" "INDEX" ...
#>  $ collection_month: chr  "3" "3" "3" "3" ...
#>  $ frequency       : chr  "Quarter" "Quarter" "Quarter" "Quarter" ...
#>  $ series_id       : chr  "A2603039T" "A2603039T" "A2603039T" "A2603039T" ...
#>  $ unit            : chr  "Index Numbers" "Index Numbers" "Index Numbers" "Index Numbers" ...

Maybe you only want a particular table? Here’s how you get a single table:


wpi_t1 <- read_abs("6345.0", tables = 1)
#> Finding filenames for tables from ABS catalogue 6345.0
#> Attempting to download files from cat. no. 6345.0, Wage Price Index, Australia
#> Extracting data from downloaded spreadsheets
#> Tidying data from imported ABS spreadsheets

If you want multiple tables, but not the whole catalogue, that’s easy too:


wpi_t1_t5 <- read_abs("6345.0", tables = c("1", "5a"))
#> Finding filenames for tables from ABS catalogue 6345.0
#> Attempting to download files from cat. no. 6345.0, Wage Price Index, Australia
#> Extracting data from downloaded spreadsheets
#> Tidying data from imported ABS spreadsheets

For more examples, see the readabs vignette.

If you have any suggestions for improvements to the package – or if you find any bugs – please let me know by filing a GitHub issue or emailing me at mattcowgill at gmail.

  1. Apart from the few that are on ABS.Stat and accessible via the API.
  2. I should clarify here that I love and cherish the ABS.
  3. I initially created a package called ‘getabs’; Zoe and Jaron saw this and generously invited me to merge it with their readabs package and take over the readabs name.