Our Triple Zero outage: the facts, the cause, and what's next

When I first spoke about the Triple Zero outage we experienced on March 1, I committed to conduct a swift, thorough and forensic investigation and to share the findings publicly.
Vicki Brady · 27 March 2024 · 10 minute read
Our Triple Zero investigation findings

Firstly – I want you to trust that when you call Triple Zero in an emergency, you can rely on the technology and the team behind the scenes to work seamlessly, to get you the help you need as quickly as possible. Anything less is not good enough.

Secondly – I would like to thank our Triple Zero call takers for the work they do to support all Australians. They have one of the toughest jobs of anyone I know, and on March 1 that job became much harder. They did the best they possibly could, for which I am so grateful, when our processes and systems failed. 

What happened

At 3:30am AEDT on March 1 our Triple Zero team identified an issue with Calling Line Identification (CLI) not appearing for calls coming into the service.  

CLI is critical as it provides us with the location and phone number of the person calling and is needed to transfer that call to the relevant emergency services operators.  

Our technical team immediately began investigating the cause of the issue and working on a fix, while our Triple Zero team enacted our backup process.   

That backup process involves our operator asking for the caller’s location and then manually connecting them to the relevant emergency service.  

This was successful for 346 of the 494 calls made during the incident.  

127 of the remaining 148 calls had to go through a manual email transfer and callback process as some of the phone numbers for the relevant emergency services stored in the backup database were incorrect.  

After a discussion with our Triple Zero team, the remaining 21 callers advised they did not require emergency assistance.  

Within 90 minutes of the incident beginning, the team had restarted the impacted server and the service returned to normal. 

What caused the issue

Our investigation has concluded that the issue on the morning of March 1 was the combination of a technical fault, an issue in our backup process and a communication error that occurred in the heat of the moment.

Technical fault

At 3:30am AEDT on March 1 there was a high volume of registration requests to the platform that manages CLI for the Triple Zero service. These requests came from medical alert devices, which are IoT devices, that are designed to make emergency calls. 

At the time of the incident, these devices were not making calls, but registering on our network in preparation should they need to do so in the future. 

This would not ordinarily cause an issue, however on this occasion it coincided with other system activity that resulted in connections to the database to reach the maximum limit. This triggered an existing but previously undetected software fault, which in turn caused the platform to become unresponsive and not able to recover on its own. 

Backup process issue

For all 24 state emergency operators, we store a backup phone number in a secondary database that can be used for manual transfer should it be needed. For eight of the 24 numbers the stored number was not correct, which prevented our team from making the manual transfer of the call to the respective emergency services operator.

Communication error

Finally, there were delays in some callers receiving call backs from emergency services operators, mainly in Victoria.

When our team identified an issue with the phone numbers they resorted to email, and were supplied an updated email address for Triple Zero Victoria during the incident. This address was incorrectly entered into the system. Our team identified the error within 13 minutes, but this did still cause a delay.

Ensuring we have the right contact numbers for emergency services operators is basic and something we should have gotten right.

Relying on email as a fallback in this situation is far from ideal, and introduced a delay that is entirely unacceptable. The team introduced it as a last resort when our manual transfer backup failed. 

What’s next

We need to get this right every time. I am personally overseeing the work to implement a number of improvements identified immediately after the incident or through our investigation. 

What we did immediately

  • Increased the connection capacity of our CLI platform database, to mitigate the risk of the issue happening again. 

  • Introduced additional monitoring and notifications for our CLI platform.

  • Updated work instructions for our teams to enable us to diagnose and address any future issues more quickly.

  • Implemented a hold on any changes to our Triple Zero platforms while the investigation was underway.

  • Updated the eight incorrect numbers for emergency services organisations in our backup database and scheduled regular reviews to ensure these are accurate at all times.

What we have done following the investigation

  • Conducted an investigation to identify the cause of the CLI issue, and reproduced the issue in our lab environment.

  • Worked with our enterprise customers who manage the IoT devices to correct the behaviour of the devices to only send the registration when they need to make an emergency call.

  • Commenced testing of a software fix that will remove the fault that was responsible for the system becoming unresponsive. 

  • Reviewed our end-to-end approach for Triple Zero, to identify improvements in our backup processes.

  • Reviewed our end-to-end monitoring and alarming for Triple Zero to ensure we can identify and respond to any issues as quickly as possible. 

What we still need to do

  • Finish the testing and deployment of the software change required to fix the fault in the CLI platform (by early April 2024).

Throughout this process we have worked closely with the Federal Government and state emergency services operators, and we continue to cooperate with the ACMA in its ongoing investigation.  

Our commitment

Let me reinforce that the series of failures that occurred on March 1 are unacceptable. The Australian public rely on Triple Zero in their times of greatest need, and we let them down by not being prepared enough for the situation.  

As CEO of Telstra, I apologise to everyone who tried to call Triple Zero during this issue, and in particular, the family of a man who suffered a cardiac arrest and tragically passed away. 

Networks and technology platforms are complex and may occasionally face issues. It’s our job to work tirelessly to reduce the risk of issues, and if they do occur, to have backup processes that mean critical services like Triple Zero can continue to operate.  

I want to reassure the Australian public that we have worked quickly to understand what occurred, learn from our mistakes and put in place improvements so that all Australians can trust that Triple Zero will be there to support them.  

By Vicki Brady

Chief Executive Officer and Managing Director