A technical fault at the internet registry SRS Plus (part of Network Solutions) used by the Zeald domain and email manager led to an intermittent outage of domain name resolution for domain names hosted on it.
Domain name resolution is the process by which your web browser finds out where to access a website, by converting a domain name like “www.zeald.com” into an “IP Address” (like 188.8.131.52). The IP Address operates much like a phone number, and allows the website to then find the site. Because of this, while the site itself was working fine during this issue, the web browser could not find it and so displayed an error message instead of loading the site.
The following diagram portrays the process by which your web browser displays a website. The successful route, which is what should happen, is in green. The unsuccessful route, which is what happened to customers affected by this issue, is in red. Further explanation is below:
1. You go on to your web browser (e.g. Firefox) and type in the domain name (eg www.zeald.com), and your web browser sends a request to your ISP (Internet Service Provider, e.g. Telecom) to find the website.
2. Your ISP looks at the DNS root servers, which tell the ISP that the domain registrar for this domain is SRSplus - SRSplus is the registry where domains registered via the Zeald Domain & Email manager are ultimately registered.
3. It then finds a list of DNS servers, and the first one to respond to the request will send the DNS information for Zeald back to the ISP so that they can find zeald’s DNS servers.
4. The browser will then make another request from Zeald’s DNS server which will then return the IP address of the website, so your browser can show the website.
There are a number of DNS servers which are registered with SRSplus, and one of these servers wasn’t working. This is why the issue was intermittent, as the DNS server which responds to the request changes all the time. When an ISP receives a result from a broken server it can’t find the website until the time it receives a result from a working server (for those interested in the technical details - this server was not returning a “glue record” for zeald controlled domains - http://en.wikipedia.org/wiki/Domain_Name_System#Circular_dependencies_and_glue_records
The DNS system makes extensive use of caching, so your computer and your ISP will remember the DNS information so that when you go to the website again, it will load quicker as it doesn’t need to look up the DNS again. DNS information can be kept for a day or even longer, which is why when this issue was fixed, you may still have seen the error message until the DNS refreshed itself.
Because there are so many steps here, it meant that it was very difficult to pinpoint where exactly the connection had broken down. If our servers are down, we know straight away and can usually resolve the issue on our end. However an ICANN approved registry such as SRSPlus/Network Solutions is part of the infrastructure of the internet - they aren’t even directly a supplier of ours, and it took some time to isolate that the issue was with one of their servers.
Why wasn’t this server working?
We are still trying to find out from our provider Webdrive, who in turn have the direct relationship with SRSplus, why they stopped working correctly. After extensive testing and research, on both our end and Webdrive’s end, we found that this SRSplus server wasn’t resolving our DNS - it couldn’t ‘see’ the IP address for our nameservers, meaning that it couldn’t find where the website was, and therefore would throw up an error message in your browser instead.
SRSplus is a global registrar, one of the world’s largest, who we use to host our nameservers, and nameservers hold all of the information that your ISP needs in order to let you view the website (like the A record, CNAME record, MX records etc). The nameservers needs to be visible to the ISP in order for your website to display. The nameserver information that should have been accessible in the root server simply wasn’t there. We think there must have been some change done for the server to suddenly stop resolving our DNS, as this isn’t something that should normally happen.
How can we prevent this?
SRS Plus are still trying to find out the reason that our DNS information wasn’t available, after which we can look at preventative methods and permanent fixes.
Its important to understand that due to the distributed nature of the internet, nobody could ever guarantee that things won’t break, and your website won’t go down. In order for your website to display correctly requires faultless operation of dozens of different systems and organisations, and each of these isn’t 100% reliable. At no time during this outage was there a fault with servers displaying your site, or “downtime” of any system under our control, or even under the control of our direct suppliers. Despite this, we need to make sure we do everything possible to ensure minimal interruptions for you and your customers. Network Solutions are a major international Domain provider (they were the first, developing the DNS system under a government grant) and this is the first time we’ve experienced issues with their service.