Email me when new posts are made to this blog

System Status Archive

Written by Jonathon Sim on May 1st, 2010.      0 comments

Database server overloaded

22/5/10 8:00-9:50
A server failure combined with higher than usual load for a saturday resulted in a database server being overloaded.  This manifests as websites loading very slowly, and in some cases timing out completely.  The affected server has been rebooted and we are monitoring the situation closely through the weekend

Network outage in one of the datacentre.

11 May 2010 15.15 - 15:40
There was a network hardware failure in one of the datacentres. The outage affected some NZ websites.
The issue was resolved by datacentre engineers promptly. The issue also caused e-mail outage. The e-mail services have been restored. For more information please read here: http://iserve.net.nz/announcements.php

Database server outage on zes14 cluster (affects approx 80 sites)

8 May 2010 5:10AM - 9:30AM
The database server for this cluster ran out of disk space - this then caused an as yet unseen bug in the mysql database server software, whereby all user accounts were lost and had to be restored from backup before this hosting cluster would work again

Broken links on some sites

30 April 02.00 - 9.35
A product release to fix a bug in url redirect handling unfortunately caused another (worse!) bug, which meant that many normal non-redirected page links were broken.  This has now been fixed, a server restart is required to deploy the change and this maybe result in a 30s - 1min outage for some sites.

10 minute outage affecting some  the customers.

20 April 18.20 - 18.30 Short network outage between datacentres caused some of the sites being inaccessible for up to 10 min. The issue has been resolved. The issue is being brought up with datacentre staff for further investigation.

Database replication delay.

14 April 09.40 - 10.00 A product release caused a delay of up to 10 minutes in database replication for a period of approximately 20 minutes. This manifested in the back end as changes appearing not to save but the website front ends were unaffected.

Websites down on cluster10.

12 April 16.30 - 18.00. Sites on cluster10 went down for 90 min. at 16.30 on 12th of April.
The cause of this issue was out of disk scenario caused by one of the customers uploading an extremely large amount of image files. As we were resizing volume the file system corrupted, thus prolonging down time. This issue now has been resolved and websites should be functioning normally.

Spike in Visitor stats

30 March - In the last month google and yahoo changed how they identify themselves to our servers resulting in them no longer being excluded from the website stats, this issue is fixed now but some sites may still be seeing a spike in stats over that time we are working on regenerating the stats for these dates to exclude the search engine visits

10min loss of International Connectivity

30 March 2010 (from 9:42AM NZDT - 09:53AM NZDT)
Services Affected: Internatinal connections to NZ hosted websites
A loss of international connectivity by our upstream provider resulted in all NZ hosted websites being inaccessable to international visitors for 10 minutes this morning.

Advanced pricing issues for some websites

18 March 2010 (from approx 10:30AM NZDT - 1.30pm)
Services Affected: Advanced pricing on some websites
A product release, base-3.7.3.188, intended to fix pricing errors on some website's with multicurrency, instead caused pricing errors with Advanced Pricing. This error would manifest itself as the advanced pricing being ignored and the base price for the product used instead.  This release was rolled back within approximately 15 minutes on the first report of issues with pricing.

However websites that had been reconfigured during this time (the main way this would happen is by editing shipping or some preferences in the website admin) kept the old pricing routines (and thus the bug) until this issue was discovered approximately 1:30 - a full restart of all servers has now been done to resolve the issue fully.

Release 3.7.3.188 is currently in testing to try to determine the underlying cause of the problem and why it was not caught by our automated regression testing before being deployed.

Problems with pop3 authentication for email

22 February 2010 (from approx 3PM NZDT - 5.30pm)
Services Affected: Email for some websites

Due to an issue with a database server at iserve, authentication for pop3 email is occasionally failing, meaning that customers are asked to enter their username/password again as though it were incorrect.  This problem was intermittent so if you keep trying you will eventually get in.  Mail is being queued and no email has been lost
http://www.iserve.co.nz/announcements.php

22 February - Iserve email hosting down for a few minutes in the morning

Downtime - NZ hosting

20 February 2010 (from approx 3:30PM NZDT - approx 9PM NZDT)
Services Affected: Approx 5% of NZ hosted websites
Status : Ongoing
An outage of orcon's cloud hosting infrastructure in wellington caused an outage of the websites hosted on it.

Downtime - NZ hosting

22 January 2009 (from 2:48PM - 3:01 PM NZDT)
Services Affected: Approx 20% of NZ hosted website
Status : Fixed
Hardware failure on a fileserver caused an outage on this hosting network while the server was rebooted.  Loss of the primary fileserver caused the secondary redundant backup fileserver to crash and reboot as well (it seems to be a related hardware failure, caused by the increased load caused by the first failure).

We are in the process of migrating websites off of these two servers so we can temporarily decomission them until we can isolate the issue.
Update 26/1/2009
The unusual circumstances of this double failure seems to have caused the order counters on the affected sites to not increment for orders made during this time period - this has various unpleasant consequences, the major one being orders potentially  overwriting each other.  To resolve this a script is currently being developed to identify where this may have happened  & resolve it by restoring data from logs (please note that very few sites are likely to actually be effected due to the shortage of time involved).

Downtime - NZ hosting

18 December 2009 (from 01.42 - 08.45 NZDT)
Services Affected: In the beginning cluster4 later all NZ clusters, as well as some mail services.
Status : Fixed
A combination of hardware failure (disk controller causing kernel panic on one of the servers at 01.42 NZDT) and software malfunction (High-availability system failing to fail-over to the running machine) caused overload on the Frontends, which resulted in the down time. All cluster apart from cluster4 were brought back up at 8.45, while cluster 4 was brought back up at a later time. Configuration on Frontends was adjusted for this scenario, as well as we are investigating the cause hardware and software failures described above.

US Cluster was temporarily inaccessible

7 December 2009 (from 13.00NZDT - ongoing)
Services Affected: US hosted websites are inaccessible
Status : Resolved
Major network gear upgrade in datacentre that is hosting servers affected is in process.
The outage consist of multiple 1-5min outages (for each network device that has been installed). The engineers estimate that this will be over by 14.00 NZDT.

DOS-related outage on some NZ sites

2 December 2009 (from 3:50 - 4:15PM (estimated))
Services Affected: Reduced performance and then outage for approx 10% of NZ hosted websites
Status : Resolved
High load caused by a denial of service attack triggered a hardware failure on a primary NFS server for this hosting cluster.  Although the system correctly failed over to the secondary as expected, however the unexpected degraded performance that followed caused a load spike that caused cascading failure in other systems.

To resolve, a full reboot was required, causing an outage to this hosting cluster until the NFS server had restarted.

Intermittent error on zes5 hosting cluster

29 November 2009 8:20 PM  - 30 November 2009  9:15AM
Services Affected: Intermittent "Catalog not found" error for approx 10% of NZ hosted websites
Status : Resolved
Due to a denial of service attack on one of our NZ hosted sites, one server was blocked from the database server.  This resulted in errors whenever this server was used to service a request.

The configuration on the database server has been changed to avoid this problem recurring.

High load, slow performance on NZ Hosting

20 November 2009 (from 12:15PM NZDT - 1PM NZDT)
Services Affected: Reduced performance for approx 10% of NZ hosted websites
Status : Resolved
A popular website is receiving extremely high traffic due to a (highly effective!) christmas promotion - this is slowing down performance for the entire hosting cluster it is on. 

Update: 12:57PM
We are in the process of bringing spare capacity online to improve this, and carefully managing traffic to minimise disruption this causes.   We anticipate performance to  improve over the next 15 minutes as these spare servers come online, but we will continue to be managing the load carefully and further performance problems may continue.

Intermittent load issues on NZ hosting

13 November 2009 (from 10:22PM NZDT - 12:58AM NZDT)
Services Affected: approx 10% of NZ hosted websites
Status : Resolved
Flaws in a scheduled task running on this cluster caused excessive load while generating website statistics.  This caused poor performance (and at various times, websites failed to load) due to excessive load.

This has been resolved by updating the database in question.

Intermittent silence issue with Zeald office phone line

22 October (from ~9.00AM NZDT - 11:00AM NZDT)
Services Affected: Zeald main phone lies
Status : Resolved
When zeald main phone line is called, the caller gets intermittent silence instead of greeting. We have reported this problem with our VoIP provider, but at this stage we haven't been given an ETA. This issue affects whole their network.
In mean time if you have urgent matter you can contact us via e-mail (support@zeald.com). Alternatively you could try again as the problem is intermittent.
UPDATE: we have been contacted by our VoIP provider and were notified that the problem has been fixed.

Email, DNS outage for iserve-hosted customers

14 October (from 7:43AM NZDT - 11:02AM NZDT)
Services Affected: Email and DNS hosted on iserve.  Also means website's reliant on iserve DNS are inaccessable
Status : Resolved
Iserve, who provide email and DNS servers to many of our customers, are having a major network outage.  The iserve network is inaccessable from most major New Zealand ISPs, meaning email and website requests will not succeed.

More up-to-date information about this outage maybe be available on iserve's status page

Update 9:20 AM NZDT
Iserve advise that this is an outage in an upstream provider that should be resolved with approximately 1 hour.

Update 9:50 AM NZDT
The iserve network appears to be reachable now.  It may not yet be reachable from all ISPs.  The update from iserve is:
"Our upstream providers have advised that they have isolated the issue and are currently working to resolve a hardware issue. We will post further updates as information comes to hand"

Update 11:02AM NZDT
Iserve advise that this problem is resolved

Update 10.30AM 15/10 NZDT
There has been caching issues with ISPs/customers that have been resolved

Downtime on Australian cluster

Tue 8th September (from 17.00 NZDT to 18.00NZDT)
Services Affected: Australian sites that are pointed to old IP address (those who are affected: please point your domain for Australian site to new IP address 119.148.66.58).
Status : Fixed
There is some networking issue in Australian data centre. This issue is beyond our control, but we have notified technicians from that data centre. It looks like there are only two sites affected. Those affected sites still use old IP address (by the way, please point your domain for your Australian sites to new IP address 119.148.66.58).

Slow service, 403 Forbidden on inquiry pages

Mon 7th September (from 9.30 to 16.00 NZDT)
Services Affected: NZ Cluster
Status : Fixed
We have been subjected to dDoS attack. From our investigation attack was targeting inquiry forms with the purpose of exploiting them to send spam, the exploit was unsuccessful, but created a lot of traffic. Due to distributed nature of this attack it was very difficult to differentiate the attacker (exploited network of computers on the internet all over the world) from legitimate user, so the temporary measure was to deny access to IE6 users as the attacking machines were posing as plain IE6 (obsolete browser by today's standards). This restriction appeared to IE6 users as 403 Forbidden error when they tried to place an inquiry. The restriction itself lasted aproximately one hour until we pinpointed the difference that allowed us to craft a better measure. The attack from security point of view was unsuccessful, while from service point of view unfortenately caused extreme load on our servers.
We have tighened security arround enquiry forms, which could show up as 403 Error if the enquiry form is used abnormally or abused.

Downtime on US cluster

Mon 31st August (from 16.15 to 17.00 NZDT)
Services Affected: US cluster.
Status : Fixed
Due to huge unexpected traffic spike, the US server was out of resources and temporarily off-line. We have allocated more resources to deal with higher load.

Downtime on US cluster

Mon 27th July (from 10.30 to 10.45)
Services Affected: US cluster.
Status : Fixed
This was scheduled 15 min. downtime by Dallas data centre staff. The core network equipment has been upgraded.

Downtime on Cluster 3 (NZ)

Mon 6th July (from 10.30 to 11.15)
Services Affected: Cluster 3 (167 sites).
Status : Fixed
Due to scheduled maintenance (involving a reboot) of a server in that cluster, we had unforeseen load spike on rest of the servers thus bringing them down as well. It took under 45 min of load to settle down, causing downtime between 20 and 45 min for some sites.

Website mail delayed

Monday 22nd June (11.30am) Duration: under 3hr
Status : Fixed
Website and e-mail marketing mail servers have been upgraded to newer version; due to change in structure of configuration files some of the mail was not delivered until later in the day. The offending misconfiguration has been fixed.

Intermittent broken images on zesuk-1 sites

Friday 29th (5.30pm) Duration: intermittent broken images until 12.00 2nd June
Services Affected: 2 sites.
Status : Fixed
Due to future decommissioning of the old UK server, we have moved 2 remaining sites to new servers on Friday afternoon, while setting up proxy between that (until the domain names are pointed to new IP) , unfortunately we have missed the firewall rule that rate-limits connections from single IP (as when we tested it there was not enough traffic to trip it). This has been fixed promptly upon discovery.

Downtime on NZ Cluster

Wed. May 13th (from 11.30) Duration: intermittent server errors until 13.00 next day
Services Affected: Legacy cluster  - 8 sites, Cluster 4 (190 sites).
Status : Fixed
Due to scheduled power upgrades, which was part of our constant improvements and future proofing of our infrastructure, one of the servers (whose uptime was over 560 days) from Cluster 4 was shut down, unfortunately something went wrong with file system and we lost one of the databases, which caused the intermittent internal server errors on cluster 4 due to extremely high load (resulted from reduced capacity). Same issue was responsible for losing file server on legacy cluster. We have rebuilt the database server and restored connectivity to legacy cluster file server.

Downtime on NZ Cluster

Wed. May 6th (from 11.30) Duration: intermittent server errors until 16.00
Services Affected: Whole NZ cluster
Status : Fixed
Due to scheduled power upgrades in datacentre, approximately 50% of the servers were shut down and started up sequentially, unfortunately this did not go as smoothly as we hoped, as 50% of the capacity was not enough and load crashed the servers that were serving. We have postponed the continuation of the upgrades until next week.

Reboot on US server

Wed. April 8th (15.20) approximately 2 min. duration
Services Affected:  US cluster, 9 sites were affected.
Status : Fixed
Server resource upgrade required a reboot. This upgrade will ensure that during load spikes server will perform normally.

Downtime on Australian cluster

Tuesday April 7th (17.30) approximately 10 min. duration
Services Affected:  Australian cluster, 14 sites were affected.
Status : Fixed
Routing issue in datacentre where the server was hosted caused the range of IP addresses that covers our Australian host to be unroutable.
Technicians at datacentre fixed the issue.

Downtime on Australian cluster

Monday April 6th (20.30) approximately 30 min. duration
Services Affected:  Australian cluster, 14 site was affected.
Status : Fixed
Very high load created by three spider bots indexing at same time (google MSN and Yahoo) caused server to unable to serve content. Restart of the server fixed the after effects.

Downtime on all New Zealand clusters

Saturday April 4th (14.00) approximately 1 hour duration
Services Affected:  most New Zealand sites.
Status : Fixed
Runaway subversion process consumed all memory (due to non-hosting server being down), this caused high load on all servers (as every server was running that process simultaneously on all servers). The issue was resolved promptly as the services are under intensive monitoring.

Downtime on Australian cluster

Friday April 3rd (14.00) approximately 30 min. duration
Services Affected:  Australian cluster, 1 site was affected.
Status : Fixed
zesau-2 host was brought down for emergency maintenance due to not responding normally.

Downtime on Australian cluster

Saturday March 21 (16.30) - Sunday March 22 (04.30)
Services Affected:  Australian cluster, 14 sites were affected.
Status : Fixed
High load spike caused the server to run out of memory, resulting in down time.
The ultimate resolution of the issue is in the pipeline, new dedicated server is built and ready for sites to be moved to.

Undefined catalog error on US cluster

Saturday March 7 (0.30) - Sunday March 8 (00.30)
Services Affected:  US cluster, 7 sites were affected.
Status : Fixed
Reported Undefined catalog errors were caused by crashed database (runaway process used up all the memory resulting in inability to start new processes).
The duration of the down time was approximately 12 hours.
Monitoring has been improved to detect such failures in future.

NFS lost configuration on the legacy cluster

Tuesday Feb 24 (9pm) - Wednesday Feb 25 (9.30am)
Services Affected:  8 websites  (0.7% of NZ hosted websites)
Status : Fixed
A glitch in fail-over system on our legacy cluster (current clusters were not affected by this problem) caused it to fail-over to a faulty configuration. Due to nature of this failure the monitoring system did not pick up the fault, as the actual servers were running fine. The failure resulted in 403 Forbidden errors.

Server restart

Monday Feb 9, between 3:00 - 3:20PM
Services Affected: Approx 1-5 minutes downtime on 14% of NZ hosted websites
Status : Fixed
Serious performance problems, caused by an issue with one of our servers necessitated a server restart for all websites to apply a settings change.  This resulted in approx 1-5 minutes of downtime for affected sites.

High Christmas Load

Monday Dec 8
Services Affected: Website hosting, email marketing
Status : Fixed
Update 15/12/2008 - 19/12/2008
The addition of four extra servers to our hosting infrastructure appears to have resolved this issue this week.  We are monitoring the situation as there is still the possibility that huge traffic spikes on any one site will adversely affect performance - but since there is now a significant over-provisioning of server resource we are much better able to handle this scenario.

We do not anticipate further performance issues in the website frontend

Update 10/12/2008
Various improvements we have made have had some improvements to this - most websites are operating at normal performance levels.  However to help further we are currently building and purchasing four additional "emergency servers" to cover this load (we have limited options for server hardware that can be delivered this side of christmas).  We hope to have this extra resource live on Dec 10/11 depending on delivery schedules.

Previous information about this outage

Christmas 2008 has been the most successful ever for our customers, especially for e-commerce customers running on the zeald platform.  Unfortunately this high load is causing intermittent problems, especially when any one site has extreme load spikes due to highly successful marketing. 

We are monitoring our servers very carefully in order to try and allocate server resources exactly where needed however in some cases our systems are running much more slowly than we would like, especially in the backend administration of sites. We are trying to prioritise front-end customer's concerns over backend users to ensure that orders and enquiries keep flowing.

The following services are experiencing the most issues:
  • Logging into the backend at peak times (generally from about 10AM - 4PM weekdays) may be slow. As editing content and/or items reduces the performance of your site, you may find that it is better to perform such edits outside of peak times if your site is one of those experiencing high traffic
  • The currency conversion service we use is experiencing high load on their servers, causing delays.  We have added a note on the page directing customers to a different service if they have issues.
  • Email marketing delivery has been running slower than usual.  We hope this issue is mostly been resolved but if a number of large campaigns run simultaneously it can take time for the mail queues to clear.  To help reduce the impact we have seperated the delivery of order and enquiry emails to ensure they do not get held up by bulk-mail.

Website outage

Mon Dec 8 2008 8:30AM - 9AM
Services Affected: Some websites
Status: Fixed
A sharp load spike on Monday morning resulted in serious performance degradation for some sites, until servers could be migrated to this cluster from elsewhere.

Slow email delivery for email marketing emails

Wed Dec 1 2008
Services Affected: Email marketing mail delivery
Status: Fixed

On December 1, due to an unexpectedely large volume of mail being delivered as part of 2008 christmas marketing, mail delivery for email marketing took longer than usual to be delivered. 

This has been resolved on our end now by the addition of extra server resource to our email infrastructure.

Here are the statistics for how many messages were delivered within specific timeframes on this date:

Under 1m  23.1%
5m  30.6%
15m  37.5%
30m  43.6%
1h  53.3%
3h  81.7%
6h  95.7%
12h  99.9%

Outage on NZ hosting cluster

Wed Dec 1 2008 15:00 - 17:00
Services Affected: Slow/non-responsive hosting for Approx 20% of NZ Hosted Websites
Status: Fixed

Unexpectedly high christmas load on one of our server clusters caused the hosting and database servers to crash.  This problem has now been resolved by the addition of more server resource to that hosting cluster.

Unfortunately several of the websites with highest load happened to be on the same website cluster.  This combined with an inordinate number of website promotions (associated both with the beginning of december and the beginning of the week) ledto a massive load spike on this cluster (approximately four times as many people visited today as on other days).

To resolve this we have added three additional servers, rebalanced the load by moving some sites to different clusters, and we are monitoring the situation to see whether additional resource will be required in the leadup to christmas.

 

Outage on NZ hosting cluster

Wed Nov 19 2008 10:00 - 11:30
Services Affected: Slow/non-responsive hosting for NZ Hosted Websites
Status: Fixed
A script released as part of an update to handle higher than usual email load over the christmas season malfunctioned and consumed all memory on a significant number of web servers before the problem could be diagnosed and fixed.

Although this particular issue was fixed fairly promptly, the fact that approx. 75% of our server capacity needed to be rebooted in order to free up memory meant that the remaining servers were brought down by the excessive load this generated.  As server capacity progrssively came back on this problem was slowly resolved after approx 11:00 - however system performance was degraded for up to 30 minutes after this time.

NFS server failure on NZ server cluster

Fri Sep 25 2008 09:10 - 09:20
Services Affected: Approx. 1/6th of websites on the New Zealand hosting cluster
Status: Fixed
Overnight filesystem corruption errors caused by a failed hard drive on one of our cluster's NFS server to switch to readonly mode, causing intermittent problems to the websites on that cluster.

To resolve this issue we were forced to shut this server down and run a filesystem check - this resulted in a 10 minute outage for all the websites affected.

Slow international connections to NZ-based websites

Wed Sep 25 2008
Status: Fixed

We are experiencing slow international connections from and to our NZ-based website hosting servers.  We are investigating this issue with our upstream bandwidth providers and will update this notice as more information is found

update 29/9/2008
This problem is still continuing, and it is still unclear what is causing it.  There seems to be a very high error rate on international connections, but our collocation provider is unsure as to the cause.  We are hopeful that a planned configuration change on their router overnight will fix it.

In the meantime we have moved the routing of email to a different connection - this helps reduce the effect of the problem by freeing up bandwidth on our international connection.

Temporary Fix! update 29/9/2008
After spending time testing this issue with our Provider's engineers they have managed to find a workaround - increasing our upstream bandwidth limit seems to work around the problem.  This probably hasn't solved the underlying issue (as we were nowhere near saturating our international link, increasing the limit should make no difference) but in our testing from offshore locations web traffic is now 100-200 times faster than it has been over the last few days.  With this temporary fix we can work to isolate this issue without the problem effecting customers.

Orders incorrectly logged in website database

Wed Sep 24 2008 AM
Status: Fixed
This morniing a database update broke the logging of orders to the website database.  This problem has since been fixed however for the orders that occurred during the outage:
  • The website admin does not show all the customer information, or in some cases show the order at all
  • This information is available in our logs, we are working on restoring it to website databases.  As this requires us to build and test custom data porting scripts this may take some time - depending on the complexity involved some sites may not see this data until tomorrow.
  • However the order email sent to the administrator *does* show correct information.  We recomend all our customers to process their orders off of these emails until the data can be restored in the website database
Update 25/9/2008
These orders have now been restored to the website databases of the affected websites

Slow international connections

Updated Wed Sep 17 2008
Status: Fixed

Over the last several days we have been experiencing major performance degradation on our international link.  This has been causing a number of problems:

  • Slow connections to websites from offshore - especially to the website admin
  • Email sent to offshore addresses (eg via email marketing) has been delayed - larger emails have also been bouncing - this most noticably effects emails to xtra.co.nz addresses which are now hosted in Australia
  • It would appear that in some form this issue has been happening for some time - however the impact has increased markedly in the last week or so (possibly due to other factors like an increase in spiders/bots).
We are working with our upstream bandwidth provider to attempt to isolate this issue - in the meantime we have a number of workarounds in place:
  • We are routing email via a temporary mail server at our offices in Albany, which takes a lot of load off of our international link thereby speeding up web connections. 
  • Other non-essential batch tasks have been disabled or delayed in an attempt to free up some bandwidth.
At this time:
  • Email is flowing freely again - there is a considerable backlog of email marketing emails that is currently being delivered from the queue and this may take several hours.  Mail especially to xtra may arrive late and in some cases days late
  • International connections should be faster than before however they are still running slower than usual
  • We are auditing several pieces of equipment within our network to see if they are malfunctioning, our bandwidth providers are performing similar problem solving in their own network, and by tomorrow we hope to be able to provide an ETA for resolution

Email Delay

Between approx 4PM 26/8/2008 and 10 AM 27/8/2008 NZDT
Status: :  Fixed
Services affected: Email sent from websites within our New Zealand hosting infrastructure
Mail was queued for a period of 18 hours overnight. No mail was lost, however mail was in some cases delayed.  As at 10AM August 27 mail is sending normally - there is however a significant queue of outgoing mail - depending on the rates at which your incoming mail server configuration allows mail to be delivered it may take some time for this backlog to be delivered.

This was caused by by disk on one of our mail servers filling up with logs - this issue was resolved quickly but led to the mailserver crashing.  Outgoing mail on the webservers was then queued until the mailserver began accepting mail again at approximately 10AM 27/8/2008

Two five minute outages websites on one server cluster

Between 9:30AM and 10 AM 12/8/2008 NZDT
Status: :  Fixed
Services affected: Approx 20% of NZ hosted websites for less than 10 minutes total
Possibly as a result of the high load the previous day, one of our secondary database servers experienced corrupted tables (a very unusual scenario we have never seen before - probably caused by a bug in the database server software itself) overnight at approximately 3AM.  The system then correctly and automatically removed this server from the cluster, falling back to using just one database server. Note that as all data is replicated on at least two seperate servers within our hosting infrastructure this scenario does not result in data loss for our clients and the system is capable of running with only one database server under normal load, without users of the websites even noticing.

The data on this failed database server was not recoverable - in the morning we needed to copy the data from a snapshot of the other database server in order to bring this server online.  As this is the same cluster that was effected by the load spike the previous day, a decision was made to do this immediately rather than waiting until outside of working hours - running on just one database server the load experienced the previous day during peak hours would have crashed the system and necessitated several hours of downtime.

This involved shutting down the master database server for long enough to take a snapshot of the database, and then start it up again - an outage of a few minutes was involved.  The first time this was attempted however the server itself crashed (a kernel bug?), necessitating a 5 minute server restart.  After the server was restarted the process needed to be done again (the second time the process completed correctly) causing another outage of a few minutes.

Very slow/intermittently inaccessable websites on one server cluster

Intermittently 2-2:30PM 11/8/2008 NZDT
Status: :  Fixed
Services affected: Approx 20% of NZ hosted websites
One website on our NZ based hosting servers experienced a very high load spike (some 20,000% higher than normal load!) due to a (very successful!) online marketing campaign.  This load spike exceeded the usual capacity of our system to adjust to varying system load and led to a number of servers to crash.

Once this issue was identified, extra server resource was brought online within that server cluster.  These servers take approximately 15 minutes to boot and be configured - however once this completed it quickly fixed the immediate issue (websites failing to load), however websites on this cluster may have experienced degraded performance throughout the remainder of the afternoon

Intermittent website outages on  UK-based server

Intermittently over the course of 6/8/2008 NZDT
Status: :  Fixed
Services affected: Website hosting on our UK based server
On the UK-based server, an issue was discovered where when users of a high-volume website view their website traffic reports too much load is generated on the database server.  This causes the websites hosted on that server to either run very slow or to time out and fail to show at all.

As a temporary workaround we have disabled website reporting on this server, which will stop this from happening - we hope to fix the bug that causes the underlying issue soon and re-enable access to the website reports.

Server outage on NZ-based hosting

Wed July 30, 12:00 - 12:13
Status: :  Fixed
Services affected: Hosting on our Hew Zealand-based hosting infrastructure

A reboot of a server within our New Zealand hosting cluster lead to an ip address conflict that resulted in websites returning a "404 Not Found" error. 

Shutting down this ip address and then resetting the switch were required to fix this issue, which was completed by 12:13.

Partial Server outage for US-based hosting server

Friday 25 July, 9:40 PM - Saturday 26 June 11:32 AM
Status: :  Fixed
Services affected: Website hosting on our Dallas based server
Over the weekend our USA based server ran into an issue (caused by a memory leak in a piece of system software) where it no longer had sufficient memory to serve websites.

Unfortunately our server monitoring system was unable to detect this type of error (as the HTTP service was still "up" however websites themselves were either running very slowly or not at all).  This bug in the monitoring system has now been fixed to ensure it does check that a valid page is being generated.

The issue itself has also been fixed (the system software at fault has been reconfigured and also a system put in place to automatically restart it should it use too much memory).

SSL Certificate Re-Issue

Tuesday June 6th
Status: :  Fixed
Services affected: SSL Certificates
Due to a bug in security software worldwide we have had to re-issue our SSL certificates. This means they have to revoke the current one to re-issue a new one.
There was a downtime in the SSL certificates which customers may have noticed. This has been resolved.


UK Server Downtime

Tuesday May 20th
Status: : Fixed
Services affected: Access to UK Hosted websites
There was an issue with our UK servers in which resulted in downtime overnight. This issue has been resolved and the servers are now being closely monitored to ensure this does not happen again.


International Speed issues

Friday May 9
Status:Fixed
Services affected: Access to New Zealand Hosted websites from offshore and some New Zealand ISPs
Several customers have noticed slow speeds when accessing websites hosted on our New Zealand servers from offshore.  This problem also seems to affect connections from at least one New Zealand ISP (callplus/slingshot).

We are investigating the issue with our upstream provider.

Update 16/5/2008
There appears to be evidence of a problem with the international bandwidth supplier being used by our upstream provider Thursday May 1 15:00 - 15:35
Status: Fixed
Services affected: Websites hosted on brisbane server
Websites hosted on our brisbane-based Australian server (note, unless you have specifically requested this, most websites are not hosted there) were unavailable due to a failure of the Optus Uplink in the australian datacentre.

Friday March 8 02:09 - 08:56
Status: Fixed
Services affected: Website outage for some websites
Due to a large load spike on one of our websites, one of our database servers ran out of disk space.  This in turn led to a problem loading websites.  As all relevent services were still running however, our monitoring systems were unable to detect the fault.

Once the problem was detected at approximately 7:30, steps were taken to add disk space to the affected server to get it operational.  Then a full server restart was required to get the affected websites live
Tuesday December 4 10:00 - 10:20
Status: Fixed
Services affected: Website outage for some websites, reduced performance for up to an hour afterward.  Page logging disabled for three hours.
Effect: Intermittent ability to access websites for some customers
Details:
A routine database change was made to one of our servers early this morning - however this had unexpected performance implications that slowed this server to the point where it began causing errors on websites & eventually crashed under the built-up load.

This problem was fixed twenty minutes later (by restarting the affected server).  However flow-on performance effects resulted in reduced performance for several of our larger websites (especially for clients in the admin, and users while completing an order) for up to an hour afterwards. 

In order to reduce load to within acceptable limits we were forced to disable page logging during this time, meaning that clients will notice a three-hour "gap" in their page view statistics for this day.

Sunday 2007-11-25 - Wednesday 2007-11-28
Status:Completed
Services affected: Website hosting for some websites
Effect: Possible slowdown as your site is copied
Details:
In order to prepare for the christmas rush we are in the process of performing server upgrades in our hosting platform.  This may involve us moving some websites to new servers, a process which should be transparent to the user.  However while this is happening users may experience some slow-down, and in some cases a website may need to be switched to read-only mode to avoid order inconsistencies.  At most this move generally should take less than three minutes per website.

Friday 2007-11-16 8:01 AM - 9:05 AM , again at 9:45 - 10AM
Status: Fixed
Services affected: Website hosting for some websites (Other websites were unaffected)
Effect: Website intermittently slow/inaccessible
Details:

Disk space was exhausted on one of the three "overflow" servers, which are used to temporarily provide extra hosting capacity to websites experiencing higher-than average load. These servers have recently been added to help handle the Christmas rush & other unexpected load spikes that sometimes affect our clients.

This resulted in page requests to that server hanging, causing 1 in 3 page requests to fail.  This server was disabled while the problem was diagnosed, but was later re-enabled by another engineer causing the problem to recur later that morning.

To avoid this problem in future the scheduled task that cleans up used disk space on these servers has been modified to run more frequently, and more sophisticated monitoring of disk usage has been implemented for these temporary servers.

Thursday 2007-9-20 9:30 - 10:24 AM
Status: Fixed
Services affected: Website hosting for 20% of websites (Other websites were unaffected)
Effect: Intermittently, access slow or site inaccessable
Details:

Unexpected load overwhelmed one of our five file servers, causing it to run extremely slowly.  This results in the websites being slowed down (in some cases to the point where people we unable to access them at all).

To resolve this we are moving some of our website's file hosting off this server to increase its spare capacity.  Within the next month we plan to replace this server with a much more recent and faster system as part of our ongoing upgrade processes.

Tuesday 2007-9-11 2:38 - 2:50 PM
Status: Fixed
Services affected: Website hosting for 20% of websites (Other websites were unaffected)
Effect: Access slow or site inaccessable
Cause:
A backup script hanging because of excessive disk usage on one server caused scheduled tasks that should run overnight to run in the middle of the day during peak website load.  This overloaded one file server, slowing down the websites running off it to the point where they no longer responded at all.

To fix this issue this scheduled task is being moved to another server where it cannot be blocked by this backup script.

Friday 2007-8-31 10:30 - 10:35 AM
Status: Fixed
Services affected: Website hosting for 20% of websites (Other websites were unaffected)
Effect: Website access blocked for 5 minutes
Cause:

A hardware failure of one of our fileservers blocked website access for approximately five minutes.  Automatic failover of the machine to its backup failed due to the nature of the hardware error (a partial failure rather than a complete failure that our failover system could not detect), however our engineers were notified and were able to manually fail-over the server.

Saturday/Sunday 2007-04-29 10:40PM-2:06AM

Status: Fixed

Services affected:

Website Hosting: yes

Email: No

Domains: No

A database server outage meant that many websites were unaccessable during this period.

Detailed description

Our main database server filled a disk with logs while experiencing much-higher than average load.  This outage also affected the SMS notifications from our monitoring system meaning only email notifications worked.  Combined with the fact that this outage occured overnight on a Sunday this resulted in the long outage window.

2007-04-26 9:06 - 13:01
Status:
Fixed

An issue with our email provider meant that all customers were being asked for thier passwords whenever they try to check thier email.  
Affected Services
Hosting :
no
Email :  yes
Domains : No
Topics: System Status
 

Comments