Wednesday, September 26, 2007

File availability audit 23 September 2007

The following message was sent to Mediamax and posted as a comment on an earlier blog entry by Martin Hood. I thought it would be useful to have here. I'll add a permanent link at the side so we can update progress on this entry and start tagging entries by theme:

---
An Open Letter to MediaMax
(accompanying the graph for MediaMax Files Audit #4 2007.09.23)

MediaMax File recovery progress graph http://tinyurl.com/2tuyxx
(included below).

MediaMax,

I first posted this graph 5 weeks ago to show that you WERE recovering he missing files. It was my intention that it would provide encouragement to those who doubted that the recovery was happening and would support your assertions that things were getting better. Regrettably the end result seems far removed from my hopeful optimism of 5 weeks ago.


Even the most casual look at the graph will show that it contains very disappointing news. By your own predictions, your file restoration process should be pretty much complete by now. Certainly no spectacular changes to the number of restored files would be expected at this late stage.

There was a temporary reversal in the week leading up to 16 September, when the number of missing files from those uploaded between April and July 2007 jumped from close to zero up to 30% - 40% missing. John Hood indirectly explained this in one of his blog postings by mentioning that some servers had been taken off line as part of the recovery process during this week. That also probably explained why about one third of recent uploads became inaccessible during that week. Most (but not all) of these files are now accessible again.

In the past 7 days a few minor gains have been made. The slivers of orange on the graph are the only improvements apart from the large area of orange between May and July 2007 which is just clawing back gains that were undone by the offline servers during the previous week.

Regrettably the broad summary result is very gloomy indeed.

Upload time period lost /total = %lost
2007 Jan - June 418 /3503 = 12%
2006 Jul - Dec 1009/4856 = 23%
2006 Jan - Jun 1592/4195 = 38%
2005 Jan - Dec 2242/7753 = 29%
Pre 2005 5012/7872 = 64%

Fortunately the restoration process has done a better job for more recently uploaded files with some months almost completely restored, although 12% average loss rate can hardly be called 'a complete recovery' by even the most liberal interpretation of the term.

Unfortunately the loss rate gets progressively worse the further back we go with some spectacular peaks. What did happen at Streamload in November 2005 that has caused the loss rate to remain at a whopping 91%?

Anyone who has files uploaded prior to 2005 is looking at a devastating 40% - 75% (average 64%) loss rate for these files. This is completely unforgivable for a file storage 'service' that still claims 'Store your files SECURELY on the web'. Anything above a 0% loss rate is NOT secure. 75% loss rate is patently INSECURE.

I CAN get over losing 9656 out of 27472 files (35%) that were stored on MediaMax given that local copies still exist, but I certainly cannot even consider uploading them all again. Some people who trusted you when you said their uploaded files were 'secure' have learnt a very hard lesson when they did not keep local copies.

Time and time again during the botched migration from Streamload to MediaMax in August 2006 your spin was 'Your files are safe and secure'. I know this because I kept a copy of the free-for-all Streamload blog from 22 August to 2 October 2006 and indeed the old blog pages are still on-line at http://blog.mediamax.com (archives). I cannot confirm that the files were in fact 'safe and secure' 12 months ago immediately after the migration from Streamload to MediaMax, but they certainly are not 'safe and secure' now.

You have been much less forthcoming with comments during this latest debacle and unsurprisingly, neither of the words 'safe' or 'secure' has been used anywhere in the recent MediaMax blog.

On behalf of all MediaMax users I ask the following questions:

1. Will you yet recover a substantial portion of the files that still show as missing (or indeed any of them) or are they lost irrecoverably?

2. Will you work with your user community to develop an efficient mechanism to identify lost files and provide a way of rehabilitating these files without having to upload modified copies?

3. What assurance can you give that files uploaded in the future will in fact be 'safe and secure' given that your track record to date has been woeful?

Martin Hood

7 comments:

JD said...

I have run a similar test as Martin today for the second time and the results are not good.

Last week I had 40% ok,
34% connection/timeout problems
This week I had 27% ok and 50% timeout/connection problems.

If I add these together as possible good and recovered files it goes from 74% to 77%.

The raise in connection problems looks suspicious also in knowing uploading has been almost impossible this week. My guess is that there is some serious stability problem as well.

Martin said...

I'll wait another week before performing another complete audit as it takes quite a while for the large number of files involved and recently improvements have been minor.

During my testing I found that the response time, and hence the number of timeouts, was very variable and seemed to be loosely related to the time of day. Mid morning to mid afternoon Australian time seemed response was generally quite fast while other times it could crawl to a standstill. It is probably related to overall system load at Mediamax and when America is awake MediaMax gets a lot more of a pounding than in the middle of their night. I increased the timeout delay to 100 seconds and this kept the number of timed out links to less than 0.5%.

Getting large numbers of timeouts increases the uncertainty of the result because a timeout could be either a 'found' or a 'not found'. It is probably unfair to add the timeout count to either the 'found' count or the 'not found' count, rather ignore them.

Your first test had 40% OK, 26% Not Found and 34% Timeout. Ignoring the timeouts and scaling the oher two figures gives 53% OK, 47% not OK, with significant uncertaintly which would depend on the number of files you tested.

The second test had 27% OK, 23% Not Found and 50% timeout. Ignoring Timeouts and scaling these numbers gives 54% OK and 46% Not OK, with an even greater level of uncertainty because of the larger number of Timeouts. Effectively the results are identical.

Did you find there was consistency between the two runs as to which individual files showed as OK and which showed as Not Found? I checked this with my own data to test the validity of the measuring method. If significant numbers of individual files were to dither over time in their OK/Not Found status, this would be an indication that the measuring method was not reliable. I found a high correlation between the various tests I performed. Apart from the one aberration where MediaMax took servers off-line, the only change was for Not Found files to change status to OK.

JD said...

I did the tests around the same time on a friday between 12.00 and 15.00 PM CET This is when USA is still sleeping or waking up.
I have not looked into comparing the results of not found. I wil do that this weekend.
The number of files I tested are around 9500.
I agree on ignoring the timeout count. But what do you mean by scaling, how did u do this math?

Martin said...

jd,

Oops! Where did those numbers for the first test come from! They are completely wrong! Sorry.

By 'scaling' I took the connection timeout number and subtracted from 100% (34% timeouts = 66% found or not found) and then divided the 'found' and 'not found' numbers by 66/100 = 0.66 to normalise them to add up to 100%.
Test 1: 40% OK / 0.66 = 61%, 26% NF / 0.66 = 39%
Test 2: 27% OK / 0.50 = 54%, 23% NF / 0.50 = 46%

So actually the second test is noticeably worse than the first. With 9500 samples in the set, even with half of them not testing, the result should still be reasonably reliable unless there is some systematic skew whereby the normally 'found' files are more likely to timeout.

JD said...

I had a feeling that the second test led to worse results also so this confiirms it also.
In that week uploading was also quite impossible. I have the suspicion they have capacity and stability problems as well so timing could be an issue.
This week seems to be even more worse up till yesterday evening when I tested uploading again.
Will do another test tomorrow and compare files not found

JD said...

Ok, thirth test done this afternoon
results 47% found,22% not found, 23% connection/timeout
Normalized this means
found 70% (47/(1-0.23)/100)
not found 32%
So this test seems to indicate we are on the move. Because of this big difference with the second test i will do this test asap

When I compare the files with connection/timeout problems is it quite clear that it is the same set of files which have the problems.
As the numbers above already indicate there have been an improvement.

BTW Martin did you receive an answer to your letter from mediamax?

Martin said...

My latest file collection audit made around 2 October, shows only very minor improvements. Overall 35 more files have been restored. That leaves 9625 files still missing. See graph
http://martinsotr.dnsalias.com/mediamax/Mediamax%202007.10.02.gif

The tiny slivers of pale green indicate the minuscule improvement made during the past week.

jd, The only responses I have received from MediaMax was a prompt one-liner saying my 'open letter' would be passed on to management and the recent composite response from John Hood elsewhere in this blog. There has been no personal response from anyone at MediaMax.

I almost forgot! There was also the boilerplate letter (that was sent out to everyone) in response to my sending the second graph to John Hood. It gave me the strong impression that my email had not even been read since my email was nothing to do with the problem referred to in their bulk mailing.