No amount of training can unfortunately prevent this. Humans are prone to making mistakes. Spent 12 hours of my precious Saturday trying to come up with logic of recovery for corrupted data at customer site thousands of miles away. Well the last few words meant that this is a global phenomenon and you cannot fully stop this .
But there should be ways to minimize this .Learnings from mistakes have to incorporated in day-to-day policies guiding operational practices of Telco IT divisions or Other teams manning the Production Utilities and Systems of such an arena.
- Sensitizing workforce towards quality over reckless speed should be a starter.
- Smaller teams planned around to the fragmentation lines between modules of the systems gives these individual teams a better ownership perspective of their modules.
- Education about the linkpoints between the various modules is often not well developed within the teams and hence this is the hotspot where inconsistencies at the handover points(or breakpoints) in the system get propagated from one module to the other and finally cause widespread data corruption in the system.
- The above will actually help IT teams to come up with comprehensive check mechanisms which will increase the regularity of a ‘systems check’ or data sanity checks by automation of the learnings from the point above.
- The benefit of the above point is that Data Corruptions will be caught much faster than they are now where a corruption left unattended allows it to interact with other data through related functional processes which result in the software making the wrong decisions on the corrupted data and hence corrupting further data.
- Also notable is that in the presence of a mature attack philosophy against these problems helps in avoiding teams falling prey to knee jerk reactions whenever one data corruption is found. On many occasions it is these knee jerk reactions which leads to actions which make the situation more difficult to correct. It leads to even greater data corruptions that are hidden at present and would appear at a later date by when levels of criticality will have increased and time to fix it would definitely have shortened.
- Data Correction result validation should be done with an even more serious mindset to really understand, till what extent have we been able to correct the data. Also see whether this correction mechanism is really an efficient one. Meaning if the result validation causes several round trips back to the correction phase then this correction strategy/plan is not a wholesome one.
Well quite strained today but wanted to beat down my thoughts here while they were still raging HOT in the penthouse upstairs :-).
All the best !!