Operational Mistakes in B/OSS Systems – How can we get Better

No amount of training can unfortunately prevent this. Humans are prone to making mistakes. Spent 12 hours of my precious Saturday trying to come up with logic of recovery for corrupted data at customer site thousands of miles away. Well the last few words meant that this is a global phenomenon and you cannot fully stop this .

But there should be ways to minimize this .Learnings from mistakes have to incorporated in day-to-day policies guiding operational practices of Telco IT divisions or Other teams manning the Production Utilities and Systems of such an arena.

  • Sensitizing workforce towards quality over reckless speed should be a starter.
  • Smaller teams planned around to the fragmentation lines between modules of the systems gives these individual teams a better ownership perspective of their modules.
  • Education about the linkpoints between the various modules is often not well developed within the teams and hence this is the hotspot where inconsistencies at the handover points(or breakpoints) in the system get propagated from one module to the other and finally cause widespread data corruption in the system.
  • The above will actually help IT teams to come up with comprehensive check mechanisms which will increase the regularity of a ‘systems check’ or data sanity checks by automation of the learnings from the point above.
  • The benefit of the above point is that Data Corruptions will be caught much faster than they are now where a corruption left unattended allows it to interact with other data through related functional processes which result in the software making the wrong decisions on the corrupted data and hence corrupting further data.
  • Also notable is that in the presence of a mature attack philosophy against these problems helps in avoiding teams falling prey to knee jerk reactions whenever one data corruption is found. On many occasions it is these knee jerk reactions which leads to actions which make the situation more difficult to correct. It leads to even greater data corruptions that are hidden at present and would appear at a later date by when levels of criticality will have increased and time to fix it would definitely have shortened.
  • Data Correction result validation should be done with an even more serious mindset to really understand, till what extent have we been able to correct the data. Also see whether this correction mechanism is really an efficient one. Meaning if the result validation causes several round trips back to the correction phase then this correction strategy/plan is not a wholesome one.

Well quite strained today but wanted to beat down my thoughts here while they were still raging HOT in the penthouse upstairs :-).

All the best !!

One thought on “Operational Mistakes in B/OSS Systems – How can we get Better

  1. Amit, your posts are so lucid and well thought of that I love reading them from time to time. This one is really good. A food for thought. We ignore them at times and especially when it is legacy systems like in a tier-1 mega customer site, data cleansing is a major paid activity. There are social factors attached which bring in a sense of insecurity and resistance into the overall process. But if a migration is to be planned, cleansing is a must to avoid pitfalls mentioned by you. However creating generic robots for performing cleansing seems far from reality, in this age of big data, unless we have good amount of R&D spendings in this area. I hope one day the world will be ridden of legacies, and there will be standardization all around us to worry less. Thanks for the post.

Leave a Reply

Your email address will not be published. Required fields are marked *