|
The e-mail request and accompanying ticket from the QA staff came in fairly early in the morning; a routine upgrade to some middle office components to the production shadow environment (QA). However, this type of change would also require the flushing of shared memory segments and the restarting the appropriate matching engine serving the specific fixed income instruments being managed by the middle office components which were getting upgraded. Also, both the middle office components and the matching engine happened to be running on the same host, and this was true for both the QA environment s well as the Production environment.
At that instance, my heart was literally in my throat. The events that followed immediately were a quick and timely recovery effort by our team to restore the components to their original versions and restart. We than engaged Market Operations to determine if any open orders had existed on those instruments at the time of restart; fortunately, there were none. However, none the less, the potential for loss both monetarily as well as customer relations were significant. It was a human error, and the human was me. I was a relatively new member of the Application Support Engineering Team that was responsible for; second and third level support issues intraday; performing application upgrades and configuration changes to the QA environment intraday; and perform similar application upgrades and changes to production after-hours and ready-for-business tests. The established polices and procedures for performing changes did exist, however all the operations and controls were essentially manual and introduced the potential for errors. There really were no automations or tools in place to reliably and consistently perform application release deployments, configuration changes and rollbacks nor any access controls to limit production access. Since performing rcp, remsh and creating manual backups for rollbacks were the accepted practices, I was able to accept that. However, it was the apparent lack of consideration that was given to naming conventions of these particular systems that really drove me crazy. In the episode that I confessed to above with respect to the lack of naming conventions. The hostnames of the QA system and the production system involved were almost identical. The first 6 characters of the names were identical which described the platforms; however the last 3 were different. But there was nothing implicit, nor could be derived, in these names that characterized if the host was the production host or the QA host. The explanation for how these names were arrived at was that these systems were legacy systems that were acquired in a merger, and the existing names were retained. While this was primarily a SUN shop, the platforms involved in this incident were actually Data General AViiON's running DG/UX. I was told that it would be virtually impossible change the hostnames of these systems to something more intuitive. So the solution I had proposed to resolve this going forward was to simply add DNS alias names for these systems. Since the QA environment was typically referred to as "shadow production", I suggested that the string "SHAD" be incorporated into the alias for the QA server. While the string "PROD" be incorporated into the alias for the Production server.Proper naming conventions are one of those initiatives that I consider "low hanging fruit". It's easy and cheap to implement if you have the luxury of setting and enforcing the policies early on in a deployment. However, it's often very difficult to retrofit a naming convention further on down the road. Of course this is not restricted to just hostnames. Elements such as directory structures, file names, script names, object names, etc. are candidates as well. Recently, I was engaged in architecting of a Application Release Management and Server Lockdown facility. The core tool chosen by the client for implementing the requirements was BMC's BladeLogic. The initial rollout was slated to have over 2 thousand applications managed within BladeLogic. One of my responsibilities was to develop BladeLogic training curriculums and best practices for the client's internal development and operations staff. A BladeLogic Object Named Convention was amongst the deliverables. The BladeLogic Configuration Manager essentially provides for the creation of Packages (BLPackage) of either proprietary applications, third party software bundles, bug fixes, patches, etc. Along with the creation of packages, there are also deployments jobs, snapshot jobs, audit jobs and script jobs. Each of these is considered an object which must be created and uniquely named in the BladeLogic Configuraton Manager. Additionally, the Workspace of the Configuration Manager allows for the creations of folders and sub-folders as well provide a hierarchal namespace. With the potential creation of many multiple of objects for each application by many different business units and groups, it was clear that an object naming convention needed to be adopted early on in the roll-out. The odds for the name space to grow out of control were too high. What I had proposed was essentially a multi-part, hierarchical name that would help identify objects with sufficient granularity but without being too verbose. Also, there was sufficient motivation to enforce some grammar rules as well into the name formats. The reason being was that we'd be able to develop some reporting tools on the namespace using the BladeLogic CLI. Presuming that object names adhered to these grammar rules, we'd be able to easily traverse the object namespace and filter as needed. Also, one of the maintenance tasks that we anticipated to occur would be some sort of namespace audit and periodic cleanup. So of course, the structured name would provide the relevant information to allow object owners to perform the necessary analysis as to whether the object should need to be retained or not. So borrowing a few lines from William Shakespeare: "What's in a name? That which we call a rose by any other name would smell as sweet." I've got no issue with that ... but please try and at least give your QA and Production systems names that have some relevant meaning. BladeLogic Objects Naming Convention
Naming Parameter Descriptions
William King started his career as a Software Engineer before being employed as both a full-time employee and consultant at several top NYC Financial Services firms. In roles such as Senior Infrastructure Engineer and UNIX Engineer and Application Support Specialist with extensive experience supporting production environments for electronic trading systems and investment banking user communities and responsibilities requiring hands-on development and technical support of systems monitoring, application release management and deployment tools. Mr. King holds a B.S. Electronic Engineering Technology. You may contact Mr. King at willyjk@optonline.net or link with him at http://www.linkedin.com/in/williamjking
Set as favorite
Bookmark
Email this
Hits: 3643 Trackback(0)Comments (0)
|
||||||||||||||||||||||||||||
| Last Updated on Wednesday, 22 July 2009 13:58 |


The e-mail request and accompanying ticket from the QA staff came in fairly early in the morning;
