Can/should Data Management be part of the Configuration Management process?

Stuart Jamieson's picture

I am currently working on a Financial system which is in development. It is a batch process. Data goes in one end, goes through a number of processes and a report is output at the end containing numbers.

What happens is that when the data file arrives, a developer then manually "cleans it up", it gets loaded into another application where the data is processed, it then gets manually "enriched" by somebody who cuts and pastes data from another file and merges the two, it passes on down the line and the next person may find a mistake in the data so it goes back to the previous step to get fixed and so on it goes on.

I have all the "applications/ scripts" that process the data under Version/change management, but nobody in the company appears to be worried about managing the data itself at the intermediate steps as they believe it will all eventually be automated. Should I be concerned ?

Do I really need to capture the data inputs and outputs at each step in the process ?

Can data be a Configuration Item ?

thanks

Stuart

8 Answers

Bob Aiello's picture
Bob Aiello replied on September 2, 2011 - 11:06am.

that is a very good question and the answer may very well be yes. If possible, I would ask the corporate audit team for their input on this (make sure you get the ok from your manager to speak with them first).

configuration management involves setting up environments along with source code management, build engineering, release and deployment. So data is fair game. Similarly, many organizations are working on data privacy issues and there you have to clean up the data (removing customer information) before it is loaded into a test environment.

In my book on configuration management best practices, I talk about a "Mickey Mouse" incident where my boss yelled at me (in public) for having sensitive data on the screen. If she had looked closer she would have seen Mickey Mouse, Donald Duck and a few other Disney characters were on the payroll :-)

Bob

Marc Girod's picture
Marc Girod replied on September 2, 2011 - 11:27am.

It depends on your (i.e. your tool's) ability to manage it.

Under clearmake (part of ClearCase), these intermediate (as well as final) data may be identified and managed as "derived objects."

They get a system provided object-id, and get discriminated on the basis of their (audited) dependencies.

Under such a system, they become the prototypical Configuration Item, much richer than hand-made source artefacts.

Marc

bglangston's picture
bglangston replied on September 2, 2011 - 5:16pm.

Stuart,

Of course! Should it be? As a generality, I would say "No." However, as Bob pointed out, it really depends on the data, or perhaps more correctly the type of data. I usually try to use three categories: Fixed data, semi-fixed data, and dynamic data.

"Fixed data" I equate to data values that:
1. Will not change or very seldom change, and
2. Provides a lynchpin for Db returns (i.e., return values).
These I consider to be part of the Db schema construct, so if you change one of these values, you are in essence changing the Db design.

"Dynamic data" equates to raw data and many other, but not all, inputs.

"Semi-fixed data" is not so well defined. It concerns that area of data that can change, but may require control to ensure the correct functional performance of the Db or provides the correct operand (in the same sense as a constant in an algebraic equation.

So, within the process you describe, which type of data are these people dealing with. That might help you determine which should be under formal control.

With that said, frankly I don't think would/should be your responsibility to ensure the accuracy of the inputs and outputs of the stages in the process. Or more directly: No, it is not your concern (as the CM specialist).

Granted, someone should, as a general rule of good business practice, keep each stage of inputs until the outputs are verified. Using your description as an example, "When the data file arrives, Jill possibly should keep a safe copy of it, her other inputs, and her output to Jack, at least until Jack (the next person) verifies the data. Retaining these artifacts allows her to readdress her activities if there is a conflict downstream.

Jack then should do the same with his "cleaned up" work, and the cutter/paster/merger, and so on. But this is a product of the business process not a CM process. If errors are occurring from the process, it is up to the business manager to determine whether it is the process, the people, or the automated parts producing those errors. If it is the process AND IF the process is under configuration control, it is your concern to ensure that any process changes are conducted according to "Hoyle." If the problem is with one or more automated elements, then conducting changes to correct those elements, NOT the data, are your concern.

When you say a "...a developer then manually cleans it up...," what do you mean by "a developer"? Jill or Jack is not developing anything in this process. Maybe they were hired as a developer, but when they are performing this process, they are data entry "specialists." I think if you keep that mind, it may be more clear that their activities, inputs and outputs in this process are not your concern.

jptownsend's picture
jptownsend replied on September 6, 2011 - 2:37pm.

Billy,

I have to disagree with you on this one. I think the data should be managed at each step of the process in the Versioning tool that is available. I feel like one of the basic tenets of CM would be to give the "developers" the ability to recreate what work they have done. While I understand that keeping copies of the files as it goes through the process on the LAN is one option, I don't think it is the most prudent.

As part of the the process the file should be checked in and checked out at each step. The main concern I have other than this is the manual manipulation of the data. While it may seem to be ok since they are going to automate the process eventually it flies in the face of anyone who would audit this process. Manual manipulation of data is extremely risky especially if someone manually updatesit to say deposit money in their personal bank account or Swiss bank account.

So I say until it is automated, all artifacts, objects, flat files or what ever should be archived as part of the process and audited daily to insure nothing unethical is occuring.

Regards,

Joe

Stuart Jamieson's picture

Thank you all very much for your input, it has been most helpful and helped me to clarify an approach.

First a bit more background. The company in question is in the "Financial Services" company. Government regulators have decreed that new reports should be produced, this involves taking data from various parts of the company, running them against financial models ,and spitting out an answer. So yes there is going to be a high degree of auditability, perhaps even external auditors. All the effort to date has been on producing the final "numbers" and documenting WHY they used the models they did to come up with the numbers. They have now realised that the auditor is also interested in "HOW" these numbers are produced. You also spotted my mistake in there there are no "software developers" on the project. We are talking very talented financial engineers, mathematicians, acturies etc.

Forgeting job titles for the moment I was brought in to manage the contents of the various test environments. So the data that is loaded onto these environments have already been "enriched" beforehand. They are now looking at me to manage the data flowing through this "enrichment" process which is new challenge for me. Luckliy another team is documenting the process for me

Based on the replies above my approach will be identify and allocate a unique identifier to the all the data going in and out of the various process steps. ( eg INPUT DATA -> PROCESS -> OUTPUT DATA)

If it is a manual process, which includes loading the input data, dumping the output results, or manually manipulating it in the middle, then the versions of the input and output need to be recorded

If it is an automated process then the application / script / macro involved should be versioned.

If the data feed is dynamic (eg currency exchange rates) then this cannot be versioned, but needs to be identified.

The final task would be to manage their relationships (ie inputs/outputs/process/documentation?)

I wonder if I should tell them that SubVersion is not a Configuration Management tool ?

thanks once again

Stuart

Bob Aiello's picture
Bob Aiello replied on September 7, 2011 - 10:37am.

I don't want to get into the debates on version control tools - but Subversion (SVN) is a widely used open source version control tool. We could have a healthy debate about whether or not it is appropriate in a financial services firm but that is a different topic. Configuration Management tools serve many different functions (e.g. build engineering, workflow automation etc)

Stuart Jamieson's picture

Of course you are right Bob, I did not mean in any way to be disrespectful about SubVersion.

I was just reminding myself that the project doesn't quite understand the difference between Version Management and Configuration Management.

(hope I haven't opened a can of worms here)

bglangston's picture
bglangston replied on September 7, 2011 - 4:40pm.

Stuart,
Your clarifications provide a different kettle of fish to fry.

Sounds like the data and reports are actually part of the test environment. Way back in my early days, a programmer once told me that his best approach was to figure out what he needed an app to do (requirements), do a process manually until it worked properly (design development), and then automate, test and deploy it.

Sounds like you guys are doing a sort of "development of proof of concept" by "testing" the steps with predetermined data and data formats.

The point is that in this case I would say you should have the data under control so you can reduce the number of variables. For example, if you modify wing connections at the same time as increasing engine power, how will you know which caused the wing to fall off.

I am assuming that the data inputs and outputs are occasionally modified in regard to the data, the form and/or format.

You said, "They are now looking at me to manage the data flowing through this 'enrichment' process which is new challenge for me." Of course they are! In my mind, this is very similar to creating a release for stages in a life cycle because someone must make sure that the materials released are the ones designated for the release -- and that someone appears to be you in this case.

We once ran across a software program that passed the developers' test, but failed on system testing. The internal tester claimed that he was performing internal system testing so could't understand why it passed with him but failed in the external environment. After he explained where he had obtained the system test materials (which he wasn't supposed to have), his team lead did a little research and discovered that the test materials had been modified (formally) several times. Also had a case where an external tester was using test materials from his workstation instead of the materials forwarded by the CM Librarian.

I suspect it is especially to prevent such situations that they want you to control the test materials.

Regarding tools used as baseline version repositories (SubVersion included), it is not uncommon to store artifacts in them - even if there is no intent to change the artifacts (ala change control). Typically, a good repository tool, especially one developed specifically for CM work, has security features that can strongly warrant the integrity of the files.

Speaking of "CM tools," would someone please define what contitutes a "CM tool?"

CMCrossroads is a TechWell community.

Through conferences, training, consulting, and online resources, TechWell helps you develop and deliver great software every day.