“The process of cleansing data is known as data hygiene in the computer database world. The phrase refers to the detection and correction of inaccuracies in databases. Inconsistencies and outright errors are ferreted out during the process and either corrected to agree with other information in the database or eradicated completely. The original data may have become corrupted due to storage errors on the part of the computer or they may be due to human error. This process utilizes various tools. It is criticized for its costliness and other business concerns.
The Process
To begin, the data in question is audited. The purpose of this auditing is simply to detect errors and anomalies. This is followed by workflow specification. This part of the process determines how the errors may have occurred. If errors in the database are typos, then the process may help in determining how to avoid the errors in the first place. In the case of typos, employees may need to take refresher courses in basic typing. The next stage is called workflow execution. This verifies the correctness of the previous specification. Finally, post-processing and controlling finishes the process. Many errors are corrected automatically by computers during the process but those that are not corrected are now revised manually by employees. This often requires that the process start anew in order to see if further errors were made during the correcting process.
Why Cleanse Data
It is necessary to cleanse data because people now put a lot of confidence in the results of computer analyses. Computer databases contribute to investment decisions for large companies and medical tests on human patients. Lives and fortunes are at stake and it is critical that computer uses can trust results.
Popular Methods
There are four methods of carrying out data hygiene which are particularly popular. These are known as parsing, data transformation, duplicate elimination and statistical methods. Each has a different emphasis.
The parsing method seeks to detect syntax errors. Just as in language, syntax refers to the proper flow of information and its coherency. Parsers determine if streams of data are intelligible.
The method known as data transformation permits the program to turn the present format of data into the required format. This occurs, for example, when numeric values are confined to the upper and lower limits expected by the required format.
Duplicate elimination simply saves time and space by eliminating unnecessary copies of information. A special algorithm in the program allows this work to be done. The algorithm is meant to prevent the erasure of information that is not really a duplicate of other similar information.
Data is also analyzed using statistical methods such as standard deviation or clustering algorithms. These tools allow experts to encounter unexpected, and therefore erroneous, information.
Criticism
This apparently simple and straightforward operation is not without criticism. Often efforts to cleanse data fail at the start. Reasons for this vary but include things such as unexpectedly high costs, lack of employee time to review results and concerns about security issues while data hygiene is underway.”
