[Net 2000 Ltd. Home][Data Masker Home][Data Masker Manual][Data Masker FAQ]
Scrambling Data in the Scott Schema
The Data Masker software contains a set of masking rules for the sample Scott schema supplied with the Oracle database. You are encouraged to add your own masking rules to this masking set. The first section of this page discusses the existing rules (as supplied with the installation) and the second part provides some hints of additional rules which might be added. An ER diagram for the Scott schema is available.
Masking Rules in the Sample Scott Masking Set
The Existing Rules
- If you wish to explore each rule further, open the masking set with the Data Masker software and double click on the rule with the mouse to launch it in editing mode. You can then view the configuration of the rule in detail.
- Rule 01-0001
- This is a Rule Controller. Notice how the other rules are dependent on it. A Rule Controller defines the schema on which the other masking rules will act. If you change the login information in the Rule Controller its rules will operate on the new schema. It is up to you to ensure that the Rule Controller points at an appropriate schema. All other masking rule types must have a parent Rule Controller and every masking set must contain at least one Rule Controller.
- Rule 01-0002
- A Substitution Rule on the EMP table. This rule uses the Random Surnames dataset to provide the substitution data which masks the ENAME column.
- Rule 01-0003
- A Shuffle Rule on the JOB column in the EMP table. Job titles could act as an indicator of the original row contents. A title such as President is probably a unique identifier and is an example of the Isolated Case Phenomena. Your analysis of how a schema should be masked must take this factor into consideration. The Data Scrambling Issues white paper discusses this and many other subtle issues in considerable detail. We highly recommend that you read this paper.
- Rule 01-0004
- A Substitution Rule on the EMP table. This rule uses the Number Variance dataset to mask the SAL column. In cases such as dates or numbers it is often not desirable to randomly replace the data as doing so will change the distribution of the values. The Number Variance dataset will modify the existing values so that after masking they contain a value which is varied by a random percentage (within specified bounds) of the original value.
- Rules 20-0005 and xx-0006
- A chain of a Substitution Rule and Row-Internal Sync Rule on the COMM column in the EMP table. These rules form a pair: Rule 20-0005 masks the COMM field in all rows in the table using the NULL Values dataset to remove the contents. Then rule xx-0006 goes back over the same data and replaces the data in a random sampling of the rows (66%) with a value which is a random percentage of the previously masked SAL column.
Important Point:There are a number of important techniques illustrated in rules 20-0005 and xx-0006. First lets discuss the rule requirements and then discuss the techniques used to implement those requirements.
- The basic decision taken before designing the masking rules is that the COMM field is too sensitive to leave any trace of the original values - including even the fact that the individual might or might not have received a commission.
- However, for the purposes of the end users of the test schema, a random selecton of the COMM fields must be populated and those values are required to be a sensible percentage of the rows SAL values.
- Remember the Data Masker software can run multiple rules simultaneously so rule xx-0006 cannot begin until rule 20-0005 completes. Steps must be taken to enforce this.
- Likewise the SAL column is, in itself, a masked value so the masking operation on the COMM field cannot start until rule 01-0004 has finished. Accordingly, steps must be taken to configure the rules so that the collection of rules 20-0005 and xx-0006 are not run until rule 01-0004 has completed.
At this point it is highly recommended that you read the Rule Blocks and Dependencies help page and also view the Using Rule Blocks and Dependencies tutorial to better understand how to explicitly control the execution order of the masking rules. [local] [internet]
- Rule 0005 is implemented as a standard Substitution rule on the COMM field using the NULL Values dataset. When this rule is run all values in the column will be replaced with nulls.
- After rule 0005 completes, we wish to set the COMM field to a value related to other fields in the same row. This is what Row-Internal Sync Rules are designed to do. Double click on rule 0006 and have a careful look at how it is configured - it may also be useful to read the help page for this rule. You may also wish to look at the Creating Row-Internal Synchronization Rule tutorial [local] [internet]
- We cannot permit rules 0005 and rules 0006 to run simultaneously. Otherwise the actions of rule 0006 could be undone by rule 0005. There are two ways of controlling execution order: Rule Blocks and Dependencies. In this case since both rules are operating on the same column we chose to make rule 0006 dependent on rule 0005. This is done by dragging rule 0006 with the mouse and dropping it onto rule 0005. The dependency relationship ensures that rule 0005 will complete before rule 0006 is started.
- It is also required that rules 0005 and 0006 should not start until rule 01-0004 completes. Notice the structure of a rule ID - there are two parts: the rule block and the rule number. The 01 in rule ID 01-0004 indicates that the rule is in rule block 01.
- All rules in rule block 01 will completely execute before the next highest rule block begins. Accordingly, rule 0005 is changed to have a rule block higher than that of rule 0004 (20 in this example). This ensures that rule 20-0005 will not start until every other rule with a higher rule block completes. To change a rule block just select it with the mouse and edit it.
- Note that rule 0006 gets a rule block of xx once it is made dependent. A dependent rule assumes the rule block of the parent and executes immediately after the parent rule - irregardless of its rule block.
- The decison to use rule blocks to control the execution order for rule 20-0005 and a dependency for rule 20-0006 is mostly for visual effect to make the rules more readable and self documenting. The thinking is that since rules 0005 and rules 0006 operate on the same column, the dependency relationship emphasizes that the two rules are part of the same operation. Since rules 0004 and 0005 operate on separate columns, the distinctive run order is emphasized by using a rule block rather than a dependency.
- Below are some suggestions for additional rules which can be added to the Scott schema sample masking set. An ER diagram for Scott is available to assist you in your analysis of the schema structure. If you do add new masking rules, it is a good idea to save the modified masking set under a new name - that way if you upgrade the Data Masker software you will not overwrite your enhancements.
- Have a look at the contents of the HIREDATE column in the EMP table. In a small organization this could be a distinct identifier of the original contents of the row. Implement a Substitution Rule and the Date Variance dataset to mask the HIREDATE column.
- Possibly it might be a good idea to mask the contents of the DNAME column. You could try using a Shuffle Rule on this column or read about User Defined Datasets, make some of your own department names, and use them in a Substitution Rule on the DNAME column.
- Apply a Substitution Rule with the Number Variance dataset to the LOSAL column in the SALGRADE table. The HISAL value could be similarly masked. Build a Row-Internal Sync rule to tidy things up and make sure the HISAL is not a lower value than the LOSAL. Make sure to
set the rule block on this rule so that it always runs after the two variance rules.
[Net 2000 Ltd. Home][Data Masker Home][Data Masker Manual][Data Masker FAQ]