Check Duplicate IDs and Records

This module could be used to

(1)   Check data error

(2)   Report on repeated measurements data.

Duplication by data entry error or ID assignment error

Duplicate records means all field are same.  Duplicated IDs means ID are same but there are some other fields not same.

If a record was mistakenly entered more than once (all fields are same) then your data will have duplicate records. If the repeated entry had some fields (except ID variable) entry error then your data will have duplicate IDs (not all fields are same), or If one ID was mistakenly assigned to more than one study subject then your data will have duplicate IDs. 

Use this module to pick up these mistakes and save a clean data file.

To check for duplicate records, basically this module will compare all fields, if there are 2 or more records that all fields are same, they are duplicate records.  However, your data may have an automatic increment variable representing the record number.  This variable should be excluded when checking for duplicate records. If there are auto-incrementing variable, you need to specify variables for checking duplicated records. Otherwise all variables will be used.

Duplicated IDs by repeated measurement

If your data had repeated measurements, for example, your data had N families and each family had 1-10 persons, each person had one record, then each family will have 1-10 records, you want to report number of families by family size (number of persons), this module can be applied.

A sample screen shot of input window:

Example 1: check duplicate IDs and records, output a clean data file

ScreenHunter_03 Jan. 12 19.33.gif

In this example, variable “SNO” is record number, it was not included in variables list that were used for check duplicate records.

This example, will check for duplicate ID and records, and output a clean data file “example_chk”.

Sample output and explanation:

Number of duplicated records: 3

 Number of duplicates Number of records

                    2                 1

                    3                 1

 

     ID SEX  AGE HEIGHT WEIGHT SBP DBP OCCU EDU PSMK SMOKE ALH SNP1 SNP2 SNO

42  116   1 42.1   1.65   60.5 128  68    0   2    0     0   0    2    1  42

43  116   1 42.1   1.65   60.5 128  68    0   2    0     0   0    2    1  43

123 361   1 26.7   1.72   63.0 118  62    0   2    1     2   0    2    1 123

124 361   1 26.7   1.72   63.0 118  62    0   2    1     2   0    2    1 124

125 361   1 26.7   1.72   63.0 118  62    0   2    1     2   0    2    1 125

                                      

Number of records with duplicated ID: 9

 Number of duplicates Number of ID

                    2            3

                    3            1

     ID SEX  AGE HEIGHT WEIGHT SBP DBP OCCU EDU PSMK SMOKE ALH SNP1 SNP2 SNO

42  116   1 42.1   1.65   60.5 128  68    0   2    0     0   0    2    1  42

43  116   1 42.1   1.65   60.5 128  68    0   2    0     0   0    2    1  43

433 270   2 33.0   1.50   50.0 118  70    0   1    0     0   0    2    1 433

436 270   2 33.0   1.50   50.0 122  70    0   1    0     0   0    2    1 436

123 361   1 26.7   1.72   63.0 118  62    0   2    1     2   0    2    1 123

124 361   1 26.7   1.72   63.0 118  62    0   2    1     2   0    2    1 124

125 361   1 26.7   1.72   63.0 118  62    0   2    1     2   0    2    1 125

145 399   1 39.3   1.65   64.5 126  66    0   4    0     2   0    2    1 145

143 399   1 39.3   1.65   64.5 126  72    0   4    0     2   0    2    1 143

                                                                               

 Removed duplicated records: if 3 records were same, remove 2 and keep 1 record

                                                                      

 Removed duplicated IDs: if 3 records had same ID, remove all 3 records

 

In this example output:

There are 3 duplicate records, 1 record (ID=116) had 2 duplicates including itself (SNO=42 and 43), 1 record (ID=361) had 3 duplicates including itself (SNO=123, 124 and 125).

There are 9 records with duplicated IDs, 3 IDs (ID=116, 270, 399) each had 2 records, and 1 ID (ID=361) had 3 records.  Duplicate IDs includes duplicate records.