Paper number: TT02 CHECK YOUR DATA MORE EFFICIENTLY
Jian Hua (Daniel) Huang, Forest Laboratories Inc, NJ
ABSTRACT:
%CHKDATA is a SAS macro program designed to check the data in an efficient and user-friendly way. First,
the macro can check the data structure by generating three types of key information: the contents of dataset,
its associated SAS format, and a collection of all variable names listed horizontally. Second, the macro can
generate the distinct values and frequency counts for any specified variables. Third, the macro can define
any potential data issues and generate the reports. In addition, %CHKDATA has one important special
feature. The macro can work on multiple datasets at the same time. When processing multiple datasets, it
combines data information from each input dataset and list them side-by-side in one report, therefore,
people can easily review and even compare the information among all input datasets. This is especially
helpful for people working on data integration or anything where multiple datasets are used and compared.
In summary, %CHKDATA is a very useful tool for anyone who wants to review and understand data quickly
BACKGROUND:
Whether the task is to create a report or to derive a new dataset, people start from some source data and,
more importantly, they need to understand the source data correctly and comprehensively. Normally, people
review the source data by simply opening the dataset and reading it directly. However, if the data is huge,
which includes dozens of variables and thousands of records, and if there is not only one single dataset but
multiple datasets are involved, and if these datasets are not from one study but from many different studies,
reading all the data one-by-one could be slow and unreliable. It is better to have utility programs that help us
review and understand the data more efficiently.
%CHKDATA is a SAS macro program designed to check the data in a more efficient and user-friendly way.
The macro is originally called ‘%GETSTART’ (sounds like ‘get start’). Because, at the beginning, it was
designed only to check the data structure and generate distinct values, so people could use such
information and get works (i.e. table programming) started quickly. Later, more functions were added to the
macro. In addition to collecting information on data structures, the macro also checks for the potential data
issues and reports them properly and timely. The macro also becomes more powerful by its ability to
process multiple datasets at the same time. Due to its increased functional scope, the macro was renamed
to %CHKDATA (sounds like ‘check data’).
%CHKDATA MACRO:
The %CHKDATA macro has two major functions. First, the macro can check the data structures for input
dataset and generate distinct values for any specified variables. Second, the macro can define and
summarize potential data issues, and report them properly. In addition, %CHKDATA has one special
feature in which the macro can process multiple datasets at the same time.
USING %CHKDATA TO CHECK THE DATA STRUCTURE AND GENERATE DISTINCT VALUES:
As mentioned before, %CHKDATA can generate three types of key information for data structures: the
contents of dataset, the SAS format values, and a collection of all variable names which are listed
horizontally on one page. Let us review these functions with below examples.
Example 1; display the contents of input dataset.
%chkdata(lib=xxxx21d, data=addm, content=yes, report=addm_content1); Reported by SAS user: JHHUANG, through SAS program: TEST_CHKDATA.SAS, on date: 2008-09-29 at time: 14:53 List of the contents of each input dataset side-by-side
Variable Attrib XXXX21D Variable Attrib XXXX21D Names -utes ADDM Names -utes ADDM ----------------------------------------------- ----------------------------------------------- AGE Type 1 Format DATE Length 8 Label Age DOBN Type 1 Format Length 8 Label Date of Birth (numeric) CENTRE Type 2 Format Length 3 Label Centre Number INVNO Type 2 Format $ Length 4 Label Investigator ID COUNTRY Type 2 Format $ Length 14 Label Country ITT Type 2 Format Length 7 Label ITT COUNTRYC Type 2 Format $YNS
The contents of dataset: ADDM from one single study: XXXX21D is generated by calling %CHKDATA and
the sample output is listed above. The output lists variable names and the values of four key attributes:
‘type’, ‘length’, ‘label’ and ‘format’. To save page space, if there is only one input dataset, each page
contains two panels. If there are multiple input datasets, then the contents of each dataset will be listed side-
by-side. We will discuss later about how %CHKDATA deal with multiple datasets.
Example 2; display the SAS formats of input study.
%chkdata(lib=xxxx21d, printfmt=yes, report=xxxx21d_format1); Reported by SAS user: JHHUANG, through SAS program: TEST_CHKDATA.SAS, on date: 2008-09-26 at time: 15:21 List of SAS format of each input study side-by-side ______________________________________ ______________________________________ xxxx21d xxxx21d Format _____________________________ Format _____________________________ Names Start End Label Names Start End Label -------------------------------------- -------------------------------------- ACCEPTS 1 1 YES, ENTERELY 6 6 NEW TREATMENT ACCEPTABLE GIVEN 2 2 YES, SOMEWHAT 7 7 OTHERS ACCEPTABLE 8 8 STOPPED 3 3 UNCERTAIN (TEMPORAL) 4 4 NO, SOMEWHAT 9 9 STOPPED UNACCEPTABLE (DEFINITIVE) 5 5 NO, ENTERELY 10 10 DOSAGE REDUCED UNACCEPTABLE 11 11 INTERRUPTED 98 98 N.D. 12 12 DISCONTINUED 99 99 N.A. (PERMANENTLY) 13 13 DOSAGE INCREASED ACTIO1S 1 1 None 14 14 DOSE NOT CHANGED 2 2 Dosage reduced 15 15 DOSE REDUCED 3 3 Interrupted 16 16 DRUG WITHDRAWN 4 4 Discontinued 17 17 DOSE INCREASED (permanently) 18 18 DRUG WITHDRAWN 5 5 Dosage Increased PERMANENTLY 6 6 Dose not changed 98 98 N.D. 7 7 Dose reduced 99 99 N.A. 8 8 Drug withdrawn 9 9 Dose increased ACTIV1S 0 0 Screen 98 98 N.D. 1 1 Change Subject 99 99 N.A. Status
The SAS format of one single study: XXXX21D is generated. The output includes format name, its start and
end values, and most importantly, the label. To save page space, if there is only one input study, each page
Example 3, list of all variable names on one page horizontally.
%chkdata(lib=issd, data=d_prof2, allvar=yes, report=prof2_allvar); Reported by SAS user: JHHUANG, through SAS program: TEST_CHKDATA.SAS, on date: 2008-09-26 at time: 15:21 Check data from: ISSD.D_PROF2 List all variable names horizotally, so they are convenient to be reviewed or edited Obs var1 var2 var3 var4 var5 var6 var7 var8 var9 var10 var11 var12 var13 var14 var15 1 AGE AGEGRP AGEGRP_C BIRTHDT BIRTHDTC BMI BMIGRP BMIGRP_C CENTER COMPLETE COMPLE_C COPDSEV COPDSE_C COPDSE_O COUNTRY Obs var16 var17 var18 var19 var20 var21 var22 var23 var24 var25 var26 var27 var28 var29 var30 1 DEATH DEATHDT DEATHDTC DEATH_C DESIGN DOSE DOSE_C DUR_STDY DUR_TRT FDSDT FDSDT99 FDSDT99C FDSDTC HEIGHTCM HEIGHTIN Obs var31 var32 var33 var34 var35 var36 var37 var38 var39 var40 var41 var42 var43 var44 var45 var46 var47 1 INITIALS INVNO ITT ITT_C LASTDT LASTDTC LDSDT LDSDT99 LDSDT99C LDSDTC PATYRS PERIOD PERIOD_C PID PP PP_C RACE Obs var48 var49 var50 var51 var52 var53 var54 var55 var56 var57 var58 var59 var60 var61 var62 var63 1 RACEOTH RACETYPE RACETY_C RACE_C RACE_O RAND RAND_C SAFETY SAFETY_C SCREENO SCRNDT SCRNDTC SEX SEX_C SMOKER SMOKER_C Obs var64 var65 var66 var67 var68 var69 var70 var71 var72 var73 var74 var75 1 SMOKER_O STUDYID TERMSPEC TREASON TREASO_C TREASO_O TREATC TREATGP TREATGPC TREATSQ WEIGHTKG WEIGHTLB
There are totally 75 variables (var1-var75) existed in input dataset: D_PROF2. All variable names are
collected and listed on one page horizontally, therefore, it is easy to review the variable names and copy and
In addition to displaying data structure, %CHKDATA can also check the distinct values and their frequency
counts for any specified variables. The following two examples demonstrate how %CHKDATA generates the
distinct values for any specified variables from the input datasets.
Example 4; display the distinct values of any specified variables (the output
%chkdata(lib=xxxx22d, data=medicati, report=cmed1_chkdata1, Check data from:xxxx22D.MEDICATI List unique values and frequency count of: STARTYYY List First 30 Observation Only Obs STARTYYY COUNT 1 04 1 2 1974 1 3 1976 1 4 1978 1 5 1980 2 6 1982 1 7 1984 5 8 1985 2 9 1986 2 10 NA 4 11 UK 15
The distinct values and their frequency counts of variable: STARTYYY are generated and listed as above.
Some strange values of STARTYYY, such as: ‘04’, ‘NA’, ‘UK’, have been detected by %CHKDATA. The
information is very useful as it reminds people to pay attention to those strange values when working on the
dataset. Btw, if multiple variables are listed under the macro option ‘UNIVAR=’, the macro will generate
distinct values and frequency counts for each input variable and list them on separated pages. Again, it is
also able to apply this function on multiple datasets at the same time. We will discuss this special feature
Example 5; display the distinct values of any specified variables (the output
%chkdata(lib=xxxxpk09d, data=advs, report=vital2_chkdata1, univar=timing, idvar=visitno visitid period); Check data from: XXXXPK09D.ADVS List unique values and frequency count of: TIMING, sorted by ID variables: VISITNO VISITID PERIOD Obs VISITNO VISITID PERIOD TIMING COUNT 1 -2.00 SCREENING 999 -99.00 198 2 -1.00 TREATMENT PERIOD 1 DAY -1 999 -99.00 63 3 1.00 TREATMENT PERIOD 1 DAY 1 1 0.00 63 4 1.00 TREATMENT PERIOD 1 DAY 1 1 0.08 63 5 1.00 TREATMENT PERIOD 1 DAY 1 1 0.50 63 6 1.00 TREATMENT PERIOD 1 DAY 1 1 2.00 63 7 1.00 TREATMENT PERIOD 1 DAY 1 1 6.00 63 8 1.00 TREATMENT PERIOD 1 DAY 1 1 12.00 63 9 2.00 TREATMENT PERIOD 1 DAY 2 1 24.00 63 10 3.00 TREATMENT PERIOD 1 DAY 3 1 0.00 63 11 3.00 TREATMENT PERIOD 1 DAY 3 1 0.08 63 12 3.00 TREATMENT PERIOD 1 DAY 3 1 0.50 63 13 3.00 TREATMENT PERIOD 1 DAY 3 1 2.00 63 14 3.00 TREATMENT PERIOD 1 DAY 3 1 6.00 63 15 3.00 TREATMENT PERIOD 1 DAY 3 1 12.00 63 16 4.00 TREATMENT PERIOD 1 DAY 4 1 24.00 63 17 5.00 TREATMENT PERIOD 1 DAY 5 1 48.00 63 18 6.00 TREATMENT PERIOD 2 DAY -1 1 -99.00 57 19 7.00 TREATMENT PERIOD 2 DAY 1 2 0.00 57 20 7.00 TREATMENT PERIOD 2 DAY 1 2 0.08 57 21 7.00 TREATMENT PERIOD 2 DAY 1 2 0.50 57 22 7.00 TREATMENT PERIOD 2 DAY 1 2 2.00 57 23 7.00 TREATMENT PERIOD 2 DAY 1 2 6.00 57 24 7.00 TREATMENT PERIOD 2 DAY 1 2 12.00 57 25 8.00 TREATMENT PERIOD 2 DAY 2 2 -99.00 12 26 8.00 TREATMENT PERIOD 2 DAY 2 2 24.00 57 27 9.00 TREATMENT PERIOD 2 DAY 3 2 0.00 57 28 9.00 TREATMENT PERIOD 2 DAY 3 2 0.08 57 29 9.00 TREATMENT PERIOD 2 DAY 3 2 0.50 57 30 9.00 TREATMENT PERIOD 2 DAY 3 2 2.00 57 31 9.00 TREATMENT PERIOD 2 DAY 3 2 6.00 57 32 9.00 TREATMENT PERIOD 2 DAY 3 2 12.00 57 33 10.00 TREATMENT PERIOD 2 DAY 4 2 24.00 57 34 11.00 TREATMENT PERIOD 2 DAY 5 2 48.00 57
In this case, the variable TIMING is first sorted by three ID variables: VISITNO, VISITID, and PERIOD. Then
its distinct values and frequency counts, within sorted variables, are generated and listed as above.
USING %CHKDATA TO DEFINE AND REPORT POTENTIAL DATA ISSUES:
We just discussed in detail how %CHKDATA can check the data structures and data values. In the next
paragraph, we will discuss how %CHKDATA can check for any potential data issues. We use the following
Example 6, check data and report potential data issue.
%chkdata (lib=xxxx22d, data=medicati, report=cmed1_issue1, listobs=15, idvar=MEDIC_TR dose dose1, fmtvar=MEDIC_TR, issue=%str(which variable to use as the correct CMED dosage: dose or Report data issue: which variable to use as the correct CMED dosage: dose or dose1? Reported by SAS user: JHHUANG, through SAS program: TEST_CHKDATA.SAS, on date: 2008-09-26 at time: 15:22 Check data from: XXXX22D.MEDICATI List first 30 observation only Obs MEDIC_TR DOSE dose1 1 ENAP 10.00000 200 2 BECLAZONE 250.00000 200 3 FLIXOTIDE 250.00000 100 4 MONOPRIL 10.00000 100 5 VERAPAMIL 80.00000 100 6 ZINNAT 500.00000 100 7 FLIXOTIDE 250.00000 . 8 PRESTARIUM 4.00000 . 9 OMEZ 20.00000 50 10 BECLAZONE 250.00000 50 11 URSOSAN 500.00000 18 12 ENALAPRIL 10.00000 25 13 BECOTIDE 300.00000 25 14 VERAPAMIL 80.00000 400 15 CAPOTEN 12.50000 400
In this case, two variables: DOSE, DOSE1 are found from the same input dataset: MEDICATI. These two
variables have the similar name and label, it is hard to tell which one represents the real dosage of
concomitant medication (MEDIC_TR). The potential data issue is defined as: ‘which variable to use as the
correct CMED dosage: dose or dose1?’. The macro defines this potential data issue in the where statement:
‘where= (dose1 ^=dose)’ and then generates corresponding output as above. This report is saved as a
permanent list file by the macro option ‘report=cmed1_issue1’. Later, the report can be sent to the
corresponding group, i.e. data management group, for further review or data cleaning. In addition, by
assigning macro option ‘fmtvar=MEDIC_TR’, it removes its associated format of variable ‘MEDIC_TR’, so
the value of MEDIC_TR can be printed and fit on one page.
SPECIAL FEATURE, %CHKDATA CHECKS MULTIPLE DATASETS AT THE SAME TIME:
%CHKDATA has one special feature; it can deal with multiple input datasets, from different studies, at the
same time. Below are examples of this important feature.
Example 7; display the contents of multiple datasets from different studies.
%chkdata(lib=xxxx22d xxxx30d xxxx31d, data=adcm medicati, content=yes, Reported by SAS user: JHHUANG, through SAS program: TEST_CHKDATA.SAS, on date: 2008-09-26 at time: 15:21 List of the contents of each input dataset side-by-side Variable Attrib XXXX22D XXXX30D XXXX31D Names -utes MEDICATI ADCM ADCM ----------------------------------------------------------------------------------------------------- ATC_TEXT Type 2 2 2 Length 200 200 200 Label ATC Text ATC Text ATC Text Format $ $ $ BATCHNO Type 1 Length 8 Label Batch number Format CAS Type 2 2 Length 10 10 Label CAS number CAS number Format $ $
The contents of two datasets: ADDM and MEDICATI, from three different studies:XXXX22D XXXX30D
XXXX31D, are generated and listed on one page, side-by-side. The output allows people to review and
compare the contents of all input datasets in an easy and quick way. This special feature is very helpful for
people who work on the data integration or anything dealing with multiple datasets at the same time. It is
worth to mention here that every input dataset, i.e. ADDM and MEDICATI, is not necessary to be existed in
each input study. %CHKDATA can automatically detect which dataset exists, and then list the contents for
those datasets that exist. For example, in this case, MEDICATI exists only in XXXX22D, and ADCM exists in
the other two studies: XXXX30D, XXXX31D.
Example 8; display SAS format for multiple studies (list 3 studies per page).
%chkdata (lib=xxxx21d xxxx24d xxxx25d xxxxpk09d xxxx30d xxxx31d, printfmt=yes, ncolpage=3, report=prof2_format2); Reported by SAS user: JHHUANG, through SAS program: TEST_CHKDATA.SAS, on date: 2008-09-26 at time: 15:21 List of SAS format of each input study side-by-side ________________________________________________________________________________________________________________________________ xxxx21d xxxx24d xxxx25d Format _______________________________________ _______________________________________ _______________________________________ Names Start End Label Start End Label Start End Label -------------------------------------------------------------------------------------------------------------------------------- ACCEPTS 1 1 YES, ENTERELY ACCEPTABLE 1 1 YES, ENTERELY ACCEPTABLE 1 1 YES, ENTERELY ACCEPTABLE 2 2 YES, SOMEWHAT ACCEPTABLE 2 2 YES, SOMEWHAT ACCEPTABLE 2 2 YES, SOMEWHAT ACCEPTABLE 3 3 UNCERTAIN 3 3 UNCERTAIN 3 3 UNCERTAIN 4 4 NO, SOMEWHAT UNACCEPTABLE 4 4 NO, SOMEWHAT UNACCEPTABLE 4 4 NO, SOMEWHAT UNACCEPTABLE 5 5 NO, ENTERELY UNACCEPTABLE 5 5 NO, ENTERELY UNACCEPTABLE 5 5 NO, ENTERELY UNACCEPTABLE 98 98 N.D. 98 98 N.D. 98 98 N.D. 99 99 N.A. 99 99 N.A. 99 99 N.A. ACTIO1S 1 1 None 1 1 None 1 1 None 2 2 Dosage reduced 2 2 Dosage reduced 2 2 Dosage reduced 3 3 Interrupted 3 3 Interrupted 3 3 Interrupted 4 4 Discontinued (permanently) 4 4 Discontinued (permanently) 4 4 Discontinued (permanently) 5 5 Dosage Increased 5 5 Dosage Increased 5 5 Dosage Increased 6 6 Dose not changed 6 6 Dose not changed 6 6 Dose not changed 7 7 Dose reduced 7 7 Dose reduced 7 7 Dose reduced 8 8 Drug withdrawn 8 8 Drug withdrawn 8 8 Drug withdrawn 9 9 Dose increased 9 9 Dose increased 9 9 Dose increased 98 98 N.D. 98 98 N.D. 98 98 N.D. 99 99 N.A. 99 99 N.A. 99 99 N.A. Reported by SAS user: JHHUANG, through SAS program: TEST_CHKDATA.SAS, on date: 2008-09-26 at time: 15:21 List of SAS format of each input study side-by-side ________________________________________________________________________________________________________________________________ xxxxpk09d xxxx30d xxxx31 Format _______________________________________ _______________________________________ _______________________________________ Names Start End Label Start End Label Start End Label -------------------------------------------------------------------------------------------------------------------------------- ACCEPTS 1 1 YES, ENTIRELY ACCEPTABLE 1 1 YES, ENTERELY ACCEPTABLE 1 1 YES, ENTERELY ACCEPTABLE 2 2 YES, SOMEWHAT ACCEPTABLE 2 2 YES, SOMEWHAT ACCEPTABLE 2 2 YES, SOMEWHAT ACCEPTABLE 3 3 UNCERTAIN 3 3 UNCERTAIN 3 3 UNCERTAIN 4 4 NO, SOMEWHAT UNACCEPTABLE 4 4 NO, SOMEWHAT UNACCEPTABLE 4 4 NO, SOMEWHAT UNACCEPTABLE 5 5 NO, ENTIRELY UNACCEPTABLE 5 5 NO, ENTERELY UNACCEPTABLE 5 5 NO, ENTERELY UNACCEPTABLE 98 98 N.D. 98 98 N.D. 98 98 N.D. 99 99 N.A. 99 99 N.A. 99 99 N.A.
%CHKDATA generates SAS formats for six studies at one time and lists them side-by-side. The macro
option of ‘ncolpage=3’ (stands for ‘number of columns listed per page’) requests that each page lists SAS
SUMMARY:
In summary, %CHKDATA is a very useful tool to check data efficiently. It can check both the data structures
and distinct data values; therefore, people can get start quickly on their work. In addition, the macro can also
define potential data issues detected from input datasets, and report these issues properly to the data
management team. It helps staff to continually keep the data clean and correct. Finally, the macro can check
multiple datasets, even from different studies, at the same time, and summarize the information in one
report. This special feature is especially helpful for people who work on data integration or anything dealing
with multiple datasets at the same time.
%CHKDATA has its limitation as well. Once the macro defines a potential data issue, it generates the report
and saves it into a separated file. It will be better if the macro can automatically concatenate all correlated
data issues together, and save them into one big final report.
CONCLUSION:
%CHKDATA macro is simple and practical. It will be a very useful tool for people who want to review and
CONTACT INFORMATION:
Your comments and questions are valued and encouraged. Please contact the author at:
SAS® and all other SAS Institute Inc. product or service names are registered trademarks or trademarks of
SAS Institute Inc. in the USA and other countries. ® indicates USA registration. Other brand and product
names are registered trademarks or trademarks of their respective companies
La Voz de Galicia. El diario más leído de Galicia gracias a la participa. http://www.lavozdegalicia.es/SSEE/print.jsp?idContenido=00031281. Viernes 13 de agosto del 2010 Rimonabant, un medicamentopara perder peso, acarreaproblemas psiquiátricosUna investigación revela que la medicina aumenta el riesgo dedesarrollar graves problemas psiquiátricos, y que en algunospacientes se dieron vario
Dean DeRoberts, M.D. Wake Forest University Baptist Medical Center Intern, Resident, and Chief Resident Plastic and Reconstructive Surgery State University of New York Health Science Center at Syracuse University of Buffalo Cornell University Bachelor of Science, Biology - Anatomy and Animal Physiology The American Board of Plastic Surgery Board Certified Plastic Surgeon