Statistics with STATA UPDATED VERSION 9
Item
- Title
-
Statistics
with STATA UPDATED VERSION 9 - extracted text
-
I
Statistics
with
Updated for Version 9
I
Lawrence C. Hamilton
University ofNew Hampshire
I
I
i
ANOTHER QUALITY
USED BOOK
■
•.
•
-
.
-
Australia • Brazi __ _______________ _________ _____________
Ur..
ItrtKHi
THOMSON
-------------*------------BROOKS/COLE
Statistics with Stata: Updated for Version 9
Lawrence C Hamilton
Publisher: Curt Hinrichs
Senior Assistant Editor: Ann Dav
Editorial Assistant: Daniel Geller
Technology Project Manager: Fiona Chong
Marketing Manager: Joe Rogove
Marketing Assistant: Brian Smith
Executive Marketing Communications Manager:
Darlene Amidon-Brent
© 2006 Duxbury, an imprint of Thomson Brooks. Cole,
a part of The Thomson Corporation. Thomson, the Star
logo, and Brooks/Cole are trademarks used herein
under license.
ALL KIGHTS RESERVED. No part of tliis work cox ered
b\ the copyright hereon may be reproduced or used in
any form or by any means—graphic, electronic, or
mechanical, including photocopying, recording, taping,
web distribution, information storage and retriex al
sy stems, or in any other manner—without the written
permission of tire publisher.
Printed in Canada
1 2 3 4 5 6 7 09 08 07 06 05
Project Manager, Editorial Production: Kelsey McGee
Creative Director: Rob Hugel
Art Director: Lee Friedman
Print.Buyer: Darlene Suruki
Permissions Editor: Kielx- Sisk
Cover Designer: Denise Davidson/Simple Design
Cover Image: ©Imtek Imagineering/Masterfile
Cover Printing, Printing & Binding: Webcom Limited
Thomson Higher Education
10 Davis Drive
Belmont, CA 94002-3098
USA
Asia (including India)
Thomson Learning
5 Shen ton Way
#01-01 UIC Building
Singapore 068808
Australia/New Zealand
Thomson Learning Australia
102 Dodds Street
Southbank, Victoria 3006
Australia
Canada
For more information about our nrndi
n Nelson
rhmount Road
Ontario MIK 5G4
another quality
USED BOOI^h
I■
pe/Middle East/Africa
i Learning
born House
iford Row
VC1R 4LR
ngdom
I
Community Health Cell
Library and Information Centre
ff 007, “Srinivasa Nilaya”
Jakkasandra 1st Main,
1st Block, Koramangala,
BANGALORE - 560 034.
Phone : 553 15 18 / 552 53 72
e-mail : chc@sochara.org
i
Contents
Preface
ix
1
Stata and Stata Resources
A Typographical Note ..................
An Example Stata Session ..............
Stata’s Documentation and Help Files
Searching for Information..................
Stata Corporation
Statalist
The Stata Journal
Books Using Stata
1
. I
. 2
. 7
. 8
9
10
10
ll
2
Data Management
Example Commands...........................................
Creating a New Dataset ...................................................
Specifying Subsets of the Data: in and if Qualifiers
Generating and Replacing Variables ..............................
Using Functions ...............................................................
Converting between Numeric and String Formats..........
Creating New Categorical and Ordinal Variables ..........
Using Explicit Subscripts with Variables........................
Importing Data from Other Programs..............................
Combining Two or More Stata Files ................................
Transposing. Reshaping, or Collapsing Data..................
Weighting Observations.....................................................
Creating Random Data and Random Samples
Writing Programs for Data Management...........................
Managing Memory.............................................................
12
. 13
. 15
. 19
. 23
. 27
. 32
. 35
. 38
. 39
. 42
. 47
. 54
. 56
. 60
. 61
3
Graphs
Example Commands . . . .
Histograms
Scatterplots
Line Plots
Connected-Line Plots . . .
Other Twoway Plot Types
Box Plots
Pie Charts
Bar Charts
64
65
67
72
77
83
84
90
92
94
vi
Statistics with Stata
Dot Plots..............
Symmetry and Quantile Plots ....
Quality Control Graphs
Adding Text to Graphs
Overlaying Multiple Twoway Plots
Graphing with Do-Files ................
Retrieving and Combining Graphs
4
Summary Statistics and Tables
Example Commands.....................................
Summary Statistics for Measurement Variables
Exploratory Data Analysis
Normality Tests and Transformations
Frequency Tables and Two-Way Cross-Tabulations
Multiple Tables and Multi-Way Cross-Tabulations
Tables of Means, Medians, and Other Summary Statistics
Using Frequency Weights
5
ANOVA and Other Comparison Methods
Example Commands
One-Sample Tests
Two-Sample Tests .................................
One-Way Analysis of Variance (ANOVA)
Two- and TV-Way Analysis of Variance .. .
Analysis of Covariance (ANCOVA)
Predicted Values and Error-Bar Charts
6
Linear Regression Analysis
Example Commands
The Regression Table
Multiple Regression .....................................
Predicted Values and Residuals
Basic Graphs for Regression
Correlations........................................
Hypothesis Tests...........................................
Dummy Variables
Automatic Categorical-Variable Indicators and Interactions
Stepwise Regression
Polynomial Regression
Panel Data
. .
. 99
100
105
109
110
115
116
120
120
121
124
126
130
133
136
138
141
142
143
146
149
152
153
155
159
159
162
164
165
168
171
175
176
183
186
188
191
£
Contents
vii
7
Regression Diagnostics
Multicollinearity
........................
Diagnostic
Case Statistics
Example Commands
Diagnostic
........................
SAT Score Plots
Regression.
Revisited
8
Fitting Curves
Example Commands
Band Regression
Lowess Smoothing
Regression with Transformed Variables - I
Regression with Transformed Variables - 2
Conditional Effect Plots
Nonlinear Regression - I
Nonlinear Regression - 2
215
215
217
219
223
227
230
232
235
9
Robust Regression
Example Commands
Regression with Ideal Data
Y Outliers
XOutliers (Leverage)
Asymmetrical Error Distributions
Robust Analysis of Variance
Further rreg and qreg Applications
Robust Estimates of Variance— I
Robust Estimates of Variance — 2
239
240
240
243
246
248
249
255
256
258
10 Logistic Regression
Example Commands
Space Shuttle Data
Using Logistic Regression
Conditional Effect Plots
Diagnostic Statistics and Plots
Logistic Regression with Ordered-Category v
Multinomial Logistic Regression
262
263
265
270
273
274
278
280
11 Survival and Event-Count Models
Example Commands
Survival-Time Data
Count-Time Data
Kaplan-Meier Survivor Functions .
Cox Proportional Hazard Models ..
Exponential and Weibull Regression
Poisson Regression
288
289
291
293
295
299
305
309
313
Generalized Linear Models
196
196
198
200
205
210
.
—— mi
via
Statistics with Stata
i
12 Principal Components, Factor, and Cluster Analysis
Example Commands.......................................
Principal Components.......................................
Rotation
Factor Scores........................................
Principal Factoring.....................................
Maximum-Likelihood Factorina
Cluster Analysis — I ......................
Cluster Analysis — 2 ......................
13 Time Series Analysis
Example Commands..........
Smoothing
Further Time Plot Examples
Lags, Leads, and Differences
Correlograms.......................
ARIMA Models .................
14 Introduction to Programming
Basic Concepts and Tools........................
Example Program: Moving Autocorrelation
Ado-File
7........................
Help File
Matrix Algebra
Bootstrapping
Monte Carlo Simulation
References
Index
. 318
. . 319
.. 320
.. 322
.. 323
. . 326
.. 327
.. 329
.. 333
339
339
341
346
349
351
354
361
361
369
373
375
378
382
387
395
401
'7
Stata and Stata Resources
Stata ts a full-featured statistical program for Windows. Macintosh, and Unix computers It
h 'lT 0050 USC "llh SPeCd’ a ,lbrary orPre-P'ogrammed analytical and data-mana«ement
n d
MS'
'lro-ranllllability tllat allows users to invent and add further capabilities as
needed. Most operations can be accomplished either via the pull-down menu system or more
ectly via tv ped commands. Menus help newcomers to learn Stata. and help anyone to apply
an unfamiliar procedure. The consistent, intuitive syntax of Stata commands frees experietKcd
users to work more efficiently, and also makes it straightforward to develop programs foi
mp ex oi repetitious tasks. Menu and command instructions can be mixed as needed durin<>
a Stata session. Extensive help, search, and link features make it easv to look up command
syntax and other information instantly, on the fly.
"
P
After introductory information, we'll begin with an example Stata session to .five you a
sense of the flow of data analysis, and ofhovv analytical results might be used. Later chapters
theTonn "are"3
•\ith°Ut eXplanati°nS' h°Wever’ V™ ca" see how straightforward
e commands are-use filename to retrieve dataset/ttowme, summarize when you
want summary statistics, correlate to get acorrelation matrix, and so forth. Alternatively
same results can be obtained by making choices from the Data or Statistics menus.
Stata users have available a variety of resources to help them learn about Stata and solve
problems at any level of difficulty. These resources come not just from Stata Corporation but
s^aXe'ehc?mmudni • ofLers‘Sections ofthis chapter introduce some keyreso—
technical heln StW ’
documentation: where to phone, fax. write, or e-mail for
undnmA d
7' S'te ('""v staIa com)- which provides many services including
S'Xrn 7"'erSt0 qUen y
qUeSti°nS; the StataliSt Intemet forum; and ,herefereed
A Typographical Note
This book employs several typographical conventions as a visual cue to how words are used:
■
Commands typed by the user appear in a bold Courier font. When the whole
t“) file"' ’S giVCn’ " S‘artS With 3 Pen°d’ 3S See" in 3 S‘ata ReSultS wind0'v or log
.
■
list year boats men penalty
Variable or file names within these commands appear in italics to emphasize the
tact that they are arbitrary'and
" ■ "“J not
-:t a fixed part of the command.
1
r~
2
Statistics with Stata
■
Names of yariables or fifes also appear in italics within the main text to distinguish them
from ordinary words.
■
Items from Stata’s menus are shown in an Arial font. with successive options separated b\
a dash. For example, we can open an existing dataset by selecting File - Open . and then
finding and clicking on the name of the particular dataset. Note that some common menu
actions can be accomplished cither with text choices from Stata’s top menu bar.
File
Edit
Prefs Data Graphics Statistics User Window
Help
or with the row of icons below these. For example, selecting File - Open is equivalent to
clicking the leftmost icon, an opening file folder:
. One could also accomplish the
same thing by typing a direct command of the form
use filename
■
Stata output as seen in the Results w indow is showm in a
font allows Stata’s 80-column output to fit within the margins of this book.
The small
Thus, we show the calculation of summary statistics for a variable named penally as
follows:
.
summarize penalty
Variable I
jbs
Wear.
penalty |
10
63
.? 1 :i.
?. 5 9 4.9
These typographic conventions exist only in this book, and not within the Stata program
itself. Stata can display a variety of onscreen fonts, but it docs not use italics in commands.
Once Stata log files have been imported into a word processor, or a results table copied and
pasted, you might want to format them in a Courier font, 10 point or smaller, so that columns
will line up correctly.
In its commands and variable names. Stata is case sensitive. Thus, summarize is a
command, but Summarize and SUMMARIZE are not. Penalty and penalty would be two
different variables.
An Example Stata Session
As a preview showing Stata at work, this section retrieves and analyzes a previous 1\ -created
dataset named lofoten.dta. Jentoft and Kristofferson (1989) originally published these data in
an article about self-management among fishermen on Norway’s arctic Lototen Islands. There
are 10 obset vations (years) and 5 variables, including penalty, a count of how many fishermen
were cited each year tor violating fisheries regulations.
If we might eventually want a record of our session, the best w'ay to prepare for this is by
opening a “log file’ at the start. Log files contain commands and results tables, but not graphs.
To begin a log file, click the scroll-shaped Begin Log icon.
. and specify a name and folder
for the resulting log file. Alternatively, a log file could be started by choosing File - Log Begin from the top menu bar, or by typing a direct command such as
.
log using mondayl
1
Stata and Stata Resources
Multiple ways of doing such things are common in Stata. Each has its own advantages, and
each suits different situations or user tastes.
Log fi les can be created either in a special Stata format (.smcl), or in ordinary text or ASCII
fonnat (Jog). A .smcl (“Stata markup and control language”) file will be nicely formatted for
viewing or printing within Stata. It could also contain hyperlinks that help to understand
commands or error messages. Jog (text) files lack such formatting, but are simplerto use ifyou
plan later to insert or edit the output in a word processor. After selecting which type of log file
you want, click Save . For this session, we will create a .smcl log file named monday 1.smcl.
An existing Stata-format dataset named lofoten.dta will be analyzed here. To open or
retrieve this dataset, we again have several options:
i
select File - Open - lofoten.dta using the top menu bar;
| - lofoten.dta ; or
select
type the command use Lofoten.
Under its default Windows configuration, Stata looks for data files in folder C:\data. If the file
we want is in a different folder, we could specify its location in the use command,
.
use c:\books\sws8\chapter01\lofoten
or change the session s default folder by issuing a cd (change directory) command:
.
cd c:\books\sws8\chapter01\
.
use lofoten
Often, the simplest way to retrieve a file will be to choose File - Open and browse through
folders in the usual way.
To see a brief description of the dataset now in memory, type
.
describe
Contains data from C:\data\lofoten.dta
obs:
10
Jentoft & Kristoffersen '89
vars:
5
30 Jun 2005 10:36
s^ze:
130
(99.9% of memoryfree)
variable name
year
boats
men
penalty
decade
Sorted by:
I
£
3
storage
type
int
int
int
int
byte
decade
display
format
* 9. Og
*9.0g
•9.0g
•9.0g
•9.0g
value
label
variable label
decade
Year
Number of fishing boats
Number of fishermen
Number of penalties
Early 1970s or early 1980s
yea r
Many Stata commands can be abbreviated to their first few letters. For example, we could
shorten describe to just the letter d. Using menus, the same table could be obtained by
choosing Data - Describe data - Describe variables in memory - OK.
This dataset has only 10 observations and 5 variables, so we can easily list its contents by
typing the command list (or the letter 1 ; or Data - Describe data - List data - OK):
4
Statistics with Stata
list
year
boats
men
penalty
i
1971
1972
1973
1974
1975
1809
2017
2068
1693
144 1
5281
6304
6794
5227
4077
71
152
183
39
36
6. I
7. I
8. I
9. I
10 . I
1981
1982
1 983
1984
1985
1540
1689
1842
1847
1365
4033
4267
4 4 30
4 622
3514
11
15
34
74
15
1.
2
3.
4.
5.
I
I
decade |
------ I
1970s |
1970s |
1970s |
1970s |
1970s |
------ I
1980s |
1980s |
1980s |
1980s |
1980s |
Analysis could begin with a table of means, standard deviations, minimum values and
maximum values (type summarize or su; orselectStatistics-Summaries, tables, & iests
Nummary statistics - Summary statistics - OK):
summarize
Variable |
year
boats
men
penalty
decade
|
|
I
|
|
Obs
Mean
S t d. Dev .
Min
Max
10
10
10
10
10
1978
1731.1
4854.9
63
.5
5.477226
232.1328
1045.577
59.59493
.5270463
1971
1365
3514
11
0
1985
2068
6794
183
1
i
To print results from the session so ffar, bring the Results window to the front by clicking
on this window or on S| (Bring Results
i (Print).
To copy a table commands, or other information from the Results window into a word
on
W'nd0W *S ‘n fr°nt by clicklng on this window or
DrTth
s . Drag the mouse to select the results you want, right-click the mouse, and then choose
Copy Text from the mouse’s menu. Finally, switch to your word processor and at the desired
menubanrP0,nt'
PaSte °T C'ick 3 “cliPboard” “on on the word processor’s
that there were more penalties in the 1970s:
.
tabulate decade,
Early 19~0s |
or early |
19=0s |
sum(penalty)
Summary of Number of penalties
Mean
std. Dev.
Freq.
1970s |
1930s |
96.2
29.8
67.41439
26.281172
5
5
Total |
63
59.594929
10
_T
sam® table couId be obtained through menus: Statistics - Summaries, tables & tests
npnXS °ne/tw°;way ‘able of summary statistics, then fill in decade as variable 1, and
penalty as the variable to be summarized. Although menu choices are often straightforward to
Stata and Stata Resources
5
use, you can see that they tend to be more complicated to describe than the simple text
akX?
i°m
P01nu°n’ WC WiU f°CUS Primarily On the “mmands, mentioning menu
to a
r
°CCaS10nally- Fully exploring the menus, and working out how to usl them
mannaTrll
’Wil'be
t0 the reader- For similar reasons>
Stata reference
manuals likewise take a command-based approach.
Perhaps the number of penalties declined because fewer people were fishing in the 1980s
number
lumber of penalties correlates strongly ( r > .8) with the number of boats aid fishermen:'
correlate boats men penalty
(obs=10)
boats |
men |
penalty |
boats
men
penalty
1.0000
0.8748
0.8259
1.0000
0.9312
1.0000
A graph might help clarify these interrelationships. Figure 1.1 plots men and oenaltv
pro<,uce<lby,he g„ph twoway eonn.otfd
hand v axis °Ja
(l^'va'*‘ab,e>c011nec‘ed-Iine plot ofmen against year, using the lefthandy axis, yaxis (1). After the separator | | , we next ask for a connected-line plot of
‘his time using the right-handy axis, yaxis (2) . The resulting graph
visualizes the correspondence between the number of fishermen and the number of penalties
over time.
. graph twoway connected men year, yaxis(1)
I I
connected penalty year, yaxis(2)
o
o
o
Figure 1.1
ej
o
5S
i
5
s
fOSg
o Q.
5 o
0
X)
I
U)
.5?
0
h
E
o
sz
o
§
1970
o
1975
Year...
Number of fishermen
1980
1985
Number of penalties
to
eCaUFSetheyearS 197610 1980 aremissingin these data, Figure 1.1 shows 1975 connected
o l 98 . For some purposes, we might hesitate to do this. Instead, we could either find the
commands
UnCOnnected
issuing a slightly more complicated set of
6
Statistics with Stata
To print this graph, click on the Graph window or on gg| (Bring Graph Window to Front),
and then click the Print icon g
To copy the graph directly into a word processor or other document, bring the Graph
window to the front, right-click on the graph, and select Copy. Switch to your word processor
go to the desired insertion point, and issue an appropriate “paste” command such as Edit “ 'L PaJS‘2 SpeCia'(Metafile)’ or click a “clipboard” icon (different word processors
will handle this differently).
V
_ To save the graph for future use, either right-click and Save, or select File - Save Graph
from the top menu bar The Save As Type submenu offers several different fl)e formats t0
chose from. On a Windows system, the choices include
Stata graph (*.gph)
(A “live” graph, containing enough information for Stata to edit.)
As-is graph (*.gph)
(A more compact Stata graph format.)
Windows Metafile (*.wmf)
Enhanced Metafile (*.emf)
I
Portable Network Graphics (*.png)
TIFF(*.tif)
PostScript (*.ps)
I
r
'i
Encapsulated PostScript with TIFF preview (*.eps)
Encapsulated PostScript (*.eps)
I
Regardless of which graphics format we want, it might be worthwhile also to save a copy of our
graph in “live” .gph format. Live .gph graphs can later be retrieved, combined, recolored or
reformatted using the graph use or graph combine commands (Chapter 3).
ir
I
il i
Instead of using menus, graphs can be saved by adding a saving (filename) option
to any graph command. To save a graph with the filename figurel.gph, add another
separator | | ,acomma,and saving (figurel) . Chapter 3 explains more about the logic
of graph commands. The complete command now contains the following (typed in the Stata
Com mand window with as many spaces as you want, but no hard returns):
• graph twoway connected men year, yaxis(1)
II
connected penalty year, yaxis(2)
I I
, saving(figurel)
Through all of the preceding analyses, the log file mondayl.smcl has been storing our
results. There are several possible ways to review this file to see what we have done:
File - Log - View - OK
- View snapshot of log file - OK
1
II
L
typing the command view mondayl. smcl
We could print the log file by choosing g (Print) . Log files close automatically at the end
of a Stata session, or earl ier if instructed by one of the following:
Stata and Stata Resources
7
File - Log - Close
^>j|- Close log file - OK
typing the command log close
Once closed, the file mondayl.smcl could be opened again through File - View during a
subsequent Stata session. To make an output file that can be opened easily by your word
processor, either translate the log file from .smcl (a Stata format) to Jog (standard ASCII text
fonnat) by typing
translate mondayl.smcl mondayl.log
or start out by creating the file in Jog instead of .smcl format.
Stata’s Documentation and Help Files
The complete Stata 9 Documentation Set includes over 6,000 pages in 15 volumes: a slim
Getting Started manual (for example, Getting Started with Stata for Windows), the more
extensive User’s Guide, the encyclopedic three-volume Base Reference Manual, and separate
reference manuals on data management, graphics, longitudinal and panel data, matrix
programming (Mata), multivariate statistics, programming, survey data, survival analysis and
epidemiological tables, and time series analysis. Getting Started helps you do just that, with
the basics of installation, window management, data entry, printing, and so on. The User's
Guide contains an extended discussion of general topics, including resources and
troubleshooting. Of particular note for new users is the User's Guide section on “Commands
everyone should know.” The Base Reference Manual lists all Stata commands alphabetically.
Entries for each command include the full command syntax, descriptions of all available
options, examples, technical notes regarding formulas and rationale, and references for further
reading. Data management, graphics, panel data. etc. are covered in the general references, but
these complicated topics get more detailed treatment and examples in their own specialized
manuals. A Quick Reference and Index volume rounds out the whole collection.
When we are in the midst of a Stata session, it is often simpler to ask for onscreen help
instead of consulting the manuals. Selecting Help from the top menu bar invokes a drop-down
menu of further choices, including help on specific commands, general topics, online updates,
the Stata Journal, or connections to Stata’s web site (www.stata.com). Alternatively, we can
bring the Viewer ( ^!) to front and use its Search or Contents features to find information.
We can also use the help command. Typing help correlate , for example, causes
help information to appear in a Viewer window. Like the reference manuals, onscreen help
provides command syntax diagrams and complete lists of options. It also includes some
examples, although often less detailed and without the technical discussions found in the
manuals. The Viewer help has several advantages over the manuals, however. It can search
for keywords in the documentation or on Stata’s web site. Hypertext links take you directly to
related entries. Onscreen help can also include material about recent updates, or the
“unofficial” Stata programs that you have downloaded from Stata’s web site or from other
users.
8
Statistics with Stata
Searching for Info rm at io n
Selecting Help Search - Search documentation and FAQs provides a direct way to search for
information in State’s documentation or in the web site’s FAQs (frequently asked questions)
and other pages. The equivalent Stata command is
. search keywords
Options available with search allow us to limit our search to the documentation and FAQs,
to net resources including the Stata Journal, or to both. For example.
. search median regression
will search the documentation and FAQs for information indexed by both keywords, “median”
and regression. To search for these keywords across Stata’s Internet resources in addition
to the documentation and FAQs, type
. search median regression, all
Search results in the Viewer window contain clickable hyperlinks leadingto further information
or original citations.
One specialized use for the search command is to provide more information on those
occasions when our command does not succeed as planned, but instead results in one of Stata’s
cryptic numerical error messages. For example, typing the one-word command table
produces the error or “return code” r(100):
. table
varlist required
r(100);
The table <command....
evidently requires a list of variables. Often, however, the meaning of
an error message is less obvious. To learn more about what return code r( 100). refers
to,type
-----. search rc 100
Keyword search
Keywords:
Search:
rc 100
(1) Official help files. FAQs, Examples, SJs, and STBs
Search of official help files, FAQs, Examples, SJs, and STBs
[P]
error
Return code 10?
varlist required;
= exp required;
using required;
by() option required;
Certain commands require a varlist or another element of the
language.
The, message specifies the required item that was
missing from the
command
d you gave. See the command’s syntax
- ----diagram. F
’ example, merge requires using be specified; perhaps,
~or
you meant to 4.type append.Or, ranksum requires a by() option;
see [R] signrank.
(end of search)
Type help search for more about this command.
[
I
Stata and Stata Resources
Stata Corporation
For orders, licensing, and upgrade information, you can contact Stata Corporation bye-mail at
stata@stata.com
or visit their web site at
http.7/www.stata.com
Stata’s extensive web site contains a wealth of user-support information and links to resources.
Stata Press also has its own website, containing information about Stata publications includins
the datasets used for examples.
—
http://www.stata-press.com
Both web sites are well worth exploring.
The mailing or physical address is
Stata Corporation
4905 Lake way Drive
College Station, TX 7"845 USA
Telephone access includes an easy-to-remember 800 number.
telephone: 1-800-STATAPC
(1-800-782-8272)
fax:
U.S.
1-800-248-8272
Canada
1-979-696-4600
International
1-979-696-4601
Online updates within major versions are free to licensed Stata users. These provide a fast
and simple way to obtain the latest enhancements, bug fixes, etc. for your current version. To
find out whether updates> exist tor your Stata. and initiate the simple online update process
itself, type the command
update query
Technical support can Ibe obtained by sending e-mail messages with your Stata serial
number in the subject line to
tech_support@stata.com
Before calling or writing for technical help, though, you might want to look at
wwv.stata.com to see whether your question is a FAQ. The site also provides product
ordering, and help infonnation; international notes; and assorted news and announcements.
Much attention is given to user support, including the following:
FAQs — Frequently asked questions and their answers. If you are puzzled by something and
can t find the answer in the manuals, check here next — it might be a FAQ. Example questions
range from basic — “How can I convert other packages’ files to Stata format data files?" to
more technical queries such as "How do I impose the restriction that rho is zero using the
heckman command with full ml?”
1
9
UPDATES — Frequent minor updates or bug fixes, downloadable at no cost by licensed Stata
users.
10
Statistics with Stata
OTHER RESOURCES — Links and information including online Stata instruction
(NetCourses); enhancements from ths Stata Journal- an independent listserver (Statalist) for
discussions among Stata users; a bookstore selling books about Stata and other up-to-date
statistical references; downloadable datasets and programs for Stata-related books; and links
to statistical web sites including Stata’s competitors.
The following sections describe some of the most important user-support resources.
Statalist
Statalist provides a valuable online forum for communication among active Stata users. It is
m ependent of Stata Corporation, although Stata programmers monitor it and often contribute
to the discussion. To subscribe to Statalist, send an e-mail message to
maj ordomo@hsphsun2.harvard.edu
The body of this message should contain only the following words:
subscribe statalist
The list processor will acknowledge your message and send instructions for using the list,
including how to post messages of your own. Any message sent to the following address goes
out to all current subscribers:
statalist@hsphsun2.harvard.edu
Tk • D|0 nOt try t0 subscnbe or unsubscribe by sending messages directly to the statalist address.
This does not work, and your mistake goes to hundreds of subscribers. To unsubscribe from
the list, write to the same majordomo address you used to subscribe:
majordomo@hsphsun2.harvard.edu
but send only the message
unsubscribe statalist
or send the equivalent message
signoff statalist
If you plan to be traveling or offline for a while, unsubscribing will keep your mailbox from
tilling up with Statalist messages. You can always re-subscribe.
Searchable Statalist archives are available at
http://www.stata.com/statalist/archive
The material on Statalist includes requests for programs, solutions, or advice, as well as
answers and general discussion. Along with the S/t?rtf Jcwr/za/(discussed below), Statalist plays
a major role in extending the capabilities both of Stata and of serious Stata users.
The Stata Journal
From 1991 through 2001, a bimonthly publication called the Stata Technical Bulletin (STB)
served 35 a means of distributing new commands and Stata updates, both user-written and
o icial. Accumulated STB articles were published in book form each year as Stata Technical
Bulletin Reprints, which can be ordered directly from Stata Corporation.
I
Stata and Stata Resources
i
11
thm hh k ®rowthuof *e Bernet, instant communication among users became possible
ou0h vehicles such as Statalist. Program files could easily be downloaded fror/distant
sources. A bimonthly printed journal and disk no longer provided the best avenues either for
communicating among users, or for distributing updates and user-written programs. To adapt
to a changing world, the STB had to evolve into something new.
P
The Stata Journal was launched to meet this challenge and the needs of Stata’s broadening
users along with unofficial commands written by Stata Corporation employees New
commands are not its primary focus, however. The Stata Journal also contains refereed
expository artic es about statistics, book reviews, and a number of interesting columns
lanauaTf Th63?1"8 T3'3
NlchoIaS L Cox’ on effective use of the Stata programming
'
f J
u
‘S ‘ntended f°r nOvice as weI1 as experienced Stata users For
example, here are the contents from one recent issue:
“Exploratory analysis of single nucleotide polymorphism (SNP) for
M.A. Cleves
quantitative traits”
“Value label utilities: labeldup and labelrename”
J. Weesie
“Multilingual datasets”
J. Weesie
“Multiple imputation of missing values: update”
P.
Royston
^Estimation and testing of fixed-effect panel-data systems”
J.L.
Blackwell,
III
“Data inspection using biplots”
U. Kohler & M. Luniak
“Stata in space: Econometric analysis of spatially explicit raster data”
’
D. Muller
Using the file command to produce formatted output for other applications
“Mns smlsfa
ph)^,ans using
--------------- • Eplaymaker
“Speaking Stata: Density probability plots”
’
S.Lemesh«»&KL.Moeiehbi“
Linear, Logistic, Survival, and Repeated Measures Models"
The Stata Journal is published quarterly. Subscriptions
can be purchased directly from
Stata Corporation by visiting www.stata.com.
Books Using Stata
Statato 11X1
S.07£fe[encc manuals, a growing library of books describe Stata, oruse
aonhcat on
t.
Y ?techniqueS- These books include general introductions; disciplinary
applications such as social science, biostatistics or econometrics; and focused texts concerning
survey analysis, experimental data, categorical dependent variables, and other subjects Thf
Bookstore pages on Stata’s web site have up-to-date lists, with descriptions of content:'
http://www.stata.com/bookstore/
XSXZS ’ “nM' ,’i,cc ,<’ '“rn ato"1 “d
L
%
1
■■'"
•■''
v:
'
'
'
Data Management
S=S===EE;=ES~=
anoil'™/1!3 t'fan,sfer,program-translate the dataset directly from a system file created by
another spreadsheet database, or statical program. Once Stata has the data in memory we
can save the data in Stata format for easy retrieval and updating in the future.
Data management encompasses the initial tasks of creating a dataset, editing to correct
errors, and adding internal documentation such as variable and value labell It also
encompasses many other jobs required by ongoing projects, such as adding new observations
'ailables reorganmng, simplifying, or sampling from the data; separating, combing, or
lapsing datasets; converting variable types; and creating new variables through algebraic or
ogica! expressions. When data-management tasks become complex or repetitive, Stata users
analXl
bT" Pr°SramS t0
the
A,th°Ugh Sta,a is best
kr its
analytical capabiht.es, it possesses a broad range of data-management features as well This
chapter introduces some of the basics.
The
s Guide provides an overview of the different methods for inputting data
followed by eight roles for determining which input method to use. Input, editing, and many
7matI°;S d‘SC“SSed *" thls chaPter can be accomplished through the Data menus. Data
menu subheadings refer to the general category of task:
Describe data
Data editor
Data browser (read-only editor)
Create or change variables
Sort
Combine datasets
Labels
Notes
Variable utilities
Matrices
Other utilities
1
Data Management
13
Example Commands
I
append using olddata
Reads previously-saved dataset olddata.dta and adds all its observations to the data
currently in memory. Subsequently typing save newdata, replace will save the
combined dataset as newdata.dta.
I
I
browse
Opens the spreadsheet-like Data Browser for viewing the data. The Browser looks similar
to the Data Editor, but it has no editing capability, so there is no risk of inadvertently
changing your data. Alternatively, click ^|.
I
. browse boats men if year > 1980
Opens the Data Browser showing only the variables boats and men for observations in
which year is greater than 1980. This example illustrates the if qualifier, which can be
used to focus the operation of many Stata commands.
. compress
Automatically converts all variables to their most efficient storage types to conserve
memory and disk space. Subsequently typing the command save filename,
replace will make these changes permanent.
. drawnorm zl z2 z3, n(5000)
Creates an artificial dataset with 5,000 observations and three random variables, ~ 1. z2. and
z3, sampled from uncorrelated standard normal distributions. Options could specify other
means, standard deviations, and correlation or covariance matrices.
. edit
Opens the spreadsheet-like Data Editor where data can be entered or edited. Alternatively,
choose Window - Data Editor or click
edit boats year men
Opens the Data Editor with only the variables boats,year, and men (in that order) \ isibie
and available for editing.
. encode stringvar, gen(numvar)
Creates a new variable named numvar, with labeled numerical values based on the strins
(non-numeric) variable string\'ar.
.
format rainfall %8.2f
Establishes a fixed ( f) display format for numeric variable rainfall'. 8 columns wide, with
two digits always shown after the decimal.
. generate newvar = (x + y)/100
Creates a new variable named newvar, equal tox plusy divided by 100.
. generate newvar = uniform()
Creates a new variable with values sampled from a uniform random distribution over the
interval ranging from 0 to nearly 1, written [0,1).
•
infile x y z using data.raw
Reads an ASCII file named data.raw containing data on three variables: .v, y, and z. The
values of these variables are separated by one or more white-space characters — blanks,
tabs; and newlines (carriage return, linefeed, or both)
or by commas. With white-space
1 _
14
Statistics with Stata
delimiters, missing values are represented by periods, not blanks. With comma-delimited
data, missing values are represented by a period or by two consecutive commas. Stata also
provides for extended missing values, which we will discuss later. Other commands are
better suited for reading tab-delimited, comma-delimited, or fixed-column raw data; type
help infiling for more infomation.
list
Lists the data in default or “table” format. If the dataset contains many variables, table
format becomes hard to read, and list, display produces better results. See help
list for other options controlling the format of data lists.
.
list x y z in 5/20
Lists the x, y, and z values of the 5th through 20th observations, as the data are presently
sorted. The in qualifier works in similar fashion with most other Stata commands as
well.
. merge id using olddata
Reads the previously-saved dataset olddata.dta and matches observations from olddata
with observations in memory that have identical lvalues. Both olddata (the “using” data)
and the data currently in memory (the “master” data) must already be sorted by id.
.
replace oldvar = 100 * oldvar
Replaces the values of oldvar with 100 times their previous values.
sample 10
Drops all the observations in memory except for a 10% random sample. Instead of
selecting a certain percentage, we could select a certain number of cases. For example,
sample 55, count would drop all but a random sample of size n = 55.
.
save newfile
Saves the data currently in memory, as a file named newfile.dta. If newfile.dta already
exists, and you want to write over the previous version, type save newfile,
replace. Alternatively, use the menus: File - Save or File - Save As . To save
newfile.dta in the format of Stata version 7, type saveold newfile .
• set memory 24m
(Windows or Unix systems only) Allocates 24 megabytes of memory for Stata data. The
amount set could be greater or less than the current allocation. Virtual memory (disk space)
is used if the request exceeds physical memory. Type clear to drop the current data
from memory before using set memory.
.
sort x
Sorts the data from lowest to highest values of x. Observations with missing x values
appear last after sorting because Stata views missing values as very high numbers. Type
help gsort for a more general sorting command that can arrange values in either
ascending or descending order and can optionally place the missing values first.
.
tabulate x if y > 65
Pioduces a frequency table forx using only those observations that have^y values above 65.
The if qualifier works similarly with most other Stata commands.
I
I
I
■
Data Management
i
I
15
. use oldfile
Retrieves previously-saved Stata-format dataset oldfile.dta from disk, and places it in
memory. If other data are currently in memory, and you want to discard those data without
saving them, type use oldfile, clear . Alternatively, these tasks can be
accomplished through File - Open or by clickins G? •
Creating a New Dataset
I
dntaon Canadian provinces and territories listed in Table 2.1. (From the Federal, Provincial
x"
rr,t°rial AdvlSOry Committee on Population Health, 1996. Canada’s newest territory,
Nunavut, is not listed here because it
---------------- .t was part of the Northwest Territories until 1999 )
Table 2.1: Data on Canada and Its Provinces
1995 Pop.
Unemployment
(1000’s) _____ Rate (percent)
Place
Canada
Newfoundland
Prince Edward Island
Nova Scotia
New Brunswick
Quebec
Ontario
Manitoba
Saskatchewan
Alberta
British Columbia
Yukon
Northwest Territories
29606.1
Male Life
Expectancy
75.1
73.9
74.8
74.2
74.8
74.5
75.5
75.0
75.2
75.5
75.8
71.3
70.2
10.6
19.6
575.4
136.1
19.1
937.8
760.1
7334.2
11100.3
13.9
13.8
13.2
9.3
1137.5
1015.6
2747.0
3766.0
8.5
7.0
8.4
9.8
30.1
65.8
Female Life
Expectancy
81.1
79.8
81.3
80.4
80.6
81.2
81.1
80.8
81.8
81.4
81.4
80.4
78.0
The simplest way to create a dataset from Table 2.1 is through Stata’s spreadsheet-like Data
Editor, which is invoked either by clicking El, selecting Window - Data Editor from the top
menu bar or by typing the command edit. Then begin typing values for each variable in
tCanTdn *at Jtata aut°matlcally cal*s
var2, etc. Thus, varl contains place names
(Canada, Newfoundland, etc.); var2, populations; and so forth.
I
11
State Editov
Preserve!
< | » |
varltSj =
1^_
I
;
-
_T
1 ________ Canada
2
_____
--
Newfoundland
var2
29606.1 |
575.4 j
Hide |
Delete... |
var3
10.6.
vai»4
75.1
73.9
fea.;.v2<i
uarS
81.1
79.8
16
Statistics with Stata
We can assign more descriptive variable names by double-clicking on the column headinas
(such as varJ) and then typing a new name in the resulting dialog box — eight characters or
ewer works best, although names with up to 32 characters are allowed. We can also create
variable labels that contain a brief description. For example, var2 (population) might be
renamed pop, and given the variable label “Population in 1000s, 1995”.
Renaming and labeling variables can also be done outside of the Data Editor through the
rename and label variable commands:
.
I
rename var2 pop
. label variable pop "Population in 1000s, 1995"
Cells left empty, such as employment rates for the Yukon and Northwest Territories, will
automatically be assigned Stata’s system (default) missing value code, a period. At any time,
we can close the Data Editor and then save the dataset to disk. CIicking RTfl or W indow - D ata
Editor brings the Editor back.
If the first value entered for a variable is a number, as with population, unemployment, and
life expectancy, then Stata assumes that this column is a “numerical variable” and it will
thereafter permit only numerical values. Numerical values can also begin with a plus or minus
sign, include decimal points, or be expressed in scientific notation. For example, we could
represent Canada’s population as2.96061 e+7, which means 2.96061 x 107 or about 29.6 million
people. Numerical values should not include any commas, such as 29,606,100. If we did
happen to put commas within the first value typed in a column, Stata would interpret this as a
“string variable” (next paragraph) rather than as a number.
If the first value entered for a variable includes non-numerical characters, as did the place
names above (or “1,000” with the comma), then Stata thereafter considers this column to be a
string variable. String variable values can be almost any combination of letters, numbers,
symbols, or spaces up to 80 characters long in Intercooled or Small Stata, and up to 244
characters in Stata/SE. We can thus store names, quotations, or other descriptive information.
String variable values can be tabulated and counted, but do not allow the calculation of means,
correlations, or most other statistics. In the Data Editor or Data Browser, string variable values
appear in red, so we can visually distinguish the two variable types.
After typing in the information from Table 2.1 in this fashion, we close the Data Editor and
save our data, perhaps with the name canadaO.dta:
save canadaO
Stata automatically adds the extension .dta to any dataset name, unless we tell it to do
otherwise. If we already had saved and named an earlier version of this file, it is possible to
write over that with the newest version by typing
save, replace
At this point, our new dataset looks like this:
L
r
Data Management
describe
Contains data from C: \data\canadaO.dta
obs :
13
vars :
5
size:
(99.9* cf memory free)
variable name
var 1
Pop
va r 3
va r4
var5
storage
type
str21
float
float
float
float
display
format
value
label
3 Jul 2005 10:30
variable label
%21s
%9.0g
%9.0g
*9.0g
%9.0g
Population in 1000s,
1995
Sorted by:
list
1.
2.
3.
4.
5.
6.
7.
8.
9.
10 .
I
11 .
12 .
13 .
I
var 1
I -I
Canada
I
Newfoundland
I
Prince Edward Island
I
Nova Scotia
I
New Brunswick
I --I
Quebec
I
Ontario
I
Manitoba
I
Saskatchewan
I
Alberta
I --British Columbia
Yukon
I Northwest Territories
pop
var3
var 4
var5 |
-------- (
29606.1
575.4
136.1
937.8
760.1
10.6
19.6
19.1
13.9
13.8
75.1
73.9
74 . 8
74.2
74.8
81.1 i
79.8 I
81.3 |
80.4 |
80.6 |
7334.2
11100.3
1137.5
1015.6
2747
13.2
9.3
8.5
7
8.4
74.5
75.5
75
75.2
75.5
3766
30.1
65.8
9.8
81.2 I
81 . 1 I
80.8 I
81.8 I
81.4 I
—I
81.4 I
fin
o
u • * I
78 I
’’S . 8
71.3
■70.2
summarize
Variable
|
varl
pop
var3
var 4
var5
■
Jos
Mear.
Std. Dev.
Min
Max
13
11
13
13
4 5 54 . . 09
12.lOyoy
74.29231
80.71539
6 4.1 4.3 C 4
4.250048
1.673052
.975402-
30.1
7
70.2
78
29606.1
19.6
75.8
81.8
Examining
Examining such
such output
output tables
tables gi
gives us a chance to look for errors that should be corrected
> jsu™narlze table, for instance, provides several numbers useful in proofreading
“n (ca“da) ,h,t ~
The next step is to make our dataset more self-documenting. The variables could be given
more descriptive names, such as the following:
'
.
i
rename varl place
rename var3
unemp
17
18
1
Statistics with Stata
rename
var4 mlife
rename
var5 flife
Stata also permits us 1to add several kinds of labels to the data, label data describes
the dataset as a whole. For example,
.
label data "Canadian dataset 0"
label variable describes an individual variable. For example,
.
label variable place "Place name "
.
label variable
.
label variable mlife "Male life
I'
.
unemp "%
15+ population unemployed,
1995"
expectancy years"
label variable flife "Female life
expectancy years"
By labeling data and variables, we obtain a dataset that iis more self-explanatory:
.
describe
Contains data from C:\data\canada0 . dta
13
obs:
vars:
5
size:
533 (99.9% of
memory free)
variable name
storage
type
display
format
place
Pop
unemp
str21
float
float
%21s
%9.0g
%9.0g
mlife
flife
float
float
%9.0g
%9.0g
value
label
Canadian z taser
3 Jul 2C:= 10:45
variable label
Place n a r e
Populatio- in lOZZs, 1995
15+ pop;_ation unemployed,
1995
Male life expects’zy years
Female life expecz = r.cy years
Sorted by:
Once labeling is completed, we should save the data to disk by using File - Save or typing
save, replace
We can later retrieve these data any time through
File - Open, or by typing
. use c:\data\canadaO
(Canadian dataset 0)
e can then proceed with a new analysis. We might notice, for instance, that male and female
life expectancies correlate positively with each other and also negatively with the
unemployment rate. The life expectancy-unemployment rate correlation Is sliahtly stronger
for males.
. correlate unemp mlife flife
(obs=ll)
unemp|
mlife|
flife|
unemp
mlife
flife
1.0000
-0.7440
-0.6173
1.0000
0.7631
1.0000
r
Data Management
19
The order of observations within a dataset can be changed through the sort command.
For example, to rearrange observations from smallest to largest in population, type
I
.
sort pop
String variables are sorted alphabetically instead of numerically. Typing sort place will
rearrange observations putting Alberta first, British Columbia second, and so on.
We can control the order of variables in the data, using the order command, For
example, we could make unemployment rate the second variable, and population last:
order place unemp mlife flife pop
The Data Editor also has buttons that perform these functions. The Sort button applies to
the column currently highlighted by the cursor. The « and » buttons move the current
variable to the beginning or end of the variable list, respectively. As with any other editing,
these changes only become permanent if we subsequently save our data.
The Data Editor’s Hide button does not rearrange the data, but rather makes a column
temporarily invisible on the spreadsheet. This feature is convenient if, for example, we need
to type in more variables and want to keep the province names or some other case identification
column in view, adjacent to the “active” column where we are enterina data.
We can also restrict the Data Editor beforehand to work only with cenain variables, in a
specified order, or with a specified range of values. For example,
edit place mlife flife
or
edit place unemp if pop > 100
The last example employs an if qualifier, an important tool described in the next section.
Specifying Subsets of the Data:
in and if Qualifiers
Many Stata commands can be restricted to a subset of the data by adding an in or if
qualifier. (Qualifiers are also available for many menu selections: look for an if/in or by/if/in
tab along the top ot the menu.) in specifies the observation numbers to which the command
applies. Forexample, list in 5 tells Stata to list only the 5th observation. Tolistthelst
through 20th observations, type
list in 1/20
The letter 1 denotes the last case, and -4 , for example, the fourth-from-last. Thus, we could
list the four most populous Canadian places (which will include Canada itself) as follows:
.
sort pop
.
list place pop in -4/1
Note the important, although typographically subtle, distinction between 1 (number one, or
first observation) and 1 (letter “el,” or last observation).-The in qualifier works in a similar
way with most other analytical or data-editing commands! It always refers to the data as
presently sorted.
20
T
Statistics with Stata
The if qualifier also has broad applications, but it selects observations based on specific
variable values. As noted, the observations in canadaO.dta include not only 12 Canadian
provinces or territories, but also Canada as a whole. For many purposes, we might want to
exclude Canada from analyses involving the 12 territories and provinces. One way to do so is
to restrict the analysis to only those places with populations below 20 million (20,000
thousand); that is, every place except Canada:
summarize if pop < 20000
Variable I
Obs
Mean
place
|
Pop
unemp
mlife
flife
I
|
|
|
0
12
10
12
12
2467.158
12.26
74.225
80.6=333
Std.
Min
Max
3435.521
4 . 44877
1 . 728965
1.0116
30.1
7
70.2
78
19.6
75.8
81 . 8
Compare this with the earlier summarize output to see how much has changed. The
previous mean of population, for example, was grossly misleading because it counted every
person twice.
<
==
(is less than) sign is one of six relational operators:
is equal to
!=
is not equal to (
>
is greater than
<
is less than
>=
is greater than orequal to
<=
is less than or equal to
The
also works)
A double equals sign, “ == ”, denotes the logical test, ^Is the value on the left side the same
as the value on the right?” To Stata, a single equals sign means something different: "Make
the value on the left side be the same as the value on the right.” The single equals sign is not
a relational operator and cannot be used within if qualifiers. Single equals signs have other
meanings. They are used with commands that generate new variables, or replace the values of
old ones, according to algebraic expressions. Single equals signs also appear in certain
specialized applications such as weighting and hypothesis tests.
Any of these relational operators can be used to select observations based on their values
for numerical variables. Only two operators, = and ! =, make sense with string variables.
To use string variables in ani if qualifier, enclose the target value in double quotes. For
example, we could get a summary excluding Canada (leaving in the 12 provinces and
territories):
.
summarize if place
!=
"Canada"
Two or more relational operators can be combined within a single if expression by the
use of logical operators. State’s logical operators are the following:
&
and
I
or (symbol is a vertical bar, not the number one or letter “el”)
not
also works)
i
r
Data Management
21
The Canadian territories (Yukon and Northwest) both have fewer than 100,000 people. To find
the mean unemployment and life expectancies for the 10 Canadian provinces only, excluding
both the smaller places (territories) and the largest (Canada), we could use this command:
summarize unemp mlife flife if pop > 100
.
& pop < 20000
Variable |
Obs
Mean
Std. Dev.
Min
Max
unemp |
mlife |
flife |
10
10
10
12.26
74 . 92
80.98
4.44877
.6051633
.586515
7
73.9
79.8
19.6
75.8
81.8
I
Parentheses allow us to specify the precedence among multiple operators. For example,
we might list all the places that either have unemployment below 9, or have life expectancies
of at least 75.4 for men and 81.4 for women:
.
I
list if
8.
9.
10 .
11 .
I
I
I
I
I
unemp <
9
|
(mlife >= 75.4
& flife >= 81.4)
p ace
pop
unemp
mlife
Manitoba
Saskatchewan
Alberta
1137.5
1015.6
2747
3766
8.5
7
8.4
9.8
75
75.2
75.5
75.8
I British Colu-bia
flife |
----- I
80.8 |
81.8 |
•81.4 |
81.4 |
A note of caution regarding missing values: Stata ordinarily shows missing values as a
period, but in some operations (notably sort and if, although not in statistical calculations
such as means or correlations), these same missing values are treated as if they were large
positive numbers. Watch what happens ifwe sort places from lowest to highest unemployment
rate, and then ask to see places with unemployment rates above 15%:
sort unemp
.
list if unemp > 15
10 .
11.
12 .
13 .
I
place
I
Prince Edward Island
I
I
Newfoundland
I
Yukon
I Northwest Territories
Pop
unemp
mlife
136.1
575.4
30.1
65.8
19.1
19.6
74.8
73.9
71.3
70.2
----------- +
flife |
----------- I
81.3 |
79.8 |
80.4 |
78 |
The two places with missing unemployment rates were included among those “greater than 15.”
In this instance the result is obvious, but with a larger dataset we might not notice. Suppose
that we were analyzing a political opinion poll. A command such as the following would
tabulate the variable vote not only for people with ages older than 65, as intended, but also for
any people whose age values were missing:
.
tabulate vote if age > 65
Where missing values exist, we might have to deal with them explicitly as part of the if
expression.
.
tabulate vote if age > 65
& age <
.
A less-than inequality such as age < . is a general way to select observations with
nonmissing values. Stata permits up to 27 different missing values codes, although we are
22
Statistics with Stata
using only the default
. here. The other 26 codes are represented internally as numbers
even larger than “ . “.so < . avoids them all. Type help missing for more details.
The in and if qualifiers set observations aside temporarily so that a particular
command does not apply to them. These qualifiers have no effect on the data in memory, and
the next command will apply to all obsen ations, unless it too has an in or if qualifier. To
di op variables from the data in memory, use the drop command. For example, to drop mlife
and flife from memory, type
drop mlife flife
.
VV e can drop obsen ations from memory by using either the in qualifier or the if
qualifier. Because we <earlier sorted on unemp, the two territories occupy the 12th and 13th
positions in the data. Canada itself is 6th... One way to drop these three nonprovinces employs
the in qualifier, drop in 12/13 means ‘‘drop the 12th through the 13th observations.”
list
.
1
2
3
I
I
I
I
z
5
g
7
8
9
10
I
I
I
I
I
I
11
12
13
place
pcc
unemp |
---------- I
Saskatchewan
Alberta
Manitoba
Ontario
British 2oiumbia
1015.6
274"
1137.5
11100.3
3766
Canada
Quebec
New Brunswick
Nova Scotia
Prince Edward Island
19606.1
7334.2
760.1
937.S
136.1
Newftundland
Yukon
Northwest Territories
575.4
30.1
65.6
7 I
8.4 |
8.5 |
9.3 |
9.8 |
-----------I
10.6 |
13.2 |
13.8 |
13.9 |
19.1 |
---------- I
19.6 |
. I
. I
. drop in 12/13
(2 observations deleted)
. drop in 6
•1 observation deleted)
The
1 lie same
bdinc change
enange could
coum have
nave been accomplished through an if qualifier, with a command
that says “drop ifplace equals Canada or population is less than 100.”
drop if place == "Canada"
(3 observations deleted)
|
pop < 100
After dropping Canada, the territories, and the variables mlife and flife, we have the
following reduced dataset:
list
---------- +
I
I
1. I
2. I
3. I
4. I
5. I
place
POP
Saskatchewan
Alberta
Manitoba
Ontario
1015.6
2747
1137.5
11100.3
3766
British Columbia
unemp |
----------- I
7 I
8.4 |
8.5 |
9.3 |
9.8 |
L
Data Management
6
7
8
9
10
!
Quebec
i
i
I
I
New Brunswick
Nova Scotia
Prince Edward Island
Newfoundland
7334.2
760.1
937.8
136.1
575.4
13.2
13.8
13.9
19.1
19.6
23
I
I
I
I
I
I
We can also drop selected variables or observations through the Delete button in the Data
Editor.
snecifvwhich m k"8
Va?ab‘eS °r observations t0 droP> * sometimes is simpler to
specify which to keep. The same reduced dataset could have been obtained as follows:
I
.
keep place pop unemp
■ keep if place != "Canada"
”Canada"
(3 observations deleted)
&
& pop
pop >=
>= 100
100
Like any other changes to the data in memory, none of these reductions affect disk files
( save' TepilStan^to31 S°T’ Wil*
°PtiOn °f Writing °Ver the oId dataset
new Zp th ?
destroying it, or just saving the newly modified dataset with a
new name (by choosing File - Save As , or by typing a command with the form
save
newname) so that both versions exist on disk.
Generatin
and Replacing Variables
generate and replace commands allow us to create new variables or change the
va ues of existing variables. For example, in Canada, as in most industrial societies, women
n o live longer than men. Toanalyze regional variations in this gender gap we might
re neye dataset cgWg^and^^new variable equal to female life
(We). m the main P^fr^^r^^e
statement (unlike if qualifiers) we use a single equals sign.
- use canadal, clear
(Canadian dataset 1)
. generate gap = flife
.
mlife
label variable gap "Female-male
gap life expectancy"
describe
Contains data from C:\data\canadal.dta
obs:
13
vars:
6
size:
595
(99.9% of memory free)
variable name
storage
type
display
format
place
POP
unemp
str21
float
float
%21s
% 9.0g
%9.0g
mlife
flife
float
float
float
%9.0g
%9.0g
%9.0g
gap
Sorted by:
value
label
Canadian dataset 1
3 Jul 2005 10:48
variable label
Place name
Population in 1000s, 1995
% 15+ population unemployed,
1’995
Male life expectancy years
Female life expectancy years
Female-male gap life expectancy
I!
24
Statistics with Stata
!r
•
list place flife mlife gap
-------- piace
flife
mlife
Canada
-Indiana
81.1
Prince Edward Island
Nova Scotia
New Brunswick
81 . 3
80 . 4
80 . e
75.1
73.9
74 . 8
74.2
74 . 8
»cebec
Ontario
Manitoba
Sas ka tone war.
Alberta
81.2
81 .1
80 . =
81 . 5
81 . 4
74 . 5
75.5
75
75.2
75.5
81
80
75.8
71.3
70.2
I --
British
i
•Jir.bia
Yukon
Northwest Territories
gap |
--------- (
6 I
5.900002 |
6.5 |
6.200005 |
5.799995 |
--------- !
6.699997 |
5.599998 |
5. ■800003 |
6.600006 |
5.900002 |
I
5.599998 |
9.099998 |
7.800003 |
For the province of Newfoundland, the true value ofgap should be 79.8 -73 9 = 5 9 years
but the output shows this value as 5.900002 instead. Like all computer programs, Stata stores
numbers in binary form, and 5.9 has no exact binary representation. The small inaccuracies that
arise from approximating decimal fractions in binary are unlikely to affect statistical
calculations much because calculations are done in double precision (8 bytes pernumber).
The) appear disconcerting in data lists, however. We can change the display format so that
Stata shows only a rounded-off version. The following command specifies a fixed display
format tour numerals wide, with one digit to the right of the decimal:
.
format gap %4.1f
observations1116 dlSP'ay Sh°WS 5'9’ however>a command such as the following will return no
list if gap == 5.9
This occurs because Stata believes the value does not exactly equal 5.9. (More technically
Stata stores gap values in single precision but does all calculations iin double, and the singleand double-precision approximations of 5.9 are not identical.)
Display formats, as well as variables names and labels, can also be changed by double
clicking on a column in the Data Editor. Fixed numeric formats such as %4.1f are one of
the three most common numeric display format types. These are
% iv. 5g
% w. 5f
% w. 5e
General numeric format, where iv specifies the total width or number ofcolumns
displayed and d the minimum number of digits that must follow the decimal
point. Exponential notation (such as 1.00e+07, meaning 1.00 x 107 or 10 million)
and shifts in the decimal-point position will be used automatically as needed, to
display values in an optimal (but varying) fashion.
Fixed numeric format, where w specifies the total width or number of columns
displayed and d the fixed number of digits that must follow the decimal point.
Exponential numeric format, where w specifes the total width or number of
columns displayed and d the fixed number of digits that must follow the decimal
point.
Data Management
25
For example, as we saw in Table 2.1, the 1995 population of Canada was approximately
29,606,100 people, and the Yukon Territory population was 30,100. Below we see how these
two numbers appear under several different display formats:
format
Canada
Yukon______
%9.0g
2.96e+07
30100
v
%9.1f
29606100.0
30100.0
%12.5e
2.9606 le+07
3.01000e+04
Although the displayed values look different, their internal values are
are identical.
identical. Statistical
Statistical
calculations remain unaffected by display formats. Other numerical display formatting options
include the use <of commas, left- and right-justification, or leading zeroes. There also exist
special formats for dates, time: series variables, and string variables. Type help formatfor more information.
replace can make the same sorts of calculations as generate, but it changes values
of an existing variable instead of creating a new variable. For example, the variablepop in our
dataset gives population in thousands. To convert this to simple population, w e just multiply
(“ * ” means multiply) all values by 1,000:
.
replace pop = pop * 1000
replace can make such wholesale changes, or it can be used with in or if qualifiers
to selectively edit the data. To illustrate, suppose that we had questionnaire data with variables
including age and year bom (born). A command such as the following would correct one or
more typos where a subject’s age had been incorrectly typed as 229 instead of 29:
.
replace age = 29 if age == 229
Alternatively, the following command could correct an error in the value ofage for observation
number 1453:
.
replace age = 29 in 1453
For a more complicated example,
.
replace age = 2005-born if age >= .
|
age < 2005-born
This replaces values ofvariable age with 2005 minus the year of birth ifage is missing or if the
reported age is less than 2005 minus the year of birth.
generate and replace provide tools to create categorical variables as well. We
noted earlier that our Canadian dataset includes several types of observations: 2 territories. 10
provinces, and 1 country combining them all. Although in and if qualifiers allow us to
separate these, and drop can eliminate observations from the data, it might be most
convenient to have a categorical variable that indicates the observation’s “type.” The fol lowing
example shows one way to create such a variable. We start by generating type as a constant
equal to 1 for each observation. Next, we replace this with the value 2 for the Yukon and
orthwest Territories, and with 3 for Canada. The final steps involve labeling new variable
type and defining labels for values 1, 2, and 3.
. use canadal, clear
(Canadian dataset 1)
. generate type = 1
J
26
Statistics with Stata
.
replace type = 2 if place
Territories"
(2 real changes made)
"Yukon”
. replace type = 3 if place
(1 real change made)
"Canada"
.
label variable type "Province,
.
label values type typelbl
| place
"Northwest
territory or nation"
label define typelbl 1 "Province" 2 "Territory" 3 "Nation"
list place flife mlife gap type
3
i
I
I
I
I
I
I
place
flife
mlife
gap
Canada
Newfoundland
Prince Edward Island
Nova Scotia
New Brunswick
81.1
79.8
81.3
80.4
80.6
75.1
73.9
74.8
74.2
74.8
6
5.900002
6.5
6.200005
5.799995
Quebec
Ontario
Manitoba
Saskatchewan
Alberta
81.2
81.1
80.8
81.8
81.4
74.5
75.5
75
75.2
75.5
6.699997
5.599998
5.800003
6.600006
5.900002
81.4
80.4
78
75.8
71.3
70.2
5.599998
9.099998
7.800003
6
3
9
10
11
12
13
I
i
I
British Columbia
I
Yukon
I Northwest Territories
type I
----------I
Nation |
Province I
Province I
Province |
Province I
--------- I
Province |
Province |
Province |
Province |
Province |
--------- I
Province I
Territory |
Territory |
As illustrated, labeling the values of a categorical variable requires two commands. The
label define command specifies what labels go with what numbers. The label
values command specifies to which variable these labels apply. One set of labels (created
through one label define command) can apply to any number of variables (that is, be
referenced in any number of label values commands). Value labels can have up to
32,000 characters, but work best for most purposes if they are not too long.
generate can create new variables, and replace can produce new values, using any
mixture of old variables, constants, random values, and expressions. For numeric variables, the
following arithmetic operators apply:
+ add
subtract
★
multiply
/
divide
A
raise to power
Parentheses will control the order of calculation. Without them, the ordinary rules of
precedence apply. Of the arithmetic operators, only addition, “4- ”, works with string variables,
where it connects two string values into one.
Although their purposes differ, generate and replace have similar syntax. Either
can use any mathematically or logically feasible combination of Stata operators and in or
if qualifiers. These commands can also employ Stata’s broad array of special functions,
introduced in the following section.
Data Management
27
Using Functions
This section lists many of the functions available for use with generate or replace
For example, we could create a new variable named loginc. equal to the natural logarithm of
mcome, by using the natural log function In within a generate command:
. generate loginc = In(income)
In is one of Stata’s mathematical functions. These functions are as follows:
abs(x)
Absolute value of.v.
acos(x)
asin(x)
Arc-cosine returning radians. Because 360 degrees =
radians,
acos (x) *180/_j>i gives the arc-cosine returning degrees ( _pi
denotes the mathematical constant k).
Arc-sine returning radians.
atan(x)
Arc-tangent returning radians.
atan2(y,x)
Two-argument arc-tangent returning radians.
atanh(x)
Arc-hyperbolic tangent returning radians.
Integer n such that n-1 <.v < n
ceil(x)
cloglog(x)
I
I
comb(n,k)
Complementary log-log of.v:
ln(—ln( 1 —a))
Combinatorial function (number of possible combinations of n things
taken k at a time).
cos(x)
Cosine of radians. To find the cosine of v degrees, type
digamma(x)
</lnr(.v) / dx
exp(x)
Exponential (e to power).
floor(x)
Integer n such that n < .v < //-I
trunc(x)
Integer obtained by truncating .v towards zero.
Inverse of the complementary log-log: I - exp(-exp(x))
Inverse of logit of.v:
exp(.v) (1 e.xp(.v))
Natural (base e) logarithm. For any other base number B, to find the
base B logarithm of.v, type
generate y = cos(y *_pi/180)
invcloglog(x)
invlogit(x)
In (x)
i
generate y = ln(x)/ln(B)
I
Infactorial(x)
J
I
generate y = round(exp(Infact(x),1)
Ingamma(x)
Natural log of r(x). To find F(x), type
log(x)
Natural logarithm; same as in (x)
'Base 10 logarithm.
generate y - exp(Ingamma(x))
loglO(x)
logit(x)
max(xl,x2,..,xn)
min(xl,x2,..,xn)
1
Natural log of factorial. To find .v factorial, type
Log of odds ratio ofx: In(x /(l^r))
Maximum ofx/,x2, ...,xn.
Minimum ofx/, x2, ..., xn
28
Statistics with Stata
mod(x,y)
Modulus of a- with respect toy.
reldif(x,y)
Relative difference: j x - v | / (| y | + 1)
Round x to nearest whole number.
Round x in units of r.
round(x)
round(x,y)
sign(x)
sin(x)
sqrt(x)
total(x)
tan(x)
tanh(x)
trigamma(x)
-1 ifx<0, 0 ifx=0, +1 ifx>0
Sine of radians.
Square root.
Running sum ofx (also see help egen )
Tangent of radians.
Hyperbolic tangent ofx.
d2 lnF(x) / dx2
Many probabilityfunctions exit as well, and are listed below. Consult help probfun
and the reference manuals for important details, including definitions, constraints on
parameters, and the treatment of missing values.
betaden(a,b,x)
Probability density of the beta distribution.
Binomial(n ,k, p)
Probability of k or more successes in n trials when the probability of
a success on a single trial is p.
binormal(h,k,r)
Joint cumulative distribution of bivariate normal with correlation r.
chi2 (n, x)
Cumulative chi-squared distribution with n degrees of freedom.
chi2tail(n,x)
Reverse cumulative (upper-tail, survival) chi-squared distribution with
n degrees of freedom..
chi2tail(/^v) = 1 - chi2(?ux)
Partial derivative of the cumulative gamma distribution gammap(tf.x)
with respect to a.
dgammapda(a,x)
dgammapdx(a,x)
Partial derivative of the cumulative gamma distribution gammap(aa')
with respect to x.
dgammapdada(a,x)
2nd partial derivative of the cumulative gamma distribution
gammap(a,x) with respect to a.
dgammapdadx (a,x)
2nd partial derivative of the cumulative gamma distribution
gammap^^r) with respect to a and x.
dgammapdxdx(a,x)
2nd partial derivative of the cumulative gamma distribution
gammap(fl,x) with respect to x.
Cumulative F distribution with nl numerator and n2 denominator
degrees of freedom.
Y(nl,n2,f)
Fden(nl,n2,f)
Ftail(nl,n2,f)
Probability density function for the F distribution with nl numerator
and n2 denominator degrees of freedom.
Reverse cumulative (upper-tail, survival) F distribution with nl
numerator and n2 denominator degrees of freedom.
Ftai\(nfn2f) = 1 - Y(nfn2j)
Data Management
29
gammaden (a,b, g, x) Probability density function for the gamma
gamma family,
family. where
gammaden(aJ.O.r) = ithe probability density function for
the
cumulative gamma distribution gammap(^rv).
gammap (a, x)
Cumulative gamma distribution fora: also known as the incomplete
gamma function.
ibeta(a,b,x)
Cumulative beta distribution fort?, b; also know:n as the incomplete
beta function.
invbinomial(n,k,P)i
I
I
Inverse binomial. For/>< 0.5. probability p such that the
probability of observing k or more successes in n trials is P: for P >
0.5, probability p such that the probability of observing k or fewer
successes in n trials is 1 - P.
invchi2(n,p)
Inverse of chi2(). If chi2(/?rr) =/?. then invchi2(n,p) = .v
invchi2tail (n,p)
Inverse of chi2taii()
If chi2tail(n^v) = p,
then invchi2tail(//,p) = .V
invF(nl,n2,p)
Inverse cumulative F distribution.
If T(nl,n2f)=p,
then invF(/7/./72.p) =/
invFtaii (ni,n2,p) Inverse reverse cumulative F distribution.
If Ftail^z/^j/) =p,
then invFtail(/7/,/72,p) =f
invgammap(a,p)
Inverse cumulative gamma distribution.
I f gammap^^x) = p,
then invgammap(a./7) = x
invibeta(a,b,p)
Inverse cumulative beta distribution.
If ibeta(a,Z>^) =p,
then invibeta(a,/>,p) = x
invnchi2(n,L,p)
Inverse cumulative noncentral chi-squared distribution.
If nchi2(/7,Z,rx) =p,
then invnchi2(n,L,p) = x
invnFtaii (nl fn2tL,P) Inverse reverse cumulative noncentral Fdistribution.
If nFtaiI(w/,w2X /) =p,
then in\TiFtail(w7,n2X,p) = f
invnibeta(a,b,L,p)
Inverse cumulative noncentral beta distribution.
If nibeta(a,Z>/.A) =p, then invnibeta(a,Z>/ ,p) = x
invnormal(p)
Inverse cumulative standard normal distribution.
If normalfz) = p,
then invnormal(p) = z
invttail(n,p)
Inverse reverse cumulative Student’s t distribution.
If ttail(/?,r) =p,
then invttail(/7,p) = t
nbetaden(a,b,L,x ) Noncentral beta density with shape parameters a, b. noncentrality
parameter/,.
nchi2(n,L,x)
Cumulative noncentral chi-squared distribution with n degrees of
freedom and noncentrality parameter L.
nFden(nl,n2,L,x)
Noncentral F density with nl numerator and n2 denominator degrees
of freedom, noncentrality parameter L.
nFtaii (nl ,n2,L,X) Reverse cumulative (upper-tail, survival) noncentral F distribution
with nJ numerator and n2 denominator degrees of freedom,
noncentrality parameter L.
j
30
T
Statistics with Stata
nibeta(a fb,L,x)
normal(z)
normalden(z)
Cumulative noncentral beta distribution with shape parameters ci and
b, and noncentrality parameter L.
Cumulative standard normal distribution.
Standard normal density, mean 0 and standard deviation 1.
normalden(z,s)
Normal density, mean 0 and standard deviation s.
normalden(x,m,s)
Normal density, mean m and standard delation 5.
Noncentrality parameter L for the noncentral cumulative chi-squared
distribution.
If nchi2(/?,Z,rr) = p,
then npnchi2(/7.v.p) = L
Probability density function of Student's t distribution with n degrees
of freedom.
npnchi2(n,x,p)
tden (n, t)
ttail(n,t)
uniform()
I
I
Reverse cumulative(upper-tail) Student’srdistribution with/? degrees
of freedom. This function returns probability T> t.
Pseudo-random number generator, returning values from a uniform
distribution theoretically ranging from 0 to nearly 1, written [0,1).
Nothing goes inside the parentheses with uniform (). Optionally, we can control the
K
? gener"tOr’S StartinS seed’ and ^nce the stream of “random” numbers, by first
issuing a set seed # command — where # could be any integer from 0 to 2 31 - 1
inclus^e Onuttmgthe set seed command corresponds to set seed 123456789
which will always produce the same stream of numbers.
ran H!a?PrjV'dX”°re,^“40rf“''/“"”“anddate',el“,ed™“
A listing
can be foundh„ Chapter27 of the
nr by typing help a.i.tep. Beiowar|
daX" 1196o“,i0"S' “ElaPSed
'h'“
—T
II
date (slzS2[ ,y]) Elapsed date corresponding to 5,. 5, is a string triable indicating the date
in virtually any format. Months can be spelled out, abbreviated to three
characters, or given as numbers; years can include or exclude the century
anks and punctuation are allowed, s, is anv permutation of m d. and
L##]y with their order defining the order that month, day and year occur in
S|. ## gives the century for two-digit years in s,: the default is 19y.
d(l)
A date literal convenience function. For example, typing d(2janl960) is
IS
equivalent to typing 1.
mdy(m,d,y)
Elapsed date corresponding to m. d. and y.
day(e)
Numeric day of the month corresponding to e, the elapsed date.
month(e)
Numeric month corresponding to e. the elapsed date.
Numeric year corresponding to e, the elapsed date.
year(e)
dow(e)
doy(e)
week(e)
quarter(e)
halfyear(e)
Numeric day of the week corresponding to e, the elapsed date.
Numeric day of the year corresponding to e, the elapsed date.
Numeric week of the year corresponding to e, the elapsed date.
Numeric quarter of the year corresponding to e, the elapsed date.
Numeric half of the corresponding to e, the elapsed date.
1,
Data Management
31
Some useful special functions include the following:
autocode (x, n, x~ i n, xmax) Forms categories from x by partitioning the interval from xmin
to xmax into n equal-length intervals and returning the upper bound of the
interval that contains x.
cond(x,a,b)
Returns a if x evaluates to “true” and b if x evaluates to “false.”
. generate y = cond(incl > inc2, incl, inc2)
creates the variabley as the maximuni of incl and inc2 (assuming neither
is missing).
Creates a categorical variable that divides the data as presently sorted into
x subsamples that are as nearly equal-sized as possible.
Returns the integer obtained by truncating (dropping fractional parts of)
group(x)
trunc(x)
max(x , x ,
., x.) Returns the maximum ofx,, x2,..., xn. Missing values are ignored.
For example, max(3+2,1) evaluates to 5.
min (x , x ,, x_) Returns the minimum ofx! ,x2,... ,xn.
recode (x, x:, x2, . . . , x.) Returns missing ifx is missing, x, ifx < Xj, orx2 ifx <x2, and
so on.
round(x,y)
Returns x rounded to the nearest y.
sign(x)
Returns - 1 ifx < 0, 0 ifx = 0, and+1 ifx > 0 (missing ifx is missing).
total(x)
Returns the running sum ofx, treating missing values as zero.
Stringfunctions, not described here, help to manipulate and evaluate string variables. Type
help str fun for a complete list of string functions. The reference manuals and User's
Guide give examples and details of these and other functions.
Multiple functions, operators, and qualifiers can be combined in one command as needed.
The functions and algebraic operators just described can also be used in another way that does
not create or change any dataset variables. The display command performs a single
calculation and shows the results onscreen. For example:
. display 2+3
c.
. display logl0(10A83)
83
. display invttail (120, .025) * 34.1/sqrt(975)
2.1622305
Thus, display works as an onscreen statistical calculator.
Unlike a calculator, display , generate , and replace have direct access to
Stata’s statistical results. For example, suppose that we summarized the unemployment rates
from dataset Canadal.dta:
.
summarize unemp
Variable I
Obs
Mean
Std. Dev.
Min
Max
unemp |
11
12.10909
4.250048
7
19.6
After summarize_^Stata temporarily stores the mean as a macro named r (mean) .
32
Statistics with Stata
display r(mean)
12.109091
We could use this result to create variable unempDEV, defined as deviations from the mean:
. gen unempDEV = unemp - r (mean)
(e missing values generated)
summ unemp unempDEV
Variable I
Obs
Mean
unemp |
unempDEV I
il
12.10909
4.33e-08
11
Dev.
Min
Max
4.250048
4.250048
7
-5.-109091
19.6
7.49091
Srd.
Stata also provides another variable-creation command,
egen (“extensions to
generate ”), which has its own set of functions to accomplish tasks not easily done by
generate . These include such things as creating new variables from the sums, maxima,
minima, medians, interquartile ranges, standardized values, or moving averages of existing
variables or expressions. For example, the following command creates'a new variable named
zscore, equal to the standardized (mean 0, variance 1) values ofx:
. egen zscore = std(x)
Or, the following command creates new variable avg, equal to the row mean of each
observation s values on .v, y\ z, and w, ignoring any missing values.
. egen avg = rowmean(x,y,z, w}
To create a new variable named sum, equal to the row sum of each observation’s values onx
v, z, and w, treating missing values as zeroes, type
. egen sum = rowsum(x,y,z, w)
The following command creates new variable xrank, holding ranks corresponding to values of
a. xrank-I for the observation with highest.r. xrank = 2 for the second highest, and so forth.
. egen xrank = rank(x)
Consult help egen for a complete list of egen functions, or the reference manuals for
further examples.
Converting between Numeric and String Formats
Dataset canada2.dta contains one string variable, place. It also has
a labeled categorical
xariable, type. Both seem to have nonnumerical values.
. use canada2, clear
(Canadian dataset 2)
• list place type
I
I
1. I
2. I
3. I
4. I
place
Canada
Newfoundland
Prince Edward Island
Nova Scotia
type |
-------- !
Nation |
Province |
Province |
Province |
r
Data Management
5.
'.'ex Brunswick
33
Province |
---------------- (
6.
Quebec
tar io
itoba
Saskatchewan
Alberta
6.
9.
10.
I
I
II
British Columbia
I
Yukon
I Northwest Territories
11 .
12 .
13 .
Province
Province
Province
Province
Province
-------- !
|
|
|
I
|
Province |
Territory |
Territory |
-------- -
Beneath the labels, however, type remains a numeric variable, as we can see if we ask for the
no label option:
.
list place type, nolabel
place
1.
2.
3.
4.
5.
I
6.
7.
8.
9.
10.
ii.
12 .
13 .
I
l
I
I
I
I
I
type
|
■ I
Canada
Newfoundland
Prince Edward Island
Nova Scotia
New Brunswick
Quebec
Ontario
Manitoba
Saskatchewan
Alberta
I
I
I
II
British Columbia
I
Yukon
I Northwest Territories
3 I
1 I
1 I
1 I
1 I
I
1 I
1 I
1 I
1 I
1 I
I
1 I
2 I
2 I
String and labeled numeric variables look similar when listed, but they behave differently
when analyzed. Most statistical operations and algebraic relations are not defined for string
™nableS) so we miSht want t0 have both string and labeled-numeric versions of the same
formation in our data. The encode command generates a labeled-numeric variable from
a stnng variable. The number 1 is given to the alphabetically first value of the string variable
to the second, and so on. In the following example, we create a labeled numeric variable
named placenum from the string variable place'.
.
encode place,
gen(placenum)
The opposite conversion is possible, too: The decode command generates a string
? 1,be“<l n"n,eric v"“b"-Here"
z
.
decode
type,
gen(typestr)
When listed, the new numeric variable placenum, and the new string variable typestr, look
similar to the originals:
’
1
34
Statistics with Stata
list place placenum type
typestr
place
4
type
typestr |
----------- ,
Panada
N'-. wi i i.-.dland
Prince Edward Island
Nova Scotia
N e w B r un s w i c k
Canada
Newfoundland
i.-.ce Eaward Island
Nova Scotia
New Brunswick
Nation
Province
Province
Province
Province
Nation I
Province |
Province I
Province |
Province I
----------- I
Quebec
1ntario
Mar. i t oba
Saskatchewan
Alberta
Quebec
Ontario
Manitoba'
Saskatchewan
Alberta
Province
Province
Province
Province
Province
rrtish Columbia
Y ukon
nwest Territories
Province
Territory
Territory
Province I
Province |
Province I
Province |
Province |
---------- I
Province |
Territory |
Territory |
10
11
12
13
place
British
tlumbia
Y ukon
.’c rthwest Territories
But with the nolabel option, the differences become visible. Stata views placenum and
type basically as numbers.
.
list place placenum type
typestr.
nolabel
place
placenum
Panada
Newfoundland
Prince Edward Island
Nova Scotia
New Brunswick
.oo: zooooooooooe+oo
. oo:: oooooo ooo oe-f-oo
1 .OOCOOOOOOOOOOOe + Ol
5 .OOOOOOOOOOOOOOe + OO
: .000000000000006 + 00
1
1
1
1
6.
7.
8.
9.
10 .
Quebec
On t a r i o
Manitoba
Saskatchewan
Alberta
1 . lOOOOOOOOOOOOOe + Ol
9.000300000 00000e + 00
4.00000000000000e+00
1 .2000-0000000000e + 01
1.00000000000000e + 00
1
1
1
1
1
12 .
clumbia
Yukon
_t tries
2 -OC-COOCOCOOOOOOe + OO
1.30000000000000e + 01
" .000 * 00000000006 + 00
1
2
2
I --
2
3.
4.
5.
I
I--
thwest
type
typestr I
I
Nation I
Province I
Province I
Province I
Province I
I
Province I
Province I
Province I
Province I
Province I
I
Province I
Territory |
Territory |
Statistical analyses, such as finding means and standard deviations, work only with
basically numeric variables. For calculation purposes, numeric variables’ labels do not matter.
summarize place placenum
Variable
Obs
place I
placenum I
type I
typestr I
0
13
13
0
I
type typestr
Me ar.
Dev .
Min
Max
1.307692
3.-9444
. 63: 4252
1
1
13
3
Occasionally we encounter a string variable where the values are all or mostly numbers.
To convert these string values into their numerical counterparts, use the real function. For
example, the variable siblings below is a string variable, although it only has one value, “4 or
more,'’ that could not be represented just as easily by a number.
1
Data Management
35
. describe siblings
str9
1. siblings
%9s
Number of siblings (string)
. list
I
siblings
--- I
I—
1. I
0
2. I
1
3. I
2
4. I
3
5. I 4 or more
|
I
I
I
I
|
. generate sibnum = real(siblings)
(1 missing value generated)
The new variable sibnum is numeric, with a missing value where siblings had “4 or more.”
list
1.
2.
3.
4.
5.
I
s iblings
I—
l
0
I
1
2
l
3
I
I 4 or more
sibnum |
----------- I
0 I
1 I
2 I
3
•
I
I
The destring command provides a more flexible method for converting string
variables to numeric. In the example above, we could have accomplished the same thing by
typing
destring siblings, generate(sibnum)
force
See help destring for information about syntax and options.
Creating New Categorical and Ordinal Variables
A previous section illustrated how to construct a categorical variable called type to distinguish
among territories, provinces, and nation in our Canadian dataset. You can create categorical
or ordinal variables in many other ways. This section gives a few examples.
type has three categories:
tabulate type
Province,
|
territory or|
nation
|
Freq.
Percent
Cum.
Province |
Territory |
Nation |
10
2
1
7 6.92
15.38
7.69
76.92
92.31
100.00
Total |
13
100.00
For some purposes, we might want to re-express a imulticategory variable as a set of
dichotomies or ‘‘dummy variables/’ each coded 0 or 1.. tabulate will create dummy
36
Statistics with Stata
variables automatically if we add the generate option. In the following example, this
results in a set of variables called typel, type2, and type3, each representing one of the three
categories of type:
tabulate type, generate(type)
Province,
|
territory or|
nation
|
Freq.
Percent
Province I
Territory |
Nation |
10
2
1
76. 92
15.38
7.69
Total |
13
100.00
Cum.
76.92
92.31
100.00
describe
.
Contains data from C:\data\canada2.dta
obs :
13
vars :
10
size:
637 (99.9% of memory free)
variable name
storage
type
display
format
place
POP
unemp
str21
float
float
%21s
%9.0g
%9.0g
ml i f e
flife
gap
type
typel
type2
type3
float
float
float
byte
byte
byte
byte
%9.0g
%9.0g
%9.0g
%9.0g
%8.0g
%8.0g
%8.0g
Sorted by:
Note:
.
value
label
typelbl
Canadian dataset 2
3 Jul 2005 10:48
variable label
Place name
Population in 1000s, 1995
% 15+ population unemployed,
1995
Male life expectancy years
Female life expectancy years
Female-male gap life expectancy
Province, territory or nation
type==Province
type==Territory
type==Nation
dataset has changed since last saved
list place type typel-type3
1.
2.
3.
4.
5.
6.
7.
8.
9.
10 .
11.
12 .
13.
I
place
I—
I
Canada
I
Newfoundland
I
Prince Edward Island
I
Nova Scotia
I
New Brunswick
I—•
I
Quebec
I
Ontario
I
Manitoba
I
Saskatchewan
I
Alberta
I—I
British Columbia
I
Yukon
I Northwest Territories
---------------------
--------- +
type
typel
type2
Nation
Province
Province
Province
Province
0
1
1
1
1
0
0
0
0
0
Province
Province
Province
Province
Province
1
1
1
1
1
0
0
0
0
0
Province
Territory
Territory
1
0
0
0
1
1
type3 |
----------- I
1 I
0 I
0 I
0 I
0 I
—I
0 I
0 I
0 I
0 I
0 I
I
0 I
0 I
0 I
Re-expressing categorical information as a set of dummy variables involves no loss of
information, in this example, typel through type3 together tell us exactly as much as type itself
I
Data Management
37
does. Occasionally, however, analysts choose to re-express a measurement variable in
categorical or ordinal form, even though this does result in a substantial loss of information.
For example, unemp in canada2.dta gives a measure of the unemployment rate. Excluding
Canada itself from the data, we see that unemp ranges from 7% to 19.6%. with a mean of 12.26:
summarize
unemp if type
!= 3
Variable |
Obs
Mean
Std. Dev.
----------------- - -----------------------------------------------------unemp |
10
12.26
4.44877
Min
Max
7
19.6
Having Canada in the data becomes a nuisance at this point, so we drop it:
. drop if type == 3
(1 observation deleted)
Two commands create a dummy variable named unemp2 with values of 0 when
unemployment is below average (12.26), 1 when unemployment is equal to or above average
and missing when unemp is missing. In readingthe second command, recall that Stata’s sorting
and relational operators treat missing values as very large numbers.
. generate unemp2 = 0 if unemp < 12.26
(7 missing values generated)
. replace unemp2 = 1 if unemp >= 12.26
(5 real changes made)
I
&
unemp <
.
We might want to group the values of a measurement variable, thereby creating an orderedcategory or ordinal variable. The autocode function (see “Using Functions” earlier in this
chapter) provides automatic groupingof measurement variables. To create new ordinal variable
»nemp3, which groups values of unemp into three equal-width groups over the interval from
d to 20, type
. generate unemp3 = autocode(unemp,3,5,20)
(2 missing values generated)
A list of the data shows how the new dummy (unemp!) and ordinal (unemp3) variables
correspond to values of the original measurement variable unemp.
.
list place unemp unemp2 unemp3
I
place
I—
1. I
Newfoundland
2. I
Prince Edward Island
3. I
Nova Scotia
4. I
New Brunswick
5. I
Quebec
I--6. I
Ontario
7. I
Manitoba
8. I
Saskatchewan
9. I
Alberta
10 . I
British Columbia
I --11. I
Yukon
12 . I Northwest Territories
----------------------
1
unemp
unemp2
19.6
19.1
13.9
13.8
13.2
1
1
1
1
1
9.3
8.5
7
8.4
9.8
0
0
0
0
0
unemp3 I
------ 1
20 |
20 |
15 |
15 |
15 i
------ I
10 |
10 |
10 |
10 |
10 |
------ I
. I
. I
38
Statistics with Stata
Both strategies just described dealt appropriately with missing values so that Canadian
places with missing values on unemp likewise receive missing values on the variables derived
iZtmte'^' Z KerPrS,ble aPPr°aCh W°rkS b6St if °Ur data COn,ain no 'lllssin2 values To
illustrate, we begin by dropping the Yukon and Northwest Territories:
drop if unemp >= .
(2 observations deleted)
A greater-than-or-equal-to inequality such as unemp >= . will select any user-specified
missing value codes, in addition to the default code “.’’Type help missing for details
Having dropped observations with missing values, we now can use the group function
o cieate an ordinal variable not with approximately equal-width groupings, as autocode
sort the daS W grOUpingSof aPProximately equal size . We do this in two steps. First,
soil the data (assuming no missing values) on the variable of interest. Second, generate a new
vanab e using the group (#) function, where # indicates the number of groups desired. The
example below divides our 10
Canadian provinces
provinces into
10 Canadian
into 55 groups.
groups.
sort unemp
.
generate unempS = group(5)
.
list place unemp unemp2 unemp3 unempS
+ -I
I -I
I
I
I
I
I---
place
unemp
unemp2
unemp3
Saskatchewan
Alberta
Manitoba
Ontario
British Columbia
7
8.4
8.5
9.3
9.8
0
0
0
0
0
10
10
10
10
10
6.
Quebec
7. I
New Brunswick
8. I
Nova Scotia
9. I Prince Edward Island
10. I
Newfoundland
13.2
13.8
13.9
19.1
19.6
1
1
1
1
1
15
15
15
20
20
1.
2.
3.
4.
5.
unempS |
------ I
1 I
1 I
2 I
2 I
3 I
------ I
3 I
4 I
4 I
5 I
5 I
Another difference is that autocode iassigns
‘
values equal to the upper bound of each
interval, whereas group simply assigns 1 to the first group,' 2 to the second, and so forth.
Using Explicit Subscripts with Variables
When Stata has data in memory, it also defines certain system variables that describe those data,
r examp e^
represents the total number of observations. _n represents the observation
number, n - 1 for the first observation, _n =2 for the second, and so on to the last observation
“ > r’
T 'SSUe 3 command such as *e following, it creates a new variable. caselD
equal to the number of each observation as presently sorted:
. generate caselD =
n
Sorting the data another way will change each observation’s value of_n , but its caselD value
wi lemain unchanged. Thus, if we do sort the data another way, we can later return to the
earlier order by typing
1
Data Management
.
39
sort caselD
Creating and saving unique case identification numbers that store the order of observations at
an early stage of dataset development can greatly facilitate later data management.
We can use explicit subscripts with variable names, to specify particular observation
numbers. For example, the 6th observation in dataset canadal.dta (if we have not dropped or
thousand.any
8 'S QUebeC' ConSequently’
refers t0 Quebec’s population 7334
. display pop[6]
7334.2002
Similarly, pop[I2] is the Yukon’s population:
. display pop[12]
30.1
fnmEaXPllClt subscriptin,g,and the -n system variable have additional relevance when our data
rm a series. If we had the daily stock market price of a particular stock as a variable named
puce, for instance, then either price or, equivalently. price[ _n] denotes the value of the nth
observation or day. ^-/ce[_n-l] denotes the previous day’s price, andpnce[_n+l] denoted
thTpre^ousTa^
6"
eqUa‘t0
Cha"ge
. generate difprice = price - price[_n-l]
Chapter 13, on time series analysis, returns to this topic.
Importing Data from Other Programs
i
f
Previous sections illustrated how to enter and edit data by typing into the Data Editor If our
original data reside m an appropriately formatted spreadsheet, a shortcut can speed up this
labels!7
t0 C°P'Vand P3Ste muJti-C01umnblocks data (not including column
nerh ns e y
,nt°
Data EditOr- This
some care and
vahtP cXperimentatlon> because Stata wdl interpret any column containing non-numeric
into the DateFd?'"? 3 a™
Si"Sle COlUnmS (variables>of data could also be pasted
pasted hto Fdt i°m 3
processor document. Once data have been successfully
pasted into Edito. columns, we assign variable names, labels, and so on in the usual manner.
tools^hT^rk3^’1 h metht1°dS 3re quirkbu'for largerpr0J'ects kis imPortant to have
lener^ eT
y . computerflles created by other programs. Such files fall into two
general categones: raw-data ASCII (text) files, which can be read into Stata with the
appropriate Stata commands; and system files, which must be translated to Stata format by a
special third-party program before Stata can read them.
Y
that 7 i'IUS!ra‘etASCI1 ?Ie methods’ we return t0 Ibc Canadian data of Table 2.1. Suppose
processor with a 7in?
S,ata’S
Edit°r’ We typed them int0 our word
ff thX n 7
i 0,16 Sp3Ce tWeen each Value- Strin8 values must be “ double quotes
if they contam internal spaces, as does “Prince Edward Island”. For other string values quotes
are optional. Word processors allow the option of saving documents as ASCH (text) files a
simpler and more universal type than the word processor’s usual saved-file format. We can
thus create an ASCII file named canada.raw that looks something like this:
--------
s.i
40
Statistics with Stata
*?
!:;i
’’Canada" 29606.1 10.6 "5.1 81.1
"Newfoundland" 575.4 19.6 73.9 79.8
"Prince Edward Island" 136.1
19.1 74.8
"Nova Scotia" 937.8 13.9 74.2 80.4
"New Brunswick" 760.1 13.8 74.8 80.6
"Quebec" 7334.2 13.2
13.2 “4.5 81.2
"Ontario" 11100.3 9.3
9.3 “5.5 81.1
"Manitoba" 1137.5 8.5
8.5 75*' 80.8
"Saskatchewan" 1015.6 7 75.2 81.8
"Alberta" 2747 8.4 75.5 81.4
"British Columbia" 3766 9.8 75.8 81.4
"Yukon" 30.1
.
71.3 80.4
"Northwest Territories" 65.8
.
70.2. 78
I*
II
I I
81.3
Note the use <ofperiods, not blanks, to indicate missing values for the Yukon and Northwest
Territories If the dataset should have five variables, then for every observation, exactly five
values (including periods for missing values) must exist.
inf ile reads into memory an ASCII file, such as canada.raw, in which the values are
separated by one or more whitespace characters — blanks, tabs, and newlines (carriage return
line feed, or both) — or by commas. Its basic form is
. infile variable-list using filename.raw
With purely numeric data, the variable list could be omitted, in which case Stata assigns the
names varl, var2, var3, and so forth. On the other hand, we might want to give each variable
a distinctive name. We also need to identify string variables individually. For Canada raw the
inf xle command might be
. infile str30 place pop unemp mlife flife using Canada.raw, clear
(13 observations read)
The infxle vanablelistspecifiesvariablesintheorderthattheyappearinthedatafile. The
clear option drops any current data from memory’ before reading in the new file.
If any string variables exist, their names must each be preceded by a str# statement.
str30 , for example, informs Stata that the next-named variable (place) is a string variable
with as many as 30 characters. Actually, none of the Canadian place names involve more than
21 characters, but we do not need to know that in advance. It is often easier to overestimate
s ring variable lengths. Then, once data are in memory, use compress to ensure that no
variable takes up more space than it needs. The compress command automatically changes
ail variables to their most memory-efficient storage type.
b
. compress
place was str30 now str21
. describe
Contains data
obs :
vars :
size :
variable name
place
POP
unemp
13
c
~33 (99.9% of memory free)
storage
type
str21
float
float
display
format
%21s
%9. Og
%9. Og
value
label
variable label
Data Management
m1 i fe
flife
float
float
41
*9.0g
*9.0g
Sorted by:
We can now proceed to label variables and data as <described earlier. At any point, the
commands save Canada0 (or save Canada0, replace ) would save the new
dataset in Stata format, as file ccmciduO.dtci. The original raw-data file, Canada.raw, remains
unchanged on disk.
If our variables have non-numeric values (for example, “male” and “female”) that we want
to store as labeled numeric variables, then adding the option automatic will accomplish
this. For example, we might read in raw survey data through this infile command:
.
infile gender age income vote using survey.raw, automatic
Spreadsheet and database programs commonly write ASCII files that have only one
observation per line, with values separated by tabs or commas. To read these files into Stata,
use insheet . Its general syntax resembles that of infile , with options telling Stata
whether the data delimited by tabs, commas, or other characters. For example assuming tabdelimited data,
’
=
.
insheet variable-list using filename. raw,
tab
Or, assuming comma-delimited data with the first row ot the file containing variable names
(also comma-delimited).
.
insheet variable-list using filename.raw, comma names
With insheet we do not need to separately identify string variables. If we include no
variable list, and do not have variable names in the file’s first row, Stata automatically assigns
the variable names varJ. var2, var3
Errors will occur if some values in our ASCII file are
not separated by tabs, commas, or some other delimiter as specified in the insheet
command.
Raw data files created by other statistical packages can be in “fixed-column” format, where
the values are not necessarily delimited at all, but do occupy predefined column positions. Both
infile and the more specialized command infix permit Stata to read such files. In the
command syntax itself, or in a “data dictionary” existing in a separate file or as the first part of
the data file, we have to specify exactly how the columns should be read.
Here is a simple example. Data exist in an ASCII file named nfresour.raw:
193624087641691000
198725247430001044
193825138637481086
193925358964371140
1990
1991
I
8615731195
7930001262
These data concern natural resource production in Newfoundland. The four variables occupy
fixed column positions: columns 1-4 are the years (1986...1991); columns 5-8 measure
forestry production inthousands of cubic meters (2408...missing); columns 9-14 measure mine
production in thousands of dollars (764,169...793,000); and columns 15-18 are the consumer
price mdex relative to 1986 (1000...1262). Notice that in fixed-column format, unlike space
or tab-dchmited files, blanks indicate missing values, and the raw data contain no decimal
points. To read j;//-e5O7yr.ravv.intp Stata, we s pecify each variable’s column position:
___
42
Statistics with Stata
.
infix year 1-4
.
list
wood 5-8 mines 9-14
using nfresour. raw, clear
(6 observations read)
i.
2.
3.
4.
5.
6.
I year
I----I 1986
I 1987
I 1988
I 1989
I 1990
|----I 1991
+-----
wood
mines
24 03
2524
2513
2535
76416?
7430::
86374=
89643'
8615"3
CPI 15-18
CPI I
---------- 1
1000 |
1 044 |
1086 |
1140 !
1195 i
-------(
793 0-: j
1262
i
More complicated
eunipncateo fixed-column
nxecKolumn formats
formats might
might require
require a data “dictionary.” Data
dictionaries can be straightforward, but they offer many possible choices. Typing help
infix or help infileZ obtains brief outlines of these commands. For more examples
and explanation, consult the User's Guide and reference manuals. Stata also can load, write,
or view data from OBDC (Open Database Connectivity) sources: see help obdc .
What if we need to export data from Stata to some other. non-OBDC program? The
outf ile command writes ASCII files to disk. A command such as the following will create
a space-delimited ASCII file named canada6.raw, containing whatever data were in memory:
.
I5
f I: j
I
4|
outfile using canada6
The infile , insheet, infix , and outfile commands just described all
manipulate raw data in ASCII files. A second, very quick, possibility is to copy your data from
Stata s Browser and paste this directly into a spreadsheet such as Excell. Often the best option
however, is to transfer data directly between the specialized system files saved by various
spreadsheet, database, or statistical programs. Several third-party programs perform such
translations. Stat/Transfer, for example, will transfer data across many different formats
including dBASE, Excel, FoxPro, Gauss. JMP. Lotus. MATLAB, Minitab, OSIRIS, Paradox
S-Plus, SAS, SPSS, SYSTAT. and Stata. It is available through Stata Corporation
(www.stata.com) or from its maker. Circle Systems (wvvw_stattransfer.com). Transfer prosrams
prove indispensable for analysts working in multi-program environments or exchanging data
with colleagues.
Combining Two or More Stata Files
1
i
We can combine Stata datasets in two general ways: append a second dataset that contains
additional observations; or merge with other datasets that contain new variables or values.
In keeping with this chapter s Canadian theme, we will illustrate these procedures using data
on Newfoundland. File newfl.dta records the province’s population for 1985 to 1989.
r
Data Management
. use newfl, clear
(Newfoundland 1955-89)
.
describe
Contains data from C:\data\newfl.dta
obs:
5
Newfoundland 1985-89
vars:
2
3 Jul 2005 10:49
s^ze:
50
(99.9% of memoryfree)
variable name
year
pop
storage
type
display
format
ir.t
float
%9. Og
%9. Og
value
label
variable label
Year
Population
Sorted 'ey:
.
list
1.
2.
3.
4.
5.
cop |
------ I
58C~00 |
58C200 ;
56=200 |
565100 |
57C300 |
File nevvf2.dta has population and unemployment counts for some later years:
f
I
I year
I
I 1985
I 1986
I 1987
I 1 988
I 1989
. use newf2
(Newfoundland 1991-95)
describe
Contains data from C:\data\newf2.dta
obs:
6
vars:
3
size:
84 (99.9% of memory free)
variable name
year
pop
j obless
storage
type
display
format
float
float
%9. Og
%9. Og
%9. Og
value
label
Newfoundland 1990-95
3 Jul 2005 10:49
variable label
Year
Population
Number of people unemployed
Sorted by:
. list
1.
2.
3.
4.
5.
6.
I year
I----------
cop
I 1990
I 1991
I 1992
I 1993
I 1994
I
I 1995
+----------
573400
573500
575600
584400
582400
575449
------- +
jobless |
I
42000 I
45000 I
49000 I
49000 I
50000 I
I
I
To combine these datasets, with newf2.dta already in memory, we use the append command:
append using newfl
43
I
44
Statistics with Stata
!
list
3
4
5
I year
I----! 1990
' 1991
I 1992
I 1993
I 1994
6
7
8
10
11 .
1995
1985
1986
198 7
1988
I
I
I 1989
+-----
573400
575600
584403
582400
4 2 o ■:: I
4 50 0 0 I
4 9000 I
4 900 3 I
5000 0 I
I
575449
580700
5Q 9 - '
568000
570000
Because variableyWew occurs in newf2( 1990 to 1995) but not in newfl, its 1985 to 1989
values are
are imissing in the combmed dataset. We can now put the observations in order from
eai liest to latest and save these combined
data as a new file, newfo.dtcr.
------------sort year
.
list
3
4
5
6
8
9
10
11 .
I year
pop
jobless |
-------- I
■ 1985
i 1986
I 1987
I 1988
I 1989
580700
580200
568200
568000
570000
• I
• I
• I
. I
. I
-------- I
I 1990
! 1991
i 1992
i 19 93
I 1994
i----! 1995
573400
573500
575600
584400
5824 00
42000 |
45000 !
49000 |
49000 |
50000 |
-------- (
575449
save newf3
append
might seco^X^vJLXZva^
be compared to lengthening
memoiy)
by tapinga
urther Newfoundland time series: the numbers of births and divorces over the years 1980 to
' SA°me °bservatlons in common with our earlier dataset newfS.dta, as well as
one
e (yeai) in common, but it also has two new variables not present in newfl.dta.
. use newf4
(Newfoundland 1980-94)
describe
Contains data from C:\data\newf4.dta
obs :
15
vars :
3
size:
150 (99.9% of memory free)
Ji
Newfoundland 1980-94
3 Jul 2005 10:49
Data Management
name
year
births
divorces
storage
type
int
int
int
display
format
value
label
45
variable label
%9.0g
%9.0g
%9.0g
Year
Number of births
Number of divorces
Sorted by:
list
------------- +
1.
2.
3.
4.
5.
year
births
1980
10332
11310
9173
9630
8560
I 1981
I
1982
6.
I 1983
I 1984
I
I 1985
3.
9.
10.
1986
1987
1988
1989
11 .
12 .
13.
14 .
15 .
I
I
I
I
I 1990
I 1991
I 1992
I 1993
I 1994
8080
8320
7656
7396
7996
7354
6929
6689
6360
6295
divorces |
-I
555 I
569 I
625 I
711 I
590 I
I
561 I
610 I
1002 I
884 I
981 I
I
973 I
912 I
8 67 I
930 I
933 I
We want to merge new/3 with newf4, matching observations according toyear wherever
possib e. To accomplish this, both datasets must be sorted by the index variable (which in this
example is year). We earlier issued a sort year command before saving newfS.dta, so we
now do the same with newf4.dta. Then we merge the two, spccifyingyear as the index variable
to match.
sort year
l
.
merge year using newf3
describe
Contains data from newf4.dta
obs:
16
vars:
6
size:
304 (99.9% of memory free)
variable name
year
births
divorces
POP
j obless
_me rge
Sorted by:
Note :
I
list
storage
type
int
int
int
float
float
byte
display
format
% 9.0g
%9.0g
% 9.0g
%9.0g
%9.0g
%8.0g
value
label
Newfoundland 1980-94
3 Jul 2005 10:49
variable label
Year
Number of births
Number of divorces
Population
Number of people unemployed
dataset has changed since last saved
r
46
Statistics with Stata
o' i •' r r e s
i.
2
6
7
3
9
10
11
12
13
14
15
16.
i
I
198 0
1981
I
I
1983
1984
pop
jobless
merge |
------------ I
1 I
1 I
1 I
1 I
1 I
569
625
"11
590
----------- !
I 1987
i 198 6
1 1 89
I---------I 1990
! 1991
I 1992
1 19 93
i 1994
! 1995
+----------
765€
"3 = 4
6 92 9
66 6 9
62 ? 3
561
610
10 02
= 84
951
580700
580200
568200
568000
570000
9"3
912
= 67
930
933
573400
573500
575600
584400
582400
3 I
3 I
3 I
3 I
3 I
------------ I
42000
45000
4 9000
49000
50000
575449
3 I
3 I
3 I
3 I
3 I
----------- I
2
I
--------- +
ipnXd S
h
e a ready ln memory> are retained and those of the “using” data are
gnored The merge command has several options, however, that override this default A
6 °ll0'Vlng form would allow any hissing values in the master data to be
replaced by corresponding nonmissing values found in the using data (here, neyvfS.dtdy.
. merge year using newfS, update
™n^«mman1 SUC5 aS thu foIIowing causes my values from the master data to be replaced by
nonmissing values from the using data, if the latter are different:
Y
. merge year using nevf5, update replace
examnST6
VariabIe °CCUr m0re than once in the master data; for
= . WO
PP°!e 'hat tJe yey 1990 Occurs twice- Then values from the using data wither
b
matchedwi ‘h each occurrence of year= 1990 in the master data. You can use this
dam on a5
SUch as combining background data on individual patients with
taon ^ny number of separate doctor ^‘5 theymade Aithough merge makes this and
many other data-management tasks straightforward, analysts should look closely at the results
o be certain that the command is accomplishing what they intend.
As a diagnostic aid, merge automatically creates a new variable called
merge. Unless
update was specified, _merge codes have the following meanings:
1
2
3
Observation from the master dataset only.
Observation from the using dataset only.
Observation from both master and using data (using values ignored if different).
i
Data Management
47
If the update option was specified, merge codes convey what happened:
1 Observation from the master dataset only.
2
Observation from the using dataset only.
Observation from both, master data agrees with using.
Observation from both, master data updated if missing.
Observation from both, master data replaced if different.
Before performing another merge operation, it will be necessary to discard or rename this
variable. For example,
3
4
5
.
drop _merge
Or,
rename _merge __mergel
I
I
We can merge multiple datasets with a single merge command. For example, if newf5.dta
through newfS.dta are four datasets, each sorted by the variable year, then merging all four with
the master dataset could be accomplished as follows.
. merge year using newfS newf6 newf7 newfB,
update replace
Other merge options include checks on whether the merging-variable values are unique, and
theabilitytospecify which variables to keep for the final dataset. Type help merge for
details.
I
Transposing, Reshaping, or Collapsing Data
Long after a dataset has been created, we might discox er that for some analytical purposes it
has the wrong organization. Fortunately. se\ eral commands facilitate drastic restructurins of
datasets. We will illustrate these using data (grow thl.dta} on recent population growth in five
eastern provinces of Canada. In these data, unlike our prexious examples, province names are
represented by a numerical variable with eight-character labels.
. use growthl, clear
(Eastern Canada growth)
describe
Contains data frorn C:\data\growthl.dta
obs:
5
vars:
5
size:
105 (99.9% of memory free)
I
variable name
provinc2
grow92
grow93
grow94
grow95
Sorted by:
storage
tvce
byte
float
float
float
float
display
format
value
label
%8 . Og
%9 . Og
%9. Og
%9.0g
%9.0g
provinc2
Eastern Canada growth
3 Jul 2005 10:48
variable label
Eastern Canadian province
Pop. gain in 1000s, 1991- 92
Pop. gain in 1000s, 1992- 93
Pop-, gain in 1000s, 1993- 94
Pop. gain in 1000s, 1994- 95
1
48
Statistics with Stata
list
1.
2.
3.
4.
5.
+-------I provinc2
|-------I New Brun
I Newfound
I Nova Sco
I Ontario
Quebec
I
+
grow92
grow93
grow94
grow 9 5
10
4.5
12.1
174.9
80. 6
2.5
.8
5.8
169.1
77.4
2.2
-3
3.5
120.9
48.5
I
2.4
-5.8
3.9 I
163.9 I
47.1 I
In this organization, population growth for each year is stored as a separate variable. We
could analyze changes in the mean or variation of population growth from year to year. On the
other hand, given this organization, Stata could not readily draw a simple time plot of
population growth against year, nor can Stata find the correlation between population growth
m New Brunswick and Newfoundland. All the necessary information is here, but such analyses
require different organizations of the data.
One simple reorganization involves transposing variables and observations. In effect the
dataset rows become its columns, and vice versa. This is accomplished by the xpose
command. The option clear is required with this command, because it always clears the
present data from memory. Including the varname option creates an additional variable
(named _yarname ) in the transposed dataset, containing original variable names as strings.
I
. xpose, clear varname
describe
Contains data
obs :
vars :
size:
variable name
vl
v2
v3
v4
v5
varname
Sorted by:
Note:
5
6
160 (99.9% of memory free)
storage
type
float
float
float
float
float
str8
display
format
value
label
variable label
%9.0g
%9.0g
%9.0g
%9. Og
%9. Og
%9s
i
dataset has changed since last saved
list
vl
I
I
1. I
1
2. I
10
3 . I 2.5
4 . I 2.2
5 . I 2.4
+--------
v2
v3
v4
v5
2
4.5
.8
-3
-5.8
3
12.1
5.8
3.5
3.9
4
174.9
169.1
120.9
163.9-
5
80.6
77.4
48.5
47 . 1
-------- +
_varname I
---------- i
provincz f
grow92 |
grow93 I
grow94 |
grow95 |
-------- +
alue labels are lost along the way, so provinces in the transposed dataset are indicated
only by their numbers (1 = New Brunswick, 2 = Newfoundland, and so on). The second
through last values in each column are the population gainslor that province, in thousands.
Data Management
49
Thus, variable vl has a province identification number (1, meaning New Brunswick) in its first
row, and New Brunswick’s population growth values for 1992 to 1995 in its second throueh
fifth rows. We can nowfind correlations between population growth in different provinces, for
instance, by typing a correlate command with in 2/5 (second throueh fifth
observations only) qualifier:
. correlate vl-v5 in 2/5
(obs=4)
vl |
v2 |
v3|
v4 |
v5|
V1
v2
v3
v4
v5
1 .0000
0.8058
0.9742
0.5070
0.6526
1.0000
0.8978
0.4803
0.9362
1.0000
0.6204
0.8049
1.0000
0.6765
1.0000
The strongest correlation appears between the growth of neighboring maritime provinces New
Brunswick (vl) and Nova Scotia (v3): >■ = .9742. Newfoundland’s (v2) growth has a much
weaker correlation with that of Ontario (y4)\ r = .4803.
More sophisticated restructuring is possible through the reshape command. This
command switches datasets between two basic configurations termed “wide” and “lon^ ”
Dataset growthl.dta is initially in wide format.
I
. use growthl, clear
(Eastern Canada growth)
.
list
I provinc2
I-------1 . I New Brun
2 . I Newfound
3. I Nova Sco
4. I
Ontario
5. I
Quebec
grow92
grow93
grow94
10
4.5
12.1
174.9
80.6
2.5
.8
5.8
169.1
77.4
2.2
-3
3.5
120.9
48.5
grow95 I
------------- I
2.4 j
-5.8 |
3.3 i
163.9 i
47.1 1
A reshape command switches this to long format.
reshape long grow, i(provinc2)
(note: j = 92 93 94 95)
Data
j(year)
wide
Number of obs.
5
Number of variables
5
j variable (4 values)
xij variables:
grow92 grow93 ... grow95
I
I
..J
long
20
3
year
grow
Listing the data shows how they were reshaped. A sepby () option with the list
command produces a table with horizontal lines visually separating the provinces, instead of
every five observations (the default).
50
.
Statistics with Stata
list.
sepby(provinc2)
I provinc2
yea r
g rc w
New Brun
New Brun
New Brun
New Brun
92
93
94
95
io
I
2.5
2.2
|
j
I Newfound
I Newfound
I Newfound
I Newfound
I-----------------9 . I Nova Sco
10 . I Nova Sco
11 . I Nova Sco
12 . I Nova Sco
I
13 . I
Ontario
14 . I
Ontario
15 . I
Ontario
16. I
Ontario
I
17 . I
Quebec
18 . I
Quebec
92
93
94
95
I
I
I
I
I
I
1.
2.
3.
4.
5.
6.
7.
8.
19 .
20 .
I
I
Quebec
Quebec
92
93
94
95
92
93
94
95
92
93
94
95
2.4
|
I
4.5 I
.8 I
-3 I
-5.8 I
I
12.1
5.8 I
3.5 I
3.9 I
I
174 . 9 I
169.1 I
120.9 I
163.9 I
I
80.6 I
77.4 I
48.5
47.1
.
label data "Eastern Canadian growth--long"
.
label variable grow "Population growth in 1000s"
.
save growth2
file C:\data\growth2.dta saved
The reshape command above began by stating that we want to put the dataset in long
form. Next, it named the new variable to be created, grow. The i (provinc2) option
specified the observation identifier, or the variable whose unique values denote logical
observations. In this example, each province forms a logical observation. The j (year)
option specifies the sub-observation identifier, or the variable whose unique values ( within each
logical observation) denote sub-observations. Here, the sub-obsen ations are vears within each
province.
Figure 2.1 shows a possible use for the long-format dataset. With one graph command,
we can now produce time plots comparing the population gains in New Brunswick.
Newfoundland, and Nova Scotia (observations for which pro\inc2 < 4). The graph
command on the following page calls for connected-line plots of grow (as r-axis variable)
against year (x axis) ifprovince? < 4. with horizontal lines at v = 0 (zero population growth),
and separate plots for each value ofprovinc?.
I
Data Management
graph twoway connected grow year if provinc2 < 4,
by(provinc2)
•
New Brun
51
yline(O)
Figure 2.1
Newfound
o
o
o o
o
.E “J
92
O
O)
93
94
95
Nova Sco
c
•° o
ro T~
Z5
Q. IO
O
CL
O
“? -
92
93
94
95
year
Graphs by Eastern Canadian province
ec meh in their fisheries during the early 1990s contributed to economic hardships in these
three provinces. Growth slowed dramatically in New Brunswick and Nova Scotia, while
Newfoundland (the most fisheries-dependent province) actually lost population.
reshape works equally well in reverse, to switch data from “long” to “wide” format
Dataset growths.dta serves as an example of long format.
.
use growths,
clear
(Eastern Canadian grcwth--lcng)
list.
sepby(provinc2)
!
------- +
1.
2.
3.
4.
5.
6.
7.
8.
9.
10 .
11 .
12 .
13 .
14 .
15.
16.
provincz
grow
i New Brun
i New Brun
I New Brun
New Brun
10
2.5
2.2
2.4
Newfound
Newfound
I Newfound
I Newfound
I--------I Nova Sco
I Nova Sco
I Nova Sco
I Nova Sco
I
I
Ontario
I
Ontario
Ontario
I
I
Ontario
I
4.5
.=
-5.S
year I
--------- I
92 I
93
|
94
95
|
I
92 |
93 i
94 I
95 I
■I
12.1
5.8
3.5
92
93
3.9
94
95
174.9
169.1
120.9
163.9
92
93
94
95
I
I
I
I
I
I
I
I
I
I
★
V
-1 CTO
1009
9
A
!
.<G
' - li
52
Statistics with Stata
Quebec
Quebec
Quebec
Quebec
19.
20 .
8 0.6
77.4
48.5
47.1
92
93
94
95
To convert this to wide format, we use reshape wide :
reshape wide
(note:
grow,
i(provinc2)
j(year)
= 92 93 94 95)
Data
wide
Number of obs.
Number of variables
j variable (4 values)
xij variables:
20
3
year
5
5
(dropped)
grow
grow92 grow93 ... grow95
list
I provinc2
I
1 . I New Brun
2 . I Newfound
3. I Nova Sco
4. I
Ontario
5. I
Quebec
+
grow92
grow93
grow94
grow95
I
■ I
10
4.5
12. 1
174.9
80.6
5
8
5.8
169.1
7’ . 4
2.2
-3
3.5
120.9
48.5
2.4
-5.8
3.9
163.9
47.1
I
I
I
I
I
+
Notice that we have recreated the organization of dataset growthl.dta.
Another important tool for restructuring datasets is the collapse command, which
creates an aggregated dataset of statistics (for example, means, medians, or sums). The long
gi o\vth3 dataset has four observations for each province.*
. use growths, clear
(Eastern Canadian growth—long)
list,
.
sepby (provinc2)
I provi.-.cz
grow
year
1.
2.
3.
4.
New Brun
New Brun
New Brun
New Brun
10
2.5
2.2
2.4
92
93
94
5.
6.
! Newfound
I Newfound
I Newfound
1 Newfound
4.5
.8
-3
-5.8
93
94
95
I N ova Sco
I Nova Sco
I Nova Sco
I Nova Sco
I
I
Ontario
Ontario
I
Ontario
I
I
Ontario
12.1
5.8
3.5
3.9
92
93
94
95
174.9
169.1
120.9
16 3.9
92
93
94
95
•7
8.
9.
10 .
11 .
12 .
13.
14 .
15 .
16.
|
j
I
I
I
I
I
Data Management
17.
18.
19.
20.
I
|
|
|
|
Quebec
Quebec
Quebec
Quebec
80.6
77.4
48.5
47.1
------ 1
92 |
93 |
94 |
95 |
that is, one province.
.
collapse
(mean)
y
'
1 idoie,
grow, by (provino2)
list
I provinc2
i
|------
i. I New Brun
2 . I Newfound
3 . I Nova Sco
4. I
Ontario
5. I
Quebec
+
grow |
--------- I
4.275 |
-.8750001 |
6.325 |
157.2 |
63.4 |
I
I
variable name as wrtb grow in the previous example, or births and deaths, the collapsed
variable takes on the same name as the old variable.
i-onapsea
.
collapse (sum) births deaths (mean) meaninc =
income
(median) medinc = income, by(provinc2)
collapse can create variables based on the following summary statistics:
mean
sd
sum
rawsum
count
max
min
median
Pl
P2
iqr
53
Means (the default; used if the type of statistic is not specified)
Standard deviations
Sums
Sums ignoring optionally specified weight
Number of nonmissing observations
Maximums
Minimums
Medians
1st percentiles
2nd percentiles (and so forth to P9 9 )
Interquartile ranges
Statistics with Stata
54
Weighting Observations
Stata understands four types of weighting:
aweight Analytical weights, used in weighted least squares (WLS) regression and similar
procedures.
fweight
Frequency weights, counting the number of duplicated observations. Frequency
weights must be integers.
iweight Importance weights, however you define “importance.”
pweight Probability or sampling weights, equal to the inverse of the probability that an
observation is included due to sampling strategy.
Researchers sometimes speak of “weighted data.” This might mean that the original sampling
scheme selected observations in a deliberately disproportionate way, as reflected by weights
equal to 1/(probability of selection). Appropriate use of pweight can compensate for
disproportionate sampling in certain analyses. On the other hand, “weighted data” might mean
something different
an aggregate dataset, perhaps constructed from a frequency table or
cross-tabulation, with one or more variables indicating how many times a particular value or
combination of values occurred. In that case, we need fweight.
Not all types of weighting have been defined for all types of analyses. We cannot, for
example, use pweight with the tabulate command. Using weights in any analysis
requires a clear understanding of what we want weighting to accomplish in that particular
analysis. The weights themselves can be any variable in the dataset.
The following small dataset (nfschool.dta\ containing results from a survey of 1,381 rural
Newfoundland high school students, illustrates a simple application of frequency weighting,
describe
Contains data from C:\data\nfschool.dta
obs:
6
vars:
3
size:
48 (99.9% of memory free)
storage
type
display
format
value
label
byte
byte
int
%8.0g
%8.0g
%8.0g
yes
list, sep(3)
+----------------I univers
year
I
1. I
no
10
2. I
no
11
count
variable name
univers
year
count
Sorted by:
3.
I
I
no
12
210
260
274
4.
5.
6.
I
I
I
yes
yes
yes
10
11
12
224
235
178
Newf.school/univer.(Seyfrit 93)
3 Jul 2005 10:50
variable label
Expect to attend university?
What year of school now?
observed frequency
Data Management
55
At first glance, the dataset seems to contain only 6 observations, and when we cross
tabulate whether students expect to attend a university (univers) by their current year in high
school (year), we get a table with one observation per cell.
tabulate univers year
Expect to
attend
university
?
|
I
|
I
What year of school now?
10
11
12 |
Total
—+
no |
yes |
1
1
1
1
1 I
1 I
3
3
Total |
2
2
2 I
6
To understand these data, we need to apply frequency weights. The variable count gives
frequencies: 210 of these students are tenth graders who said they did not expect to attend a
university, 260 are eleventh graders who said no, and so on. Specifying [fweight =
count] obtains a cross-tabulation showing responses of all 1,381 students.
tabulate
Expect to
attend
university
?
|
|
I
I
univers year
[fweight = count]
What year of school now?
10
11
12 |
------ +
Total
no I
yes I
210
224
260
235
274 |
178 |
744
637
Total |
434
495
452 |
1, 381
Carrying the analysis further, we might add options asking for a table with column
percentages ( col ), no cell frequencies ( nof), and a %2 test of independence ( chi2 ). This
reveals a statistically significant relationship (P = .001). The percentage of students expecting
to go to college declines with each year of high school.
tabulate univers year [fw = count],
Expect to
attend
university
?
|
I
|
I
What year of school now?
10
11
12 |
Total
no |
yes |
48.39
51.61
52.53
47.47
60.62 |
39.38 |
53.87
46.13
Total |
100.00
100.00
100.00 |
100.00
Pearson chi2(2) =
i
col nof chi2
13.8967
Pr = 0.001
Survey data often reflect complex sampling designs, based on one or more of the following:
disproportionate sampling— for example, oversampling particular subpopulations, in order
to get enough cases to draw conclusions about them.
clustering — for example, selecting voting precincts at random, and then sampling individuals
within the selected precincts.
56
Statistics with Stata
strattfcation
tor example, dividing precincts into “urban” and “rural” strata, and then
sampling precincts and/or individuals within each stratum.
Complex sampling designs require specialized analytical tools, pweights and Stata’s
ordinary analytical commands do not suffice.
Stata’s procedures for complex survey data include special tabulation, means, regression
ogit probit, tobit and Poisson regression commands. Before applying these commands, users
must first set up their data by identifying variables that indicate the PSUs (primary sampling
units) or clusters, strata, finite population correction, and probability weights. This is
accomplished through the svyset command. For example:
.
svyset precinct [pweight=invPsel], strata(urb_rur)
fpc(finite)
For each observation in this example, the value of variablepnnczncZ identifies PSU or cluster
\ alues of urb rur identify the strata,finite gives the finite population correction, and invPsel
gives the probability weight or inverseofthe probability of selection. After the data have been
svyset and saved, the survey analytical procedures are relatively straightforward.
Commands
—3 arc typically prefixed by svy: , as in
svy:
mean income
or
svy.
regress income education experience gender
The Survey Data Reference Manual contains full details and examples of Stata’s extensive
survey-analysis capabilities. For online guidance, type help svy and follow the links to
particular commands.
Creating Random Data and Random Samples
The pseudo-random number function uniform () lies at the heart of Stata’s ability to
generate random data or to sample randomly from the data at hand. The Base Reference
Manual (Functions) provides a technical description of this 32-bit pseudo-random generator.
It we presently have data m memory, then a command such as the following creates a new
variable named randnum, having apparently random 16-digit values over the interval (0 1) for
each case in the data.
’
. generate randnum = uniform()
Altei natively, we might create a random dataset from scratch. Suppose we want to start a
new dataset containing 10 random values. We first clear any other data from memory (if they
were valuable, save them first). Next, set the number of observations desired for the new
dataset. ^Explicitly setting the seed number makes it possible to later reproduce the same
random results. Finally, we generate our random variable.
.
clear
set obs 10
obs was 0, now 10
.
set seed 12345
.
generate randnum = uniform()
I
i
Data Management
57
■-
list
+-
randnum
I
I1. I
.309106
2 . I .6852276
3 . I .1277815
4 . I .5617244
5 . I .3134516
I-6. I ..5047374
7 . I ■.7232868
8. I •.4176817
9. I ..6768828
10 . I ..3657581
I
I
I
I
I
I
I
I
I
I
I
I
I
In combination with Stata’s algebraic, statistical, and special functions, uniform () can
simulate values sampled from a variety of theoretical distributions. If we want newvar sampled
from a uniform distribution over [0,428) instead of the usual [0,1), we type
. generate newvar = 428 * uniforrnQ
These will still be 16-digit values. Perhaps we want only integers from 1 to 428 (inclusive):
I
.
generate newvar = 1 + trunc(428
.
clear
* uniform())
To simulate 1,000 rolls of a six-sided die, type
. set obs 1000
obs was 0, now 1000
generate roll — 1 + trunc(6 * uniforrnQ)
. tabulate roll
die |
Freq.
1 I
2 I
3 I
4 I
5 I
6 I
—+
Total |
171
164
150
170
169
1000
Percent
Cum.
16.40
15.00
17.00
15.90
17.50
33.50
48.50
4-j
100.00
We might theoretically expect 16.67% ones, 16.67% twos, and so on, but in any one sample like
these 1,000 “rolls,” the observed percentages will vary randomly around their expected values.
To simulate 1,000 rolls of a pair of six-sided dice, type
I
.
generate dice = 2 + trunc(6 * uniform())
.
tabulate dice
dice |
Freq.
Percent
Cum.
2
3
4
5
6
7
.8
26
62
78
120
153
14 9
. 146
2.60
6.20
7.80
12.00
2.60
8.80 16.60
28.60
43.90
58.80
73.40
------ +
1
I
I
I
I
I
I
I
15.30
14.90
.
___
+ trunc(6 * uniform())
58
Statistics with Stata
1!
9 I
10 |
11 I
12---|
96
88
53
29
9. 60
8.80
5.30
2.90
Total |
1000
100.00
-------- +
■I i
83.00
91.80
97.10
100.00
We can use _n to begin an artificial dataset as well. The following commands create a new
5,000-observation--------------dataset with1 one variable named index, containing values from 1 to 5,000.
. set obs 5000
obs was 0, now 5000
.
generate index =
n
summarize
|
Obs
Mean
Std. Dev.
Min
Max
index |
5000
2500.5
1443.52
1
5000
Variable
It is possible to generate variables from a inormal• (Gaussian)
‘ ‘ distribution
_I‘
using
uniform(). The following example creates a dataset with 2,000 observations and 2
variables, z from an N(0,l) population, andx from N(500,75).
.
clear
. set obs 2000
obs was 0, now 2000
. generate z = invnormal(uniform())
.
generate x = 500 + 75*invnormal(uniform () )
The actual sample means and standard deviations differ slightly from their theoretical values:
summarize
Variable |
------------------------- +
.
f
z
X
I
I
Obs
Mean
Std. Dev.
Min
Max
2000
2000
.0375032
503.322
1.026784
75.68551
-3.53 62 09
244.33=4
4.038878
7 4 3.13 •’7
If 2 Allows a normal distribution, v - e' follows a lognormal distribution. To form a
lognormal variable v based upon a standard normal z,
. generate v = exp(invnormal(uniform())
To form a lognormal variable tv based on an N(100,15) distribution,
generate w = exp(100 + 15*invnormal(uniform())
Taking logarithms, of course, normalizes a lognormal variable.
To simulate y values drawn randomly from an exponential distribution with mean and
standard deviation |i = o = 3,
.
generate y = -3
*
In (uniform())
For other means and standard deviations, substitute other values for 3.
XI follows a %2 distribution with one degree of freedom, which is the same as a squared
standard normal:
I
. generate XI = (invnormal(uniform())A2
By similar logic, X2 follows a yj with two degrees of freedom:
Data Management
generate X2 -
.
(invnormal(uniformO))^2 +
59
(invnormal(uniform()))-2
Other statistical distributions, including t and F. can be simulated along the same lines. In
addition, programs have been written for Stata to generate random" samples foliowine
distributions such as binomial, Poisson, gamma, and inverse Gaussian.
Although invnormal (uniform() ) can be adjusted to yield normal variates with
particular correlations, a much easierway to do this is through the drawnorm command. To
generate 5,000 observations from N(0.1). type
.
clear
.
drawnorm z,
n(5000)
summ
Variable |
Obs
Mean
I
5000
-.0005951
z
Std.
Dev.
1.019788
Min
-4.518918
3.923464
vana )les to have the following population correlations:
Xl
x2
x3
Xl
1.0
0.4
-0.8
x2
0.4
1.0
0.0
x3
-0.8
0.0
1.0
The procedure for creating such data re<■quires first defining the correlation matrix C, and then
using C in the drawnorm command:
.
mat C =
.
drawnorm xl x2 x3,
(1/
.4,
-.8
\
.4,
1,
0
\ -.8,
means (0,100,500)
0,
1)
sds(1,15,75)
corr(C)
summarize xl-x3
Variable |
Obs
Mean
;
|
|
5000
5000
5000
100.1826
500.7747
xl
x2
x3
. 0024364
Stn.
Dev.
Min
Max
• .01646
14.91325
76.93925
-3.478467
46.13897
211.5596
3.598916
150.7634
"69.6C~4
. correlate xl-x3
(obs=5000)
xl
I
I
I
xl
:
q q r. r,
x2
x3
|
|
•3951
.8134
x2
x3
1.0000
-0.0072
1.0000
Compare the sample variables ’ correlations and means with the theoretical values given earlier
Random data generated in this fashion can be viewed as samples drawn from theoretical
populations We should not expect the samples to have exactly the theoretical population
parameters (in this example, an x3 mean of500, xl-x2 correlation of0.4,x/-x5 correlation of
-.8, and so forth).
60
Statistics with Stata
The command sample makes unobtrusive use of uniform’s random generator to
obtain random samples of the data iiin memory-. For example, to discard all but a 10% random
sample of the original data, type
. sample 10
When we add an in or if qualifier, sample applies only to those observations meeting
our criteria. For example,
.
sample 10 if age <26
would leave us with a 10% sample of those observations vyith age less than 26. plus 100% of
the original observations with age >26.
We could also select random samples of a particular size, To discard all but 90 randomlyselected observations from the dataset in memory, type
.
sample 90,
count
The sections in Chapter 14 on 1bootstrapping and Monte Carlo simulations provide further
examples of random sampling and random variable generation.
Writing Programs for Data Management
Data management on larger projects often involves repetitive or error-prone tasks that are best
handled by writing specialized Stata programs. Advanced programming can become very
technical, but we can also begin by writing simple programs that consist of nothing more than
a sequence of Stata commands, typed and saved as an ASCH file. ASCII files can be created
using your favorite word processor or text editor, which should offer “ASCII text file” among
its options under File - Save As. An even easier way to create such text files is through Stata’s
Do-file Editor, which is brought up by clicking Window - Do-file Editor or the icon <$.
Alternatively, bring up the Do-file Editor by typing the command doedxt, or doedit
filename iffilename exists.
For example, using the Do-file Editor we might create a file named canada.do (which
contains the commands to read in a raw data file named canada.raw). then label the dataset and
its variables, compress it, and save it in Stata format. The commands in this file are identical
to those seen
earlier----when
---------_.i we went through the example step by step.
infile str30 place pop
p— unemp
—- . - flife
-- - - using Canada.raw
mlife
label data "Canadian dataset 1"
label variable pop "Population in 1000s, 1 995"
label variable unemp "% 15+ population unemployed,
1 9 9 5"
label variable mlife "Male life expectancy years"
label variable
"Female li e expectancy years"
compres s
save canadal, replace
Once this canada.do file has been written and saved, simply typing the following command
causes Stata to read the file and run each command in turn:
. do Canada
Data Management
61
Such batch-mode programs, tenned “do-files,” are usually saved with a .do extension. More
elaborate programs (defined by do-files or“automatic do” files) can be stored in memory, and
can call other programs in turn — creating new Stata commands and opening worlds of
possibility tor adventurous analysts. The Do-file Editor has several other features that you
might find useful. Chapter 3 describes a simple way to use do-files in building graphs. For
further information, see the Getting Started manual on Using the Do-file Editor.
Stata ordinarily interprets the end of a command line as the end of that command. This is
reasonable onscreen, where the line can be arbitrarily long, but doesnot work as well when we
are typing commands in a text file. One way to avoid line-length problems is through the
#del imi t command, which can set some other character as the end-of-command delimiter.
In the following example, we make a semicolon the delimiter; then type two long commands
that do not end unti I a semicolon appears; and then finally reset the delimiter to its usual value,
a carriage return (cr):
#delimit ;
in.Li.Le str30 place pop unemp mlife flife births deaths
marriage medinc mededuc using newcan.raw;
order place pop births deaths marriage medinc mededuc
unemp mlife flife;
ffdelimit cr
Ii
Stata normally pauses each time the Results window becomes full of information, and waits
to proceed until we press any key (or Qj). Instead of pausing, we can ask Stata to continue
scrolling until the output is complete. Typed in the Command window or as part of a program
the command
■
’
.
I
set more off
calls for continuous scrolling. This is convenient if our program produces much screen output
that we don’t want to see, or if it is writing to a log file that we will examine later. Typing
set more on
returns to the usual mode of waiting for keyboard input before scrolling.
Managing Memory
When we use or File - Open a dataset, Stata reads the disk file and loads it into memoiy.
Loading the data into memory permits rapid analysis, but it is only possible if the dataset can
fit within the amount of memory currently allocated to Stata. If we try to open a dataset that
is too large, we get an elaborate error message saying “no room to add more observations,” and
advising what to do next.
I
. use C:\data\gbank2.dta
(Scientific surveys off S. Newfoundland)
no room to add more observations
An attempt was made to increase the number of observations beyond what is
currently possible. You have the following alternatives:
1. Store your variables more efficiently; see help compress.
(Think of
State's data area as the area of a rectangle; Stata can trade off width
and length.)
2.
Drop some variables or observations; see help drop.
r
62
Statistics with Stata
3.
Increase the amount of memory allocated to the data
area using the set
memory command; see help memory.
r( 901) ;
Sma 1 Stata allocates a fixed amount of memory to data, and this limit cannot be changed.
Intercooled Stata and Stata/SE versions are flexible, however. Default allocations equal 1
megabyte for Intercooled, and 10 megabytes for Stata/SE. If we have Intercooled or Stata/SE,
running on a,computer with enough physical memory, we can set Stata’s memory allocation
higher
with the-----setmemory command. To allocate 20 megabytes to data, type
“
.
set memory 20m
Current memory allocation
current
value
set table
set maxvar
set memory
set matsize
5000
2 0M
400
memory usage
(IM = 1024k)
description
max. variables allowed
max. data space
RHS vars in models
1.733M
20.000M
1.254M
22.987M
If there are data already in memory, first type the command clear to remove them. To reset
the memory allocation “permanently," so it will be the same next time we start up, type
set memory 20m, permanently
In the example given earlier, gbcmk2.dta is a 11.3-megabyte dataset that would not fit into
the defaul t allocation. Asking for a 20-megabyte allocation has now given us more than enough
room for these data.
I
Contains data from C:\data\gbank2.dta
obs:
74,078
var s :
size:
v a r i a b 1 e n =me
■
r=c_typc
vessel
trip
set
rank
assembla
yea r
month
day
set_type
stratum
division
unit_are
light
wind_dir
wind_for
sea
bottom
t ime_mid
duration
tow dist
Spring scientific surveys NAFO
3KLNOPQ, 1971-93
2 Mar 2000 21:28
(46.0% of memory free)
storage
type
byte
int
int
str?
byte
byte
byte
byte
int
str2
str 3
int
byte
byte
byte
byte
int
byte
int
display
format
%9.0g
% 4.0 g
%4.0g
%8.0g
%8.0g
%8.0g
%7s
* 4.0g
%4.0g
% 4.0g
%8.0g
%8.0g
%2s
%3s
%8.0g
%4.0g %4.0g
%4 . Og
%4.0g
%8.0g
%8.0g
%8.0g
value
label
variable label
original case number
Vessel
Trip number
Set number
set_type
Year
Month
Day
Set type
Stratum or line fished
NAFO division
Nfld. area grid map square
Light conditions
Wind direction
Wind force
Type of bottom
Time (midpoint)
Duration of set
Distance towed
r
Data Management
gear_oy
depthcat
min_dept
max_dept
bot_dept
temp_sur
tempcat
temp_fs_
lat
long
pos_meth
gear
total
species
number
weight
lat in
common
surtemp
f ishtemp
depth
ispecies
Sorted by:
c yze
t yte
v 4.0 g
':4.0g
3.0g
v 3.0g
* 3.0g
*3.0g
cyze
* 5.0g
* 3.0g
* 2.0g
-9.0g
byte
* 4.0g
* 8 . Og
ty-e
-9.0g
* 8.0g
* 9.0g
d -ble *9.0g
str31
*3 Is
str27
^27s
fl cat
* 9.0g
flcat *9.0g
*9.0g
b-.-ce
h 9.0g
63
Operation of gear
Category of depth
Depth (minumum)
Depth (maximum)
Depth (bottom if MWT)
Temperature (surface)
Category of temperature
Temperature (fishing depth)
Latitude (decimal)
Longitude (decimal)
Gear
Species
Number of individual fish
Catch weight in kilograms
Species — Latin name
Species — common name
Surface temperature degrees C
Fishing depth temperature C
Mean trawl depth in meters
Indicator species
id
describe the data (above), Stata reports “46.09% ofmemory free,” meaning not 46% of the
computer s total resources, but 46% of the 20 megabytes we allocated for Stata data. It is
usually advisable to ask for more memory than our data actually require. Many statistical and
a a-management operations consume additional memory, in part because they temporarily
create new variables as they work.
It is possible to set memory to values higher than the computer’s available physical
memory. In that case, Stata uses “virtual memory,” which is really disk storage. Although
vrrtual memory allows bypassing hardware limitations, it can be terribly slow. If you regularly
wor with datasets that push the limits of your computer, you might soon conclude that it is
time to buy more memory.
Type help limits to see a list of limitations in Stata, not only on dataset size but also
,0S.rJimLnT"SJndJUd,2g matr« siz^command lengths, lengths of names, and numbers of
variables in commands. Some of these limitations
----------- j can be adjusted by the user.
I
I .
I
1
s
i
Graphs
Graphs appear in every chapter of this book —one i - ‘
indication of their value and integration
with other analyses in Stata. Indeed, graphics have a! ways been one of StataTs^g ^nd
reason enough for many users to choose Stata over other packages. The graph Command
evolved incrementally from Stata versions 1 through 7. Stata version 8 marked a major step
forward, howevet. graph underwent a fundamental redesign, expanding its capabilities fo^
sophisticated, publication-quality analvtical graphics. Output appearance and choices were
i tuch imptoved as well. With the new graph command syntax and defaults, or alternatively
C' ne:,meni'S’ attrauCtive (and Publishable) baste graphs are quite easy to draw
phically ambitious users who visualize non-basic graphs will find their efforts supported by
XlyJ.mPreSS1Ve anay °f t0°IS and °PtiOnS‘ deSCribed in the 500-page Graphics Reference
In the much shorter space of this chapter, the spectrum from elementary
j to creative
graphing will be covered taking an cexample’ rather
’
than syntax-oriented approach (see the
Graphics Reference Manual qv help
? <graph for thorough coverage of syntax). We begin
by illustrating seven basic types of graphs.
histogram
histograms
graph twoway
two-variable scatterplots, line plots, and many others
graph matrix
scatterplot matrices
graph box
box plots
graph pie
pie charts
graph bar
bar charts
graph dot
dot plots
twoway °!vpeSe
°PtiOnS’ Th3t iS esPeciallytrue for the versatile
More specialized graphs such as symmetry plots, quantile plots, and quantile-normal plots
of e anhs fo’ f7Xam;ninS details of variable distributions. A few examples of these, and also
of graphs for industrial quality control, appear in this chapter. Type help graph other
for more details.
Finally, the chapter concludes with techniques particularly useful inbuildingdata-rich selfcontained graph.cs for publication. Such techniques include adding text to graphs, overlayina
nultiple twoway plots, retrieving and reformatting saved graphs, and combining multiple
graphs into one. As our graphing commands grow more complicated, simple batch programs
64
Graphs
65
(do-files) can help to write and re-use them. The full range of graphical choices goes far
beyond what this book can cover, but the concluding examples point out a few of the
possibilities. Later chapters supply further examples.
The Graphics menu provides point-and-click access to most of these graphing procedures.
A note to long-time Stata users: The graphical capabilities of Stata 8 and 9 outshine those
of earlier versions. For analysts comfortable with old Stata, there is much new material to
learn. Menus allow a quick entry, and the new graphics commands, like the old ones, follow
a consistent logic that becomes clear with practice. Fortunately, the changeover need not be
sudden. Version 7-style graphics remain available if needed. They have been moved to the
command graph? . For example, an old-version scatterplot would formerly have been drawn
by the command
graph income education
which does not work in the newer Stata. Instead, the command
.
graph? income education
will reproduce the familiar old type of graph. The options of graph? are similar to those of
the old-style graph . To see an updated version of this same scatterplot, type the new
graphics command
• graph twoway scatter income education
Further examples of new commands appear in the next section, which should give a sense of
what has changed (and what is familiar) with the redesigned graphical capabilities.
Example Commands
histogram y, frequency
Draws histogram of variable y, showing frequencies on the vertical axis.
. histogram y, start(O) width(10) norm fraction
Draws histogram of with bins 10 units wide, starting at 0. Adds a normal curve based on
the sample mean and standard deviation, and shows fraction of the data on the vertical axis.
. histogram y, by(x, total) fraction
In one figure, draws separate histograms of y for each value of x, and also a ’’total”
histogram for the sample as a whole.
kdensity x, generate(xpoints xdensity) width(20) biweight
Produces and graphs kernel density estimate of the distribution of x. Two new variables
are created: xpoints containing thex values at which the density is estimated, and xdensity’
with the density estimates themselves, width (20) specifies the halfwidth of the kernel,
in units of the variable x. (If width () is not specified, the default follows a simple
formula for “optimal.”) The biweight option in this example calls for a biweight
kernel, instead of the default epanechnikov .
I
. graph twoway scatter y x
I
Displays a basic two-variable scatterplot of v against x.
66
Stat/st/cs with Stata
. graph twoway Ifit y x
scatter y x
||
Visualizes the linear regression of y on x by overlaying two twoway graphs: the
regression (linear fit or Ifit ) line, and the y vs. x scatterplot To include a 95%
confidence band for the regression line, replace Ifit with If itci .
I
. graph twoway scatter y x, xlabel(0(10)100) ylabel(-3 (1)6, horizontal)
Constructs scatterplot ofy vs. x, withxaxis labeled atO, 10,..., 100. y axis is labeled at -3,
-2,..., 6, with labels written horizontally instead of vertically (the default).
. graph twoway scatter y x, mlabel(country)
Constructs scatterplot ofy vs. x, with data points (markers) labeled by the values ofvariable
country.
. graph twoway scatter y xl, by(x2)
In one figure, draws separate^ vs. xl scatterplots for each value ofx2.
. graph twoway scatter y xl
[fweight = population], msymbol(Oh)
Draws a scatterplot ofy vs. xl. Marker symbols are hollow circles (Oh), with their size
(area) proportional to frequency-weight variable population.
. graph twoway connected y time
A basic time plot ofy against time. Data points are shown connected by line segments. To
include line segments but no data-point markers, use line instead of connected.:
• graph twoway line y time
. graph twoway line yl y2 time
Draws a time plot (in this example, a line plot) with twoy variables that both have the same
scale, and are graphed against an x variable named time.
. graph twoway line yl time, yaxis(l)
||
line y2 time, yaxis(2)
Draws a time plot with two y variables that have different scales, by overlaying two
individual line plots. The left-handy axis, yaxis (1), gives the scale foryf, while the
right-handy axis, yaxis (2), gives the scale fory2.
. graph matrix xl x2 x3 x4 y
Constructs a scatterplot matrix, showing all possible scatterplot pairs among the variables
listed.
. graph box yl y2 y3
Constructs box plots ofvariablesy7,y2, andyJ.
. graph box y, over(x) yline(.22)
Constructs box plots ofy for each value ofx, and draws a horizontal line aty
.22.
. graph pie a b c, pie
Draws one pie chart with slices indicating the relative amounts of variables a, b, and c. The
variables must have similar units.
.
graph bar
(sum)
a b c
Shows the sums of variables a, b, and c as side-by-side bars in a bar chart. To obtain means
instead of sums, type graph bar (mean) a b c . Other options include bars
representing medians, percentiles, or counts of each variable.
• graph bar
(mean)
a, over(x)
Diaws a bar chart showing the mean of variable a at each value of variable x.
A*.
Graphs
•
graph bar
(asis)
a b c,
over(x)
67
stack
Draws a bar chart in which the values (“as is”) of variables a, b, and c are stacked on top
of one another, at each value of variable x.
•
graph dot
(median)
y,
over(x)
Draws a dot plot, in which dots along a horizontal scale mark the median value of v at each
level ofx. Other options include means, percentiles, or counts of each \ ariable.
.
qnorm y
Draws a quantile-normal plot (normal probability plot) showing quantiles of v versus
corresponding quantiles of a normal distribution.
rchart xl x2 x3 x4 x5, connect(1)
Constructs a quality-control R chart graphing the range of values represented by variables
Graph options, such as those controlling titles, labels, and tick marks on the axes are
common across graph types wherever this makes sense. Moreover, the underlying losic of
Stata’s graph commands is consistent from one type to the next. These common elemems are
the key to gaining graph-building fluency,
„, as the basics begin to fall into place.
Histograms
Histograms, displaying the distribution of measurement variables, are most easily produced
with their own command histogram. For examples, we turn to states.dta, which contains
selected environment and education measures on the 50 U.S. states plus the District of
Columbia (data from the League of Conservation Voters 1991; National Center for Education
Statistics 1992, 1993; World Resources Institute 1993).
. use states
(U.S. states data 1990-91)
describe
Contains data from c:\data\states.dta
obs:
si
vars:
21
size:
4,080 (99.9% of memory free)
variable name
I
state
region
Pop
area
density
metro
waste
energy
miles
toxic
green
house
senate
csa t
vsa t
storage
type
display
format
str20
byte
float
float
float
float
flcat
int
in t
float
float
byte
byte
* 9.0g
•9.0g
* 9.0g
*7.2f
*5. If
*5.2f
%8.0g
*8.0g
*5.2f
*5.2f
*8.0g
*8.0g
in:
* 9.0g
*8.0g
value
label
*20s
region
U.S. states data 1990-91
4 Jul 2005 12:07
variable label
State
Geographical regie.-.
1990 population
Land area, square riles
People per square
Metropolitan area emulation, *
Per capita solid waste, tons
Per capita energy s.-.sumed, 5tu
Per capita miles/ye =r, 1,000
Per capita toxics released, les
Per’capita greenhouse gas, ter.s
House '91 environ. voting,
Senate '91 environ. voting, *
Mean composite SAT s c o r e
Mean verbal SAT score
68
Statistics with Stata
msa t
percent
expense
income
high
college
Sorted by:
in t
byte
int
long
float
float
%8.0g
%9 . Og
%9.0g
% 10.0 g
%9.0g
%9.0g
Mean math SAT score
% HS graduates taking SAT
Per pupil expenditures prim&sec
Median household income, $1,000
% adults HS diploma
% adults college degree
I
state
Figure 3.1 shows a simple histogram of college, the percentage of a state’s over-25
population with a bachelor’s degree or higher. It was produced by the following command:
. histogram college, frequency title("Figure 3.1")
Figure 3.1
Figure 3.1
o
CXI
in
-iw;
>»
0)
■W
in
’J.-; •
■
•
••
1
o
10
15
20
25
% adults college degree
30
35
Under the Prefs — Graph Preferences menus, we have the choice of several pre-designed
schemes for the default colors and shading of our graphs. Custom schemes can be defined
as well. The examples in this book employ the s2 mono (monochrome) scheme, which among
other things calls for shaded margins around each graph. The s1 mono scheme does not have
such margins. Experimenting with the different monochrome and color schemes helps to
determine which works best for a particular purpose. A graph drawn and saved under one
scheme can subsequently be retrieved and re-saved under a different one, as described later in
this chapter.
Options can be listed in any order following the comma in a graph command. Figure 3.1
i Hush ates two options: frequency (instead ofdensity, the default) is shown on the vertical axis;
and the title Figure 3.1 appears over the graph. Once a graph is onscreen, menu choices
provide the easiest way to print it, save it to disk, or cut and paste it into another program such
as a word processor.
Figure 3.1 reveals the positive skew of this distribution, with ajnode above 15 and an
outlier around 35. It is hard to describe the graph more specifically because the bars do not line
up withx-axis tick marks. Figure 3.2 contains a version with several improvements (based on
some quick experiments to find the right values):
I
Graphs
1.
The ,v axis is labeled from 12 to 34, in increments of 2.
2.
The y axis is labeled from 0 to 12, in increments of 2.
3.
Tick marks are drawn on the r axis from 1 to 13. in increments of 2.
69
4. The histogram’s first bar (bin) starts at 12.
5. The width of each bar (bin) is 2.
histogram college, f
frequency title("Figure 3.2") xlabel(12(2)34)
ylabel(0 (2)12) ytick (1 (2)13)
start(12)
'
) width(2)
Figure 3.2
Figure 3.2
S'-
CM
O
o'00
c
W-
0)
BI
l’<D
uZ
- '
<
n
CM
Bi
o
i¥
14
16
A
IW-
Illi i
18o/ 20
22
24
26
% adults college degree
28
30
32
34
Figure 3.2 helps us to describe the distribution more specifically. For example, we now see that
in 13 states, the percent with college degrees is between approximately ir
16 andj m
18.
Other useful histogram options include:
I
i
bin (#)
Draw a histogram with # bins (bars). We can specify either bin (#) or, as
inFigure3.2, start(#) and width (#)—but not both.
percent
Show percentages on the vertical axis, ylabel and ytick then refer to
percentage values. Another possibility, frequency, is illustrated in Figure
3.2. We could also ask for fraction of the data. The default histogram
shows density, meaning that bars are scaled so that the sum of their areas
equals 1.
gap(#)
Leave a gap between bars. # is relative, 0 s # < 100; experiment to find a
suitable value.
addlabels
Label the heights of histogram bars. A separate option, addlabopts ,
controls the how the labels look.
discrete
Specify discrete data, requiring one bar for each value ofx.
70
Statistics with Stata
norm
Overlay a normal curve on the histogram, based on sample mean and standard
deviation.
kdensity
Overlay a kemal-density estimate on the histogram. The option kdenopts
controls density computation; see help kdensity for details.
With histograms or most other graphs, we can also override the defaults and specify our
own titles for the horizontal and vertical axes. The option ytitle controls r-axis titles, and
xtitle controls .v-axis titles. Figure 3.3 illustrates such titles, together with some other
histogram options. Note the incremental buildup from basic (Figure 3.1) to more elaborate
(Figure 3.3) graphs. This is the usual pattern of graph construction in Stata: we start simply,
then experimentally add options to earlier commands retrieved from the Review window, as we
work toward an image that most clearly presents our findings. Figure 3.3 actually is over
elaborate, but drawn here to show off multiple options.
. histogram college, f
frequency titie("Figure 3.3") ylabel(0(2)12)
ytick(1 (2)13) xlabel
’ (12(2)34)
‘ \
) start(12) width(2) addlabel
norm gap (15)
Figure 3.3
Figure 3.3
13
cxj
O
i
o'00
c
o
6 \
a
LL
iu
4
Bai
CXI
o
12
14
16
18
20
22
24
26
% adults college degree
28
30
32
34
Suppose we w ant to see how the distribution of college varies by region. The by option
obtains a separate histogram tor each value of region. Other options work as they do for single
histograms. Figure 3.4 show s an example in which we ask for percentages on the vertical axis,
and the data grouped into 8 bins.
I
4
Graphs
71
histogram college, by(region) percent bin (8)
West
Figure 3.4
N. East
s8-
s
ao _
s
South
o
CL
- I^r is
Z1
o
Midwest
§ -
8*
1
o _
<
‘r'a-
]
o
10
15
20
25
30
10
15
20
30
25
% adults college degree
Graphs by Geographical region
i
Figure 3.5, below, contains a similar set of four regional graphs, but includes a fifth that
shows the distribution for all regions combined.
. histogram college, percent bin (8) by(region, total)
West
N. East
I
$-
88o
o
o
o
10
Midwest
CL
o _
15
■
20
25
30
Total
N
88-
I
"I
I
o .
o -
I
■■
8p
test
o _
Figure 3.5
South
10
15
20
a
25
30 10
15
20
25
30
% adults college degree
Graphs by Geographical region
I
Axis labeling, tick marks, titles, and the by (varname) or by (vaniame, total)
options work in a similar fashion with other Stata graphing commands, as seen in the following
sections.
72
Statistics with Stata
Scatterplots
Basic scatterplots are obtained through commands of the general form
graph twoway scatter y x
.
wherey is the vertical or i-a.xis variable, and .v the horizontal or.v-axis one. For example, asain
usingthe states.dta dataset, we could plot waste (per capita solid wastes) against metro (percent
population in metropolitan areas), with the result shown in Figure 3.6. Each point in Figure 3 6
represents one of the 50 U.S. states (or Washington DC).
. graph twoway scatter waste metro
Figure 3.6
9
c
o
e
of
9
w
co
$
I
e
ro
Q.
o
CD
O.
o
ID -
o H—
20.0
40-0 t
60.0
80.0
Metropolitan area population. %
100.0
As with histograms, we can use xlabel, xtick, xti tie. etc. to control axis labels,
tick marks or titles. Scatterplots also allow control of the shape, color, size, and other
attributes of markers. Figure j.6 employs the default markers, which are solid circles The
same effect would result if we included the option msysbol (circle), or wrote this option
in a reviate orm as msymbol (O). msymbol (diamond) or msymbol (D) would
produce a graph with diamond markers, and so forth. The following table lists possible shapes.
msymbol( )
Abbreviation
Description
circle
O
circle, solid
diamond
D
triangle
square
plus
x
T
diamond, solid
triangle, solid
S
+
square, solid
plus sign
X
letter x
smcircle
o
small circle, solid
Graphs
smdiamend
smsquare
smtriangle
smplus
smx
circle__hollow
diamond_hollow
triangle_hollow
square__hollow
smcircle__hollow
smdi amond__ho 11 ow
smtriangle__hollow
sms guare_hol low
point
none
d
s
t
smplus
x
Oh
Dh
Th
Sh
oh
dh
th
sh
P
73
small diamond, solid
small square, solid
small triangle, solid
small plus sign
small letter x
circle, hollow
diamond, hollow
•
triangle, hollow
square, hollow
small circle, hollow
small diamond, hollow
small triangle, hollow
small square, hollow
very small dot
invisible
The mcolor option controls marker colors. For example, the command
. graph twoway scatter waste metro, msymbol(S) mcolor(purple)
would produce a scatterplot in which the symbols were large purple squares. Type help
colorstyle for a list of available colors.
I
One interesting possibility with scatterplots is to make symbol size (area) proportional to
a third variable thereby giving the data points different visual “weight.” For example we
might redraw the scatterplot of waste against metro, but make the symbols size reflect each
state s population (pop). This can be done as shown in Figure 3.7, using the fweiqht [ ]
(frequency weight) feature. Hollow circles, msymbol (Oh), provide a suitable shape.
Frequency weights are useful with some other graph types as well. Weighting can be a
deceptively complex topic, because “weights” come in several types, and have different
meanings in different contexts. For an overview of weighting in Stata, type help weight.
.—.
unnwm..
9
74
Statistics with Stata
5
■ graph twoway scatter waste metro [fweight = pop],
msymbol(Oh)
Figure 3.7
-
c
o
o'
w
o
A
$
~O -
w
re
o.
re
O
o
o
O
O
Q_
o
o
in
o
20.0
400
60.0
80 0
WO.O
Metropolitan area population, %
$tata density-distribution sunflower plots provide an alternative to scatterplots
i i high-density data. Basically, they resemble scatteiplots in which some of the individual
data pomts are replaced with sunflower-like symbols to indicate more than one observation at
that location. Figure 3.8 shows a sunflower-plot version of Figure 3.6, in which some of the
flower symbols (those with four “petals”) represent up to four individual data points, or states
table printed after the sunflower command provides a key regarding how many
o servations each flower represents. The number of petals and the darkness of the flower
correspond to the density of data.
sunflower waste metro, addplot(Ifit waste
metro)
Bin width
11.3714
Bin height
.286522
Bin aspect ratio
. 0218209
Max obs in a bin
4
Light
3
Dark
13
X-center
67.55
Y-center
. 96
Petal weight
1
flower
type
petal
weight
No. of
petals
No . of
flowers
estimated
obs.
actual
obs .
none
light
light
1
1
3
-4
5
3
23
15
12
23
15
12
50
50
Graphs
75
Figure 3.8
(D O
"TO v-’
o
o
o
■o
o
o
<D
U.
■
7^
I
o°
o
o
<D o
to o
ro
o
o-4-
$
TJ
i
i
o
.
——
o
Ks?"
o
ro
Q.
ro
■
hO
o
i
o
I—
20.0
I
o°
o
40.0
6Q o
go o
_________ Metropolitan area population, %'
100.0
o Per capita solid waste, tons
1 petal = 1 obs
» 1 petal = 1 obs.
----------Fitted values
.< Si2T£S“iCH “y h"S5‘1 Wi,h larEe
™»y observations plot
“ ""I (" ;den,1“l) «»«l™tes. The example in Figure 3.8 inclodes a regression line
’dd'd “
the graph into a visual jumble. Concentrating on one region such as the West seems more
fiZXTpX Xf qUaIlfieraCCOmplisheS this> Producing Results seen in Figure 3.9 on the
I
3
76
Statistics with Stata
• graph twoway scatter waste metro if region==l, mlabel(state)
s-
Figure 3.9
• California
o
XT
-
C T-
o
(D
-
• Oregdft Hawai*
■g
• Washington
So _
• New Mexico
Q-X-
O
• Alaska
Oo
c Idaho
CO O
• Nevada
• Arizona
• Montana
• Wyoming
iV
20.0
• Colorado
• Utah
40 0
~
BOX)
'
8CL0
Metropolitan area population, %
100.0
Figure 3.10 (below) shows separate waste - metro scatterplots for each region. The
relationship between these two variables appears noticeably steeper in the South and Midwest
than it does in the West and Northeast, an impression we will later confirm. The ylabel and
xlabel options in this example give they- and x-axis labels three-digit (maximum) fixed
display formats with no decimals, making them easier to read in the small subplots.
. graph twoway scatter waste metro, by(region)
ylabel(, format(%3.0f)) xlabel(, format(%3.Of))
Figure 3.10
West
CXI
c
O
<D
W
CQ
£ -H
•g
South
° CM -I
ro
Midwest
Q.
O
O
Q.
r- c
20
4C?
60
80
100 20
40
Metropolitan area population, %
Graphs by Geographical region
60
80
100
Graphs
77
Scatterplot matrices, produced by graph matrix. prove useful in multivariate analysis.
They provide a compact display of the relationships between a number of variable pairs,
allowing the analyst to scan for signs of nonlinearity, outliers, or clustering that mieht affect
statistical modeling. Figure 3.11 shows a scatterplot matrix involving three variables from
states.dta.
• graph matrix miles metro income waste, half msymbol(oh)
Figure 3.11
Per
capita
miles/year,
1,000
100.0- 0
O
o
0° flp
s "^o?> o
50.0-
Metropolitan
area
population,
0
OoO® 0
c
’•« o
o oo
%8:
%
OoO
o
o
Median
household
income,
$1,000
40- 0
-
0
o
30c o°
iJ8=-
0 0°
o°0
0
°0 e
0
■>
0^0
O
1.00-
oc<f><3° ’
O
O
o
Ooo
0%°c
o
cP
c0
'''
*QO
O.
0
°
Per
capita
solid
waste,
tons
OO
. o
0.50
6
8
10
12 0.0
50.0
10(20
30
40
50
The half option specified that Figure 3.11 should includeonly the lower triangular part
of the matrix. The upper triangular part is symmetrical and. for many purposes, redundant,
msymbol (oh) called for small hollow circles as markers, just as we might with a scatterplot.
Control of the axes is more complicated, because there are as many axes as variables- type
help graph_matrix for details.
When the variables of interest include one dependent or "effect’' variable, and several
independent or cause” variables, it helps to list the dependent variable last in the graph
matrix variable list. That results in a neat row of dependent-versus-independent variable
graphs across the bottom
Line Plots
I
I
Mechanically, one
line piots
plots are scatterplots m
in which the points are connected by line segments.
iv.^uaiHcany,
Like scatterplots, the various types of line plots belong to Stata’s versatile graph twoway
family. The scatterplot options that control axis labelingand markers work much the same with
line plots, too. New options control the characteristics of the lines themselves.
Line plots tend to have different uses than scatterplots. For example, as time plots they
depict changes in a variable over time. Dataset cod.dta contains time-series data reflecting the
I
Statistics with Stata
78
4
unhappy story of Newfoundland s Northern Cod fishery. This fishery, which had been among
the world s richest, collapsed in 1992 primarily due to overfishing.
Contains data from C:\data\cod.dta
obs :
38
vars :
size :
variable name
storage
type
year
cod
Canada
TAG
biomass
Sorted by:
Newfoundland's Northern Cod
fishery, 1960-1997
4 Jul 2005 15:02
5
684 (99.9% of memory free)
int
float
int
int
float
display
format
value
label
variable label
%8.0g
%8 . Og
%8.0g
%8.0g
%9.0g
Year
Total landings, lOOOt
Canadian landings, lOOOt
Total Allowable Catch, lOOOt
Estimated biomass, lOOOt
year
A simple time plot showing Canadian and total landingscan be constructed by drawing line
graphs of both variables against year. Figure 3.12 does this, showing the “killer spike” of
international overfishing in the late 1960s, followed by a decade of Canadian fishing pressure
in the 1980s, leading up to the 1992 collapse of the Northern Cod.
. graph twoway line cod Canada year
Figure 3.12
V)
O)
c
■
c
ro
■Q
CD
Sg O
o
o
-
\\
x\
\\
.TO
ro
oO
H
-L^
I960
1970
1980
Year
1990
2000
Total landings, 1000t __ -- Canadian landings, 10OOt |
In Figure 3.12, Stata automatically chose a solid line for the first-named^ variable, cod, and
a dashed line for the second, Canada. A legend at the bottom explains these meanings. We
could improve this graph by rearranging the legend, and suppressing the redundant j,-axis title,
as illustrated in Figure 3.13.
Graphs
•
graph twoway line cod Canada year,
legend(label
79
(1 "all nation®-^
ring (0)
Figure 3.13
-------- - all nations!
---------- Canada
I
\\
\\
\\
\\
\\
v
o -
1960
1970
1980
Year
1990
2000
I
The legend option for Figure 3.13 breaks down
as follows. Note that all of these
suboptions occur within the parentheses followine legend
I
label (1
"allnations")
label first-namedvariable “all nations”
label (2
"Canada")
label second-named,v variable “Canada”
position (2)
piace the Iegend at 2 o,clock position (upper right)
ring (01
Place the legend within the plot space
rOWS (2 J
organize the legend to have two rows
toyshowrttehneidSatthe
‘abelS
Pl3CinS tHem Uithin tfK pl0t SpaCe- we Ieave more room
to show the data and create a more attractive, readable figure, legend works similarlv for
other graph styles that have legends. Type help legend_option to see a list ofthe man v
suboptions available.
- - I‘g.U_r!S2‘!2 and 3-13 s’mPly connect each data point with line segments. Several other
connecting styles are possible, using the? connect option. For example.
connect(stairstep)
or equivalently.
connect(J)
will cause points to be connected in stairstep (flat, then vertical) fashion. Figure 3.14 illustrates
wi h a stairstep time plot of the government-set Total Allowable Catch (TAG) variable from
v t/ Lc . C< < C< ,
80
Statistics with Stata
• graph twoway line TAC year.
connect(stairstep)
o
o
Figure 3.14
00
oo
oo
O (D
£
28-
X> XT
ro
jp
<
fSI
1960
1970
1980
Year
1990
2000
Other connect choices are listed below,
The default, straight line segments,
corresponds to connect (direct)
or connect (1). For more details, see help
connectstyle.
connect( )
Abbreviation
_ Description
none
i
do not connect
direct
1 (letter “el”)
connect with straight lines
ascending
L
direct, but only ifx[/+l] > x[z]
stairstep
J
flat, then vertical
stepstair
vertical, then flat
3J 5 Jon
Rowing page) repeats this stairstep plot
of TAC^ but with some
enhancements of axis labels and titles, The option xtitle(""]
t
) requests
no x-axis title
(because “year” is obvious). We added
' J tick marks at two-year intervals to thex axis. labeled
they axis at intervals of 100, and printedly-axis labels horizontally instead of vertically (the
default).
r
Graphs
81
. graph twoway line TAC
year, connect(stairstep)
xtitle(" " )
xtick(1960(2)2000)
ytitle("Thousands of tons")
ylabel(0(100)800, angle(horizontal)) xtitle(
)
cipattern(dash)
800-
Figure 3.15
700"-i
i
l
lI
I
l
i
i
i
I
I
600w
o 500■o 400CTJ
W
J 300-
i
i
i
I
r
“I I
200-
i
I
I
-I
I
I
I
100-
0-|
1960
1970
1980
1990
2000
J^ead of letting Stata determine the line patterns (solid, dashed, etc.) in Figure 3.15 we
used the cipattern (dash) option to call for a dashed line. Possible line pattern choices
are listed in the table below (also see help linepatternstyle ).
I
cipattern( )
Description
solid
solid line
dash
dashed line
dot
dotted line
dash dot
dash then dot
shortdash
short dash
shortdash dot
short dash followed by dot
longdash
long dash
longdash_dot
long dash followed by dot
blank
invisible line
formula
for example, cipattern (-.) or cipattern (—..)
82
Statistics with Stata
Before we move on to other examples and types, Figure 3.16 unites the three variables
discussed in this section to create a single graphic showing the tragedy of the Northern Cod.
Note how the connect ( ). cipattern ( ), and legend ( ) options work in this
three-vanable context.
Ji
!
• graph twoway line cod Canada TAC year, connect(line line stairstep)
cipattern((solid
- - longdash
dash) xtitle("”) xtick(1960 (2)2000)
ytitle("Thousands
— of
_ tons"
---- ') ylabel (0 (100) 800, angle(horizontal))
xtitie("" ) legend(label (1 "all nations") label (2 "Canada")
label (3 "TAC") position(2) ring(0) rows(3))
5
Figure 3.16
800-I
------ all nations
------ Canada
------ TAG
700 J
600-
i
c
2 500-
400^
V)
3
O
XL
300 -i
200
100
°H
1960
1970
1980
1990
2000
Graphs
83
Connected-Line Plots
woway line both apply to graph twoway connected as well. Figure 3.17 shows
a etault example, a connected-line time plot of the cod biomass variable (bio) from cod.dta.
graph twoway connected bio year
.
Figure 3.17
o
io
CXI
o
§s
o
T—
V) O
ro io
o
is
ro o
co
LU
o
m
o
1960
1970
1980
Year
1990
2000
I
The dataset contains only biomass values for 1978 through 1997, resulting in much empty
space in Figure 3.17. if qualifiers allow us to restrict the range of years. Figure 3.1 sXn
e o owing page, does this. It also dresses up the image to show control of marker symbols
me patterns, axes, and legends. With cod landings and biomass both in the same ima-e we
recognJzed. 11355 830 kS
’ 98°S’
before a Crisis was off’cially
I
84
i
Statistics with Stata
• graph twoway connected bio cod year if year > 1977 & year < 1999,
msymbol(T Oh) cipattern(dash solid) xlabel(1978(2)1996)
xtick (197 9 (2) 1997) ytitle ( "Thousands of tons") xtitle("’’ )
ylabel (0 (500) 2500, angle(horizontal))
legend(label (1 "Estimated biomass") label(2 "Total landings")
position(2) rows(2) ring(0))
Figure 3.18
2500-
a--- Estimated biomass
I'
2000-
I
I
I
I
I
I
I
c
o
? 1500-
I
‘
\
I
ID
------ Total landings
I
I
I
I
I
I
I
I
I
I
o 1000- / /
A
/ \
'
\
\
\
\
\
\
\
\
5004
OH,
4
1978 1980 1982 1984 1986 1988 1990 1992 1994 1996
Other Twoway Plot Types
In addition to basic line plots and scatterplots, the graph twoway command encompasses
a wide variety’ of other types. The following table lists the possibilities.
I
graph twoway
Description_____________________________________________
4 ■
scatter
scatterplot
i
line
line plot
connected
connected-line plot
scatteri
scatter with immediate arguments (data given in the command line)
area
line plot with shading
bar
twoway bar plot (different from graph bar)
spike
twoway'spike plot
dropline
dropline plot (spikes dropped vertically or horizontally to given value)
dot
twoway dot plot (different from graph dot)
rarea
range plot, shading the area between high and low values
Is I
Graphs
I
85
rbar
range plot with bars between high and low values
rspike
range plot with spikes between high and low values
reap
range plot with capped spikes
rcapsym
range plot with spikes capped with symbols
rscatter
range plot with scatterplot marker symbols
rline
range plot with lines
rconnected
range plot with lines and markers
pcspike
paired-coordinate plot with spikes
pccapsym
paired-coordinate plot with spikes capped with symbols
pcarrow
paired-coordinate plot with arrows
pebarrow
paired-coordinate plot with arrows having two heads
pcscatter
paired-coordinate plot with markers
pci
pcspike with immediate arguments
pcarrowi
pcarrow with immediate arguments
tsline
time-series plot
tsrline
time-series range plot
mband
straight line segments connect the (x, y) cross-medians within bands
mspline
cubic spline curve connects the (x, v) cross-medians within bands
lowess
LOWESS (locally weighted scatterplot smoothing) curve
Ifit
linear regression line
qfit
quadratic regression curve
fpfit
fractional polynomial plot
Ifitci
linear regression line with confidence band
qfitci
quadratic regression curve with confidence band
fpfitci
fractional polynomial plot with confidence band
function
line plot of function
histogram
histogram plot
kdensity
kernel density plot
The usual options to control line patterns, marker symbols, and so forth work where
appropriate with all twoway commands. For more information about a particular command,
type help twoway_mband, help twoway__function, etc. (using any of the names
above). Note that graph twoway bar is a different command from graph bar .
Similarly, graph twoway dot differs from graph dot . The twoway versions
JI
86
Statistics with Stata
provide various methods for plotting a measurement y variable against a measurement _v
variable, analogous to a scatterplot or a line plot. The non-twoway versions, on the other hand,
provide ways to plot summary statistics (such as means or medians) of one or more
measurementy variables against categories of one or more* variables. The twoway versions
thus are comparatively specialized, although (as with all twoway plots) they can be overlaid
with other twoway plots for more complex graphical effects.
Many of these plot types are most useful in composite figures, constructed by overlaying
two or more simple plots as described later in this chapter. Others produce nice stand-alone
graphs. For example, Figure 3.19 shows an area plot of the Newfoundland cod landings.
•
graph twoway area cod Canada year, ytitle("")
Figure 3.19
O
o
co
o
o
<D
O
O
O
O
CN
o
1960
1970
Total landings, 10001
1980
Year
1990
2000
Canadian landings, 1000t
The shading in area graphs and other types with shaded regions can be controlled throush
the option bcolor . Type help colorstyle for a list of the available colors, which
include gray scales. The darkest gray, gsO, is actually black. The lightest gray, gsl6, is white.
Other values are in between. For example, Figure 3.20 shows alight-gray version of this graph.
Graphs
87
5raph twoway area cod Canada year, ytitle ('”•
) bcolor(gsl2 gsl4)
o
o
oo
Figure 3.20
o
o
co
o
o
I
o
o
CXI
o
1960
1970
1980
Year
1990
Total landings, 1000t
2000
Canadian landings, 1000t
Unusually cold atmosphere/ocean conditions played a secondary role in Newfoundland’s
fisheries disaster, which involved not only the Northern Cod but also other species and
populations. For example, key fish species in the neighboring Gulf of St. Lawrence declined
during this period as well (Hamilton, Haedrich and Duncan 2003). Dataset gulfdta describes
environment and Northern Gulf cod catches (raw data from DFO 2003).
Contains data from C:\data\gulf.dta
obs:
56
vars :
size:
7
1,344 (99.9% of memory free)
storage
type
display
format
winter
minarea
maxarea
mindays
maxdays
cil
int
float
float
byte
byte
float
%8.0g
%9. Og
%9.0g
%8. Og
%8.0g
%9. Og
cod
float
%9. Og
variable name
Sorted by:
value
label
Gulf of St. awrence
envircnme
and cod fishery
10 Jul 2005 11:51
variable label
Winter
Minimum ice area, 1090 km ■■'2
Maximum ice area. 1000 km''2
Minimum ice days
Maximum ice days
Cold Intermediate Layer
temperature mini mum, C
N. Gulf cod catch. 1000 tons
winter
The maximum annual ice cover averaged 173,017 km2 during these years.
summarize maxarea
1
Variable
|
Obs
Mean
Std. Dev.
Min
Max
maxarea
|
38
173.0172
37.18623
47.8901
220.1905
88
7
Statistics with Stata
Figure 3.21 uses this mean (173 thousand) as the base for a spike plot, in which spikes
above and below the line show above and below-average ice cover, respectively. The
yline (173) option draws a horizontal line at 173.
. graph twoway spike maxarea winter if winter > 1963, base(173)
yline(173) ylabel(40(20)220, angle(horizontal))
xlabel (19 65 (5) 2000)
Figure 3.21
220-
200CN
<E 1801
I
l
I
T7
g 160-
o
rn
co 140O)
co
g 1203 100-
E
X
rc
80-
6040-|
1965
1970
1975
1980
1985
Winter
WO
1995
2000
The base () format of Figure 3.21 emphasizes the succession of unusually harsh winters
(above-average maximum ice cover) during the late 1980s and early 1990s, around the time of
Newfoundland’s fisheries crisis. We also see an earlier spell of mild winters in the early 1980s,
and hints of a recent warming trend.
A different view of the same data, in Figure 3.22, employs lowess regression to smooth the
time series. The bandwidth option, bwidth (. 4), specifies a curve based on smoothed data
points that are calculated from weighted regressions within a moving band containing 40% of
the sample. Lower bandwidths such as bwidth (. 2), or 20% of the data, would give us a
more jagged, less smoothed curve that more closely resembles the raw data. Higher bandwidths
such as bwidth (. 8), the default, will smooth more radically. Regardless of the bandwidth
chosen, smoothed points towards either extreme of the x values must be calculated from
increasingly narrow bands, and therefore will show less smoothing. Chapter 8 contains more
about lowess smoothing.
Graphs
89
graph twoway lowess maxarea winter if winter > 1963, bwidth(.4)
yline(173) ylabel(40(20)220, angle(horizontal))
xlabel(1965 (5) 2000)
•
Figure 3.22
220-1
200180(D
.? 160$
CD
2 140x
E 120w
(D 100_O
80-
6040__
1965
1970
1975
1980
1985
Winter
1990
1995
2000
Range plots connect high and low j- values at each level ofx, using bars, spikes, or shaded
areas. Daily stock market prices are often graphed in this way. Figure 3.23 shows a capped
spike range plot using the minimum and maximum ice cover variables from gulfdta.
graph twoway reap minarea maxarea winter if winter > 1963,
ylabel(0 (20)220, angle (horizontal)) ytitle("Ice area, 1000 kinA2")
xlabel(1965 (5)2000)
-
Figure 3.23
220-j
-I I I
200-J
1804
>
T
T-t
I1
I
■-
. 1601l
CM
E 140 J
| 12°1
TI
ra- IDOcu
o 80-I
o
600)
I
40-
I
20-
o-l
1965
-L___
1970
1975
1980
1985
Winter
1990
1995
2000
90
Statistics with Stata
These examples by no means exhaust the possibilities for twoway graphs. Other
applications appear throughout the book. Later in this chapter, we will see examples involving
overlays of two or more twoway graphs, forming a single image.
Box Plots
Box plots convey information about center, spread, symmetry, and outliers at a glance. To
obtain a single box plot, type a command of the form
.
graph box y
If several different variables have roughly similar scales, we can visually compare their
distributions through commands of the form
•
graph box
w x y z
One of the most common applications for box plots involves comparine the distribution of
one variable over categories of a second. Figure 3.24 compares the distribution of college
across states of four U.S. regions, from dataset states.dta.
graph box college,
■
yline(19.1)
over(region)
Figure 3.24
o
co ■
io
0) CXI
<D
0»
<D
•u
0)
O)
=O
©cq
w
I
3
TJ
(U
o> in
i
West
N. East
South
Midwest
e median proportion of adults with college degrees tends to be highest in the Northeast
and lowest in the South. On the other hand, southern states are more variable. Regional
medians (lines within boxes) in Figure 3.24 can be compared visually to the 50-state median
in icated by the yline (19.1) option.. This median was obtained by typing
.
summarize college if region < ., detail
Graphs
91
Chapter 4 describes the summarize, detail command. The if region <
qualifier above restricted our analysis to observations that have nonmissine values of regionthat is, to every place except Washington DC.
“
* ’
The box in a box plot extends from approximate first to third quartiles, a distance called the
interquartile range (IQR). It therefore contains approximately the middle 50% of the data
Outliers, defined as observations more than 1.5IQR beyond the first orthird quartile, are plotted
individually in a box plot. No outliers appear among the four distributions in Figure 3.24.
Stata s box plots define quartiles in the same manner as summarize, detail. Thisisnot
the same approximation used to calculate “fourths” for letter-value displays. Iv (Chapter 4)
See Frigge, Hoaglin, and Iglewicz (1989) and Hamilton (1992b) for more about quartile
approximations and their role in identifying outliers.
Numerous options control the appearance, shading and details of boxes in a box plot see
elp graph_box fora list. Figure 3.25 demonstrates some of these options, and also the
horizontal arrangement of graph hbox . using per capita energy consumption from
states.dta. The option over (region, sort(l) ) calls for boxes sorted inascendina order
according to their medians on the first-named (and in this case, the only) y variable
intensify (30) controls the intensity of shading in the boxes, setting this somewhat lower
(less dark) than the default seen in Figure 3.24. Counterintuitively, the vertical line marking
the overall median (320) in Figure 3.25 requires a yline option, rather than xline .
■
I
I
graph hbox energy, over(region,
N. East
o
Midwest
d
sort(l)) yline(320) intensity(30)
Figure 3.25
West:
South
I
200
400
600
800
Per capita energy consumed, Btu
1,000
The energy box plots in Figure 3.25 make clear not only the differences among medians,
but also the presence outliers — four very high-consumption states in the West and South
With a bit of further investigation, we find that these are oil-producing states: Wyoming,
Alaska, Texas, and Louisiana. Box plots excel at drawing attention to outliers, which are easily
ovenooked (and often cause trouble) in other steps of statistical analysis.
4
92
Statistics with Stata
Pie Charts
^ charts are popular tools for "presentation graphics,” although they have little value for
analyitcal work. Stata's basic pie chan command has the fomt
.
graph pie w x y z,
pie
where the variables iv, x, y. and all measure
quantities of something in similar units (for
example, all are in dollars, hours, or people).
Dataset AKethnic.dta. on the ethnic composition of Alaska’s population provides an
illustration Alaska’s indigenous Native population divides into three broad cultural/liimuistic
LT (,n,ClUding Athabaska’ Tlingit’ alld Haida)’ and Eskimo (Yupk and
h unTatt TeUt’
nupiat). The variables aleut. mdian. eskimo, and nonnativ are population counts for each
group, taken from the 1990 U.S. Census. This dataset contains onlv three observations
representing three types or sizes of communities: <cities
‘ ’ of------10,000 people or more; towns of
1.000 to 10,000; and villages with fewer than 1.000 people.
Contains data from C:\data\Ar-e
obs :
vars:
_yze:
variable name
comtype
pop
n
aleut
Indian
es kimo
nonnativ
3
7
63 <99-9%
storage
type
byte
float
int
in t
in t
in t
float
dta
me—ry free)
display
format
value
label
% 8.0 g
*9.0g
*8.0g
%8.0g
%8. Cg
%8.0g
%9.0g
popcat
Alaska ethnicity 1990
4 Jul 2005 12:06
variable label
Community type (size)
Population
number of communities
Aleut
Indian
Eskimo
Non-Native
Sorted by:
yhejnajority
’ *
tu
• of the state'sS population
P°PulatI0n is
is non-I
non-Native,
as clearly seen in a pie chart (Figure
explode) causes the third-named variable, eskimo.
Xolof pIT (T
fc shaded a^t
poss.bl.ties such as color(blue)
color (blue) or color (cranberry) exist
Type help
tTb
T'11?5'- Plabel(3 P—nt, gap (20)) causes a percentage label
centr W
T ^"°
sVlcea gaP of 20 relative radial units from the
r X e see that about 8 /« of Alaska’s population is Eskimo (Inupiat or Yupik) The
legend option calls for a four-row box placed at the 11 o’clock position within the plot
a*
4
Graphs
93
• graph pie aleut Indian eskimo nonnativ, pie(3, explode)
Pie (4, color(gsl3)) plabel(3 percent, gap(20))
legend(position(11) rows(4) ring(0))
Aleut
Indian
Eskimo
Non-Native
Figure 3.26
8.072%
Non-Natives are the dominant group in Figure 3.26, but if we draw separate pies for each
type of community by adding a by (comtype) option, new details emerge (Figure 3.27, next
page). The option angleO () specifies the angle of the first slice of pie. Setting this firstslice angle at 0 (horizontal) orients the pies in Figure 3.27 in such a way that the labels are more
readable. The figure shows that whereas Natives are only a small fraction of the population in
Alaska cities, they constitute the majority among those living in villages. In particular, Eskimos
make up a large fraction of villagers — 35% across all villages, and more than 90% in some.
This gives Alaska villages a different character from Alaska cities.
i
94
I
Statfst/cs with Stata
graph pie aleut Indian <eskimo
’'
nonnativ, pie (3, explode)
pie(4, color(gsl3)) plabel(3 percent, gap(8))
legend(rows(1)) by(comtype) angleO(0)
.
villages
Figure 3.27
towns
V
34.67%
8.141%
cities
2332%
;
Indian
Aleut
Eskimo
Non-Native
Graphs by Community type (size)
Bar Charts
though they contain less information than box plots, bar charts provide simple and versatile
displ ays for comparing sets of summary statistics such as means, medians, sums, or counts. To
obtain vertical bars showing the mean ofj> across categories ofx, for example, type
. graph bar (mean) y, over(x)
For horizontal bars showing the sum of across categories ofx7 within categories ofx2, type
- graph hbar
(sum) y, over(xl)
over(x2)
The bar chart could display any of the following statistics:
mean
sd
sum
rawsum
count
max
min
median
Pl
Means (the default; used if the type of statistic is not specified)
Standard deviations
Sums
Sums ignoring optionally specified weight
Numbers of nonmissing observations
Maximums
Minimums
Medians
1st percentiles
Graphs
95
P2
2nd percentiles (and so forth to P9 9 )
Interquartile ranges
Saoter and'afC ™ary Jatis!ics is the same as that
collapse command (see
and XS
t
nUmberofothercommand s including graph dot (next section)
ana table (Chapter 4).
iqr
meaXXm^helQQi? C°nta^
°n the U'S-StateS’ combiniagsocioeconomic
90 ^nsus with several health-risk indicators from the Centers for Disease
Control (2003), averaged over 1994-98.
Contains data from C:\data\statehealth.
obs
~ ‘ :'
vars :
size :
51
1?
3,315 (99.9% of memory free)
variable name
state
region
income
income?
high
college
overweight
inactive
smokeM
smokeF
smokeT
motor
Sorted by:
dta
storage
type
display
format
str20 %20s
byte
%9.0g
long
%10 . Og
float %11.Og
float %9.0g
float %9.0g
float % 9. Og
float %9.0g
float %9. Og
float %9.0g
float %9. Og
float %9. Og
value
label
region
income?
Health indicators 1994-96
9 Jul ?005 11:56
( CDC)
variable label
US State
Geographical region
Median household income, 1990
Median income low cor high
% adults HS diploma,, 1990
% adults college degree, 1990
% overweight
% inactive in leisure time
% male adults smoking
% female adults smoking
% adults smoking
Age-adjusted motor-vehicle
related deaths/100,000
state
es are highest m the South (36%), and lowest in the West (21 %). Note that the vertical axis
X
I
1
-J
eX"”"0, VariaM'
,here iS Only ■>”) ShO"M be f“"‘'
p
96
Statistics with Stata
graph bar (median) inactive, over(region) blabel(bar)
bar(l, bcolor(gslO))
o
Figure 3.28
36.05
o
co
29.1
28.3
>
20.9
^Zo
o oj
o
io
Q.
o
ii
SI
ii
■
KA
SB
v i;?.
b?W^'v'
o
_______ _
West
2L_____
N. East
South
Midwest
and carina?. (K ll0W7 rC) eIabOrateS thiS idea by addin8 a second variable, ove^eight,
and coloring its bars a darker gray. The bar labels are size (medium) in Figure 3 29
making them larger than the defaults, size (small), used in Figure 3.28. Other possibilities
for size ( ) suboptions include labels that are tiny , medsmall, medlarge or
arge. See help textsizestyle for a complete list. Figure 3.29 shows that regional
differences in he prevalence of overweight individuals are less pronounced than differences
in inactivity, although both variables’ medians are highest in the South and Midwest
Graphs
■
97
graph bar (median) inactive overweight, over(region)
blabel(bar, size(medium))
bar(l, bcolor (gslO) ) bar(2, bcolor (gs7) )
.
o
Figure 3.29
36.05
31.3
O
co ■
27.6
28.3
ip
27.1
20.9
o
1 I
CM
o
to
j
_ fL
• II
o
West
[Iv--'
I
I
1
Illi
-J
■
N. East
P 50 of inactive
31.2
29.1
South
I
-
Midwest
P 50 of overweight |
th
t
!
,hl8^lnCOmc states (states having median household incomes below or
above the national median), revealing a striking correlation with wealth. Within each reeion
f^r^'T0"16 h-3!65 eXhlb,t h'gher mean fataIity rates' Across both income cateaories’
tahty rates are higher m the South, and lower in the Northeast. The order of the two 'over
options in the command controls their order in organizing the chart. For this example we chose
nzontal bar chart or hbar . In such horizontal charts, ytitle . yline , etc refer to
the horizontal axis, yline (17.2) marks the overall mean.
*!
98
Statistics with Stata
. graph hbar (mean) motor, over(income2) over(region) yline(17.2)
ytitle("Mean motor-vehicle related fatalities/100,000")
Figure 3.30
Low income I
West
i
High income |
_ _______
Low income
N. East
High income
South
Low income
High income
Low income
Midwest
High income
0
.. 5
10
15
20
25
Mean motor-vehicle related fatalities/100,000
Bars also can be stacked, as shown in Figure 3.31. This plot, based on the Alaska ethnicity
data (AKethmc.dta), employs all the defaults to display ethnic composition by type of
community (village, town, or city).
• graph bar
(sum)
nonnativ aleut Indian eskimo,
over(comtyp)
o
o
o
stack
Figure 3.31
o
o
■
R
'■
tew®
o
o
UBI
§
o
o
o
• ■ HlHI
O
villages
towns
sum of nonnativ
sum of Indian
cities
sum of aleut
sum of eskimo
Graphs
99
inel ‘gUrC J J-5edrawsthlsPlotwlthbe,terlegendand axis labels. The over option now
The leoenlo0 f
elaFel t,lecommunity
50‘he horizontal axis is more informative
Placedin the |0)Ptl°niSp"ClfieS fourrows in the same ^ical orderas the bars themselves, and
vtitle vl H t
POSIt'On lnSide IhC P'Ot SpaCC- 11 als0 imProves le^nd labels
y tie , ylabel, and ytitle options format the vertical axis.
. graph bar
(sum) nonnativ aleut Indian eskimo
legend(rows(4) order(4 321) position(ll)
ring(0)
label (1 "Non-native") label(2 "Aleut”)
label (3 "Indian") label (4 "Eskimo”))
stack ytitle(Population)
ylabel (0 (100000) 300000) ytick(500 0 0 (100000) 350000)
o
o
§
Figure 3.32
HBBI Eskimo
.'ix'ii'CiSKis Indian
Aleut
Non-native
co
o
3 CN
Q.
Q.
o
o
o
o
Villages <1,000
■
Towns 1,000-10,000
Cities >10,000
each community type, this bar chart shows their absolute sizes. Consequently Fifure3 32 tells
p:zi'h;^Fx3s27“”ld ,he “jon,y of A“a's
Dot Plots
I
Dot plots serve much the same purpose as bar charts: visually comparing statistical summaries
o one or more measurement variables. The org.mz.tieo i st.’ „pC„Xte ™ «
conma'
rh
'““"S ,l,e
comparing the medians of variables x, y, z, and w, type
•
graph dot
(median)
x y z w
To See .
P
100
Statistics with Stata
!*!!
For a dot plot comparing the mean of v across categories of.v. type
•
graph dot
(mean) y, over(x)
Figure 3.33 shows a dot plot of male and female smoking
J rates by region, from
statehealth.dta. The over <option
’ includes
* ‘ a suboption, sort (smokeM ), which calls for
the regions to be sorted in order of their mean values of smokeM
hollow circle for smokeF.
. graph dot (mean) smokeM smokeF, over(region
overfregion,, sort(smokeM)
sort(smokeM)))
marker(1, msymbol(T)) marker(2, msymbol(Oh))
Figure 3.33
West
A
N. East
Midwest
South
0
10
20
▲ mean of smokeM
mean of smokeF
30
Although Figure 3 33 displays only eight means, it does so in a way that facilitates several
of dot plots is their compactness. Dot plots (particularly when rows are sorted by the statistic
of interest, as in Figure 3.33) remain easily readable even with a dozen or more rows.
Symmetry and Quantile Plots
summary graphs, but convey more detailed information.
I
Graphs
101
A histogram of per-capita energy consumption in the 50 U.S. states (from states dta}
PPears m F.gure 3.o4. The distribution includes a handful of very highconsumption states
ich happen to be oil producers. A superimposed normal (Gaussian) curve indicates that
of p^itive skewter t lan'n°mial left tai1’and a heavier-than-normal right tail — the definition
. histogram energy f start(lOO) width(lOO) xlabel(0(100)1000)
frequency norm
LO
Figure 3.34
CM
8
IB’
>
£
5-
E
iiz o
-J.'-’
iS
z
B
i
o
0
100
200
1
wl
300 400
500
600
700
Per capita energy consumed. Btu
800
900
1000
I
i
v X
A
- (O'V
10032
A4-' /
Vfrr^
"
I
>
1^/
> o^/
102
Statistics with Stata
Figure 3.35 depicts this distribution as a symmetry plot. It plots the distance of the zth
observation above the median (vertical) against the distance of the zth observation below the
median. All points would lie on the diagonal line if this distribution were symmetrical. Instead,
we see that distances above the median grow steadily larger than corresponding distances below
the median, a symptom of positive skew. Unlike Figure 3.34, Figure 3.35 also reveals that the
energy-consumption distribution is approximately symmetrical near its center.
symplot energy
Figure 3.35
Per capita energy consumed, Btu ■
o
co
§ <O
<D
E
0)
-I
8
I
w
CM
o
‘
0
50
100
Distance below median
150
Quantiles are values below which a certain fraction of the data lie. For example, a .3
quantile is that value higher than 30% of the data. If we sort n observations in ascending order,
the zth value forms the (z - .5)/zz quantile. The following commands would calculate quantiles
of variable energy:
. drop if energy >=
sort energy
. generate quant =
(_n -
.5)/ N
As mentioned in Chapter 2, _n and _N are Stata system variables, always unobtrusively present
when there are data in memory'. _n represents the current observation number, and _N the total
number of observations.
"1
Au.
Graphs
103
Quantile plots automatically calculate what fraction of the observations lie below each data
value, and display the results graphically as in Figure 3.36. Quantile plots provide a graphic
reference for someone who does not have the original data at hand. From well-labeled quantile
plots, we can estimate order statistics such as median (.5 quantile) or quartiles (.25 and .75
quantiles). The IQR equals the rise between .25 and .75 quantiles. We could also read a
quantile plot to estimate the fraction of observations falling below a given value.
quantile energy
o
c
Figure 3.36
£D *
"S’
S’
c
<D O
ss
&
O
(5
o.
og
</}
<D
I
o
o
CXI
0
.25
Fraction of the data
.75
104
Statistics with Stata
Quantile-normal plots, also called normal probability plots, compare quantiles of a
variable’s distribution with quantilesof a theoretical normal distribution having the same mean
and standard deviation. They allow visual inspection for departures from normality in every
part of a distribution, which can help guide decisions regarding normality assumptions and
efforts to find a normalizing transformation. Figure 3.37, a quantile-normal plot of energy,
confirms the severe positive skew that we had already observed. The grid option calls for
a set of lines marking the .05, .10, .25 (first quartile), .50 (median). .75 (third quartile), .90, and
.95 quantiles of both distributions. The .05. .50. and .95 quantile values are printed along the
top and right-hand axes.
. qnorm energy, grid
I
I
111.872
o
o
354.5
Figure 3.37
597.129
So
CO
■u’03
co
0)
E
0(0
I
O)
o o
ro
ro
o- o
8
co
CD CM
CL
CM
u_ o
o
0
200
400
Inverse Normal
600
800
Grid lines are 5,10.25. 50. 75. 90. and 95 percentiles
I
I!
•K''
i
Quantile-quantile plots resemble quantile-normal plots, but they compare quantiles
(ordered data points) of two empirical distributions instead of comparing one empirical
distribution with a theoretical normal distribution. On the following page. Figure 3.38 shows
a quantile-quantile plot of the mean math SAT score versus the mean verbal SAT score in 50
states and the District of Columbia. If the two distributions were identical, we would see points
along the diagonal line. Instead, data points form a straight line roughly parallel to the
diagonal, indicating that the two variables have different means but similar shapes and standard
deviations.
Graphs
105
• qqplot msat vsat
Figure 3.38
Quantile-Quantile Plot
o
co
S
0) io
o
w
£
co
E 1
—
^8
o
400
450
500
Mean verbal SAT score
550
Regression with Graphics (Hamilton 1992a) includes an introduction to reading quantilebased plots. Chambers et al. (1983) provide more details. Related Stata commands include
pnorm (standard normal probability plot), pchi (chi-squared probability plot), and qchi
(quantile-chi-squared plot).
I
I
Quality Control Graphs
Quality control charts help to monitor output from a repetitive process such as industrial
production. Stata offers four basic types: c chart, p chart, R chart, and x chart. A fifth type
called Shewhart after the inventor of these methods, consists of vertically-aligned x and R
charts. Iman (1994) provides a brief introduction to Rand x charts, including the tables used
in calculating their control limits. The Base Reference Manual gives the command details and
formulas used by Stata. Basic outlines of these commands are as follows:
.
cchart defects unit
Constructs a c chart with the number of nonconformities or defects (defects) graphed
against the unit number (unit). Upper and lower control limits, based on the assumption
that number ofnonconformities per unit follows a Poisson distribution, appear as horizontal
lines in the chart. Observations with values outside these limits are said to be “out of
control.”
. pchart rejects unit ssize
Constructs a p chart with the proportion of items rejected (rejects / ssize) graphed against
the unit number (unit). Upper and lower control limit lines derive from a normal
approximation, taking sample size (ssize) into account. If ssize varies across units the
control limits will vary too, unless we add the option stabilize
106
.
Statistics with Stata
rchart xl x2 x3 x4 x5, connect(1)
2)nSt^ttSian Rlrange) Chr “Smg the replicated measurements in variables xl through
’ ‘n
eXample’ fiVC reP'icat>ons per sample. Graph the range with^eaS
taga'nSt.the Sa^ple number’ a^ (optionally) connect successive ranges with line
egmonts. Horizontal lines indicate the mean range and control limits. Control limits are
estimated from the sample size if the process standard deviation is unknown When o is
known we can include tins information in the command. For example, assuming o = 10,
•
.
rchart xl x2 x3 x4 x5, connect(1)
std(10)
xchart xl x2 x3 x4 x5, connect(1)
S’crXih/ (mean)hChart asingthe rePl*cated measurements in variables xl through
• 5. Graphs the mean with in each sample against the sample number and connect successive
means with line segments. The mean range is estimated from the mean of sample means
and control limits from sample size, unless we override these defaults. For example if we
know that the process actually has |i = 50 and o =10,
P
•
xchart xl X2 x3 x4 x5, connect(l) mean(50)
std(10)
Alternatively, wetcould specify particular upper and lower control limits:
. xchart xl x2 x3 x4 x5, connect(l) mean(50)
upper (60)
•
shewhart xl x2 x3 x4 x5, mean(50)
lower(40)
std(10)
In one figure, vertically aligns an x chart with an Rchart.
To illustrate a p chart, we turn to the quality inspection data in quality1.dta.
Contains data from C:\data\qualityl.dta
obs:
16
vars:
3
112 (99.9* of memory free)
•■= - - able
= ~e
storage
type
display
format
byte
byte
byte
* ?. Og
* 3. Og
*5. Og
value
label
Quality control example 1
4 Jul 2005 12:07
variable label
Day sampled
Number of units sampled
Number of units rejected
Sorted by:
.
list in 1/5
I
day
ssize
rejects I
------ !
1. I
2. I
3. I
4. I
5. I
58
26
21
6
53
53
52
52
51
10
12
12
10
10
|
|
|
|
|
^hNOt^ ?atIlainPuIe SiZe VarieS from Unit t0 unit’ and that the units (days) are not in order
M£3a
Graphs
107
■ pchart rejects day sslze
Figure 3.39
•'T
■So
■o
s
<D
IO
O
ro
tr
CM
O
0
20
Day sampled
40
60
2 units are out of control
— serves to illustrate rchart and
, Dataset quality2.dta, borrowed from Iman C
(1994:662),
xchart . Variables xl through x4 represent repeated measurements from an industrial
production process; 25 units with four replications each form the dataset.
Contains data from C:\data\quality2.dta
obs:
25
vars:
4
size:
500 (99.9% of memory free)
variable name
Ii
storage
type
xl
x2
x3
x4
float
float
float
float
display
format
%9.0g
%9.0g
%9.0g
%9.0g
Sorted by:
. list in 1/5
1.
2.
3.
4.
5.
+
xl
I
I
I 4.6
I 6.7
I 4 .6
I 4.9
I 7.6
x2
x3
2
3.8
4.3
6
6.9
4
5.1
4.5
4.8
2.5
x4 |
I
3.6 I
4.7 I
3.9 I
5.7 I
4 .7 I
value
label
Quality control (Iman 1994:662)
4 Jul 2005 12:07
variable label
108
Statistics with Stata
i'r
Figure 3.40, an rR chart,
’
graphs variation in the process range over the 25 units, rchart
informs us that one unit’s range is “out of control.”
rchart xl x2 x3 x4, connect(1)
.
g
<o
Figure 3.40
co
a>
cn
S
02
CM
■MCM
CM
0
5
10
Sample
15
20
25
1 unit is out of control
Figure 3.41, an x chart, shows variation in the process mean. None of these 25 means falls
outside the control limits.
. xchart xl x2 x3 x4, connect(1)
CO
co
3
rco
co
o
cn
ro
<
CM
s
co
co
0
25^
5
15
10
Sample
0 units are out of control
20
Figure 3.41
■
Graphs
f-
109
Adding Text to Graphs
1
Titles, captions, and notes can be added to make graphs more self-explanatory. The default
versions of titles and subtitles appear above the plot space; notes (which might document the
data source, for instance) and captions appear below. These defaults can be overridden, of
course. Type help title_options for more information about placement of titles, or
help textbox_options for details concerning their content. Figure 3.42 demonstrates
the default versions of these four options in a scatterplot of the prevalence of smoking and
college graduates among U.S. states, using statehealth.dta. Figure 3.42 also includes titles for
both the left and right y axes, yaxis (1 2), and top and bottom .r axes, xaxis (1 2)
Subsequent ytitle and xtitle options refer to the second axes specifically, by
including the axis (2) suboption, y axis 2 is not necessarily on the right, and.v axis 2 is not
necessarily on the left, as we will see later; but these are their default positions.
. graph twoway scatter smokeT college, yaxis(1 2) xaxis (1 2)
title("This is the TITLE") isubtitle
’
("This is the SUBTITLE")
caption
("This is the CAPTION") note("This is the NOTE")
caption("This
ytitle("Percent adults smoking")
ytitle("This is Y AXIS 2", axis(2))
xtitle("Percent adults with Bachelor's degrees or higher")
xtitle ("This is X AXIS 2", axis(2))
Figure 3.42
This is the TITLE
10
15
This is the SUBTITLE
This is X AXIS 2
20
25
30
35
in
co
co
CT)
CO Csi
E
v)
to
X
m<
sro ™
% :
5°
<0
o x:
o 04
cn h
Q
Q_
in
m
10
15
20
25
30
Percent adults with Bachelor's degrees or higher
35
This is the NOTE
This is the CAPTION
Titles add text boxes outside of the plot space. We can also add text boxes at specified
coordinates within the plot space. Several outliers stand out in this scatterplot. Upon
investigation, they turn out to be Washington DC (highest college value, at far right), Utah
(lowest smokeT value, at bottom center), and Nevada (highest smokeT value, at upper left).
Text boxes provide a way for us to identify these observations within our graph, as
demonstrated in Figure 3.43. The option text (15.5 22.5 "Utah”) places the word
“Utah” at position^ =15.5, x = 22.5 in the scatterplot, directly above Utah’s data point.
er 5‘ M3 (to H
3> *t)
er
§ rt
—’ 2
v 2
<P o
3 ft c r o
?
n*
g.
(D
3
3
CX
co
£> g
»<
'• hQ
C
3 31
Cu 3
M.
EE. 3
CD
8 o 3
3
<—►
0)
-i
□
s
3
co
a S <s3o'Cg£<
’
S’
3-
3‘
r-t X
CO
o’
3
3* *O
QQ
f-
CD
< CD
P
__CD CX co
cd < BL S' 3
3 w s2z a 3
CD
3 sC
a
3 UQ
“
5 £;
3
-U
iQ
o* £
£ CO £ 21 N
CD*
S ’2. O cr CfQ
g c *8
2
3
S
<? =*
to
co EF
>-—« CD
~
»■
ft
3 ~ S
3 = w
S S’ §
O S
co
. c 8? ? C CD
ST
i 2 2. »
; g 8> 2 =r’6
■ g-'H
C/1
CO
8 a co
“
s-co« Mj
p. 3 CD
3
rt
< 3
I o “ s
□ Er
3 S'
*u 3 £2. o
* • §-..O 5 S
- to E XJ <-t
&
oq
3
3‘
cT co “S 3 O 3
8 to Xo
R8 . o
CD
3
O 8- K 5 CD
3
o”2. „
Is I
□
23: •" *3o
S
3 S' 3^
Cu
CD
3
CD 3
CD ►—
G>
r°
r
15
Q
Percent adults smokinq
20
25
30
35
O ■
£££'x*‘<‘<c.
O
o Q CD
Z
0)
CD
gen’
r-^
3
Q.
C_
•«
•
«
•
£0
Q.
0)
CH
w
^ND
-N5!^
CD
w'
W‘ ? w0
X
zr
3
g.
CD
O
3.
£ CDCco_ Hco
o.to_
CD
§
3
3
W
O
m m
ETG0
<qO
ET
3
. GO
O
£=53
Q) O (/>
*3^
*2.
4
o
Q
GO
CH 1
15
t£»y2W
ft (t (U
p- p- to
GO
CD
35
2!
c5‘
c
Q
W
4
GO
..
ft ,,
ft
H H R. o
~
“
Q
3 3
E3
=s
H §
~ H *
5~2.
to
cd
o
co
ET ID H P- W
ft U) p* ND p. bf
P- H EC
Ef co
co a
H- **
ND 'cn ND
**
:= COu q w an p3
(D
n> co h- rt
= to p- 5 H- 3
CO ft
M ft co Ct H0 53
(D
3 3 3
CO ft p
rt co <a x 3
K
3
EC
p.
1‘■ tr
“ P> ft
-*
CL
ID CO
o H- CL
~ 3 > C > CQ. ft
Ef
ej :5 3 ET X H X H (D
— iq
="
= h ft h rt
H ft > *■* to co to
- -w Q H 3
3 0
> to
th ej er
M « to to
rt
o
H
s 3 H = fj
— ox
O »•
a
X* O
“
= 3
B
p
H- Z 3 p.
CM
3
X to X EJ
2
c 3
h
p- ju k iq ~= er
= h
cq
iQ P- tQ
CO O co
ft
3
ET -2
P- CO p.
EJ
EJ
3
ND ID K>
O ft
—* EJ
ft H PJ
co 0
_ (0
..
— o —
3 3 3
3 ft 3
h
— <-x x
3
PJ
H 3
w
H H to
H Q
EC EC
— rt
CL
C
(D
co co
tr 3
cQ
ND
M» P*
p. p.
H
o P*
(D
co co
o »<
(D
X
CO
ft ft 3
O
»* £t X
M
=
o
(D (D P3
h
to
Z
to
ED* co
ET
o
a
p*
H- Ct
h w
ft 3
cQ
M H ND
CD Ct
EC
= H
(D
(D
H
5
h
(t EC
ft ft ft H $
c tn in cn H to H h3
WND
o
r*
(/)
CD (D 3 ft ft
XXX p- pD* rt rt rt
ef J2
o
H
ND U) p* (D (D
CO UD to —>
Z
3-0
o
S
N
3
«•
ro”
5
O z T)
>
TJ
H
O
co
S' SCD s g-’a. CD cz>
CD 5*
r.
r
2
CD
H. o QJA ex
?
8c g. g3Q H(D 2.
g
..
H
3
O
5
£
o
K.
CL 3
e. '£
a
■■*■ o
P o 3 CD
a a o 3 3 p‘ X5
IQ
o
v>
w
s>
5
v<‘
1 3
<:
c
O
£ 8 £
Q- « 5
CD
P J2. CO 2. S'
co
CD
-J
<
o
s
n
I—
O
Q er
co
CD
CD
rr
3-
H 3
in cd
p. CD
? s< h; f a
-3
CD
X- CD
‘ S rg
tt
3
OQ tt
ft
g co
co
£ # rt
3
3 TS ~
3 CD XJ
H
1
CD
3
£h $
CD- O 0 Q.
□
2. a
| 2.
co
ST
5?
??
to
Er
GO
s3
et£-
£
3
<__ 3
tt
3* ‘E'c
rt
“
S3
2
- CD
0 CD
55 CO co 75 O <
X
?
3
-■ S- §3c s.
92 a
& a
& £%
~d £ st
2c
S’ -S GO
CD
3
C. S’
EJ1 3
£ a o & 2. 3 U»II
ft
H«
2 3
3
tt
co
□
Q-
0
O
-O 3
EF
3CD2 S
«
co g
3^ CO CO
3
er cd
5 2.
o V &
’
ET 3
(D
H U
Q
“
a er H
o o CD
X
ZTl
CD L»J
CD
‘
Q- to-'p’
S* 3
3
3
to g
O p
•
3
“•
co
O CD CD
3 3
co CD o
3^
o
.. S' £
o
H £ CD
to o
- <
rt
3
II
H 3
3
3*
3
CD
QCu
3
3*
CD 3
co
CT
O O
x
?a
o- CD >.
< 3 —
’-L. E‘
=. 5
ET
CX a
C 3 c
co co
H
% s ?r - • 3
(D cz> CT
(T>
O
X
CD
—
5. 3> ~ co3
SBl
’ 5h: CD S*. I
CD 3
CD
S £ a
co
c
co
CO
3
t-
3
co
o
I
!
Graphs
111
. graph twoway Ifitci smokeT college
Figure 3.44
°
co J
</)
<D
zj m
_
TJ
<D
L.
O
v© o
o'' <\i
in N
cn
10
15
«
20
25
% adults college degree, 1990
95% Cl
----------
30
35
Fitted values
A more infonnative graph results when we overlay a scatterplot on top of the regression
line plot, as seen in Figure 3.45. To do this, we essentially give two distinct graphing
commands, separated by “ | | ”,
. graph twoway Ifitci smokeT college
I I
scatter smokeT college
Figure 3.45
m _l
D)C0
o
E
wo
5
■u
(Q
(D
J3
ro
H
ss wk
**
ft -
Is -
U-
o
m m zr
O)
15
10
■
20
25
% adults college degree, 1990
95% Cl
% adults smoking
----------
30
Fitted values
35
«!
112
Statistics with Stata
The second plot (scatterplot) overprints the first in Figure 3.45. This order has
consequences for the default line style (solid, dashed, etc.) and also for the marker symbols
(squares, circles, etc.) used by each sub-plot. More importantly, it superimposes the scatterplot
points on the confidence bands so the points remain visible. Try reversing the order of the two
plots in the command, to see how this works. •
Figure 3.46 takes this idea a step further, improving the image through axis labeling and
legend options. Because these options apply to the graph as a whole, not just to one of the
subplots, the options are placed after a second | | separator, followed by a comma. Most of
these options resemble those used in previous examples. The order (2 1) option here does
something new: it omits one of the three legend items, so that only two of them (2, the
regression line, followed by 1, the confidence interval) appear in the figure. Compare this
legend with Figure 3.45 to see the difference. Although we list only two legend items in Figure
3.46, it is still necessary to specify a rows (3) legend format as if all three were retained.
. graph twoway Ifitci smokeT college
I|
scatter smokeT college
I I
/ xlabel (12 (2)34) ylabel(14(2)32, angle (horizontal))
xtitie("Percent adults with Bachelor’s degrees or higher")
ytitle("Percent adults smoking")
note("Data from CDC and US Census")
legend(order (2 1) label (1 "95% c.i.") label (2 "regression line")
rows(3) position(l) ring(0))
Figure 3.46
32
regression line
95% c.i.
30
28
O)
IE 26
v>
JQ
24
-u
22
CD
<D
o 20
18
16
14
B
12
14
16
18
20
22
24
26
28
30
Percent adults with Bachelor's degrees or higher
32
34
Data from CDC and US Census
The two separate plots (If itci and scatter) overlaid in Figure 3.46 share the same
y and x scales, so a single set of axes applies to both. When the variables of interest have
different scales, we need independently scaled axes. Figure 3.47 illustrates this with an overlay
of two line plots based on the Gulf of St. Lawrence environmental data 'xngulfdtci. This figure
combines time series of the minimum mean temperature of the Gulfs cold intermediate layer
waters (czT), in degrees Celsius, and maximum winter ice cover (maxarea), in thousands of
square kilometers. The cil plot makes use of yaxis (1), which by default is on the left. The
Graphs
113
maxarea plot makes use of yaxis(2). which by default is on the right. The various
ylabel , ytitle , yline , and yscale options each include an axis(l) or
axis (2) suboption, declaring whichy axis they refer to. Extra spaces inside the quotation
marks for ytitle provided a quick way to place the words of these titles where we want
them, near the numerical labels. (For a different approach, see Figure 3.48.) The text box
containing “Northern Gulf fisheries decline and collapse” is drawn with medium-wide margins
around the text; see help marginstyle for other choices, yscale (range ( ))
options give both y axes a range wider than their data, with specific values chosen after
experimenting to find the best vertical separation between the two series.
. graph twoway line cil winter, ya-xis (1) yscale (range(-1,3) axis(l))
ytitle("Degrees C
", axis(1))
yline (0)
ylabel(-1(.5)1.5, axis(l) angle(horizontal) nogrid)
text(l 1992 "Northern Gulf" fisheries decline” "and collapse"
, box margin(medium) )
I|
line maxarea winter,
yaxis(2) ylabel(50(50)200, axis(2) angle(horizontal))
yscale(range(-100,221) axis(2))
ytitle ("
1000s of kmA2", axis (2))
yline(173.6, axis(2) Ipattern(dot))
I|
if winter > 1949,
xtitle("")
xlabel (1950(10)2000) xtick (1950 (2) 2002)
legend(position (11) ring(0) rows(2) order(2 1)
label (1 "Max ice area") label(2 "Min CIL temp"))
note("Source: Hamilton, Haedrich and Duncan (2003)- data from
DFO (2003)")
Figure 3.47
-- Min CIL tempi
— Max ice area
\
\
/
i
1 I 1 I
1/ ’ I
If I /
1/ ' /
I > I
1.5-J
/ /|,|/
/ /k
-200
Vvv\
I
/
l/\
\
04
-150E
cn
-100g
A
/
o
Northern Gulf
fisheries decline
U
w .5o
-50
<p
U)
<D
0
0
-.5-11950
1960
1970
1980
1990
2000
Source: Hamilton, Haedrich and Duncan (2003); data fromDFO (2003)
I
The text box on the right in Figure 3.47 marks the late- 1980s and early-1990s period when
key fisheries including the Northern Gulf cod declined or collapsed. As the graph shows, the
fisheries declines coincided with the most sustained cold and ice conditions on record.
X
114
Statistics with Stata
To place cod catches in the same graph with temperature and ice, we need three
independent vertical scales. Figure 3.48 involves three overlaid plots, with all y axes at left
(default). The basic forms of the three component plots are as follows:
connected maxarea winter
A connected-line plot of tnaxarea vs. winter, usingy axis 3 (which will be leftmost in our final
graph). The y axis scale ranges from -300 to +220, with no grid of horizontal lines. Its title is
‘Tee area. 1000 kmA2.” This title is placed in the “northwest" position, placement (nw).
line cil winter
A line plot of ciI vs. winter, using v axis 2. y scale ranges from -4 to +3, with default labels,
connected cod winter
A connected-line plot of cod vs. winter, usings axis I. The title placement is “southwest,”
placement(sw).
Bringing these three component plots together, the full command for Figure 3.48 appears
on the next page, v ranges for each of the overlaid plots were chosen by experimenting to find
the “right” amount of vertical separation among the three series. Options applied to the whole
graph restrict the analysis to years since 1959, specify legend and.v axis labeling, and request
vertical grid lines.
[
Graphs
115
graph twoway connected maxarea winter, yaxis(3)
yscale(range (-300,220) axis(3)) ylabel(50(50)200, nogrid axis (3))
ytitleflce area, 1000 km-2", axis(3) placement (nw) )
cipattern(dash)
II
line cil winter, yaxis(2) yscale(range(-4,3) axis(2))
ylabel(, nogrid axis(2))
ytitle("CIL temperature, degrees C", axis(2)) cipattern(solid)
I |
connected cod winter, yaxis(l) yscale(range(0,200) axis(l))
ylabel(, nogrid axis(l))
ytitle("Cod catch, 1000 tons", axis(l) placement(sw))
i
I I
if winter > 1959,
legend(ring (0) position(7)
label(2 "Min CIL temp")
i
xtitle("")
label(1 "Max ice area")
label(3 "Cod catch")) rows (3))
xlabel(1960(5)2000, grid)
Figure 3.48
CN
O
oO
ro
oo
o io
io
o T~
c/)
0)
O) •
<D
■o O
I(5 1?
oV
E
a>
—i
O
o
s
c
-2 o
o
A
\
ji
*
§o
X
_<----- Max ice area
-------- Min CIL temp
■g
OO
Cod catch
1960
1965
1970
1975
1980
1985
X
1990
1995
2000
Graphing with Do-Files
ompl icated graphics like Figure 3.48 require graph commands that are many physical lines
long (although Stata views the whole command as one logical line). Do-files, introduced in
Chapter 2, help in writing such multi-line commands. They also make it easy to save the
command for future re-use, in case we later want to modify the graph or draw it again.
The following commands, typed into Stata’s Do-file Editor and saved with the file name
fig03_48.do, become a new do-file for drawing Figure 3.48. Typing
.
do fig03_48
then causes the do-file to execute, redrawing the graph and saving it in two formats.
116
Statistics with Stata
^delimit
;
use c:\data•gulf.dta, c 1 e a r ;
graph twoway connected maxarea winter, yaxis(3)
•/scale (range (-300,220) < '
^^^50(50)200, nogrid axis(3))
ytitleCIce area, 1000 kmA2”,
, axis(3) placement (nw) )
s^pattern(dash)
II
line cil winter, yaxi‘s(2) yscale (range(-4, 3)
axis (2))
/label (, nogrid axis (2))
vtitle(”CIL temperature, degrees C", axiS(2)) cipattern(solid)
li
connected cod winter,
yaxis(l) ystale(ra.-.ge(0,200) axis(l))
ya.abel(, nogrid axis(l))
ytitle("Cod catch, 1000 tons", a x i s (11
p- acement(sw))
; I
if
> 2 95 9,
iegend(ring(0) position(7) labelfl "Max ---e area")
"Min CIL t-
.............
xlabel (1960 (5)2000, grid)
saving (c:\data\fig0 3_48.gph, replace) ;
graph export c:\data\fig03 48.eps replace ;
#delimit cr
~
treats this all as one logical line that ends with the semicolon after the saving () ontion
Tins option saves the graph in Stata’s .gph format.
9 '' P
'
Next, the graph export command creates a second version of the same eranh in
(Type^heS
h ‘ fOrmat’ 3S
by the epS Suff,x in the filename^ 48eps
Lf 1 r
P .graPh-exP°rt ‘o leant more about this command, which is partlcuZ
part,CularIy
useful for wntmg programs or do-files that will create graphs repeatedly.)
’ •fC d°
deinn,,go,ng back
cr command re-sets a carriage return as the end-of-Iine
Sla,a-s „s„al mode Alth()ugh , |s „„,;isib|e “ “
blanite.,.h7eZSX"“"
re“m<h‘'““
Retrieving and Combining Graphs
Any graph saved in Stata’s “live” .gph format
can subsequently be retrieved into memory by
the graph use command. For example, we
could retrieve Figure 3.48 by typing
.
graph use fi.g03__48
?d,‘iT™aPl, iS
“ is <lisp|wl»»»«" and c«» be printed or saved again with
saveeXZ
r
V'”1’ S"'d
EnhancedWindoXeXS” wT.riP‘
fo™>. “
^bfegnewly
’’'’"‘T
Ora|’l’iCS ( p”8’’ “
meniK nr dirArti • d
We a S0 COU d chan£e the coIor scheme, either through
mono I
d
i V
uhe graph use c°mmand. fig03 48.gph was saved in the s2
monochrome scheme, but we could see how it looks in the si colofCcheme by typing
•
1^
graph use fig03_47,
scheme(slcolor)
Graphs
117
Graphs saved on disk can also be combined by the graph combine command. This
provides a way to bring multiple plots into the same image. For illustration, we return to the
Gulf of St. Lawrence data shown earlier in Figure 3.48. The following commands draw three
simple time plots (not shown), saving them with the namesfig03_49a.gph,fig03_49b.gph, and
fig03_49c.gph. The margin (medium) suboptions specify the margin width for title boxes
within each plot.
. graph twoway line maxarea winter if winter > 1964, xtitlef ” )
xlabel (1965 (5)2000, grid) ylabel(50 (50) 200, nogrid)
title("Maximum winter ice area", position(4) ring(0) box
margin(medium))
ytitle("1000 kmA2") saving (fig03_49a)
• graph twoway line cil winter if winter > 1964, xtitle("")
xlabel (1965 (5) 2000, grid) ylabel(-1(.5)1.5, nogrid)
title("Minimum CIL temperature", position (1) ring(0) box
margin(medium))
ytitle("Degrees C") saving (fig03_49b)
. graph twoway line cod winter if winter > 1964 ,
xtitle (
)
xlabel (1965 (5) 2000, grid) ylabel(0 (20) 100, nogrid)
titie("Northern Gulf cod catch", position(l) ring(0) box
margin(medium))
ytitle("1000 tons") saving(fig03_49c)
To combine these plots, we type the following command. Because the three plots have
identical x scales, it makes sense to align the graphs vertically, in three rows. The imargin
option specifies “very small” margins around the individual plots of Figure 3.49.
. graph combine fig03__49a . gph fig03_49b . gph fig03_49c. gph,
imargin(vsmall) rows(3)
Figure 3.49
o
§S- \
s ■ _i_
1965
Maximum winter ice area
1970
1975
1980
1985
1990
1995
2000
Minimum CIL temperature
o'- ‘
SJ’Q 2ojo -
0)
Om
o
1965
1970
1975
1980
1985
1990
1995
2000
Northern Gulf cod catch
-
Is- __
§?o -
1965
1970
1975
1980
1985
1990
1995
2000
118
Statistics with Stata
details including the
h-C°"lbine for more information on this command. Options control
b on e sXr as thZ ’h 7, C°IUmnS’
°fa"d markers <which othe™se
Thev can Iso soec
t
P
the marginS betWeen individual P’°ts.
ex can also specify whether x ory axes of twoway plots have common scales or assign all
components a common color scheme. Titles can be added to the combined graph which can
be printed, saved, retrieved, or for that matter combined again in the usual ways
Our fina! example in^ratessevera! of'h^ graph combine optionS; and a
0
sn ot graP " n Uneqi'a'-S,zed comPonents. Suppose we want a scatteiplot similar to Ae
variable8
P Ot
earl'er 'n F'gUre 342’
with box PIots of thcT and x
x anables drawn beside their respective axes. Using statehealth.dta, we might first try to do this
box olot'of 3 '/Z
3 7iP 0fs,"okeT- a scatterplot of™IOA-eTvs. college, and a horizontal
50Sg
”
",g ‘l,e
Pk"S in,° “e im*sh«™>
• graph box smokeT, saving(wrongl)
. graph twoway scatter smokeT college, saving(wrong2)
graph hbox college, saving(wrong2)
. graph combine wrongl.gph wrong2. gph wrongj.gph
The combmed graph produced by the commands above would look wrong, however We
would end up with two fat box plots, each the size of the whole scatterplot, and none ofthe axes
ahgned. For a more sat.sfactory version, we need to start by creating a thin vertical box plot
7 onoz r
(20) option in the following command fixes the plot’s x (horizontal)
size at :0/o of normal, resulting in a normal height but only 20% width plot. Two empty
caption lines are included for spacing reasons that will be apparent in the final graph.
. graph box smokeT, fxsize(20) caption ("”
•’”)
ytitlef") ylabel(none) ytick (15 (5) 35 , grid) saving (fig03_50a)
For the second component, we create a straightforward scatterplot of smokeT vs. college.
graph twoway scatter smokeT college,
ytitle("Percent adults smoking")
xtitle("Percent adults with Bachelor’s degrees or higher")
ylabel (, grxd)
xlabel (, grid)
saving(fig03_5Ob)
Z .compon1e"t is a thin horizontal box plot ofcollege. This plot should have normal
X deluded’5,26
2°%°fnOrma1' F°rSp3CingreaS°nS’ tW°empty lefttitles
graph hbox college, fysize(20) lltitle("" ) 12title( H H J
ylabel(none) ytick(10(5)35, grid) ytitle("")
saving(fig03_50c)
Graphs
119
These three components come together in Figure 3.50. The graph combine
command’s cols (2) option arranges the plots in two columns, like a 2-by-2 table with one
empty cell. The holes (3) option specifies that the empty cell should be the third one, so
our three component graphs fill positions 1, 2, and 4. xscale(1.05) enlarges marker
symbols and text by about 5%, for readability. The empty captions or titles we built into the
original box plots compensate for the two lines of text (title and label) on each axis of the
scatterplot, so the box plots align (although not quite perfectly) with the scatterplot axes.
graph combine fig03_50a.gph fig03_50b.gph figO3_5Oc.gph,
cols(2) holes(3) iscale(1.05)
Figure 3.50
io
CD
co
.£
1
1
1°
</)
3 CN
•u
<TJ
O
<5
<
•••
*
•«
Ql lo
12
.
20
25
30
35
Percent adults with Bachelor’s degrees or higher
X__________ _____
-
*
: ;
Summary Statistics and Tables
statistics are available through the command tahc^t ^exlble arrangements of summary
Ce S C°ntain statistics such as
one’variabIe procedures
frequencies, sums, means or median, F
including normality tests transfoimation, ’^dd
are not described in this chapter, but
Example Commands
. summarize yl y2 y3
• summarize yl y2 y3, detail
Obtains detailed summary statistics mvh
including percentiles, median, mean, standard
deviation, variance, skewness, and kurtosis.
•
summarize yl if xl > 3 & x2 <
°"'V
ObSe™,iOnS
V™b,e
■ summarize yl [fweight = w], detail
Cafato detailed s„,™ary s.a.lsfa forj7 „sing ,he frequei,cy wc|gh<s jn
’
* *
y1'
Stats<mean =d skewness kurtosis n)
Calculates only the specified summary statistics for variableyl.
'
!0
Stat?(min P5 P25 p50 P75 P95 “ax) by(xl)
is
Summary Statistics and Tables
121
tabulate xl
Displays a frequency distribution table for all nonmissing values of variable xl.
.
tabulate xl,
sort miss
Displays a frequency distribution of.v/. including the missing values. Rows (values) are
sorted from most to least frequent.
.
tabl xl x2 x3 x4
Displays a series of frequency distribution tables, one for each of the variables listed.
.
tabulate xl x2
Displays a two-variable cross-tabulation with.v/ as the row variable, and.v? as the columns.
. tabulate xl x2, chi2 nof column
Produces a cross-tabulation and Pearson
test of independence. Does not show cell
frequencies, but instead gives the column percentages in each cell.
tabulate xl x2, missing row all
Produces a cross-tabulation that includes missing values in the table and in the calculation
of percentages. Calculates “all” available statistics (Pearso:
(Pearson and likelihood x:, Cramer’s
of
K Goodman and Kruskal’s gamma, and Kendall’s rb).
.
tab2 xl x2 x3 x4
Performs all possible two-way cross-tabulations of the listed variables.
tabulate xl,
summ(y)
Produces a one-way table showing the mean, standard deviation, and frequency of r values
within each category of xl.
tabulate xl x2, summ(y) means
Produces a two-way table showing the mean ofjy at each combination ofxl and x2 values.
. by x3, sort:
tabulate xl x2, exact
Creates a three-way cross-tabulation, with subtables for.tf (row) by.v2 (column) at each
value of.rd. Calculates Fisher’s exact test for each subtable, by varname, sort:
works as a prefix for almost any Stata command where it makes sense. The sort option
is unnecessary if the data already are sorted on vamame.
table y x2 x3, by(x4 x5) contents(freq)
Creates a five-way cross-tabulation, of r (row) by r’ (column) by x3 (supercolumn), by.v4
(superrow I) by x5 (superrow 2). Cells contain frequencies.
. table xl x2, contents(mean yl median y2)
Creates a two-way table of.v/ (row) by x2 (column). Cells contain the mean of vl and the
median ofy2.
Dataset VTtown.dta contains information from residents of a town in Vermont. A survey was
conducted soon after routine state testing had detected trace amounts of toxic chemicals in the
town s water supply. Higher concentrations were found in several private wells and near the
public schools. Worried citizens held meetings to discuss possible solutions to this problem.
H
122
Statistics with Stata
Contains data from C:\data\VTtown.dta
obs:
153
vars:
7
size:
1,683 (99.9% of
memory free)
variable name
storage
type
display
forma t
value
label
sexlbl
VT town survey (Hamilton 1985)
11 Jul 2005 18:05
variable label
gender
lived
kids
educ
meetings
contarn
byte
byte
byte
byte
byte
byte
%8.0g
* 8.0g
% 8.0g
8.0g
%8.0g
% 8.0g
kidlbl
con tamlb
school
byte
%8.0g
close
kidlbl
Respondent's gender
Years lived in town
Have children <19 in town?
Highest year school completed
Attended meetings on pollution
Believe own propertv'water
contaminated
School closing opinion
Sorted by:
To find the mean and standard deviation of the variable lived (years the respondent had
lived in town), type
summarize lived
.
|
Obs
Mean
lived |
153
19.26797
Variable
Std.
Dev.
Min
Max
16.95466
1
81
This table also gives the number of nonmissing observations and the variable's minimum and
maximum values. Ifwe had simply typed sununarize with no variable list, we would obtain
means and standard deviations for every numerical variable in the dataset.
To see more detailed summary statistics, type
.
summarize lived,
detail
Years lived in town
1%
5%
10%
25%
50%
75%
90%
95%
99%
Percentiles
1
2
3
5
Smallest
1
1
1
1
15
29
42
55
68
Largest
65
65
68
81
Obs
Sum of Wg£.
153
i5?
Mean
Std. De v.
19.26797
16.95466
Variance
Skewness
Kurtosis
287.4606
1.208804
4.025642
This summarize, detail output includes basic statistics plus the following:
Percentiles:
Notably the first quartile (25th percentile), median (50th percentile), and third
quartile (75th percentile). Because many samples do not divide evenly into
quarters or other standard fractions, these percentiles are approximations.
Four smallest andfour largest values, where outliers might show up.
»
Summary Statistics and Tables
123
Sum ofweights: Stata understands tour types of weights: analytical weights ( aweight).
frequency weights ( fweight ), importance weights ( iweight ), and
sampling weights (pweight). Different procedures allow, and make sense
with, different kinds of weights, summarize, detail , for example,
permits aweight or fweight. Forexplanations see help weights.
Variance’.
Standard deviation squared (more properly, standard deviation equals the
square root of variance).
Skewness:
The direction and degree of asymmetry. A perfectly symmetrical distribution
has skewness = 0. Positive skew (heavier right tail) results in skewness > 0;
negative skew (heavier left tail) results in skewness < 0.
Kurtosis:
Tail weight. A normal (Gaussian) distribution is symmetrical and has kurtosis
- 3. If a symmetrical distribution has heavier-than-normal tails (that is, is
sharply peaked), it will have kurtosis > 3. Kurtosis < 3 indicates lighter-thannormal tails.
The tabstat command provides a more flexible alternative to summarize. We can
specify just which summary statistics we want to see. For example,
.
tabstat lived,
stats(mean range skewness)
variable I
mean
range
s kewness
lived I
19.26'797
OA
1.208804
With a by(varname) option, tabstat constructs a table containing summary
statistics for each value of varname. The following example contains means, standard
deviations, medians, interquartile ranges, and number of nonmissing observations of lived, for
each category ofgender. The means and medians both indicate that, on average, the women
in this sample had lived in town for fewer years than the men. Note that the median column is
labeled “p50”, meaning 50th percentile.
tabstat lived,
stats(mean sd median iqr n)
Summary for variables: lived
by categories of: gender
by(gender)
f=e«r-rdert*s sender)
gender |
mean
sd
p50
iqr
N
male |
female |
23.48333
16.54839
19.69125
14.39468
19.5
13
28
19
60
93
|
19.26757
16.95466
15
24
153
Total
Statistics axailable for the stats () option of tabstat include:
mean
Mean
count
n
sum
max
Count of nonmissing observations
Same as count
Sum
Maximum
124
Statistics with Stata
min
Minimum
range
Range = max - min
sd
Standard deviation
Variance
var
cv
Coefficient of variation = sd / mean
semean
Standard error of mean = sd / sqrt(n)
skewness Skewness
kurtosis Kurtosis
median
Median (same as p50 )
Pl
iqr
1st percentile (similarly, P5 , pio , p25 , p50 , p75 . p95 , OF p99 )
Interquartile range = p75 - p25
q
Quartiles; equivalent to specifying p25 pso p75
Further tabstat options give control over the table layout and labeling. Type help
tabstat to see a complete list.
The statistics produced by summarize or f
tabstat describe the sample at hand. We
might also want to draw inferences about the population, for example, by constructing a 99%
confidence interval for the mean of lived'.
. ci lived, level (99)
Variable I
Obs
Mean
Std. Err.
[99% Conf. Interval!
lived |
153
19.26797
1.370703
15.69241
22.84354
Based on this sample, we could be 99% confident that the population mean lies somewhere
in the interval from 15.69 to 22.84 years. Here we used a level ( ) option to specify a 99%
confidence interval. If we omit this option, ci defaults to a 95% confidence interval.
Other options allow ci to calculate exact confidence intervals for variables that follow
binomial or Poisson distributions. A related command, cii , calculates normal, binomial, or
Poisson confidence intervals directly from summary statistics, such as we might encounter in
a published article. It does not require the raw data. Type help ci for details about both
commands.
Exploratory Data Analysis
Statistician John Tukey invented a toolkit of methods for exploratory data analysis (EDA),
which involves analyzing data in an exploratory and skeptical way without making unneeded
assumptions (see Tukey 1977; also Hoaglin, Mosteller, and Tukey 1983, 1985). Box plots,
introduced in Chapter 3, are one of Tukey’s best-known innovations. Another is the stem-andleaf display, a graphical arrangement of ordered data values in which initial digits form the
"stems” and following digits for each observation make up the "leaves.”
Summary Statistics and Tables
125
stem lived
Stem-and-leaf plot for lived (Years lived in town)
0*
0.
11.
2*
2.
I 1111111222223333333344444444
I 55555555555566666666777889999
I 0000001122223333334
I 55555567788899
I 000000111112224444
I 56778899
3’ i 00000124
3 . I 5555666789
4 * I 0012
4 . I 59
5* I 00134
5 . I 556
6*
6.
7*
7. I
8* I
5558
1
stem automatically chose a double-stem version here, in which 1* denotes first dibits
of 1 and second digits of 0-4 (that is, respondents who had lived in town 10-14 years), .
denotes first digits of 1 and second digits of 5 to 9 (15-19 years). We can control the number
of lines per initial digit with the lines ( ) option. For example, a five-stem version in which
the 1* stem hold leaves of 0-1, It leaves of2-3, If leaves of4-5, Is leaves of 6-7
QPld
1
lc*nirz*r> rip Q O
J L _ _ • . *
< •
•
’
leaves of 8-9 could be obtained by typing
.
stem lived, lines(5)
Type help stem for information about other options.
Letter-value displays (Iv) use order statistics to dissect a distribution.
lv lived
153
M
F
E
n
C
B
A
Z
77
39
20
I
I
I
10.5 |
5.5 |
3
I
2
I
1.5 |
1
I
I
I
inner fence I
outer fence |
i
Years lived in town
5
3
2
1
1
1
1
1
-31
-67
15
17
21
27
30.75
33
34.5
37.75
41
29
39
52
60.5
65
68
74.5
81
65
101
I
|
|
|
|
|
|
|
I
I
I
|
|
spread
24
36
50
59.5
64
67
73.5
80
pseudosigma
17.9731
15.86391
16.62351
16.26523
15.15955
14.59762
15.14113
15.32737
# below
0
0
# above
5
0
M denotes the median, and F the fourths (quartiles, using a different approximation than
the quartile approximation used by summarize, detail and tabsum). E D C
. . denote cutoff points such that roughly 1/8, 1/16, 1/32, . . . of the distribution remains
outside in the tails. The second column of numbers gives the “depth,” or distance from nearest
extreme, for each letter value. Within the center box, the middle column gives
midsummaries, which are averages of the two letter values. If midsummaries drift away from
the median, as they do for lived, this tefcus that.the.distribution becomes progressively more
126
Statistics with Stata
Finally, “pseudosigmas” in the Xht-hand mi F
S he aPProxlmate interquartile range,
be if these letter values described a Gaussian popuh?™ Th^F^ 'he.Standard deviation should
a "pseudo standard deviation" (PSD) oroGdZ
>eFpSeudoslgma> sometimes called
■PPr»™.Bnor„,.li,yi„s^^^
-"I »U.lie™istaM check for
1 • Comparing mean with median diagnoses overall skew
mean > medtan
mean = median
mean < median
positive skew
symmetry
negative skew
2.
.......
■
standard deviation = PSD
standard deviation < PSD
normal tails
lighter-than-normal tails
r-ss™—*
..
lies outside the inner fenc^bm not omsid^'the'^ouZfen" :"" Va'Ue"
0UffiCr”’t
Th,
~ 3,QRss<F' - '■«&< or E, + l.5IQR<,iF +yiQR
The value of., ,s a -severe oudier if 1, „K ou,sMe lhe
x<Fl-3IQR or x> F . + 3IQR
1V gives these cutoffs and the number of outliers of each tvm e
the outer fences, occur sparsely (about two per million^ i^P '
°“tllers’values bey°nd
■^h). Sev.eou.lS"^
“
dis.ribXuo^.^Xbta.X,'0;'1™ r "W tos ” P’^
-wle
formal normal,ry rests, and
Normalit
Tests and Transformations
X'XeZSsXXX X “ vrbte tto “»»
normality, extending the graphical tools this^ °rat0Iy methods t0 check for approximate
quantile-normal plot!) preseme^'n ^p ^'1?^’
plOtS’
pl<^ a"d
skewness and X Uries iC»f rhe
evaluate the null hypothesis that
population.
co
i
i_ , ' detai1 ’ can more formally
mp e at hand came fo°m a normally-distributed
Summary Statistics and Tables
127
sktest lived
Skewness/Kurtosis tests for Normality
Variable I
Pr(Skewness)
Pr(Kurtosis)
0.0*00
0.028
lived |
adj chi2(2)
joint ----Prob>chi2
24.79
0.0000
sktest here rejects normality: lived appears significantly nonnormal in skewness (P =
.000), kurtosis (P = .028), and in both statistics considered jointly (P = .0000). Stata rounds off
displayed probabilities to three or four decimals; “0.0000” really means P < .00005.
Other normality or log-normality tests include Shapiro-Wilk IV( swilk ) and ShapiroFrancia II ( sfrancia ) methods. Type help sktest to see the options.
Nonlinear transformations such as square roots and logarithms are often employed to
change distributions shapes, with the aim of making skewed distributions more symmetrical
and perhaps more nearly normal. Transformations might also help linearize relationships
between variables (Chapter 8). Table 4.1 shows a progression called the “ladder of powers”
(Tukey 1977) that provides guidance for choosing transformations to change distributional
shape. The variable lived exhibits mild positive skew, so its square root might be more
symmetrical. We could create a new variable equal to the square root of lived by typing
. generate srlived = lived A.5
Instead of lived A . 5, we could equally well have written sqrt (lived)
Logarithms are another transformation that can reduce positive skew. To generate a new
variable equal to the natural (base e) logarithm of lived, type
. generate loglived = In(lived)
In the ladder of powers and related transformation schemes such as Box-Cox, logarithms take
the place of a “0” power. Their effect on distribution shape is intermediate between .5 (square
root) and -.5 (reciprocal root) transformations.
Table 4.1: Ladder of Powers
Transformation
Formula
Effect
cube
new = old A3
reduce severe negative skew
square
new = old A2
reduce mild negative skew
raw
old
no change (raw data)
square root
new = old A . 5
reduce mild positive skew
*°e<(or log10)
new = In(old)
new = loglO(old)
reduce positive skew
negative reciprocal root
new = -(old A-.5)
reduce severe positive skew
negative reciprocal
new = -(old A-l)
reduce very severe positive skew
negative reciprocal square
new = -(old A-2)
negative reciprocal cube
new = -(old A-3) r
ff
When raising to a power less than zero, we take negatives of the result to preserve the
original order — the highest value of old becomes transformed into the highest value of/?eiv,
128
Statistics with Stata
and so forth. When old itself contains negative or zero values, it is necessary to add a constant
before transformation. For example, ifarrests measures the number of times a person has been
attested (0 for many people), then a suitable log transformation could be
generate larrests = In(arrests + 1)
The ladder command combines the ladder of powers with sktest tests for
normality. It tries each power on the ladder, and reports whether the result is significantly
nonnormal. This can be illustrated using the severely skewed variable energy, per capita energy
consumption, from states.dta.
ladder energy
Transformation
f orr.” la
chi2 (2)
P(chi2)
cube
square
raw
squa re-root
log
ene rayA 3
energy"2
energy
sqrc(energy)
log (energy)
1 /sort(energy)
1/energy
1/(energyA2)
1/ (energy"3)
53.74
45.53
33.25
25.03
15.88
7.36
1 . 32
4.13
11.56
0.000
0.000
0.000
0.000
0.000
0.025
0.517
0.127
0.003
reciprocal root
reciproca1
reciprocal square
reciprocal cube
It appears that the reciprocal transformation, Menergy (or energy'1), most closely resembles
a normal distribution. Most of the other transformations (including the raw data) are
significantly nonnormal. Figure 4.1 (produced by the gladder command) visually supports
this conclusion by comparing histograms of each transformation to normal1 curves.
.
gladder energy
»•
cubic
o
i
square
8
i
§
g
f
0
o
02 OOa4»-06«&06<fr06eH)6e+09
0 ZOOOOOOOCKDOOSDOOOOOOOOt
sqrt
'tn
<D
Q
o
1000
-.05 -.04
-.03
1/sqrt
s
g
o
10
ig
800
o
o
CD
O
600
400
g
g
o
200
log
2
!
1
8
o
R
T
£
Figure 4.1
identity
s
15
20
25
30
inverse
o
I
I
-lllii.
o
5
i
t)
o
5.5
6
6.5
7
1/square
wwShOs
8
-.005 -.004-.003 -.002-.001
-.07
-. 000023)002300 IS) OS 00e1C8Se-21
-.06
1/cubic
I
-1.50e-071.00e-075.00e-08
0
Per capita energy consumed, Btu
Histograms by transformation
Figure 4.2 shows a corresponding'set of quantile-normal plots for these ladder of powers
transformations, obtained by the “quantile ladder” command qladder . To make the tiny
?•*
!
Summary Statistics and Tables
129
plots more readable in this example we scale the labels and marker symbols up by 25% with
e scale (1 25) option. The axis labels (which would be unreadable andPcrowded) Je
suppressed by the options ylabel (none) xlabel (none).
’
■ gladder energy, scale(1.25) ylabel(none) xlabel(none)
cubic
square
identity
sqrt
tog
1/sqrt
inverse
1/square
1/cubic
Figure 4.2
Per capita energy consumed, Btu
Quantite-Normal plots by transformation
(^for .he BoS
““ bCSke”° finds
y^)
or
{yA- 1}/A
A > 0 or A
0
yW = lnO)
A=0
such that y (A) has approximately 0 skewness.
Applying this to energy, we obtain the
transformed variable benergy.
. bcskewO benergy = energy, level (95)
I
Transform |
L
(energyAL—1)/L |
-1.246052
(1 missing value generated)
[95% Conf. Interval1
Skewness
-2.052503
. 000281
-.6163383
syDm^(asMne^XSke»„;s's«s*))T^
I
from our ladde.-of-powem eho.ee, .he
pjr. The ShSeXZ™L’a
" “ n“,
—2.0525 < A < —.6163
reje? SOmn
P°Ssibilities incIudi"g logarithms (A = 0) or square roots (A = 5)
Chapter 8 descnbes a Box-Cox approach^to regression modeling.
re roots (A .5).
130
Statistics with Stata
Frequency Tables and Two-Wa
Cross-Tabu lation s
categorical variables requite
-iZ? p
P°"ution P™1*"' »>
labulaiing rhe caregorL v.ri.Me
tabulate meetings
Attended |
meetings on |
pollution |
Freq.
Percent
Cum.
no |
yes |
106
47
69.28
30.72
69.28
100.00
Total |
153
100.00
tabulate canmTnTCeJrerUenCy distributions for variables that have thousands of values,
To construct a
however, you might fSt wanX^^p^ose'values'b/wlv3™1316
ma"y Va'UeS’
«cod. or .otocodo options (see Chapter l^or^el^pW“1' "
TTr'" ■
19 living in town):
»™«bul..io„. F„r
meetings by kids (whether respondent has children under
tabulate meetings kids
Attended |
meetings | Have children <19 in
on |
town?
pollution |
no
yes |
Total
no |
yes |
52
11
54 |
36 |
106
47
Total |
63
90 |
153
The first-named variable forms the
only 11 of theses neT^le’
fOrmS C°Iumns in the resulting table.
We see that
1
/
non-parents who attended the meetings
tabulate
'"1'has
h~3 3 number of options that are useful with frequency tables:
all
.hat borh eanabies have ordered categories, whereas ehi2 . 11:chi2 , a„d™
cchi2
Ways rhe ooarriburion
cell
Shows total percentages for each cell.
chi2
Pearson
clrchi2
Displays the contribution to likelihood-ratio r in each cell of a two-way table.
Pearson x‘ (ehi.squ.ted) i„ each cell „f a
test of hypothesis that row and column variables are independent
•
Summary Statistics and Tables
column
exact
131
Shows column percentages for each cell.
Fisher’s exact test ofthe independence hypothesis. Superiorto chi2 ifthe table
contains thin cells with low expected frequencies. Often too slow to be practical
in large tables, however.
expected Displays the expected frequency under the assumption of independence in each cell
of a two-way table.
I
gamma
Goodman and Kruskal’s y (gamma), with its asymptotic standard error (ASE).
Measures association between ordinal variables, based on the number of
concordant and discordant pairs (ignoring ties). -I < y < 1.
generate (new) Creates a set of dummy variables named newl, new2, and so on to represent
the values of the tabulated variable.
lrchi2
Likelihood-ratio %2 test of independence hypothesis. Not obtainable ifthe table
contains any empty cells.
matcell (matname) Saves the reported frequencies in matname.
matcoi (matname) Saves the numeric values ofthe 1 x c column stub in matname.
matrow (matname) Saves the numeric values ofthe r x 1 row stub in matname.
missing
Includes missing” as one row and/or column of the table.
nofreq
Does not show cell frequencies.
nokey
Suppresses the display of a key above two-way tables. The default is to display the
key if more than one cell statistic is requested and otherwise to omit it. Specifying
key forces the display of the key.
nolabel
Shows numerical values rather than value labels of labeled numeric variables.
plot
Produces a simple bar chart of the relative frequencies in a one-way table.
replace
Indicates that the immediate data specified as arguments to the tabi command
are to be left as the current data in memory, replacing whatever data were there.
row
Shows row percentages for each cell.
sort
Displays the rows in descending order of frequency (and ascendine order ofthe
variable within equal values of frequency).
subpop (varname) Excludes observations for which varname = 0 in tabulating frequencies,
The identities of the rows and columns will be determined from all the data,
including the varname 0 group, and so there may be entries in the table with
frequency 0.
A
taub
Kendall’s r b ((tau-b), with its asymptotic standard error (ASE). Measures
association between ordinal variables. taub is similar to gamma , but uses a
correction for ties. - 1 < t b < 1.
V
Cramer’s V (note capitalization), a measure of association for nominal variables.
In 2 x 2 tables,' - IIn< larger
KI.tables, 0 < V < 1.
.Ur
132
Statistics with Stata
wrap
Requests that Stata take no action on wide, two-way tables to make them readable.
Unless wrap is specified, wide tables are broken into pieces for readability.
To get the column percentages (because the column variable, kids, is the independent
vanable) and a X2 test for the cross-tabulation of meetings by kids, type
.
tabulate meetings kids,
column chi2
I Key
|
|------------------ ,
I
frequency
|
I column percentage |
Attended |
meetings | Have children <19 in
on I
town?
pollution |
no
yes |
----- +
no |
52
54 |
I
82.54
60.00 |
----- +
yes |
I
Total |
I
Total
106
69.28
11
17.46
36 |
40.00 |
47
30.72
63
100.00
90 |
100.00 |
153
100.00
=
Pearson chi2(l)
8.8464
Pr = 0.003
percent of tIie respondents with children attended meetings, compared with about
/ /o ofthe respondents without children. This association is statistically significant (P=.003).
Occasionally we might need to re-analyze a published table, without access to the original
raw data. A special command, tabi (“immediate” tabulation), accomplishes this. Type the
cell frequencies on the command line, with table rows separated by “ \ For illustration, here
is
how tabi
----— could reproduce the previous %2 analysis, given only the four cell frequencies:
.
tab! 52 54
\ 11 36,
column chi2
I Key
I
|--------------------------------------(
I
frequency
|
I column percentage |
I
row |
--- +
1
2 I
Total
i I
I
52
82.54
54 |
60.00 |
106
69.28
2 I
I
11
17.46
47
30.72
Total |
I
63
100.00
36 |
40.00-1
------ +
90 |
100.00 |
col
Pearson chi2(l)
=
8.8464
153
100.00
Pr = 0.003
'I
Summary Statistics and Tables
133
Unlike tabulate, tabi does not require or refer to any data in memory. By adding the
replace option, however, we can ask tabi to replace whatever data are in memory with
he new cross-tabulation. Statistical options ( chi2 , exact, nofreq, and so forth) work
the same for tabi as they do with tabulate .
Multiple Tables and Multi-Way Cross-Tabulations
With surveys and other large datasets, we.sometimes n&
^ed frequency distributions of many
different variables. Instead of asking for each table :
separately, for example by typing
tabulate meetings, then tabulate gender, and finally tabulate kids
, we
could simply use another specialized command, tabi :
.
tabi meetings gender kids
Or to produce one-way frequency tables for each variable from gender through school in this
dataset (the maximum is 30 variables at one time), type
-
tabi gender-school
Similarly, tab2 creates multiple two-way tables. For example, the following command
cross-tabulates every two-way combination of the listed variables:
.
tab2 meetings gender kids
tabi and tab2 offer the same options as tabulate .
To form multi-way contingency tables, one approach uses the ordinary tabulate
command with a by prefix. Here is a three-way cross-tabulation of meetings by kids by
c°nta'>‘ (respondent believes his or her own property or water contaminated), with y2 tests for
the independence of meetings and kids within each level of contam'.
■ by contam,
sort:
tabulate meetings kids, nofreq col chi2
->
Attenaea |
reetmos : Have children <19 in
town?
poll t ion
no
yes I
Total
no |
yes i
91.30
8.70
68.75 |
31.25 |
78 . 18
21.82
Total |
100.00
100.00 |
100.00
Pearson chi2(l) =
7.9814
Pr = 0.005
134
Statistics with Stata
contam = yes
Attended
meetings
on
pollution
|
■
i
|
-------- -
no |
yes |
Total
Have children <19 in
town ?
no
yes |
58.82
4 1.18
38.46 |
61.54 |
46.51
53.49
100.00
100.00 |
100.00
-------- -
|
Total
Pearson chi2(l)
=
1.7131
Pr = 0.191
arents were more like y to attend meetings, among both the contaminated and uncontaminated
groups. On y among the larger uncontaminated group is this “parenthood effect" statistically
significant however As multi-way tables separate the data into smaller subsamples, the size
of these subsamples has noticeable effects on significance-test outcomes.
This approach can be extended to tabulations of greater complexity. For example to get
a our-way cross-tabulation of gender by contam by meetings by kids, with X2 tests for each
meetings by kids subtable (results not shown), type the command
.
by gender contam,
sort:
tabulate meetings kids,
column chi2
tableS’ifwed0 not need Percentages or statistical tests,
is through Stata’s general table-making command, table Tl.L
. This versatile command has many
options, only a few of which are illustrated here. To construct a simple frequency table of
meetings, type
table meetings,
contents(freq)
Attended
meetings
on
|
pollution |
Freq.
no |
yes I
106
47
For a two-way frequency table or cross-tabulation, type
.
table meetings kids,
Attended
meet:nq=
on
pollution
no |
yes I
Have
children
<19 in
town?
no
yes
52
11
54
36
contents(freq)
Summary Statistics and Tables
135
If we specifya third categorical variable, it forms the ■■supercolumns” of a three-way table:
table meetings kids content, contents (freq)
.
Attended
meet ings
on
pollution
I
I
I
no l
= e_ie-.-e own
property/water
contaminated and Have
children <19 in town?
-- nc
- yes -no
yes
no
yes
42
44
10
16
More complicated tables require the by ( ]
vari bles tabl tl
( } Opti°n’ Wh'Ch a,1°WS UPt0 four ‘WpetTOw”
variables,
one
y table thus can produce up to seven-way tables: c
---- row, one column one
supercolumn, and up to four superrows. ----Hereisisa afour-way
four-wavexample:
examole:
.
table meetings kids contain,
Responden |
I
gender
|I
and
Ii
Attended I
meetings
|I
on
|
pollution |
contents(freq) by(gender)
t 's
male
Believe own
property,/water
contaminated and Have
children <19 in town?
no-- -- yes __
yes
no
ves
no
yes
I
no |
ye s |
■18
2
1?
7
3
3
3
6
no
yes
24
2
26
13
7
4
7
10
female
The contents ( ) option of table
contents(freq)
specifies what statistics the table’s cells contain:
Frequency
contents(mean varname)
Mean of varname
contents(sd varname)
Standard deviation of varname
Sum of varname
contents(sum varname)
contents(rawsum varname)
contents(count varname)
I
contents(n varname)
contents(max varname)
contents(min varname)
contents(median varname)
I
contents(iqr varname)
Sums ignoring optionally specified weight
Count of nonmissing observations of varname
Same as count
Maximum of varname
Minimum of varname
Median of varname
Interquartile range (IQR) of varname
136
Statistics with Stata
contents (pl
varname)
contents (p2
varname)
1st percentile of varname
2nd percentile of varname (so forth to p99 )
The next section illustrates several more of these options.
Tables of Means, Medians, and Other Summary Statistics
ttb,U1^te 'Td'ly Pr°dUCeS tableS °f meanS and standard Aviations within categories of the
ssxx;,fon" * one-w‘y 'awe wi,h “"s °f wi,hi"
tabulate meetings,
Attended |
meetings on |
pollution |
summ(lived)
Summary of Years lived in town
Mean
Std. Dev.
Freq.
---------- -
no |
yes |
21.509434
14.212766
17.743833
13.911139
106
47
Total |
19.267974
16.954663
153
^th'^M vear^frS .T"3' h b" 7'a"Ve newcomers, averaging 14.2 years in town, compared
h ith 21.5 years for those who did not attend.
We can also use tabulate to form a two-way table of means by typing
tabulate meetings kids,
sum(lived)
means
Means of Years lived in town
Attended I
meetings I
on |
pollution |
Have children <19
in town?
no
yes |
no | 28.307692
yes I 23.363636
Iota1
!
27.444444
Total
14.962963 I 21.509434
11.416667 ; 14.212766
13.544444
(
19.267974
Both parents and nonparents among the meeting attenders tend to have lived fewer years in
refle^tiTn
SXlytoaJenT " SPUri°US
The means option used above called fora table containing only means. Otherwise we
get a bulkier table with means, standard deviations, and frequencies in each cell. Chapter 5
describes statistical tests for hypotheses about subgroup means.
Although it performs no tests, table nicely builds up to seven-way tables containing
means standard deviations, sums, medians, or other statistics (see the option list in previous
section). Here is a one-way table showing means of lived within categories of meetings:
F
Summary Statistics and Tables
137
table meetings, contents(mean lived)
Attended I
meetings
I
on
I
pollution I mean(lived)
no |
yes |
21.5094
14.2128
A two-way table of means is a straightforward extension:
.
table meetings kids,
contents(mean lived)
Attended
meetings
on
pollution
no
yes
I
(Have children <19
I
in town?
I
no
yes
I 28.3077
I 23.3636
14.963
11.4167
Table cells can contain more than one statistic,
Suppose we want a two-way table with both
means and medians of the variable lived:
.
table meetings kids, contents{mean lived
median lived)
Attended l
meetings
(Have children <19
on
I
in town?
pollution I
no
yes
no
I
yes
I 28.3077
I
27.5
I
I 23.3636
I
21
14.963
12.5
11.4167
6
I
I
I
I
The cell contents shown by table could be means, medians, sums
or other summary
statistics for two or more different variables.
138
Statistics with Stata
Using Frequency Weights
.S.“^rtZe ’ ,tabulate > table , and related commands can be used with frequency
weights that indicate the number of replicated observations
For example, file sextab2.dta
*Bri,ish !Urvey *•“*beh™ <«”•«
Ss
Contains data from C:\data\sextab2.dta
o b s:
48
vars:
4
size:
432 (99.9% of
memory free)
storage
type
variable name
age
gender
lifepart
count
byte
byte
byte
int
Sorted by:
age
display
format
value
label
%8.0g
%8.0g
%8.0g
%8.0g
age
gender
partners
1i fepart
British sex survey (Johnson 92)
11 Jul 2005 18:05
variable label
Age
Gender
# heterosex partners lifetime
Number of individuals
gender
16 “24’ ““
h«— XXplSrTS”vJ"B w“e
.
■»
list in 1/5
i.
2.
3.
4.
5.
+
I
age
I
I 16-24
I 16-24
I 16-24
I 16-24
I 16-24
gender
lifepart
count |
--------- ,
male
female
male
female
male
none
none
one
one
two
405
465
323
606
194
|
|
|
|
|
We use count as a frequency weight to create a cross-tabulation of lifepart by gender:
.
tabulate lifepart gender
--------- [fw = count]
# I
heterosex |
partners |
lifetime |
Gender
male
female |
Total
|
|
|
|
|
|
544
1734
887
1542
1630
2048
586
4146
1777
1908
1364
708
|
|
|
|
|
|
1130
5880
2664
3450
2994
2756
Total |
8385
10489 |
18874
none
one
two
3-4
5-9
10+
The usual tabulate options work as texpected' with
' * frequency
~
weights. Here is the same
table showing column percentages instead of frequencies:
A
--------- r-'
tv
AbJLA
JLJ.’
Summary Statistics and Tables
tabulate lifepart gender [fweight = count],
.
#
heterosex
partners
lifetime
I
|
|
|
none
one
two
3-4
5-9
10+
|
I
|
|
|
|
Total |
Gender
male
female
lotal
6.49
20.68
10.58
18.39
19.44
24.42
5.59
39.53
16.94
18.19
13.00
6.75
|
|
|
|
5.99
31. 15
14.11
18.28
15.86
14.60
100.00
100.00 |
100.00
139
column nof
Other types of weights such as probability or analytical weights do not work as well with
tabulate because their meanings are unclear regarding the command’s principal options.
A different application of frequency weights can be demonstrated with summarize. File
coliegel.dta contains information on a random sample consisting of 11 U.S. colleges, drawn
from Barron’s Compact Guide to Colleges (1992).
Contains data from C:\data\collegel.dta
obs:
11
vars:
5
size:
429 (99.9% of memory free)
variable name
school
enroll
pctmale
msat
vsat
I
I
storage
type
str28
int
byte
int
int
display
format
value
label
%28s
%8.0g
%8.0g
%8.0g
%8.0g
Colleges sample 1
11 Jul 2005 18:05
(Barren's
)
variable label
College or university
Full-time students 1991
Percent male 1991
Average math SAT
Average verbal SAT
Sorted by:
The variables include msat, the mean math Scholastic Aptitude Test score at each of the 11
schools.
.
list school enroll msat
1.
2.
3.
4.
5.
6.
7.
8.
9.
10 .
11 .
4L
tk.
I
school
I --I
3rown University
I
U . S c r a n t on
I U. North Carolina/Asheville
I
Claremont College
I
CePaul University
I—Thomas Aquinas College
I
Davidson College
I
I
U. Michigan/Dearborn
I
Mass. College of Art
Oberlin College
I
I---
enroll
American University
5228
I
5550
3821
2035
849
6197
201
1543
3541
961
2765
msat I
----- !
680
554
540 I
660 I
547 I
I
570 I
640 I
485 I
482 I
64 0 I
I
587 |
I
140
Statistics with Stata
We can easily find the mean msat value among these 11 schools by typing
.
I ,
summarize msat
Variable |
Obs
Mean
Std. Dev.
Min
Max
|
11
580.4545
67.63155
4 82
680
msat
This summary table gives each school’s mean math SAT score the same weight DePaul
University, however, has 30 times as many students as Thomas Aquinas College.^To take the
different enrollments into account we could weight by enroll.
.
summarize msat [fweight = enroll]
Variable |
----------------- + -_
msat |
!
Obs
Mean
Std. Dev.
Min
Max
32691
583.064
63.10665
4 82
680
Typing
.
summarize msat [freq = enroll]
would accomplish the same thing.
i
'i'i
I
I
f
I!
i
enroIIment’weighted mean, unlike the unweighted mean, is equivalent to the mean for
the 32,691 students at these colleges (assuming they all took the SAT). Note, however, that we
cou not say the same thing about the standard deviation, minimum, or maximum. Apart from
the mean, most individual-level statistics cannot be calculated simply by weighting data that
already are aggregated. Thus, we need to use weights with caution. They might make sense
in the context of one particular analysis, but seldom do for the dataset as a whole, when many
aitterent kinds of analyses are needed.
I___ II.I -JM1
ANOVA and Other
Comparison Methods
Analysis of variance (ANOVA) encompasses a set of methods for testing hypotheses about
differences between means. Its applications range from simple analyses where we compare the
means of v across categories of.v, to more complicated situations with multiple categorical and
measurement x variables, t tests for hypotheses regarding a single mean (one-sample) or a pair
of means (two-sample) correspond to elementary forms of ANOVA.
Rank-based “nonparametric” tests, including sign, Mann-Whitney, and Kruskal-Wallis,
take a different approach to comparing distributions. These tests make weaker assumptions
about measurement, distribution shape, and spread. Consequently, they remain valid under a
wider range of conditions than ANOVA and its “parametric” relatives. Careful analysts
sometimes use parametric and nonparametric tests together, checking to see whether both point
toward similar conclusions. Further troubleshooting is called for when parametric and
nonparametric results disagree.
I
anova is the first of Stata’s model-fitting commands to be introduced in this book. Like
the others, it has considerable flexibility encompassing a wide variety of models, anova can
fit one-way and AA-way ANOVA or analysis of covariance (ANCOVA) for balanced and
unbalanced designs, including designs with missing cells. It can also fit factorial, nested,
mixed, or repeated-measures designs. One follow-up command, predict , calculates
picdicted \alues. several types of residuals, and assorted standard errors and diagnostic
statistics after anova. Another followup command, test, obtains tests of user-specified
null hypotheses. Both predict and test work similarly with other Stata model-fitting
commands, such as regress (Chapter 6).
The following menu choices give access to most operations described in this chapter:
Statistics - Summaries, tables, & tests - Classical tests of hypotheses
Statistics - Summaries, tables, & tests - Nonparametric tests of hypotheses
Statistics - ANOVA/MANOVA
Statistics - General post-estimation - Obtain predictions, residuals, etc., after estimation
I
I
J.
(graphics - Overlaid twoway graphs
141
Ji I
142
Statistics with Stata
Example Commands
.
anova y xl x2
IfxJand ^O‘Way AN0VA’ tes,in2 for differences amongthe means ofy across categories
. anova y xl x2 xl*x2
Performs a two-way factorial ANOVA, including both the main and interaction (x/*x2)
effects of categorical variables x7 and x2.
.
anova y xl x2 x3 xl*x2 xl*x3 x2*x3 xl*x2*x3
Perf onns a three-way factorial ANOVA, including the three-way interaction x7 *x2*x3 as
well as all two-way interactions and main effects.
.
anova reading curriculum / teacher\curriculum
Fits 3 nested model to test the effects of three types of curriculum on students’ reading
ability (reading), teacher is nested within curriculum ( teacher] curriculum )
because several di fferent teachers were assigned to each curriculum. The Base Reference
Manual provides other nested ANOVA examples, including a split-plot design.
.
anova headache subject medication, repeated(medication)
Fits a repeated-measures ANOVA model to test the effects of three types of headache
medication (medication) on the severity of subjects’ headaches (headache). The sample
consists of 20 subjects who report suffering from frequent headaches. Each subject tried
each of the three medications at separate times during the study.
. anova y xl x2 x3 x4 x2*x3, continuous(x3 x4) regress
Perf onns analysis of covariance (ANCOVA) with four independent variables, two of them
(x7 and x2) categorical and two of them (x3 and x4) measurements. Includes the x2*x3
interaction, and shows results in the form of a regression table instead of the default
ANOVA table.
.
kwallis y, by(x)
Pert onns a Kruskal-Wallis test of the null hypothesis that y has identical rank distributions
across the k categories ofx (k > 2).
oneway y x
Performs a one-way analysis of variance (ANOVA). testing for differences among the
means ot y across categories of.v. The same analysis, with a different output table, is
produced by anova y x .
. oneway y x, tabulate scheffe
Perfonns one-way ANOVA. including a table of sample means and Scheffe multiplecomparison tests in the output.
.
I
ranksum y, by(x)
Performs a Wilcoxon rank-sum test (also known as a Mann-Whitney U test) of the null
ypothesis that y has identical rank distributions for both categories of dichotomous
variable x. If we assume that both rank distributions possess the same shape, this amounts
to a test for whether the two medians of are equal.
I
1
ANOVA and Other Comparison Methods
.
143
serrbar ymean se x, scale(2)
Constructs a standard-error-bar plot from a dataset of means. Variable ymean holds the
group means of r; se the standard errors; and x the values of categorical variable x.
scale (2) asks for bars extending to ±2 standard errors around each mean (default is ±1
standard error).
signrank yl = y2
Performs a Wilcoxon matched-pairs signed-rank test for the equality of the rank
distributions ofyl andy2. We could test whether the median ofyl differs from a constant
such as 23.4 by typing the command signrank yl = 23.4.
signtest yl = y2
Tests the equality of the medians of yl and y2 (assuming matched data; that is, both
variables measured on the same sample of observations). Typing sign test yl = 5
would perform a sign test of the null hypothesis that the median ofy/ equals 5.
.
ttest y = 5
Performs a one-sample t test of the null hypothesis that the population mean ofy equals 5.
ttest yl = y2
Performs a one-sample (paired difference) t test of the null hypothesis that the population
mean ofyl equals that ofy2. The default form of this command assumes that the data are
paired. With unpaired data (yl and y2 are measured from two independent samples), add
the option unpaired.
ttest y, by(x) unequal
Performs a two-sample t test of the null hypothesis that the population mean ofy is the
same for both categories of variable x. Does not assume that the populations have equal
variances. (Without the unequal option, ttest does assume equal variances.)
One-Sample Tests
Ii
I
I
I
I
One-sample t tests have two seemingly different applications:
1. Testing whether a sample mean y differs significantly from an hypothesized value p0.
2. Testing whether the means ofy and y2, two variables measured over the same set of
observations, differ significantly from each other. This is equivalent to testing whether the
mean of a “difference score” variable created by subtractingy1 fromy equals zero.
We use essentially the same formulas for either application, although the second starts with
information on two variables instead of one.
The data in writing.dta were collected to evaluate a college writing course based on word
processing (Nash and Schwartz 1987). Measures such as the number of sentences completed
in timed writing were collected both before and after students took the course. The researchers
wanted to know whether the post-course measures showed improvement.
TH
144
Statistics with Stata
describe
Contains data from C:\data\writing.dta
obs:
24
vars:
9
size:
312 (99.9% of memory free)
i
i
variable name
id
preS
prep
preC
preE
posts
post?
pos tC
pos tE
I
storage
type
byte
byte
byte
byte
byte
byte
byte
byte
byte
Nash and Schwartz (1987)
12 Jul 2005 10:16
display
format
value
label
variable label
%8.0g
%8.0y
%8.0g
%8.0g
%8.0g
%8.0g
% 8.0g
%8.0g
%8.0g
slbl
Student ID
# of sentences (pre-test)
. # of paragraphs (pre-test)
Coherence scale 0-2 (pre-test)
Evidence scale 0-6 (pre-test)
# of sentences (post-test)
# of paragraphs (post-test)
Coherence scale 0-2 (post-test)
Evidence scale 0-6 (post-test).
Sorted by:
Suppose that we knew that students in previous years were able to complete an average of
0 sentences. Before examining whether the students in writing.dta improved during the
course, we might want to learn whether at the start of the course they were essentially like
earlier students — in other words, whether their pre-test (preS) mean differs significantly from
the mean of previous students (10). To see a one-sample t test of
= 10, type
.
ttest preS = 10
One-sample t test
Variable
I
Obs
Mean
preS
24
10.79167
Decrees
freedom.:
Std. Err.
A O
O
«
Std. Dev.
[95% Conf. Interval]
4.606037
8.846708
12.73663
Ho: mean(preS) = 10
H= : rear.
t
<
il
t
10
u . ‘ ?oo
Ha: mean != 10
t =
0.8420
P > It | =
0.4084
Ha : mean > 10
t
0.8420
P > t =
0.2042
. Then°tftlon p >
means “the probability of a greater value oft ’’—that is, the one-tail
test probability. The two-tail probability of a greater absolute t appears as P > |t| =
.4084. Because this probability is high, we have no reason to reject H0:p = 10. Note that
ttesst automatically provides a 95% confidence interval for the mean. We could get a
different confidence interval, such as 90%, by adding a level (90) option to this command.
I
A nonparametnc counterpart, the sign test, employs the binomial distribution to test
hypotheses about single medians. For example, we could test whether the median of preS
equals 10. sign test gives us no reason to reject that null hypothesis either.
j
il
1;.
ANOVA and Other Comparison Methods
.
145
signtest preS = 10
Sign test
sign |
observed
expected
positive |
negative |
zero |
12
10
2
11
11
2
all |
24
24
One-sided tests:
Ho: median of preS - 10 = 0 vs.
Ha: median of preS - 10 > 0
Pr(#positive >= 12) =
Binomial(n = 22, x >= 12, p = 0.5) =
0.4159
Ho: median of preS - 10 = 0 vs.
Ha: median of preS - 10 < 0
Pr(#negative >= 10) =
Binomial(n = 22, x >= 10, p = 0.5) =
0.7383
Two-sided test:
Ho: median of preS - 10 = 0 vs.
Ha: median of preS - 10 != 0
Pr(#positive >= 12 or #negative >= 12) =
min(l, 2*Binomial(n = 22, x >= 12, p = 0.5)) =
0.8318
Like ttest, signtest includes right-tail, left-tail, andtwo-tail probabilities. Unlike
the symmetrical t distributions used by ttest, however, the binomial distributions used by
sign test have different left- and right-tail probabilities. In this example, only the two-tail
probability matters because we were testing whether the writing.dta students “differ” from their
predecessors.
I
I
Next, we can test for improvement during the course by testing the null hypothesis that the
mean number of sentences completed before and after the course (that is, the means ofpreS and
postS) are equal. The ttest command accomplishes this as well, finding a significant
improvement.
ttest posts = preS
Paired t test
I
Variable |
Obs
Mean
Std. Err .
Std. Dev.
[95% Conf. Interval]
posts |
preS |
24
24
26.375
10.79167
1 . 693779
.9402034
8.297787
4.606037
22.87115
8.846708
29.87885
12.73663
diff |
24
15.58333
1.383019
6.775382
12.72234
18.44433
Ho: mean(posts - preS) = mean(diff) = 0
Ha : mean (diff) < 0
t = 11.2676
P < t =
1.0000
Ha: mean (diff) != 0
t = 11.2676
P > It| =
0.0000 '
Ha : mean(diff) > 0
t = 11.2676
P > t =
0.0000
Because we expect “improvement,” not just “difference” between the preS and postS
means, a one-tail test is appropriate. The displayed one-tail probability rounds off four decimal
146
Statistics with Stata
places to zero (“0.0000” really means P< .00005). Students’ mean sentence completion does
l^^'ZXes^ °n thiS SamPle’ 3re 95% COnfldent that k imProves by between
t tests assume that variables follow a normal distribution. This assumption usually is not
critical because the tests are moderately robust. When nonnormality involves severe outliers
however, or occurs in small samples, we might be safer tumingto medians instead of means and
employing a nonparametnc test that does not assume normality. The Wilcoxon signed-rank
test, for example, assumes only that the distributions are symmetrical and continuous. Applying
a signed-rank test to these data yields essentially the same conclusion as ttest that
students’ sentence completion significantly improved. Because both tests agree on this
conclusion, we can assert it with more assurance.
.
signrank posts = preS
Wilcoxon signed-rank test
sign |
obs
sum ranks
expected
positive |
negative |
zero |
24
0
0
300
0
0
150
150
0
all |
24
300
300
unadjusted variance
adjustment for ties
adjustment for zeros
1225.00
-1.63
0.00
adjusted variance
1223.38
Ho: posts = preS
z =
Prob > Iz| =
4.289
0.0000
Two-Sample Tests
The remainder of this chapter draws examples from a survey of college undergraduates by
Ward and Ault (1990) (student2.dta).
.
describe
Contains data from C:\data\student2.dta
obs:
243
vars:
19
size:
6,561 (99.9% of memory free)
variable name
j
*&a
id
year
age
gender
major
relig
drink
gpa
grades
storage
type
int
byte
byte
byte
byte
byte
byte
float
byte
display
format
%8.0g
%8.0g
%8.0g
%9.0g
%8.0g
%8.0g
%9.0g
%9.0g
% 8 . Og
value
label
Student survey (Ward & Ault 1990)
12 Jul 2005 10:16
variable label
year
Student ID
Year in college
Age at last birthday
s
Gender
v4
grades-
(male)
Student major
Religious preference
33-point drinking scale
Grade Point Average
Gtressed-grades this semester
1
ANOVA and Other Comparison Methods
belong
live
miles
study
athlete
employed
allnight
ditch
hsdrink
aggress
byte
byte
byte
byte
byte
byte
byte
byte
byte
byte
Sorted by:
%8.0g
%8.0g
%8.0g
%8.0g
%8.0g
%8.0g
%8- 0g
%8.0g
%9.0g
%9.0g
belong
vlO
147
Belong to fraternity/sorority
Where do you live?
How many miles from campus?
Avg. hours/week studying
Are you a varsity athlete?
Are you employed?
How often study all night?
How many class/month ditched?
High school drinking scale
Aggressive behavior scale
yes
yes
allnight
times
id
About 19% of these students belong to a fraternity or sorority:
.
I
tabulate belong
Belong to |
fraternity/ I
sorority I
----------- +
member I
nonmember |
Freq.
Percent
Cum.
47
196
19.34
80.66
19.34
100.00
Total |
243
100.00
Another variable, drink, measures how often and heavily a student drinks alcohol, on a 33point scale. Campus rumors might lead one to suspect that fraternity/sorority members tend to
differ from other students in their drinking behavior. Box plots comparing the median drink
values of members and nonmembers, and a bar chart comparing their means, both appear
consistent with these rumors. Figure 5.1 combines these two separate plot types in one image.
.
graph box drink,
ylabel(0(5)35)
over (belong)
saving(fig05_01a)
. graph bar (mean) drink, over(belong) ylabel(0(5)35) saving(fig05_01b)
.
graph combine fig05_01a.gph fig05_01b .gph,
col(2)
iscale(1.05)
Figure 5.1
m _
co
co
o
CO
o
co ’
gCN -
CN
m _
(/)
co _
"O O _
V? CM
c
"O L0
-
E
9-o
Or- "
CO
o
■pg
in
o
member
i
1
nonmember
o
member
nonmember
148
Statistics with Stata
^eSt icommand’ use<i earlier for one-sample and paired-difference tests, can
perform two-sample tests as well. IIn this application its general syntax is ttest
measurement, by (categorical). For example,
.
ttest drink, by(belong)
Two-sample t test with equal variances
Group |
-------- + —
member |
nonmembe |
Obs
Mean
Std. Err-.
Std. Dev.
[95% Conf. Interval]
47
196
24.7234
17.7602
.7124518
.4575013
.4.884323
*6.405018
23.28931
16.85792
26.1575
18.66249
combined |
243
19.107
.431224
6.722117
18.25756
19.95643
6.9632
.9978608
4.997558
8.928842
diff |
Degrees of freedom: 241
Ho: mean(member) - mean(nonmembe) = diff = 0
Ha: diff < 0
t =
6.9781
P < t =
1.0000
.
L4
Ha: diff > 0
t =
6.9781
P > t =
0.0000
ttest drink, by(belong) unequal
Two-sample t test with unequal variances
Group |
Obs
Mean
Std. Err.
Std. Dev.
[95% Conf. Interval]
47
196
24.7234
17.7602
.7124518
.4575013
4.884323
6.405018
23.28931
16.85792
26.1575
18.66249
243
19.107
.431224
6.722117
18.25756
19.95643
6.9632
. 8466965
5.280627
8.645773
I---
■i
Ha: diff != 0
t =
6.9781
P > It| =
0.0000
member |
nonmembe |
-------- +--combined |
-------- +--diff |
Satterthwaite's degrees of freedom:
88.22
Ho: mean(member) - mean(nonmembe)
= diff = 0
Ha: diff < 0
Ha: diff •= 0
Ha: diff > 0
t = 8.2240
t =
8.2240
t =
8.2240
P < t =
1.0000
P > It| =
0.0000
P > t =
0.0000
‘k
le^ank disti ibutions have similar shape, the rank-sum test here indicates that we can reject the
null hypothesis of equal population medians.
4
■
■
ANOVA and Other Comparison Methods
.
ranksum drink,
by(belong)
Two-sample Wilcoxon rank-sum
I
(Mann-Ahltney)
tesr
belong I
cbs
member |
nonmember i
47
1 96
853 5
21111
5734
23912
combined I
243
2 9 64c
29646
rank
unadjusted variance
adjustment for ties
187310.67
-472.30
adjusted variance
186838.36
Ho: drink(belong
z =
Prob
IZ| =
149
member)
6.480
0.0000
expected
dr in k(be long
nonmer.ber)
One-Way Analysis of Variance (ANOVA)
Analysis »f ..ri.nee (ANOVA) provides anoiher way. more general rhan, tests to test (or
differences among means The simplest case, one-way ANOVA. tests whether the means of
differ across categories of,. One-way ANOVA can be performed bv a on.w.y command
with the general form oneway measurement categorical. Forexample.
.
I
oneway drink belong,
Belong to I
fraternity/ |
sorority |
member |
nonmember I
Total |
tabulate
Summary of 33-point dtinkina s~aiMean
Std. Dev.
'
n a
60204
X
4.8843233
6.4 0 51 ' -
• A
= .':ce
MS
Source
Between groups
Within groups
Total
1838.08426
9097.13385
241
1838.08426
37.7474433
2131242
4r.18685:"
Bartlett's test for equal variances:
chi. (1)
=
r
Prob > F
4 8.69
0.1000
4.8378
9rob>chi2 = 0.128
abandoning the equal-variances assumption.
'
K
Ba d "tr37 Jhrnally TS ‘i6 equa|-variances assumption, using Bartlett’s X2.
. A low
Bartlett s probability implies that ANOVA’s c----- ’ ’
equal-variance assumption is implausible, in
i
J
150
.i. J
Statistics with Stata
which case we should not trust the ANOVA F test results. In the coneway drink
. - - belong
- example above, Bartlett’s P = .028 casts doubt on the ANOVA’s validity.
ANOVA s real value lies not in two-sample comparisons, but in more complicated
comparisons of three or more means. For example, we could test whether mean drinking
behavior varies by year in college:
"
fed
ini
will
.
J
oneway drink year,
Year in | Summary of 33-point drinking scale
Mean
Std. Dev.
Freq.
college
““1’ ’ 3
I
Freshman
Sophomore
Junior
Senior
|
I
|
|
18.975
21.169231
19.453333
16.650794
6.9226033
6.5444853
6.28660S1
6.6409257
‘40
Total |
19.106996
6.7221166
243
'a
I
Between groups
Within groups
666.200518
10269.0176
3
239
222.066839
42.9666038
Total
10935.2181
242
45.186851“
Bartlett' s test for equal variances:
B;
Row Mean-|
Col Mean |
-------------- +
4 !
65
75
63
Analysis cf Variance
SS
df
MS
Source
Sophomor
|
I
I
Junior I
I
I
Senior I
I
chi2(3) =
F
Prob > F
5.17
0.0018
0.5103
Prob>chi2 = 0.917
Comparison of 33-point drinking scale by Year in college
(Scheffe)
Freshman
Sophomor
Junior
2.19423
0.429
.478333
0.987
-1.7159
0.498
-2.32421
0.382
-4.51844
0.002
-2.83254
We can reject the hypothesis of equal means (P = .0018). but not the hypothesis of equal
variances (P - .917). The latter is “good news" regarding the ANOVA’s validity.
■
Pl°ts ’n Figure 5.2 (next page) support this conclusion, showing similar variation
within each category. This figure, which combines separate box plots and dot plots, shows that
differences among medians and among means follow similar patterns.
1»•
J i!
I
■'
tabulate scheffe
■
saving (,fig05_02a)
.
graph hbox drink, over (year) ylabel (0 (5) 35)
.
graph dot (mean) drink, over(year) ylabel (0 (5)35, grid)
marker (1, msymbol(S)) saving(fig05_02b)
.
graph combine fig05_02a . gph fig05__02b. gph ,
row(2)
iscale(1.05)
•W
ANOVA and Other Comparison Methods
151
Figure 5.2
Freshman
Sophomore
Junior
Senior
0
5
10
15
20
25
33-point drinking scale
30
35
30
35
Freshman
Sophomore
Junior
Senior
0
10
5
15
20
mean of drink
25
The scheffe option (Scheffe multiple-comparison test) produces a table showing the
differences between each pair ofmeans. The freshman mean equals 18.975 and the sophomore
mean equals 21.16923, so the sophomore-freshman difference is 21.16923 - 18 975 = 2 19423
not statistically distinguishable from zero (P = .429). Of the six contrasts in this table only the
senior-sophomore difference, 16.6508 -21.1692 =-4.5184, is significant (P= .002). Thus,
our overall conclusion that these four groups’ means are not the same arises mainly from the
contrast between seniors (the lightest drinkers) and sophomores (the heaviest).
oneway offers three multiple-comparison options: scheffe , bonferroni , and
sidak (see Base Reference Manual for definitions). The Scheffe test remains valid under
a wider variety of conditions, although it is sometimes less sensitive.
The Kruskal-Wallis test ( kwallis ), a A'-sampIe generalization of the two-sample rank
sum test, provides a nonparametric alternative to one-way ANOVA. Ittests the null hypothesis
of equal population medians.
.
kwallis drink,
Test:
Equality of populations (Kruskal-Wallis test)
year
I
I
by(year)
I
Freshr.ar. i
I Sophomore I
I
Junior |
I
Senior I
chi-squared =
probability =
Obs
I Rank Sum |
---- I
40 |
4914.00 |
65 |
9341.50 |
75
63
|
|
9300.50
6090.00
|
|
14.453 with 3 d.f.
0.0023
chi-squared with ties =
probability =
0.0023
14.490 with 3 d. f.
Gs- ICrO
1U002
.0
<C
.
) *
) -
152
*
I
Statistics with Stata
Here, the kwallis results (P= .0023) agree with our oneway findings of significant
differences in drink by year in college. Kruskal-Wallis is generally safer than ANOVA if we
have reason to doubt ANOVA’s equal-variances or normality assumptions, or if we suspect
problems caused by outliers, kwallis , like ranksum. makes the weaker assumption of
similar-shaped distributions within.each group. In principle, ranksum and kwallis
should produce similar results when applied to two-sample comparisons, but in practice this is
true only if the data contain no ties, ranksum incorporates an exact method for dealing with
ties, which makes it preferable for two-sample problems.
Two- and A/-Way Analysis of Varianee
One-way ANOVA examines how the means of measurement variable;/ vary across categories
of one other variable x. N-way ANOVA generalizes this approach to deal with two or more
categorical x variables. For example, we might consider how drinking behavior varies not only
by fiaternity or soiority membership, but also by gender. We start by examining a two-way
table of means:
I
.
S'-
i
■
4 I
table belong gender,
Belong to |
fraternit I
y/sororit I
Y
I
----------------+.
Gender (male)
Female
Male
member I 22.44444
nonmember | 16.51724
i
fl
111
l-H
I
??h
r!
contents(mean drink)
row col
Total
26.13793
19.5625
24.7234
17.7602
21.31193
19.107
I
Total
I
17.31343
It appears that in this sample, males drink more than females and members drink more than
nonmembers. The member-nonmember difference appears similar among males and females.
Stata’s 2V-way ANOVA command, anova , can test for significant differences among these
means attributable to belonging to a fraternity or sorority, gender, or the interaction of
belonging and gender (written belong*gender).
anova drink belong gender belong* gender
Number of obs =
Root MSE
R-squared
Adj R-squared =
C.2222
0.2123
Source
Partial SS
dr
MS
F
Prob > r
Mode 1
242 5.-'237
3
309.557456
22.75
0.0000
1416.2366
408.520097
3.7=216612
1
1
1416.2366
403.520097
3 . " = 016612
39.51
11 . 48
0.11
0.0000
0.0008
0.7448
I
I
belong |
gender |
belong*gender |
I
Residual
I
8506.54574
239
35.5922416
Total
|
10935.2181
242
4 5 . 1868517
ANOVA and Other Comparison Methods
153
In this example of“two-way factorial ANOVA,” the output shows significant main effects
for belong^(P = .0000) and gender (P = .0008), but their interaction contributes little to the
model (P - .7448). This interaction cannot be distinguished from zero, so we might prefer to
fit a simpler model without the interaction term (results not shown):
anova drink belong gender
To include any interaction term with anova , specify the variable names joined by *.
Unless the number of observations with each combination of x values is the same (a condition
called "balanced data”), it can be hard to interpret the main effects in a model that also includes
inteiactions. This does not mean that the main effects in such models are unimportant,
however. Regression analysis might help to make sense of complicated ANOVA results, as
illustrated in the following section.
Analysis of Covariance (ANCOVA)
Analysis of Covariance (ANCOVA) extends V-way ANO VA to encompass a mix ofcategorical
and continuous^ variables. This is accomplished through the anova command if we specify
which variables are continuous. For example, when we include gpa (college grade point
average) among the independent variables, we find that it, too, is related to drinking behavior.
I
I
anova drink belong gender gpa, continuous(gpa)
Number of obs =
218
Root MSE
= 5.68939
I
R-squared
=
Adj R-squared =
0.2970
0.2'8 72
Source |
Partial SS
df
MS
F
Prob > F
Model |
I
belong |
gender |
gpa I
I
Residual I
2927.03087
3
975.676958
30.14
0.0000
1489.31999
405.137843
407.0089
1
1
1
1489.31999
405.137843
407.0089
46.01
12.52
12.57
0.0000
0.0005
0.0005
6926.99206
214
32.3691218
Total |
9854.02294
217
45.4102439
From this analysis we know that a significant relationship exists between drink and gpa
when we control for belong and gender. Beyond their F tests for statistical significance,
however, ANOVA or ANCOVA ordinarily do not provide much descriptive information about
how variables aie related. Regression, with its explicit model and parameter estimates, does
a better descriptive job. Because ANOVA and ANCOVA amount to special cases of
regression, we could restate these analyses in regression form. Stata does so automatically if
we add the regress option to anova. For instance, we might want to see regression
output in order to understand results from the following ANCOVA.
I
*1
154
Statistics with Stata
anova
regress
Source |
----- +
belong gender belong*gender gpa,
SS
df
MS
Model I
Residual |
2933.45823
6920.5647
4.
213
733.364558
32.4909141
Total
9854.02294
217
45.4102439
drink
continuous(gpa)
Number of obs =
F ( 4,
213) =
Prob > F
=
R-squared
=
Adj R-squared =
Root MSE
=
218
22.57
0.0000
0.2977
0.2845
5.7001
Coef.
Sid. Err.
t
P> I t I
[95% Conf
Interval]
27.47676
2.439962
11.26
0.000
22.6672
32.28633
1
2
6. 925384
(dropped)
’ .286774
5.38
0.000
4.388942
9.461826
1
2
-2.629057
(dropped)
-3.054633
. =917152
-2.95
0.004
-4.386774
-.8713407
. =5934 98
-3.55
0.000
-4.748552
-1.360713
1.946211
-0.44
0.657
-4.701916
2.970685
_cons
belong
gender
gpa
belong*gender
1
1
2
2
1
2
1
2
-.8656158
(dropped)
(dropped)
(dropped)
With the jregress option, we get the anova output formatted as a regression table.
The top part gives the same overall F
andas
R a standard ANOVA table. The bottom part
- test
-------describes the following regression:
We construct a separate dummy variable {0,1} representing each category of each x
variable, except for the highest categories, which are dropped. Interaction terms (if
specified in the variable list) are constructed from the products of every possible
combination of these dummy variables. Regressy on all these dummy variables and
interactions, and also on any continuous variables specified in the command line.
The previous example therefore corresponds to a regression of drink on four x variables:
1. a dummy coded 1 = fratemity/sorority member, 0 otherwise (highest category of belong
nonmember, gets dropped);
2. a dummy coded 1 = female, 0 otherwise (highest category ofgender, male, gets dropped);
3.
4.
the continuous variable gpa;
an interaction term coded 1 = sorority female, 0 otherwise.
Interpret the individual dummy variables’ regression coefficients as effects on predicted or
"a,ean?’; ForexamPle> the coefficient on the first category ofgender (female) equals-2.629057
I his informs us that the mean drinking scale levels for females are about 2.63 points lower than
those of males with the same grade point average and membership status. And we know that
among students of the same gender and membership status, mean drinking scale values decline
y . 54633 with each one-point increase in grades. Note also that we have confidence
interva s and individual t tests for each coefficient; there is much more information in the
anova, regress output than in the ANOVA table alone.
I
i,
J
ANOVA and Other Comparison Methods
155
Predicted Values and Error-Bar Charts
After anova, the followup command predict calculates predicted values, residuals or
standard errors and diagnosttc statistics. One application for such statistics is in drawing
graphical representat.ons of the model’s predictions, in the form of error-bar charts For Z
simple illustration, we return to the one-way ANOVA of drink by year.
anova drink year
Number of obs =
243
Root MSE
= 6.55489
R-squared
Adj R-squared =
0.0609
0.0491
Source |
Partial SS
df
MS
F
Prob > F
Model
666.200518
3
222.066839
5.17
0.0018
5.17
0.0018
I
i
year l
i
Residual i
666.200518
3
222.066839
10269.0176
239
42.9666008
Total
10935.2181
242
45.1868517
|
To calculate predicted means from the recent anova,type predict followed by a new
variable name:
. predict drinkmean
(option xb assumed; fitted values)
.
label variable drinkmean "Mean drinking scale"
With the Stdp option, predict calculates standard errors of the predicted means:
.
predict SEdrink,
stdp
Using these new variables, we apply the serrbar command to create an error-bar chart
The scale (2) option tells serrbar to draw error bars of plus and minus two standard
errors, from
drinkmean - 2*SEdrink
to
drinkmean + 2*SEdrink.
In a serrbar command, the first-listed variable should be the means or y variable* the
second-listed the standard error or standard deviation (depending on which you want to show);
and the third-listed variable defines the x axis. The plot ( ) option for serrbar can
specify a second plot to overlay on the standard-error bars. In Figure 5.3, we overlay a line plot
that connects the drinkmean values with solid line segments.
156
Statistics with Stata
serrbar drinkmean SEdrink year,
.
plot (line drinkmean year,
scale(2)
cipattern(solid))
legend(off)
Figure 5.3
CM
CM ’
rLi
1
■Q
1
CD
1
2
Year in college
I
4 I
3
4
For a two-way factorial ANOVA, error-bar charts help us to visualize main and interaction
effects. Although the usual error-bar command serrbar can, with effort, be adapted for this
purpose, an alternative approach using the more flexible graph twoway family will be
illustrated below. First, we perform ANOVA, obtain group means (predicted values) and their
standard errors, then generate new variables equal to the group means plus or minus two
standard errors. The example examines the relationship between students’ aggressive behavior
(aggress), gender, and year in college. Both the main effects of gender Midyear, and their
interaction, are statistically significant.
anova aggress gender year gender*year
Number of obs =
243
Root MSE
= 1.45652
Source |
Partial SS
df
Model
R-squared
=
Adj R-squared =
MS
0.2503
0.2280
F
Prob > F
I
I
gender |
year I
gender*year |
166.432503
7
23.7832147
11.21
0.0000
94.3505972
1 9.0404045
24.1029759
1
3
3
44 . 47
2.99
3.79
0.0000
0.0317
0.0111
Residual
I
I
94.3505972
6.34680149
8.03432529
498.538073
235
2.12143861
Total
|
665.020576
242
2.74801891
ANOVA and Other Comparison Methods
157
. predict aggmean
(option xb assumed; fitted values)
$
■w..
. label variable aggmean "Mean aggressive behavior scale"
. predict SEagg, stdp
gen agghigh - aggmean + 2 * SEagg
.
. gen agglow — aggmean - 2 * SEagg
. graph twoway connected aggmean year
I I
I I
reap agghigh agglow year
/ by(gender, legend(off) note(
))
ytitle("Mean aggressive behavior scale")
Female
Figure 5.4
Male
CO
.0)
o
w
o
(N -
<D
n
§
w
V)
s?O)
*
D)
ro
c
IQ
0)
2
o
3
4 1
Year in college
2
3
4
Figure 5.4 built error-bar charts by overlaying two pairs of plots. The first pair are female
and male connected-line plots, connecting the group means of aggress (which we calculated
using predict, and saved as the variable aggmean). The second pair are female and male
capped-spike range plots (twoway reap) in which the vertical spikes connectine variables
agghigh (group means of aggress plus two standard errors) and agglow (group'”means of
aggress minus two standard errors). The by (gender) option produced sub-plots for females
and males. Notice that to suppress legends and notes in a graph that uses a by ( ) option,
legend (off) and note ("") must appear as suboptions within by ( ).
The resulting error-bar chart (Figure 5.4) shows female means on the aggressive-behavior
scale fluctuating at comparatively low levels during the four years of college. Male means are
higher throughout, with a sophomore-year peak that resembles the pattern seen earlier for
rinkmg (Figures 5.2 and 5.3). Thus, the relationship between aggress and year is different for
males and females. This graph helps us to understand and explain the significant interaction
eiieci.
predict works the same way with regression analysis ( regress ) as it does with
anova because the two share a.cpmmon mathematical framework. A list of some other
i!I'
Ifel
I'
158
Statistics with Stata
,°PtIOnS aPPearS in Chapter 6’ and filrther examPles usi"gthese oP^ons are given
m Chapter 7. The options include residuals that can be used to check assumptions regarding
X™^OnS’
alS° H SUite °f dia8n0Stic statistics <such as leverage. Cook’s D, and
n u
1 meaSUre thC inflUenCe °f
observations on model results The
Ourbm Watson test ( dwstat), described in Chapter 13, can also be used after anova to
test for fet-order autocorrelation. Conditional effect plotting (Chapter 7) provides a graphical
approac that can aid interpretation of more complicated regression, ANOVA or ANCOVA
models.
’
Linear Regression Anaiysis
Stata offers an exceptionally broad range of regression procedures. A partial list of the
possibilities can be seen by typing help ’ regress . This chapter introduces regress
and related commands that perform simple and multiple ordinary least squares (OLS)
regression. One followup command, predict, calculates predicted values, residuals, and
diagnostic statistics such as leverage or Cook’s D. Another followup command, test
performs tests of user-specified hypotheses, regress can accomplish other analyses
including weighted least squares and two-stage least squares. Regression with dummy
variables, interaction effects, polynomial terms, and stepwise variable selection are covered
briefly in this chapter, along with a first look at residual analysis.
The following menus access most of the operations discussed:
Statistics - Linear regression and related - Linear regression
Statistics - Linear regression and related - Regression diagnostics
Statistics - General post-estimation - Obtain predictions, residuals, etc., after estimation
Graphics - Overlaid twoway graphs
Statistics - Cross-sectional time series
Example Commands
. regress y x
Performs ordinary least squares (OLS) regression of variable v on one predictor, x.
. regress y x if ethnic == 3 & income > 50
Regresses y on x using only that subset of the data for which variable ethnic equals 3 and
income is greater than 50.
. predict yhat
Generates a new variable (here arbitrarily namedyhat) equal to the predicted values from
the most recent regression.
. predict e, resid
Geneiates a new variable (here arbitrarily named e) equal to the residuals from the most
recent regression.
. graph twoway Ifit y x
||
scatter y x
Draws the simple regression line (Ifit or linear fit) with a scatterplot ofy vs. x.
159
160
I
Statistics with Stata
. graph twoway mspline yhat x
||
scatter y x
Draws a simple regression line with a scatterplot ofy vs. x by connecting (with a smooth
cubic spline curve) the regression’s predicted values (in this example named vhat).
Note: There are many alternative ways to draw regression lines or curves in Stata.
These alternatives include the. twoway graph types mspline (illustrated above),
mband, line , If it, If itci , qf it, and qf itci , each of which has its own
advantages and options. Usually we combine (overlay) the regression line or curve with
a scatterplot. If the scatterplot comes second in our graph twoway command, as in
the example above, then scatterplot points will print on top of the regression line. Placing
the scatterplot first in the command causes the line to print on top of the scatter. Examples
throughout this and the following chapters illustrate some of these different possibilities.
I
r
I
J
a
.
1 J'
rvfplot
Draws a residual versus fitted (predicted values) plot, automatically based on the most
recent regression.
f!
. graph twoway scatter e yhat, yline(O)
Draws a residual versus predicted values plot using the variables e and yhat.
.
i
regress y xl x2 x3
Performs multiple regression ofy on three predictor variables, x/, x2, and x3.
.
regress y xl x2 x3, robust
Calculates robust (Huber/White) estimates of standard errors. See the User's Guide for
details. The robust option works with many other model fitting commands as well.
. regress y xl x2 x3, beta
I
5
F
Performs multiple regression and includes standardized regression coefficients (“beta
weights”) in the output table.
. correlate xl x2 x3 y
Displays a matrix of Pearson correlations, using only observations with no missing values
on all of the variables specified. Adding the option covariance produces a variance
covariance matrix instead of correlations.
. pwcorr xl x2 x3 y, sig
Displays a matrix of Pearson correlations, using pairwise deletion of missing values and
showing probabilities from t tests of H0:p = 0 on each correlation.
. graph matrix xl x2 x3 y, half
Draws a scatterplot matrix. Because their variable lists are the same, this example yields
a scatterplot matrix having the same organization as the correlation matrix produced by the
preceding pwcorr command. Listing the dependent (y) variable last creates a matrix in
which the bottom row forms a series ofj>-versus-x plots.
test xl x2
Performs an Ftest of the null hypothesis that coefficients onx7 andx? both equal zero in
the most recent regression model.
.xi:
regress y xl x2 i.catvar*x2
Performs “expanded interaction” regression ofy on predictors x/, x2, a set of dummy
variables created automatically to represent categories of catvar, and a set of interaction
terms equal to those dummy variables times measurement variable x2. help xi gives
more details.
..
.
.
.
Linear Regression Analysis
.
161
sw regress y xl x2 x3, pr(.O5)
Performs stepwise regression using backward elimination until all remaining predictors are
significant at the .05 level. All listed predictors are entered on the first iteration.
Thereafter, each iteration drops one predictor with the highest P value, until all predictors
remaining have probabilities below the “probability to retain,” pr (. 05). Options permit
forward or hierarchical selection. Stepwise variants exist for many other model-fitting
commands as well; type help sw for a list.
. regress y xl x2 x3 [aweight = w]
Perfonns weighted least squares (WLS) regression ofy onxl,x2, andx3. Variable w holds
the analytical weights, which work as if we had multiplied each variable and the constant
by the square root of w, and then performed an ordinary regression. Analytical weights are
often employed to correct for heteroskedasticity when they and x variables are means,
rates, or proportions, and w is the number of individuals making up each aggregate
observation (e.g., city or school) in the data. If they and x variables are individual-level,
and the weights indicate numbers of replicated observations, then use frequency weights
[fweight = v] instead. See help svy if the weights reflect design factors such
as disproportionate sampling.
. regress yl y2 x (x z)
. regress y2 yl z (x z)
Estimates the reciprocal effects ofy/ and v2. using instrumental variables x and z. The
first parts of these commands specify the structural equations:
yl = a0 + a}y2 + a2x + e,
v2 = po + P|^2 + p2M'4-e2
The parentheses in the commands enclose variables that are exogenous to all of the
structural equations, regress accomplishes two-stage least squares (2SLS) in this
example.
.
svy:
regress y xl x2 x3
Regresses von predictors.v/..v2. and.v3, with appropriate adjustments fora complex survey
sampling design. \\ e assume that a svyset command has previously been used to set
up the data, by specifying the strata, clusters, and sampling probabilities, help svy
lists the many procedures available for working with complex survey data, help
regress outlines the syntax of this particular command; follow references to the User's
Guide and the Surxfey Data Reference Manual for details.
. xtreg y xl x2 x3 x4, re
Fits a panel (cross-sectional time series) model with random effects by generalized least
squares (GLS). An observation in panel data consists of information about unit i at time
and there are multiple observations (times) for each unit. Before using xtreg , the
variable identifying the units was specified by an iis (“z is”) command, and the variable
identifying time by tis (‘7 is”). Once the data have been saved, these definitions are
retained for future analysis by xtreg and other xt procedures, help xt lists
available panel estimation procedures, help xtreg gives the syntax of this command
and references to the printed documentation. If your data include many observations for
each unit, a time-series approach could be more appropriate. Stata’s time series procedures
(introduced in Chapter 13) provide further tools for analyzing panel data. Consult the
Longitudinal/Panel Data Reference Manual for a full description.
I
162
.
1
Statistics with Stata
xtmixed population year
city: year
||
Assume that we have yearly data on population, for a number of different cities. The
xtmixed population year part specifies a“fixed-effect”model,similarto ordinary
regression, which describes the average trend in population. The || city: year part
specifies a “random-effects” model, allowing unique intercepts and slopes (different
starting points and growth rates) for each city.
. xtmixed SAT grades prepcourse
||
district:
pctcollege
||
region:
Fits a hierarchical (nested or multi-level) linear model predicting students’s SAT scores as
a function of the individual students’ grades and whether they took a preparation coursethe percent college graduates among their school district’s adults; and region of the country
(region affecting v-intercept only). See the Longitudinal/Panel Data Reference Manual for
much more about the xtmixed command, which is new with Stata 9.
The Regression Table
File states.dta contains educational data on the U.S. states and District of Columbia:
describe state csat expense percent income high college region
variable name
state
csat
expense
percent
income
high
college
region
storage
type
display
format
str20
int
int
byte
long
float
%20s
%9.0g
%9.0g
%9.0g
*10.0g
%9.0q'
-9.0g
9.0g
byte
value
label
variable label
region
State
Mean composite SAT score
Per pupil expenditures prim&sec
% HS graduates taking SAT
Median household income
=: adults HS diploma
adults college degree
Geographical region
Political leaders occasionally use mean Scholastic Aptitude Test (SAT) scores to make
pointed comparisons between the educational systems of different U.S. states. For example,
some have raised the question ot whether SAT scores are higher in states that spend more
money on education. We might try to address this question by regressing mean composite SAT
scores (csat) on per-pupil expenditures (expense). The appropriate Stata command has the form
regress y x , where y is the predicted or dependent variable, and x the predictor or
independent variable.
regress csat expense
Source !
ss
df
MS
Number of obs =
F( 1,
49) =
Prob > F
=
R-squared
=
Adj R-squared =
Root MSE
Model |
Residual |
48708.3001
175306.21
1
49
48708.3001
3577.67775
Total |
224014.51
50
4480.2902
csat I
Coe f.
Std. Err.
expense |
_cons I
- . 0222756
1060.732
. 0060371
-3.69
0.001
.. 32._7Q 0.9
32.
o.uiao—
t
P> 111
51
13.61
0.0006
0.2174
0.2015
59.814
[95% Conf. Interval]
-.0344077
995.0175
-.0101436
1126.447
I
5
Linear Regression Analysis
163
This regression tells an unexpected story: the more money a state spends on education, the
lower its students’ mean SAT scores. Any causal interpretation is premature at this point,’but
the regression table does convey information about the linear statistical relationship between
csat and expense. At upper right it gives an overall Atest, based on the sums of squares at the
upper left. This F test evaluates the null hypothesis that coefficients on all x variables in the
model (here there is only onex variable, expense) equal zero. The/7statistic, 13.61 with I and
49 degrees of freedom, leads easily to rejection of this null hypothesis (P = .0006). Prob > F
means "the probability ofa greater/7” statistic if we drew samples randomly from a population
in which the null hypothesis is true.
At upper right, we also see the coefficient of determination, R 2 = .2174. Per-pupil
expenditures explain about 22% of the variance in states’ mean composite SAT scores.
Adjusted R , R a = .2015, takes into account the complexity of the model relative to the
complexity of the data. This adjusted statistic is often more informative for research.
The lower half of the regression table gives the fitted model itself. We find coefficients
(slope and r-intercept) in the first column, here yielding the prediction equation
predicted csat = 1060.732 - .0222756expense
The second column lists estimated standard errors of the coefficients. These are used to
calculate t tests (columns 2-4) and confidence intervals (columns 5-6) for each regression
coefficient. The t statistics (coefficients divided by their standard errors) test null hypotheses
that the corresponding population coefficients equal zero. At the a = .05 significance level, we
could reject this null hypothesis regarding both the coefficient on expense (P = .001) and the
j-intercept ( .000 , really meaning P < .0005). Stata’s modeling commands print 95%
confidence intervals routinely, but we can request other levels by specifying the level ( )
option, as shown in the following:
. regress csat expense, level(99)
Because these data do not represent a random sample from some larger population of U.S.
states, hypothesis tests and confidence intervals lack their usual meanings. They are discussed
in this chapter anyway for purposes of illustration.
The term cons stands tor the regression constant, usually set at one. Stata automatically
includes a constant unless we tell it not to. The nocons option causes Stata to suppress the
constant, performing regression through the origin. For example,
.
regress y x,
nocons
or
.
regress y xl x2 x3, nocons
In certain adxanced applications, you might need to specify your own constant. If the
independent variables include a user-supplied constant (named c, for example), employ the
hascons option instead of nocons :
.
regress y c x, hascons
Using nocons in this situation would result in a misleading/7 test and/?2. Consult the Base
Reference Manual or help regress for more about hascons .
164
Statistics with Stata
Multiple Regression
r
Multiple regression allows us to estimate how expense predicts csat, while adjusting for a
number of other possible predictor variables. We can incorporate other predictors of csat
simply by listing these variables in the command
.
regress
csat expense percent income high college
Source
SS
df
MS
Model
Residual
I
I
184663.309
39351.2012
45
36932.6617
=74.471137
Total
|
224014.51
50
4480.2902
csat |
------- + _
expense
percent
income
high
college
_cons
I
|
I
|
I
|
Coe f .
. 0033528
-2.618177
.0001056
1.630841
2.030894
851.5649
Std .
. 0044"09
.2536491
.0011661
.992247
1.660128
5 9.29228
Number of obs =
F( 5,
45) =
Prob > F
R-squared
Adj R-squared =
Root MSE
?> I c
0.75
-10.31
0.09
1.64
1.22
14.36
o. o :
J. 92 =
O.IC"
3.22 =
:. c: i
51
42.23
0.0000
0.8243
0.8048
29.571
[95% Conf
Interval]
-.005652
-3.129455
-.002243
-.367647
-1.312756
732.1441
.0123576
-2.106898
.0024542
3.629329
5.374544
970.9857
fit I
This yields the multiple regression equation
predicted csat - 851.56 + .00335eApense - 2.61 ^percent + .OOQMnconie +
1.62>high -f- Z.Qlcollege
Controlling for four other variables weakens the coefficient on expense from -.0223 to .00335,
which is no longer statistically distinguishable fromzero. The unexpected negative relationship
between expense and csat found in our earlier simple regression evidently can be explained by
other predictors.
r
J
■
■
it
Only the coefficient on percent (percentage of high school graduates taking the SAT)
attains significance at the .05 level. We could interpret this “fourth-order partial regression
coefficient” (so called because its calculation adjusts for four other predictors) as follows.
^2
2.618. Predicted mean SAT scores decline by 2.618 points, with each one-point
increase in the percentage of high school graduates taking the SAT — if expense,
income, high, and college do not change.
Taken together, the five .v variables in this model explain about 80% of the variance in
states mean composite SAT scores (R= .8048). In contrast, our earlier simple regression
with expense as the only predictor explained only 20% of the variance in csat.
To obtain standardized regression coefficients (“beta weights”) with any regression, add
the beta option. Standardized coefficients are what we would see in a regression where all
the variables had been transformed into standard scores (means 0, standard deviations 1).
Linear Regression Analysis
.
165
regress csat expense percent Income high college, beta
Source |
ss
df
MS
Model |
Residual |
184663.309
39351.2012
5
45
36932.6617
874.471137
Total |
224014.51
50
4480.2902
csat |
Coef.
Std. Err.
t
P>l 11
Beta
.0033528
-2.618177
.0001056
1.630841
2.030894
851.5649
.0044709
.2538491
.0011661
.992247
1.660118
59.29228
0.75
-10.31
0.09
1.64
1.22
14.36
0.457
0.000
0.928
0.107
0.228
0.000
.070185
-1.024538
.0101321
.1361672
. 1263952
expense
percent
income
high
college
_cons
I
|
|
|
|
|
Number of obs =
F( 5,
45) =
Prob > F
R-squared
Adj R-squared =
Root MSE
51
42.23
0.0000
0.8243
0.8048
29.571
The standardized regression equation is
predicted csat* = .07expense* - \XftA5percent* + .Oiinconie* +
A36high* + .126co//ege*
where csat*, expense*, etc. denote these variables in standard-score form. We might interpret
the standardized coefficient on percent, for example, as follows:
^2
1.0245. Predicted mean SAT scores decline by 1.0245 standard deviations,
with each one-standard-deviation increase in the percentage of high school graduates
taking the SAT — if expense, income, high, and college do not change.
The F and t tests, R2, and other aspects of the regression remain the same.
Predicted Values and Residuals
After any regression, the predict command can obtain predicted values, residuals, and
other case statistics. Suppose we have just done a regression of composite SAT scores on their
strongest single predictor:
. regress csat percent
Now, to create a new variable called yhat containing predicted v values from this regression,
type
. predict yhat
.
label variable yhat "Predicted mean SAT score"
Through the resid option, we can also create another new variable containing the
residuals, here named e:
. predict e,
resid
label variable e "Residual"
We might instead have obtained the same predicted y and residuals through two
generate commands:
.
generate yhatO — _b[_cons] + __b [percen t ] ★percen t
I
p
166
n
Statistics with Stata
Sir
generate eO = csat - yhatO
.
Stata temporarily remembers coefficients and other details from the recent regression. Thus
_b[vfl/ name ] refers to the coefficient on independent vanable varname. b[ cons] refers to
the coefficient on _cons (usually, the y-intercept). These stored values are useful in
programming and some advanced applications, but for most purposes, predict saves us the
trouble of generating^yhatO and eO “by hand” in this fashion.
Residuals contain information about where the model fits poorly, and so are important for
diagnostic or troubleshooting analysis. Such analysis might begin just by sorting and
examining the residuals. Negative residuals occur when our model overpredicts the observed
values. That is, in these states the mean SAT scores are lower than we would expect, based on
what percentage of students took the test. To list the states with the five lowest residuals, type
I
sort e
.
list state percent csat yhat e in 1/5
+
i
i
i
2
3
4
5
!
I
state
percent
csat
yhat
South Carolina
West Virginia
North Carolina
Texas
Nevada
58
17
57
44
25
832
926
844
874
919
894.3333
986.0953
896.5714
925.6666
968.1905
e I
-62.3333
-60.09526
-52.5714
-51.66666
-49.19049
|
|
|
|
|
The four lowest residuals belong to southern states, suggesting that we might be able to improve
our model, or better understand variation in mean SAT scores, by somehow taking region into
account.
Positive residuals occur when actual y values are higher than predicted. Because the data
already have been sorted by e, to list the five highest residuals we add the qualifier
‘•-5" in this qualifier means the 5th-from-last observation, and the letter “el” (note that this is
not the number “1”) stands for the last observations. The qualifiers in 47/1 or in
47/51 could accomplish the same thing.
.
I
list state percent csat yhat e in -5/1
47 .
48 .
49.
50 .
51 .
i
I
i
state
i
i Massachusetts
Connecticut
i
North Dakota
i
I New Hampshire
I
Iowa
percent
csat
yhat
79
81
6
75
5
896
897
1073
921
1093
847.3333
842.8571
1010.714
856.2856
1012.952
e I
I
48.666"3 I
54.14292 I
62.28567 I
64.71434 I
80.04758 I
+
Linear Regression Analysis
167
predict also derives other statistics from the most recently-fitted model.
Below are
some Predict options that can be
used after: anova or regress .
------------predict new
Predicted values ofy. predict new, xb means the
same thing (referring to Xb, the vector of predicted y
values).
. predict new, cooksd
predict new,
• predict DFxl,
covratio
Cook’s D influence measures.
COVRATIO influence measures; effect of each
observation on the variance-covariance matrix of
estimates.
dfbeta (xi) DFBETAs measuring each observation’s influence on the
coefficient of predictors/.
. predict new. dfits
DF/TS influence measures.
. predict new, hat
. predict new, resid
Diagonal elements of hat matrix (leverage).
Residuals.
. predict new, rstandard
Standardized residuals.
. predict new, rstudent
Studentized (jackknifed) residuals.
. predict new, stdf
Standard errors of predicted individual y, sometimes
called the standard errors of forecast or the standard
errors of prediction.
. predict new, stdp
Standard errors of predicted mean^y.
. predict new, stdr
Standard errors of residuals.
predict new, welsch
I
Welsch’s distance influence measures.
Further options obtain
predicted probabilities and expected
—amp!cuicicaproDaomnesand
expected values;
values; type
type help regress for
a list. All predict options create case statistics, which are new variables (like predicted
values and residuals) that have a value for each observation in the sample.
When using predict, substitute a new variable name of your choosing for new in the
commands shown above. For example, to obtain Cook’s D influence measures, type
. predict D, cooksd
Or you can find hat matrix diagonals by typing
. predict h, hat
The names of variables created by predict (such asy/mt, e, D, h) are arbitrary and are
invented by the user. As with other elements of Stata commands, we could abbreviate the
options to the minimum number of letters it takes to identify them uniquely. For example,
. predict e, resid
could be shortened to
. pre e,
re
168
Statistics with Stata
Basic Graphs for Regression
This section introduces some <elementary
*
graphs you can use to represent a regression model
or examine its fit. Chapter 7 describes
_j more specialized graphs that aid post-regression
diagnostic work.
In simple regression, predicted values lie on the line defined by the regression equation.
By plotting and connecting predicted values, we can make that line visible. The If it (linear
lit) command automatically draws a simple regression line.
. graph twoway Ifit csat percent
Ordinarily, it is more interesting to overlay a scatterplot on the regression line, as done in
Figured. I.
. graph twoway Ifit csat percent
I I
I I
scatter csat percent
, ytitle("Mean composite SAT score") legend(off)
o
Figure 6.1
2 -I|
£
i
!
i
0)
8°
wo _
HO '
C^
co
0)
w
o
CL
E
8g-
£ 05
co
<D
o
CO
—J-
0
20
40
60
% HS graduates taking SAT
80
We could draw the same F'
Figure 6.1 graph “by hand” using the predicted values (yhat)
generated after the regression, and a command of the form
graph twoway mspline yhat percent, bands(50)
I|
scatter csat percent
II
, legend(off) ytitle("Mean composite SAT score")
The second approach is more work, but offers greater flexibility for advanced applications such
as conditional effect plots or nonlinear regression. Working directly with the predicted values
also keeps the analyst closer to the data, and to what a regression model is doing, graph
twoway mspline (cubic spline curve fit to 50 cross-medians) simply draws a straight line
when applied to linear predicted values, but will equally well draw a smooth curve in the case
of nonlinear predicted values.
Linear Regression Analysis
169
I
1
Residual-versus-predicted-values plots provide useful diagnostic tools (Figure 6.2). After
any regression analysis (also after some other models, such as ANOVA) we can automatically
draw a residual-versus-fitted (predicted values) plot just by typing
. rvfplot, yline(0)
o
o
Figure 6.2
o _
m
ro
ZJ
•g
V)
0)
or o
o
%
o
tn -
850
900
950
Fitted values
1000
1050
The “by-hand” alternative for drawing Figure 6.2 would be
. graph twoway scatter e yhat, yline(0)
Figure 6.2 reveals that our present model overlooks an obvious pattern in the data. The
residuals or prediction errors appear to be mostly positive at First (due to too-high predictions),
then mostly negative, followed by mostly positive residuals again. Later sections will seek a
model that better fits these data.
predict can generate two kinds of standard errors tor the predicted v values, which have
two different applications. These applications are sometimes distinguished by the names
confidence intervals’ and “prediction intervals”: A “confidence interval” in this context
expresses our uncertainty in estimating the conditional mean ot p at a given x value (or a given
combination ofx values, in multiple regression). Standard errors for this purpose are obtained
through
. predict SE,
stdp
Select an appropriate t value. With 49 degrees of freedom, for 95% confidence we should use
t = 2.01, found by looking up the / distribution or simply by asking Stata:
. display invttail(49,.05/2)
2.0095752
Then the lower confidence limit is approximately
.
generate lowl = yhat - 2.01*SE
and the upper confidence limit is
i
170
Statistics with Stata
'Bit:
. generate highl = yhat + 2.01*SE
Confidence bands in simple regression have an hourglass shape, narrowest at the mean of
x. We
graph these
an overlaid
ac thf*
We could
could graph
these using
using an
overlaid twowav
twoway command
command such
such as
the fniinxxrinrY
following,
graph twoway mspline lowl percent, cipattern(dash) bands(50)
11
mspline highl percent, cipattern(dash) bands(50)
11
mspline yhat percent, cipattern(solid) bands(50)
scatter csat percent
11
11
, legend(off) ytitle("Mean composite SAT score")
Ml
Shaded-area range plots (see help twoway_rarea ) offer a different way to draw such
graphs, shading the range between lowl and high]. Alternatively, Ifitci can do this
automatically, and take care of the confidence-band calculations, as illustrated in Figure 6.3.
Note the stdp option, calling for a conditional-mean confidence band (actually, the default).
. graph twoway Ifitci csat percent, stdp
||
scatter csat percent, msymbol(O)
II
, ytitle("Mean composite SAT score")
legend(off)
title("Confidence bands for conditional means (stdp)")
Figure 6.3
Confidence bands for conditional means (stdp)
o
o
o
o
o .
I
co
0)
'w
o
Q.
E
Oo
Oo CO
ro
<D
i
o
o
co
C?
20
40
60
% HS graduates taking SAT
80
The second type of confidence interval for regression predictions is sometimes called a
“prediction interval.” This expresses our uncertainty in estimating the unknown value ofy for
an individual observation with known x value(s). Standard errors for this purpose are obtained
by typing
. predict SEyhat,
stdf
Figure 6.4 (next page) graphs this prediction band using Ifitci withthe stdf option.
Predicting they values of individual observations as done in Figure 6.4 inherently involves
greater uncertainty, and hence wider bands, than does predicting the conditional mean of y
(Figure 6.3). In both instances, the bands are narrowest at the mean of*.
Linear Regression Analysis
171
• graph twoway Ifitci csat percent, stdf
11
scatter csat percent, msymbol(0)
11
, ytitlef’Mean composite SAT score")
legend(off)
title("Confidence bands for individual-case predictions (stdf)")
o Confidence bands for individual-case predictions (stdf)
Figure 6.4
O -
<D
8o
wo 1-2
<
O)
0)
s
c
ro
a>
S
o
oco
o
20
40
60
% HS graduates taking SAT
80
As with other confidence intervals and hypothesis tests in OLS regression, the standard
errors and bands just described depend on the assumption of independent and identically
istributed errors. Figure 6.2 has cast doubt on this assumption, so the results in Figures 6 3
and 6.4 could be misleading.
Correlations
correlate obtains Pearson product-moment correlations between enables.
.
correlate csat expense percent income high college
(obs=51)
csat
expense
percent
income
high
college
I
|
|
|
I
|
|
csat
expense
percent
income
high
college
1.0000
-0.4663
-0.8758
-0.4713
0.0858
-0.3729
1.0000
0.6509
0.6784
0.3133
0.6400
1.0000
0.6733
0.1413
0.6091
1.0000
0.5099
0.7234
1.0000
0.5319
1.0000
correlate uses only a jsubset
*
of the data that has no missing values on any of the
variables listed (with these particular variables, that does not matter became
,
. .
—i
—**v*. luunvi MwauoC no observations
have missing values). In this respect, the correlate command resembles regress,and
given the same variable list, they will use the same subset of the data. Analysts not employing
m i172
Statistics with Stata
regression or other multi-variable techniques, however, might prefer to find correlations based
upon all of the observations available for each variable pair. The command pwcorr
(pairwise correlation) accomplishes this, and can also furnish /-test probabilities for the null
hypotheses that each individual correlation equals zero.
.
pwcorr csat expense percent income high college,
I
csat
csat I
1.0000
+
I
I
expense I
I
I
percent I
I
I
income I
I
I
high |
I
I
college I
I
I
r
t
■ 1
expense
percent
income
sig
high
-0.4663
0.0006
1.0000
-0.8758
0.0000
0.6509
0.0000
1.0000
-0.4713
0.0005
0.6784
0.0000
0.6733
0.0000
1.000C
0.0858
0.5495
0.3133
0.0252
0.1413
0.3226
0.5099
C.0001
1 .0000
-0.3729
0.0070
0.6400
0.0000
0.6091
0.0000
0.7234
0.3000
0.5319
0.0001
college
1.0000
It is worth recalling here that if we drew many random samples from a population in which
a variables really had 0 correlations, about 5% of the sample correlations would nonetheless
test statistically significant” at the .05 level. Analysts who review many individual hypothesis
tests, such as those in a pwcorr matrix, to identify the handful that are significant at the 0>
level, therefore run a much higher than .05 risk of making a Type I error. This problem is called
the multiple comparison fallacy.” pwcorr offers two methods, Bonferroni and Sidak, for
adjusting significance levels to take multiple comparisons into account. Of these, the Sidak
method is more precise.
.
pwcorr csat expense percent income high college,
csat
expense
percent
income
high
college
I
+
csat
I
I
I
I
I
I
I
I
I
|
I
I
|
I
I
|
I
1.0000
expense
percent
income
high
-0.4663
0.0084
1.0000
-0.8758
0.0000
0.6509
0.0000
-0.4713
0.0072
0.6784
0.0000
0.6733
0.0000
i.:ooc
0.0858
1.0000
0.3133
0.3180
0.1413
0.9971
0.5099
0.0020
1.0000
-0.3729
0.1004
0.6400
0.0000
0.6091
0.0000
0.7234
0.0000
0.5319
0.0009
i
sidak sig
college
nnnn
X • V V V V
1.0000
■ii mu ii
hi
mi
Linear Regression Analysis
173
Comparing the test probabilities in the table above with those of the previous pwcorr
provides some idea of how much adjustment occurs. In general, the more variables we
correlate, the more the adjusted probabilities will exceed their unadjusted counterparts. See the
Base Reference Mannar?, discussion of oneway for the formulas involved.
correlate itself offers several important options. Adding the covariance option
produces a matrix of variances and covariances instead of correlations:
correlate w x y z, covariance
Typing the following after a regression analysis displays the matrix of correlations between
estimated coefficients, sometimes used to diagnose multicollinearity (see Chapter 7):
correlate, _coef
.
■The following command will display the estimated coefficients’ variance-covariance matrix,
from which standard errors are derived:
. correlate, _coef covariance
Peaison correlation coefficients measure how well an OLS regression line fits the data.
They consequently share the assumptions and weaknesses of OLS, and like OLS, should
generally not be interpreted without first reviewing the corresponding scatterplots. A
scatterplot matrix provides a quick way to do this, using the same organization as the
correlation matrix. Figure 6.5 shows a scatterplot matrix corresponding to the pwcorr
matrix given earlier. Only the lower-triangular half of the matrix is drawn, and plus signs are
used as plotting symbols. We suppress^ and x-axis labeling here to keep the graph uncluttered.
. graph matrix csat expense percent income high college,
half msymbol(+) maxis(ylabel(none) xlabel(none))
I
Mean
composite
SAT
score
i
X
Figure 6.5
i
Per pupa
I
% HS
graduates
taking
SAT
Median
household
income,
$1,000
rn-
* \-
.
t '*
-
J ‘
L
•4>
%
adults
HS
diploma
% adults
college
degree
I
174
Statistics with Stata
i-l
To obtain a scatterplot matrix corresponding to a ccorrelate
’
■ • matrix, from
correlation
which all observations having missing values have been dropped, we would need to qualify the
command. If all of the variables had some
some missine
missing values
values, we
we cnnld
could tvnA
type aa command such as
• graph matrix csat expense percent income high college if
csat <
& expense < • & income < . & high < . & college <
To reduce the likelihood of confusion and mistakes, it might make sense to create a new dataset
keeping only those observations that have no missing values:
■
keep if csat < . & expense < .
& college < .
save nmvstate
& income <
& high < .
In this example, we immediately saved the reduced dataset with a new name, so as to avoid
inadvertently writing over and losing the information in the old, more complete dataset. An
alternative way to eliminate missing values uses drop instead of keep:
. drop if csat >= . |
I college >= .
expense >= .
|
income >=
I high >= .
save nmvstate
In addition to Pearson correlations, Stata can also calculate several rank-based correlations
1 hese can be employed to measure associations between ordinal variables, or as an outlierresistant alternative to Pearson correlation for measurement variables. To obtain the Spearman
rank correlation between csat and expense, equivalent to the Pearson correlation if these
variables were transformed into ranks, type
.
spearman csat expense
<■
|i!
Number of obs =
Spearman's rho =
51
-0.42S2
Test of Ho: csat ar.d expense =re independent
Prob > I t I =
i.ooi-
Kendall’s r, (tau-a)and tb. (tau-b) rank correlations can be found easily for these data, although
with larger datasets their calculation becomes slow:
ktau csat expense
Number of obs =
Kendall's tau-a =
Kendall' s tau-b =
Kendal1 ' s score =
SE of score =
51
-0.2925
-0.2932
3 "3
123.095
(corrected for ties)
iest of Ho: csat and expense are independent
Prob > lz| =
0.0025
(continuity corrected)
For comparison, here is the Pearson correlation with its (unadjusted) P-value:
1
Linear Regression Analysis
175
• pwcorr csat expense, sig
I
csat
csat |
I
I
expense I
I
I
1.0000
-0.4663
0.0006
expense
1.0000
In this example, both spearman (-.4282) and pwcorr (-.4663) yield hitter correlations
than ktau (-.2925 or-.2932). All three agree that null hypotheses of no association can be
rejected.
Hypothesis Tests
Two types of hypothesis tests appear in regress output tables. As with other common
hypothesis tests, they begin from the assumption that observations in the sample at hand were
drawn randomly and independently from an infinitely large population.
1. Overall F test: The F statistic at the upper right in the regression table evaluates the null
hypothesis that in the population, coefficients on all the model’s x variables equal zero.
Individual t tests: The third and fourth columns of the regression table contain t tests for
each individual regression coefficient. These evaluate the null hypotheses that in the
population, the coefficient on each particular x variable equals zero.
2.
The t test probabilities are two-sided. For one-sided tests, divide these P-values in half.
In addition to these standard F and t tests, Stata can perform F tests of user-specified
hypotheses. The test command refers back to the most recent model-fitting command such
as anova or regress . For example, individual t tests from the following regression
report that neither the percent of adults with at least high school diplomas (high) nor the percent
with college degrees (college) has a statistically significant individual effect on composite SAT
scores.
.
regress csat expense percent income high college
Conceptually, however, both predictors reflect the level of education attained by a state’s
..Ioh^ and for some
------ --purposes we jHjgbt want to test thenu[1 hypothesis that both have
population,
effect. To do this, we Ibegin by repeating the multiple regression quietly , because we do
not need to see its full output again. Then-------use thej test command:
.
I
quietly regress csat expense percent income high college
test high college
( 1)
( 2)
high = 0.0
college = 0.0
F(
2/
45) =
Prob > F =
3.32
0.0451
176
Statistics with Stata
Unlike the individual null hypotheses, the joint hypothesis that coefficients on high and
college both equal zero can reasonably be rejected (P = .0451). Such tests on subsets of
coefficients are useful when we have several conceptually related predictors or when individual
coefficient estimates appear unreliable due to multicollinearity (Chapter 7).
test could duplicate the overall Ftest:
.
1
I
test expense percent income high college
test could also duplicate the individual-coefficient tests:
test expense
.
i
test percent
test income
and so forth. Applications of test more useful in advanced work include
I
1. Test whether a coefficient equals a specified constant. For example, to test the null
hypothesis that the coefficient on income equals 1 (Ho :P3 = 1), instead of the usual null
hypothesis that it equals 0 (Ho :P 3 = 0), type
.
test income = 1
2. Test whether two coefficients are equal. For example, the following command evaluates
the null hypothesis H0 :p 4 = p 5 *
test high = college
3.
Finally, test understands some algebraic expressions. We could request something like
the following, which would test Ho :p 3 = (P 4 + p 5) / 100:
.
test income = (high + college)/100
Consult help test for more information and examples.
Dummy Variables
Categorical variables can become predictors in a regression when they are expressed as one or
more {0,1} dichotomies called dummy variables.” For example, we have reason to suspect
that regional differences exist in states mean SAT scores. The tabulatse command will
generate one dummy variable for each category of the tabulated variable if we add a gen
(generate) option. Below, we create four dummy variables from the four-category variable
legion. The dummies are named regl, reg2, reg3 and reg4. regl equals 1 for Western states
and 0 for others; regl equals 1 for Northeastern states and 0 for others; and so forth.
.
tabulate region, gen(reg)
Geographica |
1 region |
-------------------+
West
N. East
South
Midwest
Freq.
Percent
Cum.
13
9
26.00
44 .00
76.00
100.00
|
|
|
|
12
26.00
18.00
32.00
24.00
Total |
50
100.00
16
•1
Linear Regression Analysis
177
describe regl-reg4
variable name
storage
type
display
format
byte
byte
byte
byte
regl
reg2
reg3
reg4
value
label
variable label
%8.0g
•%8.0g
%8.0g
%8.0g
region
region:
region:
region=
:Wes t
:N. East
:South
:Midwes t
tabulate regl
region
Wes |
t I
Freq.
Percent•
Cum.
37
13
74.00
26.00
74.00
100.00
50
100.00
region==N.
N. |
East |
Freq.
Percent
Cum.
0 I
1 I
41
9
82.00
18.00
82.00
100.00
Total |
50
100.00
0 I
1 I
—+
Total |
tabulate reg2
Regressing csat on one dummy variable, reg2 (Northeast), is equivalent to performing a
two-sample t test of whether mean csat is the same across categories of reg2. That is, is the
mean csat the same in the Northeast as in other U.S. states?
.
regress csat reg2
Source
I
Model
Residual
|
|
SS
df
MS
35191.4017
177769.978
1
48
35191.4017
3703.54121
Total |
212961.38
49
4346.15061
csat |
Coef.
Std. Err.
t
P>l 11
reg2 |
cons |
-69.0542
958.6098
22.40167
9.504224
-3.08
100.86
0.003
0.000
------------- +
Number of obs =
F( 1,
48) =
Prob > F
R-squared
=
Adj R-squared =
Root MSE
[95% Conf.
-114.0958
939.5002
50
9.50
0.0034
0.1652
0.1479
60.857
Interval]
-24.01262
977.7193
The dummy variable coefficient’s t statistic (t = -3.08, P = .003) indicates a significant
difference. According to this regression, mean SAT scores are 69.0542 points lower (because
b = -69.0542) among Northeastern states. We get exactly the same result (t = 3.08, P = .003)
from a simple t test, which also shows the means as 889.5556 (Northeast) and 958.6098 (other
states), a difference of 69.0542.
I
I
178
Statistics with Stata
i,h- ■
ttest csat, by(reg2)
Two sample t test with equal variances
Group |
Obs
Mean
Std. Err.
Std. Dev.
[95% Conf. Interval]
0 I
1 I
41
9
958.6098
889.5556
10.36563
4.652094
66.37239
13.95628
937.66
878.8278
979.5595
900.2833
combined I
50
946.18
9.323251
65.92534
927.4442
964.9158
69.0542
22.40167
24.01262
114.0958
diff
|
Degrees of freedom: 48
Ho : mean (0) - mean(l) = diff = 0
Ha : diff < 0
t =
3.0825
P < t =
0.9983
Ha: diff != 0
t =
3.0825
P > 111 =
0.0034
Ha: diff > 0
t =
3.0825
P > t =
0.0017
This conclusion proves spurious, however, once we control for the percentage of students
taking the test. We do so with a multiple regression of csat on both reg2 and percent.
.
regress csat reg2 percent
Source |
SS
df
MS
Model |
Residual |
174664.983
38296.3969
2
47
87332.4916
814.816955
Total |
212961.38
49
4346.15061
csat |
Coef.
Std. Err.
t
P> 111
reg2 |
percent |
_cons |
57.52437
-2.793009
1033.749
14.28326
.2134796
7.270285
4.03
-13.08
142.19
0.000
0.000
0.000
----- +
---- +
Number of obs =
F( 2,
47) =
Prob > F
=
R-squared
=
Adj R-squared =
Root MSE
50
107.18
0.0000
0.8202
0.8125
28.545
[95% Conf. Interval]
28.79016
-3.222475
1019.123
86.25858
-2.363544
1048.374
The Northeastern region variable reg2 now has a statistically significantj?osztzvc coefficient
(Z? = 57.52437, P < .0005). The earlier negative relationship was misleading. Although mean
SAT scores among Northeastern states really are lower, they are lower because higher
percentages of students take this test in the Northeast. A smaller, more “elite” group of
students, often less than 20% of high school seniors, take the SAT in many of the non-Northeast
states. In all Northeastern states, however, large majorities (64% to 81%) do so. Once we
adjust for differences in the percentages taking the test, SAT scores actually tend to be higher
in the Northeast.
To understand dummy variable regression results, it can help to write out the regression
equation, substituting zeroes and ones. For Northeastern states, the equation is approximately
predicted csat = 1033.7 + 57.5reg2 - 2.Spercent
= 1033.7 + 57.5 x 1 2.8percent
= 1091.2 - Z.Zpercent
1
Illi
Linear Regression Analysis
179
For other states, the predicted csat is 57.5 points lower at any given level ofpercent:
predicted csat = 1033.7 + 57.5 * 0 - 2.8percent
= 1033.7 - 2.8percent
■
w
I
Dummy variables in models such as this are termed “intercept dummy variables,” because they
describe a shift in the y-intercept or constant.
From a categorical variable with k categories we can
< define k dummy variables, but one of
these will be redundant. Once we know a state’s values on the West, Northeast, and Midwest
dummy variables, for example, we can already guess its value on the South variable. For this
reason, no more than k - 1 of the dummy variables — three, in the case of region — can be
included in a regression. If we try to include all the possible dummies, Stata will automatically
drop one because multicollinearity otherwise makes the calculation impossible.
.
regress
csat regl reg2 reg3 reg4 percent
Source
I
ss
df
MS
Model |
Residual |
181378.099
31583.2811
4
45
45344.5247
701.850691
Total
212961.38
49
4346.15061
|
csat |
regl
reg2
reg3
reg4
percent
_cons
|
|
I
|
|
|
Number of obs =
F( 4Z
45) =
Prob > F
R-squared
Adj R-squared =
Root MSE
50
64.61
0.0000
0.8517
0.8385
26.4 92
Coef.
Std. Err.
t
P> I 11
-23.77315
25.79985
-33.29951
(dropped)
-2.546058
1047.638
11.12578
16.96365
10.85443
-2.14
1.52
-3.07
0.038
0.135
0.004
-46.18162
-8.366693
-55.16146
-1.364676
59.96639
-11.43757
.2140196
8.273625
-11.90
126.62
0.000
0.000
-2.977116
1030.974
-2.115001
1064.302
[95% Conf.
Interval]
The model s fit
including R -, F tests, predictions, and residuals — remains essentially
the same regardless of which dummy variable we (or Stata) choose to omit. Interpretation of
the coefficients, however, occurs with reference to that omitted category. In this example, the
Midwest dummy variable (reg4) was omitted. The regression coefficients on regl. reg2. and
reg3 tell us that, at any given level of percent, the predicted mean SAT scores are
approximately as follows:
23.8 points lower in the West (regl = 1) than in the Midwest;
25.8 points higher in the Northeast (reg2 = 1) than in the Midwest: and
33.3 points lower in the South (reg3 = 1) than in the Midwest.
The West and South both differ significantly from the Midwest in this respect, but the Northeast
does not.
An alternative command, <areg , fits the same model without going through dummy
variable creation. Instead, it “absorbs” the effect of a ^-category variable such as region. The
model s fit, F test on the absorbed variable, and other key aspects of the results are the same
as those we could obtain through explicit dummy variables. Note that areg does not provide
estimates
coefficients
x__ of the
------22----- j on individual dummy variables, however.
180
.
Statistics with Stata
areg csat percent, absorb(region)
Number of obs =
F( 1,
45) =
Prob > F
R-squared
=
Adj R-squared =
Root MSE
p
■
csat |
Coef.
Std. Err.
t
P> I t |
percent |
_cons |
-2.546058
1035.445
.2140196
8.38689
-11.90
123.46
0.000
0.000
9.465
0.000
------ +
r
region |
F(3, 45) =
50
141.52
0.0000
0.8517
0.8385
26.492
[95% Conf. Interval]
-2.977116
1018.553
-2.115001
1052.337
(4 categories)
Although its output is less informative than regression with explicit dummy variables,
areg does have two advantages. It speeds up exploratory work, providing quick feedback
about whether a dummy variable approach is worthwhile. Secondly, when the variable of
haS
j 7jues’ creating dummies for each of them could lead to too many variables
or too large a model for our particular Stata configuration, areg thus works around the usual
limitations on dataset and matrix size.
J
S' ’i
Explicit dummy variables have other advantages, however, including ways to model
interaction effects. Interaction terms called “slope dummy variables” can be formed by
multiplying a dummy times a measurement variable. For example, to model an interaction
between Northeast/other region ^percent, we create a slope dummy variable called reg2perc.
. generate reg2perc = reg2 * percent
(1 missing value generated)
The new variable, reg2perc, equals percent for Northeastern states and zero for all other states.
We can include this interaction term among the regression predictors:
.
regress csat reg2 percent reg2perc
Source |
r
i:
ss^
df
MS
Number of obs =
F( 3,
46) =
Prob > F
R-squared
=
Adj R-squared =
Root MSE
=
50
= 2.27
0.0000
0.8429
0.8327
26.968
Model |
Residual |
179506.19
33455.1897
3
46
59835.3968
727.286733
Total |
212961.38
49
4346.15061
csat |
Coef.
Std. Err.
t
P> 111
[95% Conf. Interval]
-241.3574
-2.858829
4.179666
1035.519
116.6278
.2032947
1.620009
6.902898
-2.07
-14.06
2.58
150.01
0.044
0.000
0.013
0.000
-476.117
-3.26804
. 9187559
1021.624
reg2
percent
reg2perc
_cons
|
|
|
|
-6.597821
-2.449618
7.440576
1049.414
i':;
p
a
The interaction
= 2.58,
imeracnon is
is statistically
statistically significant
significant (t
(t =
2.58, P
P = .013). Because this analysis
includes both intercept (reg2) and slope (reg2perc) dummy variables, it is worthwhile to write
out the equations. The regression equation for Northeastern states is approximately
rLinear Regression Analysis
181
predicted csat = 1035.5 - 24l.4reg2 - l.Vpercent + 4.2reg2perc
= 1035.5 - 241.4 x 1 - 2.9percent + 4.2 x 1 x percent
- 794.1 4- X.lpercent
For other states it is
predicted csat — 1035.5 — 241.4 x 0 — 2.9percent + 4.2 x 0 x percent
= 1035.5 -2.9percent
An interaction implies that the effect of one variable changes, depending on the values of
some other variable. From this regression, it appears that percent has a relatively weak and
positive effect among Northeastern states, whereas its effect is stronger and negative among the
To visualize the results from a jslope-and-intercept
‘
dummy variable regression, we have
several graphing possibilities. Without even fitting the model/we could ask
If it to do the
work as follows, with the results seen in Figure 6.6.
label define reg2 0 "other regions" 1 "Northeast"
label values reg2 reg2
.
• graph twoway Ifit csat percent
II
I I
scatter csat percent
, by(reg2, legend(off) note( ))
ytitle("Mean composite SAT score")
other regions
o
Figure 6.6
Northeast
T—
S
o
sg
0)
s
Q.
E
8 g.
c 05
<u
<D
o
OO
l—
0
20
40
60
80
0
20
% HS graduates taking SAT
I
40
60
80
I
Is
L-
I
a il
182
Statistics with Stata
Alternatively, we could fit the regression model, calculate predicted values, and use those
to make the a more refined plot such as Figure 6.7. The bands (50) options with both
mspline commands specify median splines based on 50 vertical bands, which is more than
enough to cover the range of the data.
. quietly regress csat reg2 percent reg2perc
. predict yhatl
. graph twoway scatter csat percent if reg2
I I
0
mspline yhatl percent if reg2 == 0, cipattern(solid)
bands(50)
11
11
scatter csat percent if reg2 ■ = 1, msymbol(Sh)
mspline yhatl percent if reg2 == 1, cipattern(solid)
bands (50)
II
, ytitle("Composite mean SAT score")
legend (order (1 3) label(1 "other regions")
label(3 "Northeast states") position(12) ring(0))
o
o
I
•
other regions
Figure 6.7
□
Northeast states
2
i•
Ho "
CO
c
(Q
<D
E
0)
o° "
E01
o
O
|
o
o
co
0
■I
20
40
60
% HS graduates taking SAT
80
Figure 6.7 involves four overlays: two scatterplots (csat vs.percent for Northeast and other
states) and two median-spline plots (connecting predicted values, yhatl, graphed against
percent for Northeast and others). The Northeast states are plotted as hollow squares,
msymbol (Sh). ytitle and legend options simplify they-axis title and the legend; in
their default form, both would be crowded and unclear.
Figures 6.6 and 6.7 both show the striking difference, captured by our interaction effect,
between Northeastern and other states. This raises the question of what other regional
differences exist. Figure 6.8 explores this question by drawing a csat-percent scatterplot with
different symbols for each of the four regions. In this plot, the Midwestern states, with one
exception (Indiana), seem to have their own steeply negative regional pattern at the left side of
the graph. Southern states are the most heterogeneous group.
Linear Regression Analysis
183
• graph twoway scatter csat percent if regl ==1
11
scatter csat percent if reg2 ==1, msymbol(Sh)
11
scatter csat percent if reg3
1, msymbol(T)
11
scatter csat percent if reg4
1, msymbol(+)
11
, legend(position(1) ring(0) label(1 "West")
label (2 "Northeast") label(3
.3 "South") label(4 "Midwest"))
o
o
Figure 6.8
• West
Northeast
▲ South + Midwest
<D
8o
Wo
HO
< T"
0)
<D
w
Q.
E
e
S05 I
<D
A
%
o o
00 L-r-
0
20
40
60
% HS graduates taking SAT
80
Automatic Categorical-Variable Indicators and Interactions
I
The xi (expand interactions) command simplifies the jobs of expanding multiple-category
variables into sets of dummy and interaction variables, and including these as predictors in
regression or other models. For example, in da^tstudent2.dta (introduced in Chapter 5) there
is a four-category variable year, representing a student’s year in college (freshman, sophomore,
etc.). W e could automatically create a set of three dummy variables by typing
. xi, prefix(ind) i.year
The three new dummy variables will be named indyear_2, indyearj, and indyearj. The
prefix () option specified the prefix used in naming the new dummy variables. If we typed
simply
. xi i.year
giving no prefix () option, the names Jyear_2,_Iyear_3, and _Iyear_4 would be assigned
(and any previously calculated variables with those names would be overwritten by the new
variables). Typing
. drop
I*
employs the wildcard * notation to drop all variables that have names beginning with J.
184
Statistics with Stata
fM
1
By default, xi omits the lowest value ofthe categorical variable when creating dummies
but this can be controlled. Typing the command
char _dta[omit]
prevalent
will cause subsequent xi commands to automatically omit the most prevalent category (note
the use of square brackets). char _dta [ ] preferences are saved with the data; to restore
the default, type
.
char __dta[omit]
Typing
■
.
r-
i
char year[omit]
3
would omit year 3. To restore the default, type
.
char year[omit]
xi can also create interaction terms involving two categorical variables, or one categorical
and one measurement variable. For example, we could create a set of interaction terms for year
and gender by typing
.
xi i.year*i.gender
From the four categories ofrear and the two categories ofgender, this xi command creates
seven new variables — four dummy variables and three interactions. Because their names all
begin with J, we can use the wildcard notation _/* to describe these variables:
.
J
■■
S'
i!
Z*
variable name
storage
type
_Iyear_2
_Iyear_3
_Iyear_4
_lger.der_l
_IyeaXger._2_l
_IyeaXgen_3_l
_IyeaXgen_4_l
byte
byte
byte
byte
byte
byte
byte
value
label
variable label
*6
year== 2
year== 3
year== 4
gender
year==2
year
3
year
4
*
*=
* t
-ac . wg
1
& gender
£ gender
& gender;
1
1
1
To create interaction terms for categorical variable year and measurement variable drink
(33-point drinking behavior scale), type
.
I
j
describe
xi i.year*drink
Six new variables result, three dummy variables for year, and three interaction terms
representing each of the rear dummies times drink. For example, for a sophomore student
Jyear2 - 1 and _IyeaXdrink_2 = 1 xdrink = drink. For a junior student, _Iyear2 = 0 and
JyearXdrinkJ = 0><drink =0: also Jyeari = 1 and JyeaXdrinkJ = 1 ^drink = drink, and so
forth.
describe _Iyea*
variable name
_Iyear_2
_Iyear_3
_Iyea r_4
storage
type
display
format
byte
byte
byte t
%8 . Og
%8 . Dg
%8v0g—
value
label
variable label
’
year==2
year==3
year«4
Linear Regression Analysis
_IyeaXdrink_2
_IyeaXdrink_3
_IyeaXdrink_4
float
float
float
%9.0g
%9.0g
%9.0g
(year
(year
(year
185
2) *drink
3) *drink
4) *drink
The real convenience of xi comes from its ability to generate dummy variables and
interactions automatically within a regression or other model-fitting command. For example,
to regress variable gpa (student’s college grade point average) on drink and a set of dummy
variables for year, simply type
xi:
regress gpa drink i.year
This command automatically creates the necessary dummy variables, following the same rules
described above. Similarly, to regress gpa on drink, year, and the interaction ofdrink and year,
type
xi :
regress gpa drink i.year*drink
i.year
i.year*drink
_Iyear_l-4
_IyeaXdrink_#
(naturally coded; _Iyear_l omitted)
(coded as above)
Source
SS
df
MS
Model |
Residual I
5.08865901
40.6630801
7
210
.726951288
.193633715
Total |
45.7517391
217
.210837507
gpa |
drink
_Iyear_2
_Iyear_3
_Iyear_4
drink
_IyeaXdrin~2
_IyeaXdrin~3
_IyeaXdrin~4
_cons
I
|
I
|
|
I
|
|
|
Number of obs =
F( 1,
210) =
Prob > F
=
R-squared
=
Adj R-squared =
Root MSE
218
3.75
0.0007
0.1112
0.0816
.44004
Coef.
Std. Err.
t
P> 111
-.0285369
-.5839268
-.2859424
-.2203783
(dropped)
.0199977
.0108977
.0104239
3.432132
.0140402
.314782
.3044178
.2939595
-2.03
-1.86
-0.94
-0.75
0.043
0.065
0.349
0.454
-.0562146
-1.204464
-.8860487
-.799868
-.0008591
.0366107
.3141639
. 3591114
.0164436
.016348
.016369
.2523984
1.22
0.67
0.64
13.60
0.225
0.506
0.525
0.000
-.0124179
-.0213297
-.0218446
2.934572
. 0524133
.043125
.0426925
3.929691
[95% Conf. Interval]
The xi: command can be applied in the same way before many other model-fitting
procedures such as logistic (Chapter 10). In general, it allows us to include predictor
(right-hand-side) variables such as the following, without first creating the actual dummy
variable or interaction terms.
i . catvar
Creates j-l dummy variables representing the j categories of
catvar.
i.catvarl*i.catvar2
Creates j-l dummy variables representing the j categories of
catvar1; k-\ dummy variables from the A: categories of catvar2;
and (/—1 )(A7—1) interaction variables (dummy x dummy).
Creates j-1 dummy variables representing the J categories of
catvar, and J-l variables representing interactions with the
measurement variable (dummy x measvar).
After any xi command, the new variables remain in the dataset.
i.catvar* measvar
I
186
Statistics with Stata
Stepwise Regression
I1
With the regional dummy variable terms we added earlier to the state-level data in states dta
we have many possible predictors of csat. This results in an overly complicated model, with
several coefficients statistically indistinguishable from zero.
.
regress csat expense percent income college high regl reg2
reg2perc reg3
ss
df
MS
Model [
Residual |
195420.517
17540.863
9
40
21713.3908
438.521576
Total |
212961.38
49
4346.15061
csat |
Coef .
Std. Er r.
t
P> 111
[95% Conf
Interval]
expense I
percent |
income i
college j
high |
regl I
reg2 |
reg2perc j
reg3 |
_cons |
-.0022508
-2.93786
-.0004919
3.900087
2.175542
-33.78456
-143.5149
. 0041333
.2302596
. 0010255
1.719409
1.171767
9.302983
101.1244
1.404483
12.54658
76.35942
-0.54
-12.76
-0.48
2.27
1.86
-3.63
-1.42
1.78
-0.70
10.99
0.589
0.000
0.634
0.029
0.071
0.001
0.164
0.082
0.487
0.000
-.0106045
-3.403232
-.0025645
.4250318
-.192688
-52.58659
-347.8949
-.3319506
-34.15679
684.8927
.006103
-2.472488
.0015806
7.375142
4.543771
-14.98253
60.86509
5.345183
16.55838
993.549
Source |
-------- +
y
»!
2.506616
-8.799205
839.2209
Number of obs =
F( 9,
40) =
Prob > F
R-squared
=
Adj R-squared =
Root MSE
50
49.51
0.0000
0.9176
0.8991
20.941
ii
We might: now try to simplify this model, dropping first that predictor with the highest t
probability (income, P= .634), then refittingthe model and deciding whetherto drop something
urther. Through this process of backward elimination, we seek a more parsimonious modelone that is simpler but fits almost equally well. Ideally, this strategy is pursued with attention
both to the statistical results and to the substantive or theoretical implications of keeping or
discarding certain variables.
For analysts in a hurry, stepwise methods provide ways to automate the process of model
selection. They work either by subtracting predictors from a complicated model, or by adding
predictors to a simpler one according to some pre-set statistical criteria. Stepwise methods
cannot consider the substantive or theoretical implications of their choices, nor can they do
much troubleshooting to evaluate possible weaknesses in the models produced at each step.
Despite their drawbacks, stepwise methods meet certain practical needs and have been widely
used.
J
I
Ill'
or automatic backward elimination, we issue a sw regress command that includes
a of our possible predictor variables, and a maximum P value required to retain them. Setting
the P-to-retain criteria as pr (. 05) ensures that only predictors having coefficients that are
significantly different from zero at the .05 level will be kept in the model.
i
£h
i
J
Linear Regression Analysis
187
sw regress csat expense percent income college high regl
reg2
■4
reg2perc reg3, pr(.O5)
p = 0.6341 >= 0.0500
p = 0.5273 >= 0.0500
P = 0.4215 >= 0.0500
p = 0.2107 >= 0.0500
begin with full model
removing income
removing reg3
removing expense
removing reg2
Source |
SS
df
MS
Model |
Residual |
194185.761
18775.6194
5
44
38837.1521
426.718624
Total
212961.38
49
4346.15061
|
Number of obs =
F( 5,
44) =
Prob > F
=
R-squared
Adj R-squared =
Root MSE
csat |
Coef.
Std. Err.
t
P> 111
I
I
I
I
I
I
-30.59218
-3.119155
.5833272
3.995495
2.231294
806.672
8.479395
. 1804553
.1545969
1.359331
.8178968
49.98744
-3.61
-17.28
3.77
2.94
2.73
16.14
0.001
0.000
0.000
0.005
0.009
0.000
regl
percent
reg2perc
college
high
cons
50
91.01
0.0000
0.9118
0.9018
20.657
[95% Conf. Interval]
-47.68128
-3.482839
.2717577
1.255944
. 5829313
705.9289
-13.50309
-2.755471
.8948967
6.735046
3.879657
907.4151
SW regress dropped first income, then reg3, expense, and finally reg2 before settling on
the final model. Although it has four fewer coefficients, this final model has almost the same
R2 (.9118 versus .9176) and a higher R’2a (.9018 versus .8991) compared with the earlier
version.
If, instead of a P-to-retain, pr (. 05), we specify a P-to-enter value such as pe (. 0 5),
then sw regress performs forward inclusion (starting with an “empty” or constant-only
model) instead of backward elimination. Other stepwise options include hierarchical selection
and locking certain predictors into the model. For example, the following command specifies
that the first term (xl) should be locked into the model and not subject to possible removal:
. sw regress y xl x2 x3, pr(.O5)
lockterml
The following command calls for forward inclusion of any predictors found significant at
the . 10 level, but with variables x4, x5, and x6 treated as one unit — either entered or left out
together:
sw regress y xl x2 x3 (x4 x5 x6) , pe(.10)
The following command invokes hierarchical backward elimination with a P = .20 criterion:
sw regress y xl x2 x3 (x4 x5 x6)
x7, pr(.20) hier
The hier option specifies that the terms are ordered: consider dropping the last term (x7)
first, and stop if it is not dropped. Ifx7 is dropped, next consider the second-to-last term (x4
x5 x6), and so forth.
Many other Stata commands besides regress also have stepwise variants that work in
a similar manner. Available stepwise procedures include the following:
sw clogit
Conditional (fixed-effects) logistic regression
sw cloglog
Maximum likelihood complementary log-log estimation
188
Statistics with Stata
sw cnreg
i
I
Censored normal regression
sw glm
Generalized linear models
sw logistic
Logistic regression (odds)
sw logit
Logistic regression (coefficients)
sw nbreg
Negative binomial regression
sw ologit
Ordered logistic regression
sw oprobit
Ordered probit regression
sw poisson
Poisson regression
sw probit
Probit regression
sw qreg
Quantile regression
s w regress
OLS regression
sw stcox
Cox proportional hazard model regression
sw streg
Parametric survival-time model regression
sw tobit
Tobit regression
Type help sw for details about the stepwise options and logic.
Polynomial Regression
I
Ii
i ■
Earlier in this chapter, Figures 6.1 and 6.2 revealed an apparently curvilinear relationship
between mean composite SAT scores (csai) and the percentage of high school seniors taking
the test (percent). Figure 6.6 illustrated one way to model the upturn in SAT scores at high
percent values: as a phenomenon peculiar to the Northeastern states. That interaction model
fit reasonably well (R 2a = .8327). But Figure 6.9 (next page), a residuals versus predicted
values plot for the interaction model, still exhibits signs of trouble. Residuals appear to trend
upwards at both high and low predicted values.
. quietly regress csat reg2 percent reg2perc
. rvfplot, yline(O)
Linear Regression Analysis
1
'
o
o
189
Figure 6.9
o
IO ■
<n
OJ
□
■u
‘w
0)
or
••
o
t •
o
ID "
850
900
950
Fitted values
1000
1050
Ch,apter 8 Presents a variety of techniques for curvilinear and nonlinear regression.
Curvilinear regression” here refers to intrinsically linear OLS regressions (for example,
regress ) that include nonlinear transformations of the original y or.v variables. Although
curvilinear regression fits a curved model with respect to the original data, this model remains
linear m the transformed variables. (Nonlinear regression, also discussed in Chapter 8, applies
non-OLS methods to fit models that cannot be linearized through transformation.)
One simple type of curvilinear regression, called polynomial regression, often succeeds in
fitting U or inverted-U shaped curves. It includes as predictors both an independent variable
and its square (andpossibly higher powers ifnecessary). Because the csat-percent relationship
appears somewhat U-shaped, we generate a new variable equal lopercent squared, then include
percent and percent2 as predictors of csat. Figure 6.10 graphs the resulting curve.
. generate percent2 = percent*!
.
regress csat percent percent2
Source |
ss
df
MS
Number of obs =
F(
2,
48) =
Prob > F
R-squared
Adj R-squared =
Root MSE
Model
Residual
|
|
193721.829
30292.6806
2
48
96860.9146
631.097513
Total
|
224014.51
50
4480.2902
csat
|
Coef.
Std. Er r.
t
P> 111
percent |
percent2 |
_cons |
-6.111993
.0495819
1065.921
. 6715406
.0084179
9.285379
-9.10
5.89
114.80
0.000
0.000
0.000
------------- +
-------- +
[95% Conf.
-7.462216
.0326566
1047.252
51
153.48
0.0 0 00
0.8648
0.8591
25.122
Interval]
-4.76177
. 0665072
1084.591
i
I
I
190
Statistics with Stata
I
• predict yhat2
(option xb assumed; fitted values)
ill
. graph twoway mspline yhat2 percent, bands(50)
11
scatter csat percent
11
, legend(off) ytitle("Mean composite SAT score")
i
o
o
Figure 6.10
i. I'
<D
§° I
Wq
HO
<Tcn
J
<p
"w
'll'
£o
cu
V-
I
♦
Q.
E
“S-
<D
o
o 00 V
0
20
40
60
% HS graduates taking SAT
80
If we only wanted to see the graph, and did not need the regression analysis there is a
r
a curve similar to Figure 6.10 could have been obtained by typing
. graph twoway qfit csat percent
I I
j|M
r
O’
? Ji!
I
scatter csat percent
modlHnPFiS6
6J()TatChCS thC
Slightlybetterthanour interaction
Striking in a
h i °
-8591 versus -S327). Secanse the curvilinear pattern is now less
ndeXent dT r 11^
P'Ot
611^’ the usual assumption of
polyn^model
PlaUSlb’e
t0 thiS
Linear Regression Analysis
191
quietly regress csat percent percent2
. rvfplot, yline(O)
f
Figure 6.11
o
io
ro
"O
’</> o
a>
• •
CC
%
o
in -j___
850
900
950
Fitted values
1000
1050
In Figures 6.7 and 6.10, we have two alternative models for the observed upturn in SAT
scores at high levels of student participation. Statistical evidence seems to lean towards the
polynomial model at this point. For serious research, however, we ought to choose between
similar-fitting alternative models on substantive as well as statistical grounds. Which model
seems more useful, or makes more sense? Which, if either, regression model suggests or
corresponds to a good real-world explanation for the upturn in test scores at high levels of
student participation?
Although it can closely fit sample data, polynomial regression also has important statistical
weaknesses. The different powers of.v might be highly correlated with each other, giving rise
to multicollinearity. Furthermore, polynomial regression tends to track observations that have
unusually large positive or negative x values, so a few data points can exert disproportionate
influence on the results. For both reasons, polynomial regression results can sometimes be
sample-specific, fining one dataset well but generalizing poorly to other data. Chapter 7 takes
a second look at this example, using tools that check for potential problems.
Panel Data
Panel data, also called cross-sectional time series, consist of observations on i analytical units
or cases, repeated over t points in time. The Longitudinal/Panel Data Reference Manual
describes a wide range of methods for analyzing such data. Most of the relevant Stata
commands begin with the letters xt; type help xt for an overview. As mentioned in the
documentation, some xt procedures require time series or tsset data; see Chapter 13, or
type help tsset, for more about this step.
192
77-
i
ill
Statistics with Stata
This section considers the relatively simple case of linear regression with panel data,
accomplished by the command xtreg . Our example dataset, newfdiv.dta contains
information about the 10 census divisions of the Canadian p> v
i province of Newfoundland (Avalon
Peninsula, Burin Peninsula, and 8 others), for the years 1992-96.
Contains data from C:\data\netffdiv.dta
obs:
50
vars :
size:
7
2,250
variable name
storage
type
cendiv
divname
year
pop
unemp
outmig
tcrime
ii
Sorted by:
Newfoundland Census divisions
(source:
Statistics Canada)
18 Jul 2005 10:28
(99.9% of memory free)
display
format
byte
%9.0g
str20
%20s
int
%9.0g
double %9.0g
float
%9.0g
int
%9.0g
float
%9.0g
value
label
variable label
cd
Census Division
Census Division name
Year
Population, 1000s
Total unemployment. 1000s
Out-migration
Total crimes reported. 1000s
divname
year
pop
unemp
outmig
Avalon Peninsula
Avalon Peninsula
Avalon Peninsula
Avalon Peninsula
Avalon Peninsula
1992
1993
1994
1995
1996
259.587
261.083
259.296
257.546
255.723
58.56
52.23
44.81
39.35
38.68
6556
644 9
6907
Burin Peninsula
Burin Peninsula
Burin Peninsula
Burin Peninsula
Burin Peninsula
1992
1993
1994
1995
1996
29.865
29.611
29.327
26.898
28 . 126
9.5
9.18
8.41
7.12
6.61
874
928
584
cendiv
year
. list in 1/10
111
i-
<I?
1.
2.
3.
4.
5.
6.
7.
8.
9.
10 .
I cendiv
I-------I Avalon
I Avalon
I Avalon
I Avalon
I Avalon
I—
Burin
I
I
Burin
Burin
I
Burin
I
Burin
I
tcrime
26.211
■
21.039
20.201
19.536
21.268
I
I
1. 903
1.94
2.063
1 . 923
Figure 6.12 visualizes the panel data, graphing variations in the number ofcrimes reponed
each year for 9 of the 10 census divisions. Census division 1, the Avalon Peninsula, is by far
he largest in Newfoundland. Setting it temporarily aside by specifying if cendiv != 1
makes the remaining 9 plots in Figure 6.12 more readable. The imargin(l=3 r=3) option
in this example calls for left and right margins subplot margins equal to 3% of the graph width
giving more separation than the default.
Linear Regression Analysis
193
graph twoway connected tcrime year if cendiv != 1,
by(cendlvf note("")) xtitleC”') imargin (lef t=3 right=3)
Burin
S Coast
St Georg
Humber
Central
Bonavist
Notre D
NPen
Labrador
Figure 6.12
CO -
CM -
O
O
O
'
■o'
O)
€o
CL
£
co
co
04
<D
o
ro
H
co CM -
1992 1993 1994 1995 1996 1992 1993 1994 1995 1996 1992 1993 1994 1995 1996
The dataset contains 50 observations total. Because the 50 observations represent only 10
individual cases, however, the usual assumptions of OLS and other common statistical methods
do not apply. Instead, we need models with complex error specifications, allowing for both
unit-specific and individual-observation disturbances.
Consider the regression of v on two predictors, x and w. OLS regression estimates the
regression coefficients a, b, and c, and calculates the associated standard errors and tests,
assuming a model of the form
V,- — a -T bXj -r CM-, -r et
where the residuals for each observation, e i , are assumed to represent errors that have
independent and identical distributions. The i.i.d. errors assumption appears unlikely with
panel data, where the observations consist of the same units measured repeatedly.
A more plausible panel-data model includes two error terms. One is common to each of
the i units, but differs between units (u, ). The second is unique to each of the i, t obsen ations
yH = a + bxit + civf7 + W/ + e.
In order to fit such a model, Stata needs to know which variable identifies the i units, and
which variable is the time index t. This can be done within an xt command, or more
efficiently for the dataset as a whole. The commands iis (“z is”) and tis (‘7 is”) specify
the i and t variables, respectively. For newfdiv.dta, the units are census divisions (cendiv) and
the time index is^ear.
194
ifi
II
J i
'
!
Statistics with Stata
iis
cendiv
tis year
save, replace
Saving the dataset preserves the i and t specifications, so the iis and' tis
_L_ commands
are not required in a future session. Having set these variables, we can now fit a random-effects
(meaning that the common errors u, are assumed to be variable, rather than fixed) model
regressing tcrinie on unemp and pop.
xtreg
tcrime unemp pop,
re
Random-effects GLS regression
Group variable (i): cendiv
R-sq:
Number of obs
Number of groups
within
= 0.5265
between = 0.9717
overall = 0.9634
Obs per group: min =
5
avg =
5.0
5
max =
Random effects u i ~ Gaussian
corr(u_i, X)
= 0 (assumed)
Wald chi2(2)
Prob > chi2
H
tcrime |
’Ml
Std. Err.
z
P>|z|
[95* Conf. Interval]
unemp |
Pop I
_cons |
.1645266
.0558997
-.7264381
.0381813
.0073437
.301522
4.31
7.61
-2.41
0.000
0.000
0.016
.0896925
.0415062
-1.31741
sigma_u |
sigma_e I
rho |
.34458437
.42064667
.40157462
(traction of variance due to u_i)
■ i [J
Ub
n ■
H
r
|!
■■
705.54
0.0000
Coef.
------------------ 4.
Hit
50
10
.2393607
.0702931
.1354659
The xtreg output table contains regression coefficients, standard errors, t tests, and
confidence intervals that resemble those of an OLS regression. In this example we see that the
coefficient on unemp (. 1645) is positive and statistically significant. The predicted number of
crimes increases by .1645 for each additional person unemployed, if population is held
constant. Holding unemployment constant, predicted crimes increase by 5.59 with each 100person increase in population. Echoing the individual-coefficient z tests, the Wald chi-square
test at upper right (Z: = 705.54.#= 2. P< .00005) allows us to reject the joint null hypothesis
that the coefficients on unemp and pop are both zero.
This output table gives further information related to the two error terms. At lower left in
the table we find
sigma_u
standard deviation of the common residuals u,
sigma_e
standard deviation of the unique residuals e,
rho
fraction of the unexplained variance due to differences among the units (i.e.,
differences among the 10 Newfoundland census divisions).
VarfwJ/CVar^,] + Var[eJ)
At upper left the table gives three “R2 ” <statistics.
'
The definitions for these differ from the
true/?2 ofOLS. In the case of xtreg , the “R2 ” are based on fits between several kinds of
observed and predicted;/ values.
■
-
Il
Regression Diagnostics
I
It
|j
Do the data give us any reason to distrust our regression results? Can we find better ways to
specify the model, or to estimate its parameters? Careful diagnostic work, checking for
potential problems and evaluating the plausibility of key assumptions, forms a crucial step in
modem data analysis. We fit an initial model, but then look closely at our results for signs of
trouble or ways in which the model needs improvement. Many of the general methods
introduced in earlier chapters, such as scatterplots, box plots, normality tests, or just sorting and
listing the data, prove useful for troubleshooting. Stata also provides a toolkit of specialized
diagnostic techniques designed for this purpose.
i!
Autocorrelation, a complication that often affects regression with time series data, is not
covered in this chapter. Chapter 13, Time Series Analysis, introduces Stata’s library of time
series procedures including Durbin-Watson tests, autocorrelation graphs, lag operators and
time-series regression techniques.
4
Regression diagnostic procedures can be found under these menu selections:
■
Statistics - Linear regression and related - Regression diagnostics
i
Statistics - General post-estimation - Obtain predictions, residuals, etc., after estimation
h
Example Commands
1
The commands illustrated in this section all assume that you have just fit a model using either
anova or regress . The commands’ results refer back to that model. These followup
commands are of three basic types:
1. predict options that generate new variables containing case statistics such as predicted
values, residuals, standard errors, and influence statistics. Chapter 6 noted some key
options; type help regress for a complete listing.
2. Diagnostic tests for statistical problems such as autocorrelation, heteroskedasticity,
specification errors, or variance inflation (multicollinearity). Type help regdiag for
a list.
”
'
3. Diagnostic plots such as added-variable or leverage plots, residual-versus-fitted plots
residual-versus-predictorplots. and component-versus-residualplots. Again,typing help
regdiag obtains a full listing of regression and ANOVA diagnostic plots. General
graphs for diagnosing distribution shape and normality were covered in Chapter 2; type
help diagplots for a list of those.
iF
I
196
¥
Linear Regression Analysis
195
R2 within
Explained variation within units—defined as the squared correlation between
deviations of.y„ values from unit means (y-y,) and deviations of predicted
values from unit mean predicted values (y^-y,).
R2 between
Explained variation between units — defined as the squared correlation
between unit means (y,) and y t values predicted from unit means of the
independent variables.
R2 overall
Explained variation overall — defined as the squared correlation between
observed (yf7) and predicted (y,7) values.
Our example model does a very good job fitting the observed crimes overall (R2 = .96), and
also the variations among census division means (R2 = .97). Variations around the means
within census divisions are somewhat less predictable (R2 = .53).
The random-effects option employed for this example is one of several possible choices.
I
re
Generalized least squares (GLS) random-effects estimator; default
be
between regression estimator
fe
fixed-effects (within) regression estimator
mle
maximum-likelihood random-effects estimator
pa
population-averaged estimator
Consult help xtreg for further options and syntax. The Longitudinal/Panel Data
Reference Manual gives examples, references, and technical details.
I
>1
Regression Diagnostics
197
predict Options
. predict new, cooksd
Generates a new variable equal to Cook’s distance D, summarizing how much each
observation influences the fitted model.
. predict new, covratio
Generates a new variable equal to Belsley, Kuh, and Welsch’s COVRATIO statistic.
COVRATIO measures the ith case’s influence upon the variance-covariance matrix of the
estimated coefficients.
. predict DFxl, dfbeta(xl)
Generates DFBETA case statistics measuring how much each observation affects the
coefficient on predictor xl. The dfbeta command accomplishes the same thing more
conveniently, and in this example will automatically name the resulting statistics DFxl\
dfbeta xl
To create a complete set of DFBETAs for all predictors in the model, simply type the
command dfbeta without arguments.
. predict new, dfits
Generates DFITS case statistics, summarizing the influence of each observation on the
fitted model (similar in purpose to Cook’s D and Welsch’s IV).
Diagnostic Tests
I
I
dwstat
Calculates the Durbin-Watson test for first-order autocorrelation. Chapter 13 gives
examples of this and other time series procedures. See also:
help durbina
Durbin-Watson h statistic
help bgodfrey
Breusch-Godfrey LM (Lagrange multiplier) statistic
. hettest
Performs Cook and Weisberg’s test for heteroskedasticity. If we have reason to suspect
that heteroskedasticity is a function of a particular predictor xl, we could focus on that
predictor by typing hettest xl.
. ovtest,
rhs
Perfonns the Ramsey regression specification error test (RESET) for omitted variables. The
option rhs calls for using powers of the right-hand-side variables, instead of powers of
predicted y (default).
. vif
Calculates variance inflation factors to check for multicollinearity.
Diagnostic Plots
acprplot xl, mspline msopts(bands(7))
Constructs an augmented component-plus-residual plot (also known as an augmented
partial residual plot), often better than cprplot in screening for nonlinearities. The
options mspline msopts (bands (7)) call for connecting with line segments the
cross-medians of seven vertical bands. Alternatively, we might ask for a lowess-smoothed
curve with bandwidth 0.5 by specifying the options lowess Isopts (bwidth (. 5) ).
198
.
Statistics with Stata
avplot xl
Constructs an added-variable plot (also called a partial-regression or leverage plot) showing
the relationship between^ and xl, both adjusted for other x variables. Such plots help to
notice outliers and influence points.
avplots
Draws and combines in one image all the added-variable plots from the recent anova or
regress.
.
cprplot xl
Constructs a component-plus-residual plot (also known as a partial-residual plot) showing
the adjusted relationship betweeny and predictor xl. Such plots help detect nonlinearities
in the data.
Ivr2plot
Constructs a leverage-versus-squared-residual plot (also known as an L-Rplot).
rvfplot
Graphs the residuals versus the fitted (predicted) values of v.
.
rvpplot xl
Graphs the residuals against values of predictor xl.
SAT Score Regression, Revisited
-
i|
■
1
Diagnostic techniques have been described as tools for “regression criticism,” because they help
us examine our regression models for possible flaws and for ways that the models could be
improved. In this spirit, we return now to the state Scholastic Aptitude Test regressions of
Chapter 6. A three-predictor model explains about 92% of the variance in mean state SAT
scores. The predictors are percent (percent of high school graduates taking the test), percent?
(percent squared), and high (percent of adults with a high school diploma).
. generate percents = percent*!
. regress csat percent percents high
Source
|
Model |
Residual |
li-i
df
MS
Number of obs =
F(
3,
47) =
Prob > F
R-squared
Adj R-squared =
Root MSE
207225.103
16789.4069
3
47
69075.0343
357.221424
Total |
224014.51
50
4480.2902
csat |
Coef.
Std. Err.
t
P>l t|
percent |
percent2 |
high |
_cons |
-6.520312
.0536555
2.986509
844.8207
.5095805
.0063678
.4857502
36.63387
-12.80
8.43
6.15
23.06
0.000
0.000
0.000
0.000
------- +
I
ss
[95% Conf.
-7.545455
. 0408452
2.009305
771.1228
lil i
The regression equation is
predicted csat = 844.82 - 6.52percent + .05percent2 + 2.99high
51
193.37
0.0000
0.9251
0.9203
18.90
Interval]
-5.495168
.0664658
3.963712
918.5185
Regression Diagnostics
199
- four variables. As
-^interrelations among these
mOdel to fit *e visibly
11 ‘‘"SXws »«< ™"ssl°“
The scatterpH
noted in Chapter
curvilinear relate
and percentcs3t, half »s
yinbol ( + )
percents
Figure T ''
graph matrix
% HS I
graduates |
t3^ J
6000-
percenU
&
40002000-
90-
*
a**5
HS
dipt*03
A
80-
*♦
J
70 I
Mean
composite
SA‘
score
♦♦V*
60 -I
1100-f
1000-
r’
900-
800-{p
0
,
80
70
90
. The
:cond,
4 variance !)■
i of
,uld improve the
50
Several
It then perfe™ »
y equal zero. It
eRresston, we
model. With the C5^r
. ovtest
Ramsey RESET teS^
Ho: mode 1
not rei«'nU" W<“ ‘
. V- -J
fitted
— 3 of the
0,i.ng P°w“
!
variable
5
itted
n0 °
T.
values of csat
1.48
tPrror variance by
me assumption ofconstan A ( e cook and
’ > te
teStS
are linearly related 10 > (
st that in
t
7
standardized
resi^
thc
csat
regresston
su^
A heteroskedaStt^J d standardized resrdu ,
of constant vanance.
examining whether ^"ion and^Vh^oth^
example).
Weisberg 1994 for 4
the mil ..
csat
this instance we shot”
fitted values of
°-2319
gtO» >
. hettest
I
1
Cook-Weisberg te^
Ho: Constant
chi2(D
Prob >
s ..^roskedas
heteros
ticity
4.86
0.02'74
using
r
200
Statistics with Stata
“Significant” heteroskedasticity implies that our standard errors and hypothesis tests might be
invalid. Figure 7.2, in the next section, shows why this result occurs.
■f'
Diagnostic Plots
Chapter 6 demonstrated how predict can create new variables holding residual and
predicted values after a regress command. To obtain these values from our regression of
csut on percent, percent?, and high, we type the two commands:
. predict yhat3
. predict e3, resid
The new variables named e3 (residuals) andy/m/5 (predicted values) could be displayed in a
residual-versus-predicted graph by typing graph twoway scatter e3 yhat,
yline (0) . The rvfplot (residual-versus-fitted) command obtains such graphs in asingle
inreadinVsucl^pl?’2 'nCludeS 3 horizontal line at 0 (the residual mean), which helps
. rvfplot, yline(0)
Figure 7.2
o
CM
1 i
-Po -r
8
i
I '»
1
H
t.
H
I
i
9
%
o L
*
o
T ~l
850
900
950
Fitted values
1000
1050
Figure 7.2 shows residuals symmetrically distributed around 0 (symmetry is consistent with
the normal-en-ors assumption), and with no evidence of outliers or curvilinearity The
dispersion of the residuals appears somewhat greater for above-average predicted values ofy
however, which is why hottest earlier rejected the constant-variance hypothesis.
Kesidual-versus-fitted plots provide a one-graph overview of fhe regression residuals. For
more detailed study, we can plot residuals against each predictor variable separately through
(no?sh7wnrty^al’VerSUS'PrediCt°r” C°mrnandS- T° graph the residuaIs a8ainst Predictor high
>
Regression Diagnostics
.
201
rvpplot high
The one-variable graphs described in Chapter 3 can also be employed for residual analysis.
For example, we could use box plots to check the residuals for outliers or skew, or
quantile-normal plots to evaluate the assumption of normal errors.
Added-variable plots are valuable diagnostic tools, known by different names including
partial-regression leverage plots, adjusted partial residual plots, oradjusted variable plots. They
depict the relationship between y and one x variable, adjusting for the effects of other x
variables. If we regressed yonx2 andx3, and likewise regressed x7 onx2 andx3, then took the
residuals from each regression and graphed-these residuals in a scatterplot, we would obtain an
added-variable plot for the relationship betweeny andxf, adjusted forx2 and x3. An avplot
command performs the necessary calculations automatically. We can draw the adjustedvariable plot for predictor high, for example, just by typing
.
avplot high
Speeding the process further, we could type avplots to obtain a complete set of tiny
added-variable plots with each of the predictor variables in the preceding regression. Figure
7.3 shows the results from the regression of csat on percent, percent2, and high. The lines
drawn in added-variable plots have slopes equal to the corresponding partial regression
coefficients. For example, the slope of the line at lower left in Figure 7.3 equals 2.99 which
is the coefficient on high.
.
avplots
Figure 7.3
o
«—
^8x
x3-
8° '
*3-
3-5
0
e( percent | X)
-10
5
^io
coef = -6.5203116, se = .50958046. t = -12.8
-500
0
500
e( percent | X)
1000
coef = .05365555, se = .00636777. t ■ 8.43
8x
'5'
?■
-10
-5
0
e( high | X)
5
10
coef = 2.9865088, se = .48575023. t “ 6.15
Added-vanable plots help to uncover observations exerting a disproportionate influence on
the regression model. In simple regression with one x variable, ordinary scatterplots suffice for
this purpose. In multiple regression, however, the signs of influence become more subtle. An
observation with an unusual combination of values on several x variables might have high
leverage, or potential to influence the regression, even though none of its individual x values
202
|i
Statistics with Stata
IS
1
if i-
is unusual by itself. High-leverage observations show up in added-variable plots as points
horizontally distant from the rest of the data. We see no such problems in Figure 7.3, however.
If outliers appear, we might identify which observations these are by including observation
labels for the markers in an added-variable plot. This is done using the mlabel ( ) option,
just as with scatterplots. Figure 7.4 illustrates using state names (values of the string variable
state)as labels. Although such labels tend to overprint each other where the data are dense,
individual outliers remain more readable.
.
avplot high, mlabel(state)
Figure 7.4
• Iowa
8
• North DakOtegon
• llliftditpryland
• Wisconsin
• Virginia •Delaware
• Nebraska
• Sputh^akutmt
• California
• Tennessee
X
CD
Alaska
• Rhode Island * M«ss<
O°
• FfqgSW^
of
• Kentucky
_^_^*<ouis?a?i§xas
'Arkansas
• Alabama
• NoftiGeereifoa
8
• New Hampshire
• Montaf^^Wir^l^0^
• Utah
• Wyoming
• MicNeaeda
• District of Columbia
• Mswsaett Carolina
• West Virginia
-10
• Idaho
• Ohio
• Arizona
• Indiana
• Oklahoma
-5
0
e( high | X )
coef = 2.9865088. se = .48575023, t = 6.15
5
10
Component-plus-residual plots, produced by commands of the form cprplot xl, take
a different approach to graphing multiple regression. The component-plus residual plot for
variable xl graphs each observation s residual plus its component predicted ffomx/,
et + bjxlj
against values of xl. Such plots might help diagnose nonlinearities and suggest alternative
functional forms. An augmented component-plus-residual plot (Mallows 1986) works
somewhat better, although both types often seem inconclusive. Figure 7.5 shows an augmented
component-plus-residual plot from the regression of csat on percent, percent?, and high.
I
Regression Diagnostics
203
acprplot high, lowess
f
o
co
Figure 7.5
ro o
1
V)
2
w
-to
c
o
h
o ‘O
eS
0)10
<
o
io
65
1
70
75
80
% over 25 w/HS diploma
85
The straight line in Figure 7.5 corresponds to the regression model. The curved line reflects
lowess smoothing based on the default bandwidth of .5, or half the data. The curve’s downturn
at far right can be disregarded as a lowess artifact, because only a few cases determine its
location toward the extremes (see Chapter 8). If more central parts of the lowess curve showed
a systematically curved pattern, departing from the linear regression model, we would have
reason to doubt the model’s adequacy. In Figure 7.5, however, the component-plus-residuals
medians closely follow the regression model. This plot reinforces the conclusion we reached
earlier from Figure 7.2, that the present regression model adequately accounts for all
nonlinearity visible in the raw data (Figure 7.1), leaving none apparent in its residuals.
As its name implies, a leverage-versus-squared-residuals plot graphs leverage (hat matrix
diagonals) against the residuals squared. Figure 7.6 shows such a plot for the csat regression.
To identify individual outliers, we label the markers with the values of state. The option
mlabsize (medsmall) calls for “medium small” marker labels, somewhat larger than the
default size of “small.” (See help testsizestyle fora list of other choices.) Most of
the state names form a jumble at lower left in Figure 7.6. but a few outliers stand out
■ Si
I
204
Statistics with Stata
Ivr2plot, mlabel(state) mlabsize(medsmall)
.
Figure 7.6
cn
• Connectici t
• Massachus etts
<u
om
ro
to
• Mississippi
•Utah
• Alaska
• New Hampshire
• Kentucky
’.A&oSrk
*
o
Carolina
'1I
S^e Island
ifla
Maryland
Jico
0
flbw--------------•District of Columbh
•Oregon
’ -• Nevada
• Michigan
• Illinois
.02
.04
.06
Normalized residual squared
.08
• Iowa
♦ Tennessee
.1
Lines in a leverage-versus-squared-residuals plot mark the means of leverage (horizontal
line) and squared residuals (vertical line). Leverage tells us how much potential for influencing
the regression an observation has, based on its particular combination of x values. Extreme v
values or unusual combinations give an observation high leverage. A large squared residual
indicates an observation with value much different from that predicted by the regression
model. Connecticut, Massachusetts, and Mississippi have the greatest potential leverage but
the model fits them relatively well. (This is not necessarily good. Sometimes, although not
here, high-leverage observations exert so much influence that they control the regression, and
it must fit them well.) Iowa and Tennessee are poorly fit, but have less potential influence.
Utah stands out as one observation that is both ill fit and potentially influential. We can read
its values by listing just this state. Because state is a string variable, we enclose the value
“Utah” in double quotes.
. list csat yhat3 percent high e3 if state
1.
I csat
I--------I 1031
yhat3
percent
high
1067.712
5
85.1
"Utah"
e3 I
----------------- I
-36.71239 |
Only 5% of Utah students took the SAT, and 85.1% of the state’s adults graduated from
high school. This unusual combination of near-extreme values on both x variables is the source
of the state’s leverage, and leads our model to predict mean SAT scores 36.7 points higher than
what Utah students actually achieved. To see exactly how much difference this one observation
makes, we could repeat the regression using Stata’s “not equal to” qualifier ! = to set Utah
aside.
Regression Diagnostics
..
.
regress
csat percent percent2 high if state
Source I
SS
MS
df
|
|
20109’.423
15214.1968
46
67032.4744
330.741235
Total
|
216311.52
49
4414.52082
csat
|
Std.
t
P> I 11
-13.44
9.02
6.74
22.87
0.000
0.000
0.000
0.000
percent |
percent2 I
high |
_cons |
-6.77=-06
.0563562
3.281’65
827.1159
.5044217
.0062509
.4865854
36.17138
"Utah”
Number of obs
F(
3,
46)
Prob > F
R-squared
Adj R-squared
Root MSE
Model
Residual
3
!=
205
=
=
=
=
=
50
202.67
0.0000
0.9297
0.9251
18.186
[95% Conf. Interval]
-7.794054
. 0437738
2.302319
754.3067
-5.763357
.0689387
4.26121
899.9252
In the n - 50 (instead of n = 51) regression, all three coefficients strengthened a bit because
we deleted an ill-fit obsenation. The general conclusions remain unchanged, however.
Chambers et al. (198?) and Cook and Weisberg (1994) provide more detailed examples and
explanations of diagnostic plots and other graphical methods for data analysis.
Diagnostic Case Statistics
I
I
Afterusing regress or anova, we can obtain a variety ofdiagnostic statistics through the
predict command (see Chapter 6 or type help regress ). The variables created by
predict are case statistics, meaning that they have values for each observation in the data.
Diagnostic work usually begins by calculating the predicted values and residuals.
There is some overlap in purpose among other predict statistics. Many attempt to
measure how much each obsenation influences regression results. “Influencing regression
results, however, could refer to several different things — effects on the r-intercept, on a
particular slope coefficient, on all the slope coefficients, or on the estimated standard errors,
for example. Consequently, we have a variety of alternative case statistics designed to measure
influence.
Standardized and studentized residuals (rstandard and rstudent) helpto identify
outliers among the residuals — observations that particularly contradict the regression model.
Studentized residuals have the most straightforward interpretation. They correspond to the t
statistic we would obtain by including in the regression a dummy predictor coded 1 for that
observation and 0 for all others. Thus, they test whether a particular observation significantly
shifts the ^-intercept.
i
Hat matrix diagonals ( hat ) measure leverage, meaning the potential to influence
regression coefficients. Observations possess high leverage when their x values (or their
combination ofx values) are unusual.
I
Several other statistics measure actual influence on coefficients. DFBETAs indicate by how
many standard errors the coefficient on xl would change if observation i were dropped from
the regression. These can be obtained fora single predictor, xl, in either of two ways: through
the predict option dfbeta(xl) or through the command dfbeta .
1'
206
Statistics with Stata
Cook’s D ( cooksd ), Welsch’s distance ( welsch ), and DFITS ( df its ), unlike
DFBETA, all summarize how much observation i influences the regression model as a whole
or equivalently, how much observation i influences the set of predicted values. CO VRATIO
measures the influence of the zth observation on the estimated standard errors. Below we
generate a full set of diagnostic statistics including DFBETAs for all three predictors. Note that
predict supplies variable labels automatically for the variables it creates, but dfbeta
does not. We begin by repeating our original regression to ensure that these post-regression
diagnostics refer to the proper (n = 51) model.
. quietly regress csat percent percent2 high
. predict standard,
rstandard
predict student, rstudent
. predict h, hat
. predict D, cooksd
kl
r !
I
.
predict DFITS,
.
predict
W,
df i ts
welsch
. predict COVRATIO, covratio
. dfbeta
I
DFpercent:
DFpercent2:
DFhigh:
I il
■h
.
describe standard - DFhigh
variable name
standard
student
h
D
DFITS
W
COVRATIO
DFpercent
DFpercent2
DFhigh
!1
!
Il <
’
7
storage
type
float
float
float
float
float
float
float
float
float
float
display
format
value
label
%9.0g
%9.0g
%9.0g
%9.0g
%9.0g
%9.0g
%9.0g
%9.0g
%9.0g
%9.0g
variable label
Standardized residuals
Studentized residuals
leverage
Cock’s D
Welsch distance
Covratio
summarize standard - DFhigh
Variable I
+
standard I
St-
DFbeta (percent)
DFbeta(percent2)
DFbeta(high)
student i
h I
D I
DFITS I
Obs
Mean
Std. Dev.
Min
Max
51
51
51
51
51
-.0031359
- . 00162
.0784314
. 0219941
- .0107348
1.010579
1.032723
.0373011
.0364003
.3064762
-2.099976
-2.182423
. 0336437
.0000135
-.896658
2.233379
2.336977
.2151227
.1860992
.7444486
51
51
51
51
51
- .089723
1.092452
.000938
-.0010659
-.0012204
2.278704
.1316834
.1498813
.1370372
.1747835
-6.854601
.7607449
- . 5067295
- .440771
-.6316988
5.52468
1.360136
.5269799
.4253958
.3414851
+
W
COVRATIO
DFpercent
DFpercent2
DFhigh
|
|
|
|
|
Ill
___________
Regression Diagnostics
207
summarize shows us the minimum and maximum values of each statistic, so we can
quickly check whether any are large enough to cause concern. For example, special tables
could be used to determine whether the observation with the largest absolute studentized
residual (student) constitutes a significant outlier. Alternatively, we could apply the Bonferroni
inequality and t distribution.table: max| student | is significant at level a if 111 is significant at
a/«. In this example, we have max| student | = 2.337 (Iowa) and n = 51. For Iowa to be a
s>gn>ficant outlier (cause a significant shift in intercept) at a = .05, t = 2.337 must be significant
. display
.00098039
.05/51
Stata’s ttail ( ) function can approximate the probability of 111 > 2.337, given dfh-K-1
= 51-3-1 =47:
. display 2*ttail(47,
. 02375138
2.337)
The obtained P-vaiue (P - .0238) is not below a/n = .00098, so Iowa is not a significant outlier
at a - .05.
n ^“‘‘^ ^iduals measure the /th observation’s influence on the?-intercept. Cook’s
• ’ k
'e SCh S distance 3,1 measure the /th observation’s influence on all coefficients
in the model (or, equivalently, on all n predicted;; values). To list the 5 most influential
observations as measured by Cook’s
type
. sort D
.
list state yhat3 D DFITS W in -5/1
state
47.
48 .
49.
50.
North Dakota
Wyoming
Tennessee
lows
Utah
yhat3
1036.696
1017.005
974.6981
1052.78
1067.712
D
. 0705921
.0789454
.111718
.1265392
. 1860992
□ FITS
. 5493086
-.5820746
. 6992343
. 7444486
- .896658
W
4.020527
-4.270465
5.162398
5.52468
-6.854601
The in -5/1 qualifier tells Stata to 1*list only the fifth-from-last (-5) through last
(lowercase letter “i1”) observations. Figure 7.7 shows one way to display influence graphically
symbols in a residual-versus-predicted plot are given sizes proportional to values of Cook’s D
through the “analytical weight" option [aweight = D]. Five influential observations stand
out, with large positive or negative residuals and high predicted csat values.
i
'J?
I
208
F
Statistics with Stata
• graph twoway scatter e3 yhat3 [aweight - D], msymbol(oh) yline(O)
o
hd.
Figure 7.7
o
CM
O
o
o
°o
o
o
_w
ro
Po
o
<0
0)
cr
o
o
o
%
o
8 O
o
I
o
o
T
850
900
950
1000
Fitted values
1050
Although they have different statistical rationales. Cook’s D, Welsch’s distance, andDFfTS
are closely related. In practice they tend to flag the same observations as influential. Figure
7.8 shows their similarity in the example at hand.
• graph matrix D W DFITS, half
Figure 7.8
Cooks
D
I
J
'
5
0
Welsch
distance
-5
1
h
0
Dfits
-1
0
.1
.2
-5
0
5
DFBETAs indicate how much each observation influences each regression coefficient.
Typing dfbeta after a regression automatically generates DFBETAs, for each predictor. In
■
Regression Diagnostics
209
this example, theyreceivedthenamesDFpercent(DFBETAforpredictorpercent'),DFpercent2,
and DFhigh. Figure 7.9 graphs their distributions as box plots.
.
graph box DFpercent DFpercent2 DFhigh,
legend(cols(3))
Figure 7.9
*9
■
♦
—1
tn
r
■
■■■■ DFpercent
6®® DFpercent2
■■■ DFhigh
From left to right, Figure 7.9 shows the distributions oiDFBETAs for percent, percent?,
and high. (We could more easily distinguish them in color.) The extreme values in each plot
belong to Iowa and Utah, which also have the two highest Cook’s D values. For example,
Utah’s DFhigh = -.63. This tells us that Utah causes the coefficient on high to be .63 standard
errors lower than it would be if Utah were set aside. Similarly, DFpercent = .53 indicates that
with Utah present, the coefficient on percent is .53 standard errors higher (because the percent
regression coefficient is negative, “higher” means closer to 0) than it otherwise would be.
Thus, Utah weakens the apparent effects of both high and percent.
The most direct way to learn how particular observations affect a regression is to repeat the
regression with those observations set aside. For example, we could set aside all states that
move any coefficient by half a standard error (that is, have absolute DFBETAs of .5 or more):
.
regress csat percent percent2 high if abs(DFpercent)
abs(DFpercent2) < .5 & abs(DFhigh) < .5
Source I
-------------------- +
SS
ss
df
MS
<
.5 &
Number of obs =
F(
3,
44) =
Prob > F
R-squared
Adj R-squared =
Root MSE
Model
Residual
|
|
175366.782
11937.1351
3
44
58455.5939
271.298525
Total
|
187303.917
47
3985.18972
csat
|
Ccef.
Std. Err.
t
P> 111
percent
percent2
high
_cons
|
|
|
|
-6.510868
.0538131
3.35664
815.0279
.4700719
. 005779
.4577103
33.93199
-13.85
9.31
7.33
24.02
0.000
0.000
0.000
0.000
[95% Conf.
-7.458235
.0421664
2.434186
746.6424
48
215.47
0.0000
0.9363
0.9319
16.471
Interval]
-5.5635
.0654599
4.279095
883.4133
210
■3
I
fi
M-
Statistics with Stata
J“are^uI *nsPecti°n wiII reveal the details in which this regression table (based on n = 48)
differs from its n = 51 or n = 50 counterparts seen earlier. Our central conclusion — that mean
state SAT scores are well predicted by the percent of adults with high school diplomas and,
curvilinearly, by the percent of students taking the test — remains unchanged, however.
Although diagnostic statistics draw attention to influential observations, they do not answer
the question of whether we should set those observations aside. That requires a substantive
decision based on careful evaluation of the data and research context. In this example, we have
no substantive reason to discard any states, and even the most influential of them do not
fundamentally change our conclusions.
Using any fixed definition of what constitutes an “outlier,” we are liable to see more of
them in larger samples. For this reason, sample-size-adjusted cutoffs are sometimes
recommended for identifying unusual observations. After fitting a regression model with K
coefficients (including the constant) based on n observations, we might look more closely at
those observations for which any of the following are true:
leverage h > IK/n
Cook’s D > 4/n
['■
I’'
S'-
I
DFITS>2VK/n
Welsch’s IV> 3\/K
DFBETA >2/^
\COVRATIO- 1| > 2K/n
The reasoning behind these cutoffs, and the diagnostic statistics more generally, can be found
in Cook and Weisberg (1982,1994); Belsley, Kuh, and Welsch (1980); or Fox (1991).
♦
ir
I
1
i
Multicollinearity
If perfect multicollinearity (linear relationship) exists among the predictors, regression
equations become unsolvable. Stata handles this by warning the user and then automatically
dropping one of the offending predictors. High but not perfect multicollinearity causes more
subtle problems. When we add a newx variable that is strongly related tox variables already
in the model, symptoms of possible trouble include the following:
1. Substantially higher standard errors, with correspondingly lower t statistics.
2. Unexpected changes in coefficient magnitudes or signs.
3. Nonsignificant coefficients despite a high 2?2.
Multiple regression attempts to estimate the independent effects of each x variable. There is
little information for doing so, however, if one or more of the x variables does not have much
independent variation. The symptoms listed above warn that coefficient estimateshave become
unreliable, and might shift drastically with small changes in the sample or model. Further
troubleshooting is needed to determine whether multicollinearity really is at fault and, if so,
what should be done about it.
Multicollinearity cannot necessarily be detected, or ruled out, by examining a matrix of
correlations between variables. A better assessment comes from regressing eachx on all of the
other x variables. Then we calculate 1 -R2 from this regression to see what fraction of the first
Regression Diagnostics
211
x variable’s variance is independent of the otherx variables. For example, about 97% of high's
variance is independent ofpercent and percent?'.
II
. quietly regress high percent percent2
. display 1 - e(r2)
.96942331
After regression, e(r2) holds the value of??2. Similar commands reveal that only 4% of
percent's variance is independent of the other two predictor variables:
. quietly regress percent high percent2
. display 1 - e(r2)
.04010307
This finding about percent and percent2 is not surprising. In polynomial regression or
regression with interaction terms, some x variables are calculated directly from other x
variables. Although strictly speaking their relationship is nonlinear, it often is close enough to
linear to raise problems of multicollinearity.
The post-regression command vif , for variance inflation factor, performs similar
calculations automatically. This gives a quick and straightforward check for multicollinearity.
. quietly regress csat percent percent2 high
. vif
Variable |
VIF
1/VIF
percent I
percent2 |
high |
24.94
24.78
1.03
0.040103
0.040354
0.969423
Mean VIF |
16.92
------- +
The 1/VIF column at right in a vif table gives values equal to 1 -R2 from the regression
of each x on the other x variables, as can be seen by comparing the values for high (.969423)
or percent (.040103) with our earlier display calculations. That is. 1.-VIF (or 1-/?;) tells
us what proportion of anx variable’s variance is independent of all the otherx variables. A low
proportion, such as the .04 (4% independent variation) of percent and percent?, indicates
potential trouble. Some analysts set a minimum level, called tolerance, for the 1/VIF value, and
automatically exclude predictors that fall below their tolerance criterion.
The VIF column at center in a vif table reflects the degree to which other coefficients’
variances (and standard errors) are increased due to the inclusion of that predictor. We see that
high has virtually no impact on other variances, but percent and percent? affect the variances
substantially. VIF values provide guidance but not direct measurements of the increase in
coefficient variances. The following commands show the impact directly by displaying
standard error estimates for the coefficient on percent, when percent? is and is not included in
the model.
. quietly regress csat percent percent2 high
I
L
. display __se [percent]
.50958046
. quietly regress csat percent high
. display _se[percent]
.16162193
kt,
■••I
212
Statistics with Stata
II
With pei cent2 included in the model, the standard error for percent is three times higher:
.50958046/. 16162193 = 3.1529166
This corresponds to a tenfold increase in the coefficient’s variance.
How much variance inflation is too much? Chatterjee, Hadi, and Price (2000) suggest the
following as guidelines for the presence of multi collinearity:
1. The largest VIF is greater than 10; or
2. the mean VIF is larger than 1.
With our largest VIFs close to 25, and the mean almost 17, the csat regression clearly meets
both criteria. How troublesome the problem is, and what, if anything, should be done about it,
are the next questions to consider.
Because/7erce«Z andpercent2 are closely related, we cannot estimate their separate effects
with nearly as much precision as we could the effect of either predictor alone. That is why the
standard error for the coefficient on percent increases threefold when we compare the
regression of csat on percent and high to a polynomial regression ofera/ on percent, percent2,
and high. Despite this loss of precision, however, we can still distinguish all the coefficients
from zero. Moreover, the polynomial regression obtains a better prediction model. For these
reasons, the multi col linearity in this regression does not necessarily pose a great problem, or
require a solution. We could simply live with it as one feature of an otherwise acceptable
model.
■SI i
i 1 ' :'
I '
!I
in
When solutions are needed, a simple trick called “centering” often succeeds in reducing
multicollinearity in polynomial or interaction-effect models. Centering involves subtracting the
mean from x variable values before generating polynomial or product terms. Subtracting the
mean creates a new variable centered on zero and much less correlated with its own squared
values. The resulting regression fits the same as an uncentered version. By reducing
multicollinearity, centering often (but not always) yields more precise coefficient estimates with
lower standard errors. The commands below generate a centered version of percent named
Cpercent, and then obtain squared values of Cpercent named Cpercent2.
summarize percent
Il i
Variable
|
Obs
Mean
Std. Dev.
Min
Max
percent
|
51
35.76471
26.19281
4
81
.
generate Cpercent = percent
r(mean)
.
generate Cpercent2 = Cpercent A2
S
fh
. correlate Cpercent Cpercent2 percent percent2 hiqh
(obs=51)
| •
I Cpercent Cperce~2
percent percent2
I
|
I
|
|
|
1.0000
0.9794
0.1413
-0.8758
------------------------------
ij
Cpercent
Cpercent2
percent
percent2
high
csat
J : *
>
1.0000
0.3791
1.0000
0.9794
0.1413
-0.8758
1.0000
0.3791
0.5582
-0.0417
-0.0428
1.0000
0.1176
-0.7946
csa t
high
csat
1.0000
0.0858
1.0000
Regression Diagnostics
213
JZ T'
Whereas percent and percent? have a near-perfect correlation with each other (r = .9794),
the centered versions Cpercent and Cpercent? are just moderately correlated (r = .3791).
Otherwise, correlations involving percent and Cpercent are identical because centering is a
linear transformation. Correlations involving Cpercent? are different from those with percent?,
however. Figure 7.10 shows scatterplots that help to visualize these correlations, and the
transformation’s effects.
.
graph matrix Cpercent Cpercent2 percent percent2 high csat,
half msymbol(+)
Figure 7.10
Cpercent
2000
Cpercent2
1000
108
% HS
graduates
taking
50
0
6000
4000
percent2
2000
98
80
% over
25 w/HS
diploma
70
1188
Mean
composite
SAT
score
1000
900
800
-50
500
0
1000
2000
50
100
70
2000 4000 600QE0
80
90
The/?2, overall Ftest, predictions, and many other aspects of a model should be unchanged
after centering. Differences will be most noticeable in the centered variable’s coefficient and
standard error.
.
regress csat Cpercent Cpercent2 high
Source |
Model
Residual
|
I
Total
I
ss
dr
MS
20"225.103
lc~89.407
3
47
69075.0343
357.221426
224014.51
50
4480.2902
Number of C-bS =
47) =
F( 3,
Prob > F
R-squared
Adj R-squared =
Root MSE
csat
I
Ooef.
Std. Err.
t
P> 111
Cpercent
I
-2.682362
.0536555
2.936509
680.2552
.1119085
. 0063678
.4857502
37.82329
-23.97
8.43
6.15
17.99
0.000
0.000
0.000
0.000
Cpercent2 I
high |
cons I
[95% Conf.
-2.907493
. 0408452
2.009305
604.1646
51
193.37
0.0000
0.9251
0.9203
18.90
Interval]
-2.457231
.0664659
3.963712
756.3458
In this example, the standard error of the coefficient on Cpercent is actually lower
(.111 9085 compared with .16162193) when Cpercent2 is included in the model. The t statistic
is correspondingly larger. Thus, it appears thatcentering did improve that coefficient estimate’s
214
I
Statistics with Stata
precision. The VIF table now gives less cause for concern: each of the three predictors has
more than 80% independent variation, compared with 4% for percent and percent2 in the
uncentered regression.
.
vif
I
VIE
1/VIF
Cpercent I
Cpercent2 |
high |
1.20
1.18
C .831528
. 8 4 6991
96942 3
Mean VIF I
1 .14
Variable
--------------- +
1.03
Another diagnostic table sometimes consulted to check for multicollinearity is the matrix
of correlations between estimated coefficients {not variables). This matrix can be displayed
after regress, anova , or other model-fitting procedures by typing
.
correlate,
coef
I Cpercent -perce-2
+----------------I
1.0000
|
-0.3893
1
200
|
-0.1700
0
240
|
0.2105
-0
151
Cpercent
Cpercent2
high
_cons
high
cons
1.0000
-0.9912
1.0000
High correlations between pairs of coefficients indicate possible collinearity problems.
By adding the option covariance , we can see the coefficients’ variance—covariance
matrix, from which standard errors are derived.
correlate,
I
Cpercent |
Cpercent2
high
_cons
I
|
|
__coef covariance
Cpercent Sperse'-2
high
cons
.012524
-.000277
.002241
-.009239
.00C322 .235953
.891126 -.051=17 -18.2105
1430.6
- Media
10002 PART-1.pdf
Position: 9 (160 views)