Statistics with STATA UPDATED VERSION 9

Item

Title: Statistics
with STATA UPDATED VERSION 9
extracted text: I

Statistics
with

Updated for Version 9

I

Lawrence C. Hamilton
University ofNew Hampshire

I
I

i

ANOTHER QUALITY
USED BOOK
■

•.

•

-

.

-

Australia • Brazi __ _______________ _________ _____________
Ur..

ItrtKHi
THOMSON
-------------*------------BROOKS/COLE

Statistics with Stata: Updated for Version 9
Lawrence C Hamilton
Publisher: Curt Hinrichs
Senior Assistant Editor: Ann Dav
Editorial Assistant: Daniel Geller
Technology Project Manager: Fiona Chong
Marketing Manager: Joe Rogove
Marketing Assistant: Brian Smith
Executive Marketing Communications Manager:
Darlene Amidon-Brent

© 2006 Duxbury, an imprint of Thomson Brooks. Cole,
a part of The Thomson Corporation. Thomson, the Star
logo, and Brooks/Cole are trademarks used herein
under license.
ALL KIGHTS RESERVED. No part of tliis work cox ered
b\ the copyright hereon may be reproduced or used in
any form or by any means—graphic, electronic, or
mechanical, including photocopying, recording, taping,
web distribution, information storage and retriex al
sy stems, or in any other manner—without the written
permission of tire publisher.

Printed in Canada
1 2 3 4 5 6 7 09 08 07 06 05

Project Manager, Editorial Production: Kelsey McGee
Creative Director: Rob Hugel
Art Director: Lee Friedman
Print.Buyer: Darlene Suruki
Permissions Editor: Kielx- Sisk
Cover Designer: Denise Davidson/Simple Design
Cover Image: ©Imtek Imagineering/Masterfile
Cover Printing, Printing & Binding: Webcom Limited
Thomson Higher Education
10 Davis Drive

Belmont, CA 94002-3098
USA
Asia (including India)
Thomson Learning
5 Shen ton Way
#01-01 UIC Building
Singapore 068808

Australia/New Zealand
Thomson Learning Australia
102 Dodds Street
Southbank, Victoria 3006
Australia
Canada

For more information about our nrndi

n Nelson
rhmount Road
Ontario MIK 5G4

another quality
USED BOOI^h
I■

pe/Middle East/Africa
i Learning
born House
iford Row
VC1R 4LR
ngdom

I
Community Health Cell
Library and Information Centre

ff 007, “Srinivasa Nilaya”
Jakkasandra 1st Main,
1st Block, Koramangala,
BANGALORE - 560 034.
Phone : 553 15 18 / 552 53 72
e-mail : chc@sochara.org

i

Contents
Preface

ix

1

Stata and Stata Resources
A Typographical Note ..................
An Example Stata Session ..............
Stata’s Documentation and Help Files
Searching for Information..................
Stata Corporation
Statalist
The Stata Journal
Books Using Stata

1
. I
. 2
. 7
. 8
9
10
10
ll

2

Data Management
Example Commands...........................................
Creating a New Dataset ...................................................
Specifying Subsets of the Data: in and if Qualifiers
Generating and Replacing Variables ..............................
Using Functions ...............................................................
Converting between Numeric and String Formats..........
Creating New Categorical and Ordinal Variables ..........
Using Explicit Subscripts with Variables........................
Importing Data from Other Programs..............................
Combining Two or More Stata Files ................................
Transposing. Reshaping, or Collapsing Data..................
Weighting Observations.....................................................
Creating Random Data and Random Samples
Writing Programs for Data Management...........................
Managing Memory.............................................................

12
. 13
. 15
. 19
. 23
. 27
. 32
. 35
. 38
. 39
. 42
. 47
. 54
. 56
. 60
. 61

3

Graphs
Example Commands . . . .
Histograms
Scatterplots
Line Plots
Connected-Line Plots . . .
Other Twoway Plot Types
Box Plots
Pie Charts
Bar Charts

64
65
67
72
77
83
84
90
92
94

vi

Statistics with Stata

Dot Plots..............
Symmetry and Quantile Plots ....
Quality Control Graphs
Adding Text to Graphs
Overlaying Multiple Twoway Plots
Graphing with Do-Files ................
Retrieving and Combining Graphs

4

Summary Statistics and Tables
Example Commands.....................................
Summary Statistics for Measurement Variables
Exploratory Data Analysis
Normality Tests and Transformations
Frequency Tables and Two-Way Cross-Tabulations
Multiple Tables and Multi-Way Cross-Tabulations
Tables of Means, Medians, and Other Summary Statistics
Using Frequency Weights

5

ANOVA and Other Comparison Methods
Example Commands
One-Sample Tests
Two-Sample Tests .................................
One-Way Analysis of Variance (ANOVA)
Two- and TV-Way Analysis of Variance .. .
Analysis of Covariance (ANCOVA)
Predicted Values and Error-Bar Charts

6

Linear Regression Analysis
Example Commands
The Regression Table
Multiple Regression .....................................
Predicted Values and Residuals
Basic Graphs for Regression
Correlations........................................
Hypothesis Tests...........................................
Dummy Variables
Automatic Categorical-Variable Indicators and Interactions
Stepwise Regression
Polynomial Regression
Panel Data
. .

. 99
100
105
109
110
115
116
120
120
121
124
126
130
133
136
138

141
142
143
146
149
152
153
155
159
159
162
164
165
168
171
175
176
183
186
188
191

£

Contents

vii

7

Regression Diagnostics
Multicollinearity
........................
Diagnostic
Case Statistics
Example Commands
Diagnostic
........................
SAT Score Plots
Regression.
Revisited

8

Fitting Curves
Example Commands
Band Regression
Lowess Smoothing
Regression with Transformed Variables - I
Regression with Transformed Variables - 2
Conditional Effect Plots
Nonlinear Regression - I
Nonlinear Regression - 2

215
215
217
219
223
227
230
232
235

9

Robust Regression
Example Commands
Regression with Ideal Data
Y Outliers
XOutliers (Leverage)
Asymmetrical Error Distributions
Robust Analysis of Variance
Further rreg and qreg Applications
Robust Estimates of Variance— I
Robust Estimates of Variance — 2

239
240
240
243
246
248
249
255
256
258

10 Logistic Regression
Example Commands
Space Shuttle Data
Using Logistic Regression
Conditional Effect Plots
Diagnostic Statistics and Plots
Logistic Regression with Ordered-Category v
Multinomial Logistic Regression

262
263
265
270
273
274
278
280

11 Survival and Event-Count Models
Example Commands
Survival-Time Data
Count-Time Data
Kaplan-Meier Survivor Functions .
Cox Proportional Hazard Models ..
Exponential and Weibull Regression
Poisson Regression

288
289
291
293
295
299
305
309
313

Generalized Linear Models

196
196
198
200
205
210

.

—— mi

via

Statistics with Stata
i

12 Principal Components, Factor, and Cluster Analysis

Example Commands.......................................
Principal Components.......................................
Rotation
Factor Scores........................................
Principal Factoring.....................................
Maximum-Likelihood Factorina
Cluster Analysis — I ......................
Cluster Analysis — 2 ......................
13 Time Series Analysis

Example Commands..........
Smoothing
Further Time Plot Examples
Lags, Leads, and Differences
Correlograms.......................
ARIMA Models .................
14 Introduction to Programming

Basic Concepts and Tools........................
Example Program: Moving Autocorrelation
Ado-File
7........................
Help File
Matrix Algebra
Bootstrapping
Monte Carlo Simulation
References
Index

. 318
. . 319
.. 320
.. 322
.. 323
. . 326
.. 327
.. 329
.. 333
339
339
341
346
349
351
354

361
361
369
373
375
378
382
387
395

401

'7

Stata and Stata Resources

Stata ts a full-featured statistical program for Windows. Macintosh, and Unix computers It
h 'lT 0050 USC "llh SPeCd’ a ,lbrary orPre-P'ogrammed analytical and data-mana«ement
n d
MS'
'lro-ranllllability tllat allows users to invent and add further capabilities as
needed. Most operations can be accomplished either via the pull-down menu system or more
ectly via tv ped commands. Menus help newcomers to learn Stata. and help anyone to apply
an unfamiliar procedure. The consistent, intuitive syntax of Stata commands frees experietKcd
users to work more efficiently, and also makes it straightforward to develop programs foi
mp ex oi repetitious tasks. Menu and command instructions can be mixed as needed durin<>
a Stata session. Extensive help, search, and link features make it easv to look up command
syntax and other information instantly, on the fly.
"
P

After introductory information, we'll begin with an example Stata session to .five you a
sense of the flow of data analysis, and ofhovv analytical results might be used. Later chapters
theTonn "are"3
•\ith°Ut eXplanati°nS' h°Wever’ V™ ca" see how straightforward
e commands are-use filename to retrieve dataset/ttowme, summarize when you
want summary statistics, correlate to get acorrelation matrix, and so forth. Alternatively
same results can be obtained by making choices from the Data or Statistics menus.
Stata users have available a variety of resources to help them learn about Stata and solve
problems at any level of difficulty. These resources come not just from Stata Corporation but

s^aXe'ehc?mmudni • ofLers‘Sections ofthis chapter introduce some keyreso—

technical heln StW ’
documentation: where to phone, fax. write, or e-mail for
undnmA d
7' S'te ('""v staIa com)- which provides many services including
S'Xrn 7"'erSt0 qUen y
qUeSti°nS; the StataliSt Intemet forum; and ,herefereed

A Typographical Note
This book employs several typographical conventions as a visual cue to how words are used:

■

Commands typed by the user appear in a bold Courier font. When the whole
t“) file"' ’S giVCn’ " S‘artS With 3 Pen°d’ 3S See" in 3 S‘ata ReSultS wind0'v or log
.

■

list year boats men penalty

Variable or file names within these commands appear in italics to emphasize the
tact that they are arbitrary'and
" ■ "“J not
-:t a fixed part of the command.

1

r~
2

Statistics with Stata

■

Names of yariables or fifes also appear in italics within the main text to distinguish them
from ordinary words.

■

Items from Stata’s menus are shown in an Arial font. with successive options separated b\
a dash. For example, we can open an existing dataset by selecting File - Open . and then
finding and clicking on the name of the particular dataset. Note that some common menu
actions can be accomplished cither with text choices from Stata’s top menu bar.
File
Edit
Prefs Data Graphics Statistics User Window
Help
or with the row of icons below these. For example, selecting File - Open is equivalent to
clicking the leftmost icon, an opening file folder:
. One could also accomplish the
same thing by typing a direct command of the form
use filename

■

Stata output as seen in the Results w indow is showm in a
font allows Stata’s 80-column output to fit within the margins of this book.

The small

Thus, we show the calculation of summary statistics for a variable named penally as
follows:
.

summarize penalty

Variable I

jbs

Wear.

penalty |

10

63

.? 1 :i.

?. 5 9 4.9

These typographic conventions exist only in this book, and not within the Stata program
itself. Stata can display a variety of onscreen fonts, but it docs not use italics in commands.
Once Stata log files have been imported into a word processor, or a results table copied and
pasted, you might want to format them in a Courier font, 10 point or smaller, so that columns
will line up correctly.

In its commands and variable names. Stata is case sensitive. Thus, summarize is a
command, but Summarize and SUMMARIZE are not. Penalty and penalty would be two
different variables.

An Example Stata Session
As a preview showing Stata at work, this section retrieves and analyzes a previous 1\ -created
dataset named lofoten.dta. Jentoft and Kristofferson (1989) originally published these data in
an article about self-management among fishermen on Norway’s arctic Lototen Islands. There
are 10 obset vations (years) and 5 variables, including penalty, a count of how many fishermen
were cited each year tor violating fisheries regulations.
If we might eventually want a record of our session, the best w'ay to prepare for this is by
opening a “log file’ at the start. Log files contain commands and results tables, but not graphs.
To begin a log file, click the scroll-shaped Begin Log icon.
. and specify a name and folder
for the resulting log file. Alternatively, a log file could be started by choosing File - Log Begin from the top menu bar, or by typing a direct command such as
.

log using mondayl

1

Stata and Stata Resources

Multiple ways of doing such things are common in Stata. Each has its own advantages, and
each suits different situations or user tastes.
Log fi les can be created either in a special Stata format (.smcl), or in ordinary text or ASCII
fonnat (Jog). A .smcl (“Stata markup and control language”) file will be nicely formatted for
viewing or printing within Stata. It could also contain hyperlinks that help to understand
commands or error messages. Jog (text) files lack such formatting, but are simplerto use ifyou
plan later to insert or edit the output in a word processor. After selecting which type of log file
you want, click Save . For this session, we will create a .smcl log file named monday 1.smcl.
An existing Stata-format dataset named lofoten.dta will be analyzed here. To open or
retrieve this dataset, we again have several options:

i

select File - Open - lofoten.dta using the top menu bar;
| - lofoten.dta ; or

select

type the command use Lofoten.

Under its default Windows configuration, Stata looks for data files in folder C:\data. If the file
we want is in a different folder, we could specify its location in the use command,
.

use c:\books\sws8\chapter01\lofoten

or change the session s default folder by issuing a cd (change directory) command:
.

cd c:\books\sws8\chapter01\

.

use lofoten

Often, the simplest way to retrieve a file will be to choose File - Open and browse through
folders in the usual way.
To see a brief description of the dataset now in memory, type
.

describe

Contains data from C:\data\lofoten.dta
obs:
10
Jentoft & Kristoffersen '89
vars:
5
30 Jun 2005 10:36
s^ze:
130
(99.9% of memoryfree)
variable name
year
boats
men
penalty
decade

Sorted by:

I

£

3

storage
type

int
int
int
int
byte
decade

display
format

* 9. Og
*9.0g
•9.0g
•9.0g
•9.0g

value
label

variable label

decade

Year
Number of fishing boats
Number of fishermen
Number of penalties
Early 1970s or early 1980s

yea r

Many Stata commands can be abbreviated to their first few letters. For example, we could
shorten describe to just the letter d. Using menus, the same table could be obtained by
choosing Data - Describe data - Describe variables in memory - OK.
This dataset has only 10 observations and 5 variables, so we can easily list its contents by
typing the command list (or the letter 1 ; or Data - Describe data - List data - OK):

4

Statistics with Stata

list
year

boats

men

penalty

i

1971
1972
1973
1974
1975

1809
2017
2068
1693
144 1

5281
6304
6794
5227
4077

71
152
183
39
36

6. I
7. I
8. I
9. I
10 . I

1981
1982
1 983
1984
1985

1540
1689
1842
1847
1365

4033
4267
4 4 30
4 622
3514

11
15
34
74
15

1.
2
3.
4.
5.

I
I

decade |
------ I
1970s |
1970s |
1970s |
1970s |
1970s |
------ I
1980s |
1980s |
1980s |
1980s |
1980s |

Analysis could begin with a table of means, standard deviations, minimum values and
maximum values (type summarize or su; orselectStatistics-Summaries, tables, & iests
Nummary statistics - Summary statistics - OK):
summarize
Variable |
year
boats
men
penalty
decade

|
|
I
|
|

Obs

Mean

S t d. Dev .

Min

Max

10
10
10
10
10

1978
1731.1
4854.9
63
.5

5.477226
232.1328
1045.577
59.59493
.5270463

1971
1365
3514
11
0

1985
2068
6794
183
1
i

To print results from the session so ffar, bring the Results window to the front by clicking
on this window or on S| (Bring Results
i (Print).
To copy a table commands, or other information from the Results window into a word

on

W'nd0W *S ‘n fr°nt by clicklng on this window or

DrTth

s . Drag the mouse to select the results you want, right-click the mouse, and then choose
Copy Text from the mouse’s menu. Finally, switch to your word processor and at the desired
menubanrP0,nt'
PaSte °T C'ick 3 “cliPboard” “on on the word processor’s

that there were more penalties in the 1970s:
.

tabulate decade,

Early 19~0s |
or early |
19=0s |

sum(penalty)

Summary of Number of penalties
Mean
std. Dev.
Freq.

1970s |
1930s |

96.2
29.8

67.41439
26.281172

5
5

Total |

63

59.594929

10

_T
sam® table couId be obtained through menus: Statistics - Summaries, tables & tests
npnXS °ne/tw°;way ‘able of summary statistics, then fill in decade as variable 1, and
penalty as the variable to be summarized. Although menu choices are often straightforward to

Stata and Stata Resources

5

use, you can see that they tend to be more complicated to describe than the simple text
akX?
i°m
P01nu°n’ WC WiU f°CUS Primarily On the “mmands, mentioning menu
to a
r
°CCaS10nally- Fully exploring the menus, and working out how to usl them
mannaTrll
’Wil'be
t0 the reader- For similar reasons>
Stata reference
manuals likewise take a command-based approach.
Perhaps the number of penalties declined because fewer people were fishing in the 1980s
number
lumber of penalties correlates strongly ( r > .8) with the number of boats aid fishermen:'
correlate boats men penalty
(obs=10)

boats |
men |
penalty |

boats

men

penalty

1.0000
0.8748
0.8259

1.0000
0.9312

1.0000

A graph might help clarify these interrelationships. Figure 1.1 plots men and oenaltv

pro<,uce<lby,he g„ph twoway eonn.otfd

hand v axis °Ja
(l^'va'*‘ab,e>c011nec‘ed-Iine plot ofmen against year, using the lefthandy axis, yaxis (1). After the separator | | , we next ask for a connected-line plot of
‘his time using the right-handy axis, yaxis (2) . The resulting graph
visualizes the correspondence between the number of fishermen and the number of penalties
over time.
. graph twoway connected men year, yaxis(1)
I I

connected penalty year, yaxis(2)

o
o

o

Figure 1.1

ej

o

5S
i

5

s

fOSg

o Q.

5 o

0
X)

I

U)

.5?

0

h

E

o

sz

o

§

1970

o
1975
Year...
Number of fishermen

1980

1985

Number of penalties

to
eCaUFSetheyearS 197610 1980 aremissingin these data, Figure 1.1 shows 1975 connected
o l 98 . For some purposes, we might hesitate to do this. Instead, we could either find the
commands
UnCOnnected
issuing a slightly more complicated set of

6

Statistics with Stata

To print this graph, click on the Graph window or on gg| (Bring Graph Window to Front),
and then click the Print icon g

To copy the graph directly into a word processor or other document, bring the Graph
window to the front, right-click on the graph, and select Copy. Switch to your word processor
go to the desired insertion point, and issue an appropriate “paste” command such as Edit “ 'L PaJS‘2 SpeCia'(Metafile)’ or click a “clipboard” icon (different word processors
will handle this differently).

V

_ To save the graph for future use, either right-click and Save, or select File - Save Graph
from the top menu bar The Save As Type submenu offers several different fl)e formats t0
chose from. On a Windows system, the choices include
Stata graph (*.gph)

(A “live” graph, containing enough information for Stata to edit.)

As-is graph (*.gph)

(A more compact Stata graph format.)
Windows Metafile (*.wmf)
Enhanced Metafile (*.emf)

I

Portable Network Graphics (*.png)
TIFF(*.tif)
PostScript (*.ps)

I
r

'i

Encapsulated PostScript with TIFF preview (*.eps)
Encapsulated PostScript (*.eps)

I

Regardless of which graphics format we want, it might be worthwhile also to save a copy of our
graph in “live” .gph format. Live .gph graphs can later be retrieved, combined, recolored or
reformatted using the graph use or graph combine commands (Chapter 3).
ir

I
il i

Instead of using menus, graphs can be saved by adding a saving (filename) option
to any graph command. To save a graph with the filename figurel.gph, add another
separator | | ,acomma,and saving (figurel) . Chapter 3 explains more about the logic
of graph commands. The complete command now contains the following (typed in the Stata
Com mand window with as many spaces as you want, but no hard returns):
• graph twoway connected men year, yaxis(1)
II
connected penalty year, yaxis(2)
I I
, saving(figurel)

Through all of the preceding analyses, the log file mondayl.smcl has been storing our
results. There are several possible ways to review this file to see what we have done:
File - Log - View - OK
- View snapshot of log file - OK

1
II

L

typing the command view mondayl. smcl
We could print the log file by choosing g (Print) . Log files close automatically at the end
of a Stata session, or earl ier if instructed by one of the following:

Stata and Stata Resources

7

File - Log - Close

^>j|- Close log file - OK
typing the command log close

Once closed, the file mondayl.smcl could be opened again through File - View during a
subsequent Stata session. To make an output file that can be opened easily by your word
processor, either translate the log file from .smcl (a Stata format) to Jog (standard ASCII text
fonnat) by typing
translate mondayl.smcl mondayl.log

or start out by creating the file in Jog instead of .smcl format.

Stata’s Documentation and Help Files
The complete Stata 9 Documentation Set includes over 6,000 pages in 15 volumes: a slim
Getting Started manual (for example, Getting Started with Stata for Windows), the more
extensive User’s Guide, the encyclopedic three-volume Base Reference Manual, and separate
reference manuals on data management, graphics, longitudinal and panel data, matrix
programming (Mata), multivariate statistics, programming, survey data, survival analysis and
epidemiological tables, and time series analysis. Getting Started helps you do just that, with
the basics of installation, window management, data entry, printing, and so on. The User's
Guide contains an extended discussion of general topics, including resources and
troubleshooting. Of particular note for new users is the User's Guide section on “Commands
everyone should know.” The Base Reference Manual lists all Stata commands alphabetically.
Entries for each command include the full command syntax, descriptions of all available
options, examples, technical notes regarding formulas and rationale, and references for further
reading. Data management, graphics, panel data. etc. are covered in the general references, but
these complicated topics get more detailed treatment and examples in their own specialized
manuals. A Quick Reference and Index volume rounds out the whole collection.
When we are in the midst of a Stata session, it is often simpler to ask for onscreen help
instead of consulting the manuals. Selecting Help from the top menu bar invokes a drop-down
menu of further choices, including help on specific commands, general topics, online updates,
the Stata Journal, or connections to Stata’s web site (www.stata.com). Alternatively, we can
bring the Viewer ( ^!) to front and use its Search or Contents features to find information.
We can also use the help command. Typing help correlate , for example, causes
help information to appear in a Viewer window. Like the reference manuals, onscreen help
provides command syntax diagrams and complete lists of options. It also includes some
examples, although often less detailed and without the technical discussions found in the
manuals. The Viewer help has several advantages over the manuals, however. It can search
for keywords in the documentation or on Stata’s web site. Hypertext links take you directly to
related entries. Onscreen help can also include material about recent updates, or the
“unofficial” Stata programs that you have downloaded from Stata’s web site or from other
users.

8

Statistics with Stata

Searching for Info rm at io n
Selecting Help Search - Search documentation and FAQs provides a direct way to search for
information in State’s documentation or in the web site’s FAQs (frequently asked questions)
and other pages. The equivalent Stata command is
. search keywords

Options available with search allow us to limit our search to the documentation and FAQs,
to net resources including the Stata Journal, or to both. For example.
. search median regression

will search the documentation and FAQs for information indexed by both keywords, “median”
and regression. To search for these keywords across Stata’s Internet resources in addition
to the documentation and FAQs, type
. search median regression, all

Search results in the Viewer window contain clickable hyperlinks leadingto further information
or original citations.

One specialized use for the search command is to provide more information on those
occasions when our command does not succeed as planned, but instead results in one of Stata’s
cryptic numerical error messages. For example, typing the one-word command table
produces the error or “return code” r(100):
. table
varlist required
r(100);

The table <command....
evidently requires a list of variables. Often, however, the meaning of
an error message is less obvious. To learn more about what return code r( 100). refers
to,type
-----. search rc 100

Keyword search
Keywords:
Search:

rc 100
(1) Official help files. FAQs, Examples, SJs, and STBs

Search of official help files, FAQs, Examples, SJs, and STBs
[P]

error
Return code 10?
varlist required;
= exp required;
using required;
by() option required;
Certain commands require a varlist or another element of the
language.
The, message specifies the required item that was
missing from the
command
d you gave. See the command’s syntax
- ----diagram. F
’ example, merge requires using be specified; perhaps,
~or
you meant to 4.type append.Or, ranksum requires a by() option;
see [R] signrank.

(end of search)

Type help search for more about this command.

[
I

Stata and Stata Resources

Stata Corporation
For orders, licensing, and upgrade information, you can contact Stata Corporation bye-mail at
stata@stata.com
or visit their web site at
http.7/www.stata.com
Stata’s extensive web site contains a wealth of user-support information and links to resources.
Stata Press also has its own website, containing information about Stata publications includins
the datasets used for examples.
—

http://www.stata-press.com
Both web sites are well worth exploring.
The mailing or physical address is
Stata Corporation
4905 Lake way Drive
College Station, TX 7"845 USA
Telephone access includes an easy-to-remember 800 number.

telephone: 1-800-STATAPC
(1-800-782-8272)

fax:

U.S.

1-800-248-8272

Canada

1-979-696-4600

International

1-979-696-4601

Online updates within major versions are free to licensed Stata users. These provide a fast
and simple way to obtain the latest enhancements, bug fixes, etc. for your current version. To
find out whether updates> exist tor your Stata. and initiate the simple online update process
itself, type the command
update query

Technical support can Ibe obtained by sending e-mail messages with your Stata serial
number in the subject line to
tech_support@stata.com

Before calling or writing for technical help, though, you might want to look at
wwv.stata.com to see whether your question is a FAQ. The site also provides product
ordering, and help infonnation; international notes; and assorted news and announcements.
Much attention is given to user support, including the following:
FAQs — Frequently asked questions and their answers. If you are puzzled by something and
can t find the answer in the manuals, check here next — it might be a FAQ. Example questions
range from basic — “How can I convert other packages’ files to Stata format data files?" to
more technical queries such as "How do I impose the restriction that rho is zero using the
heckman command with full ml?”

1

9

UPDATES — Frequent minor updates or bug fixes, downloadable at no cost by licensed Stata
users.

10

Statistics with Stata

OTHER RESOURCES — Links and information including online Stata instruction
(NetCourses); enhancements from ths Stata Journal- an independent listserver (Statalist) for
discussions among Stata users; a bookstore selling books about Stata and other up-to-date
statistical references; downloadable datasets and programs for Stata-related books; and links
to statistical web sites including Stata’s competitors.
The following sections describe some of the most important user-support resources.

Statalist
Statalist provides a valuable online forum for communication among active Stata users. It is
m ependent of Stata Corporation, although Stata programmers monitor it and often contribute
to the discussion. To subscribe to Statalist, send an e-mail message to
maj ordomo@hsphsun2.harvard.edu

The body of this message should contain only the following words:
subscribe statalist

The list processor will acknowledge your message and send instructions for using the list,
including how to post messages of your own. Any message sent to the following address goes
out to all current subscribers:
statalist@hsphsun2.harvard.edu
Tk • D|0 nOt try t0 subscnbe or unsubscribe by sending messages directly to the statalist address.
This does not work, and your mistake goes to hundreds of subscribers. To unsubscribe from
the list, write to the same majordomo address you used to subscribe:
majordomo@hsphsun2.harvard.edu

but send only the message
unsubscribe statalist

or send the equivalent message
signoff statalist

If you plan to be traveling or offline for a while, unsubscribing will keep your mailbox from
tilling up with Statalist messages. You can always re-subscribe.
Searchable Statalist archives are available at
http://www.stata.com/statalist/archive

The material on Statalist includes requests for programs, solutions, or advice, as well as
answers and general discussion. Along with the S/t?rtf Jcwr/za/(discussed below), Statalist plays
a major role in extending the capabilities both of Stata and of serious Stata users.

The Stata Journal
From 1991 through 2001, a bimonthly publication called the Stata Technical Bulletin (STB)
served 35 a means of distributing new commands and Stata updates, both user-written and
o icial. Accumulated STB articles were published in book form each year as Stata Technical
Bulletin Reprints, which can be ordered directly from Stata Corporation.

I

Stata and Stata Resources
i

11

thm hh k ®rowthuof *e Bernet, instant communication among users became possible
ou0h vehicles such as Statalist. Program files could easily be downloaded fror/distant
sources. A bimonthly printed journal and disk no longer provided the best avenues either for
communicating among users, or for distributing updates and user-written programs. To adapt
to a changing world, the STB had to evolve into something new.
P
The Stata Journal was launched to meet this challenge and the needs of Stata’s broadening
users along with unofficial commands written by Stata Corporation employees New
commands are not its primary focus, however. The Stata Journal also contains refereed
expository artic es about statistics, book reviews, and a number of interesting columns
lanauaTf Th63?1"8 T3'3
NlchoIaS L Cox’ on effective use of the Stata programming
'
f J
u
‘S ‘ntended f°r nOvice as weI1 as experienced Stata users For
example, here are the contents from one recent issue:
“Exploratory analysis of single nucleotide polymorphism (SNP) for
M.A. Cleves
quantitative traits”
“Value label utilities: labeldup and labelrename”
J. Weesie
“Multilingual datasets”
J. Weesie
“Multiple imputation of missing values: update”
P.
Royston
^Estimation and testing of fixed-effect panel-data systems”
J.L.
Blackwell,
III
“Data inspection using biplots”
U. Kohler & M. Luniak
“Stata in space: Econometric analysis of spatially explicit raster data”
’
D. Muller
Using the file command to produce formatted output for other applications

“Mns smlsfa

ph)^,ans using

--------------- • Eplaymaker

“Speaking Stata: Density probability plots”

’
S.Lemesh«»&KL.Moeiehbi“
Linear, Logistic, Survival, and Repeated Measures Models"

The Stata Journal is published quarterly. Subscriptions
can be purchased directly from
Stata Corporation by visiting www.stata.com.

Books Using Stata
Statato 11X1
S.07£fe[encc manuals, a growing library of books describe Stata, oruse
aonhcat on
t.
Y ?techniqueS- These books include general introductions; disciplinary
applications such as social science, biostatistics or econometrics; and focused texts concerning
survey analysis, experimental data, categorical dependent variables, and other subjects Thf
Bookstore pages on Stata’s web site have up-to-date lists, with descriptions of content:'
http://www.stata.com/bookstore/

XSXZS ’ “nM' ,’i,cc ,<’ '“rn ato"1 “d

L

%
1

■■'"

•■''

v:
'
'
'

Data Management

S=S===EE;=ES~=

anoil'™/1!3 t'fan,sfer,program-translate the dataset directly from a system file created by
another spreadsheet database, or statical program. Once Stata has the data in memory we
can save the data in Stata format for easy retrieval and updating in the future.
Data management encompasses the initial tasks of creating a dataset, editing to correct
errors, and adding internal documentation such as variable and value labell It also
encompasses many other jobs required by ongoing projects, such as adding new observations
'ailables reorganmng, simplifying, or sampling from the data; separating, combing, or
lapsing datasets; converting variable types; and creating new variables through algebraic or
ogica! expressions. When data-management tasks become complex or repetitive, Stata users
analXl
bT" Pr°SramS t0
the
A,th°Ugh Sta,a is best
kr its
analytical capabiht.es, it possesses a broad range of data-management features as well This
chapter introduces some of the basics.
The
s Guide provides an overview of the different methods for inputting data
followed by eight roles for determining which input method to use. Input, editing, and many
7matI°;S d‘SC“SSed *" thls chaPter can be accomplished through the Data menus. Data
menu subheadings refer to the general category of task:
Describe data
Data editor

Data browser (read-only editor)
Create or change variables
Sort

Combine datasets
Labels
Notes

Variable utilities
Matrices

Other utilities

1

Data Management

13

Example Commands

I

append using olddata

Reads previously-saved dataset olddata.dta and adds all its observations to the data
currently in memory. Subsequently typing save newdata, replace will save the
combined dataset as newdata.dta.

I

I

browse

Opens the spreadsheet-like Data Browser for viewing the data. The Browser looks similar
to the Data Editor, but it has no editing capability, so there is no risk of inadvertently
changing your data. Alternatively, click ^|.

I

. browse boats men if year > 1980

Opens the Data Browser showing only the variables boats and men for observations in
which year is greater than 1980. This example illustrates the if qualifier, which can be
used to focus the operation of many Stata commands.
. compress

Automatically converts all variables to their most efficient storage types to conserve
memory and disk space. Subsequently typing the command save filename,
replace will make these changes permanent.
. drawnorm zl z2 z3, n(5000)

Creates an artificial dataset with 5,000 observations and three random variables, ~ 1. z2. and
z3, sampled from uncorrelated standard normal distributions. Options could specify other
means, standard deviations, and correlation or covariance matrices.
. edit

Opens the spreadsheet-like Data Editor where data can be entered or edited. Alternatively,
choose Window - Data Editor or click
edit boats year men

Opens the Data Editor with only the variables boats,year, and men (in that order) \ isibie
and available for editing.
. encode stringvar, gen(numvar)

Creates a new variable named numvar, with labeled numerical values based on the strins
(non-numeric) variable string\'ar.
.

format rainfall %8.2f

Establishes a fixed ( f) display format for numeric variable rainfall'. 8 columns wide, with
two digits always shown after the decimal.
. generate newvar = (x + y)/100

Creates a new variable named newvar, equal tox plusy divided by 100.
. generate newvar = uniform()

Creates a new variable with values sampled from a uniform random distribution over the
interval ranging from 0 to nearly 1, written [0,1).
•

infile x y z using data.raw

Reads an ASCII file named data.raw containing data on three variables: .v, y, and z. The
values of these variables are separated by one or more white-space characters — blanks,
tabs; and newlines (carriage return, linefeed, or both)
or by commas. With white-space

1 _

14

Statistics with Stata

delimiters, missing values are represented by periods, not blanks. With comma-delimited
data, missing values are represented by a period or by two consecutive commas. Stata also
provides for extended missing values, which we will discuss later. Other commands are
better suited for reading tab-delimited, comma-delimited, or fixed-column raw data; type
help infiling for more infomation.
list

Lists the data in default or “table” format. If the dataset contains many variables, table
format becomes hard to read, and list, display produces better results. See help
list for other options controlling the format of data lists.
.

list x y z in 5/20

Lists the x, y, and z values of the 5th through 20th observations, as the data are presently
sorted. The in qualifier works in similar fashion with most other Stata commands as
well.
. merge id using olddata

Reads the previously-saved dataset olddata.dta and matches observations from olddata
with observations in memory that have identical lvalues. Both olddata (the “using” data)
and the data currently in memory (the “master” data) must already be sorted by id.
.

replace oldvar = 100 * oldvar

Replaces the values of oldvar with 100 times their previous values.
sample 10

Drops all the observations in memory except for a 10% random sample. Instead of
selecting a certain percentage, we could select a certain number of cases. For example,
sample 55, count would drop all but a random sample of size n = 55.
.

save newfile

Saves the data currently in memory, as a file named newfile.dta. If newfile.dta already
exists, and you want to write over the previous version, type save newfile,
replace. Alternatively, use the menus: File - Save or File - Save As . To save
newfile.dta in the format of Stata version 7, type saveold newfile .
• set memory 24m

(Windows or Unix systems only) Allocates 24 megabytes of memory for Stata data. The
amount set could be greater or less than the current allocation. Virtual memory (disk space)
is used if the request exceeds physical memory. Type clear to drop the current data
from memory before using set memory.
.

sort x

Sorts the data from lowest to highest values of x. Observations with missing x values
appear last after sorting because Stata views missing values as very high numbers. Type
help gsort for a more general sorting command that can arrange values in either
ascending or descending order and can optionally place the missing values first.
.

tabulate x if y > 65

Pioduces a frequency table forx using only those observations that have^y values above 65.
The if qualifier works similarly with most other Stata commands.

I

I

I

■
Data Management

i

I

15

. use oldfile
Retrieves previously-saved Stata-format dataset oldfile.dta from disk, and places it in
memory. If other data are currently in memory, and you want to discard those data without
saving them, type use oldfile, clear . Alternatively, these tasks can be
accomplished through File - Open or by clickins G? •

Creating a New Dataset

I

dntaon Canadian provinces and territories listed in Table 2.1. (From the Federal, Provincial
x"
rr,t°rial AdvlSOry Committee on Population Health, 1996. Canada’s newest territory,
Nunavut, is not listed here because it
---------------- .t was part of the Northwest Territories until 1999 )
Table 2.1: Data on Canada and Its Provinces

1995 Pop.
Unemployment
(1000’s) _____ Rate (percent)

Place

Canada
Newfoundland
Prince Edward Island
Nova Scotia
New Brunswick
Quebec
Ontario
Manitoba
Saskatchewan
Alberta
British Columbia
Yukon
Northwest Territories

29606.1

Male Life
Expectancy
75.1
73.9
74.8
74.2
74.8
74.5
75.5
75.0
75.2
75.5
75.8
71.3
70.2

10.6
19.6

575.4
136.1

19.1

937.8
760.1
7334.2
11100.3

13.9
13.8
13.2
9.3

1137.5
1015.6
2747.0
3766.0

8.5
7.0
8.4
9.8

30.1

65.8

Female Life
Expectancy
81.1
79.8
81.3
80.4
80.6
81.2
81.1
80.8
81.8
81.4
81.4
80.4
78.0

The simplest way to create a dataset from Table 2.1 is through Stata’s spreadsheet-like Data
Editor, which is invoked either by clicking El, selecting Window - Data Editor from the top
menu bar or by typing the command edit. Then begin typing values for each variable in
tCanTdn *at Jtata aut°matlcally cal*s
var2, etc. Thus, varl contains place names
(Canada, Newfoundland, etc.); var2, populations; and so forth.

I

11

State Editov
Preserve!

< | » |
varltSj =

1^_

I

;

-

_T

1 ________ Canada

2

_____
--

Newfoundland

var2
29606.1 |
575.4 j

Hide |

Delete... |

var3
10.6.

vai»4
75.1
73.9

fea.;.v2<i
uarS
81.1
79.8

16

Statistics with Stata

We can assign more descriptive variable names by double-clicking on the column headinas
(such as varJ) and then typing a new name in the resulting dialog box — eight characters or
ewer works best, although names with up to 32 characters are allowed. We can also create
variable labels that contain a brief description. For example, var2 (population) might be
renamed pop, and given the variable label “Population in 1000s, 1995”.
Renaming and labeling variables can also be done outside of the Data Editor through the
rename and label variable commands:
.

I

rename var2 pop

. label variable pop "Population in 1000s, 1995"

Cells left empty, such as employment rates for the Yukon and Northwest Territories, will
automatically be assigned Stata’s system (default) missing value code, a period. At any time,
we can close the Data Editor and then save the dataset to disk. CIicking RTfl or W indow - D ata
Editor brings the Editor back.
If the first value entered for a variable is a number, as with population, unemployment, and
life expectancy, then Stata assumes that this column is a “numerical variable” and it will
thereafter permit only numerical values. Numerical values can also begin with a plus or minus
sign, include decimal points, or be expressed in scientific notation. For example, we could
represent Canada’s population as2.96061 e+7, which means 2.96061 x 107 or about 29.6 million
people. Numerical values should not include any commas, such as 29,606,100. If we did
happen to put commas within the first value typed in a column, Stata would interpret this as a
“string variable” (next paragraph) rather than as a number.
If the first value entered for a variable includes non-numerical characters, as did the place
names above (or “1,000” with the comma), then Stata thereafter considers this column to be a
string variable. String variable values can be almost any combination of letters, numbers,
symbols, or spaces up to 80 characters long in Intercooled or Small Stata, and up to 244
characters in Stata/SE. We can thus store names, quotations, or other descriptive information.
String variable values can be tabulated and counted, but do not allow the calculation of means,
correlations, or most other statistics. In the Data Editor or Data Browser, string variable values
appear in red, so we can visually distinguish the two variable types.
After typing in the information from Table 2.1 in this fashion, we close the Data Editor and
save our data, perhaps with the name canadaO.dta:
save canadaO

Stata automatically adds the extension .dta to any dataset name, unless we tell it to do
otherwise. If we already had saved and named an earlier version of this file, it is possible to
write over that with the newest version by typing
save, replace

At this point, our new dataset looks like this:

L

r

Data Management

describe
Contains data from C: \data\canadaO.dta
obs :
13
vars :
5
size:
(99.9* cf memory free)

variable name
var 1
Pop
va r 3
va r4
var5

storage
type
str21
float
float
float
float

display
format

value
label

3 Jul 2005 10:30

variable label

%21s
%9.0g
%9.0g
*9.0g
%9.0g

Population in 1000s,

1995

Sorted by:

list

1.

2.
3.
4.
5.
6.
7.
8.
9.
10 .

I

11 .
12 .
13 .

I
var 1
I -I
Canada
I
Newfoundland
I
Prince Edward Island
I
Nova Scotia
I
New Brunswick
I --I
Quebec
I
Ontario
I
Manitoba
I
Saskatchewan
I
Alberta
I --British Columbia
Yukon
I Northwest Territories

pop

var3

var 4

var5 |

-------- (
29606.1
575.4
136.1
937.8
760.1

10.6
19.6
19.1
13.9
13.8

75.1
73.9
74 . 8
74.2
74.8

81.1 i
79.8 I
81.3 |
80.4 |
80.6 |

7334.2
11100.3
1137.5
1015.6
2747

13.2
9.3
8.5
7
8.4

74.5
75.5
75
75.2
75.5

3766
30.1
65.8

9.8

81.2 I
81 . 1 I
80.8 I
81.8 I
81.4 I
—I
81.4 I
fin
o
u • * I
78 I

’’S . 8

71.3
■70.2

summarize

Variable

|

varl
pop
var3
var 4
var5

■

Jos

Mear.

Std. Dev.

Min

Max

13
11
13
13

4 5 54 . . 09
12.lOyoy
74.29231
80.71539

6 4.1 4.3 C 4
4.250048
1.673052
.975402-

30.1
7
70.2
78

29606.1
19.6
75.8
81.8

Examining
Examining such
such output
output tables
tables gi
gives us a chance to look for errors that should be corrected
> jsu™narlze table, for instance, provides several numbers useful in proofreading

“n (ca“da) ,h,t ~
The next step is to make our dataset more self-documenting. The variables could be given
more descriptive names, such as the following:
'
.

i

rename varl place
rename var3

unemp

17

18

1

Statistics with Stata

rename

var4 mlife

rename

var5 flife

Stata also permits us 1to add several kinds of labels to the data, label data describes
the dataset as a whole. For example,
.

label data "Canadian dataset 0"

label variable describes an individual variable. For example,
.

label variable place "Place name "

.

label variable

.

label variable mlife "Male life

I'

.

unemp "%

15+ population unemployed,

1995"

expectancy years"
label variable flife "Female life
expectancy years"

By labeling data and variables, we obtain a dataset that iis more self-explanatory:
.

describe

Contains data from C:\data\canada0 . dta
13
obs:
vars:
5
size:
533 (99.9% of
memory free)
variable name

storage
type

display
format

place
Pop
unemp

str21
float
float

%21s
%9.0g
%9.0g

mlife
flife

float
float

%9.0g
%9.0g

value
label

Canadian z taser

3 Jul 2C:= 10:45

variable label
Place n a r e
Populatio- in lOZZs, 1995
15+ pop;_ation unemployed,
1995
Male life expects’zy years
Female life expecz = r.cy years

Sorted by:

Once labeling is completed, we should save the data to disk by using File - Save or typing
save, replace

We can later retrieve these data any time through

File - Open, or by typing

. use c:\data\canadaO
(Canadian dataset 0)

e can then proceed with a new analysis. We might notice, for instance, that male and female
life expectancies correlate positively with each other and also negatively with the
unemployment rate. The life expectancy-unemployment rate correlation Is sliahtly stronger
for males.
. correlate unemp mlife flife
(obs=ll)

unemp|
mlife|
flife|

unemp

mlife

flife

1.0000
-0.7440
-0.6173

1.0000
0.7631

1.0000

r

Data Management

19

The order of observations within a dataset can be changed through the sort command.
For example, to rearrange observations from smallest to largest in population, type

I

.

sort pop

String variables are sorted alphabetically instead of numerically. Typing sort place will
rearrange observations putting Alberta first, British Columbia second, and so on.
We can control the order of variables in the data, using the order command, For
example, we could make unemployment rate the second variable, and population last:
order place unemp mlife flife pop

The Data Editor also has buttons that perform these functions. The Sort button applies to
the column currently highlighted by the cursor. The « and » buttons move the current
variable to the beginning or end of the variable list, respectively. As with any other editing,
these changes only become permanent if we subsequently save our data.
The Data Editor’s Hide button does not rearrange the data, but rather makes a column
temporarily invisible on the spreadsheet. This feature is convenient if, for example, we need
to type in more variables and want to keep the province names or some other case identification
column in view, adjacent to the “active” column where we are enterina data.
We can also restrict the Data Editor beforehand to work only with cenain variables, in a
specified order, or with a specified range of values. For example,
edit place mlife flife
or

edit place unemp if pop > 100

The last example employs an if qualifier, an important tool described in the next section.

Specifying Subsets of the Data:

in and if Qualifiers

Many Stata commands can be restricted to a subset of the data by adding an in or if
qualifier. (Qualifiers are also available for many menu selections: look for an if/in or by/if/in
tab along the top ot the menu.) in specifies the observation numbers to which the command
applies. Forexample, list in 5 tells Stata to list only the 5th observation. Tolistthelst
through 20th observations, type
list in 1/20

The letter 1 denotes the last case, and -4 , for example, the fourth-from-last. Thus, we could
list the four most populous Canadian places (which will include Canada itself) as follows:
.

sort pop

.

list place pop in -4/1

Note the important, although typographically subtle, distinction between 1 (number one, or
first observation) and 1 (letter “el,” or last observation).-The in qualifier works in a similar
way with most other analytical or data-editing commands! It always refers to the data as
presently sorted.

20

T

Statistics with Stata

The if qualifier also has broad applications, but it selects observations based on specific
variable values. As noted, the observations in canadaO.dta include not only 12 Canadian
provinces or territories, but also Canada as a whole. For many purposes, we might want to
exclude Canada from analyses involving the 12 territories and provinces. One way to do so is
to restrict the analysis to only those places with populations below 20 million (20,000
thousand); that is, every place except Canada:
summarize if pop < 20000

Variable I

Obs

Mean

place

|

Pop
unemp
mlife
flife

I
|
|
|

0
12
10
12
12

2467.158
12.26
74.225
80.6=333

Std.

Min

Max

3435.521
4 . 44877
1 . 728965
1.0116

30.1
7
70.2
78

19.6
75.8
81 . 8

Compare this with the earlier summarize output to see how much has changed. The
previous mean of population, for example, was grossly misleading because it counted every
person twice.

<
==

(is less than) sign is one of six relational operators:
is equal to

!=

is not equal to (

>

is greater than

<

is less than

>=

is greater than orequal to

<=

is less than or equal to

The

also works)

A double equals sign, “ == ”, denotes the logical test, ^Is the value on the left side the same
as the value on the right?” To Stata, a single equals sign means something different: "Make
the value on the left side be the same as the value on the right.” The single equals sign is not
a relational operator and cannot be used within if qualifiers. Single equals signs have other
meanings. They are used with commands that generate new variables, or replace the values of
old ones, according to algebraic expressions. Single equals signs also appear in certain
specialized applications such as weighting and hypothesis tests.

Any of these relational operators can be used to select observations based on their values
for numerical variables. Only two operators, = and ! =, make sense with string variables.
To use string variables in ani if qualifier, enclose the target value in double quotes. For
example, we could get a summary excluding Canada (leaving in the 12 provinces and
territories):
.

summarize if place

!=

"Canada"

Two or more relational operators can be combined within a single if expression by the
use of logical operators. State’s logical operators are the following:
&

and

I

or (symbol is a vertical bar, not the number one or letter “el”)
not
also works)

i

r

Data Management

21

The Canadian territories (Yukon and Northwest) both have fewer than 100,000 people. To find
the mean unemployment and life expectancies for the 10 Canadian provinces only, excluding
both the smaller places (territories) and the largest (Canada), we could use this command:
summarize unemp mlife flife if pop > 100

.

& pop < 20000

Variable |

Obs

Mean

Std. Dev.

Min

Max

unemp |
mlife |
flife |

10
10
10

12.26
74 . 92
80.98

4.44877
.6051633
.586515

7
73.9
79.8

19.6
75.8
81.8

I

Parentheses allow us to specify the precedence among multiple operators. For example,
we might list all the places that either have unemployment below 9, or have life expectancies
of at least 75.4 for men and 81.4 for women:
.

I

list if

8.
9.
10 .
11 .

I
I
I
I
I

unemp <

9

|

(mlife >= 75.4

& flife >= 81.4)

p ace

pop

unemp

mlife

Manitoba
Saskatchewan
Alberta

1137.5
1015.6
2747
3766

8.5
7
8.4
9.8

75
75.2
75.5
75.8

I British Colu-bia

flife |
----- I
80.8 |
81.8 |
•81.4 |
81.4 |

A note of caution regarding missing values: Stata ordinarily shows missing values as a
period, but in some operations (notably sort and if, although not in statistical calculations
such as means or correlations), these same missing values are treated as if they were large
positive numbers. Watch what happens ifwe sort places from lowest to highest unemployment
rate, and then ask to see places with unemployment rates above 15%:
sort unemp

.

list if unemp > 15

10 .
11.
12 .
13 .

I
place
I
Prince Edward Island
I
I
Newfoundland
I
Yukon
I Northwest Territories

Pop

unemp

mlife

136.1
575.4
30.1
65.8

19.1
19.6

74.8
73.9
71.3
70.2

----------- +
flife |
----------- I
81.3 |
79.8 |
80.4 |
78 |

The two places with missing unemployment rates were included among those “greater than 15.”
In this instance the result is obvious, but with a larger dataset we might not notice. Suppose
that we were analyzing a political opinion poll. A command such as the following would
tabulate the variable vote not only for people with ages older than 65, as intended, but also for
any people whose age values were missing:
.

tabulate vote if age > 65

Where missing values exist, we might have to deal with them explicitly as part of the if
expression.
.

tabulate vote if age > 65

& age <

.

A less-than inequality such as age < . is a general way to select observations with
nonmissing values. Stata permits up to 27 different missing values codes, although we are

22

Statistics with Stata

using only the default
. here. The other 26 codes are represented internally as numbers
even larger than “ . “.so < . avoids them all. Type help missing for more details.

The in and if qualifiers set observations aside temporarily so that a particular
command does not apply to them. These qualifiers have no effect on the data in memory, and
the next command will apply to all obsen ations, unless it too has an in or if qualifier. To
di op variables from the data in memory, use the drop command. For example, to drop mlife
and flife from memory, type
drop mlife flife

.

VV e can drop obsen ations from memory by using either the in qualifier or the if
qualifier. Because we <earlier sorted on unemp, the two territories occupy the 12th and 13th
positions in the data. Canada itself is 6th... One way to drop these three nonprovinces employs
the in qualifier, drop in 12/13 means ‘‘drop the 12th through the 13th observations.”
list

.

1
2
3

I
I
I
I

z

5
g
7
8
9
10

I
I
I
I
I
I

11
12
13

place

pcc

unemp |
---------- I

Saskatchewan
Alberta
Manitoba
Ontario
British 2oiumbia

1015.6
274"
1137.5
11100.3
3766

Canada
Quebec
New Brunswick
Nova Scotia
Prince Edward Island

19606.1
7334.2
760.1
937.S
136.1

Newftundland
Yukon
Northwest Territories

575.4
30.1
65.6

7 I
8.4 |
8.5 |
9.3 |
9.8 |
-----------I
10.6 |
13.2 |
13.8 |
13.9 |
19.1 |
---------- I
19.6 |
. I
. I

. drop in 12/13
(2 observations deleted)

. drop in 6
•1 observation deleted)

The
1 lie same
bdinc change
enange could
coum have
nave been accomplished through an if qualifier, with a command
that says “drop ifplace equals Canada or population is less than 100.”
drop if place == "Canada"
(3 observations deleted)

|

pop < 100

After dropping Canada, the territories, and the variables mlife and flife, we have the
following reduced dataset:
list
---------- +
I
I
1. I
2. I
3. I
4. I
5. I

place

POP

Saskatchewan
Alberta
Manitoba
Ontario

1015.6
2747
1137.5
11100.3
3766

British Columbia

unemp |
----------- I
7 I
8.4 |
8.5 |
9.3 |
9.8 |

L

Data Management

6
7

8
9
10

!

Quebec
i
i
I
I

New Brunswick
Nova Scotia
Prince Edward Island
Newfoundland

7334.2
760.1
937.8
136.1
575.4

13.2
13.8
13.9
19.1
19.6

23

I
I
I
I
I
I

We can also drop selected variables or observations through the Delete button in the Data
Editor.

snecifvwhich m k"8
Va?ab‘eS °r observations t0 droP> * sometimes is simpler to
specify which to keep. The same reduced dataset could have been obtained as follows:

I

.

keep place pop unemp

■ keep if place != "Canada"
”Canada"
(3 observations deleted)

&
& pop
pop >=
>= 100
100

Like any other changes to the data in memory, none of these reductions affect disk files
( save' TepilStan^to31 S°T’ Wil*
°PtiOn °f Writing °Ver the oId dataset
new Zp th ?
destroying it, or just saving the newly modified dataset with a
new name (by choosing File - Save As , or by typing a command with the form
save
newname) so that both versions exist on disk.

Generatin

and Replacing Variables

generate and replace commands allow us to create new variables or change the
va ues of existing variables. For example, in Canada, as in most industrial societies, women
n o live longer than men. Toanalyze regional variations in this gender gap we might
re neye dataset cgWg^and^^new variable equal to female life

(We). m the main P^fr^^r^^e
statement (unlike if qualifiers) we use a single equals sign.
- use canadal, clear
(Canadian dataset 1)

. generate gap = flife

.

mlife

label variable gap "Female-male
gap life expectancy"
describe

Contains data from C:\data\canadal.dta
obs:
13
vars:
6
size:
595
(99.9% of memory free)
variable name

storage
type

display
format

place
POP
unemp

str21
float
float

%21s
% 9.0g
%9.0g

mlife
flife

float
float
float

%9.0g
%9.0g
%9.0g

gap
Sorted by:

value
label

Canadian dataset 1
3 Jul 2005 10:48

variable label

Place name
Population in 1000s, 1995
% 15+ population unemployed,
1’995
Male life expectancy years
Female life expectancy years
Female-male gap life expectancy

I!

24

Statistics with Stata

!r
•

list place flife mlife gap

-------- piace

flife

mlife

Canada
-Indiana

81.1

Prince Edward Island
Nova Scotia
New Brunswick

81 . 3
80 . 4
80 . e

75.1
73.9
74 . 8
74.2
74 . 8

»cebec
Ontario
Manitoba
Sas ka tone war.
Alberta

81.2
81 .1
80 . =
81 . 5
81 . 4

74 . 5
75.5
75
75.2
75.5

81
80

75.8
71.3
70.2

I --

British

i

•Jir.bia
Yukon
Northwest Territories

gap |
--------- (
6 I
5.900002 |
6.5 |
6.200005 |
5.799995 |
--------- !
6.699997 |
5.599998 |
5. ■800003 |
6.600006 |
5.900002 |
I
5.599998 |
9.099998 |
7.800003 |

For the province of Newfoundland, the true value ofgap should be 79.8 -73 9 = 5 9 years
but the output shows this value as 5.900002 instead. Like all computer programs, Stata stores
numbers in binary form, and 5.9 has no exact binary representation. The small inaccuracies that
arise from approximating decimal fractions in binary are unlikely to affect statistical
calculations much because calculations are done in double precision (8 bytes pernumber).
The) appear disconcerting in data lists, however. We can change the display format so that
Stata shows only a rounded-off version. The following command specifies a fixed display
format tour numerals wide, with one digit to the right of the decimal:
.

format gap %4.1f

observations1116 dlSP'ay Sh°WS 5'9’ however>a command such as the following will return no
list if gap == 5.9

This occurs because Stata believes the value does not exactly equal 5.9. (More technically
Stata stores gap values in single precision but does all calculations iin double, and the singleand double-precision approximations of 5.9 are not identical.)
Display formats, as well as variables names and labels, can also be changed by double
clicking on a column in the Data Editor. Fixed numeric formats such as %4.1f are one of
the three most common numeric display format types. These are
% iv. 5g

% w. 5f
% w. 5e

General numeric format, where iv specifies the total width or number ofcolumns
displayed and d the minimum number of digits that must follow the decimal
point. Exponential notation (such as 1.00e+07, meaning 1.00 x 107 or 10 million)
and shifts in the decimal-point position will be used automatically as needed, to
display values in an optimal (but varying) fashion.
Fixed numeric format, where w specifies the total width or number of columns
displayed and d the fixed number of digits that must follow the decimal point.
Exponential numeric format, where w specifes the total width or number of
columns displayed and d the fixed number of digits that must follow the decimal
point.

Data Management

25

For example, as we saw in Table 2.1, the 1995 population of Canada was approximately
29,606,100 people, and the Yukon Territory population was 30,100. Below we see how these
two numbers appear under several different display formats:
format
Canada
Yukon______
%9.0g
2.96e+07
30100
v
%9.1f
29606100.0
30100.0
%12.5e
2.9606 le+07
3.01000e+04
Although the displayed values look different, their internal values are
are identical.
identical. Statistical
Statistical
calculations remain unaffected by display formats. Other numerical display formatting options
include the use <of commas, left- and right-justification, or leading zeroes. There also exist
special formats for dates, time: series variables, and string variables. Type help formatfor more information.

replace can make the same sorts of calculations as generate, but it changes values
of an existing variable instead of creating a new variable. For example, the variablepop in our
dataset gives population in thousands. To convert this to simple population, w e just multiply
(“ * ” means multiply) all values by 1,000:
.

replace pop = pop * 1000

replace can make such wholesale changes, or it can be used with in or if qualifiers
to selectively edit the data. To illustrate, suppose that we had questionnaire data with variables
including age and year bom (born). A command such as the following would correct one or
more typos where a subject’s age had been incorrectly typed as 229 instead of 29:
.

replace age = 29 if age == 229

Alternatively, the following command could correct an error in the value ofage for observation
number 1453:
.

replace age = 29 in 1453

For a more complicated example,
.

replace age = 2005-born if age >= .

|

age < 2005-born

This replaces values ofvariable age with 2005 minus the year of birth ifage is missing or if the
reported age is less than 2005 minus the year of birth.
generate and replace provide tools to create categorical variables as well. We
noted earlier that our Canadian dataset includes several types of observations: 2 territories. 10
provinces, and 1 country combining them all. Although in and if qualifiers allow us to
separate these, and drop can eliminate observations from the data, it might be most
convenient to have a categorical variable that indicates the observation’s “type.” The fol lowing
example shows one way to create such a variable. We start by generating type as a constant
equal to 1 for each observation. Next, we replace this with the value 2 for the Yukon and
orthwest Territories, and with 3 for Canada. The final steps involve labeling new variable
type and defining labels for values 1, 2, and 3.
. use canadal, clear
(Canadian dataset 1)
. generate type = 1

J

26

Statistics with Stata

.

replace type = 2 if place
Territories"
(2 real changes made)

"Yukon”

. replace type = 3 if place
(1 real change made)

"Canada"

.

label variable type "Province,

.

label values type typelbl

| place

"Northwest

territory or nation"

label define typelbl 1 "Province" 2 "Territory" 3 "Nation"
list place flife mlife gap type

3

i
I
I
I
I
I
I

place

flife

mlife

gap

Canada
Newfoundland
Prince Edward Island
Nova Scotia
New Brunswick

81.1
79.8
81.3
80.4
80.6

75.1
73.9
74.8
74.2
74.8

6
5.900002
6.5
6.200005
5.799995

Quebec
Ontario
Manitoba
Saskatchewan
Alberta

81.2
81.1
80.8
81.8
81.4

74.5
75.5
75
75.2
75.5

6.699997
5.599998
5.800003
6.600006
5.900002

81.4
80.4
78

75.8
71.3
70.2

5.599998
9.099998
7.800003

6
3
9
10
11
12
13

I
i
I
British Columbia
I
Yukon
I Northwest Territories

type I
----------I
Nation |
Province I
Province I
Province |
Province I
--------- I
Province |
Province |
Province |
Province |
Province |
--------- I
Province I
Territory |
Territory |

As illustrated, labeling the values of a categorical variable requires two commands. The
label define command specifies what labels go with what numbers. The label
values command specifies to which variable these labels apply. One set of labels (created
through one label define command) can apply to any number of variables (that is, be
referenced in any number of label values commands). Value labels can have up to
32,000 characters, but work best for most purposes if they are not too long.

generate can create new variables, and replace can produce new values, using any
mixture of old variables, constants, random values, and expressions. For numeric variables, the
following arithmetic operators apply:
+ add
subtract
★

multiply

/

divide

A

raise to power

Parentheses will control the order of calculation. Without them, the ordinary rules of
precedence apply. Of the arithmetic operators, only addition, “4- ”, works with string variables,
where it connects two string values into one.
Although their purposes differ, generate and replace have similar syntax. Either
can use any mathematically or logically feasible combination of Stata operators and in or
if qualifiers. These commands can also employ Stata’s broad array of special functions,
introduced in the following section.

Data Management

27

Using Functions
This section lists many of the functions available for use with generate or replace
For example, we could create a new variable named loginc. equal to the natural logarithm of
mcome, by using the natural log function In within a generate command:
. generate loginc = In(income)

In is one of Stata’s mathematical functions. These functions are as follows:
abs(x)

Absolute value of.v.

acos(x)

asin(x)

Arc-cosine returning radians. Because 360 degrees =
radians,
acos (x) *180/_j>i gives the arc-cosine returning degrees ( _pi
denotes the mathematical constant k).
Arc-sine returning radians.

atan(x)

Arc-tangent returning radians.

atan2(y,x)

Two-argument arc-tangent returning radians.

atanh(x)

Arc-hyperbolic tangent returning radians.
Integer n such that n-1 <.v < n

ceil(x)

cloglog(x)

I
I

comb(n,k)

Complementary log-log of.v:
ln(—ln( 1 —a))
Combinatorial function (number of possible combinations of n things
taken k at a time).

cos(x)

Cosine of radians. To find the cosine of v degrees, type

digamma(x)

</lnr(.v) / dx

exp(x)

Exponential (e to power).

floor(x)

Integer n such that n < .v < //-I

trunc(x)

Integer obtained by truncating .v towards zero.
Inverse of the complementary log-log: I - exp(-exp(x))
Inverse of logit of.v:
exp(.v) (1 e.xp(.v))
Natural (base e) logarithm. For any other base number B, to find the
base B logarithm of.v, type

generate y = cos(y *_pi/180)

invcloglog(x)
invlogit(x)

In (x)
i

generate y = ln(x)/ln(B)

I

Infactorial(x)

J

I

generate y = round(exp(Infact(x),1)

Ingamma(x)

Natural log of r(x). To find F(x), type

log(x)

Natural logarithm; same as in (x)
'Base 10 logarithm.

generate y - exp(Ingamma(x))
loglO(x)
logit(x)
max(xl,x2,..,xn)
min(xl,x2,..,xn)

1

Natural log of factorial. To find .v factorial, type

Log of odds ratio ofx: In(x /(l^r))
Maximum ofx/,x2, ...,xn.
Minimum ofx/, x2, ..., xn

28

Statistics with Stata

mod(x,y)

Modulus of a- with respect toy.

reldif(x,y)

Relative difference: j x - v | / (| y | + 1)
Round x to nearest whole number.
Round x in units of r.

round(x)
round(x,y)

sign(x)
sin(x)
sqrt(x)

total(x)

tan(x)
tanh(x)
trigamma(x)

-1 ifx<0, 0 ifx=0, +1 ifx>0
Sine of radians.
Square root.

Running sum ofx (also see help egen )
Tangent of radians.
Hyperbolic tangent ofx.
d2 lnF(x) / dx2

Many probabilityfunctions exit as well, and are listed below. Consult help probfun
and the reference manuals for important details, including definitions, constraints on
parameters, and the treatment of missing values.
betaden(a,b,x)

Probability density of the beta distribution.

Binomial(n ,k, p)

Probability of k or more successes in n trials when the probability of
a success on a single trial is p.

binormal(h,k,r)

Joint cumulative distribution of bivariate normal with correlation r.

chi2 (n, x)

Cumulative chi-squared distribution with n degrees of freedom.

chi2tail(n,x)

Reverse cumulative (upper-tail, survival) chi-squared distribution with
n degrees of freedom..
chi2tail(/^v) = 1 - chi2(?ux)
Partial derivative of the cumulative gamma distribution gammap(tf.x)
with respect to a.

dgammapda(a,x)
dgammapdx(a,x)

Partial derivative of the cumulative gamma distribution gammap(aa')
with respect to x.

dgammapdada(a,x)

2nd partial derivative of the cumulative gamma distribution
gammap(a,x) with respect to a.

dgammapdadx (a,x)

2nd partial derivative of the cumulative gamma distribution
gammap^^r) with respect to a and x.

dgammapdxdx(a,x)

2nd partial derivative of the cumulative gamma distribution
gammap(fl,x) with respect to x.
Cumulative F distribution with nl numerator and n2 denominator
degrees of freedom.

Y(nl,n2,f)

Fden(nl,n2,f)
Ftail(nl,n2,f)

Probability density function for the F distribution with nl numerator
and n2 denominator degrees of freedom.
Reverse cumulative (upper-tail, survival) F distribution with nl
numerator and n2 denominator degrees of freedom.
Ftai\(nfn2f) = 1 - Y(nfn2j)

Data Management

29

gammaden (a,b, g, x) Probability density function for the gamma
gamma family,
family. where
gammaden(aJ.O.r) = ithe probability density function for
the
cumulative gamma distribution gammap(^rv).
gammap (a, x)
Cumulative gamma distribution fora: also known as the incomplete
gamma function.
ibeta(a,b,x)

Cumulative beta distribution fort?, b; also know:n as the incomplete
beta function.

invbinomial(n,k,P)i

I

I

Inverse binomial. For/>< 0.5. probability p such that the
probability of observing k or more successes in n trials is P: for P >
0.5, probability p such that the probability of observing k or fewer
successes in n trials is 1 - P.
invchi2(n,p)
Inverse of chi2(). If chi2(/?rr) =/?. then invchi2(n,p) = .v
invchi2tail (n,p)
Inverse of chi2taii()
If chi2tail(n^v) = p,
then invchi2tail(//,p) = .V
invF(nl,n2,p)
Inverse cumulative F distribution.
If T(nl,n2f)=p,
then invF(/7/./72.p) =/
invFtaii (ni,n2,p) Inverse reverse cumulative F distribution.
If Ftail^z/^j/) =p,
then invFtail(/7/,/72,p) =f
invgammap(a,p)
Inverse cumulative gamma distribution.
I f gammap^^x) = p,
then invgammap(a./7) = x
invibeta(a,b,p)
Inverse cumulative beta distribution.
If ibeta(a,Z>^) =p,
then invibeta(a,/>,p) = x
invnchi2(n,L,p)
Inverse cumulative noncentral chi-squared distribution.
If nchi2(/7,Z,rx) =p,
then invnchi2(n,L,p) = x
invnFtaii (nl fn2tL,P) Inverse reverse cumulative noncentral Fdistribution.
If nFtaiI(w/,w2X /) =p,
then in\TiFtail(w7,n2X,p) = f
invnibeta(a,b,L,p)

Inverse cumulative noncentral beta distribution.
If nibeta(a,Z>/.A) =p, then invnibeta(a,Z>/ ,p) = x
invnormal(p)
Inverse cumulative standard normal distribution.
If normalfz) = p,
then invnormal(p) = z
invttail(n,p)
Inverse reverse cumulative Student’s t distribution.
If ttail(/?,r) =p,
then invttail(/7,p) = t
nbetaden(a,b,L,x ) Noncentral beta density with shape parameters a, b. noncentrality
parameter/,.
nchi2(n,L,x)

Cumulative noncentral chi-squared distribution with n degrees of
freedom and noncentrality parameter L.
nFden(nl,n2,L,x)
Noncentral F density with nl numerator and n2 denominator degrees
of freedom, noncentrality parameter L.
nFtaii (nl ,n2,L,X) Reverse cumulative (upper-tail, survival) noncentral F distribution
with nJ numerator and n2 denominator degrees of freedom,
noncentrality parameter L.

j

30

T

Statistics with Stata

nibeta(a fb,L,x)

normal(z)
normalden(z)

Cumulative noncentral beta distribution with shape parameters ci and
b, and noncentrality parameter L.
Cumulative standard normal distribution.
Standard normal density, mean 0 and standard deviation 1.

normalden(z,s)

Normal density, mean 0 and standard deviation s.

normalden(x,m,s)

Normal density, mean m and standard delation 5.
Noncentrality parameter L for the noncentral cumulative chi-squared
distribution.
If nchi2(/?,Z,rr) = p,
then npnchi2(/7.v.p) = L
Probability density function of Student's t distribution with n degrees
of freedom.

npnchi2(n,x,p)

tden (n, t)

ttail(n,t)

uniform()

I

I

Reverse cumulative(upper-tail) Student’srdistribution with/? degrees
of freedom. This function returns probability T> t.
Pseudo-random number generator, returning values from a uniform
distribution theoretically ranging from 0 to nearly 1, written [0,1).

Nothing goes inside the parentheses with uniform (). Optionally, we can control the
K
? gener"tOr’S StartinS seed’ and ^nce the stream of “random” numbers, by first
issuing a set seed # command — where # could be any integer from 0 to 2 31 - 1
inclus^e Onuttmgthe set seed command corresponds to set seed 123456789
which will always produce the same stream of numbers.

ran H!a?PrjV'dX”°re,^“40rf“''/“"”“anddate',el“,ed™“
A listing
can be foundh„ Chapter27 of the
nr by typing help a.i.tep. Beiowar|

daX" 1196o“,i0"S' “ElaPSed

'h'“

—T

II

date (slzS2[ ,y]) Elapsed date corresponding to 5,. 5, is a string triable indicating the date
in virtually any format. Months can be spelled out, abbreviated to three
characters, or given as numbers; years can include or exclude the century
anks and punctuation are allowed, s, is anv permutation of m d. and
L##]y with their order defining the order that month, day and year occur in
S|. ## gives the century for two-digit years in s,: the default is 19y.
d(l)
A date literal convenience function. For example, typing d(2janl960) is
IS
equivalent to typing 1.
mdy(m,d,y)
Elapsed date corresponding to m. d. and y.
day(e)

Numeric day of the month corresponding to e, the elapsed date.

month(e)

Numeric month corresponding to e. the elapsed date.
Numeric year corresponding to e, the elapsed date.

year(e)

dow(e)
doy(e)

week(e)
quarter(e)
halfyear(e)

Numeric day of the week corresponding to e, the elapsed date.
Numeric day of the year corresponding to e, the elapsed date.
Numeric week of the year corresponding to e, the elapsed date.
Numeric quarter of the year corresponding to e, the elapsed date.
Numeric half of the corresponding to e, the elapsed date.
1,

Data Management

31

Some useful special functions include the following:

autocode (x, n, x~ i n, xmax) Forms categories from x by partitioning the interval from xmin
to xmax into n equal-length intervals and returning the upper bound of the
interval that contains x.
cond(x,a,b)
Returns a if x evaluates to “true” and b if x evaluates to “false.”
. generate y = cond(incl > inc2, incl, inc2)

creates the variabley as the maximuni of incl and inc2 (assuming neither
is missing).

Creates a categorical variable that divides the data as presently sorted into
x subsamples that are as nearly equal-sized as possible.
Returns the integer obtained by truncating (dropping fractional parts of)

group(x)
trunc(x)

max(x , x ,

., x.) Returns the maximum ofx,, x2,..., xn. Missing values are ignored.
For example, max(3+2,1) evaluates to 5.

min (x , x ,, x_) Returns the minimum ofx! ,x2,... ,xn.
recode (x, x:, x2, . . . , x.) Returns missing ifx is missing, x, ifx < Xj, orx2 ifx <x2, and
so on.
round(x,y)

Returns x rounded to the nearest y.

sign(x)

Returns - 1 ifx < 0, 0 ifx = 0, and+1 ifx > 0 (missing ifx is missing).

total(x)

Returns the running sum ofx, treating missing values as zero.

Stringfunctions, not described here, help to manipulate and evaluate string variables. Type
help str fun for a complete list of string functions. The reference manuals and User's
Guide give examples and details of these and other functions.
Multiple functions, operators, and qualifiers can be combined in one command as needed.
The functions and algebraic operators just described can also be used in another way that does
not create or change any dataset variables. The display command performs a single
calculation and shows the results onscreen. For example:
. display 2+3
c.

. display logl0(10A83)
83
. display invttail (120, .025) * 34.1/sqrt(975)
2.1622305

Thus, display works as an onscreen statistical calculator.

Unlike a calculator, display , generate , and replace have direct access to
Stata’s statistical results. For example, suppose that we summarized the unemployment rates
from dataset Canadal.dta:
.

summarize unemp
Variable I

Obs

Mean

Std. Dev.

Min

Max

unemp |

11

12.10909

4.250048

7

19.6

After summarize_^Stata temporarily stores the mean as a macro named r (mean) .

32

Statistics with Stata

display r(mean)
12.109091

We could use this result to create variable unempDEV, defined as deviations from the mean:
. gen unempDEV = unemp - r (mean)
(e missing values generated)

summ unemp unempDEV
Variable I

Obs

Mean

unemp |
unempDEV I

il

12.10909
4.33e-08

11

Dev.

Min

Max

4.250048
4.250048

7
-5.-109091

19.6
7.49091

Srd.

Stata also provides another variable-creation command,
egen (“extensions to
generate ”), which has its own set of functions to accomplish tasks not easily done by
generate . These include such things as creating new variables from the sums, maxima,
minima, medians, interquartile ranges, standardized values, or moving averages of existing
variables or expressions. For example, the following command creates'a new variable named
zscore, equal to the standardized (mean 0, variance 1) values ofx:
. egen zscore = std(x)

Or, the following command creates new variable avg, equal to the row mean of each
observation s values on .v, y\ z, and w, ignoring any missing values.
. egen avg = rowmean(x,y,z, w}

To create a new variable named sum, equal to the row sum of each observation’s values onx
v, z, and w, treating missing values as zeroes, type
. egen sum = rowsum(x,y,z, w)

The following command creates new variable xrank, holding ranks corresponding to values of
a. xrank-I for the observation with highest.r. xrank = 2 for the second highest, and so forth.
. egen xrank = rank(x)

Consult help egen for a complete list of egen functions, or the reference manuals for
further examples.

Converting between Numeric and String Formats
Dataset canada2.dta contains one string variable, place. It also has
a labeled categorical
xariable, type. Both seem to have nonnumerical values.
. use canada2, clear
(Canadian dataset 2)
• list place type
I
I
1. I
2. I
3. I
4. I

place
Canada
Newfoundland
Prince Edward Island
Nova Scotia

type |
-------- !
Nation |
Province |
Province |
Province |

r

Data Management
5.

'.'ex Brunswick

33

Province |

---------------- (
6.

Quebec
tar io
itoba
Saskatchewan
Alberta

6.
9.
10.

I
I
II
British Columbia
I
Yukon
I Northwest Territories

11 .
12 .
13 .

Province
Province
Province
Province
Province

-------- !

|
|
|
I
|

Province |
Territory |
Territory |

-------- -

Beneath the labels, however, type remains a numeric variable, as we can see if we ask for the
no label option:
.

list place type, nolabel
place
1.
2.
3.
4.
5.

I

6.
7.
8.
9.
10.

ii.
12 .
13 .

I

l

I
I

I
I
I

type

|
■ I

Canada
Newfoundland
Prince Edward Island
Nova Scotia
New Brunswick

Quebec
Ontario
Manitoba
Saskatchewan
Alberta

I
I
I
II
British Columbia
I
Yukon
I Northwest Territories

3 I
1 I
1 I
1 I
1 I
I
1 I
1 I
1 I
1 I
1 I
I
1 I
2 I
2 I

String and labeled numeric variables look similar when listed, but they behave differently
when analyzed. Most statistical operations and algebraic relations are not defined for string
™nableS) so we miSht want t0 have both string and labeled-numeric versions of the same
formation in our data. The encode command generates a labeled-numeric variable from
a stnng variable. The number 1 is given to the alphabetically first value of the string variable
to the second, and so on. In the following example, we create a labeled numeric variable
named placenum from the string variable place'.
.

encode place,

gen(placenum)

The opposite conversion is possible, too: The decode command generates a string

? 1,be“<l n"n,eric v"“b"-Here"

z
.

decode

type,

gen(typestr)

When listed, the new numeric variable placenum, and the new string variable typestr, look
similar to the originals:
’

1

34

Statistics with Stata

list place placenum type

typestr

place

4

type

typestr |
----------- ,

Panada
N'-. wi i i.-.dland
Prince Edward Island
Nova Scotia
N e w B r un s w i c k

Canada
Newfoundland
i.-.ce Eaward Island
Nova Scotia
New Brunswick

Nation
Province
Province
Province
Province

Nation I
Province |
Province I
Province |
Province I
----------- I

Quebec
1ntario
Mar. i t oba
Saskatchewan
Alberta

Quebec
Ontario
Manitoba'
Saskatchewan
Alberta

Province
Province
Province
Province
Province

rrtish Columbia
Y ukon
nwest Territories

Province
Territory
Territory

Province I
Province |
Province I
Province |
Province |
---------- I
Province |
Territory |
Territory |

10
11
12
13

place

British

tlumbia
Y ukon
.’c rthwest Territories

But with the nolabel option, the differences become visible. Stata views placenum and
type basically as numbers.
.

list place placenum type

typestr.

nolabel

place

placenum

Panada
Newfoundland
Prince Edward Island
Nova Scotia
New Brunswick

.oo: zooooooooooe+oo
. oo:: oooooo ooo oe-f-oo
1 .OOCOOOOOOOOOOOe + Ol
5 .OOOOOOOOOOOOOOe + OO
: .000000000000006 + 00

1
1
1
1

6.
7.
8.
9.
10 .

Quebec
On t a r i o
Manitoba
Saskatchewan
Alberta

1 . lOOOOOOOOOOOOOe + Ol
9.000300000 00000e + 00
4.00000000000000e+00
1 .2000-0000000000e + 01
1.00000000000000e + 00

1
1
1
1
1

12 .

clumbia
Yukon
_t tries

2 -OC-COOCOCOOOOOOe + OO
1.30000000000000e + 01
" .000 * 00000000006 + 00

1
2
2

I --

2
3.
4.
5.

I
I--

thwest

type

typestr I
I
Nation I
Province I
Province I
Province I
Province I
I
Province I
Province I
Province I
Province I
Province I
I
Province I
Territory |
Territory |

Statistical analyses, such as finding means and standard deviations, work only with
basically numeric variables. For calculation purposes, numeric variables’ labels do not matter.
summarize place placenum

Variable

Obs

place I
placenum I
type I
typestr I

0
13
13
0

I

type typestr

Me ar.

Dev .

Min

Max

1.307692

3.-9444
. 63: 4252

1
1

13
3

Occasionally we encounter a string variable where the values are all or mostly numbers.
To convert these string values into their numerical counterparts, use the real function. For
example, the variable siblings below is a string variable, although it only has one value, “4 or
more,'’ that could not be represented just as easily by a number.

1

Data Management

35

. describe siblings

str9

1. siblings

%9s

Number of siblings (string)

. list

I
siblings
--- I
I—
1. I
0
2. I
1
3. I
2
4. I
3
5. I 4 or more

|

I
I
I
I
|

. generate sibnum = real(siblings)
(1 missing value generated)

The new variable sibnum is numeric, with a missing value where siblings had “4 or more.”
list

1.
2.
3.

4.
5.

I
s iblings
I—
l
0
I
1
2
l
3
I
I 4 or more

sibnum |
----------- I
0 I
1 I
2 I
3
•

I
I

The destring command provides a more flexible method for converting string
variables to numeric. In the example above, we could have accomplished the same thing by
typing
destring siblings, generate(sibnum)

force

See help destring for information about syntax and options.

Creating New Categorical and Ordinal Variables
A previous section illustrated how to construct a categorical variable called type to distinguish
among territories, provinces, and nation in our Canadian dataset. You can create categorical
or ordinal variables in many other ways. This section gives a few examples.
type has three categories:
tabulate type

Province,
|
territory or|
nation
|

Freq.

Percent

Cum.

Province |
Territory |
Nation |

10
2
1

7 6.92
15.38
7.69

76.92
92.31
100.00

Total |

13

100.00

For some purposes, we might want to re-express a imulticategory variable as a set of
dichotomies or ‘‘dummy variables/’ each coded 0 or 1.. tabulate will create dummy

36

Statistics with Stata

variables automatically if we add the generate option. In the following example, this
results in a set of variables called typel, type2, and type3, each representing one of the three
categories of type:
tabulate type, generate(type)
Province,
|
territory or|
nation
|
Freq.
Percent

Province I
Territory |
Nation |

10
2
1

76. 92
15.38
7.69

Total |

13

100.00

Cum.

76.92
92.31
100.00

describe

.

Contains data from C:\data\canada2.dta
obs :
13
vars :
10
size:
637 (99.9% of memory free)
variable name

storage
type

display
format

place
POP
unemp

str21
float
float

%21s
%9.0g
%9.0g

ml i f e
flife
gap
type
typel
type2
type3

float
float
float
byte
byte
byte
byte

%9.0g
%9.0g
%9.0g
%9.0g
%8.0g
%8.0g
%8.0g

Sorted by:
Note:
.

value
label

typelbl

Canadian dataset 2
3 Jul 2005 10:48

variable label
Place name
Population in 1000s, 1995
% 15+ population unemployed,
1995
Male life expectancy years
Female life expectancy years
Female-male gap life expectancy
Province, territory or nation
type==Province
type==Territory
type==Nation

dataset has changed since last saved

list place type typel-type3

1.
2.
3.
4.
5.
6.
7.
8.
9.
10 .
11.
12 .
13.

I
place
I—
I
Canada
I
Newfoundland
I
Prince Edward Island
I
Nova Scotia
I
New Brunswick
I—•
I
Quebec
I
Ontario
I
Manitoba
I
Saskatchewan
I
Alberta
I—I
British Columbia
I
Yukon
I Northwest Territories

---------------------

--------- +
type

typel

type2

Nation
Province
Province
Province
Province

0
1
1
1
1

0
0
0
0
0

Province
Province
Province
Province
Province

1
1
1
1
1

0
0
0
0
0

Province
Territory
Territory

1
0
0

0
1
1

type3 |
----------- I
1 I
0 I
0 I
0 I
0 I
—I
0 I
0 I
0 I
0 I
0 I
I
0 I
0 I
0 I

Re-expressing categorical information as a set of dummy variables involves no loss of
information, in this example, typel through type3 together tell us exactly as much as type itself

I

Data Management

37

does. Occasionally, however, analysts choose to re-express a measurement variable in
categorical or ordinal form, even though this does result in a substantial loss of information.
For example, unemp in canada2.dta gives a measure of the unemployment rate. Excluding
Canada itself from the data, we see that unemp ranges from 7% to 19.6%. with a mean of 12.26:
summarize

unemp if type

!= 3

Variable |
Obs
Mean
Std. Dev.
----------------- - -----------------------------------------------------unemp |
10
12.26
4.44877

Min

Max

7

19.6

Having Canada in the data becomes a nuisance at this point, so we drop it:
. drop if type == 3
(1 observation deleted)

Two commands create a dummy variable named unemp2 with values of 0 when
unemployment is below average (12.26), 1 when unemployment is equal to or above average
and missing when unemp is missing. In readingthe second command, recall that Stata’s sorting
and relational operators treat missing values as very large numbers.
. generate unemp2 = 0 if unemp < 12.26
(7 missing values generated)
. replace unemp2 = 1 if unemp >= 12.26
(5 real changes made)

I

&

unemp <

.

We might want to group the values of a measurement variable, thereby creating an orderedcategory or ordinal variable. The autocode function (see “Using Functions” earlier in this
chapter) provides automatic groupingof measurement variables. To create new ordinal variable
»nemp3, which groups values of unemp into three equal-width groups over the interval from
d to 20, type
. generate unemp3 = autocode(unemp,3,5,20)
(2 missing values generated)

A list of the data shows how the new dummy (unemp!) and ordinal (unemp3) variables
correspond to values of the original measurement variable unemp.
.

list place unemp unemp2 unemp3

I
place
I—
1. I
Newfoundland
2. I
Prince Edward Island
3. I
Nova Scotia
4. I
New Brunswick
5. I
Quebec
I--6. I
Ontario
7. I
Manitoba
8. I
Saskatchewan
9. I
Alberta
10 . I
British Columbia
I --11. I
Yukon
12 . I Northwest Territories
----------------------

1

unemp

unemp2

19.6
19.1
13.9
13.8
13.2

1
1
1
1
1

9.3
8.5
7
8.4
9.8

0
0
0
0
0

unemp3 I
------ 1
20 |
20 |
15 |
15 |
15 i
------ I
10 |
10 |
10 |
10 |
10 |
------ I
. I
. I

38

Statistics with Stata

Both strategies just described dealt appropriately with missing values so that Canadian
places with missing values on unemp likewise receive missing values on the variables derived
iZtmte'^' Z KerPrS,ble aPPr°aCh W°rkS b6St if °Ur data COn,ain no 'lllssin2 values To
illustrate, we begin by dropping the Yukon and Northwest Territories:
drop if unemp >= .
(2 observations deleted)

A greater-than-or-equal-to inequality such as unemp >= . will select any user-specified
missing value codes, in addition to the default code “.’’Type help missing for details
Having dropped observations with missing values, we now can use the group function
o cieate an ordinal variable not with approximately equal-width groupings, as autocode
sort the daS W grOUpingSof aPProximately equal size . We do this in two steps. First,
soil the data (assuming no missing values) on the variable of interest. Second, generate a new
vanab e using the group (#) function, where # indicates the number of groups desired. The
example below divides our 10
Canadian provinces
provinces into
10 Canadian
into 55 groups.
groups.
sort unemp

.

generate unempS = group(5)

.

list place unemp unemp2 unemp3 unempS
+ -I
I -I
I
I
I
I
I---

place

unemp

unemp2

unemp3

Saskatchewan
Alberta
Manitoba
Ontario
British Columbia

7
8.4
8.5
9.3
9.8

0
0
0
0
0

10
10
10
10
10

6.
Quebec
7. I
New Brunswick
8. I
Nova Scotia
9. I Prince Edward Island
10. I
Newfoundland

13.2
13.8
13.9
19.1
19.6

1
1
1
1
1

15
15
15
20
20

1.
2.
3.
4.
5.

unempS |
------ I
1 I
1 I
2 I
2 I
3 I
------ I
3 I
4 I
4 I
5 I
5 I

Another difference is that autocode iassigns
‘
values equal to the upper bound of each
interval, whereas group simply assigns 1 to the first group,' 2 to the second, and so forth.

Using Explicit Subscripts with Variables
When Stata has data in memory, it also defines certain system variables that describe those data,
r examp e^
represents the total number of observations. _n represents the observation
number, n - 1 for the first observation, _n =2 for the second, and so on to the last observation
“ > r’
T 'SSUe 3 command such as *e following, it creates a new variable. caselD
equal to the number of each observation as presently sorted:
. generate caselD =

n

Sorting the data another way will change each observation’s value of_n , but its caselD value
wi lemain unchanged. Thus, if we do sort the data another way, we can later return to the
earlier order by typing

1

Data Management

.

39

sort caselD

Creating and saving unique case identification numbers that store the order of observations at
an early stage of dataset development can greatly facilitate later data management.
We can use explicit subscripts with variable names, to specify particular observation
numbers. For example, the 6th observation in dataset canadal.dta (if we have not dropped or

thousand.any

8 'S QUebeC' ConSequently’

refers t0 Quebec’s population 7334

. display pop[6]
7334.2002

Similarly, pop[I2] is the Yukon’s population:
. display pop[12]
30.1

fnmEaXPllClt subscriptin,g,and the -n system variable have additional relevance when our data
rm a series. If we had the daily stock market price of a particular stock as a variable named
puce, for instance, then either price or, equivalently. price[ _n] denotes the value of the nth
observation or day. ^-/ce[_n-l] denotes the previous day’s price, andpnce[_n+l] denoted

thTpre^ousTa^

6"

eqUa‘t0

Cha"ge

. generate difprice = price - price[_n-l]

Chapter 13, on time series analysis, returns to this topic.

Importing Data from Other Programs
i

f

Previous sections illustrated how to enter and edit data by typing into the Data Editor If our
original data reside m an appropriately formatted spreadsheet, a shortcut can speed up this
labels!7
t0 C°P'Vand P3Ste muJti-C01umnblocks data (not including column
nerh ns e y
,nt°
Data EditOr- This
some care and
vahtP cXperimentatlon> because Stata wdl interpret any column containing non-numeric
into the DateFd?'"? 3 a™
Si"Sle COlUnmS (variables>of data could also be pasted
pasted hto Fdt i°m 3
processor document. Once data have been successfully
pasted into Edito. columns, we assign variable names, labels, and so on in the usual manner.
tools^hT^rk3^’1 h metht1°dS 3re quirkbu'for largerpr0J'ects kis imPortant to have
lener^ eT
y . computerflles created by other programs. Such files fall into two
general categones: raw-data ASCII (text) files, which can be read into Stata with the
appropriate Stata commands; and system files, which must be translated to Stata format by a
special third-party program before Stata can read them.
Y
that 7 i'IUS!ra‘etASCI1 ?Ie methods’ we return t0 Ibc Canadian data of Table 2.1. Suppose
processor with a 7in?
S,ata’S
Edit°r’ We typed them int0 our word
ff thX n 7
i 0,16 Sp3Ce tWeen each Value- Strin8 values must be “ double quotes
if they contam internal spaces, as does “Prince Edward Island”. For other string values quotes
are optional. Word processors allow the option of saving documents as ASCH (text) files a
simpler and more universal type than the word processor’s usual saved-file format. We can
thus create an ASCII file named canada.raw that looks something like this:
--------

s.i

40

Statistics with Stata

*?

!:;i

’’Canada" 29606.1 10.6 "5.1 81.1
"Newfoundland" 575.4 19.6 73.9 79.8
"Prince Edward Island" 136.1
19.1 74.8
"Nova Scotia" 937.8 13.9 74.2 80.4
"New Brunswick" 760.1 13.8 74.8 80.6
"Quebec" 7334.2 13.2
13.2 “4.5 81.2
"Ontario" 11100.3 9.3
9.3 “5.5 81.1
"Manitoba" 1137.5 8.5
8.5 75*' 80.8
"Saskatchewan" 1015.6 7 75.2 81.8
"Alberta" 2747 8.4 75.5 81.4
"British Columbia" 3766 9.8 75.8 81.4
"Yukon" 30.1
.
71.3 80.4
"Northwest Territories" 65.8
.
70.2. 78

I*

II

I I

81.3

Note the use <ofperiods, not blanks, to indicate missing values for the Yukon and Northwest
Territories If the dataset should have five variables, then for every observation, exactly five
values (including periods for missing values) must exist.
inf ile reads into memory an ASCII file, such as canada.raw, in which the values are
separated by one or more whitespace characters — blanks, tabs, and newlines (carriage return
line feed, or both) — or by commas. Its basic form is
. infile variable-list using filename.raw

With purely numeric data, the variable list could be omitted, in which case Stata assigns the
names varl, var2, var3, and so forth. On the other hand, we might want to give each variable
a distinctive name. We also need to identify string variables individually. For Canada raw the
inf xle command might be
. infile str30 place pop unemp mlife flife using Canada.raw, clear
(13 observations read)

The infxle vanablelistspecifiesvariablesintheorderthattheyappearinthedatafile. The
clear option drops any current data from memory’ before reading in the new file.
If any string variables exist, their names must each be preceded by a str# statement.
str30 , for example, informs Stata that the next-named variable (place) is a string variable
with as many as 30 characters. Actually, none of the Canadian place names involve more than
21 characters, but we do not need to know that in advance. It is often easier to overestimate
s ring variable lengths. Then, once data are in memory, use compress to ensure that no
variable takes up more space than it needs. The compress command automatically changes
ail variables to their most memory-efficient storage type.

b

. compress
place was str30 now str21
. describe

Contains data
obs :
vars :
size :
variable name

place
POP
unemp

13
c

~33 (99.9% of memory free)
storage
type

str21
float
float

display
format

%21s
%9. Og
%9. Og

value
label

variable label

Data Management
m1 i fe
flife

float
float

41

*9.0g
*9.0g

Sorted by:

We can now proceed to label variables and data as <described earlier. At any point, the
commands save Canada0 (or save Canada0, replace ) would save the new
dataset in Stata format, as file ccmciduO.dtci. The original raw-data file, Canada.raw, remains
unchanged on disk.

If our variables have non-numeric values (for example, “male” and “female”) that we want
to store as labeled numeric variables, then adding the option automatic will accomplish
this. For example, we might read in raw survey data through this infile command:
.

infile gender age income vote using survey.raw, automatic

Spreadsheet and database programs commonly write ASCII files that have only one
observation per line, with values separated by tabs or commas. To read these files into Stata,
use insheet . Its general syntax resembles that of infile , with options telling Stata
whether the data delimited by tabs, commas, or other characters. For example assuming tabdelimited data,
’
=
.

insheet variable-list using filename. raw,

tab

Or, assuming comma-delimited data with the first row ot the file containing variable names
(also comma-delimited).
.

insheet variable-list using filename.raw, comma names

With insheet we do not need to separately identify string variables. If we include no
variable list, and do not have variable names in the file’s first row, Stata automatically assigns
the variable names varJ. var2, var3
Errors will occur if some values in our ASCII file are
not separated by tabs, commas, or some other delimiter as specified in the insheet
command.
Raw data files created by other statistical packages can be in “fixed-column” format, where
the values are not necessarily delimited at all, but do occupy predefined column positions. Both
infile and the more specialized command infix permit Stata to read such files. In the
command syntax itself, or in a “data dictionary” existing in a separate file or as the first part of
the data file, we have to specify exactly how the columns should be read.

Here is a simple example. Data exist in an ASCII file named nfresour.raw:
193624087641691000
198725247430001044
193825138637481086
193925358964371140
1990
1991

I

8615731195
7930001262

These data concern natural resource production in Newfoundland. The four variables occupy
fixed column positions: columns 1-4 are the years (1986...1991); columns 5-8 measure
forestry production inthousands of cubic meters (2408...missing); columns 9-14 measure mine
production in thousands of dollars (764,169...793,000); and columns 15-18 are the consumer
price mdex relative to 1986 (1000...1262). Notice that in fixed-column format, unlike space
or tab-dchmited files, blanks indicate missing values, and the raw data contain no decimal
points. To read j;//-e5O7yr.ravv.intp Stata, we s pecify each variable’s column position:
___

42

Statistics with Stata

.

infix year 1-4

.

list

wood 5-8 mines 9-14
using nfresour. raw, clear
(6 observations read)

i.
2.
3.
4.
5.
6.

I year
I----I 1986
I 1987
I 1988
I 1989
I 1990
|----I 1991
+-----

wood

mines

24 03
2524
2513
2535

76416?
7430::
86374=
89643'
8615"3

CPI 15-18

CPI I
---------- 1
1000 |
1 044 |
1086 |
1140 !
1195 i

-------(

793 0-: j

1262

i

More complicated
eunipncateo fixed-column
nxecKolumn formats
formats might
might require
require a data “dictionary.” Data
dictionaries can be straightforward, but they offer many possible choices. Typing help
infix or help infileZ obtains brief outlines of these commands. For more examples
and explanation, consult the User's Guide and reference manuals. Stata also can load, write,
or view data from OBDC (Open Database Connectivity) sources: see help obdc .
What if we need to export data from Stata to some other. non-OBDC program? The
outf ile command writes ASCII files to disk. A command such as the following will create
a space-delimited ASCII file named canada6.raw, containing whatever data were in memory:
.

I5
f I: j

I

4|

outfile using canada6

The infile , insheet, infix , and outfile commands just described all
manipulate raw data in ASCII files. A second, very quick, possibility is to copy your data from
Stata s Browser and paste this directly into a spreadsheet such as Excell. Often the best option
however, is to transfer data directly between the specialized system files saved by various
spreadsheet, database, or statistical programs. Several third-party programs perform such
translations. Stat/Transfer, for example, will transfer data across many different formats
including dBASE, Excel, FoxPro, Gauss. JMP. Lotus. MATLAB, Minitab, OSIRIS, Paradox
S-Plus, SAS, SPSS, SYSTAT. and Stata. It is available through Stata Corporation
(www.stata.com) or from its maker. Circle Systems (wvvw_stattransfer.com). Transfer prosrams
prove indispensable for analysts working in multi-program environments or exchanging data
with colleagues.

Combining Two or More Stata Files

1

i

We can combine Stata datasets in two general ways: append a second dataset that contains
additional observations; or merge with other datasets that contain new variables or values.
In keeping with this chapter s Canadian theme, we will illustrate these procedures using data
on Newfoundland. File newfl.dta records the province’s population for 1985 to 1989.

r

Data Management
. use newfl, clear
(Newfoundland 1955-89)

.

describe

Contains data from C:\data\newfl.dta
obs:
5
Newfoundland 1985-89
vars:
2
3 Jul 2005 10:49
s^ze:
50
(99.9% of memoryfree)
variable name

year
pop

storage
type

display
format

ir.t
float

%9. Og
%9. Og

value
label

variable label
Year
Population

Sorted 'ey:

.

list

1.
2.
3.
4.
5.

cop |
------ I
58C~00 |
58C200 ;
56=200 |
565100 |
57C300 |

File nevvf2.dta has population and unemployment counts for some later years:

f
I

I year
I
I 1985
I 1986
I 1987
I 1 988
I 1989

. use newf2
(Newfoundland 1991-95)

describe

Contains data from C:\data\newf2.dta
obs:
6
vars:
3
size:
84 (99.9% of memory free)
variable name

year
pop
j obless

storage
type

display
format

float
float

%9. Og
%9. Og
%9. Og

value
label

Newfoundland 1990-95
3 Jul 2005 10:49

variable label

Year
Population
Number of people unemployed

Sorted by:

. list

1.
2.
3.
4.
5.

6.

I year
I----------

cop

I 1990
I 1991
I 1992
I 1993
I 1994
I
I 1995
+----------

573400
573500
575600
584400
582400
575449

------- +
jobless |
I
42000 I
45000 I
49000 I
49000 I
50000 I
I
I

To combine these datasets, with newf2.dta already in memory, we use the append command:
append using newfl

43

I

44

Statistics with Stata

!
list

3
4
5

I year
I----! 1990
' 1991
I 1992
I 1993
I 1994

6
7
8
10

11 .

1995
1985
1986
198 7
1988

I
I
I 1989
+-----

573400

575600
584403
582400

4 2 o ■:: I
4 50 0 0 I
4 9000 I
4 900 3 I
5000 0 I
I

575449
580700
5Q 9 - '
568000
570000

Because variableyWew occurs in newf2( 1990 to 1995) but not in newfl, its 1985 to 1989
values are
are imissing in the combmed dataset. We can now put the observations in order from
eai liest to latest and save these combined
data as a new file, newfo.dtcr.
------------sort year

.

list

3
4
5

6

8
9
10
11 .

I year

pop

jobless |
-------- I

■ 1985
i 1986
I 1987
I 1988
I 1989

580700
580200
568200
568000
570000

• I
• I
• I
. I
. I
-------- I

I 1990
! 1991
i 1992
i 19 93
I 1994
i----! 1995

573400
573500
575600
584400
5824 00

42000 |
45000 !
49000 |
49000 |
50000 |
-------- (

575449

save newf3

append
might seco^X^vJLXZva^
be compared to lengthening
memoiy)
by tapinga

urther Newfoundland time series: the numbers of births and divorces over the years 1980 to
' SA°me °bservatlons in common with our earlier dataset newfS.dta, as well as
one
e (yeai) in common, but it also has two new variables not present in newfl.dta.
. use newf4
(Newfoundland 1980-94)

describe

Contains data from C:\data\newf4.dta
obs :
15
vars :
3
size:
150 (99.9% of memory free)

Ji

Newfoundland 1980-94
3 Jul 2005 10:49

Data Management

name

year
births
divorces

storage
type

int
int
int

display
format

value
label

45

variable label

%9.0g
%9.0g
%9.0g

Year
Number of births
Number of divorces

Sorted by:

list

------------- +
1.
2.
3.
4.
5.

year

births

1980

10332
11310
9173
9630
8560

I 1981
I

1982

6.

I 1983
I 1984
I
I 1985

3.
9.
10.

1986
1987
1988
1989

11 .
12 .
13.
14 .
15 .

I
I
I
I
I 1990
I 1991
I 1992
I 1993
I 1994

8080
8320
7656
7396
7996

7354
6929
6689
6360
6295

divorces |
-I
555 I
569 I
625 I
711 I
590 I
I
561 I
610 I
1002 I
884 I
981 I
I
973 I
912 I
8 67 I
930 I
933 I

We want to merge new/3 with newf4, matching observations according toyear wherever
possib e. To accomplish this, both datasets must be sorted by the index variable (which in this
example is year). We earlier issued a sort year command before saving newfS.dta, so we
now do the same with newf4.dta. Then we merge the two, spccifyingyear as the index variable
to match.
sort year

l
.

merge year using newf3
describe

Contains data from newf4.dta
obs:
16
vars:
6
size:
304 (99.9% of memory free)

variable name
year
births
divorces
POP
j obless
_me rge

Sorted by:
Note :

I

list

storage
type
int
int
int
float
float
byte

display
format

% 9.0g
%9.0g
% 9.0g
%9.0g
%9.0g
%8.0g

value
label

Newfoundland 1980-94
3 Jul 2005 10:49

variable label

Year
Number of births
Number of divorces
Population
Number of people unemployed

dataset has changed since last saved

r

46

Statistics with Stata

o' i •' r r e s
i.
2

6
7
3
9
10
11
12
13
14
15

16.

i
I

198 0
1981

I
I

1983
1984

pop

jobless

merge |
------------ I
1 I
1 I
1 I
1 I
1 I

569
625
"11
590

----------- !
I 1987
i 198 6
1 1 89
I---------I 1990
! 1991
I 1992
1 19 93
i 1994

! 1995
+----------

765€

"3 = 4
6 92 9
66 6 9
62 ? 3

561
610
10 02
= 84
951

580700
580200
568200
568000
570000

9"3
912
= 67
930
933

573400
573500
575600
584400
582400

3 I
3 I
3 I
3 I
3 I
------------ I

42000
45000
4 9000
49000
50000

575449

3 I
3 I
3 I
3 I
3 I
----------- I

2

I

--------- +

ipnXd S
h
e a ready ln memory> are retained and those of the “using” data are
gnored The merge command has several options, however, that override this default A
6 °ll0'Vlng form would allow any hissing values in the master data to be
replaced by corresponding nonmissing values found in the using data (here, neyvfS.dtdy.
. merge year using newfS, update

™n^«mman1 SUC5 aS thu foIIowing causes my values from the master data to be replaced by
nonmissing values from the using data, if the latter are different:
Y
. merge year using nevf5, update replace

examnST6
VariabIe °CCUr m0re than once in the master data; for
= . WO
PP°!e 'hat tJe yey 1990 Occurs twice- Then values from the using data wither
b
matchedwi ‘h each occurrence of year= 1990 in the master data. You can use this
dam on a5
SUch as combining background data on individual patients with
taon ^ny number of separate doctor ^‘5 theymade Aithough merge makes this and
many other data-management tasks straightforward, analysts should look closely at the results
o be certain that the command is accomplishing what they intend.
As a diagnostic aid, merge automatically creates a new variable called
merge. Unless
update was specified, _merge codes have the following meanings:

1
2
3

Observation from the master dataset only.
Observation from the using dataset only.
Observation from both master and using data (using values ignored if different).

i

Data Management

47

If the update option was specified, merge codes convey what happened:
1 Observation from the master dataset only.
2

Observation from the using dataset only.
Observation from both, master data agrees with using.
Observation from both, master data updated if missing.
Observation from both, master data replaced if different.
Before performing another merge operation, it will be necessary to discard or rename this
variable. For example,
3
4
5

.

drop _merge

Or,
rename _merge __mergel

I
I

We can merge multiple datasets with a single merge command. For example, if newf5.dta
through newfS.dta are four datasets, each sorted by the variable year, then merging all four with
the master dataset could be accomplished as follows.
. merge year using newfS newf6 newf7 newfB,

update replace

Other merge options include checks on whether the merging-variable values are unique, and
theabilitytospecify which variables to keep for the final dataset. Type help merge for
details.

I

Transposing, Reshaping, or Collapsing Data
Long after a dataset has been created, we might discox er that for some analytical purposes it
has the wrong organization. Fortunately. se\ eral commands facilitate drastic restructurins of
datasets. We will illustrate these using data (grow thl.dta} on recent population growth in five
eastern provinces of Canada. In these data, unlike our prexious examples, province names are
represented by a numerical variable with eight-character labels.
. use growthl, clear
(Eastern Canada growth)
describe

Contains data frorn C:\data\growthl.dta
obs:
5
vars:
5
size:
105 (99.9% of memory free)

I

variable name
provinc2
grow92
grow93
grow94
grow95
Sorted by:

storage
tvce
byte
float
float
float
float

display
format

value
label

%8 . Og
%9 . Og
%9. Og
%9.0g
%9.0g

provinc2

Eastern Canada growth
3 Jul 2005 10:48

variable label
Eastern Canadian province
Pop. gain in 1000s, 1991- 92
Pop. gain in 1000s, 1992- 93
Pop-, gain in 1000s, 1993- 94
Pop. gain in 1000s, 1994- 95

1
48

Statistics with Stata

list

1.
2.
3.
4.
5.

+-------I provinc2
|-------I New Brun
I Newfound
I Nova Sco
I Ontario
Quebec
I
+

grow92

grow93

grow94

grow 9 5

10
4.5
12.1
174.9
80. 6

2.5
.8
5.8
169.1
77.4

2.2
-3
3.5
120.9
48.5

I
2.4
-5.8
3.9 I
163.9 I
47.1 I

In this organization, population growth for each year is stored as a separate variable. We
could analyze changes in the mean or variation of population growth from year to year. On the
other hand, given this organization, Stata could not readily draw a simple time plot of
population growth against year, nor can Stata find the correlation between population growth
m New Brunswick and Newfoundland. All the necessary information is here, but such analyses
require different organizations of the data.
One simple reorganization involves transposing variables and observations. In effect the
dataset rows become its columns, and vice versa. This is accomplished by the xpose
command. The option clear is required with this command, because it always clears the
present data from memory. Including the varname option creates an additional variable
(named _yarname ) in the transposed dataset, containing original variable names as strings.

I

. xpose, clear varname
describe

Contains data
obs :
vars :
size:
variable name

vl
v2
v3
v4
v5
varname
Sorted by:
Note:

5
6
160 (99.9% of memory free)

storage
type
float
float
float
float
float
str8

display
format

value
label

variable label

%9.0g
%9.0g
%9.0g
%9. Og
%9. Og
%9s

i

dataset has changed since last saved

list

vl
I
I
1. I
1
2. I
10
3 . I 2.5
4 . I 2.2
5 . I 2.4
+--------

v2

v3

v4

v5

2
4.5
.8
-3
-5.8

3
12.1
5.8
3.5
3.9

4
174.9
169.1
120.9
163.9-

5
80.6
77.4
48.5
47 . 1

-------- +
_varname I
---------- i

provincz f
grow92 |
grow93 I
grow94 |
grow95 |
-------- +

alue labels are lost along the way, so provinces in the transposed dataset are indicated
only by their numbers (1 = New Brunswick, 2 = Newfoundland, and so on). The second
through last values in each column are the population gainslor that province, in thousands.

Data Management

49

Thus, variable vl has a province identification number (1, meaning New Brunswick) in its first
row, and New Brunswick’s population growth values for 1992 to 1995 in its second throueh
fifth rows. We can nowfind correlations between population growth in different provinces, for
instance, by typing a correlate command with in 2/5 (second throueh fifth
observations only) qualifier:
. correlate vl-v5 in 2/5
(obs=4)

vl |
v2 |
v3|
v4 |
v5|

V1

v2

v3

v4

v5

1 .0000
0.8058
0.9742
0.5070
0.6526

1.0000
0.8978
0.4803
0.9362

1.0000
0.6204
0.8049

1.0000
0.6765

1.0000

The strongest correlation appears between the growth of neighboring maritime provinces New
Brunswick (vl) and Nova Scotia (v3): >■ = .9742. Newfoundland’s (v2) growth has a much
weaker correlation with that of Ontario (y4)\ r = .4803.
More sophisticated restructuring is possible through the reshape command. This
command switches datasets between two basic configurations termed “wide” and “lon^ ”
Dataset growthl.dta is initially in wide format.

I

. use growthl, clear
(Eastern Canada growth)
.

list

I provinc2
I-------1 . I New Brun
2 . I Newfound
3. I Nova Sco
4. I
Ontario
5. I
Quebec

grow92

grow93

grow94

10
4.5
12.1
174.9
80.6

2.5
.8
5.8
169.1
77.4

2.2
-3
3.5
120.9
48.5

grow95 I
------------- I
2.4 j
-5.8 |
3.3 i
163.9 i
47.1 1

A reshape command switches this to long format.
reshape long grow, i(provinc2)
(note: j = 92 93 94 95)
Data

j(year)

wide

Number of obs.
5
Number of variables
5
j variable (4 values)
xij variables:
grow92 grow93 ... grow95

I
I

..J

long

20
3
year
grow

Listing the data shows how they were reshaped. A sepby () option with the list
command produces a table with horizontal lines visually separating the provinces, instead of
every five observations (the default).

50

.

Statistics with Stata

list.

sepby(provinc2)

I provinc2

yea r

g rc w

New Brun
New Brun
New Brun
New Brun

92
93
94
95

io

I

2.5
2.2

|
j

I Newfound
I Newfound
I Newfound
I Newfound
I-----------------9 . I Nova Sco
10 . I Nova Sco
11 . I Nova Sco
12 . I Nova Sco
I
13 . I
Ontario
14 . I
Ontario
15 . I
Ontario
16. I
Ontario
I
17 . I
Quebec
18 . I
Quebec

92
93
94
95

I
I
I
I
I
I

1.
2.
3.
4.

5.
6.
7.
8.

19 .
20 .

I
I

Quebec
Quebec

92
93
94
95
92
93
94
95
92
93
94
95

2.4

|
I
4.5 I
.8 I
-3 I
-5.8 I
I
12.1
5.8 I
3.5 I
3.9 I
I
174 . 9 I
169.1 I
120.9 I
163.9 I
I
80.6 I
77.4 I
48.5
47.1

.

label data "Eastern Canadian growth--long"

.

label variable grow "Population growth in 1000s"

.

save growth2

file C:\data\growth2.dta saved

The reshape command above began by stating that we want to put the dataset in long
form. Next, it named the new variable to be created, grow. The i (provinc2) option
specified the observation identifier, or the variable whose unique values denote logical
observations. In this example, each province forms a logical observation. The j (year)
option specifies the sub-observation identifier, or the variable whose unique values ( within each
logical observation) denote sub-observations. Here, the sub-obsen ations are vears within each
province.
Figure 2.1 shows a possible use for the long-format dataset. With one graph command,
we can now produce time plots comparing the population gains in New Brunswick.
Newfoundland, and Nova Scotia (observations for which pro\inc2 < 4). The graph
command on the following page calls for connected-line plots of grow (as r-axis variable)
against year (x axis) ifprovince? < 4. with horizontal lines at v = 0 (zero population growth),
and separate plots for each value ofprovinc?.

I

Data Management

graph twoway connected grow year if provinc2 < 4,
by(provinc2)

•

New Brun

51

yline(O)

Figure 2.1

Newfound

o

o
o o
o

.E “J

92

O
O)

93

94

95

Nova Sco

c
•° o
ro T~
Z5
Q. IO
O
CL
O

“? -

92

93

94

95

year
Graphs by Eastern Canadian province

ec meh in their fisheries during the early 1990s contributed to economic hardships in these
three provinces. Growth slowed dramatically in New Brunswick and Nova Scotia, while
Newfoundland (the most fisheries-dependent province) actually lost population.
reshape works equally well in reverse, to switch data from “long” to “wide” format
Dataset growths.dta serves as an example of long format.
.

use growths,

clear

(Eastern Canadian grcwth--lcng)
list.

sepby(provinc2)

!

------- +
1.
2.
3.

4.
5.
6.
7.
8.

9.
10 .
11 .
12 .
13 .
14 .
15.
16.

provincz

grow

i New Brun
i New Brun
I New Brun
New Brun

10
2.5
2.2
2.4

Newfound
Newfound
I Newfound
I Newfound
I--------I Nova Sco
I Nova Sco
I Nova Sco
I Nova Sco
I
I
Ontario
I
Ontario
Ontario
I
I
Ontario
I

4.5

.=
-5.S

year I
--------- I
92 I
93

|

94
95

|
I

92 |
93 i
94 I
95 I
■I

12.1
5.8
3.5

92
93

3.9

94
95

174.9
169.1
120.9
163.9

92
93
94
95

I
I
I
I
I
I
I
I
I
I

★

V

-1 CTO

1009

9

A

!

.<G

' - li

52

Statistics with Stata

Quebec
Quebec
Quebec
Quebec

19.
20 .

8 0.6
77.4
48.5
47.1

92
93
94
95

To convert this to wide format, we use reshape wide :
reshape wide

(note:

grow,

i(provinc2)

j(year)

= 92 93 94 95)

Data

wide

Number of obs.
Number of variables
j variable (4 values)
xij variables:

20
3
year

5
5
(dropped)

grow

grow92 grow93 ... grow95

list
I provinc2
I
1 . I New Brun
2 . I Newfound
3. I Nova Sco
4. I
Ontario
5. I
Quebec
+

grow92

grow93

grow94

grow95

I
■ I

10
4.5
12. 1
174.9
80.6

5
8
5.8
169.1
7’ . 4

2.2
-3
3.5
120.9
48.5

2.4
-5.8
3.9
163.9
47.1

I
I
I
I
I

+

Notice that we have recreated the organization of dataset growthl.dta.
Another important tool for restructuring datasets is the collapse command, which
creates an aggregated dataset of statistics (for example, means, medians, or sums). The long
gi o\vth3 dataset has four observations for each province.*
. use growths, clear
(Eastern Canadian growth—long)

list,

.

sepby (provinc2)

I provi.-.cz

grow

year

1.
2.
3.
4.

New Brun
New Brun
New Brun
New Brun

10
2.5
2.2
2.4

92
93
94

5.
6.

! Newfound
I Newfound
I Newfound
1 Newfound

4.5
.8
-3
-5.8

93
94
95

I N ova Sco
I Nova Sco
I Nova Sco
I Nova Sco
I
I
Ontario
Ontario
I
Ontario
I
I
Ontario

12.1
5.8
3.5
3.9

92
93
94
95

174.9
169.1
120.9
16 3.9

92
93
94
95

•7

8.
9.
10 .
11 .
12 .

13.
14 .
15 .
16.

|
j
I
I
I
I

I

Data Management

17.
18.
19.
20.

I
|
|
|
|

Quebec
Quebec
Quebec
Quebec

80.6
77.4
48.5
47.1

------ 1
92 |
93 |
94 |
95 |

that is, one province.
.

collapse

(mean)

y

'

1 idoie,

grow, by (provino2)

list
I provinc2

i

|------

i. I New Brun
2 . I Newfound
3 . I Nova Sco
4. I
Ontario
5. I
Quebec
+

grow |
--------- I
4.275 |
-.8750001 |
6.325 |
157.2 |
63.4 |

I

I

variable name as wrtb grow in the previous example, or births and deaths, the collapsed
variable takes on the same name as the old variable.
i-onapsea
.

collapse (sum) births deaths (mean) meaninc =
income
(median) medinc = income, by(provinc2)

collapse can create variables based on the following summary statistics:
mean
sd
sum

rawsum
count

max
min

median

Pl
P2

iqr

53

Means (the default; used if the type of statistic is not specified)
Standard deviations
Sums

Sums ignoring optionally specified weight
Number of nonmissing observations
Maximums
Minimums
Medians
1st percentiles
2nd percentiles (and so forth to P9 9 )
Interquartile ranges

Statistics with Stata

54

Weighting Observations

Stata understands four types of weighting:
aweight Analytical weights, used in weighted least squares (WLS) regression and similar
procedures.
fweight

Frequency weights, counting the number of duplicated observations. Frequency
weights must be integers.
iweight Importance weights, however you define “importance.”
pweight Probability or sampling weights, equal to the inverse of the probability that an
observation is included due to sampling strategy.
Researchers sometimes speak of “weighted data.” This might mean that the original sampling
scheme selected observations in a deliberately disproportionate way, as reflected by weights
equal to 1/(probability of selection). Appropriate use of pweight can compensate for
disproportionate sampling in certain analyses. On the other hand, “weighted data” might mean
something different
an aggregate dataset, perhaps constructed from a frequency table or
cross-tabulation, with one or more variables indicating how many times a particular value or
combination of values occurred. In that case, we need fweight.
Not all types of weighting have been defined for all types of analyses. We cannot, for
example, use pweight with the tabulate command. Using weights in any analysis
requires a clear understanding of what we want weighting to accomplish in that particular
analysis. The weights themselves can be any variable in the dataset.
The following small dataset (nfschool.dta\ containing results from a survey of 1,381 rural
Newfoundland high school students, illustrates a simple application of frequency weighting,
describe
Contains data from C:\data\nfschool.dta
obs:
6
vars:
3
size:
48 (99.9% of memory free)
storage
type

display
format

value
label

byte
byte
int

%8.0g
%8.0g
%8.0g

yes

list, sep(3)
+----------------I univers
year
I
1. I
no
10
2. I
no
11

count

variable name
univers
year
count

Sorted by:

3.

I
I

no

12

210
260
274

4.
5.
6.

I
I
I

yes
yes
yes

10
11
12

224
235
178

Newf.school/univer.(Seyfrit 93)
3 Jul 2005 10:50

variable label
Expect to attend university?
What year of school now?
observed frequency

Data Management

55

At first glance, the dataset seems to contain only 6 observations, and when we cross
tabulate whether students expect to attend a university (univers) by their current year in high
school (year), we get a table with one observation per cell.
tabulate univers year

Expect to
attend
university
?

|
I
|
I

What year of school now?
10
11
12 |

Total

—+
no |
yes |

1
1

1
1

1 I
1 I

3
3

Total |

2

2

2 I

6

To understand these data, we need to apply frequency weights. The variable count gives
frequencies: 210 of these students are tenth graders who said they did not expect to attend a
university, 260 are eleventh graders who said no, and so on. Specifying [fweight =
count] obtains a cross-tabulation showing responses of all 1,381 students.
tabulate

Expect to
attend
university
?

|
|
I
I

univers year

[fweight = count]

What year of school now?
10
11
12 |

------ +

Total

no I
yes I

210
224

260
235

274 |
178 |

744
637

Total |

434

495

452 |

1, 381

Carrying the analysis further, we might add options asking for a table with column
percentages ( col ), no cell frequencies ( nof), and a %2 test of independence ( chi2 ). This
reveals a statistically significant relationship (P = .001). The percentage of students expecting
to go to college declines with each year of high school.
tabulate univers year [fw = count],
Expect to
attend
university
?

|
I
|
I

What year of school now?
10
11
12 |

Total

no |
yes |

48.39
51.61

52.53
47.47

60.62 |
39.38 |

53.87
46.13

Total |

100.00

100.00

100.00 |

100.00

Pearson chi2(2) =

i

col nof chi2

13.8967

Pr = 0.001

Survey data often reflect complex sampling designs, based on one or more of the following:
disproportionate sampling— for example, oversampling particular subpopulations, in order
to get enough cases to draw conclusions about them.
clustering — for example, selecting voting precincts at random, and then sampling individuals
within the selected precincts.

56

Statistics with Stata

strattfcation
tor example, dividing precincts into “urban” and “rural” strata, and then
sampling precincts and/or individuals within each stratum.
Complex sampling designs require specialized analytical tools, pweights and Stata’s
ordinary analytical commands do not suffice.
Stata’s procedures for complex survey data include special tabulation, means, regression
ogit probit, tobit and Poisson regression commands. Before applying these commands, users
must first set up their data by identifying variables that indicate the PSUs (primary sampling
units) or clusters, strata, finite population correction, and probability weights. This is
accomplished through the svyset command. For example:
.

svyset precinct [pweight=invPsel], strata(urb_rur)

fpc(finite)

For each observation in this example, the value of variablepnnczncZ identifies PSU or cluster
\ alues of urb rur identify the strata,finite gives the finite population correction, and invPsel
gives the probability weight or inverseofthe probability of selection. After the data have been
svyset and saved, the survey analytical procedures are relatively straightforward.
Commands
—3 arc typically prefixed by svy: , as in
svy:

mean income

or
svy.

regress income education experience gender

The Survey Data Reference Manual contains full details and examples of Stata’s extensive
survey-analysis capabilities. For online guidance, type help svy and follow the links to
particular commands.

Creating Random Data and Random Samples
The pseudo-random number function uniform () lies at the heart of Stata’s ability to
generate random data or to sample randomly from the data at hand. The Base Reference
Manual (Functions) provides a technical description of this 32-bit pseudo-random generator.
It we presently have data m memory, then a command such as the following creates a new
variable named randnum, having apparently random 16-digit values over the interval (0 1) for
each case in the data.
’
. generate randnum = uniform()

Altei natively, we might create a random dataset from scratch. Suppose we want to start a
new dataset containing 10 random values. We first clear any other data from memory (if they
were valuable, save them first). Next, set the number of observations desired for the new
dataset. ^Explicitly setting the seed number makes it possible to later reproduce the same
random results. Finally, we generate our random variable.
.

clear

set obs 10
obs was 0, now 10
.

set seed 12345

.

generate randnum = uniform()

I
i

Data Management

57

■-

list
+-

randnum
I
I1. I
.309106
2 . I .6852276
3 . I .1277815
4 . I .5617244
5 . I .3134516
I-6. I ..5047374
7 . I ■.7232868
8. I •.4176817
9. I ..6768828
10 . I ..3657581

I
I
I
I
I
I
I
I
I
I
I
I
I

In combination with Stata’s algebraic, statistical, and special functions, uniform () can
simulate values sampled from a variety of theoretical distributions. If we want newvar sampled
from a uniform distribution over [0,428) instead of the usual [0,1), we type
. generate newvar = 428 * uniforrnQ

These will still be 16-digit values. Perhaps we want only integers from 1 to 428 (inclusive):

I

.

generate newvar = 1 + trunc(428

.

clear

* uniform())

To simulate 1,000 rolls of a six-sided die, type
. set obs 1000
obs was 0, now 1000

generate roll — 1 + trunc(6 * uniforrnQ)
. tabulate roll
die |

Freq.

1 I
2 I
3 I
4 I
5 I
6 I
—+
Total |

171
164
150
170
169

1000

Percent

Cum.

16.40
15.00
17.00
15.90
17.50

33.50
48.50
4-j

100.00

We might theoretically expect 16.67% ones, 16.67% twos, and so on, but in any one sample like
these 1,000 “rolls,” the observed percentages will vary randomly around their expected values.
To simulate 1,000 rolls of a pair of six-sided dice, type

I

.

generate dice = 2 + trunc(6 * uniform())

.

tabulate dice
dice |

Freq.

Percent

Cum.

2
3
4
5
6
7
.8

26
62
78
120
153
14 9
. 146

2.60
6.20
7.80
12.00

2.60
8.80 16.60
28.60
43.90
58.80
73.40

------ +

1

I
I
I
I
I
I
I

15.30

14.90

.

___

+ trunc(6 * uniform())

58

Statistics with Stata

1!

9 I
10 |
11 I
12---|

96
88
53
29

9. 60
8.80
5.30
2.90

Total |

1000

100.00

-------- +

■I i

83.00
91.80
97.10
100.00

We can use _n to begin an artificial dataset as well. The following commands create a new
5,000-observation--------------dataset with1 one variable named index, containing values from 1 to 5,000.
. set obs 5000
obs was 0, now 5000

.

generate index =

n

summarize

|

Obs

Mean

Std. Dev.

Min

Max

index |

5000

2500.5

1443.52

1

5000

Variable

It is possible to generate variables from a inormal• (Gaussian)
‘ ‘ distribution
_I‘
using
uniform(). The following example creates a dataset with 2,000 observations and 2
variables, z from an N(0,l) population, andx from N(500,75).
.

clear

. set obs 2000
obs was 0, now 2000

. generate z = invnormal(uniform())
.

generate x = 500 + 75*invnormal(uniform () )

The actual sample means and standard deviations differ slightly from their theoretical values:
summarize
Variable |
------------------------- +
.

f

z
X

I
I

Obs

Mean

Std. Dev.

Min

Max

2000
2000

.0375032
503.322

1.026784
75.68551

-3.53 62 09
244.33=4

4.038878
7 4 3.13 •’7

If 2 Allows a normal distribution, v - e' follows a lognormal distribution. To form a
lognormal variable v based upon a standard normal z,
. generate v = exp(invnormal(uniform())

To form a lognormal variable tv based on an N(100,15) distribution,
generate w = exp(100 + 15*invnormal(uniform())

Taking logarithms, of course, normalizes a lognormal variable.
To simulate y values drawn randomly from an exponential distribution with mean and
standard deviation |i = o = 3,
.

generate y = -3

*

In (uniform())

For other means and standard deviations, substitute other values for 3.
XI follows a %2 distribution with one degree of freedom, which is the same as a squared
standard normal:

I

. generate XI = (invnormal(uniform())A2

By similar logic, X2 follows a yj with two degrees of freedom:

Data Management

generate X2 -

.

(invnormal(uniformO))^2 +

59

(invnormal(uniform()))-2

Other statistical distributions, including t and F. can be simulated along the same lines. In
addition, programs have been written for Stata to generate random" samples foliowine
distributions such as binomial, Poisson, gamma, and inverse Gaussian.

Although invnormal (uniform() ) can be adjusted to yield normal variates with
particular correlations, a much easierway to do this is through the drawnorm command. To
generate 5,000 observations from N(0.1). type
.

clear

.

drawnorm z,

n(5000)

summ
Variable |

Obs

Mean

I

5000

-.0005951

z

Std.

Dev.

1.019788

Min

-4.518918

3.923464

vana )les to have the following population correlations:

Xl

x2

x3

Xl

1.0

0.4

-0.8

x2

0.4

1.0

0.0

x3

-0.8

0.0

1.0

The procedure for creating such data re<■quires first defining the correlation matrix C, and then
using C in the drawnorm command:
.

mat C =

.

drawnorm xl x2 x3,

(1/

.4,

-.8

\

.4,

1,

0

\ -.8,

means (0,100,500)

0,

1)

sds(1,15,75)

corr(C)

summarize xl-x3

Variable |

Obs

Mean

;
|
|

5000
5000
5000

100.1826
500.7747

xl
x2
x3

. 0024364

Stn.

Dev.

Min

Max

• .01646
14.91325
76.93925

-3.478467
46.13897
211.5596

3.598916
150.7634
"69.6C~4

. correlate xl-x3
(obs=5000)
xl

I
I

I

xl

:

q q r. r,

x2
x3

|
|

•3951
.8134

x2

x3

1.0000
-0.0072

1.0000

Compare the sample variables ’ correlations and means with the theoretical values given earlier
Random data generated in this fashion can be viewed as samples drawn from theoretical
populations We should not expect the samples to have exactly the theoretical population
parameters (in this example, an x3 mean of500, xl-x2 correlation of0.4,x/-x5 correlation of
-.8, and so forth).

60

Statistics with Stata

The command sample makes unobtrusive use of uniform’s random generator to
obtain random samples of the data iiin memory-. For example, to discard all but a 10% random
sample of the original data, type
. sample 10

When we add an in or if qualifier, sample applies only to those observations meeting
our criteria. For example,
.

sample 10 if age <26

would leave us with a 10% sample of those observations vyith age less than 26. plus 100% of
the original observations with age >26.
We could also select random samples of a particular size, To discard all but 90 randomlyselected observations from the dataset in memory, type
.

sample 90,

count

The sections in Chapter 14 on 1bootstrapping and Monte Carlo simulations provide further
examples of random sampling and random variable generation.

Writing Programs for Data Management
Data management on larger projects often involves repetitive or error-prone tasks that are best
handled by writing specialized Stata programs. Advanced programming can become very
technical, but we can also begin by writing simple programs that consist of nothing more than
a sequence of Stata commands, typed and saved as an ASCH file. ASCII files can be created
using your favorite word processor or text editor, which should offer “ASCII text file” among
its options under File - Save As. An even easier way to create such text files is through Stata’s
Do-file Editor, which is brought up by clicking Window - Do-file Editor or the icon <$.
Alternatively, bring up the Do-file Editor by typing the command doedxt, or doedit
filename iffilename exists.
For example, using the Do-file Editor we might create a file named canada.do (which
contains the commands to read in a raw data file named canada.raw). then label the dataset and
its variables, compress it, and save it in Stata format. The commands in this file are identical
to those seen
earlier----when
---------_.i we went through the example step by step.
infile str30 place pop
p— unemp
—- . - flife
-- - - using Canada.raw
mlife

label data "Canadian dataset 1"
label variable pop "Population in 1000s, 1 995"
label variable unemp "% 15+ population unemployed,
1 9 9 5"
label variable mlife "Male life expectancy years"
label variable
"Female li e expectancy years"
compres s
save canadal, replace

Once this canada.do file has been written and saved, simply typing the following command
causes Stata to read the file and run each command in turn:
. do Canada

Data Management

61

Such batch-mode programs, tenned “do-files,” are usually saved with a .do extension. More
elaborate programs (defined by do-files or“automatic do” files) can be stored in memory, and
can call other programs in turn — creating new Stata commands and opening worlds of
possibility tor adventurous analysts. The Do-file Editor has several other features that you
might find useful. Chapter 3 describes a simple way to use do-files in building graphs. For
further information, see the Getting Started manual on Using the Do-file Editor.
Stata ordinarily interprets the end of a command line as the end of that command. This is
reasonable onscreen, where the line can be arbitrarily long, but doesnot work as well when we
are typing commands in a text file. One way to avoid line-length problems is through the
#del imi t command, which can set some other character as the end-of-command delimiter.
In the following example, we make a semicolon the delimiter; then type two long commands
that do not end unti I a semicolon appears; and then finally reset the delimiter to its usual value,
a carriage return (cr):
#delimit ;
in.Li.Le str30 place pop unemp mlife flife births deaths
marriage medinc mededuc using newcan.raw;
order place pop births deaths marriage medinc mededuc
unemp mlife flife;
ffdelimit cr

Ii

Stata normally pauses each time the Results window becomes full of information, and waits
to proceed until we press any key (or Qj). Instead of pausing, we can ask Stata to continue
scrolling until the output is complete. Typed in the Command window or as part of a program
the command
■
’
.

I

set more off

calls for continuous scrolling. This is convenient if our program produces much screen output
that we don’t want to see, or if it is writing to a log file that we will examine later. Typing
set more on

returns to the usual mode of waiting for keyboard input before scrolling.

Managing Memory
When we use or File - Open a dataset, Stata reads the disk file and loads it into memoiy.
Loading the data into memory permits rapid analysis, but it is only possible if the dataset can
fit within the amount of memory currently allocated to Stata. If we try to open a dataset that
is too large, we get an elaborate error message saying “no room to add more observations,” and
advising what to do next.

I

. use C:\data\gbank2.dta

(Scientific surveys off S. Newfoundland)
no room to add more observations
An attempt was made to increase the number of observations beyond what is
currently possible. You have the following alternatives:
1. Store your variables more efficiently; see help compress.
(Think of
State's data area as the area of a rectangle; Stata can trade off width
and length.)

2.

Drop some variables or observations; see help drop.

r

62

Statistics with Stata

3.

Increase the amount of memory allocated to the data
area using the set
memory command; see help memory.

r( 901) ;

Sma 1 Stata allocates a fixed amount of memory to data, and this limit cannot be changed.
Intercooled Stata and Stata/SE versions are flexible, however. Default allocations equal 1
megabyte for Intercooled, and 10 megabytes for Stata/SE. If we have Intercooled or Stata/SE,
running on a,computer with enough physical memory, we can set Stata’s memory allocation
higher
with the-----setmemory command. To allocate 20 megabytes to data, type
“
.

set memory 20m

Current memory allocation
current
value

set table

set maxvar
set memory
set matsize

5000
2 0M
400

memory usage
(IM = 1024k)

description

max. variables allowed
max. data space
RHS vars in models

1.733M
20.000M
1.254M
22.987M

If there are data already in memory, first type the command clear to remove them. To reset
the memory allocation “permanently," so it will be the same next time we start up, type
set memory 20m, permanently

In the example given earlier, gbcmk2.dta is a 11.3-megabyte dataset that would not fit into
the defaul t allocation. Asking for a 20-megabyte allocation has now given us more than enough
room for these data.

I

Contains data from C:\data\gbank2.dta
obs:
74,078
var s :
size:

v a r i a b 1 e n =me

■

r=c_typc
vessel
trip
set
rank
assembla
yea r
month
day
set_type
stratum
division
unit_are
light
wind_dir
wind_for
sea
bottom
t ime_mid
duration
tow dist

Spring scientific surveys NAFO
3KLNOPQ, 1971-93
2 Mar 2000 21:28

(46.0% of memory free)

storage
type

byte
int
int
str?
byte
byte
byte
byte
int
str2
str 3
int
byte
byte
byte
byte
int
byte
int

display
format

%9.0g
% 4.0 g
%4.0g
%8.0g
%8.0g
%8.0g
%7s
* 4.0g
%4.0g
% 4.0g
%8.0g
%8.0g
%2s
%3s
%8.0g
%4.0g %4.0g
%4 . Og
%4.0g
%8.0g
%8.0g
%8.0g

value
label

variable label
original case number
Vessel
Trip number
Set number

set_type

Year
Month
Day
Set type
Stratum or line fished
NAFO division
Nfld. area grid map square
Light conditions
Wind direction

Wind force
Type of bottom
Time (midpoint)
Duration of set
Distance towed

r

Data Management
gear_oy
depthcat
min_dept
max_dept
bot_dept
temp_sur
tempcat
temp_fs_
lat
long
pos_meth
gear
total
species
number
weight
lat in
common
surtemp
f ishtemp
depth
ispecies
Sorted by:

c yze
t yte

v 4.0 g
':4.0g
3.0g
v 3.0g
* 3.0g
*3.0g
cyze
* 5.0g
* 3.0g
* 2.0g
-9.0g
byte
* 4.0g
* 8 . Og
ty-e
-9.0g
* 8.0g
* 9.0g
d -ble *9.0g
str31
*3 Is
str27
^27s
fl cat
* 9.0g
flcat *9.0g
*9.0g
b-.-ce
h 9.0g

63

Operation of gear
Category of depth
Depth (minumum)
Depth (maximum)
Depth (bottom if MWT)
Temperature (surface)
Category of temperature
Temperature (fishing depth)
Latitude (decimal)
Longitude (decimal)
Gear

Species
Number of individual fish
Catch weight in kilograms
Species — Latin name
Species — common name
Surface temperature degrees C
Fishing depth temperature C
Mean trawl depth in meters
Indicator species

id

describe the data (above), Stata reports “46.09% ofmemory free,” meaning not 46% of the
computer s total resources, but 46% of the 20 megabytes we allocated for Stata data. It is
usually advisable to ask for more memory than our data actually require. Many statistical and
a a-management operations consume additional memory, in part because they temporarily
create new variables as they work.
It is possible to set memory to values higher than the computer’s available physical
memory. In that case, Stata uses “virtual memory,” which is really disk storage. Although
vrrtual memory allows bypassing hardware limitations, it can be terribly slow. If you regularly
wor with datasets that push the limits of your computer, you might soon conclude that it is
time to buy more memory.
Type help limits to see a list of limitations in Stata, not only on dataset size but also
,0S.rJimLnT"SJndJUd,2g matr« siz^command lengths, lengths of names, and numbers of
variables in commands. Some of these limitations
----------- j can be adjusted by the user.

I

I .

I

1
s
i

Graphs

Graphs appear in every chapter of this book —one i - ‘
indication of their value and integration
with other analyses in Stata. Indeed, graphics have a! ways been one of StataTs^g ^nd
reason enough for many users to choose Stata over other packages. The graph Command
evolved incrementally from Stata versions 1 through 7. Stata version 8 marked a major step
forward, howevet. graph underwent a fundamental redesign, expanding its capabilities fo^
sophisticated, publication-quality analvtical graphics. Output appearance and choices were
i tuch imptoved as well. With the new graph command syntax and defaults, or alternatively
C' ne:,meni'S’ attrauCtive (and Publishable) baste graphs are quite easy to draw
phically ambitious users who visualize non-basic graphs will find their efforts supported by
XlyJ.mPreSS1Ve anay °f t0°IS and °PtiOnS‘ deSCribed in the 500-page Graphics Reference
In the much shorter space of this chapter, the spectrum from elementary
j to creative
graphing will be covered taking an cexample’ rather
’
than syntax-oriented approach (see the
Graphics Reference Manual qv help
? <graph for thorough coverage of syntax). We begin
by illustrating seven basic types of graphs.

histogram

histograms

graph twoway

two-variable scatterplots, line plots, and many others

graph matrix

scatterplot matrices

graph box

box plots

graph pie

pie charts

graph bar

bar charts

graph dot

dot plots

twoway °!vpeSe

°PtiOnS’ Th3t iS esPeciallytrue for the versatile

More specialized graphs such as symmetry plots, quantile plots, and quantile-normal plots
of e anhs fo’ f7Xam;ninS details of variable distributions. A few examples of these, and also
of graphs for industrial quality control, appear in this chapter. Type help graph other
for more details.
Finally, the chapter concludes with techniques particularly useful inbuildingdata-rich selfcontained graph.cs for publication. Such techniques include adding text to graphs, overlayina
nultiple twoway plots, retrieving and reformatting saved graphs, and combining multiple
graphs into one. As our graphing commands grow more complicated, simple batch programs

64

Graphs

65

(do-files) can help to write and re-use them. The full range of graphical choices goes far
beyond what this book can cover, but the concluding examples point out a few of the
possibilities. Later chapters supply further examples.
The Graphics menu provides point-and-click access to most of these graphing procedures.
A note to long-time Stata users: The graphical capabilities of Stata 8 and 9 outshine those
of earlier versions. For analysts comfortable with old Stata, there is much new material to
learn. Menus allow a quick entry, and the new graphics commands, like the old ones, follow
a consistent logic that becomes clear with practice. Fortunately, the changeover need not be
sudden. Version 7-style graphics remain available if needed. They have been moved to the
command graph? . For example, an old-version scatterplot would formerly have been drawn
by the command
graph income education

which does not work in the newer Stata. Instead, the command
.

graph? income education

will reproduce the familiar old type of graph. The options of graph? are similar to those of
the old-style graph . To see an updated version of this same scatterplot, type the new
graphics command
• graph twoway scatter income education

Further examples of new commands appear in the next section, which should give a sense of
what has changed (and what is familiar) with the redesigned graphical capabilities.

Example Commands
histogram y, frequency

Draws histogram of variable y, showing frequencies on the vertical axis.
. histogram y, start(O) width(10) norm fraction

Draws histogram of with bins 10 units wide, starting at 0. Adds a normal curve based on
the sample mean and standard deviation, and shows fraction of the data on the vertical axis.
. histogram y, by(x, total) fraction

In one figure, draws separate histograms of y for each value of x, and also a ’’total”
histogram for the sample as a whole.
kdensity x, generate(xpoints xdensity) width(20) biweight

Produces and graphs kernel density estimate of the distribution of x. Two new variables
are created: xpoints containing thex values at which the density is estimated, and xdensity’
with the density estimates themselves, width (20) specifies the halfwidth of the kernel,
in units of the variable x. (If width () is not specified, the default follows a simple
formula for “optimal.”) The biweight option in this example calls for a biweight
kernel, instead of the default epanechnikov .
I

. graph twoway scatter y x

I

Displays a basic two-variable scatterplot of v against x.

66

Stat/st/cs with Stata

. graph twoway Ifit y x

scatter y x

||

Visualizes the linear regression of y on x by overlaying two twoway graphs: the
regression (linear fit or Ifit ) line, and the y vs. x scatterplot To include a 95%
confidence band for the regression line, replace Ifit with If itci .

I

. graph twoway scatter y x, xlabel(0(10)100) ylabel(-3 (1)6, horizontal)

Constructs scatterplot ofy vs. x, withxaxis labeled atO, 10,..., 100. y axis is labeled at -3,
-2,..., 6, with labels written horizontally instead of vertically (the default).
. graph twoway scatter y x, mlabel(country)

Constructs scatterplot ofy vs. x, with data points (markers) labeled by the values ofvariable
country.
. graph twoway scatter y xl, by(x2)

In one figure, draws separate^ vs. xl scatterplots for each value ofx2.
. graph twoway scatter y xl

[fweight = population], msymbol(Oh)

Draws a scatterplot ofy vs. xl. Marker symbols are hollow circles (Oh), with their size
(area) proportional to frequency-weight variable population.
. graph twoway connected y time

A basic time plot ofy against time. Data points are shown connected by line segments. To
include line segments but no data-point markers, use line instead of connected.:
• graph twoway line y time

. graph twoway line yl y2 time

Draws a time plot (in this example, a line plot) with twoy variables that both have the same
scale, and are graphed against an x variable named time.
. graph twoway line yl time, yaxis(l)

||

line y2 time, yaxis(2)

Draws a time plot with two y variables that have different scales, by overlaying two
individual line plots. The left-handy axis, yaxis (1), gives the scale foryf, while the
right-handy axis, yaxis (2), gives the scale fory2.
. graph matrix xl x2 x3 x4 y

Constructs a scatterplot matrix, showing all possible scatterplot pairs among the variables
listed.
. graph box yl y2 y3

Constructs box plots ofvariablesy7,y2, andyJ.
. graph box y, over(x) yline(.22)

Constructs box plots ofy for each value ofx, and draws a horizontal line aty

.22.

. graph pie a b c, pie

Draws one pie chart with slices indicating the relative amounts of variables a, b, and c. The
variables must have similar units.
.

graph bar

(sum)

a b c

Shows the sums of variables a, b, and c as side-by-side bars in a bar chart. To obtain means
instead of sums, type graph bar (mean) a b c . Other options include bars
representing medians, percentiles, or counts of each variable.
• graph bar

(mean)

a, over(x)

Diaws a bar chart showing the mean of variable a at each value of variable x.

A*.

Graphs

•

graph bar

(asis)

a b c,

over(x)

67

stack

Draws a bar chart in which the values (“as is”) of variables a, b, and c are stacked on top
of one another, at each value of variable x.
•

graph dot

(median)

y,

over(x)

Draws a dot plot, in which dots along a horizontal scale mark the median value of v at each
level ofx. Other options include means, percentiles, or counts of each \ ariable.
.

qnorm y

Draws a quantile-normal plot (normal probability plot) showing quantiles of v versus
corresponding quantiles of a normal distribution.
rchart xl x2 x3 x4 x5, connect(1)

Constructs a quality-control R chart graphing the range of values represented by variables

Graph options, such as those controlling titles, labels, and tick marks on the axes are
common across graph types wherever this makes sense. Moreover, the underlying losic of
Stata’s graph commands is consistent from one type to the next. These common elemems are
the key to gaining graph-building fluency,
„, as the basics begin to fall into place.
Histograms
Histograms, displaying the distribution of measurement variables, are most easily produced
with their own command histogram. For examples, we turn to states.dta, which contains
selected environment and education measures on the 50 U.S. states plus the District of
Columbia (data from the League of Conservation Voters 1991; National Center for Education
Statistics 1992, 1993; World Resources Institute 1993).
. use states
(U.S. states data 1990-91)
describe

Contains data from c:\data\states.dta
obs:
si
vars:
21
size:
4,080 (99.9% of memory free)
variable name

I

state
region
Pop
area
density
metro
waste
energy
miles
toxic
green
house
senate
csa t
vsa t

storage
type

display
format

str20
byte
float
float
float
float
flcat
int
in t
float
float
byte
byte

* 9.0g
•9.0g
* 9.0g
*7.2f
*5. If
*5.2f
%8.0g
*8.0g
*5.2f
*5.2f
*8.0g
*8.0g

in:

* 9.0g
*8.0g

value
label

*20s

region

U.S. states data 1990-91
4 Jul 2005 12:07

variable label
State
Geographical regie.-.
1990 population
Land area, square riles

People per square
Metropolitan area emulation, *
Per capita solid waste, tons
Per capita energy s.-.sumed, 5tu
Per capita miles/ye =r, 1,000
Per capita toxics released, les
Per’capita greenhouse gas, ter.s
House '91 environ. voting,
Senate '91 environ. voting, *
Mean composite SAT s c o r e
Mean verbal SAT score

68

Statistics with Stata

msa t
percent
expense
income
high
college

Sorted by:

in t
byte
int
long
float
float

%8.0g
%9 . Og
%9.0g
% 10.0 g
%9.0g
%9.0g

Mean math SAT score
% HS graduates taking SAT
Per pupil expenditures prim&sec
Median household income, $1,000
% adults HS diploma
% adults college degree

I

state

Figure 3.1 shows a simple histogram of college, the percentage of a state’s over-25
population with a bachelor’s degree or higher. It was produced by the following command:
. histogram college, frequency title("Figure 3.1")

Figure 3.1

Figure 3.1

o
CXI

in

-iw;

>»

0)

■W

in

’J.-; •

■

•

••

1

o

10

15

20
25
% adults college degree

30

35

Under the Prefs — Graph Preferences menus, we have the choice of several pre-designed
schemes for the default colors and shading of our graphs. Custom schemes can be defined
as well. The examples in this book employ the s2 mono (monochrome) scheme, which among
other things calls for shaded margins around each graph. The s1 mono scheme does not have
such margins. Experimenting with the different monochrome and color schemes helps to
determine which works best for a particular purpose. A graph drawn and saved under one
scheme can subsequently be retrieved and re-saved under a different one, as described later in
this chapter.

Options can be listed in any order following the comma in a graph command. Figure 3.1
i Hush ates two options: frequency (instead ofdensity, the default) is shown on the vertical axis;
and the title Figure 3.1 appears over the graph. Once a graph is onscreen, menu choices
provide the easiest way to print it, save it to disk, or cut and paste it into another program such
as a word processor.
Figure 3.1 reveals the positive skew of this distribution, with ajnode above 15 and an
outlier around 35. It is hard to describe the graph more specifically because the bars do not line
up withx-axis tick marks. Figure 3.2 contains a version with several improvements (based on
some quick experiments to find the right values):

I

Graphs

1.

The ,v axis is labeled from 12 to 34, in increments of 2.

2.

The y axis is labeled from 0 to 12, in increments of 2.

3.

Tick marks are drawn on the r axis from 1 to 13. in increments of 2.

69

4. The histogram’s first bar (bin) starts at 12.

5. The width of each bar (bin) is 2.
histogram college, f
frequency title("Figure 3.2") xlabel(12(2)34)
ylabel(0 (2)12) ytick (1 (2)13)
start(12)
'
) width(2)
Figure 3.2

Figure 3.2

S'-

CM

O

o'00
c

W-

0)

BI

l’<D

uZ

- '

<

n

CM

Bi

o

i¥

14

16

A

IW-

Illi i
18o/ 20
22
24
26
% adults college degree

28

30

32

34

Figure 3.2 helps us to describe the distribution more specifically. For example, we now see that
in 13 states, the percent with college degrees is between approximately ir
16 andj m
18.
Other useful histogram options include:

I

i

bin (#)

Draw a histogram with # bins (bars). We can specify either bin (#) or, as
inFigure3.2, start(#) and width (#)—but not both.

percent

Show percentages on the vertical axis, ylabel and ytick then refer to
percentage values. Another possibility, frequency, is illustrated in Figure
3.2. We could also ask for fraction of the data. The default histogram
shows density, meaning that bars are scaled so that the sum of their areas
equals 1.

gap(#)

Leave a gap between bars. # is relative, 0 s # < 100; experiment to find a
suitable value.

addlabels

Label the heights of histogram bars. A separate option, addlabopts ,
controls the how the labels look.

discrete

Specify discrete data, requiring one bar for each value ofx.

70

Statistics with Stata

norm

Overlay a normal curve on the histogram, based on sample mean and standard
deviation.

kdensity

Overlay a kemal-density estimate on the histogram. The option kdenopts
controls density computation; see help kdensity for details.

With histograms or most other graphs, we can also override the defaults and specify our
own titles for the horizontal and vertical axes. The option ytitle controls r-axis titles, and
xtitle controls .v-axis titles. Figure 3.3 illustrates such titles, together with some other
histogram options. Note the incremental buildup from basic (Figure 3.1) to more elaborate
(Figure 3.3) graphs. This is the usual pattern of graph construction in Stata: we start simply,
then experimentally add options to earlier commands retrieved from the Review window, as we
work toward an image that most clearly presents our findings. Figure 3.3 actually is over
elaborate, but drawn here to show off multiple options.
. histogram college, f
frequency titie("Figure 3.3") ylabel(0(2)12)
ytick(1 (2)13) xlabel
’ (12(2)34)
‘ \
) start(12) width(2) addlabel
norm gap (15)
Figure 3.3

Figure 3.3

13

cxj
O

i

o'00
c
o
6 \

a

LL

iu

4

Bai

CXI

o
12

14

16

18

20
22
24
26
% adults college degree

28

30

32

34

Suppose we w ant to see how the distribution of college varies by region. The by option
obtains a separate histogram tor each value of region. Other options work as they do for single
histograms. Figure 3.4 show s an example in which we ask for percentages on the vertical axis,
and the data grouped into 8 bins.

I

4

Graphs

71

histogram college, by(region) percent bin (8)

West

Figure 3.4

N. East

s8-

s

ao _

s

South

o

CL

- I^r is

Z1

o

Midwest

§ -

8*

1
o _

<

‘r'a-

]

o

10

15

20

25

30

10

15

20

30

25

% adults college degree
Graphs by Geographical region

i

Figure 3.5, below, contains a similar set of four regional graphs, but includes a fifth that
shows the distribution for all regions combined.
. histogram college, percent bin (8) by(region, total)

West

N. East

I

$-

88o

o
o
o

10

Midwest

CL

o _

15

■

20

25

30

Total

N

88-

I

"I

I

o .
o -

I

■■

8p
test

o _

Figure 3.5

South

10

15

20

a
25

30 10

15

20

25

30

% adults college degree
Graphs by Geographical region
I

Axis labeling, tick marks, titles, and the by (varname) or by (vaniame, total)
options work in a similar fashion with other Stata graphing commands, as seen in the following
sections.

72

Statistics with Stata

Scatterplots
Basic scatterplots are obtained through commands of the general form
graph twoway scatter y x

.

wherey is the vertical or i-a.xis variable, and .v the horizontal or.v-axis one. For example, asain
usingthe states.dta dataset, we could plot waste (per capita solid wastes) against metro (percent
population in metropolitan areas), with the result shown in Figure 3.6. Each point in Figure 3 6
represents one of the 50 U.S. states (or Washington DC).
. graph twoway scatter waste metro
Figure 3.6

9

c
o

e

of

9

w
co
$

I
e

ro
Q.

o
CD

O.

o

ID -

o H—

20.0

40-0 t
60.0
80.0
Metropolitan area population. %

100.0

As with histograms, we can use xlabel, xtick, xti tie. etc. to control axis labels,
tick marks or titles. Scatterplots also allow control of the shape, color, size, and other
attributes of markers. Figure j.6 employs the default markers, which are solid circles The
same effect would result if we included the option msysbol (circle), or wrote this option
in a reviate orm as msymbol (O). msymbol (diamond) or msymbol (D) would
produce a graph with diamond markers, and so forth. The following table lists possible shapes.

msymbol( )

Abbreviation

Description

circle

O

circle, solid

diamond

D

triangle
square
plus
x

T

diamond, solid
triangle, solid

S
+

square, solid
plus sign

X

letter x

smcircle

o

small circle, solid

Graphs

smdiamend
smsquare
smtriangle
smplus
smx

circle__hollow
diamond_hollow
triangle_hollow
square__hollow
smcircle__hollow
smdi amond__ho 11 ow
smtriangle__hollow
sms guare_hol low
point
none

d
s
t
smplus
x
Oh
Dh
Th
Sh
oh
dh
th
sh

P

73

small diamond, solid
small square, solid

small triangle, solid
small plus sign

small letter x
circle, hollow

diamond, hollow

•

triangle, hollow
square, hollow
small circle, hollow
small diamond, hollow
small triangle, hollow
small square, hollow
very small dot

invisible

The mcolor option controls marker colors. For example, the command
. graph twoway scatter waste metro, msymbol(S) mcolor(purple)

would produce a scatterplot in which the symbols were large purple squares. Type help
colorstyle for a list of available colors.

I

One interesting possibility with scatterplots is to make symbol size (area) proportional to
a third variable thereby giving the data points different visual “weight.” For example we
might redraw the scatterplot of waste against metro, but make the symbols size reflect each
state s population (pop). This can be done as shown in Figure 3.7, using the fweiqht [ ]
(frequency weight) feature. Hollow circles, msymbol (Oh), provide a suitable shape.

Frequency weights are useful with some other graph types as well. Weighting can be a
deceptively complex topic, because “weights” come in several types, and have different
meanings in different contexts. For an overview of weighting in Stata, type help weight.

.—.

unnwm..

9

74

Statistics with Stata

5

■ graph twoway scatter waste metro [fweight = pop],
msymbol(Oh)

Figure 3.7

-

c

o
o'
w

o

A

$

~O -

w
re
o.
re

O

o

o

O
O

Q_

o

o
in
o

20.0

400

60.0

80 0

WO.O

Metropolitan area population, %

$tata density-distribution sunflower plots provide an alternative to scatterplots
i i high-density data. Basically, they resemble scatteiplots in which some of the individual
data pomts are replaced with sunflower-like symbols to indicate more than one observation at
that location. Figure 3.8 shows a sunflower-plot version of Figure 3.6, in which some of the
flower symbols (those with four “petals”) represent up to four individual data points, or states
table printed after the sunflower command provides a key regarding how many
o servations each flower represents. The number of petals and the darkness of the flower
correspond to the density of data.
sunflower waste metro, addplot(Ifit waste
metro)
Bin width
11.3714
Bin height
.286522
Bin aspect ratio
. 0218209
Max obs in a bin
4
Light
3
Dark
13
X-center
67.55
Y-center
. 96
Petal weight
1

flower
type

petal
weight

No. of
petals

No . of
flowers

estimated
obs.

actual
obs .

none
light
light

1
1

3
-4

5
3

23
15
12

23
15
12

50

50

Graphs

75

Figure 3.8

(D O
"TO v-’

o

o

o

■o

o
o

<D
U.

■

7^

I

o°

o

o

<D o
to o
ro

o

o-4-

$

TJ

i
i

o

.

——

o

Ks?"

o

ro
Q.

ro

■

hO

o

i

o

I—

20.0

I

o°

o

40.0
6Q o
go o
_________ Metropolitan area population, %'

100.0

o Per capita solid waste, tons
1 petal = 1 obs
» 1 petal = 1 obs.
----------Fitted values

.< Si2T£S“iCH “y h"S5‘1 Wi,h larEe
™»y observations plot
“ ""I (" ;den,1“l) «»«l™tes. The example in Figure 3.8 inclodes a regression line

’dd'd “

the graph into a visual jumble. Concentrating on one region such as the West seems more
fiZXTpX Xf qUaIlfieraCCOmplisheS this> Producing Results seen in Figure 3.9 on the

I

3

76

Statistics with Stata

• graph twoway scatter waste metro if region==l, mlabel(state)

s-

Figure 3.9
• California

o
XT

-

C T-

o
(D

-

• Oregdft Hawai*

■g

• Washington

So _

• New Mexico

Q-X-

O

• Alaska

Oo

c Idaho

CO O

• Nevada
• Arizona

• Montana
• Wyoming

iV
20.0

• Colorado
• Utah

40 0
~
BOX)
'
8CL0
Metropolitan area population, %

100.0

Figure 3.10 (below) shows separate waste - metro scatterplots for each region. The
relationship between these two variables appears noticeably steeper in the South and Midwest
than it does in the West and Northeast, an impression we will later confirm. The ylabel and
xlabel options in this example give they- and x-axis labels three-digit (maximum) fixed
display formats with no decimals, making them easier to read in the small subplots.
. graph twoway scatter waste metro, by(region)
ylabel(, format(%3.0f)) xlabel(, format(%3.Of))
Figure 3.10

West
CXI

c
O

<D
W
CQ

£ -H

•g

South

° CM -I
ro

Midwest

Q.
O

O

Q.

r- c

20

4C?

60

80

100 20

40

Metropolitan area population, %

Graphs by Geographical region

60

80

100

Graphs

77

Scatterplot matrices, produced by graph matrix. prove useful in multivariate analysis.
They provide a compact display of the relationships between a number of variable pairs,
allowing the analyst to scan for signs of nonlinearity, outliers, or clustering that mieht affect
statistical modeling. Figure 3.11 shows a scatterplot matrix involving three variables from
states.dta.
• graph matrix miles metro income waste, half msymbol(oh)

Figure 3.11

Per
capita
miles/year,
1,000
100.0- 0

O

o
0° flp

s "^o?> o
50.0-

Metropolitan
area
population,

0

OoO® 0

c

’•« o
o oo

%8:

%

OoO

o

o

Median
household
income,
$1,000

40- 0
-

0

o

30c o°

iJ8=-

0 0°

o°0

0
°0 e

0

■>

0^0

O

1.00-

oc<f><3° ’
O

O

o
Ooo
0%°c

o
cP

c0

'''

*QO

O.

0

°

Per
capita
solid
waste,
tons

OO

. o

0.50
6

8

10

12 0.0

50.0

10(20

30

40

50

The half option specified that Figure 3.11 should includeonly the lower triangular part
of the matrix. The upper triangular part is symmetrical and. for many purposes, redundant,
msymbol (oh) called for small hollow circles as markers, just as we might with a scatterplot.
Control of the axes is more complicated, because there are as many axes as variables- type
help graph_matrix for details.
When the variables of interest include one dependent or "effect’' variable, and several
independent or cause” variables, it helps to list the dependent variable last in the graph
matrix variable list. That results in a neat row of dependent-versus-independent variable
graphs across the bottom

Line Plots

I
I

Mechanically, one
line piots
plots are scatterplots m
in which the points are connected by line segments.
iv.^uaiHcany,
Like scatterplots, the various types of line plots belong to Stata’s versatile graph twoway
family. The scatterplot options that control axis labelingand markers work much the same with
line plots, too. New options control the characteristics of the lines themselves.
Line plots tend to have different uses than scatterplots. For example, as time plots they
depict changes in a variable over time. Dataset cod.dta contains time-series data reflecting the

I

Statistics with Stata

78
4

unhappy story of Newfoundland s Northern Cod fishery. This fishery, which had been among
the world s richest, collapsed in 1992 primarily due to overfishing.
Contains data from C:\data\cod.dta
obs :
38
vars :
size :

variable name

storage
type

year
cod
Canada
TAG
biomass
Sorted by:

Newfoundland's Northern Cod
fishery, 1960-1997
4 Jul 2005 15:02

5
684 (99.9% of memory free)

int
float
int
int
float

display
format

value
label

variable label

%8.0g
%8 . Og
%8.0g
%8.0g
%9.0g

Year

Total landings, lOOOt
Canadian landings, lOOOt
Total Allowable Catch, lOOOt
Estimated biomass, lOOOt

year

A simple time plot showing Canadian and total landingscan be constructed by drawing line
graphs of both variables against year. Figure 3.12 does this, showing the “killer spike” of
international overfishing in the late 1960s, followed by a decade of Canadian fishing pressure
in the 1980s, leading up to the 1992 collapse of the Northern Cod.

. graph twoway line cod Canada year
Figure 3.12
V)
O)

c

■

c
ro
■Q
CD

Sg O

o
o

-

\\
x\
\\

.TO

ro
oO
H

-L^
I960

1970

1980
Year

1990

2000

Total landings, 1000t __ -- Canadian landings, 10OOt |

In Figure 3.12, Stata automatically chose a solid line for the first-named^ variable, cod, and
a dashed line for the second, Canada. A legend at the bottom explains these meanings. We
could improve this graph by rearranging the legend, and suppressing the redundant j,-axis title,
as illustrated in Figure 3.13.

Graphs

•

graph twoway line cod Canada year,

legend(label

79

(1 "all nation®-^

ring (0)
Figure 3.13

-------- - all nations!
---------- Canada

I
\\
\\
\\
\\
\\

v

o -

1960

1970

1980
Year

1990

2000

I

The legend option for Figure 3.13 breaks down
as follows. Note that all of these
suboptions occur within the parentheses followine legend

I

label (1

"allnations")

label first-namedvariable “all nations”

label (2

"Canada")

label second-named,v variable “Canada”

position (2)

piace the Iegend at 2 o,clock position (upper right)

ring (01

Place the legend within the plot space

rOWS (2 J

organize the legend to have two rows

toyshowrttehneidSatthe
‘abelS
Pl3CinS tHem Uithin tfK pl0t SpaCe- we Ieave more room
to show the data and create a more attractive, readable figure, legend works similarlv for
other graph styles that have legends. Type help legend_option to see a list ofthe man v
suboptions available.

- - I‘g.U_r!S2‘!2 and 3-13 s’mPly connect each data point with line segments. Several other
connecting styles are possible, using the? connect option. For example.

connect(stairstep)
or equivalently.
connect(J)
will cause points to be connected in stairstep (flat, then vertical) fashion. Figure 3.14 illustrates
wi h a stairstep time plot of the government-set Total Allowable Catch (TAG) variable from
v t/ Lc . C< < C< ,

80

Statistics with Stata

• graph twoway line TAC year.

connect(stairstep)

o
o

Figure 3.14

00

oo
oo
O (D
£

28-

X> XT

ro
jp

<

fSI
1960

1970

1980
Year

1990

2000

Other connect choices are listed below,
The default, straight line segments,
corresponds to connect (direct)
or connect (1). For more details, see help
connectstyle.

connect( )

Abbreviation

_ Description

none

i

do not connect

direct

1 (letter “el”)

connect with straight lines

ascending

L

direct, but only ifx[/+l] > x[z]

stairstep

J

flat, then vertical

stepstair

vertical, then flat

3J 5 Jon
Rowing page) repeats this stairstep plot
of TAC^ but with some
enhancements of axis labels and titles, The option xtitle(""]
t
) requests
no x-axis title
(because “year” is obvious). We added
' J tick marks at two-year intervals to thex axis. labeled
they axis at intervals of 100, and printedly-axis labels horizontally instead of vertically (the
default).

r

Graphs

81

. graph twoway line TAC
year, connect(stairstep)
xtitle(" " )
xtick(1960(2)2000)
ytitle("Thousands of tons")
ylabel(0(100)800, angle(horizontal)) xtitle(
)
cipattern(dash)

800-

Figure 3.15

700"-i
i
l
lI
I
l
i
i
i
I
I

600w

o 500■o 400CTJ
W

J 300-

i
i
i
I
r
“I I

200-

i

I
I
-I
I
I
I

100-

0-|

1960

1970

1980

1990

2000

J^ead of letting Stata determine the line patterns (solid, dashed, etc.) in Figure 3.15 we
used the cipattern (dash) option to call for a dashed line. Possible line pattern choices
are listed in the table below (also see help linepatternstyle ).

I

cipattern( )

Description

solid

solid line

dash

dashed line

dot

dotted line

dash dot

dash then dot

shortdash

short dash

shortdash dot

short dash followed by dot

longdash

long dash

longdash_dot

long dash followed by dot

blank

invisible line

formula

for example, cipattern (-.) or cipattern (—..)

82

Statistics with Stata

Before we move on to other examples and types, Figure 3.16 unites the three variables
discussed in this section to create a single graphic showing the tragedy of the Northern Cod.
Note how the connect ( ). cipattern ( ), and legend ( ) options work in this
three-vanable context.

Ji
!

• graph twoway line cod Canada TAC year, connect(line line stairstep)
cipattern((solid
- - longdash
dash) xtitle("”) xtick(1960 (2)2000)
ytitle("Thousands
— of
_ tons"
---- ') ylabel (0 (100) 800, angle(horizontal))
xtitie("" ) legend(label (1 "all nations") label (2 "Canada")
label (3 "TAC") position(2) ring(0) rows(3))
5

Figure 3.16

800-I

------ all nations
------ Canada
------ TAG

700 J
600-

i

c

2 500-

400^
V)
3
O
XL

300 -i

200
100

°H
1960

1970

1980

1990

2000

Graphs

83

Connected-Line Plots

woway line both apply to graph twoway connected as well. Figure 3.17 shows
a etault example, a connected-line time plot of the cod biomass variable (bio) from cod.dta.
graph twoway connected bio year

.

Figure 3.17

o
io
CXI

o

§s

o
T—

V) O

ro io
o

is
ro o
co

LU
o
m

o

1960

1970

1980
Year

1990

2000

I

The dataset contains only biomass values for 1978 through 1997, resulting in much empty
space in Figure 3.17. if qualifiers allow us to restrict the range of years. Figure 3.1 sXn
e o owing page, does this. It also dresses up the image to show control of marker symbols
me patterns, axes, and legends. With cod landings and biomass both in the same ima-e we
recognJzed. 11355 830 kS
’ 98°S’
before a Crisis was off’cially

I

84

i

Statistics with Stata

• graph twoway connected bio cod year if year > 1977 & year < 1999,
msymbol(T Oh) cipattern(dash solid) xlabel(1978(2)1996)
xtick (197 9 (2) 1997) ytitle ( "Thousands of tons") xtitle("’’ )
ylabel (0 (500) 2500, angle(horizontal))
legend(label (1 "Estimated biomass") label(2 "Total landings")
position(2) rows(2) ring(0))
Figure 3.18
2500-

a--- Estimated biomass

I'

2000-

I
I
I
I
I
I
I

c
o

? 1500-

I

‘

\

I

ID

------ Total landings
I
I
I
I
I
I
I
I
I
I

o 1000- / /

A
/ \
'

\

\
\
\
\
\
\
\

5004
OH,

4

1978 1980 1982 1984 1986 1988 1990 1992 1994 1996

Other Twoway Plot Types
In addition to basic line plots and scatterplots, the graph twoway command encompasses
a wide variety’ of other types. The following table lists the possibilities.

I

graph twoway

Description_____________________________________________

4 ■

scatter

scatterplot

i

line

line plot

connected

connected-line plot

scatteri

scatter with immediate arguments (data given in the command line)

area

line plot with shading

bar

twoway bar plot (different from graph bar)

spike

twoway'spike plot

dropline

dropline plot (spikes dropped vertically or horizontally to given value)

dot

twoway dot plot (different from graph dot)

rarea

range plot, shading the area between high and low values

Is I

Graphs

I

85

rbar

range plot with bars between high and low values

rspike

range plot with spikes between high and low values

reap

range plot with capped spikes

rcapsym

range plot with spikes capped with symbols

rscatter

range plot with scatterplot marker symbols

rline

range plot with lines

rconnected

range plot with lines and markers

pcspike

paired-coordinate plot with spikes

pccapsym

paired-coordinate plot with spikes capped with symbols

pcarrow

paired-coordinate plot with arrows

pebarrow

paired-coordinate plot with arrows having two heads

pcscatter

paired-coordinate plot with markers

pci

pcspike with immediate arguments

pcarrowi

pcarrow with immediate arguments

tsline

time-series plot

tsrline

time-series range plot

mband

straight line segments connect the (x, y) cross-medians within bands

mspline

cubic spline curve connects the (x, v) cross-medians within bands

lowess

LOWESS (locally weighted scatterplot smoothing) curve

Ifit

linear regression line

qfit

quadratic regression curve

fpfit

fractional polynomial plot

Ifitci

linear regression line with confidence band

qfitci

quadratic regression curve with confidence band

fpfitci

fractional polynomial plot with confidence band

function

line plot of function

histogram

histogram plot

kdensity

kernel density plot

The usual options to control line patterns, marker symbols, and so forth work where
appropriate with all twoway commands. For more information about a particular command,
type help twoway_mband, help twoway__function, etc. (using any of the names
above). Note that graph twoway bar is a different command from graph bar .
Similarly, graph twoway dot differs from graph dot . The twoway versions
JI

86

Statistics with Stata

provide various methods for plotting a measurement y variable against a measurement _v
variable, analogous to a scatterplot or a line plot. The non-twoway versions, on the other hand,
provide ways to plot summary statistics (such as means or medians) of one or more
measurementy variables against categories of one or more* variables. The twoway versions
thus are comparatively specialized, although (as with all twoway plots) they can be overlaid
with other twoway plots for more complex graphical effects.
Many of these plot types are most useful in composite figures, constructed by overlaying
two or more simple plots as described later in this chapter. Others produce nice stand-alone
graphs. For example, Figure 3.19 shows an area plot of the Newfoundland cod landings.
•

graph twoway area cod Canada year, ytitle("")

Figure 3.19

O

o
co

o
o

<D

O
O

O
O
CN

o
1960

1970
Total landings, 10001

1980
Year

1990

2000

Canadian landings, 1000t

The shading in area graphs and other types with shaded regions can be controlled throush
the option bcolor . Type help colorstyle for a list of the available colors, which
include gray scales. The darkest gray, gsO, is actually black. The lightest gray, gsl6, is white.
Other values are in between. For example, Figure 3.20 shows alight-gray version of this graph.

Graphs

87

5raph twoway area cod Canada year, ytitle ('”•
) bcolor(gsl2 gsl4)
o
o
oo

Figure 3.20

o
o
co

o
o

I

o
o

CXI

o

1960

1970

1980
Year

1990

Total landings, 1000t

2000

Canadian landings, 1000t

Unusually cold atmosphere/ocean conditions played a secondary role in Newfoundland’s
fisheries disaster, which involved not only the Northern Cod but also other species and
populations. For example, key fish species in the neighboring Gulf of St. Lawrence declined
during this period as well (Hamilton, Haedrich and Duncan 2003). Dataset gulfdta describes
environment and Northern Gulf cod catches (raw data from DFO 2003).
Contains data from C:\data\gulf.dta
obs:
56

vars :
size:

7
1,344 (99.9% of memory free)

storage
type

display
format

winter
minarea
maxarea
mindays
maxdays
cil

int
float
float
byte
byte
float

%8.0g
%9. Og
%9.0g
%8. Og
%8.0g
%9. Og

cod

float

%9. Og

variable name

Sorted by:

value
label

Gulf of St. awrence
envircnme
and cod fishery
10 Jul 2005 11:51

variable label

Winter
Minimum ice area, 1090 km ■■'2
Maximum ice area. 1000 km''2
Minimum ice days
Maximum ice days
Cold Intermediate Layer
temperature mini mum, C
N. Gulf cod catch. 1000 tons

winter

The maximum annual ice cover averaged 173,017 km2 during these years.
summarize maxarea

1

Variable

|

Obs

Mean

Std. Dev.

Min

Max

maxarea

|

38

173.0172

37.18623

47.8901

220.1905

88

7

Statistics with Stata

Figure 3.21 uses this mean (173 thousand) as the base for a spike plot, in which spikes
above and below the line show above and below-average ice cover, respectively. The
yline (173) option draws a horizontal line at 173.
. graph twoway spike maxarea winter if winter > 1963, base(173)
yline(173) ylabel(40(20)220, angle(horizontal))
xlabel (19 65 (5) 2000)
Figure 3.21

220-

200CN

<E 1801

I

l

I

T7

g 160-

o

rn

co 140O)

co

g 1203 100-

E

X

rc

80-

6040-|
1965

1970

1975

1980

1985

Winter

WO

1995

2000

The base () format of Figure 3.21 emphasizes the succession of unusually harsh winters
(above-average maximum ice cover) during the late 1980s and early 1990s, around the time of
Newfoundland’s fisheries crisis. We also see an earlier spell of mild winters in the early 1980s,
and hints of a recent warming trend.
A different view of the same data, in Figure 3.22, employs lowess regression to smooth the
time series. The bandwidth option, bwidth (. 4), specifies a curve based on smoothed data
points that are calculated from weighted regressions within a moving band containing 40% of
the sample. Lower bandwidths such as bwidth (. 2), or 20% of the data, would give us a
more jagged, less smoothed curve that more closely resembles the raw data. Higher bandwidths
such as bwidth (. 8), the default, will smooth more radically. Regardless of the bandwidth
chosen, smoothed points towards either extreme of the x values must be calculated from
increasingly narrow bands, and therefore will show less smoothing. Chapter 8 contains more
about lowess smoothing.

Graphs

89

graph twoway lowess maxarea winter if winter > 1963, bwidth(.4)
yline(173) ylabel(40(20)220, angle(horizontal))
xlabel(1965 (5) 2000)

•

Figure 3.22

220-1

200180(D

.? 160$
CD

2 140x
E 120w
(D 100_O

80-

6040__

1965

1970

1975

1980
1985
Winter

1990

1995

2000

Range plots connect high and low j- values at each level ofx, using bars, spikes, or shaded
areas. Daily stock market prices are often graphed in this way. Figure 3.23 shows a capped
spike range plot using the minimum and maximum ice cover variables from gulfdta.
graph twoway reap minarea maxarea winter if winter > 1963,
ylabel(0 (20)220, angle (horizontal)) ytitle("Ice area, 1000 kinA2")
xlabel(1965 (5)2000)

-

Figure 3.23

220-j

-I I I

200-J

1804

>

T

T-t

I1

I

■-

. 1601l

CM

E 140 J

| 12°1

TI

ra- IDOcu
o 80-I
o
600)

I

40-

I

20-

o-l
1965

-L___

1970

1975

1980
1985
Winter

1990

1995

2000

90

Statistics with Stata

These examples by no means exhaust the possibilities for twoway graphs. Other
applications appear throughout the book. Later in this chapter, we will see examples involving
overlays of two or more twoway graphs, forming a single image.

Box Plots
Box plots convey information about center, spread, symmetry, and outliers at a glance. To
obtain a single box plot, type a command of the form
.

graph box y

If several different variables have roughly similar scales, we can visually compare their
distributions through commands of the form
•

graph box

w x y z

One of the most common applications for box plots involves comparine the distribution of
one variable over categories of a second. Figure 3.24 compares the distribution of college
across states of four U.S. regions, from dataset states.dta.
graph box college,

■

yline(19.1)

over(region)

Figure 3.24

o
co ■

io
0) CXI
<D
0»
<D
•u
0)
O)
=O
©cq
w

I

3

TJ
(U
o> in

i
West

N. East

South

Midwest

e median proportion of adults with college degrees tends to be highest in the Northeast
and lowest in the South. On the other hand, southern states are more variable. Regional
medians (lines within boxes) in Figure 3.24 can be compared visually to the 50-state median
in icated by the yline (19.1) option.. This median was obtained by typing
.

summarize college if region < ., detail

Graphs

91

Chapter 4 describes the summarize, detail command. The if region <
qualifier above restricted our analysis to observations that have nonmissine values of regionthat is, to every place except Washington DC.
“
* ’
The box in a box plot extends from approximate first to third quartiles, a distance called the
interquartile range (IQR). It therefore contains approximately the middle 50% of the data
Outliers, defined as observations more than 1.5IQR beyond the first orthird quartile, are plotted
individually in a box plot. No outliers appear among the four distributions in Figure 3.24.
Stata s box plots define quartiles in the same manner as summarize, detail. Thisisnot
the same approximation used to calculate “fourths” for letter-value displays. Iv (Chapter 4)
See Frigge, Hoaglin, and Iglewicz (1989) and Hamilton (1992b) for more about quartile
approximations and their role in identifying outliers.
Numerous options control the appearance, shading and details of boxes in a box plot see
elp graph_box fora list. Figure 3.25 demonstrates some of these options, and also the
horizontal arrangement of graph hbox . using per capita energy consumption from
states.dta. The option over (region, sort(l) ) calls for boxes sorted inascendina order
according to their medians on the first-named (and in this case, the only) y variable
intensify (30) controls the intensity of shading in the boxes, setting this somewhat lower
(less dark) than the default seen in Figure 3.24. Counterintuitively, the vertical line marking
the overall median (320) in Figure 3.25 requires a yline option, rather than xline .
■

I

I

graph hbox energy, over(region,

N. East

o

Midwest

d

sort(l)) yline(320) intensity(30)

Figure 3.25

West:

South

I
200

400
600
800
Per capita energy consumed, Btu

1,000

The energy box plots in Figure 3.25 make clear not only the differences among medians,
but also the presence outliers — four very high-consumption states in the West and South
With a bit of further investigation, we find that these are oil-producing states: Wyoming,
Alaska, Texas, and Louisiana. Box plots excel at drawing attention to outliers, which are easily
ovenooked (and often cause trouble) in other steps of statistical analysis.
4

92

Statistics with Stata

Pie Charts
^ charts are popular tools for "presentation graphics,” although they have little value for
analyitcal work. Stata's basic pie chan command has the fomt
.

graph pie w x y z,

pie

where the variables iv, x, y. and all measure
quantities of something in similar units (for
example, all are in dollars, hours, or people).
Dataset AKethnic.dta. on the ethnic composition of Alaska’s population provides an
illustration Alaska’s indigenous Native population divides into three broad cultural/liimuistic

LT (,n,ClUding Athabaska’ Tlingit’ alld Haida)’ and Eskimo (Yupk and
h unTatt TeUt’
nupiat). The variables aleut. mdian. eskimo, and nonnativ are population counts for each
group, taken from the 1990 U.S. Census. This dataset contains onlv three observations
representing three types or sizes of communities: <cities
‘ ’ of------10,000 people or more; towns of
1.000 to 10,000; and villages with fewer than 1.000 people.
Contains data from C:\data\Ar-e
obs :

vars:
_yze:

variable name

comtype
pop
n
aleut
Indian
es kimo
nonnativ

3

7
63 <99-9%

storage
type
byte
float
int
in t
in t
in t
float

dta
me—ry free)

display
format

value
label

% 8.0 g
*9.0g
*8.0g
%8.0g
%8. Cg
%8.0g
%9.0g

popcat

Alaska ethnicity 1990
4 Jul 2005 12:06

variable label

Community type (size)
Population
number of communities
Aleut
Indian
Eskimo
Non-Native

Sorted by:

yhejnajority
’ *
tu
• of the state'sS population
P°PulatI0n is
is non-I
non-Native,
as clearly seen in a pie chart (Figure
explode) causes the third-named variable, eskimo.
Xolof pIT (T
fc shaded a^t

poss.bl.ties such as color(blue)
color (blue) or color (cranberry) exist
Type help
tTb
T'11?5'- Plabel(3 P—nt, gap (20)) causes a percentage label
centr W
T ^"°
sVlcea gaP of 20 relative radial units from the
r X e see that about 8 /« of Alaska’s population is Eskimo (Inupiat or Yupik) The
legend option calls for a four-row box placed at the 11 o’clock position within the plot

a*

4

Graphs

93

• graph pie aleut Indian eskimo nonnativ, pie(3, explode)

Pie (4, color(gsl3)) plabel(3 percent, gap(20))
legend(position(11) rows(4) ring(0))
Aleut
Indian
Eskimo
Non-Native

Figure 3.26
8.072%

Non-Natives are the dominant group in Figure 3.26, but if we draw separate pies for each
type of community by adding a by (comtype) option, new details emerge (Figure 3.27, next
page). The option angleO () specifies the angle of the first slice of pie. Setting this firstslice angle at 0 (horizontal) orients the pies in Figure 3.27 in such a way that the labels are more
readable. The figure shows that whereas Natives are only a small fraction of the population in
Alaska cities, they constitute the majority among those living in villages. In particular, Eskimos
make up a large fraction of villagers — 35% across all villages, and more than 90% in some.
This gives Alaska villages a different character from Alaska cities.

i

94

I

Statfst/cs with Stata

graph pie aleut Indian <eskimo
’'
nonnativ, pie (3, explode)
pie(4, color(gsl3)) plabel(3 percent, gap(8))
legend(rows(1)) by(comtype) angleO(0)

.

villages

Figure 3.27

towns

V

34.67%

8.141%

cities

2332%

;

Indian

Aleut

Eskimo

Non-Native

Graphs by Community type (size)

Bar Charts
though they contain less information than box plots, bar charts provide simple and versatile
displ ays for comparing sets of summary statistics such as means, medians, sums, or counts. To
obtain vertical bars showing the mean ofj> across categories ofx, for example, type
. graph bar (mean) y, over(x)

For horizontal bars showing the sum of across categories ofx7 within categories ofx2, type
- graph hbar

(sum) y, over(xl)

over(x2)

The bar chart could display any of the following statistics:
mean
sd
sum

rawsum
count

max

min

median

Pl

Means (the default; used if the type of statistic is not specified)
Standard deviations
Sums

Sums ignoring optionally specified weight
Numbers of nonmissing observations
Maximums
Minimums
Medians
1st percentiles

Graphs

95

P2

2nd percentiles (and so forth to P9 9 )
Interquartile ranges
Saoter and'afC ™ary Jatis!ics is the same as that
collapse command (see
and XS
t
nUmberofothercommand s including graph dot (next section)
ana table (Chapter 4).
iqr

meaXXm^helQQi? C°nta^
°n the U'S-StateS’ combiniagsocioeconomic
90 ^nsus with several health-risk indicators from the Centers for Disease
Control (2003), averaged over 1994-98.
Contains data from C:\data\statehealth.

obs
~ ‘ :'
vars :
size :

51

1?

3,315 (99.9% of memory free)

variable name

state
region
income
income?
high
college
overweight
inactive
smokeM
smokeF
smokeT
motor
Sorted by:

dta

storage
type

display
format

str20 %20s
byte
%9.0g
long
%10 . Og
float %11.Og
float %9.0g
float %9.0g
float % 9. Og
float %9.0g
float %9. Og
float %9.0g
float %9. Og
float %9. Og

value
label

region

income?

Health indicators 1994-96
9 Jul ?005 11:56

( CDC)

variable label

US State
Geographical region
Median household income, 1990
Median income low cor high
% adults HS diploma,, 1990
% adults college degree, 1990
% overweight
% inactive in leisure time
% male adults smoking
% female adults smoking
% adults smoking
Age-adjusted motor-vehicle
related deaths/100,000

state

es are highest m the South (36%), and lowest in the West (21 %). Note that the vertical axis

X

I

1
-J

eX"”"0, VariaM'

,here iS Only ■>”) ShO"M be f“"‘'

p

96

Statistics with Stata

graph bar (median) inactive, over(region) blabel(bar)
bar(l, bcolor(gslO))
o

Figure 3.28
36.05

o
co

29.1

28.3

>
20.9
^Zo
o oj
o
io
Q.

o

ii

SI

ii

■

KA

SB

v i;?.

b?W^'v'
o

_______ _
West

2L_____
N. East

South

Midwest

and carina?. (K ll0W7 rC) eIabOrateS thiS idea by addin8 a second variable, ove^eight,

and coloring its bars a darker gray. The bar labels are size (medium) in Figure 3 29
making them larger than the defaults, size (small), used in Figure 3.28. Other possibilities
for size ( ) suboptions include labels that are tiny , medsmall, medlarge or
arge. See help textsizestyle for a complete list. Figure 3.29 shows that regional
differences in he prevalence of overweight individuals are less pronounced than differences
in inactivity, although both variables’ medians are highest in the South and Midwest

Graphs

■

97

graph bar (median) inactive overweight, over(region)
blabel(bar, size(medium))
bar(l, bcolor (gslO) ) bar(2, bcolor (gs7) )

.

o

Figure 3.29

36.05
31.3

O
co ■

27.6

28.3

ip

27.1

20.9
o

1 I

CM

o

to

j

_ fL

• II

o

West
[Iv--'

I

I

1

Illi

-J

■

N. East
P 50 of inactive

31.2

29.1

South

I
-

Midwest

P 50 of overweight |

th
t
!
,hl8^lnCOmc states (states having median household incomes below or
above the national median), revealing a striking correlation with wealth. Within each reeion
f^r^'T0"16 h-3!65 eXhlb,t h'gher mean fataIity rates' Across both income cateaories’
tahty rates are higher m the South, and lower in the Northeast. The order of the two 'over
options in the command controls their order in organizing the chart. For this example we chose
nzontal bar chart or hbar . In such horizontal charts, ytitle . yline , etc refer to
the horizontal axis, yline (17.2) marks the overall mean.

*!

98

Statistics with Stata

. graph hbar (mean) motor, over(income2) over(region) yline(17.2)
ytitle("Mean motor-vehicle related fatalities/100,000")
Figure 3.30

Low income I
West

i
High income |

_ _______

Low income

N. East
High income

South

Low income

High income

Low income

Midwest
High income
0

.. 5
10
15
20
25
Mean motor-vehicle related fatalities/100,000

Bars also can be stacked, as shown in Figure 3.31. This plot, based on the Alaska ethnicity
data (AKethmc.dta), employs all the defaults to display ethnic composition by type of
community (village, town, or city).
• graph bar

(sum)

nonnativ aleut Indian eskimo,
over(comtyp)

o
o
o

stack

Figure 3.31

o
o
■

R

'■

tew®

o
o

UBI

§
o
o
o
• ■ HlHI
O

villages

towns
sum of nonnativ
sum of Indian

cities
sum of aleut
sum of eskimo

Graphs

99

inel ‘gUrC J J-5edrawsthlsPlotwlthbe,terlegendand axis labels. The over option now
The leoenlo0 f
elaFel t,lecommunity
50‘he horizontal axis is more informative
Placedin the |0)Ptl°niSp"ClfieS fourrows in the same ^ical orderas the bars themselves, and
vtitle vl H t
POSIt'On lnSide IhC P'Ot SpaCC- 11 als0 imProves le^nd labels
y tie , ylabel, and ytitle options format the vertical axis.
. graph bar

(sum) nonnativ aleut Indian eskimo

legend(rows(4) order(4 321) position(ll)
ring(0)
label (1 "Non-native") label(2 "Aleut”)
label (3 "Indian") label (4 "Eskimo”))
stack ytitle(Population)
ylabel (0 (100000) 300000) ytick(500 0 0 (100000) 350000)

o
o

§

Figure 3.32

HBBI Eskimo
.'ix'ii'CiSKis Indian
Aleut
Non-native

co

o
3 CN
Q.

Q.

o
o
o

o

Villages <1,000

■

Towns 1,000-10,000

Cities >10,000

each community type, this bar chart shows their absolute sizes. Consequently Fifure3 32 tells

p:zi'h;^Fx3s27“”ld ,he “jon,y of A“a's
Dot Plots

I

Dot plots serve much the same purpose as bar charts: visually comparing statistical summaries
o one or more measurement variables. The org.mz.tieo i st.’ „pC„Xte ™ «
conma'
rh
'““"S ,l,e
comparing the medians of variables x, y, z, and w, type
•

graph dot

(median)

x y z w

To See .

P

100

Statistics with Stata

!*!!

For a dot plot comparing the mean of v across categories of.v. type
•

graph dot

(mean) y, over(x)

Figure 3.33 shows a dot plot of male and female smoking
J rates by region, from
statehealth.dta. The over <option
’ includes
* ‘ a suboption, sort (smokeM ), which calls for
the regions to be sorted in order of their mean values of smokeM

hollow circle for smokeF.
. graph dot (mean) smokeM smokeF, over(region
overfregion,, sort(smokeM)
sort(smokeM)))
marker(1, msymbol(T)) marker(2, msymbol(Oh))

Figure 3.33

West

A

N. East

Midwest

South

0

10

20

▲ mean of smokeM

mean of smokeF

30

Although Figure 3 33 displays only eight means, it does so in a way that facilitates several

of dot plots is their compactness. Dot plots (particularly when rows are sorted by the statistic
of interest, as in Figure 3.33) remain easily readable even with a dozen or more rows.

Symmetry and Quantile Plots

summary graphs, but convey more detailed information.

I

Graphs

101

A histogram of per-capita energy consumption in the 50 U.S. states (from states dta}
PPears m F.gure 3.o4. The distribution includes a handful of very highconsumption states
ich happen to be oil producers. A superimposed normal (Gaussian) curve indicates that
of p^itive skewter t lan'n°mial left tai1’and a heavier-than-normal right tail — the definition
. histogram energy f start(lOO) width(lOO) xlabel(0(100)1000)
frequency norm

LO

Figure 3.34

CM

8

IB’
>

£
5-

E
iiz o

-J.'-’

iS

z

B
i

o

0

100

200

1

wl

300 400
500
600
700
Per capita energy consumed. Btu

800

900

1000

I

i

v X

A

- (O'V

10032

A4-' /
Vfrr^

"

I

>

1^/

> o^/

102

Statistics with Stata

Figure 3.35 depicts this distribution as a symmetry plot. It plots the distance of the zth
observation above the median (vertical) against the distance of the zth observation below the
median. All points would lie on the diagonal line if this distribution were symmetrical. Instead,
we see that distances above the median grow steadily larger than corresponding distances below
the median, a symptom of positive skew. Unlike Figure 3.34, Figure 3.35 also reveals that the
energy-consumption distribution is approximately symmetrical near its center.
symplot energy
Figure 3.35

Per capita energy consumed, Btu ■

o
co

§ <O
<D

E
0)

-I
8
I
w
CM

o

‘

0

50
100
Distance below median

150

Quantiles are values below which a certain fraction of the data lie. For example, a .3
quantile is that value higher than 30% of the data. If we sort n observations in ascending order,
the zth value forms the (z - .5)/zz quantile. The following commands would calculate quantiles
of variable energy:
. drop if energy >=
sort energy
. generate quant =

(_n -

.5)/ N

As mentioned in Chapter 2, _n and _N are Stata system variables, always unobtrusively present
when there are data in memory'. _n represents the current observation number, and _N the total
number of observations.

"1
Au.

Graphs

103

Quantile plots automatically calculate what fraction of the observations lie below each data
value, and display the results graphically as in Figure 3.36. Quantile plots provide a graphic
reference for someone who does not have the original data at hand. From well-labeled quantile
plots, we can estimate order statistics such as median (.5 quantile) or quartiles (.25 and .75
quantiles). The IQR equals the rise between .25 and .75 quantiles. We could also read a
quantile plot to estimate the fraction of observations falling below a given value.
quantile energy
o
c

Figure 3.36

£D *

"S’

S’
c

<D O

ss
&
O

(5

o.
og
</}

<D

I

o

o
CXI

0

.25

Fraction of the data

.75

104

Statistics with Stata

Quantile-normal plots, also called normal probability plots, compare quantiles of a
variable’s distribution with quantilesof a theoretical normal distribution having the same mean
and standard deviation. They allow visual inspection for departures from normality in every
part of a distribution, which can help guide decisions regarding normality assumptions and
efforts to find a normalizing transformation. Figure 3.37, a quantile-normal plot of energy,
confirms the severe positive skew that we had already observed. The grid option calls for
a set of lines marking the .05, .10, .25 (first quartile), .50 (median). .75 (third quartile), .90, and
.95 quantiles of both distributions. The .05. .50. and .95 quantile values are printed along the
top and right-hand axes.
. qnorm energy, grid
I

I

111.872

o
o

354.5

Figure 3.37

597.129

So
CO

■u’03

co

0)

E

0(0

I

O)

o o

ro
ro
o- o

8
co

CD CM
CL

CM

u_ o

o

0

200

400
Inverse Normal

600

800

Grid lines are 5,10.25. 50. 75. 90. and 95 percentiles

I

I!

•K''

i

Quantile-quantile plots resemble quantile-normal plots, but they compare quantiles
(ordered data points) of two empirical distributions instead of comparing one empirical
distribution with a theoretical normal distribution. On the following page. Figure 3.38 shows
a quantile-quantile plot of the mean math SAT score versus the mean verbal SAT score in 50
states and the District of Columbia. If the two distributions were identical, we would see points
along the diagonal line. Instead, data points form a straight line roughly parallel to the
diagonal, indicating that the two variables have different means but similar shapes and standard
deviations.

Graphs

105

• qqplot msat vsat

Figure 3.38

Quantile-Quantile Plot

o
co

S

0) io

o
w

£
co
E 1

—

^8
o

400

450
500
Mean verbal SAT score

550

Regression with Graphics (Hamilton 1992a) includes an introduction to reading quantilebased plots. Chambers et al. (1983) provide more details. Related Stata commands include
pnorm (standard normal probability plot), pchi (chi-squared probability plot), and qchi
(quantile-chi-squared plot).

I

I

Quality Control Graphs
Quality control charts help to monitor output from a repetitive process such as industrial
production. Stata offers four basic types: c chart, p chart, R chart, and x chart. A fifth type
called Shewhart after the inventor of these methods, consists of vertically-aligned x and R
charts. Iman (1994) provides a brief introduction to Rand x charts, including the tables used
in calculating their control limits. The Base Reference Manual gives the command details and
formulas used by Stata. Basic outlines of these commands are as follows:
.

cchart defects unit

Constructs a c chart with the number of nonconformities or defects (defects) graphed
against the unit number (unit). Upper and lower control limits, based on the assumption
that number ofnonconformities per unit follows a Poisson distribution, appear as horizontal
lines in the chart. Observations with values outside these limits are said to be “out of
control.”
. pchart rejects unit ssize

Constructs a p chart with the proportion of items rejected (rejects / ssize) graphed against
the unit number (unit). Upper and lower control limit lines derive from a normal
approximation, taking sample size (ssize) into account. If ssize varies across units the
control limits will vary too, unless we add the option stabilize

106

.

Statistics with Stata

rchart xl x2 x3 x4 x5, connect(1)

2)nSt^ttSian Rlrange) Chr “Smg the replicated measurements in variables xl through
’ ‘n
eXample’ fiVC reP'icat>ons per sample. Graph the range with^eaS
taga'nSt.the Sa^ple number’ a^ (optionally) connect successive ranges with line
egmonts. Horizontal lines indicate the mean range and control limits. Control limits are
estimated from the sample size if the process standard deviation is unknown When o is
known we can include tins information in the command. For example, assuming o = 10,
•
.

rchart xl x2 x3 x4 x5, connect(1)

std(10)

xchart xl x2 x3 x4 x5, connect(1)

S’crXih/ (mean)hChart asingthe rePl*cated measurements in variables xl through
• 5. Graphs the mean with in each sample against the sample number and connect successive
means with line segments. The mean range is estimated from the mean of sample means
and control limits from sample size, unless we override these defaults. For example if we
know that the process actually has |i = 50 and o =10,
P
•

xchart xl X2 x3 x4 x5, connect(l) mean(50)

std(10)

Alternatively, wetcould specify particular upper and lower control limits:
. xchart xl x2 x3 x4 x5, connect(l) mean(50)
upper (60)

•

shewhart xl x2 x3 x4 x5, mean(50)

lower(40)

std(10)

In one figure, vertically aligns an x chart with an Rchart.

To illustrate a p chart, we turn to the quality inspection data in quality1.dta.
Contains data from C:\data\qualityl.dta
obs:
16
vars:
3

112 (99.9* of memory free)
•■= - - able

= ~e

storage
type

display
format

byte
byte
byte

* ?. Og
* 3. Og
*5. Og

value
label

Quality control example 1
4 Jul 2005 12:07

variable label

Day sampled
Number of units sampled
Number of units rejected

Sorted by:

.

list in 1/5
I

day

ssize

rejects I

------ !
1. I
2. I
3. I
4. I
5. I

58

26
21
6

53
53
52

52
51

10
12
12
10
10

|
|
|
|
|

^hNOt^ ?atIlainPuIe SiZe VarieS from Unit t0 unit’ and that the units (days) are not in order

M£3a

Graphs

107

■ pchart rejects day sslze

Figure 3.39
•'T

■So
■o

s

<D
IO
O

ro
tr

CM

O

0

20
Day sampled

40

60

2 units are out of control

— serves to illustrate rchart and
, Dataset quality2.dta, borrowed from Iman C
(1994:662),
xchart . Variables xl through x4 represent repeated measurements from an industrial
production process; 25 units with four replications each form the dataset.
Contains data from C:\data\quality2.dta
obs:
25
vars:
4
size:
500 (99.9% of memory free)

variable name

Ii

storage
type

xl
x2
x3
x4

float
float
float
float

display
format

%9.0g
%9.0g
%9.0g
%9.0g

Sorted by:

. list in 1/5

1.
2.
3.

4.
5.

+
xl
I
I
I 4.6
I 6.7
I 4 .6
I 4.9
I 7.6

x2

x3

2
3.8
4.3
6
6.9

4
5.1
4.5
4.8
2.5

x4 |
I
3.6 I
4.7 I
3.9 I
5.7 I
4 .7 I

value
label

Quality control (Iman 1994:662)
4 Jul 2005 12:07

variable label

108

Statistics with Stata
i'r

Figure 3.40, an rR chart,
’
graphs variation in the process range over the 25 units, rchart
informs us that one unit’s range is “out of control.”
rchart xl x2 x3 x4, connect(1)

.

g

<o

Figure 3.40

co

a>
cn

S
02

CM
■MCM

CM

0

5

10
Sample

15

20

25

1 unit is out of control

Figure 3.41, an x chart, shows variation in the process mean. None of these 25 means falls
outside the control limits.
. xchart xl x2 x3 x4, connect(1)

CO

co

3

rco
co

o
cn
ro

<

CM

s
co

co

0

25^

5

15

10

Sample
0 units are out of control

20

Figure 3.41

■
Graphs

f-

109

Adding Text to Graphs

1

Titles, captions, and notes can be added to make graphs more self-explanatory. The default
versions of titles and subtitles appear above the plot space; notes (which might document the
data source, for instance) and captions appear below. These defaults can be overridden, of
course. Type help title_options for more information about placement of titles, or
help textbox_options for details concerning their content. Figure 3.42 demonstrates
the default versions of these four options in a scatterplot of the prevalence of smoking and
college graduates among U.S. states, using statehealth.dta. Figure 3.42 also includes titles for
both the left and right y axes, yaxis (1 2), and top and bottom .r axes, xaxis (1 2)
Subsequent ytitle and xtitle options refer to the second axes specifically, by
including the axis (2) suboption, y axis 2 is not necessarily on the right, and.v axis 2 is not
necessarily on the left, as we will see later; but these are their default positions.
. graph twoway scatter smokeT college, yaxis(1 2) xaxis (1 2)
title("This is the TITLE") isubtitle
’
("This is the SUBTITLE")
caption
("This is the CAPTION") note("This is the NOTE")
caption("This
ytitle("Percent adults smoking")
ytitle("This is Y AXIS 2", axis(2))
xtitle("Percent adults with Bachelor's degrees or higher")
xtitle ("This is X AXIS 2", axis(2))
Figure 3.42

This is the TITLE
10

15

This is the SUBTITLE
This is X AXIS 2
20
25

30

35
in

co

co

CT)

CO Csi

E
v)

to
X
m<

sro ™

% :

5°

<0

o x:

o 04

cn h

Q

Q_

in

m
10

15
20
25
30
Percent adults with Bachelor's degrees or higher

35

This is the NOTE

This is the CAPTION

Titles add text boxes outside of the plot space. We can also add text boxes at specified
coordinates within the plot space. Several outliers stand out in this scatterplot. Upon
investigation, they turn out to be Washington DC (highest college value, at far right), Utah
(lowest smokeT value, at bottom center), and Nevada (highest smokeT value, at upper left).
Text boxes provide a way for us to identify these observations within our graph, as
demonstrated in Figure 3.43. The option text (15.5 22.5 "Utah”) places the word
“Utah” at position^ =15.5, x = 22.5 in the scatterplot, directly above Utah’s data point.

er 5‘ M3 (to H
3> *t)
er
§ rt
—’ 2
v 2
<P o
3 ft c r o
?
n*
g.
(D

3
3
CX
co

£> g

»<

'• hQ

C
3 31
Cu 3
M.

EE. 3
CD

8 o 3
3

<—►

0)

-i

□

s

3
co

a S <s3o'Cg£<
’
S’
3-

3‘

r-t X
CO
o’
3
3* *O

QQ

f-

CD

< CD
P
__CD CX co
cd < BL S' 3
3 w s2z a 3
CD
3 sC

a

3 UQ

“

5 £;
3
-U
iQ
o* £
£ CO £ 21 N

CD*

S ’2. O cr CfQ

g c *8

2

3

S
<? =*
to
co EF
>-—« CD
~

»■

ft

3 ~ S
3 = w
S S’ §
O S
co

. c 8? ? C CD
ST
i 2 2. »
; g 8> 2 =r’6

■ g-'H

C/1

CO

8 a co

“

s-co« Mj
p. 3 CD
3
rt
< 3
I o “ s
□ Er
3 S'
*u 3 £2. o
* • §-..O 5 S
- to E XJ <-t
&

oq

3

3‘

cT co “S 3 O 3
8 to Xo
R8 . o
CD
3
O 8- K 5 CD
3
o”2. „

Is I

□

23: •" *3o
S
3 S' 3^

Cu
CD
3
CD 3
CD ►—

G>

r°
r

15

Q

Percent adults smokinq
20
25
30
35

O ■

£££'x*‘<‘<c.
O

o Q CD

Z

0)

CD

gen’
r-^

3
Q.
C_

•«

•
«

•

£0
Q.
0)

CH

w
^ND

-N5!^

CD

w'
W‘ ? w0
X
zr

3
g.

CD
O

3.

£ CDCco_ Hco

o.to_
CD
§
3
3
W
O

m m

ETG0
<qO
ET
3

. GO
O

£=53
Q) O (/>
*3^

*2.
4

o
Q

GO
CH 1

15

t£»y2W

ft (t (U
p- p- to

GO
CD

35

2!

c5‘
c
Q
W

4

GO

..
ft ,,
ft

H H R. o
~
“
Q
3 3
E3

=s

H §
~ H *

5~2.

to

cd

o

co
ET ID H P- W
ft U) p* ND p. bf
P- H EC
Ef co
co a
H- **
ND 'cn ND
**
:= COu q w an p3
(D
n> co h- rt
= to p- 5 H- 3
CO ft
M ft co Ct H0 53
(D
3 3 3
CO ft p
rt co <a x 3
K
3
EC
p.
1‘■ tr
“ P> ft
-*
CL
ID CO
o H- CL
~ 3 > C > CQ. ft
Ef
ej :5 3 ET X H X H (D
— iq
="
= h ft h rt
H ft > *■* to co to
- -w Q H 3
3 0
> to
th ej er
M « to to
rt
o
H
s 3 H = fj
— ox
O »•
a
X* O
“
= 3
B
p
H- Z 3 p.
CM
3
X to X EJ
2
c 3
h
p- ju k iq ~= er
= h
cq
iQ P- tQ
CO O co
ft
3
ET -2
P- CO p.
EJ
EJ
3
ND ID K>
O ft
—* EJ
ft H PJ
co 0
_ (0
..
— o —
3 3 3
3 ft 3
h
— <-x x
3
PJ
H 3
w
H H to
H Q
EC EC
— rt
CL
C
(D
co co
tr 3
cQ
ND
M» P*
p. p.
H
o P*
(D
co co
o »<
(D
X
CO
ft ft 3
O
»* £t X
M
=
o
(D (D P3
h
to
Z
to
ED* co
ET
o
a
p*
H- Ct
h w
ft 3
cQ
M H ND
CD Ct
EC
= H
(D
(D
H

5

h

(t EC

ft ft ft H $

c tn in cn H to H h3

WND

o
r*
(/)

CD (D 3 ft ft
XXX p- pD* rt rt rt
ef J2
o
H
ND U) p* (D (D
CO UD to —>

Z

3-0

o
S

N
3

«•
ro”
5
O z T)

>
TJ
H
O

co
S' SCD s g-’a. CD cz>
CD 5*
r.
r
2
CD
H. o QJA ex
?
8c g. g3Q H(D 2.
g
..
H
3
O
5
£
o
K.
CL 3
e. '£
a
■■*■ o
P o 3 CD
a a o 3 3 p‘ X5

IQ

o

v>
w

s>
5
v<‘

1 3

<:
c

O

£ 8 £
Q- « 5
CD
P J2. CO 2. S'
co
CD

-J

<
o

s

n

I—

O

Q er
co
CD
CD

rr

3-

H 3
in cd
p. CD

? s< h; f a

-3
CD
X- CD
‘ S rg
tt
3
OQ tt
ft
g co
co
£ # rt
3
3 TS ~
3 CD XJ
H

1

CD

3

£h $
CD- O 0 Q.

□
2. a
| 2.

co
ST
5?

??

to

Er
GO

s3

et£-

£

3
<__ 3
tt
3* ‘E'c
rt
“
S3
2
- CD
0 CD
55 CO co 75 O <

X

?

3
-■ S- §3c s.
92 a
& a
& £%
~d £ st
2c
S’ -S GO
CD
3

C. S’
EJ1 3

£ a o & 2. 3 U»II
ft
H«

2 3

3
tt

co

□
Q-

0
O
-O 3
EF
3CD2 S

«
co g
3^ CO CO

3
er cd

5 2.

o V &
’

ET 3

(D

H U
Q

“

a er H
o o CD
X

ZTl
CD L»J
CD
‘
Q- to-'p’

S* 3
3
3
to g
O p
•
3

“•
co
O CD CD
3 3
co CD o
3^

o

.. S' £

o
H £ CD
to o
- <
rt
3

II

H 3
3
3*

3

CD

QCu

3

3*
CD 3
co
CT
O O

x

?a
o- CD >.
< 3 —
’-L. E‘
=. 5
ET
CX a
C 3 c
co co

H
% s ?r - • 3
(D cz> CT
(T>

O
X

CD

—

5. 3> ~ co3
SBl
’ 5h: CD S*. I
CD 3
CD
S £ a
co

c
co

CO

3

t-

3
co

o

I

!

Graphs

111

. graph twoway Ifitci smokeT college
Figure 3.44

°
co J

</)
<D

zj m

_

TJ
<D

L.
O
v© o
o'' <\i
in N

cn

10

15

«
20
25
% adults college degree, 1990

95% Cl

----------

30

35

Fitted values

A more infonnative graph results when we overlay a scatterplot on top of the regression
line plot, as seen in Figure 3.45. To do this, we essentially give two distinct graphing
commands, separated by “ | | ”,
. graph twoway Ifitci smokeT college

I I

scatter smokeT college
Figure 3.45

m _l

D)C0

o

E
wo
5

■u
(Q

(D

J3
ro

H

ss wk

**

ft -

Is -

U-

o
m m zr

O)

15

10

■

20
25
% adults college degree, 1990

95% Cl
% adults smoking

----------

30

Fitted values

35

«!

112

Statistics with Stata

The second plot (scatterplot) overprints the first in Figure 3.45. This order has
consequences for the default line style (solid, dashed, etc.) and also for the marker symbols
(squares, circles, etc.) used by each sub-plot. More importantly, it superimposes the scatterplot
points on the confidence bands so the points remain visible. Try reversing the order of the two
plots in the command, to see how this works. •

Figure 3.46 takes this idea a step further, improving the image through axis labeling and
legend options. Because these options apply to the graph as a whole, not just to one of the
subplots, the options are placed after a second | | separator, followed by a comma. Most of
these options resemble those used in previous examples. The order (2 1) option here does
something new: it omits one of the three legend items, so that only two of them (2, the
regression line, followed by 1, the confidence interval) appear in the figure. Compare this
legend with Figure 3.45 to see the difference. Although we list only two legend items in Figure
3.46, it is still necessary to specify a rows (3) legend format as if all three were retained.
. graph twoway Ifitci smokeT college
I|
scatter smokeT college
I I
/ xlabel (12 (2)34) ylabel(14(2)32, angle (horizontal))
xtitie("Percent adults with Bachelor’s degrees or higher")
ytitle("Percent adults smoking")
note("Data from CDC and US Census")
legend(order (2 1) label (1 "95% c.i.") label (2 "regression line")
rows(3) position(l) ring(0))
Figure 3.46

32

regression line
95% c.i.

30
28
O)

IE 26
v>
JQ

24

-u

22

CD

<D

o 20
18
16
14

B

12

14

16
18
20
22
24
26
28
30
Percent adults with Bachelor's degrees or higher

32

34

Data from CDC and US Census

The two separate plots (If itci and scatter) overlaid in Figure 3.46 share the same
y and x scales, so a single set of axes applies to both. When the variables of interest have
different scales, we need independently scaled axes. Figure 3.47 illustrates this with an overlay
of two line plots based on the Gulf of St. Lawrence environmental data 'xngulfdtci. This figure
combines time series of the minimum mean temperature of the Gulfs cold intermediate layer
waters (czT), in degrees Celsius, and maximum winter ice cover (maxarea), in thousands of
square kilometers. The cil plot makes use of yaxis (1), which by default is on the left. The

Graphs

113

maxarea plot makes use of yaxis(2). which by default is on the right. The various
ylabel , ytitle , yline , and yscale options each include an axis(l) or
axis (2) suboption, declaring whichy axis they refer to. Extra spaces inside the quotation
marks for ytitle provided a quick way to place the words of these titles where we want
them, near the numerical labels. (For a different approach, see Figure 3.48.) The text box
containing “Northern Gulf fisheries decline and collapse” is drawn with medium-wide margins
around the text; see help marginstyle for other choices, yscale (range ( ))
options give both y axes a range wider than their data, with specific values chosen after
experimenting to find the best vertical separation between the two series.
. graph twoway line cil winter, ya-xis (1) yscale (range(-1,3) axis(l))
ytitle("Degrees C
", axis(1))
yline (0)
ylabel(-1(.5)1.5, axis(l) angle(horizontal) nogrid)
text(l 1992 "Northern Gulf" fisheries decline” "and collapse"
, box margin(medium) )
I|
line maxarea winter,
yaxis(2) ylabel(50(50)200, axis(2) angle(horizontal))
yscale(range(-100,221) axis(2))
ytitle ("
1000s of kmA2", axis (2))
yline(173.6, axis(2) Ipattern(dot))
I|
if winter > 1949,
xtitle("")
xlabel (1950(10)2000) xtick (1950 (2) 2002)
legend(position (11) ring(0) rows(2) order(2 1)
label (1 "Max ice area") label(2 "Min CIL temp"))
note("Source: Hamilton, Haedrich and Duncan (2003)- data from
DFO (2003)")
Figure 3.47

-- Min CIL tempi
— Max ice area

\
\

/

i
1 I 1 I
1/ ’ I
If I /
1/ ' /
I > I

1.5-J

/ /|,|/
/ /k

-200

Vvv\

I
/
l/\

\

04

-150E
cn

-100g

A
/

o

Northern Gulf

fisheries decline

U
w .5o

-50

<p

U)
<D

0

0

-.5-11950

1960

1970

1980

1990

2000

Source: Hamilton, Haedrich and Duncan (2003); data fromDFO (2003)

I
The text box on the right in Figure 3.47 marks the late- 1980s and early-1990s period when
key fisheries including the Northern Gulf cod declined or collapsed. As the graph shows, the
fisheries declines coincided with the most sustained cold and ice conditions on record.
X

114

Statistics with Stata

To place cod catches in the same graph with temperature and ice, we need three
independent vertical scales. Figure 3.48 involves three overlaid plots, with all y axes at left
(default). The basic forms of the three component plots are as follows:

connected maxarea winter
A connected-line plot of tnaxarea vs. winter, usingy axis 3 (which will be leftmost in our final
graph). The y axis scale ranges from -300 to +220, with no grid of horizontal lines. Its title is
‘Tee area. 1000 kmA2.” This title is placed in the “northwest" position, placement (nw).
line cil winter
A line plot of ciI vs. winter, using v axis 2. y scale ranges from -4 to +3, with default labels,

connected cod winter
A connected-line plot of cod vs. winter, usings axis I. The title placement is “southwest,”

placement(sw).

Bringing these three component plots together, the full command for Figure 3.48 appears
on the next page, v ranges for each of the overlaid plots were chosen by experimenting to find
the “right” amount of vertical separation among the three series. Options applied to the whole
graph restrict the analysis to years since 1959, specify legend and.v axis labeling, and request
vertical grid lines.

[

Graphs

115

graph twoway connected maxarea winter, yaxis(3)
yscale(range (-300,220) axis(3)) ylabel(50(50)200, nogrid axis (3))
ytitleflce area, 1000 km-2", axis(3) placement (nw) )
cipattern(dash)

II
line cil winter, yaxis(2) yscale(range(-4,3) axis(2))
ylabel(, nogrid axis(2))
ytitle("CIL temperature, degrees C", axis(2)) cipattern(solid)
I |
connected cod winter, yaxis(l) yscale(range(0,200) axis(l))
ylabel(, nogrid axis(l))
ytitle("Cod catch, 1000 tons", axis(l) placement(sw))

i

I I
if winter > 1959,
legend(ring (0) position(7)
label(2 "Min CIL temp")
i

xtitle("")

label(1 "Max ice area")
label(3 "Cod catch")) rows (3))
xlabel(1960(5)2000, grid)

Figure 3.48

CN

O
oO
ro
oo
o io

io

o T~
c/)
0)

O) •
<D
■o O

I(5 1?
oV
E
a>

—i

O

o

s

c
-2 o
o

A

\

ji

*

§o
X

_<----- Max ice area
-------- Min CIL temp
■g
OO

Cod catch

1960

1965

1970

1975

1980

1985

X

1990

1995

2000

Graphing with Do-Files
ompl icated graphics like Figure 3.48 require graph commands that are many physical lines
long (although Stata views the whole command as one logical line). Do-files, introduced in
Chapter 2, help in writing such multi-line commands. They also make it easy to save the
command for future re-use, in case we later want to modify the graph or draw it again.
The following commands, typed into Stata’s Do-file Editor and saved with the file name
fig03_48.do, become a new do-file for drawing Figure 3.48. Typing
.

do fig03_48

then causes the do-file to execute, redrawing the graph and saving it in two formats.

116

Statistics with Stata

^delimit

;

use c:\data•gulf.dta, c 1 e a r ;
graph twoway connected maxarea winter, yaxis(3)
•/scale (range (-300,220) < '
^^^50(50)200, nogrid axis(3))
ytitleCIce area, 1000 kmA2”,
, axis(3) placement (nw) )
s^pattern(dash)
II
line cil winter, yaxi‘s(2) yscale (range(-4, 3)
axis (2))
/label (, nogrid axis (2))
vtitle(”CIL temperature, degrees C", axiS(2)) cipattern(solid)
li
connected cod winter,
yaxis(l) ystale(ra.-.ge(0,200) axis(l))
ya.abel(, nogrid axis(l))
ytitle("Cod catch, 1000 tons", a x i s (11
p- acement(sw))
; I
if
> 2 95 9,
iegend(ring(0) position(7) labelfl "Max ---e area")
"Min CIL t-

.............

xlabel (1960 (5)2000, grid)
saving (c:\data\fig0 3_48.gph, replace) ;
graph export c:\data\fig03 48.eps replace ;
#delimit cr
~

treats this all as one logical line that ends with the semicolon after the saving () ontion
Tins option saves the graph in Stata’s .gph format.
9 '' P
'
Next, the graph export command creates a second version of the same eranh in
(Type^heS
h ‘ fOrmat’ 3S
by the epS Suff,x in the filename^ 48eps
Lf 1 r
P .graPh-exP°rt ‘o leant more about this command, which is partlcuZ
part,CularIy
useful for wntmg programs or do-files that will create graphs repeatedly.)

’ •fC d°
deinn,,go,ng back

cr command re-sets a carriage return as the end-of-Iine
Sla,a-s „s„al mode Alth()ugh , |s „„,;isib|e “ “

blanite.,.h7eZSX"“"

re“m<h‘'““

Retrieving and Combining Graphs
Any graph saved in Stata’s “live” .gph format
can subsequently be retrieved into memory by
the graph use command. For example, we
could retrieve Figure 3.48 by typing
.

graph use fi.g03__48

?d,‘iT™aPl, iS

“ is <lisp|wl»»»«" and c«» be printed or saved again with

saveeXZ
r
V'”1’ S"'d
EnhancedWindoXeXS” wT.riP‘

fo™>. “
^bfegnewly
’’'’"‘T
Ora|’l’iCS ( p”8’’ “

meniK nr dirArti • d
We a S0 COU d chan£e the coIor scheme, either through
mono I
d
i V
uhe graph use c°mmand. fig03 48.gph was saved in the s2
monochrome scheme, but we could see how it looks in the si colofCcheme by typing
•

1^

graph use fig03_47,

scheme(slcolor)

Graphs

117

Graphs saved on disk can also be combined by the graph combine command. This
provides a way to bring multiple plots into the same image. For illustration, we return to the
Gulf of St. Lawrence data shown earlier in Figure 3.48. The following commands draw three
simple time plots (not shown), saving them with the namesfig03_49a.gph,fig03_49b.gph, and
fig03_49c.gph. The margin (medium) suboptions specify the margin width for title boxes
within each plot.
. graph twoway line maxarea winter if winter > 1964, xtitlef ” )
xlabel (1965 (5)2000, grid) ylabel(50 (50) 200, nogrid)
title("Maximum winter ice area", position(4) ring(0) box
margin(medium))
ytitle("1000 kmA2") saving (fig03_49a)
• graph twoway line cil winter if winter > 1964, xtitle("")
xlabel (1965 (5) 2000, grid) ylabel(-1(.5)1.5, nogrid)
title("Minimum CIL temperature", position (1) ring(0) box
margin(medium))
ytitle("Degrees C") saving (fig03_49b)
. graph twoway line cod winter if winter > 1964 ,
xtitle (
)
xlabel (1965 (5) 2000, grid) ylabel(0 (20) 100, nogrid)
titie("Northern Gulf cod catch", position(l) ring(0) box
margin(medium))
ytitle("1000 tons") saving(fig03_49c)

To combine these plots, we type the following command. Because the three plots have
identical x scales, it makes sense to align the graphs vertically, in three rows. The imargin
option specifies “very small” margins around the individual plots of Figure 3.49.
. graph combine fig03__49a . gph fig03_49b . gph fig03_49c. gph,
imargin(vsmall) rows(3)

Figure 3.49

o

§S- \
s ■ _i_
1965

Maximum winter ice area
1970

1975

1980

1985

1990

1995

2000

Minimum CIL temperature

o'- ‘

SJ’Q 2ojo -

0)

Om

o

1965

1970

1975

1980

1985

1990

1995

2000

Northern Gulf cod catch

-

Is- __
§?o -

1965

1970

1975

1980

1985

1990

1995

2000

118

Statistics with Stata

details including the
h-C°"lbine for more information on this command. Options control
b on e sXr as thZ ’h 7, C°IUmnS’
°fa"d markers <which othe™se

Thev can Iso soec
t
P
the marginS betWeen individual P’°ts.
ex can also specify whether x ory axes of twoway plots have common scales or assign all
components a common color scheme. Titles can be added to the combined graph which can
be printed, saved, retrieved, or for that matter combined again in the usual ways

Our fina! example in^ratessevera! of'h^ graph combine optionS; and a
0
sn ot graP " n Uneqi'a'-S,zed comPonents. Suppose we want a scatteiplot similar to Ae
variable8
P Ot
earl'er 'n F'gUre 342’
with box PIots of thcT and x
x anables drawn beside their respective axes. Using statehealth.dta, we might first try to do this
box olot'of 3 '/Z
3 7iP 0fs,"okeT- a scatterplot of™IOA-eTvs. college, and a horizontal

50Sg

”

",g ‘l,e

Pk"S in,° “e im*sh«™>

• graph box smokeT, saving(wrongl)
. graph twoway scatter smokeT college, saving(wrong2)
graph hbox college, saving(wrong2)
. graph combine wrongl.gph wrong2. gph wrongj.gph

The combmed graph produced by the commands above would look wrong, however We
would end up with two fat box plots, each the size of the whole scatterplot, and none ofthe axes
ahgned. For a more sat.sfactory version, we need to start by creating a thin vertical box plot
7 onoz r
(20) option in the following command fixes the plot’s x (horizontal)
size at :0/o of normal, resulting in a normal height but only 20% width plot. Two empty
caption lines are included for spacing reasons that will be apparent in the final graph.
. graph box smokeT, fxsize(20) caption ("”
•’”)
ytitlef") ylabel(none) ytick (15 (5) 35 , grid) saving (fig03_50a)

For the second component, we create a straightforward scatterplot of smokeT vs. college.
graph twoway scatter smokeT college,
ytitle("Percent adults smoking")

xtitle("Percent adults with Bachelor’s degrees or higher")

ylabel (, grxd)

xlabel (, grid)

saving(fig03_5Ob)

Z .compon1e"t is a thin horizontal box plot ofcollege. This plot should have normal

X deluded’5,26

2°%°fnOrma1' F°rSp3CingreaS°nS’ tW°empty lefttitles

graph hbox college, fysize(20) lltitle("" ) 12title( H H J
ylabel(none) ytick(10(5)35, grid) ytitle("")
saving(fig03_50c)

Graphs

119

These three components come together in Figure 3.50. The graph combine
command’s cols (2) option arranges the plots in two columns, like a 2-by-2 table with one
empty cell. The holes (3) option specifies that the empty cell should be the third one, so
our three component graphs fill positions 1, 2, and 4. xscale(1.05) enlarges marker
symbols and text by about 5%, for readability. The empty captions or titles we built into the
original box plots compensate for the two lines of text (title and label) on each axis of the
scatterplot, so the box plots align (although not quite perfectly) with the scatterplot axes.
graph combine fig03_50a.gph fig03_50b.gph figO3_5Oc.gph,

cols(2) holes(3) iscale(1.05)
Figure 3.50

io

CD

co

.£

1
1

1°
</)
3 CN

•u
<TJ

O

<5

<

•••

*

•«

Ql lo

12
.
20
25
30
35
Percent adults with Bachelor’s degrees or higher

X__________ _____

-

*

: ;

Summary Statistics and Tables

statistics are available through the command tahc^t ^exlble arrangements of summary

Ce S C°ntain statistics such as
one’variabIe procedures

frequencies, sums, means or median, F
including normality tests transfoimation, ’^dd

are not described in this chapter, but

Example Commands
. summarize yl y2 y3

• summarize yl y2 y3, detail

Obtains detailed summary statistics mvh
including percentiles, median, mean, standard
deviation, variance, skewness, and kurtosis.
•

summarize yl if xl > 3 & x2 <

°"'V

ObSe™,iOnS

V™b,e

■ summarize yl [fweight = w], detail

Cafato detailed s„,™ary s.a.lsfa forj7 „sing ,he frequei,cy wc|gh<s jn
’

* *

y1'

Stats<mean =d skewness kurtosis n)

Calculates only the specified summary statistics for variableyl.
'

!0

Stat?(min P5 P25 p50 P75 P95 “ax) by(xl)

is

Summary Statistics and Tables

121

tabulate xl

Displays a frequency distribution table for all nonmissing values of variable xl.
.

tabulate xl,

sort miss

Displays a frequency distribution of.v/. including the missing values. Rows (values) are
sorted from most to least frequent.
.

tabl xl x2 x3 x4

Displays a series of frequency distribution tables, one for each of the variables listed.
.

tabulate xl x2

Displays a two-variable cross-tabulation with.v/ as the row variable, and.v? as the columns.
. tabulate xl x2, chi2 nof column

Produces a cross-tabulation and Pearson
test of independence. Does not show cell
frequencies, but instead gives the column percentages in each cell.
tabulate xl x2, missing row all

Produces a cross-tabulation that includes missing values in the table and in the calculation
of percentages. Calculates “all” available statistics (Pearso:
(Pearson and likelihood x:, Cramer’s
of
K Goodman and Kruskal’s gamma, and Kendall’s rb).
.

tab2 xl x2 x3 x4

Performs all possible two-way cross-tabulations of the listed variables.
tabulate xl,

summ(y)

Produces a one-way table showing the mean, standard deviation, and frequency of r values
within each category of xl.
tabulate xl x2, summ(y) means

Produces a two-way table showing the mean ofjy at each combination ofxl and x2 values.
. by x3, sort:

tabulate xl x2, exact

Creates a three-way cross-tabulation, with subtables for.tf (row) by.v2 (column) at each
value of.rd. Calculates Fisher’s exact test for each subtable, by varname, sort:
works as a prefix for almost any Stata command where it makes sense. The sort option
is unnecessary if the data already are sorted on vamame.
table y x2 x3, by(x4 x5) contents(freq)

Creates a five-way cross-tabulation, of r (row) by r’ (column) by x3 (supercolumn), by.v4
(superrow I) by x5 (superrow 2). Cells contain frequencies.
. table xl x2, contents(mean yl median y2)

Creates a two-way table of.v/ (row) by x2 (column). Cells contain the mean of vl and the
median ofy2.

Dataset VTtown.dta contains information from residents of a town in Vermont. A survey was
conducted soon after routine state testing had detected trace amounts of toxic chemicals in the
town s water supply. Higher concentrations were found in several private wells and near the
public schools. Worried citizens held meetings to discuss possible solutions to this problem.

H

122

Statistics with Stata

Contains data from C:\data\VTtown.dta
obs:
153
vars:
7
size:
1,683 (99.9% of
memory free)
variable name

storage
type

display
forma t

value
label

sexlbl

VT town survey (Hamilton 1985)
11 Jul 2005 18:05

variable label

gender
lived
kids
educ
meetings
contarn

byte
byte
byte
byte
byte
byte

%8.0g
* 8.0g
% 8.0g
8.0g
%8.0g
% 8.0g

kidlbl
con tamlb

school

byte

%8.0g

close

kidlbl

Respondent's gender
Years lived in town
Have children <19 in town?
Highest year school completed
Attended meetings on pollution
Believe own propertv'water
contaminated
School closing opinion

Sorted by:

To find the mean and standard deviation of the variable lived (years the respondent had
lived in town), type
summarize lived

.

|

Obs

Mean

lived |

153

19.26797

Variable

Std.

Dev.

Min

Max

16.95466

1

81

This table also gives the number of nonmissing observations and the variable's minimum and
maximum values. Ifwe had simply typed sununarize with no variable list, we would obtain
means and standard deviations for every numerical variable in the dataset.
To see more detailed summary statistics, type
.

summarize lived,

detail
Years lived in town

1%
5%
10%
25%

50%

75%
90%
95%
99%

Percentiles
1
2
3
5

Smallest
1
1
1
1

15

29
42
55
68

Largest
65
65
68
81

Obs
Sum of Wg£.

153
i5?

Mean
Std. De v.

19.26797
16.95466

Variance
Skewness
Kurtosis

287.4606
1.208804
4.025642

This summarize, detail output includes basic statistics plus the following:
Percentiles:
Notably the first quartile (25th percentile), median (50th percentile), and third
quartile (75th percentile). Because many samples do not divide evenly into
quarters or other standard fractions, these percentiles are approximations.
Four smallest andfour largest values, where outliers might show up.

»

Summary Statistics and Tables

123

Sum ofweights: Stata understands tour types of weights: analytical weights ( aweight).
frequency weights ( fweight ), importance weights ( iweight ), and
sampling weights (pweight). Different procedures allow, and make sense
with, different kinds of weights, summarize, detail , for example,
permits aweight or fweight. Forexplanations see help weights.
Variance’.
Standard deviation squared (more properly, standard deviation equals the
square root of variance).
Skewness:
The direction and degree of asymmetry. A perfectly symmetrical distribution
has skewness = 0. Positive skew (heavier right tail) results in skewness > 0;
negative skew (heavier left tail) results in skewness < 0.
Kurtosis:
Tail weight. A normal (Gaussian) distribution is symmetrical and has kurtosis
- 3. If a symmetrical distribution has heavier-than-normal tails (that is, is
sharply peaked), it will have kurtosis > 3. Kurtosis < 3 indicates lighter-thannormal tails.

The tabstat command provides a more flexible alternative to summarize. We can
specify just which summary statistics we want to see. For example,
.

tabstat lived,

stats(mean range skewness)

variable I

mean

range

s kewness

lived I

19.26'797

OA

1.208804

With a by(varname) option, tabstat constructs a table containing summary
statistics for each value of varname. The following example contains means, standard
deviations, medians, interquartile ranges, and number of nonmissing observations of lived, for
each category ofgender. The means and medians both indicate that, on average, the women
in this sample had lived in town for fewer years than the men. Note that the median column is
labeled “p50”, meaning 50th percentile.
tabstat lived,

stats(mean sd median iqr n)

Summary for variables: lived
by categories of: gender

by(gender)

f=e«r-rdert*s sender)

gender |

mean

sd

p50

iqr

N

male |
female |

23.48333
16.54839

19.69125
14.39468

19.5
13

28
19

60
93

|

19.26757

16.95466

15

24

153

Total

Statistics axailable for the stats () option of tabstat include:
mean
Mean
count

n
sum
max

Count of nonmissing observations
Same as count
Sum
Maximum

124

Statistics with Stata

min

Minimum

range

Range = max - min

sd

Standard deviation
Variance

var

cv

Coefficient of variation = sd / mean
semean
Standard error of mean = sd / sqrt(n)
skewness Skewness
kurtosis Kurtosis
median
Median (same as p50 )
Pl
iqr

1st percentile (similarly, P5 , pio , p25 , p50 , p75 . p95 , OF p99 )
Interquartile range = p75 - p25

q
Quartiles; equivalent to specifying p25 pso p75
Further tabstat options give control over the table layout and labeling. Type help
tabstat to see a complete list.

The statistics produced by summarize or f
tabstat describe the sample at hand. We
might also want to draw inferences about the population, for example, by constructing a 99%
confidence interval for the mean of lived'.
. ci lived, level (99)
Variable I

Obs

Mean

Std. Err.

[99% Conf. Interval!

lived |

153

19.26797

1.370703

15.69241

22.84354

Based on this sample, we could be 99% confident that the population mean lies somewhere
in the interval from 15.69 to 22.84 years. Here we used a level ( ) option to specify a 99%
confidence interval. If we omit this option, ci defaults to a 95% confidence interval.
Other options allow ci to calculate exact confidence intervals for variables that follow
binomial or Poisson distributions. A related command, cii , calculates normal, binomial, or
Poisson confidence intervals directly from summary statistics, such as we might encounter in
a published article. It does not require the raw data. Type help ci for details about both
commands.

Exploratory Data Analysis
Statistician John Tukey invented a toolkit of methods for exploratory data analysis (EDA),
which involves analyzing data in an exploratory and skeptical way without making unneeded
assumptions (see Tukey 1977; also Hoaglin, Mosteller, and Tukey 1983, 1985). Box plots,
introduced in Chapter 3, are one of Tukey’s best-known innovations. Another is the stem-andleaf display, a graphical arrangement of ordered data values in which initial digits form the
"stems” and following digits for each observation make up the "leaves.”

Summary Statistics and Tables

125

stem lived

Stem-and-leaf plot for lived (Years lived in town)
0*
0.
11.
2*
2.

I 1111111222223333333344444444
I 55555555555566666666777889999
I 0000001122223333334
I 55555567788899
I 000000111112224444
I 56778899
3’ i 00000124
3 . I 5555666789
4 * I 0012
4 . I 59
5* I 00134
5 . I 556
6*
6.
7*
7. I
8* I

5558

1

stem automatically chose a double-stem version here, in which 1* denotes first dibits
of 1 and second digits of 0-4 (that is, respondents who had lived in town 10-14 years), .
denotes first digits of 1 and second digits of 5 to 9 (15-19 years). We can control the number
of lines per initial digit with the lines ( ) option. For example, a five-stem version in which
the 1* stem hold leaves of 0-1, It leaves of2-3, If leaves of4-5, Is leaves of 6-7
QPld
1
lc*nirz*r> rip Q O
J L _ _ • . *
< •
•
’
leaves of 8-9 could be obtained by typing
.

stem lived, lines(5)

Type help stem for information about other options.
Letter-value displays (Iv) use order statistics to dissect a distribution.
lv lived
153

M
F
E
n
C
B
A
Z

77
39
20

I
I
I
10.5 |
5.5 |
3
I
2
I
1.5 |
1
I
I
I
inner fence I
outer fence |

i

Years lived in town

5
3
2
1
1
1
1
1

-31
-67

15
17
21
27
30.75
33
34.5
37.75
41

29
39
52
60.5
65
68
74.5
81

65
101

I
|
|
|
|
|
|
|
I
I
I
|
|

spread
24
36
50
59.5
64
67
73.5
80

pseudosigma
17.9731
15.86391
16.62351
16.26523
15.15955
14.59762
15.14113
15.32737

# below
0
0

# above
5
0

M denotes the median, and F the fourths (quartiles, using a different approximation than
the quartile approximation used by summarize, detail and tabsum). E D C
. . denote cutoff points such that roughly 1/8, 1/16, 1/32, . . . of the distribution remains
outside in the tails. The second column of numbers gives the “depth,” or distance from nearest
extreme, for each letter value. Within the center box, the middle column gives
midsummaries, which are averages of the two letter values. If midsummaries drift away from
the median, as they do for lived, this tefcus that.the.distribution becomes progressively more

126

Statistics with Stata

Finally, “pseudosigmas” in the Xht-hand mi F

S he aPProxlmate interquartile range,

be if these letter values described a Gaussian popuh?™ Th^F^ 'he.Standard deviation should
a "pseudo standard deviation" (PSD) oroGdZ
>eFpSeudoslgma> sometimes called
■PPr»™.Bnor„,.li,yi„s^^^
-"I »U.lie™istaM check for

1 • Comparing mean with median diagnoses overall skew

mean > medtan
mean = median
mean < median

positive skew
symmetry
negative skew

2.
.......
■

standard deviation = PSD
standard deviation < PSD

normal tails
lighter-than-normal tails

r-ss™—*

..

lies outside the inner fenc^bm not omsid^'the'^ouZfen" :"" Va'Ue"

0UffiCr”’t

Th,
~ 3,QRss<F' - '■«&< or E, + l.5IQR<,iF +yiQR
The value of., ,s a -severe oudier if 1, „K ou,sMe lhe

x<Fl-3IQR or x> F . + 3IQR
1V gives these cutoffs and the number of outliers of each tvm e
the outer fences, occur sparsely (about two per million^ i^P '

°“tllers’values bey°nd

■^h). Sev.eou.lS"^

“

dis.ribXuo^.^Xbta.X,'0;'1™ r "W tos ” P’^

-wle

formal normal,ry rests, and

Normalit

Tests and Transformations

X'XeZSsXXX X “ vrbte tto “»»

normality, extending the graphical tools this^ °rat0Iy methods t0 check for approximate
quantile-normal plot!) preseme^'n ^p ^'1?^’
plOtS’
pl<^ a"d
skewness and X Uries iC»f rhe
evaluate the null hypothesis that
population.

co

i
i_ , ' detai1 ’ can more formally
mp e at hand came fo°m a normally-distributed

Summary Statistics and Tables

127

sktest lived
Skewness/Kurtosis tests for Normality

Variable I

Pr(Skewness)

Pr(Kurtosis)

0.0*00

0.028

lived |

adj chi2(2)

joint ----Prob>chi2

24.79

0.0000

sktest here rejects normality: lived appears significantly nonnormal in skewness (P =
.000), kurtosis (P = .028), and in both statistics considered jointly (P = .0000). Stata rounds off
displayed probabilities to three or four decimals; “0.0000” really means P < .00005.

Other normality or log-normality tests include Shapiro-Wilk IV( swilk ) and ShapiroFrancia II ( sfrancia ) methods. Type help sktest to see the options.
Nonlinear transformations such as square roots and logarithms are often employed to
change distributions shapes, with the aim of making skewed distributions more symmetrical
and perhaps more nearly normal. Transformations might also help linearize relationships
between variables (Chapter 8). Table 4.1 shows a progression called the “ladder of powers”
(Tukey 1977) that provides guidance for choosing transformations to change distributional
shape. The variable lived exhibits mild positive skew, so its square root might be more
symmetrical. We could create a new variable equal to the square root of lived by typing
. generate srlived = lived A.5

Instead of lived A . 5, we could equally well have written sqrt (lived)
Logarithms are another transformation that can reduce positive skew. To generate a new
variable equal to the natural (base e) logarithm of lived, type
. generate loglived = In(lived)

In the ladder of powers and related transformation schemes such as Box-Cox, logarithms take
the place of a “0” power. Their effect on distribution shape is intermediate between .5 (square
root) and -.5 (reciprocal root) transformations.
Table 4.1: Ladder of Powers

Transformation

Formula

Effect

cube

new = old A3

reduce severe negative skew

square

new = old A2

reduce mild negative skew

raw

old

no change (raw data)

square root

new = old A . 5

reduce mild positive skew

*°e<(or log10)

new = In(old)
new = loglO(old)

reduce positive skew

negative reciprocal root

new = -(old A-.5)

reduce severe positive skew

negative reciprocal

new = -(old A-l)

reduce very severe positive skew

negative reciprocal square

new = -(old A-2)

negative reciprocal cube

new = -(old A-3) r

ff

When raising to a power less than zero, we take negatives of the result to preserve the
original order — the highest value of old becomes transformed into the highest value of/?eiv,

128

Statistics with Stata

and so forth. When old itself contains negative or zero values, it is necessary to add a constant
before transformation. For example, ifarrests measures the number of times a person has been
attested (0 for many people), then a suitable log transformation could be
generate larrests = In(arrests + 1)

The ladder command combines the ladder of powers with sktest tests for
normality. It tries each power on the ladder, and reports whether the result is significantly
nonnormal. This can be illustrated using the severely skewed variable energy, per capita energy
consumption, from states.dta.
ladder energy

Transformation

f orr.” la

chi2 (2)

P(chi2)

cube
square
raw
squa re-root
log

ene rayA 3
energy"2
energy
sqrc(energy)
log (energy)
1 /sort(energy)
1/energy
1/(energyA2)
1/ (energy"3)

53.74
45.53
33.25
25.03
15.88
7.36
1 . 32
4.13
11.56

0.000
0.000
0.000
0.000
0.000
0.025
0.517
0.127
0.003

reciprocal root
reciproca1

reciprocal square
reciprocal cube

It appears that the reciprocal transformation, Menergy (or energy'1), most closely resembles
a normal distribution. Most of the other transformations (including the raw data) are
significantly nonnormal. Figure 4.1 (produced by the gladder command) visually supports
this conclusion by comparing histograms of each transformation to normal1 curves.
.

gladder energy

»•

cubic

o

i

square

8

i

§

g

f

0

o

02 OOa4»-06«&06<fr06eH)6e+09

0 ZOOOOOOOCKDOOSDOOOOOOOOt

sqrt
'tn

<D
Q

o

1000

-.05 -.04

-.03

1/sqrt

s

g
o

10

ig

800

o

o

CD
O

600

400

g

g

o

200

log

2

!

1

8

o

R

T

£

Figure 4.1

identity

s

15

20

25

30

inverse

o

I

I
-lllii.

o

5

i

t)

o

5.5

6

6.5

7

1/square

wwShOs

8

-.005 -.004-.003 -.002-.001

-.07

-. 000023)002300 IS) OS 00e1C8Se-21

-.06

1/cubic

I

-1.50e-071.00e-075.00e-08

0

Per capita energy consumed, Btu
Histograms by transformation

Figure 4.2 shows a corresponding'set of quantile-normal plots for these ladder of powers
transformations, obtained by the “quantile ladder” command qladder . To make the tiny
?•*

!

Summary Statistics and Tables

129

plots more readable in this example we scale the labels and marker symbols up by 25% with
e scale (1 25) option. The axis labels (which would be unreadable andPcrowded) Je
suppressed by the options ylabel (none) xlabel (none).
’
■ gladder energy, scale(1.25) ylabel(none) xlabel(none)

cubic

square

identity

sqrt

tog

1/sqrt

inverse

1/square

1/cubic

Figure 4.2

Per capita energy consumed, Btu
Quantite-Normal plots by transformation

(^for .he BoS

““ bCSke”° finds
y^)

or

{yA- 1}/A

A > 0 or A

0

yW = lnO)

A=0
such that y (A) has approximately 0 skewness.
Applying this to energy, we obtain the
transformed variable benergy.
. bcskewO benergy = energy, level (95)

I

Transform |

L

(energyAL—1)/L |
-1.246052
(1 missing value generated)

[95% Conf. Interval1

Skewness

-2.052503

. 000281

-.6163383

syDm^(asMne^XSke»„;s's«s*))T^

I

from our ladde.-of-powem eho.ee, .he

pjr. The ShSeXZ™L’a

" “ n“,

—2.0525 < A < —.6163
reje? SOmn
P°Ssibilities incIudi"g logarithms (A = 0) or square roots (A = 5)
Chapter 8 descnbes a Box-Cox approach^to regression modeling.
re roots (A .5).

130

Statistics with Stata

Frequency Tables and Two-Wa

Cross-Tabu lation s

categorical variables requite

-iZ? p

P°"ution P™1*"' »>

labulaiing rhe caregorL v.ri.Me
tabulate meetings
Attended |
meetings on |
pollution |

Freq.

Percent

Cum.

no |
yes |

106
47

69.28
30.72

69.28
100.00

Total |

153

100.00

tabulate canmTnTCeJrerUenCy distributions for variables that have thousands of values,
To construct a
however, you might fSt wanX^^p^ose'values'b/wlv3™1316
ma"y Va'UeS’

«cod. or .otocodo options (see Chapter l^or^el^pW“1' "

TTr'" ■

19 living in town):

»™«bul..io„. F„r

meetings by kids (whether respondent has children under

tabulate meetings kids

Attended |
meetings | Have children <19 in
on |
town?
pollution |
no
yes |

Total

no |
yes |

52
11

54 |
36 |

106
47

Total |

63

90 |

153

The first-named variable forms the
only 11 of theses neT^le’
fOrmS C°Iumns in the resulting table.
We see that
1
/
non-parents who attended the meetings
tabulate
'"1'has
h~3 3 number of options that are useful with frequency tables:
all

.hat borh eanabies have ordered categories, whereas ehi2 . 11:chi2 , a„d™
cchi2

Ways rhe ooarriburion

cell

Shows total percentages for each cell.

chi2

Pearson

clrchi2

Displays the contribution to likelihood-ratio r in each cell of a two-way table.

Pearson x‘ (ehi.squ.ted) i„ each cell „f a

test of hypothesis that row and column variables are independent
•

Summary Statistics and Tables

column
exact

131

Shows column percentages for each cell.
Fisher’s exact test ofthe independence hypothesis. Superiorto chi2 ifthe table
contains thin cells with low expected frequencies. Often too slow to be practical
in large tables, however.

expected Displays the expected frequency under the assumption of independence in each cell
of a two-way table.

I

gamma

Goodman and Kruskal’s y (gamma), with its asymptotic standard error (ASE).
Measures association between ordinal variables, based on the number of
concordant and discordant pairs (ignoring ties). -I < y < 1.

generate (new) Creates a set of dummy variables named newl, new2, and so on to represent
the values of the tabulated variable.
lrchi2

Likelihood-ratio %2 test of independence hypothesis. Not obtainable ifthe table
contains any empty cells.

matcell (matname) Saves the reported frequencies in matname.

matcoi (matname) Saves the numeric values ofthe 1 x c column stub in matname.

matrow (matname) Saves the numeric values ofthe r x 1 row stub in matname.
missing

Includes missing” as one row and/or column of the table.

nofreq

Does not show cell frequencies.

nokey

Suppresses the display of a key above two-way tables. The default is to display the
key if more than one cell statistic is requested and otherwise to omit it. Specifying
key forces the display of the key.

nolabel

Shows numerical values rather than value labels of labeled numeric variables.

plot

Produces a simple bar chart of the relative frequencies in a one-way table.

replace

Indicates that the immediate data specified as arguments to the tabi command
are to be left as the current data in memory, replacing whatever data were there.

row

Shows row percentages for each cell.

sort

Displays the rows in descending order of frequency (and ascendine order ofthe
variable within equal values of frequency).

subpop (varname) Excludes observations for which varname = 0 in tabulating frequencies,
The identities of the rows and columns will be determined from all the data,
including the varname 0 group, and so there may be entries in the table with
frequency 0.

A

taub

Kendall’s r b ((tau-b), with its asymptotic standard error (ASE). Measures
association between ordinal variables. taub is similar to gamma , but uses a
correction for ties. - 1 < t b < 1.

V

Cramer’s V (note capitalization), a measure of association for nominal variables.
In 2 x 2 tables,' - IIn< larger
KI.tables, 0 < V < 1.

.Ur

132

Statistics with Stata

wrap

Requests that Stata take no action on wide, two-way tables to make them readable.
Unless wrap is specified, wide tables are broken into pieces for readability.

To get the column percentages (because the column variable, kids, is the independent
vanable) and a X2 test for the cross-tabulation of meetings by kids, type
.

tabulate meetings kids,

column chi2

I Key
|
|------------------ ,
I
frequency
|
I column percentage |
Attended |

meetings | Have children <19 in
on I
town?
pollution |
no
yes |
----- +
no |
52
54 |
I
82.54
60.00 |
----- +
yes |
I

Total |

I

Total

106
69.28

11
17.46

36 |
40.00 |

47
30.72

63
100.00

90 |
100.00 |

153
100.00

=

Pearson chi2(l)

8.8464

Pr = 0.003

percent of tIie respondents with children attended meetings, compared with about
/ /o ofthe respondents without children. This association is statistically significant (P=.003).
Occasionally we might need to re-analyze a published table, without access to the original
raw data. A special command, tabi (“immediate” tabulation), accomplishes this. Type the
cell frequencies on the command line, with table rows separated by “ \ For illustration, here
is
how tabi
----— could reproduce the previous %2 analysis, given only the four cell frequencies:
.

tab! 52 54

\ 11 36,

column chi2

I Key
I
|--------------------------------------(

I
frequency
|
I column percentage |

I
row |
--- +

1

2 I

Total

i I
I

52
82.54

54 |
60.00 |

106
69.28

2 I
I

11
17.46

47
30.72

Total |
I

63
100.00

36 |
40.00-1
------ +
90 |
100.00 |

col

Pearson chi2(l)

=

8.8464

153
100.00
Pr = 0.003

'I

Summary Statistics and Tables

133

Unlike tabulate, tabi does not require or refer to any data in memory. By adding the
replace option, however, we can ask tabi to replace whatever data are in memory with
he new cross-tabulation. Statistical options ( chi2 , exact, nofreq, and so forth) work
the same for tabi as they do with tabulate .

Multiple Tables and Multi-Way Cross-Tabulations
With surveys and other large datasets, we.sometimes n&
^ed frequency distributions of many
different variables. Instead of asking for each table :
separately, for example by typing
tabulate meetings, then tabulate gender, and finally tabulate kids
, we
could simply use another specialized command, tabi :
.

tabi meetings gender kids

Or to produce one-way frequency tables for each variable from gender through school in this
dataset (the maximum is 30 variables at one time), type
-

tabi gender-school

Similarly, tab2 creates multiple two-way tables. For example, the following command
cross-tabulates every two-way combination of the listed variables:
.

tab2 meetings gender kids

tabi and tab2 offer the same options as tabulate .

To form multi-way contingency tables, one approach uses the ordinary tabulate
command with a by prefix. Here is a three-way cross-tabulation of meetings by kids by
c°nta'>‘ (respondent believes his or her own property or water contaminated), with y2 tests for
the independence of meetings and kids within each level of contam'.
■ by contam,

sort:

tabulate meetings kids, nofreq col chi2

->
Attenaea |
reetmos : Have children <19 in
town?
poll t ion
no
yes I

Total

no |
yes i

91.30
8.70

68.75 |
31.25 |

78 . 18
21.82

Total |

100.00

100.00 |

100.00

Pearson chi2(l) =

7.9814

Pr = 0.005

134

Statistics with Stata

contam = yes

Attended
meetings
on
pollution

|
■
i
|

-------- -

no |
yes |
Total

Have children <19 in
town ?
no
yes |

58.82
4 1.18

38.46 |
61.54 |

46.51
53.49

100.00

100.00 |

100.00

-------- -

|

Total

Pearson chi2(l)

=

1.7131

Pr = 0.191

arents were more like y to attend meetings, among both the contaminated and uncontaminated
groups. On y among the larger uncontaminated group is this “parenthood effect" statistically
significant however As multi-way tables separate the data into smaller subsamples, the size
of these subsamples has noticeable effects on significance-test outcomes.
This approach can be extended to tabulations of greater complexity. For example to get
a our-way cross-tabulation of gender by contam by meetings by kids, with X2 tests for each
meetings by kids subtable (results not shown), type the command
.

by gender contam,

sort:

tabulate meetings kids,

column chi2

tableS’ifwed0 not need Percentages or statistical tests,
is through Stata’s general table-making command, table Tl.L
. This versatile command has many
options, only a few of which are illustrated here. To construct a simple frequency table of
meetings, type
table meetings,

contents(freq)

Attended
meetings
on
|
pollution |

Freq.

no |
yes I

106
47

For a two-way frequency table or cross-tabulation, type
.

table meetings kids,

Attended
meet:nq=
on
pollution
no |
yes I

Have
children
<19 in
town?
no
yes
52
11

54
36

contents(freq)

Summary Statistics and Tables

135

If we specifya third categorical variable, it forms the ■■supercolumns” of a three-way table:
table meetings kids content, contents (freq)

.

Attended
meet ings
on
pollution

I
I
I

no l

= e_ie-.-e own
property/water
contaminated and Have
children <19 in town?
-- nc
- yes -no
yes
no
yes
42

44

10
16

More complicated tables require the by ( ]
vari bles tabl tl
( } Opti°n’ Wh'Ch a,1°WS UPt0 four ‘WpetTOw”
variables,
one
y table thus can produce up to seven-way tables: c
---- row, one column one
supercolumn, and up to four superrows. ----Hereisisa afour-way
four-wavexample:
examole:
.

table meetings kids contain,

Responden |
I
gender
|I
and
Ii
Attended I
meetings
|I
on
|
pollution |

contents(freq) by(gender)

t 's

male

Believe own
property,/water
contaminated and Have
children <19 in town?
no-- -- yes __
yes
no
ves
no
yes

I
no |
ye s |

■18

2

1?
7

3
3

3
6

no
yes

24
2

26
13

7
4

7
10

female

The contents ( ) option of table
contents(freq)

specifies what statistics the table’s cells contain:
Frequency

contents(mean varname)

Mean of varname

contents(sd varname)

Standard deviation of varname
Sum of varname

contents(sum varname)

contents(rawsum varname)
contents(count varname)

I

contents(n varname)
contents(max varname)

contents(min varname)
contents(median varname)

I

contents(iqr varname)

Sums ignoring optionally specified weight
Count of nonmissing observations of varname
Same as count
Maximum of varname
Minimum of varname
Median of varname

Interquartile range (IQR) of varname

136

Statistics with Stata

contents (pl

varname)

contents (p2

varname)

1st percentile of varname

2nd percentile of varname (so forth to p99 )
The next section illustrates several more of these options.

Tables of Means, Medians, and Other Summary Statistics
ttb,U1^te 'Td'ly Pr°dUCeS tableS °f meanS and standard Aviations within categories of the

ssxx;,fon" * one-w‘y 'awe wi,h “"s °f wi,hi"
tabulate meetings,

Attended |
meetings on |
pollution |

summ(lived)

Summary of Years lived in town
Mean
Std. Dev.
Freq.

---------- -

no |
yes |

21.509434
14.212766

17.743833
13.911139

106
47

Total |

19.267974

16.954663

153

^th'^M vear^frS .T"3' h b" 7'a"Ve newcomers, averaging 14.2 years in town, compared
h ith 21.5 years for those who did not attend.
We can also use tabulate to form a two-way table of means by typing
tabulate meetings kids,

sum(lived)

means

Means of Years lived in town

Attended I
meetings I
on |
pollution |

Have children <19
in town?
no
yes |

no | 28.307692
yes I 23.363636
Iota1

!

27.444444

Total

14.962963 I 21.509434
11.416667 ; 14.212766
13.544444

(

19.267974

Both parents and nonparents among the meeting attenders tend to have lived fewer years in
refle^tiTn
SXlytoaJenT " SPUri°US
The means option used above called fora table containing only means. Otherwise we
get a bulkier table with means, standard deviations, and frequencies in each cell. Chapter 5
describes statistical tests for hypotheses about subgroup means.
Although it performs no tests, table nicely builds up to seven-way tables containing
means standard deviations, sums, medians, or other statistics (see the option list in previous
section). Here is a one-way table showing means of lived within categories of meetings:

F
Summary Statistics and Tables

137

table meetings, contents(mean lived)

Attended I
meetings
I
on
I
pollution I mean(lived)
no |
yes |

21.5094
14.2128

A two-way table of means is a straightforward extension:
.

table meetings kids,
contents(mean lived)

Attended
meetings
on
pollution
no
yes

I
(Have children <19
I
in town?
I
no
yes
I 28.3077
I 23.3636

14.963
11.4167

Table cells can contain more than one statistic,
Suppose we want a two-way table with both
means and medians of the variable lived:
.

table meetings kids, contents{mean lived
median lived)

Attended l
meetings
(Have children <19
on
I
in town?
pollution I
no
yes
no

I

yes

I 28.3077
I
27.5
I
I 23.3636
I
21

14.963
12.5
11.4167
6

I

I
I
I

The cell contents shown by table could be means, medians, sums
or other summary
statistics for two or more different variables.

138

Statistics with Stata

Using Frequency Weights
.S.“^rtZe ’ ,tabulate > table , and related commands can be used with frequency
weights that indicate the number of replicated observations
For example, file sextab2.dta

*Bri,ish !Urvey *•“*beh™ <«”•«

Ss

Contains data from C:\data\sextab2.dta
o b s:
48
vars:
4
size:
432 (99.9% of

memory free)

storage
type

variable name

age
gender
lifepart
count

byte
byte
byte
int

Sorted by:

age

display
format

value
label

%8.0g
%8.0g
%8.0g
%8.0g

age
gender
partners

1i fepart

British sex survey (Johnson 92)
11 Jul 2005 18:05

variable label
Age
Gender
# heterosex partners lifetime
Number of individuals

gender

16 “24’ ““

h«— XXplSrTS”vJ"B w“e
.

■»

list in 1/5

i.
2.
3.
4.
5.

+
I
age
I
I 16-24
I 16-24
I 16-24
I 16-24
I 16-24

gender

lifepart

count |

--------- ,
male
female
male
female
male

none
none
one
one
two

405
465
323
606
194

|
|
|
|
|

We use count as a frequency weight to create a cross-tabulation of lifepart by gender:
.

tabulate lifepart gender
--------- [fw = count]
# I

heterosex |
partners |
lifetime |

Gender
male
female |

Total

|
|
|
|
|
|

544
1734
887
1542
1630
2048

586
4146
1777
1908
1364
708

|
|
|
|
|
|

1130
5880
2664
3450
2994
2756

Total |

8385

10489 |

18874

none
one
two
3-4
5-9
10+

The usual tabulate options work as texpected' with
' * frequency
~
weights. Here is the same
table showing column percentages instead of frequencies:
A

--------- r-'

tv

AbJLA

JLJ.’

Summary Statistics and Tables

tabulate lifepart gender [fweight = count],

.

#
heterosex
partners
lifetime

I
|
|
|

none
one
two
3-4
5-9
10+

|
I
|
|
|
|

Total |

Gender
male
female

lotal

6.49
20.68
10.58
18.39
19.44
24.42

5.59
39.53
16.94
18.19
13.00
6.75

|
|
|
|

5.99
31. 15
14.11
18.28
15.86
14.60

100.00

100.00 |

100.00

139

column nof

Other types of weights such as probability or analytical weights do not work as well with
tabulate because their meanings are unclear regarding the command’s principal options.
A different application of frequency weights can be demonstrated with summarize. File
coliegel.dta contains information on a random sample consisting of 11 U.S. colleges, drawn
from Barron’s Compact Guide to Colleges (1992).
Contains data from C:\data\collegel.dta
obs:
11
vars:
5
size:
429 (99.9% of memory free)

variable name
school
enroll
pctmale
msat
vsat

I
I

storage
type

str28
int
byte
int
int

display
format

value
label

%28s
%8.0g
%8.0g
%8.0g
%8.0g

Colleges sample 1
11 Jul 2005 18:05

(Barren's

)

variable label
College or university
Full-time students 1991
Percent male 1991
Average math SAT
Average verbal SAT

Sorted by:

The variables include msat, the mean math Scholastic Aptitude Test score at each of the 11
schools.
.

list school enroll msat

1.
2.
3.
4.
5.

6.
7.
8.
9.
10 .
11 .

4L

tk.

I
school
I --I
3rown University
I
U . S c r a n t on
I U. North Carolina/Asheville
I
Claremont College
I
CePaul University
I—Thomas Aquinas College
I
Davidson College
I
I
U. Michigan/Dearborn
I
Mass. College of Art
Oberlin College
I
I---

enroll

American University

5228

I

5550
3821
2035
849
6197

201
1543
3541
961
2765

msat I
----- !
680
554
540 I
660 I
547 I
I
570 I
640 I
485 I
482 I
64 0 I
I
587 |

I

140

Statistics with Stata

We can easily find the mean msat value among these 11 schools by typing
.

I ,

summarize msat

Variable |

Obs

Mean

Std. Dev.

Min

Max

|

11

580.4545

67.63155

4 82

680

msat

This summary table gives each school’s mean math SAT score the same weight DePaul
University, however, has 30 times as many students as Thomas Aquinas College.^To take the
different enrollments into account we could weight by enroll.
.

summarize msat [fweight = enroll]
Variable |
----------------- + -_
msat |

!

Obs

Mean

Std. Dev.

Min

Max

32691

583.064

63.10665

4 82

680

Typing
.

summarize msat [freq = enroll]

would accomplish the same thing.

i
'i'i

I

I

f
I!

i

enroIIment’weighted mean, unlike the unweighted mean, is equivalent to the mean for
the 32,691 students at these colleges (assuming they all took the SAT). Note, however, that we
cou not say the same thing about the standard deviation, minimum, or maximum. Apart from
the mean, most individual-level statistics cannot be calculated simply by weighting data that
already are aggregated. Thus, we need to use weights with caution. They might make sense
in the context of one particular analysis, but seldom do for the dataset as a whole, when many
aitterent kinds of analyses are needed.

I___ II.I -JM1

ANOVA and Other
Comparison Methods

Analysis of variance (ANOVA) encompasses a set of methods for testing hypotheses about
differences between means. Its applications range from simple analyses where we compare the
means of v across categories of.v, to more complicated situations with multiple categorical and
measurement x variables, t tests for hypotheses regarding a single mean (one-sample) or a pair
of means (two-sample) correspond to elementary forms of ANOVA.
Rank-based “nonparametric” tests, including sign, Mann-Whitney, and Kruskal-Wallis,
take a different approach to comparing distributions. These tests make weaker assumptions
about measurement, distribution shape, and spread. Consequently, they remain valid under a
wider range of conditions than ANOVA and its “parametric” relatives. Careful analysts
sometimes use parametric and nonparametric tests together, checking to see whether both point
toward similar conclusions. Further troubleshooting is called for when parametric and
nonparametric results disagree.

I

anova is the first of Stata’s model-fitting commands to be introduced in this book. Like
the others, it has considerable flexibility encompassing a wide variety of models, anova can
fit one-way and AA-way ANOVA or analysis of covariance (ANCOVA) for balanced and
unbalanced designs, including designs with missing cells. It can also fit factorial, nested,
mixed, or repeated-measures designs. One follow-up command, predict , calculates
picdicted \alues. several types of residuals, and assorted standard errors and diagnostic
statistics after anova. Another followup command, test, obtains tests of user-specified
null hypotheses. Both predict and test work similarly with other Stata model-fitting
commands, such as regress (Chapter 6).

The following menu choices give access to most operations described in this chapter:
Statistics - Summaries, tables, & tests - Classical tests of hypotheses
Statistics - Summaries, tables, & tests - Nonparametric tests of hypotheses

Statistics - ANOVA/MANOVA
Statistics - General post-estimation - Obtain predictions, residuals, etc., after estimation

I

I
J.

(graphics - Overlaid twoway graphs

141

Ji I

142

Statistics with Stata

Example Commands
.

anova y xl x2

IfxJand ^O‘Way AN0VA’ tes,in2 for differences amongthe means ofy across categories
. anova y xl x2 xl*x2

Performs a two-way factorial ANOVA, including both the main and interaction (x/*x2)
effects of categorical variables x7 and x2.
.

anova y xl x2 x3 xl*x2 xl*x3 x2*x3 xl*x2*x3

Perf onns a three-way factorial ANOVA, including the three-way interaction x7 *x2*x3 as
well as all two-way interactions and main effects.
.

anova reading curriculum / teacher\curriculum

Fits 3 nested model to test the effects of three types of curriculum on students’ reading
ability (reading), teacher is nested within curriculum ( teacher] curriculum )
because several di fferent teachers were assigned to each curriculum. The Base Reference
Manual provides other nested ANOVA examples, including a split-plot design.
.

anova headache subject medication, repeated(medication)

Fits a repeated-measures ANOVA model to test the effects of three types of headache
medication (medication) on the severity of subjects’ headaches (headache). The sample
consists of 20 subjects who report suffering from frequent headaches. Each subject tried
each of the three medications at separate times during the study.
. anova y xl x2 x3 x4 x2*x3, continuous(x3 x4) regress

Perf onns analysis of covariance (ANCOVA) with four independent variables, two of them
(x7 and x2) categorical and two of them (x3 and x4) measurements. Includes the x2*x3
interaction, and shows results in the form of a regression table instead of the default
ANOVA table.
.

kwallis y, by(x)

Pert onns a Kruskal-Wallis test of the null hypothesis that y has identical rank distributions
across the k categories ofx (k > 2).
oneway y x

Performs a one-way analysis of variance (ANOVA). testing for differences among the
means ot y across categories of.v. The same analysis, with a different output table, is
produced by anova y x .
. oneway y x, tabulate scheffe

Perfonns one-way ANOVA. including a table of sample means and Scheffe multiplecomparison tests in the output.
.
I

ranksum y, by(x)

Performs a Wilcoxon rank-sum test (also known as a Mann-Whitney U test) of the null
ypothesis that y has identical rank distributions for both categories of dichotomous
variable x. If we assume that both rank distributions possess the same shape, this amounts
to a test for whether the two medians of are equal.

I

1

ANOVA and Other Comparison Methods

.

143

serrbar ymean se x, scale(2)

Constructs a standard-error-bar plot from a dataset of means. Variable ymean holds the
group means of r; se the standard errors; and x the values of categorical variable x.
scale (2) asks for bars extending to ±2 standard errors around each mean (default is ±1
standard error).
signrank yl = y2

Performs a Wilcoxon matched-pairs signed-rank test for the equality of the rank
distributions ofyl andy2. We could test whether the median ofyl differs from a constant
such as 23.4 by typing the command signrank yl = 23.4.
signtest yl = y2

Tests the equality of the medians of yl and y2 (assuming matched data; that is, both
variables measured on the same sample of observations). Typing sign test yl = 5
would perform a sign test of the null hypothesis that the median ofy/ equals 5.
.

ttest y = 5

Performs a one-sample t test of the null hypothesis that the population mean ofy equals 5.
ttest yl = y2

Performs a one-sample (paired difference) t test of the null hypothesis that the population
mean ofyl equals that ofy2. The default form of this command assumes that the data are
paired. With unpaired data (yl and y2 are measured from two independent samples), add
the option unpaired.
ttest y, by(x) unequal

Performs a two-sample t test of the null hypothesis that the population mean ofy is the
same for both categories of variable x. Does not assume that the populations have equal
variances. (Without the unequal option, ttest does assume equal variances.)

One-Sample Tests

Ii
I

I
I
I

One-sample t tests have two seemingly different applications:
1. Testing whether a sample mean y differs significantly from an hypothesized value p0.
2. Testing whether the means ofy and y2, two variables measured over the same set of
observations, differ significantly from each other. This is equivalent to testing whether the
mean of a “difference score” variable created by subtractingy1 fromy equals zero.
We use essentially the same formulas for either application, although the second starts with
information on two variables instead of one.
The data in writing.dta were collected to evaluate a college writing course based on word
processing (Nash and Schwartz 1987). Measures such as the number of sentences completed
in timed writing were collected both before and after students took the course. The researchers
wanted to know whether the post-course measures showed improvement.

TH

144

Statistics with Stata

describe
Contains data from C:\data\writing.dta
obs:
24
vars:
9
size:
312 (99.9% of memory free)

i

i

variable name
id
preS
prep
preC
preE
posts
post?
pos tC
pos tE

I

storage
type

byte
byte
byte
byte
byte
byte
byte
byte
byte

Nash and Schwartz (1987)
12 Jul 2005 10:16

display
format

value
label

variable label

%8.0g
%8.0y
%8.0g
%8.0g
%8.0g
%8.0g
% 8.0g
%8.0g
%8.0g

slbl

Student ID

# of sentences (pre-test)
. # of paragraphs (pre-test)
Coherence scale 0-2 (pre-test)
Evidence scale 0-6 (pre-test)
# of sentences (post-test)
# of paragraphs (post-test)
Coherence scale 0-2 (post-test)
Evidence scale 0-6 (post-test).

Sorted by:

Suppose that we knew that students in previous years were able to complete an average of
0 sentences. Before examining whether the students in writing.dta improved during the
course, we might want to learn whether at the start of the course they were essentially like
earlier students — in other words, whether their pre-test (preS) mean differs significantly from
the mean of previous students (10). To see a one-sample t test of
= 10, type
.

ttest preS = 10

One-sample t test
Variable

I

Obs

Mean

preS

24

10.79167

Decrees

freedom.:

Std. Err.
A O

O

«

Std. Dev.

[95% Conf. Interval]

4.606037

8.846708

12.73663

Ho: mean(preS) = 10

H= : rear.
t

<

il

t

10
u . ‘ ?oo

Ha: mean != 10
t =
0.8420
P > It | =
0.4084

Ha : mean > 10
t
0.8420
P > t =
0.2042

. Then°tftlon p >
means “the probability of a greater value oft ’’—that is, the one-tail
test probability. The two-tail probability of a greater absolute t appears as P > |t| =
.4084. Because this probability is high, we have no reason to reject H0:p = 10. Note that
ttesst automatically provides a 95% confidence interval for the mean. We could get a
different confidence interval, such as 90%, by adding a level (90) option to this command.

I

A nonparametnc counterpart, the sign test, employs the binomial distribution to test
hypotheses about single medians. For example, we could test whether the median of preS
equals 10. sign test gives us no reason to reject that null hypothesis either.

j

il
1;.

ANOVA and Other Comparison Methods

.

145

signtest preS = 10

Sign test

sign |

observed

expected

positive |
negative |
zero |

12
10
2

11
11
2

all |

24

24

One-sided tests:
Ho: median of preS - 10 = 0 vs.
Ha: median of preS - 10 > 0
Pr(#positive >= 12) =
Binomial(n = 22, x >= 12, p = 0.5) =

0.4159

Ho: median of preS - 10 = 0 vs.
Ha: median of preS - 10 < 0
Pr(#negative >= 10) =
Binomial(n = 22, x >= 10, p = 0.5) =

0.7383

Two-sided test:
Ho: median of preS - 10 = 0 vs.
Ha: median of preS - 10 != 0
Pr(#positive >= 12 or #negative >= 12) =
min(l, 2*Binomial(n = 22, x >= 12, p = 0.5)) =

0.8318

Like ttest, signtest includes right-tail, left-tail, andtwo-tail probabilities. Unlike
the symmetrical t distributions used by ttest, however, the binomial distributions used by
sign test have different left- and right-tail probabilities. In this example, only the two-tail
probability matters because we were testing whether the writing.dta students “differ” from their
predecessors.

I

I

Next, we can test for improvement during the course by testing the null hypothesis that the
mean number of sentences completed before and after the course (that is, the means ofpreS and
postS) are equal. The ttest command accomplishes this as well, finding a significant
improvement.
ttest posts = preS

Paired t test

I

Variable |

Obs

Mean

Std. Err .

Std. Dev.

[95% Conf. Interval]

posts |
preS |

24
24

26.375
10.79167

1 . 693779
.9402034

8.297787
4.606037

22.87115
8.846708

29.87885
12.73663

diff |

24

15.58333

1.383019

6.775382

12.72234

18.44433

Ho: mean(posts - preS) = mean(diff) = 0

Ha : mean (diff) < 0
t = 11.2676
P < t =
1.0000

Ha: mean (diff) != 0
t = 11.2676
P > It| =
0.0000 '

Ha : mean(diff) > 0
t = 11.2676
P > t =
0.0000

Because we expect “improvement,” not just “difference” between the preS and postS
means, a one-tail test is appropriate. The displayed one-tail probability rounds off four decimal

146

Statistics with Stata

places to zero (“0.0000” really means P< .00005). Students’ mean sentence completion does
l^^'ZXes^ °n thiS SamPle’ 3re 95% COnfldent that k imProves by between
t tests assume that variables follow a normal distribution. This assumption usually is not
critical because the tests are moderately robust. When nonnormality involves severe outliers
however, or occurs in small samples, we might be safer tumingto medians instead of means and
employing a nonparametnc test that does not assume normality. The Wilcoxon signed-rank
test, for example, assumes only that the distributions are symmetrical and continuous. Applying
a signed-rank test to these data yields essentially the same conclusion as ttest that
students’ sentence completion significantly improved. Because both tests agree on this
conclusion, we can assert it with more assurance.
.

signrank posts = preS

Wilcoxon signed-rank test
sign |

obs

sum ranks

expected

positive |
negative |
zero |

24
0
0

300
0
0

150
150
0

all |

24

300

300

unadjusted variance
adjustment for ties
adjustment for zeros

1225.00
-1.63
0.00

adjusted variance

1223.38

Ho: posts = preS
z =
Prob > Iz| =

4.289
0.0000

Two-Sample Tests

The remainder of this chapter draws examples from a survey of college undergraduates by
Ward and Ault (1990) (student2.dta).
.

describe

Contains data from C:\data\student2.dta
obs:
243
vars:
19
size:
6,561 (99.9% of memory free)
variable name

j

*&a

id
year
age
gender
major
relig
drink
gpa
grades

storage
type

int
byte
byte
byte
byte
byte
byte
float
byte

display
format
%8.0g
%8.0g
%8.0g
%9.0g
%8.0g
%8.0g
%9.0g
%9.0g
% 8 . Og

value
label

Student survey (Ward & Ault 1990)
12 Jul 2005 10:16

variable label

year

Student ID
Year in college
Age at last birthday

s

Gender

v4

grades-

(male)

Student major
Religious preference
33-point drinking scale
Grade Point Average
Gtressed-grades this semester

1

ANOVA and Other Comparison Methods

belong
live
miles
study
athlete
employed
allnight
ditch
hsdrink
aggress

byte
byte
byte
byte
byte
byte
byte
byte
byte
byte

Sorted by:

%8.0g
%8.0g
%8.0g
%8.0g
%8.0g
%8.0g
%8- 0g
%8.0g
%9.0g
%9.0g

belong
vlO

147

Belong to fraternity/sorority
Where do you live?
How many miles from campus?
Avg. hours/week studying
Are you a varsity athlete?
Are you employed?
How often study all night?
How many class/month ditched?
High school drinking scale
Aggressive behavior scale

yes
yes
allnight
times

id

About 19% of these students belong to a fraternity or sorority:
.

I

tabulate belong

Belong to |
fraternity/ I
sorority I
----------- +
member I
nonmember |

Freq.

Percent

Cum.

47
196

19.34
80.66

19.34
100.00

Total |

243

100.00

Another variable, drink, measures how often and heavily a student drinks alcohol, on a 33point scale. Campus rumors might lead one to suspect that fraternity/sorority members tend to
differ from other students in their drinking behavior. Box plots comparing the median drink
values of members and nonmembers, and a bar chart comparing their means, both appear
consistent with these rumors. Figure 5.1 combines these two separate plot types in one image.
.

graph box drink,

ylabel(0(5)35)

over (belong)

saving(fig05_01a)

. graph bar (mean) drink, over(belong) ylabel(0(5)35) saving(fig05_01b)

.

graph combine fig05_01a.gph fig05_01b .gph,

col(2)

iscale(1.05)

Figure 5.1
m _
co

co

o

CO

o
co ’

gCN -

CN

m _

(/)

co _

"O O _

V? CM

c
"O L0

-

E

9-o

Or- "
CO

o

■pg

in

o

member
i

1

nonmember

o
member

nonmember

148

Statistics with Stata

^eSt icommand’ use<i earlier for one-sample and paired-difference tests, can
perform two-sample tests as well. IIn this application its general syntax is ttest
measurement, by (categorical). For example,
.

ttest drink, by(belong)

Two-sample t test with equal variances
Group |
-------- + —
member |
nonmembe |

Obs

Mean

Std. Err-.

Std. Dev.

[95% Conf. Interval]

47
196

24.7234
17.7602

.7124518
.4575013

.4.884323
*6.405018

23.28931
16.85792

26.1575
18.66249

combined |

243

19.107

.431224

6.722117

18.25756

19.95643

6.9632

.9978608

4.997558

8.928842

diff |
Degrees of freedom: 241

Ho: mean(member) - mean(nonmembe) = diff = 0

Ha: diff < 0
t =
6.9781
P < t =
1.0000

.

L4

Ha: diff > 0
t =
6.9781
P > t =
0.0000

ttest drink, by(belong) unequal

Two-sample t test with unequal variances
Group |

Obs

Mean

Std. Err.

Std. Dev.

[95% Conf. Interval]

47
196

24.7234
17.7602

.7124518
.4575013

4.884323
6.405018

23.28931
16.85792

26.1575
18.66249

243

19.107

.431224

6.722117

18.25756

19.95643

6.9632

. 8466965

5.280627

8.645773

I---

■i

Ha: diff != 0
t =
6.9781
P > It| =
0.0000

member |
nonmembe |
-------- +--combined |
-------- +--diff |

Satterthwaite's degrees of freedom:

88.22

Ho: mean(member) - mean(nonmembe)
= diff = 0
Ha: diff < 0
Ha: diff •= 0
Ha: diff > 0
t = 8.2240
t =
8.2240
t =
8.2240
P < t =
1.0000
P > It| =
0.0000
P > t =
0.0000
‘k

le^ank disti ibutions have similar shape, the rank-sum test here indicates that we can reject the
null hypothesis of equal population medians.

4
■

■

ANOVA and Other Comparison Methods

.

ranksum drink,

by(belong)

Two-sample Wilcoxon rank-sum

I

(Mann-Ahltney)

tesr

belong I

cbs

member |
nonmember i

47
1 96

853 5
21111

5734
23912

combined I

243

2 9 64c

29646

rank

unadjusted variance
adjustment for ties

187310.67
-472.30

adjusted variance

186838.36

Ho: drink(belong
z =
Prob
IZ| =

149

member)
6.480
0.0000

expected

dr in k(be long

nonmer.ber)

One-Way Analysis of Variance (ANOVA)

Analysis »f ..ri.nee (ANOVA) provides anoiher way. more general rhan, tests to test (or
differences among means The simplest case, one-way ANOVA. tests whether the means of
differ across categories of,. One-way ANOVA can be performed bv a on.w.y command
with the general form oneway measurement categorical. Forexample.
.

I

oneway drink belong,

Belong to I
fraternity/ |
sorority |

member |
nonmember I
Total |

tabulate

Summary of 33-point dtinkina s~aiMean
Std. Dev.
'
n a

60204
X

4.8843233
6.4 0 51 ' -

• A

= .':ce
MS

Source
Between groups
Within groups
Total

1838.08426
9097.13385

241

1838.08426
37.7474433

2131242

4r.18685:"

Bartlett's test for equal variances:

chi. (1)

=

r

Prob > F

4 8.69

0.1000

4.8378

9rob>chi2 = 0.128

abandoning the equal-variances assumption.
'
K
Ba d "tr37 Jhrnally TS ‘i6 equa|-variances assumption, using Bartlett’s X2.
. A low
Bartlett s probability implies that ANOVA’s c----- ’ ’
equal-variance assumption is implausible, in

i

J

150
.i. J

Statistics with Stata

which case we should not trust the ANOVA F test results. In the coneway drink
. - - belong
- example above, Bartlett’s P = .028 casts doubt on the ANOVA’s validity.
ANOVA s real value lies not in two-sample comparisons, but in more complicated
comparisons of three or more means. For example, we could test whether mean drinking
behavior varies by year in college:
"

fed
ini
will

.

J

oneway drink year,

Year in | Summary of 33-point drinking scale
Mean
Std. Dev.
Freq.

college
““1’ ’ 3
I
Freshman
Sophomore
Junior
Senior

|
I
|
|

18.975
21.169231
19.453333
16.650794

6.9226033
6.5444853
6.28660S1
6.6409257

‘40

Total |

19.106996

6.7221166

243

'a

I

Between groups
Within groups

666.200518
10269.0176

3
239

222.066839
42.9666038

Total

10935.2181

242

45.186851“

Bartlett' s test for equal variances:

B;

Row Mean-|
Col Mean |

-------------- +

4 !

65
75
63

Analysis cf Variance
SS
df
MS

Source

Sophomor

|
I
I
Junior I
I
I
Senior I
I

chi2(3) =

F

Prob > F

5.17

0.0018

0.5103

Prob>chi2 = 0.917

Comparison of 33-point drinking scale by Year in college
(Scheffe)
Freshman

Sophomor

Junior

2.19423
0.429

.478333
0.987

-1.7159
0.498

-2.32421
0.382

-4.51844
0.002

-2.83254

We can reject the hypothesis of equal means (P = .0018). but not the hypothesis of equal
variances (P - .917). The latter is “good news" regarding the ANOVA’s validity.
■
Pl°ts ’n Figure 5.2 (next page) support this conclusion, showing similar variation
within each category. This figure, which combines separate box plots and dot plots, shows that
differences among medians and among means follow similar patterns.

1»•

J i!

I
■'

tabulate scheffe

■

saving (,fig05_02a)

.

graph hbox drink, over (year) ylabel (0 (5) 35)

.

graph dot (mean) drink, over(year) ylabel (0 (5)35, grid)
marker (1, msymbol(S)) saving(fig05_02b)

.

graph combine fig05_02a . gph fig05__02b. gph ,

row(2)

iscale(1.05)

•W
ANOVA and Other Comparison Methods

151

Figure 5.2

Freshman
Sophomore

Junior

Senior
0

5

10
15
20
25
33-point drinking scale

30

35

30

35

Freshman
Sophomore
Junior
Senior

0

10

5

15
20
mean of drink

25

The scheffe option (Scheffe multiple-comparison test) produces a table showing the
differences between each pair ofmeans. The freshman mean equals 18.975 and the sophomore
mean equals 21.16923, so the sophomore-freshman difference is 21.16923 - 18 975 = 2 19423
not statistically distinguishable from zero (P = .429). Of the six contrasts in this table only the
senior-sophomore difference, 16.6508 -21.1692 =-4.5184, is significant (P= .002). Thus,
our overall conclusion that these four groups’ means are not the same arises mainly from the
contrast between seniors (the lightest drinkers) and sophomores (the heaviest).
oneway offers three multiple-comparison options: scheffe , bonferroni , and
sidak (see Base Reference Manual for definitions). The Scheffe test remains valid under
a wider variety of conditions, although it is sometimes less sensitive.
The Kruskal-Wallis test ( kwallis ), a A'-sampIe generalization of the two-sample rank
sum test, provides a nonparametric alternative to one-way ANOVA. Ittests the null hypothesis
of equal population medians.
.

kwallis drink,

Test:

Equality of populations (Kruskal-Wallis test)
year

I

I

by(year)

I
Freshr.ar. i
I Sophomore I
I
Junior |
I
Senior I

chi-squared =
probability =

Obs

I Rank Sum |
---- I
40 |
4914.00 |
65 |
9341.50 |

75
63

|
|

9300.50
6090.00

|
|

14.453 with 3 d.f.
0.0023

chi-squared with ties =
probability =
0.0023

14.490 with 3 d. f.

Gs- ICrO
1U002

.0

<C

.

) *
) -

152

*

I

Statistics with Stata

Here, the kwallis results (P= .0023) agree with our oneway findings of significant
differences in drink by year in college. Kruskal-Wallis is generally safer than ANOVA if we
have reason to doubt ANOVA’s equal-variances or normality assumptions, or if we suspect
problems caused by outliers, kwallis , like ranksum. makes the weaker assumption of
similar-shaped distributions within.each group. In principle, ranksum and kwallis
should produce similar results when applied to two-sample comparisons, but in practice this is
true only if the data contain no ties, ranksum incorporates an exact method for dealing with
ties, which makes it preferable for two-sample problems.

Two- and A/-Way Analysis of Varianee
One-way ANOVA examines how the means of measurement variable;/ vary across categories
of one other variable x. N-way ANOVA generalizes this approach to deal with two or more
categorical x variables. For example, we might consider how drinking behavior varies not only
by fiaternity or soiority membership, but also by gender. We start by examining a two-way
table of means:

I

.

S'-

i
■

4 I

table belong gender,

Belong to |
fraternit I
y/sororit I
Y
I

----------------+.

Gender (male)
Female
Male

member I 22.44444
nonmember | 16.51724

i

fl
111
l-H

I

??h

r!

contents(mean drink)

row col

Total

26.13793
19.5625

24.7234
17.7602

21.31193

19.107

I
Total

I

17.31343

It appears that in this sample, males drink more than females and members drink more than
nonmembers. The member-nonmember difference appears similar among males and females.
Stata’s 2V-way ANOVA command, anova , can test for significant differences among these
means attributable to belonging to a fraternity or sorority, gender, or the interaction of
belonging and gender (written belong*gender).
anova drink belong gender belong* gender
Number of obs =
Root MSE

R-squared
Adj R-squared =

C.2222
0.2123

Source

Partial SS

dr

MS

F

Prob > r

Mode 1

242 5.-'237

3

309.557456

22.75

0.0000

1416.2366
408.520097
3.7=216612

1
1

1416.2366
403.520097
3 . " = 016612

39.51
11 . 48
0.11

0.0000
0.0008
0.7448

I
I
belong |
gender |
belong*gender |

I
Residual

I

8506.54574

239

35.5922416

Total

|

10935.2181

242

4 5 . 1868517

ANOVA and Other Comparison Methods

153

In this example of“two-way factorial ANOVA,” the output shows significant main effects
for belong^(P = .0000) and gender (P = .0008), but their interaction contributes little to the
model (P - .7448). This interaction cannot be distinguished from zero, so we might prefer to
fit a simpler model without the interaction term (results not shown):
anova drink belong gender

To include any interaction term with anova , specify the variable names joined by *.
Unless the number of observations with each combination of x values is the same (a condition
called "balanced data”), it can be hard to interpret the main effects in a model that also includes
inteiactions. This does not mean that the main effects in such models are unimportant,
however. Regression analysis might help to make sense of complicated ANOVA results, as
illustrated in the following section.

Analysis of Covariance (ANCOVA)
Analysis of Covariance (ANCOVA) extends V-way ANO VA to encompass a mix ofcategorical
and continuous^ variables. This is accomplished through the anova command if we specify
which variables are continuous. For example, when we include gpa (college grade point
average) among the independent variables, we find that it, too, is related to drinking behavior.

I
I

anova drink belong gender gpa, continuous(gpa)

Number of obs =
218
Root MSE
= 5.68939

I

R-squared
=
Adj R-squared =

0.2970
0.2'8 72

Source |

Partial SS

df

MS

F

Prob > F

Model |
I
belong |
gender |
gpa I
I
Residual I

2927.03087

3

975.676958

30.14

0.0000

1489.31999
405.137843
407.0089

1
1
1

1489.31999
405.137843
407.0089

46.01
12.52
12.57

0.0000
0.0005
0.0005

6926.99206

214

32.3691218

Total |

9854.02294

217

45.4102439

From this analysis we know that a significant relationship exists between drink and gpa
when we control for belong and gender. Beyond their F tests for statistical significance,
however, ANOVA or ANCOVA ordinarily do not provide much descriptive information about
how variables aie related. Regression, with its explicit model and parameter estimates, does
a better descriptive job. Because ANOVA and ANCOVA amount to special cases of
regression, we could restate these analyses in regression form. Stata does so automatically if
we add the regress option to anova. For instance, we might want to see regression
output in order to understand results from the following ANCOVA.

I

*1

154

Statistics with Stata

anova
regress

Source |

----- +

belong gender belong*gender gpa,

SS

df

MS

Model I
Residual |

2933.45823
6920.5647

4.
213

733.364558
32.4909141

Total

9854.02294

217

45.4102439

drink

continuous(gpa)

Number of obs =
F ( 4,
213) =
Prob > F
=
R-squared
=
Adj R-squared =
Root MSE
=

218
22.57
0.0000
0.2977
0.2845
5.7001

Coef.

Sid. Err.

t

P> I t I

[95% Conf

Interval]

27.47676

2.439962

11.26

0.000

22.6672

32.28633

1
2

6. 925384
(dropped)

’ .286774

5.38

0.000

4.388942

9.461826

1
2

-2.629057
(dropped)
-3.054633

. =917152

-2.95

0.004

-4.386774

-.8713407

. =5934 98

-3.55

0.000

-4.748552

-1.360713

1.946211

-0.44

0.657

-4.701916

2.970685

_cons
belong

gender

gpa
belong*gender
1
1
2
2

1
2
1
2

-.8656158
(dropped)
(dropped)
(dropped)

With the jregress option, we get the anova output formatted as a regression table.
The top part gives the same overall F
andas
R a standard ANOVA table. The bottom part
- test
-------describes the following regression:
We construct a separate dummy variable {0,1} representing each category of each x
variable, except for the highest categories, which are dropped. Interaction terms (if
specified in the variable list) are constructed from the products of every possible
combination of these dummy variables. Regressy on all these dummy variables and
interactions, and also on any continuous variables specified in the command line.
The previous example therefore corresponds to a regression of drink on four x variables:
1. a dummy coded 1 = fratemity/sorority member, 0 otherwise (highest category of belong
nonmember, gets dropped);
2. a dummy coded 1 = female, 0 otherwise (highest category ofgender, male, gets dropped);

3.
4.

the continuous variable gpa;

an interaction term coded 1 = sorority female, 0 otherwise.
Interpret the individual dummy variables’ regression coefficients as effects on predicted or
"a,ean?’; ForexamPle> the coefficient on the first category ofgender (female) equals-2.629057
I his informs us that the mean drinking scale levels for females are about 2.63 points lower than
those of males with the same grade point average and membership status. And we know that
among students of the same gender and membership status, mean drinking scale values decline
y . 54633 with each one-point increase in grades. Note also that we have confidence
interva s and individual t tests for each coefficient; there is much more information in the
anova, regress output than in the ANOVA table alone.

I
i,

J

ANOVA and Other Comparison Methods

155

Predicted Values and Error-Bar Charts

After anova, the followup command predict calculates predicted values, residuals or
standard errors and diagnosttc statistics. One application for such statistics is in drawing
graphical representat.ons of the model’s predictions, in the form of error-bar charts For Z
simple illustration, we return to the one-way ANOVA of drink by year.
anova drink year

Number of obs =
243
Root MSE
= 6.55489

R-squared
Adj R-squared =

0.0609
0.0491

Source |

Partial SS

df

MS

F

Prob > F

Model

666.200518

3

222.066839

5.17

0.0018

5.17

0.0018

I
i
year l
i
Residual i

666.200518

3

222.066839

10269.0176

239

42.9666008

Total

10935.2181

242

45.1868517

|

To calculate predicted means from the recent anova,type predict followed by a new
variable name:
. predict drinkmean
(option xb assumed; fitted values)
.

label variable drinkmean "Mean drinking scale"

With the Stdp option, predict calculates standard errors of the predicted means:
.

predict SEdrink,

stdp

Using these new variables, we apply the serrbar command to create an error-bar chart
The scale (2) option tells serrbar to draw error bars of plus and minus two standard
errors, from
drinkmean - 2*SEdrink
to
drinkmean + 2*SEdrink.
In a serrbar command, the first-listed variable should be the means or y variable* the
second-listed the standard error or standard deviation (depending on which you want to show);
and the third-listed variable defines the x axis. The plot ( ) option for serrbar can
specify a second plot to overlay on the standard-error bars. In Figure 5.3, we overlay a line plot
that connects the drinkmean values with solid line segments.

156

Statistics with Stata

serrbar drinkmean SEdrink year,

.

plot (line drinkmean year,

scale(2)

cipattern(solid))

legend(off)
Figure 5.3

CM
CM ’

rLi

1

■Q

1
CD

1

2
Year in college

I
4 I

3

4

For a two-way factorial ANOVA, error-bar charts help us to visualize main and interaction
effects. Although the usual error-bar command serrbar can, with effort, be adapted for this
purpose, an alternative approach using the more flexible graph twoway family will be
illustrated below. First, we perform ANOVA, obtain group means (predicted values) and their
standard errors, then generate new variables equal to the group means plus or minus two
standard errors. The example examines the relationship between students’ aggressive behavior
(aggress), gender, and year in college. Both the main effects of gender Midyear, and their
interaction, are statistically significant.
anova aggress gender year gender*year

Number of obs =
243
Root MSE
= 1.45652
Source |

Partial SS

df

Model

R-squared
=
Adj R-squared =

MS

0.2503
0.2280

F

Prob > F

I
I
gender |
year I
gender*year |

166.432503

7

23.7832147

11.21

0.0000

94.3505972
1 9.0404045
24.1029759

1
3
3

44 . 47
2.99
3.79

0.0000
0.0317
0.0111

Residual

I
I

94.3505972
6.34680149
8.03432529

498.538073

235

2.12143861

Total

|

665.020576

242

2.74801891

ANOVA and Other Comparison Methods

157

. predict aggmean

(option xb assumed; fitted values)
$
■w..

. label variable aggmean "Mean aggressive behavior scale"
. predict SEagg, stdp
gen agghigh - aggmean + 2 * SEagg

.

. gen agglow — aggmean - 2 * SEagg

. graph twoway connected aggmean year

I I
I I

reap agghigh agglow year

/ by(gender, legend(off) note(
))
ytitle("Mean aggressive behavior scale")
Female

Figure 5.4

Male

CO

.0)

o
w
o
(N -

<D

n
§
w
V)

s?O)

*

D)

ro
c

IQ
0)

2

o

3

4 1
Year in college

2

3

4

Figure 5.4 built error-bar charts by overlaying two pairs of plots. The first pair are female
and male connected-line plots, connecting the group means of aggress (which we calculated
using predict, and saved as the variable aggmean). The second pair are female and male
capped-spike range plots (twoway reap) in which the vertical spikes connectine variables
agghigh (group means of aggress plus two standard errors) and agglow (group'”means of
aggress minus two standard errors). The by (gender) option produced sub-plots for females
and males. Notice that to suppress legends and notes in a graph that uses a by ( ) option,
legend (off) and note ("") must appear as suboptions within by ( ).

The resulting error-bar chart (Figure 5.4) shows female means on the aggressive-behavior
scale fluctuating at comparatively low levels during the four years of college. Male means are
higher throughout, with a sophomore-year peak that resembles the pattern seen earlier for
rinkmg (Figures 5.2 and 5.3). Thus, the relationship between aggress and year is different for
males and females. This graph helps us to understand and explain the significant interaction
eiieci.

predict works the same way with regression analysis ( regress ) as it does with
anova because the two share a.cpmmon mathematical framework. A list of some other

i!I'

Ifel
I'

158

Statistics with Stata

,°PtIOnS aPPearS in Chapter 6’ and filrther examPles usi"gthese oP^ons are given
m Chapter 7. The options include residuals that can be used to check assumptions regarding
X™^OnS’
alS° H SUite °f dia8n0Stic statistics <such as leverage. Cook’s D, and
n u
1 meaSUre thC inflUenCe °f
observations on model results The
Ourbm Watson test ( dwstat), described in Chapter 13, can also be used after anova to
test for fet-order autocorrelation. Conditional effect plotting (Chapter 7) provides a graphical
approac that can aid interpretation of more complicated regression, ANOVA or ANCOVA
models.
’

Linear Regression Anaiysis

Stata offers an exceptionally broad range of regression procedures. A partial list of the
possibilities can be seen by typing help ’ regress . This chapter introduces regress
and related commands that perform simple and multiple ordinary least squares (OLS)
regression. One followup command, predict, calculates predicted values, residuals, and
diagnostic statistics such as leverage or Cook’s D. Another followup command, test
performs tests of user-specified hypotheses, regress can accomplish other analyses
including weighted least squares and two-stage least squares. Regression with dummy
variables, interaction effects, polynomial terms, and stepwise variable selection are covered
briefly in this chapter, along with a first look at residual analysis.

The following menus access most of the operations discussed:
Statistics - Linear regression and related - Linear regression
Statistics - Linear regression and related - Regression diagnostics
Statistics - General post-estimation - Obtain predictions, residuals, etc., after estimation
Graphics - Overlaid twoway graphs

Statistics - Cross-sectional time series

Example Commands
. regress y x

Performs ordinary least squares (OLS) regression of variable v on one predictor, x.
. regress y x if ethnic == 3 & income > 50

Regresses y on x using only that subset of the data for which variable ethnic equals 3 and
income is greater than 50.
. predict yhat

Generates a new variable (here arbitrarily namedyhat) equal to the predicted values from
the most recent regression.
. predict e, resid

Geneiates a new variable (here arbitrarily named e) equal to the residuals from the most
recent regression.
. graph twoway Ifit y x

||

scatter y x

Draws the simple regression line (Ifit or linear fit) with a scatterplot ofy vs. x.

159

160

I

Statistics with Stata

. graph twoway mspline yhat x

||

scatter y x

Draws a simple regression line with a scatterplot ofy vs. x by connecting (with a smooth
cubic spline curve) the regression’s predicted values (in this example named vhat).
Note: There are many alternative ways to draw regression lines or curves in Stata.
These alternatives include the. twoway graph types mspline (illustrated above),
mband, line , If it, If itci , qf it, and qf itci , each of which has its own
advantages and options. Usually we combine (overlay) the regression line or curve with
a scatterplot. If the scatterplot comes second in our graph twoway command, as in
the example above, then scatterplot points will print on top of the regression line. Placing
the scatterplot first in the command causes the line to print on top of the scatter. Examples
throughout this and the following chapters illustrate some of these different possibilities.

I
r

I

J
a

.

1 J'

rvfplot

Draws a residual versus fitted (predicted values) plot, automatically based on the most
recent regression.

f!

. graph twoway scatter e yhat, yline(O)

Draws a residual versus predicted values plot using the variables e and yhat.
.

i

regress y xl x2 x3

Performs multiple regression ofy on three predictor variables, x/, x2, and x3.
.

regress y xl x2 x3, robust

Calculates robust (Huber/White) estimates of standard errors. See the User's Guide for
details. The robust option works with many other model fitting commands as well.
. regress y xl x2 x3, beta

I
5

F

Performs multiple regression and includes standardized regression coefficients (“beta
weights”) in the output table.
. correlate xl x2 x3 y

Displays a matrix of Pearson correlations, using only observations with no missing values
on all of the variables specified. Adding the option covariance produces a variance
covariance matrix instead of correlations.
. pwcorr xl x2 x3 y, sig

Displays a matrix of Pearson correlations, using pairwise deletion of missing values and
showing probabilities from t tests of H0:p = 0 on each correlation.
. graph matrix xl x2 x3 y, half

Draws a scatterplot matrix. Because their variable lists are the same, this example yields
a scatterplot matrix having the same organization as the correlation matrix produced by the
preceding pwcorr command. Listing the dependent (y) variable last creates a matrix in
which the bottom row forms a series ofj>-versus-x plots.
test xl x2

Performs an Ftest of the null hypothesis that coefficients onx7 andx? both equal zero in
the most recent regression model.
.xi:

regress y xl x2 i.catvar*x2

Performs “expanded interaction” regression ofy on predictors x/, x2, a set of dummy
variables created automatically to represent categories of catvar, and a set of interaction
terms equal to those dummy variables times measurement variable x2. help xi gives
more details.
..
.
.
.

Linear Regression Analysis

.

161

sw regress y xl x2 x3, pr(.O5)

Performs stepwise regression using backward elimination until all remaining predictors are
significant at the .05 level. All listed predictors are entered on the first iteration.
Thereafter, each iteration drops one predictor with the highest P value, until all predictors
remaining have probabilities below the “probability to retain,” pr (. 05). Options permit
forward or hierarchical selection. Stepwise variants exist for many other model-fitting
commands as well; type help sw for a list.
. regress y xl x2 x3 [aweight = w]

Perfonns weighted least squares (WLS) regression ofy onxl,x2, andx3. Variable w holds
the analytical weights, which work as if we had multiplied each variable and the constant
by the square root of w, and then performed an ordinary regression. Analytical weights are
often employed to correct for heteroskedasticity when they and x variables are means,
rates, or proportions, and w is the number of individuals making up each aggregate
observation (e.g., city or school) in the data. If they and x variables are individual-level,
and the weights indicate numbers of replicated observations, then use frequency weights
[fweight = v] instead. See help svy if the weights reflect design factors such
as disproportionate sampling.
. regress yl y2 x (x z)
. regress y2 yl z (x z)

Estimates the reciprocal effects ofy/ and v2. using instrumental variables x and z. The
first parts of these commands specify the structural equations:
yl = a0 + a}y2 + a2x + e,
v2 = po + P|^2 + p2M'4-e2
The parentheses in the commands enclose variables that are exogenous to all of the
structural equations, regress accomplishes two-stage least squares (2SLS) in this
example.
.

svy:

regress y xl x2 x3

Regresses von predictors.v/..v2. and.v3, with appropriate adjustments fora complex survey
sampling design. \\ e assume that a svyset command has previously been used to set
up the data, by specifying the strata, clusters, and sampling probabilities, help svy
lists the many procedures available for working with complex survey data, help
regress outlines the syntax of this particular command; follow references to the User's
Guide and the Surxfey Data Reference Manual for details.
. xtreg y xl x2 x3 x4, re

Fits a panel (cross-sectional time series) model with random effects by generalized least
squares (GLS). An observation in panel data consists of information about unit i at time
and there are multiple observations (times) for each unit. Before using xtreg , the
variable identifying the units was specified by an iis (“z is”) command, and the variable
identifying time by tis (‘7 is”). Once the data have been saved, these definitions are
retained for future analysis by xtreg and other xt procedures, help xt lists
available panel estimation procedures, help xtreg gives the syntax of this command
and references to the printed documentation. If your data include many observations for
each unit, a time-series approach could be more appropriate. Stata’s time series procedures
(introduced in Chapter 13) provide further tools for analyzing panel data. Consult the
Longitudinal/Panel Data Reference Manual for a full description.
I

162

.

1

Statistics with Stata

xtmixed population year

city: year

||

Assume that we have yearly data on population, for a number of different cities. The
xtmixed population year part specifies a“fixed-effect”model,similarto ordinary
regression, which describes the average trend in population. The || city: year part
specifies a “random-effects” model, allowing unique intercepts and slopes (different
starting points and growth rates) for each city.
. xtmixed SAT grades prepcourse

||

district:

pctcollege

||

region:

Fits a hierarchical (nested or multi-level) linear model predicting students’s SAT scores as
a function of the individual students’ grades and whether they took a preparation coursethe percent college graduates among their school district’s adults; and region of the country
(region affecting v-intercept only). See the Longitudinal/Panel Data Reference Manual for
much more about the xtmixed command, which is new with Stata 9.

The Regression Table
File states.dta contains educational data on the U.S. states and District of Columbia:
describe state csat expense percent income high college region
variable name

state
csat
expense
percent
income
high
college
region

storage
type

display
format

str20
int
int
byte
long
float

%20s
%9.0g
%9.0g
%9.0g
*10.0g
%9.0q'
-9.0g
9.0g

byte

value
label

variable label

region

State
Mean composite SAT score
Per pupil expenditures prim&sec
% HS graduates taking SAT
Median household income
=: adults HS diploma
adults college degree
Geographical region

Political leaders occasionally use mean Scholastic Aptitude Test (SAT) scores to make
pointed comparisons between the educational systems of different U.S. states. For example,
some have raised the question ot whether SAT scores are higher in states that spend more
money on education. We might try to address this question by regressing mean composite SAT
scores (csat) on per-pupil expenditures (expense). The appropriate Stata command has the form
regress y x , where y is the predicted or dependent variable, and x the predictor or
independent variable.
regress csat expense

Source !

ss

df

MS

Number of obs =
F( 1,
49) =
Prob > F
=
R-squared
=
Adj R-squared =
Root MSE

Model |
Residual |

48708.3001
175306.21

1
49

48708.3001
3577.67775

Total |

224014.51

50

4480.2902

csat I

Coe f.

Std. Err.

expense |
_cons I

- . 0222756
1060.732

. 0060371

-3.69

0.001

.. 32._7Q 0.9

32.

o.uiao—

t

P> 111

51
13.61
0.0006
0.2174
0.2015
59.814

[95% Conf. Interval]

-.0344077
995.0175

-.0101436
1126.447

I

5
Linear Regression Analysis

163

This regression tells an unexpected story: the more money a state spends on education, the
lower its students’ mean SAT scores. Any causal interpretation is premature at this point,’but
the regression table does convey information about the linear statistical relationship between
csat and expense. At upper right it gives an overall Atest, based on the sums of squares at the
upper left. This F test evaluates the null hypothesis that coefficients on all x variables in the
model (here there is only onex variable, expense) equal zero. The/7statistic, 13.61 with I and
49 degrees of freedom, leads easily to rejection of this null hypothesis (P = .0006). Prob > F
means "the probability ofa greater/7” statistic if we drew samples randomly from a population
in which the null hypothesis is true.

At upper right, we also see the coefficient of determination, R 2 = .2174. Per-pupil
expenditures explain about 22% of the variance in states’ mean composite SAT scores.
Adjusted R , R a = .2015, takes into account the complexity of the model relative to the
complexity of the data. This adjusted statistic is often more informative for research.
The lower half of the regression table gives the fitted model itself. We find coefficients
(slope and r-intercept) in the first column, here yielding the prediction equation
predicted csat = 1060.732 - .0222756expense

The second column lists estimated standard errors of the coefficients. These are used to
calculate t tests (columns 2-4) and confidence intervals (columns 5-6) for each regression
coefficient. The t statistics (coefficients divided by their standard errors) test null hypotheses
that the corresponding population coefficients equal zero. At the a = .05 significance level, we
could reject this null hypothesis regarding both the coefficient on expense (P = .001) and the
j-intercept ( .000 , really meaning P < .0005). Stata’s modeling commands print 95%
confidence intervals routinely, but we can request other levels by specifying the level ( )
option, as shown in the following:
. regress csat expense, level(99)

Because these data do not represent a random sample from some larger population of U.S.
states, hypothesis tests and confidence intervals lack their usual meanings. They are discussed
in this chapter anyway for purposes of illustration.
The term cons stands tor the regression constant, usually set at one. Stata automatically
includes a constant unless we tell it not to. The nocons option causes Stata to suppress the
constant, performing regression through the origin. For example,
.

regress y x,

nocons

or

.

regress y xl x2 x3, nocons

In certain adxanced applications, you might need to specify your own constant. If the
independent variables include a user-supplied constant (named c, for example), employ the
hascons option instead of nocons :
.

regress y c x, hascons

Using nocons in this situation would result in a misleading/7 test and/?2. Consult the Base
Reference Manual or help regress for more about hascons .

164

Statistics with Stata

Multiple Regression
r

Multiple regression allows us to estimate how expense predicts csat, while adjusting for a
number of other possible predictor variables. We can incorporate other predictors of csat
simply by listing these variables in the command
.

regress

csat expense percent income high college

Source

SS

df

MS

Model
Residual

I
I

184663.309
39351.2012

45

36932.6617
=74.471137

Total

|

224014.51

50

4480.2902

csat |

------- + _
expense
percent
income
high
college
_cons

I
|
I
|
I
|

Coe f .
. 0033528
-2.618177
.0001056
1.630841
2.030894
851.5649

Std .

. 0044"09
.2536491
.0011661
.992247
1.660128
5 9.29228

Number of obs =
F( 5,
45) =
Prob > F
R-squared
Adj R-squared =
Root MSE
?> I c

0.75
-10.31
0.09
1.64
1.22
14.36

o. o :
J. 92 =
O.IC"
3.22 =
:. c: i

51

42.23
0.0000
0.8243
0.8048
29.571

[95% Conf

Interval]

-.005652
-3.129455
-.002243
-.367647
-1.312756
732.1441

.0123576
-2.106898
.0024542
3.629329
5.374544
970.9857

fit I
This yields the multiple regression equation

predicted csat - 851.56 + .00335eApense - 2.61 ^percent + .OOQMnconie +
1.62>high -f- Z.Qlcollege

Controlling for four other variables weakens the coefficient on expense from -.0223 to .00335,
which is no longer statistically distinguishable fromzero. The unexpected negative relationship
between expense and csat found in our earlier simple regression evidently can be explained by
other predictors.

r

J
■

■

it

Only the coefficient on percent (percentage of high school graduates taking the SAT)
attains significance at the .05 level. We could interpret this “fourth-order partial regression
coefficient” (so called because its calculation adjusts for four other predictors) as follows.

^2
2.618. Predicted mean SAT scores decline by 2.618 points, with each one-point
increase in the percentage of high school graduates taking the SAT — if expense,
income, high, and college do not change.
Taken together, the five .v variables in this model explain about 80% of the variance in
states mean composite SAT scores (R= .8048). In contrast, our earlier simple regression
with expense as the only predictor explained only 20% of the variance in csat.
To obtain standardized regression coefficients (“beta weights”) with any regression, add
the beta option. Standardized coefficients are what we would see in a regression where all
the variables had been transformed into standard scores (means 0, standard deviations 1).

Linear Regression Analysis

.

165

regress csat expense percent Income high college, beta

Source |

ss

df

MS

Model |
Residual |

184663.309
39351.2012

5
45

36932.6617
874.471137

Total |

224014.51

50

4480.2902

csat |

Coef.

Std. Err.

t

P>l 11

Beta

.0033528
-2.618177
.0001056
1.630841
2.030894
851.5649

.0044709
.2538491
.0011661
.992247
1.660118
59.29228

0.75
-10.31
0.09
1.64
1.22
14.36

0.457
0.000
0.928
0.107
0.228
0.000

.070185
-1.024538
.0101321
.1361672
. 1263952

expense
percent
income
high
college
_cons

I
|
|
|
|
|

Number of obs =
F( 5,
45) =
Prob > F
R-squared
Adj R-squared =
Root MSE

51
42.23
0.0000
0.8243
0.8048
29.571

The standardized regression equation is
predicted csat* = .07expense* - \XftA5percent* + .Oiinconie* +
A36high* + .126co//ege*
where csat*, expense*, etc. denote these variables in standard-score form. We might interpret
the standardized coefficient on percent, for example, as follows:
^2
1.0245. Predicted mean SAT scores decline by 1.0245 standard deviations,
with each one-standard-deviation increase in the percentage of high school graduates
taking the SAT — if expense, income, high, and college do not change.

The F and t tests, R2, and other aspects of the regression remain the same.

Predicted Values and Residuals
After any regression, the predict command can obtain predicted values, residuals, and
other case statistics. Suppose we have just done a regression of composite SAT scores on their
strongest single predictor:
. regress csat percent

Now, to create a new variable called yhat containing predicted v values from this regression,
type
. predict yhat

.

label variable yhat "Predicted mean SAT score"

Through the resid option, we can also create another new variable containing the
residuals, here named e:
. predict e,

resid
label variable e "Residual"

We might instead have obtained the same predicted y and residuals through two
generate commands:
.

generate yhatO — _b[_cons] + __b [percen t ] ★percen t

I

p

166

n

Statistics with Stata

Sir
generate eO = csat - yhatO

.

Stata temporarily remembers coefficients and other details from the recent regression. Thus
_b[vfl/ name ] refers to the coefficient on independent vanable varname. b[ cons] refers to
the coefficient on _cons (usually, the y-intercept). These stored values are useful in
programming and some advanced applications, but for most purposes, predict saves us the
trouble of generating^yhatO and eO “by hand” in this fashion.
Residuals contain information about where the model fits poorly, and so are important for
diagnostic or troubleshooting analysis. Such analysis might begin just by sorting and
examining the residuals. Negative residuals occur when our model overpredicts the observed
values. That is, in these states the mean SAT scores are lower than we would expect, based on
what percentage of students took the test. To list the states with the five lowest residuals, type

I

sort e
.

list state percent csat yhat e in 1/5

+
i
i
i
2
3
4
5

!
I

state

percent

csat

yhat

South Carolina
West Virginia
North Carolina
Texas
Nevada

58
17
57
44
25

832
926
844
874
919

894.3333
986.0953
896.5714
925.6666
968.1905

e I
-62.3333
-60.09526
-52.5714
-51.66666
-49.19049

|
|
|
|
|

The four lowest residuals belong to southern states, suggesting that we might be able to improve
our model, or better understand variation in mean SAT scores, by somehow taking region into
account.
Positive residuals occur when actual y values are higher than predicted. Because the data
already have been sorted by e, to list the five highest residuals we add the qualifier
‘•-5" in this qualifier means the 5th-from-last observation, and the letter “el” (note that this is
not the number “1”) stands for the last observations. The qualifiers in 47/1 or in
47/51 could accomplish the same thing.
.

I

list state percent csat yhat e in -5/1

47 .
48 .
49.
50 .
51 .

i

I

i
state
i
i Massachusetts
Connecticut
i
North Dakota
i
I New Hampshire
I

Iowa

percent

csat

yhat

79
81
6
75
5

896
897
1073
921
1093

847.3333
842.8571
1010.714
856.2856
1012.952

e I
I
48.666"3 I
54.14292 I
62.28567 I
64.71434 I
80.04758 I
+

Linear Regression Analysis

167

predict also derives other statistics from the most recently-fitted model.
Below are
some Predict options that can be
used after: anova or regress .
------------predict new

Predicted values ofy. predict new, xb means the
same thing (referring to Xb, the vector of predicted y
values).

. predict new, cooksd
predict new,

• predict DFxl,

covratio

Cook’s D influence measures.
COVRATIO influence measures; effect of each
observation on the variance-covariance matrix of
estimates.

dfbeta (xi) DFBETAs measuring each observation’s influence on the
coefficient of predictors/.

. predict new. dfits

DF/TS influence measures.

. predict new, hat

. predict new, resid

Diagonal elements of hat matrix (leverage).
Residuals.

. predict new, rstandard

Standardized residuals.

. predict new, rstudent

Studentized (jackknifed) residuals.

. predict new, stdf

Standard errors of predicted individual y, sometimes
called the standard errors of forecast or the standard
errors of prediction.

. predict new, stdp

Standard errors of predicted mean^y.

. predict new, stdr

Standard errors of residuals.

predict new, welsch

I

Welsch’s distance influence measures.
Further options obtain
predicted probabilities and expected
—amp!cuicicaproDaomnesand
expected values;
values; type
type help regress for
a list. All predict options create case statistics, which are new variables (like predicted
values and residuals) that have a value for each observation in the sample.
When using predict, substitute a new variable name of your choosing for new in the
commands shown above. For example, to obtain Cook’s D influence measures, type
. predict D, cooksd

Or you can find hat matrix diagonals by typing
. predict h, hat

The names of variables created by predict (such asy/mt, e, D, h) are arbitrary and are
invented by the user. As with other elements of Stata commands, we could abbreviate the
options to the minimum number of letters it takes to identify them uniquely. For example,
. predict e, resid

could be shortened to
. pre e,

re

168

Statistics with Stata

Basic Graphs for Regression
This section introduces some <elementary
*
graphs you can use to represent a regression model
or examine its fit. Chapter 7 describes
_j more specialized graphs that aid post-regression
diagnostic work.

In simple regression, predicted values lie on the line defined by the regression equation.
By plotting and connecting predicted values, we can make that line visible. The If it (linear
lit) command automatically draws a simple regression line.

. graph twoway Ifit csat percent
Ordinarily, it is more interesting to overlay a scatterplot on the regression line, as done in
Figured. I.
. graph twoway Ifit csat percent
I I
I I

scatter csat percent

, ytitle("Mean composite SAT score") legend(off)

o

Figure 6.1

2 -I|
£
i

!
i

0)

8°
wo _
HO '

C^
co
0)

w
o
CL
E

8g-

£ 05
co
<D

o
CO

—J-

0

20

40
60
% HS graduates taking SAT

80

We could draw the same F'
Figure 6.1 graph “by hand” using the predicted values (yhat)
generated after the regression, and a command of the form
graph twoway mspline yhat percent, bands(50)
I|
scatter csat percent
II
, legend(off) ytitle("Mean composite SAT score")

The second approach is more work, but offers greater flexibility for advanced applications such
as conditional effect plots or nonlinear regression. Working directly with the predicted values
also keeps the analyst closer to the data, and to what a regression model is doing, graph
twoway mspline (cubic spline curve fit to 50 cross-medians) simply draws a straight line
when applied to linear predicted values, but will equally well draw a smooth curve in the case
of nonlinear predicted values.

Linear Regression Analysis

169

I
1

Residual-versus-predicted-values plots provide useful diagnostic tools (Figure 6.2). After
any regression analysis (also after some other models, such as ANOVA) we can automatically
draw a residual-versus-fitted (predicted values) plot just by typing
. rvfplot, yline(0)
o
o

Figure 6.2

o _
m

ro
ZJ
•g
V)
0)

or o

o

%

o
tn -

850

900

950
Fitted values

1000

1050

The “by-hand” alternative for drawing Figure 6.2 would be
. graph twoway scatter e yhat, yline(0)

Figure 6.2 reveals that our present model overlooks an obvious pattern in the data. The
residuals or prediction errors appear to be mostly positive at First (due to too-high predictions),
then mostly negative, followed by mostly positive residuals again. Later sections will seek a
model that better fits these data.
predict can generate two kinds of standard errors tor the predicted v values, which have
two different applications. These applications are sometimes distinguished by the names
confidence intervals’ and “prediction intervals”: A “confidence interval” in this context
expresses our uncertainty in estimating the conditional mean ot p at a given x value (or a given
combination ofx values, in multiple regression). Standard errors for this purpose are obtained
through
. predict SE,

stdp

Select an appropriate t value. With 49 degrees of freedom, for 95% confidence we should use
t = 2.01, found by looking up the / distribution or simply by asking Stata:
. display invttail(49,.05/2)
2.0095752

Then the lower confidence limit is approximately
.

generate lowl = yhat - 2.01*SE

and the upper confidence limit is

i

170

Statistics with Stata

'Bit:
. generate highl = yhat + 2.01*SE

Confidence bands in simple regression have an hourglass shape, narrowest at the mean of
x. We
graph these
an overlaid
ac thf*
We could
could graph
these using
using an
overlaid twowav
twoway command
command such
such as
the fniinxxrinrY
following,

graph twoway mspline lowl percent, cipattern(dash) bands(50)
11
mspline highl percent, cipattern(dash) bands(50)
11
mspline yhat percent, cipattern(solid) bands(50)
scatter csat percent
11
11
, legend(off) ytitle("Mean composite SAT score")

Ml

Shaded-area range plots (see help twoway_rarea ) offer a different way to draw such
graphs, shading the range between lowl and high]. Alternatively, Ifitci can do this
automatically, and take care of the confidence-band calculations, as illustrated in Figure 6.3.
Note the stdp option, calling for a conditional-mean confidence band (actually, the default).
. graph twoway Ifitci csat percent, stdp
||
scatter csat percent, msymbol(O)
II
, ytitle("Mean composite SAT score")
legend(off)
title("Confidence bands for conditional means (stdp)")
Figure 6.3

Confidence bands for conditional means (stdp)

o
o

o
o
o .

I

co
0)

'w
o

Q.

E

Oo
Oo CO

ro
<D

i
o
o
co

C?

20

40
60
% HS graduates taking SAT

80

The second type of confidence interval for regression predictions is sometimes called a
“prediction interval.” This expresses our uncertainty in estimating the unknown value ofy for
an individual observation with known x value(s). Standard errors for this purpose are obtained
by typing
. predict SEyhat,

stdf

Figure 6.4 (next page) graphs this prediction band using Ifitci withthe stdf option.
Predicting they values of individual observations as done in Figure 6.4 inherently involves
greater uncertainty, and hence wider bands, than does predicting the conditional mean of y
(Figure 6.3). In both instances, the bands are narrowest at the mean of*.

Linear Regression Analysis

171

• graph twoway Ifitci csat percent, stdf
11
scatter csat percent, msymbol(0)
11
, ytitlef’Mean composite SAT score")
legend(off)
title("Confidence bands for individual-case predictions (stdf)")

o Confidence bands for individual-case predictions (stdf)

Figure 6.4

O -

<D

8o
wo 1-2
<

O)
0)

s

c
ro
a>
S

o
oco

o

20

40
60
% HS graduates taking SAT

80

As with other confidence intervals and hypothesis tests in OLS regression, the standard
errors and bands just described depend on the assumption of independent and identically
istributed errors. Figure 6.2 has cast doubt on this assumption, so the results in Figures 6 3
and 6.4 could be misleading.

Correlations
correlate obtains Pearson product-moment correlations between enables.
.

correlate csat expense percent income high college

(obs=51)

csat
expense
percent
income
high
college

I

|
|
|
I
|
|

csat

expense

percent

income

high

college

1.0000
-0.4663
-0.8758
-0.4713
0.0858
-0.3729

1.0000
0.6509
0.6784
0.3133
0.6400

1.0000
0.6733
0.1413
0.6091

1.0000
0.5099
0.7234

1.0000
0.5319

1.0000

correlate uses only a jsubset
*
of the data that has no missing values on any of the
variables listed (with these particular variables, that does not matter became
,
. .
—i
—**v*. luunvi MwauoC no observations
have missing values). In this respect, the correlate command resembles regress,and
given the same variable list, they will use the same subset of the data. Analysts not employing

m i172

Statistics with Stata

regression or other multi-variable techniques, however, might prefer to find correlations based
upon all of the observations available for each variable pair. The command pwcorr
(pairwise correlation) accomplishes this, and can also furnish /-test probabilities for the null
hypotheses that each individual correlation equals zero.
.

pwcorr csat expense percent income high college,
I

csat

csat I

1.0000

+
I
I
expense I
I
I
percent I
I
I
income I
I
I
high |
I
I
college I
I
I

r
t

■ 1

expense

percent

income

sig

high

-0.4663
0.0006

1.0000

-0.8758
0.0000

0.6509
0.0000

1.0000

-0.4713
0.0005

0.6784
0.0000

0.6733
0.0000

1.000C

0.0858
0.5495

0.3133
0.0252

0.1413
0.3226

0.5099
C.0001

1 .0000

-0.3729
0.0070

0.6400
0.0000

0.6091
0.0000

0.7234
0.3000

0.5319
0.0001

college

1.0000

It is worth recalling here that if we drew many random samples from a population in which
a variables really had 0 correlations, about 5% of the sample correlations would nonetheless
test statistically significant” at the .05 level. Analysts who review many individual hypothesis
tests, such as those in a pwcorr matrix, to identify the handful that are significant at the 0>
level, therefore run a much higher than .05 risk of making a Type I error. This problem is called
the multiple comparison fallacy.” pwcorr offers two methods, Bonferroni and Sidak, for
adjusting significance levels to take multiple comparisons into account. Of these, the Sidak
method is more precise.
.

pwcorr csat expense percent income high college,

csat

expense

percent

income

high

college

I
+

csat

I
I
I
I
I
I
I
I
I
|
I
I
|
I
I
|
I

1.0000

expense

percent

income

high

-0.4663
0.0084

1.0000

-0.8758
0.0000

0.6509
0.0000

-0.4713
0.0072

0.6784
0.0000

0.6733
0.0000

i.:ooc

0.0858
1.0000

0.3133
0.3180

0.1413
0.9971

0.5099
0.0020

1.0000

-0.3729
0.1004

0.6400
0.0000

0.6091
0.0000

0.7234
0.0000

0.5319
0.0009

i

sidak sig
college

nnnn

X • V V V V

1.0000

■ii mu ii

hi

mi

Linear Regression Analysis

173

Comparing the test probabilities in the table above with those of the previous pwcorr
provides some idea of how much adjustment occurs. In general, the more variables we
correlate, the more the adjusted probabilities will exceed their unadjusted counterparts. See the
Base Reference Mannar?, discussion of oneway for the formulas involved.
correlate itself offers several important options. Adding the covariance option
produces a matrix of variances and covariances instead of correlations:
correlate w x y z, covariance

Typing the following after a regression analysis displays the matrix of correlations between
estimated coefficients, sometimes used to diagnose multicollinearity (see Chapter 7):
correlate, _coef

.

■The following command will display the estimated coefficients’ variance-covariance matrix,
from which standard errors are derived:
. correlate, _coef covariance

Peaison correlation coefficients measure how well an OLS regression line fits the data.
They consequently share the assumptions and weaknesses of OLS, and like OLS, should
generally not be interpreted without first reviewing the corresponding scatterplots. A
scatterplot matrix provides a quick way to do this, using the same organization as the
correlation matrix. Figure 6.5 shows a scatterplot matrix corresponding to the pwcorr
matrix given earlier. Only the lower-triangular half of the matrix is drawn, and plus signs are
used as plotting symbols. We suppress^ and x-axis labeling here to keep the graph uncluttered.
. graph matrix csat expense percent income high college,

half msymbol(+) maxis(ylabel(none) xlabel(none))

I

Mean
composite
SAT
score

i

X

Figure 6.5

i

Per pupa

I
% HS
graduates
taking
SAT

Median
household
income,
$1,000

rn-

* \-

.

t '*

-

J ‘

L

•4>

%

adults
HS
diploma
% adults
college
degree

I

174

Statistics with Stata

i-l

To obtain a scatterplot matrix corresponding to a ccorrelate
’
■ • matrix, from
correlation
which all observations having missing values have been dropped, we would need to qualify the
command. If all of the variables had some
some missine
missing values
values, we
we cnnld
could tvnA
type aa command such as
• graph matrix csat expense percent income high college if
csat <
& expense < • & income < . & high < . & college <

To reduce the likelihood of confusion and mistakes, it might make sense to create a new dataset
keeping only those observations that have no missing values:
■

keep if csat < . & expense < .
& college < .
save nmvstate

& income <

& high < .

In this example, we immediately saved the reduced dataset with a new name, so as to avoid
inadvertently writing over and losing the information in the old, more complete dataset. An
alternative way to eliminate missing values uses drop instead of keep:
. drop if csat >= . |
I college >= .

expense >= .

|

income >=

I high >= .

save nmvstate

In addition to Pearson correlations, Stata can also calculate several rank-based correlations
1 hese can be employed to measure associations between ordinal variables, or as an outlierresistant alternative to Pearson correlation for measurement variables. To obtain the Spearman
rank correlation between csat and expense, equivalent to the Pearson correlation if these
variables were transformed into ranks, type
.

spearman csat expense

<■

|i!

Number of obs =
Spearman's rho =

51
-0.42S2

Test of Ho: csat ar.d expense =re independent
Prob > I t I =
i.ooi-

Kendall’s r, (tau-a)and tb. (tau-b) rank correlations can be found easily for these data, although
with larger datasets their calculation becomes slow:
ktau csat expense

Number of obs =
Kendall's tau-a =
Kendall' s tau-b =
Kendal1 ' s score =
SE of score =

51
-0.2925
-0.2932
3 "3

123.095

(corrected for ties)

iest of Ho: csat and expense are independent
Prob > lz| =
0.0025
(continuity corrected)

For comparison, here is the Pearson correlation with its (unadjusted) P-value:

1

Linear Regression Analysis

175

• pwcorr csat expense, sig
I

csat

csat |
I
I
expense I
I
I

1.0000
-0.4663
0.0006

expense

1.0000

In this example, both spearman (-.4282) and pwcorr (-.4663) yield hitter correlations
than ktau (-.2925 or-.2932). All three agree that null hypotheses of no association can be
rejected.

Hypothesis Tests
Two types of hypothesis tests appear in regress output tables. As with other common
hypothesis tests, they begin from the assumption that observations in the sample at hand were
drawn randomly and independently from an infinitely large population.
1. Overall F test: The F statistic at the upper right in the regression table evaluates the null
hypothesis that in the population, coefficients on all the model’s x variables equal zero.
Individual t tests: The third and fourth columns of the regression table contain t tests for
each individual regression coefficient. These evaluate the null hypotheses that in the
population, the coefficient on each particular x variable equals zero.

2.

The t test probabilities are two-sided. For one-sided tests, divide these P-values in half.
In addition to these standard F and t tests, Stata can perform F tests of user-specified
hypotheses. The test command refers back to the most recent model-fitting command such
as anova or regress . For example, individual t tests from the following regression
report that neither the percent of adults with at least high school diplomas (high) nor the percent
with college degrees (college) has a statistically significant individual effect on composite SAT
scores.
.

regress csat expense percent income high college

Conceptually, however, both predictors reflect the level of education attained by a state’s
..Ioh^ and for some
------ --purposes we jHjgbt want to test thenu[1 hypothesis that both have
population,
effect. To do this, we Ibegin by repeating the multiple regression quietly , because we do
not need to see its full output again. Then-------use thej test command:
.

I

quietly regress csat expense percent income high college
test high college
( 1)
( 2)

high = 0.0
college = 0.0
F(

2/
45) =
Prob > F =

3.32
0.0451

176

Statistics with Stata

Unlike the individual null hypotheses, the joint hypothesis that coefficients on high and
college both equal zero can reasonably be rejected (P = .0451). Such tests on subsets of
coefficients are useful when we have several conceptually related predictors or when individual
coefficient estimates appear unreliable due to multicollinearity (Chapter 7).
test could duplicate the overall Ftest:
.

1

I

test expense percent income high college

test could also duplicate the individual-coefficient tests:
test expense

.

i

test percent
test income

and so forth. Applications of test more useful in advanced work include

I

1. Test whether a coefficient equals a specified constant. For example, to test the null
hypothesis that the coefficient on income equals 1 (Ho :P3 = 1), instead of the usual null
hypothesis that it equals 0 (Ho :P 3 = 0), type
.

test income = 1

2. Test whether two coefficients are equal. For example, the following command evaluates
the null hypothesis H0 :p 4 = p 5 *
test high = college

3.

Finally, test understands some algebraic expressions. We could request something like
the following, which would test Ho :p 3 = (P 4 + p 5) / 100:
.

test income = (high + college)/100

Consult help test for more information and examples.

Dummy Variables
Categorical variables can become predictors in a regression when they are expressed as one or
more {0,1} dichotomies called dummy variables.” For example, we have reason to suspect
that regional differences exist in states mean SAT scores. The tabulatse command will
generate one dummy variable for each category of the tabulated variable if we add a gen
(generate) option. Below, we create four dummy variables from the four-category variable
legion. The dummies are named regl, reg2, reg3 and reg4. regl equals 1 for Western states
and 0 for others; regl equals 1 for Northeastern states and 0 for others; and so forth.
.

tabulate region, gen(reg)

Geographica |
1 region |

-------------------+
West
N. East
South
Midwest

Freq.

Percent

Cum.

13
9

26.00
44 .00
76.00
100.00

|
|
|
|

12

26.00
18.00
32.00
24.00

Total |

50

100.00

16

•1

Linear Regression Analysis

177

describe regl-reg4

variable name

storage
type

display
format

byte
byte
byte
byte

regl
reg2
reg3
reg4

value
label

variable label

%8.0g
•%8.0g
%8.0g
%8.0g

region
region:
region:
region=

:Wes t
:N. East
:South
:Midwes t

tabulate regl

region

Wes |
t I

Freq.

Percent•

Cum.

37
13

74.00
26.00

74.00
100.00

50

100.00

region==N.
N. |
East |

Freq.

Percent

Cum.

0 I
1 I

41
9

82.00
18.00

82.00
100.00

Total |

50

100.00

0 I
1 I
—+
Total |
tabulate reg2

Regressing csat on one dummy variable, reg2 (Northeast), is equivalent to performing a
two-sample t test of whether mean csat is the same across categories of reg2. That is, is the
mean csat the same in the Northeast as in other U.S. states?
.

regress csat reg2

Source

I

Model
Residual

|
|

SS

df

MS

35191.4017
177769.978

1
48

35191.4017
3703.54121

Total |

212961.38

49

4346.15061

csat |

Coef.

Std. Err.

t

P>l 11

reg2 |
cons |

-69.0542
958.6098

22.40167
9.504224

-3.08
100.86

0.003
0.000

------------- +

Number of obs =
F( 1,
48) =
Prob > F
R-squared
=
Adj R-squared =
Root MSE

[95% Conf.

-114.0958
939.5002

50
9.50
0.0034
0.1652
0.1479
60.857

Interval]

-24.01262
977.7193

The dummy variable coefficient’s t statistic (t = -3.08, P = .003) indicates a significant
difference. According to this regression, mean SAT scores are 69.0542 points lower (because
b = -69.0542) among Northeastern states. We get exactly the same result (t = 3.08, P = .003)
from a simple t test, which also shows the means as 889.5556 (Northeast) and 958.6098 (other
states), a difference of 69.0542.

I

I

178

Statistics with Stata

i,h- ■

ttest csat, by(reg2)
Two sample t test with equal variances
Group |

Obs

Mean

Std. Err.

Std. Dev.

[95% Conf. Interval]

0 I
1 I

41
9

958.6098
889.5556

10.36563
4.652094

66.37239
13.95628

937.66
878.8278

979.5595
900.2833

combined I

50

946.18

9.323251

65.92534

927.4442

964.9158

69.0542

22.40167

24.01262

114.0958

diff

|

Degrees of freedom: 48
Ho : mean (0) - mean(l) = diff = 0

Ha : diff < 0
t =
3.0825
P < t =
0.9983

Ha: diff != 0
t =
3.0825
P > 111 =
0.0034

Ha: diff > 0
t =
3.0825
P > t =
0.0017

This conclusion proves spurious, however, once we control for the percentage of students
taking the test. We do so with a multiple regression of csat on both reg2 and percent.
.

regress csat reg2 percent

Source |

SS

df

MS

Model |
Residual |

174664.983
38296.3969

2
47

87332.4916
814.816955

Total |

212961.38

49

4346.15061

csat |

Coef.

Std. Err.

t

P> 111

reg2 |
percent |
_cons |

57.52437
-2.793009
1033.749

14.28326
.2134796
7.270285

4.03
-13.08
142.19

0.000
0.000
0.000

----- +

---- +

Number of obs =
F( 2,
47) =
Prob > F
=
R-squared
=
Adj R-squared =
Root MSE

50
107.18
0.0000
0.8202
0.8125
28.545

[95% Conf. Interval]
28.79016
-3.222475
1019.123

86.25858
-2.363544
1048.374

The Northeastern region variable reg2 now has a statistically significantj?osztzvc coefficient
(Z? = 57.52437, P < .0005). The earlier negative relationship was misleading. Although mean
SAT scores among Northeastern states really are lower, they are lower because higher
percentages of students take this test in the Northeast. A smaller, more “elite” group of
students, often less than 20% of high school seniors, take the SAT in many of the non-Northeast
states. In all Northeastern states, however, large majorities (64% to 81%) do so. Once we
adjust for differences in the percentages taking the test, SAT scores actually tend to be higher
in the Northeast.
To understand dummy variable regression results, it can help to write out the regression
equation, substituting zeroes and ones. For Northeastern states, the equation is approximately
predicted csat = 1033.7 + 57.5reg2 - 2.Spercent
= 1033.7 + 57.5 x 1 2.8percent
= 1091.2 - Z.Zpercent
1

Illi

Linear Regression Analysis

179

For other states, the predicted csat is 57.5 points lower at any given level ofpercent:

predicted csat = 1033.7 + 57.5 * 0 - 2.8percent
= 1033.7 - 2.8percent

■

w
I

Dummy variables in models such as this are termed “intercept dummy variables,” because they
describe a shift in the y-intercept or constant.
From a categorical variable with k categories we can
< define k dummy variables, but one of
these will be redundant. Once we know a state’s values on the West, Northeast, and Midwest
dummy variables, for example, we can already guess its value on the South variable. For this
reason, no more than k - 1 of the dummy variables — three, in the case of region — can be
included in a regression. If we try to include all the possible dummies, Stata will automatically
drop one because multicollinearity otherwise makes the calculation impossible.
.

regress

csat regl reg2 reg3 reg4 percent

Source

I

ss

df

MS

Model |
Residual |

181378.099
31583.2811

4
45

45344.5247
701.850691

Total

212961.38

49

4346.15061

|

csat |

regl
reg2
reg3
reg4
percent
_cons

|
|
I
|
|
|

Number of obs =
F( 4Z
45) =
Prob > F
R-squared
Adj R-squared =
Root MSE

50
64.61
0.0000
0.8517
0.8385
26.4 92

Coef.

Std. Err.

t

P> I 11

-23.77315
25.79985
-33.29951
(dropped)
-2.546058
1047.638

11.12578
16.96365
10.85443

-2.14
1.52
-3.07

0.038
0.135
0.004

-46.18162
-8.366693
-55.16146

-1.364676
59.96639
-11.43757

.2140196
8.273625

-11.90
126.62

0.000
0.000

-2.977116
1030.974

-2.115001
1064.302

[95% Conf.

Interval]

The model s fit
including R -, F tests, predictions, and residuals — remains essentially
the same regardless of which dummy variable we (or Stata) choose to omit. Interpretation of
the coefficients, however, occurs with reference to that omitted category. In this example, the
Midwest dummy variable (reg4) was omitted. The regression coefficients on regl. reg2. and
reg3 tell us that, at any given level of percent, the predicted mean SAT scores are
approximately as follows:
23.8 points lower in the West (regl = 1) than in the Midwest;
25.8 points higher in the Northeast (reg2 = 1) than in the Midwest: and
33.3 points lower in the South (reg3 = 1) than in the Midwest.
The West and South both differ significantly from the Midwest in this respect, but the Northeast
does not.

An alternative command, <areg , fits the same model without going through dummy
variable creation. Instead, it “absorbs” the effect of a ^-category variable such as region. The
model s fit, F test on the absorbed variable, and other key aspects of the results are the same
as those we could obtain through explicit dummy variables. Note that areg does not provide
estimates
coefficients
x__ of the
------22----- j on individual dummy variables, however.

180

.

Statistics with Stata

areg csat percent, absorb(region)

Number of obs =
F( 1,
45) =
Prob > F
R-squared
=
Adj R-squared =
Root MSE

p
■

csat |

Coef.

Std. Err.

t

P> I t |

percent |
_cons |

-2.546058
1035.445

.2140196
8.38689

-11.90
123.46

0.000
0.000

9.465

0.000

------ +

r

region |

F(3, 45) =

50
141.52
0.0000
0.8517
0.8385
26.492

[95% Conf. Interval]

-2.977116
1018.553

-2.115001
1052.337

(4 categories)

Although its output is less informative than regression with explicit dummy variables,
areg does have two advantages. It speeds up exploratory work, providing quick feedback
about whether a dummy variable approach is worthwhile. Secondly, when the variable of
haS
j 7jues’ creating dummies for each of them could lead to too many variables
or too large a model for our particular Stata configuration, areg thus works around the usual
limitations on dataset and matrix size.

J
S' ’i

Explicit dummy variables have other advantages, however, including ways to model
interaction effects. Interaction terms called “slope dummy variables” can be formed by
multiplying a dummy times a measurement variable. For example, to model an interaction
between Northeast/other region ^percent, we create a slope dummy variable called reg2perc.
. generate reg2perc = reg2 * percent
(1 missing value generated)

The new variable, reg2perc, equals percent for Northeastern states and zero for all other states.
We can include this interaction term among the regression predictors:
.

regress csat reg2 percent reg2perc

Source |

r

i:

ss^

df

MS

Number of obs =
F( 3,
46) =
Prob > F
R-squared
=
Adj R-squared =
Root MSE
=

50
= 2.27
0.0000
0.8429
0.8327
26.968

Model |
Residual |

179506.19
33455.1897

3
46

59835.3968
727.286733

Total |

212961.38

49

4346.15061

csat |

Coef.

Std. Err.

t

P> 111

[95% Conf. Interval]

-241.3574
-2.858829
4.179666
1035.519

116.6278
.2032947
1.620009
6.902898

-2.07
-14.06
2.58
150.01

0.044
0.000
0.013
0.000

-476.117
-3.26804
. 9187559
1021.624

reg2
percent
reg2perc
_cons

|
|
|
|

-6.597821
-2.449618
7.440576
1049.414

i':;

p
a

The interaction
= 2.58,
imeracnon is
is statistically
statistically significant
significant (t
(t =
2.58, P
P = .013). Because this analysis
includes both intercept (reg2) and slope (reg2perc) dummy variables, it is worthwhile to write
out the equations. The regression equation for Northeastern states is approximately

rLinear Regression Analysis

181

predicted csat = 1035.5 - 24l.4reg2 - l.Vpercent + 4.2reg2perc
= 1035.5 - 241.4 x 1 - 2.9percent + 4.2 x 1 x percent
- 794.1 4- X.lpercent
For other states it is

predicted csat — 1035.5 — 241.4 x 0 — 2.9percent + 4.2 x 0 x percent
= 1035.5 -2.9percent

An interaction implies that the effect of one variable changes, depending on the values of
some other variable. From this regression, it appears that percent has a relatively weak and
positive effect among Northeastern states, whereas its effect is stronger and negative among the
To visualize the results from a jslope-and-intercept
‘
dummy variable regression, we have
several graphing possibilities. Without even fitting the model/we could ask
If it to do the
work as follows, with the results seen in Figure 6.6.
label define reg2 0 "other regions" 1 "Northeast"
label values reg2 reg2

.

• graph twoway Ifit csat percent

II
I I

scatter csat percent

, by(reg2, legend(off) note( ))
ytitle("Mean composite SAT score")
other regions

o

Figure 6.6

Northeast

T—

S
o

sg
0)

s

Q.

E

8 g.

c 05
<u
<D

o
OO

l—

0

20

40

60

80

0

20

% HS graduates taking SAT

I

40

60

80

I
Is
L-

I

a il

182

Statistics with Stata

Alternatively, we could fit the regression model, calculate predicted values, and use those
to make the a more refined plot such as Figure 6.7. The bands (50) options with both
mspline commands specify median splines based on 50 vertical bands, which is more than
enough to cover the range of the data.
. quietly regress csat reg2 percent reg2perc

. predict yhatl
. graph twoway scatter csat percent if reg2
I I

0
mspline yhatl percent if reg2 == 0, cipattern(solid)

bands(50)

11
11

scatter csat percent if reg2 ■ = 1, msymbol(Sh)
mspline yhatl percent if reg2 == 1, cipattern(solid)
bands (50)

II
, ytitle("Composite mean SAT score")
legend (order (1 3) label(1 "other regions")
label(3 "Northeast states") position(12) ring(0))
o
o

I

•

other regions

Figure 6.7

□

Northeast states

2

i•
Ho "
CO
c
(Q
<D

E

0)

o° "
E01
o
O

|
o
o
co

0

■I

20

40
60
% HS graduates taking SAT

80

Figure 6.7 involves four overlays: two scatterplots (csat vs.percent for Northeast and other
states) and two median-spline plots (connecting predicted values, yhatl, graphed against
percent for Northeast and others). The Northeast states are plotted as hollow squares,
msymbol (Sh). ytitle and legend options simplify they-axis title and the legend; in
their default form, both would be crowded and unclear.
Figures 6.6 and 6.7 both show the striking difference, captured by our interaction effect,
between Northeastern and other states. This raises the question of what other regional
differences exist. Figure 6.8 explores this question by drawing a csat-percent scatterplot with
different symbols for each of the four regions. In this plot, the Midwestern states, with one
exception (Indiana), seem to have their own steeply negative regional pattern at the left side of
the graph. Southern states are the most heterogeneous group.

Linear Regression Analysis

183

• graph twoway scatter csat percent if regl ==1
11
scatter csat percent if reg2 ==1, msymbol(Sh)
11
scatter csat percent if reg3
1, msymbol(T)
11
scatter csat percent if reg4
1, msymbol(+)
11
, legend(position(1) ring(0) label(1 "West")
label (2 "Northeast") label(3
.3 "South") label(4 "Midwest"))
o
o

Figure 6.8

• West
Northeast
▲ South + Midwest
<D

8o
Wo
HO

< T"

0)

<D
w
Q.

E
e

S05 I

<D

A

%

o o
00 L-r-

0

20

40
60
% HS graduates taking SAT

80

Automatic Categorical-Variable Indicators and Interactions

I

The xi (expand interactions) command simplifies the jobs of expanding multiple-category
variables into sets of dummy and interaction variables, and including these as predictors in
regression or other models. For example, in da^tstudent2.dta (introduced in Chapter 5) there
is a four-category variable year, representing a student’s year in college (freshman, sophomore,
etc.). W e could automatically create a set of three dummy variables by typing
. xi, prefix(ind) i.year
The three new dummy variables will be named indyear_2, indyearj, and indyearj. The
prefix () option specified the prefix used in naming the new dummy variables. If we typed
simply

. xi i.year
giving no prefix () option, the names Jyear_2,_Iyear_3, and _Iyear_4 would be assigned
(and any previously calculated variables with those names would be overwritten by the new
variables). Typing

. drop

I*

employs the wildcard * notation to drop all variables that have names beginning with J.

184

Statistics with Stata

fM
1

By default, xi omits the lowest value ofthe categorical variable when creating dummies
but this can be controlled. Typing the command
char _dta[omit]

prevalent

will cause subsequent xi commands to automatically omit the most prevalent category (note
the use of square brackets). char _dta [ ] preferences are saved with the data; to restore
the default, type
.

char __dta[omit]

Typing
■

.

r-

i

char year[omit]

3

would omit year 3. To restore the default, type
.

char year[omit]

xi can also create interaction terms involving two categorical variables, or one categorical
and one measurement variable. For example, we could create a set of interaction terms for year
and gender by typing
.

xi i.year*i.gender

From the four categories ofrear and the two categories ofgender, this xi command creates
seven new variables — four dummy variables and three interactions. Because their names all
begin with J, we can use the wildcard notation _/* to describe these variables:
.

J
■■

S'
i!

Z*

variable name

storage
type

_Iyear_2
_Iyear_3
_Iyear_4
_lger.der_l
_IyeaXger._2_l
_IyeaXgen_3_l
_IyeaXgen_4_l

byte
byte
byte
byte
byte
byte
byte

value
label

variable label

*6

year== 2
year== 3
year== 4
gender
year==2
year
3
year
4

*
*=
* t

-ac . wg

1
& gender
£ gender
& gender;

1
1
1

To create interaction terms for categorical variable year and measurement variable drink
(33-point drinking behavior scale), type
.

I
j

describe

xi i.year*drink

Six new variables result, three dummy variables for year, and three interaction terms
representing each of the rear dummies times drink. For example, for a sophomore student
Jyear2 - 1 and _IyeaXdrink_2 = 1 xdrink = drink. For a junior student, _Iyear2 = 0 and
JyearXdrinkJ = 0><drink =0: also Jyeari = 1 and JyeaXdrinkJ = 1 ^drink = drink, and so
forth.
describe _Iyea*

variable name
_Iyear_2
_Iyear_3
_Iyea r_4

storage
type

display
format

byte
byte
byte t

%8 . Og
%8 . Dg
%8v0g—

value
label

variable label

’

year==2
year==3
year«4

Linear Regression Analysis

_IyeaXdrink_2
_IyeaXdrink_3
_IyeaXdrink_4

float
float
float

%9.0g
%9.0g
%9.0g

(year
(year
(year

185

2) *drink
3) *drink
4) *drink

The real convenience of xi comes from its ability to generate dummy variables and
interactions automatically within a regression or other model-fitting command. For example,
to regress variable gpa (student’s college grade point average) on drink and a set of dummy
variables for year, simply type
xi:

regress gpa drink i.year

This command automatically creates the necessary dummy variables, following the same rules
described above. Similarly, to regress gpa on drink, year, and the interaction ofdrink and year,
type
xi :

regress gpa drink i.year*drink

i.year
i.year*drink

_Iyear_l-4
_IyeaXdrink_#

(naturally coded; _Iyear_l omitted)
(coded as above)

Source

SS

df

MS

Model |
Residual I

5.08865901
40.6630801

7
210

.726951288
.193633715

Total |

45.7517391

217

.210837507

gpa |

drink
_Iyear_2
_Iyear_3
_Iyear_4
drink
_IyeaXdrin~2
_IyeaXdrin~3
_IyeaXdrin~4
_cons

I
|
I
|
|
I
|
|
|

Number of obs =
F( 1,
210) =
Prob > F
=
R-squared
=
Adj R-squared =
Root MSE

218
3.75
0.0007
0.1112
0.0816
.44004

Coef.

Std. Err.

t

P> 111

-.0285369
-.5839268
-.2859424
-.2203783
(dropped)
.0199977
.0108977
.0104239
3.432132

.0140402
.314782
.3044178
.2939595

-2.03
-1.86
-0.94
-0.75

0.043
0.065
0.349
0.454

-.0562146
-1.204464
-.8860487
-.799868

-.0008591
.0366107
.3141639
. 3591114

.0164436
.016348
.016369
.2523984

1.22
0.67
0.64
13.60

0.225
0.506
0.525
0.000

-.0124179
-.0213297
-.0218446
2.934572

. 0524133
.043125
.0426925
3.929691

[95% Conf. Interval]

The xi: command can be applied in the same way before many other model-fitting
procedures such as logistic (Chapter 10). In general, it allows us to include predictor
(right-hand-side) variables such as the following, without first creating the actual dummy
variable or interaction terms.
i . catvar

Creates j-l dummy variables representing the j categories of
catvar.

i.catvarl*i.catvar2

Creates j-l dummy variables representing the j categories of
catvar1; k-\ dummy variables from the A: categories of catvar2;
and (/—1 )(A7—1) interaction variables (dummy x dummy).

Creates j-1 dummy variables representing the J categories of
catvar, and J-l variables representing interactions with the
measurement variable (dummy x measvar).
After any xi command, the new variables remain in the dataset.
i.catvar* measvar

I

186

Statistics with Stata

Stepwise Regression

I1

With the regional dummy variable terms we added earlier to the state-level data in states dta
we have many possible predictors of csat. This results in an overly complicated model, with
several coefficients statistically indistinguishable from zero.
.

regress csat expense percent income college high regl reg2
reg2perc reg3

ss

df

MS

Model [
Residual |

195420.517
17540.863

9
40

21713.3908
438.521576

Total |

212961.38

49

4346.15061

csat |

Coef .

Std. Er r.

t

P> 111

[95% Conf

Interval]

expense I
percent |
income i
college j
high |
regl I
reg2 |
reg2perc j
reg3 |
_cons |

-.0022508
-2.93786
-.0004919
3.900087
2.175542
-33.78456
-143.5149

. 0041333
.2302596
. 0010255
1.719409
1.171767
9.302983
101.1244
1.404483
12.54658
76.35942

-0.54
-12.76
-0.48
2.27
1.86
-3.63
-1.42
1.78
-0.70
10.99

0.589
0.000
0.634
0.029
0.071
0.001
0.164
0.082
0.487
0.000

-.0106045
-3.403232
-.0025645
.4250318
-.192688
-52.58659
-347.8949
-.3319506
-34.15679
684.8927

.006103
-2.472488
.0015806
7.375142
4.543771
-14.98253
60.86509
5.345183
16.55838
993.549

Source |

-------- +

y
»!

2.506616
-8.799205
839.2209

Number of obs =
F( 9,
40) =
Prob > F
R-squared
=
Adj R-squared =
Root MSE

50
49.51
0.0000
0.9176
0.8991
20.941

ii
We might: now try to simplify this model, dropping first that predictor with the highest t
probability (income, P= .634), then refittingthe model and deciding whetherto drop something
urther. Through this process of backward elimination, we seek a more parsimonious modelone that is simpler but fits almost equally well. Ideally, this strategy is pursued with attention
both to the statistical results and to the substantive or theoretical implications of keeping or
discarding certain variables.
For analysts in a hurry, stepwise methods provide ways to automate the process of model
selection. They work either by subtracting predictors from a complicated model, or by adding
predictors to a simpler one according to some pre-set statistical criteria. Stepwise methods
cannot consider the substantive or theoretical implications of their choices, nor can they do
much troubleshooting to evaluate possible weaknesses in the models produced at each step.
Despite their drawbacks, stepwise methods meet certain practical needs and have been widely
used.
J

I
Ill'

or automatic backward elimination, we issue a sw regress command that includes
a of our possible predictor variables, and a maximum P value required to retain them. Setting
the P-to-retain criteria as pr (. 05) ensures that only predictors having coefficients that are
significantly different from zero at the .05 level will be kept in the model.

i

£h

i

J

Linear Regression Analysis

187

sw regress csat expense percent income college high regl
reg2
■4

reg2perc reg3, pr(.O5)

p = 0.6341 >= 0.0500
p = 0.5273 >= 0.0500
P = 0.4215 >= 0.0500
p = 0.2107 >= 0.0500

begin with full model
removing income
removing reg3
removing expense
removing reg2

Source |

SS

df

MS

Model |
Residual |

194185.761
18775.6194

5
44

38837.1521
426.718624

Total

212961.38

49

4346.15061

|

Number of obs =
F( 5,
44) =
Prob > F
=
R-squared
Adj R-squared =
Root MSE

csat |

Coef.

Std. Err.

t

P> 111

I
I
I
I
I
I

-30.59218
-3.119155
.5833272
3.995495
2.231294
806.672

8.479395
. 1804553
.1545969
1.359331
.8178968
49.98744

-3.61
-17.28
3.77
2.94
2.73
16.14

0.001
0.000
0.000
0.005
0.009
0.000

regl
percent
reg2perc
college
high
cons

50
91.01
0.0000
0.9118
0.9018
20.657

[95% Conf. Interval]

-47.68128
-3.482839
.2717577
1.255944
. 5829313
705.9289

-13.50309
-2.755471
.8948967
6.735046
3.879657
907.4151

SW regress dropped first income, then reg3, expense, and finally reg2 before settling on
the final model. Although it has four fewer coefficients, this final model has almost the same
R2 (.9118 versus .9176) and a higher R’2a (.9018 versus .8991) compared with the earlier
version.

If, instead of a P-to-retain, pr (. 05), we specify a P-to-enter value such as pe (. 0 5),
then sw regress performs forward inclusion (starting with an “empty” or constant-only
model) instead of backward elimination. Other stepwise options include hierarchical selection
and locking certain predictors into the model. For example, the following command specifies
that the first term (xl) should be locked into the model and not subject to possible removal:
. sw regress y xl x2 x3, pr(.O5)

lockterml

The following command calls for forward inclusion of any predictors found significant at
the . 10 level, but with variables x4, x5, and x6 treated as one unit — either entered or left out
together:
sw regress y xl x2 x3 (x4 x5 x6) , pe(.10)

The following command invokes hierarchical backward elimination with a P = .20 criterion:
sw regress y xl x2 x3 (x4 x5 x6)

x7, pr(.20) hier

The hier option specifies that the terms are ordered: consider dropping the last term (x7)
first, and stop if it is not dropped. Ifx7 is dropped, next consider the second-to-last term (x4
x5 x6), and so forth.
Many other Stata commands besides regress also have stepwise variants that work in
a similar manner. Available stepwise procedures include the following:
sw clogit
Conditional (fixed-effects) logistic regression
sw cloglog

Maximum likelihood complementary log-log estimation

188

Statistics with Stata

sw cnreg

i

I

Censored normal regression
sw glm
Generalized linear models
sw logistic
Logistic regression (odds)
sw logit
Logistic regression (coefficients)
sw nbreg
Negative binomial regression
sw ologit
Ordered logistic regression
sw oprobit
Ordered probit regression
sw poisson
Poisson regression
sw probit
Probit regression
sw qreg
Quantile regression
s w regress
OLS regression
sw stcox
Cox proportional hazard model regression
sw streg
Parametric survival-time model regression
sw tobit
Tobit regression
Type help sw for details about the stepwise options and logic.

Polynomial Regression

I

Ii
i ■

Earlier in this chapter, Figures 6.1 and 6.2 revealed an apparently curvilinear relationship
between mean composite SAT scores (csai) and the percentage of high school seniors taking
the test (percent). Figure 6.6 illustrated one way to model the upturn in SAT scores at high
percent values: as a phenomenon peculiar to the Northeastern states. That interaction model
fit reasonably well (R 2a = .8327). But Figure 6.9 (next page), a residuals versus predicted
values plot for the interaction model, still exhibits signs of trouble. Residuals appear to trend
upwards at both high and low predicted values.
. quietly regress csat reg2 percent reg2perc
. rvfplot, yline(O)

Linear Regression Analysis
1

'

o
o

189

Figure 6.9

o
IO ■

<n
OJ

□

■u
‘w
0)

or

••

o

t •

o

ID "

850

900

950
Fitted values

1000

1050

Ch,apter 8 Presents a variety of techniques for curvilinear and nonlinear regression.
Curvilinear regression” here refers to intrinsically linear OLS regressions (for example,
regress ) that include nonlinear transformations of the original y or.v variables. Although
curvilinear regression fits a curved model with respect to the original data, this model remains
linear m the transformed variables. (Nonlinear regression, also discussed in Chapter 8, applies
non-OLS methods to fit models that cannot be linearized through transformation.)
One simple type of curvilinear regression, called polynomial regression, often succeeds in
fitting U or inverted-U shaped curves. It includes as predictors both an independent variable
and its square (andpossibly higher powers ifnecessary). Because the csat-percent relationship
appears somewhat U-shaped, we generate a new variable equal lopercent squared, then include
percent and percent2 as predictors of csat. Figure 6.10 graphs the resulting curve.
. generate percent2 = percent*!

.

regress csat percent percent2
Source |

ss

df

MS

Number of obs =
F(
2,
48) =
Prob > F
R-squared
Adj R-squared =
Root MSE

Model
Residual

|
|

193721.829
30292.6806

2
48

96860.9146
631.097513

Total

|

224014.51

50

4480.2902

csat

|

Coef.

Std. Er r.

t

P> 111

percent |
percent2 |
_cons |

-6.111993
.0495819
1065.921

. 6715406
.0084179
9.285379

-9.10
5.89
114.80

0.000
0.000
0.000

------------- +

-------- +

[95% Conf.
-7.462216
.0326566
1047.252

51
153.48
0.0 0 00
0.8648
0.8591
25.122

Interval]

-4.76177
. 0665072
1084.591

i

I

I

190

Statistics with Stata

I
• predict yhat2
(option xb assumed; fitted values)

ill

. graph twoway mspline yhat2 percent, bands(50)
11
scatter csat percent
11
, legend(off) ytitle("Mean composite SAT score")

i

o
o

Figure 6.10

i. I'
<D

§° I

Wq
HO

<Tcn

J

<p
"w

'll'

£o
cu

V-

I

♦

Q.

E

“S-

<D

o
o 00 V

0

20

40
60
% HS graduates taking SAT

80

If we only wanted to see the graph, and did not need the regression analysis there is a

r

a curve similar to Figure 6.10 could have been obtained by typing
. graph twoway qfit csat percent
I I

j|M

r

O’
? Ji!

I

scatter csat percent

modlHnPFiS6
6J()TatChCS thC
Slightlybetterthanour interaction
Striking in a
h i °
-8591 versus -S327). Secanse the curvilinear pattern is now less
ndeXent dT r 11^
P'Ot
611^’ the usual assumption of
polyn^model
PlaUSlb’e
t0 thiS

Linear Regression Analysis

191

quietly regress csat percent percent2

. rvfplot, yline(O)

f
Figure 6.11
o

io

ro
"O
’</> o

a>

• •

CC

%
o
in -j___

850

900

950
Fitted values

1000

1050

In Figures 6.7 and 6.10, we have two alternative models for the observed upturn in SAT
scores at high levels of student participation. Statistical evidence seems to lean towards the
polynomial model at this point. For serious research, however, we ought to choose between
similar-fitting alternative models on substantive as well as statistical grounds. Which model
seems more useful, or makes more sense? Which, if either, regression model suggests or
corresponds to a good real-world explanation for the upturn in test scores at high levels of
student participation?

Although it can closely fit sample data, polynomial regression also has important statistical
weaknesses. The different powers of.v might be highly correlated with each other, giving rise
to multicollinearity. Furthermore, polynomial regression tends to track observations that have
unusually large positive or negative x values, so a few data points can exert disproportionate
influence on the results. For both reasons, polynomial regression results can sometimes be
sample-specific, fining one dataset well but generalizing poorly to other data. Chapter 7 takes
a second look at this example, using tools that check for potential problems.

Panel Data
Panel data, also called cross-sectional time series, consist of observations on i analytical units
or cases, repeated over t points in time. The Longitudinal/Panel Data Reference Manual
describes a wide range of methods for analyzing such data. Most of the relevant Stata
commands begin with the letters xt; type help xt for an overview. As mentioned in the
documentation, some xt procedures require time series or tsset data; see Chapter 13, or
type help tsset, for more about this step.

192

77-

i

ill

Statistics with Stata

This section considers the relatively simple case of linear regression with panel data,
accomplished by the command xtreg . Our example dataset, newfdiv.dta contains
information about the 10 census divisions of the Canadian p> v
i province of Newfoundland (Avalon
Peninsula, Burin Peninsula, and 8 others), for the years 1992-96.
Contains data from C:\data\netffdiv.dta
obs:
50

vars :
size:

7
2,250

variable name

storage
type

cendiv
divname
year
pop
unemp
outmig
tcrime

ii

Sorted by:

Newfoundland Census divisions
(source:
Statistics Canada)
18 Jul 2005 10:28

(99.9% of memory free)

display
format

byte
%9.0g
str20
%20s
int
%9.0g
double %9.0g
float
%9.0g
int
%9.0g
float
%9.0g

value
label

variable label

cd

Census Division
Census Division name
Year
Population, 1000s
Total unemployment. 1000s
Out-migration
Total crimes reported. 1000s

divname

year

pop

unemp

outmig

Avalon Peninsula
Avalon Peninsula
Avalon Peninsula
Avalon Peninsula
Avalon Peninsula

1992
1993
1994
1995
1996

259.587
261.083
259.296
257.546
255.723

58.56
52.23
44.81
39.35
38.68

6556
644 9
6907

Burin Peninsula
Burin Peninsula
Burin Peninsula
Burin Peninsula
Burin Peninsula

1992
1993
1994
1995
1996

29.865
29.611
29.327
26.898
28 . 126

9.5
9.18
8.41
7.12
6.61

874
928
584

cendiv

year

. list in 1/10

111
i-

<I?

1.
2.
3.
4.
5.

6.
7.
8.
9.
10 .

I cendiv
I-------I Avalon
I Avalon
I Avalon
I Avalon
I Avalon
I—
Burin
I
I
Burin
Burin
I
Burin
I
Burin
I

tcrime
26.211

■

21.039
20.201
19.536
21.268

I
I

1. 903
1.94
2.063
1 . 923

Figure 6.12 visualizes the panel data, graphing variations in the number ofcrimes reponed
each year for 9 of the 10 census divisions. Census division 1, the Avalon Peninsula, is by far
he largest in Newfoundland. Setting it temporarily aside by specifying if cendiv != 1
makes the remaining 9 plots in Figure 6.12 more readable. The imargin(l=3 r=3) option
in this example calls for left and right margins subplot margins equal to 3% of the graph width
giving more separation than the default.

Linear Regression Analysis

193

graph twoway connected tcrime year if cendiv != 1,
by(cendlvf note("")) xtitleC”') imargin (lef t=3 right=3)

Burin

S Coast

St Georg

Humber

Central

Bonavist

Notre D

NPen

Labrador

Figure 6.12

CO -

CM -

O
O
O

'

■o'

O)

€o
CL
£
co

co
04

<D

o
ro

H
co CM -

1992 1993 1994 1995 1996 1992 1993 1994 1995 1996 1992 1993 1994 1995 1996

The dataset contains 50 observations total. Because the 50 observations represent only 10
individual cases, however, the usual assumptions of OLS and other common statistical methods
do not apply. Instead, we need models with complex error specifications, allowing for both
unit-specific and individual-observation disturbances.
Consider the regression of v on two predictors, x and w. OLS regression estimates the
regression coefficients a, b, and c, and calculates the associated standard errors and tests,
assuming a model of the form
V,- — a -T bXj -r CM-, -r et
where the residuals for each observation, e i , are assumed to represent errors that have
independent and identical distributions. The i.i.d. errors assumption appears unlikely with
panel data, where the observations consist of the same units measured repeatedly.
A more plausible panel-data model includes two error terms. One is common to each of
the i units, but differs between units (u, ). The second is unique to each of the i, t obsen ations

yH = a + bxit + civf7 + W/ + e.
In order to fit such a model, Stata needs to know which variable identifies the i units, and
which variable is the time index t. This can be done within an xt command, or more
efficiently for the dataset as a whole. The commands iis (“z is”) and tis (‘7 is”) specify
the i and t variables, respectively. For newfdiv.dta, the units are census divisions (cendiv) and
the time index is^ear.

194

ifi

II
J i
'

!

Statistics with Stata

iis

cendiv

tis year

save, replace

Saving the dataset preserves the i and t specifications, so the iis and' tis
_L_ commands
are not required in a future session. Having set these variables, we can now fit a random-effects
(meaning that the common errors u, are assumed to be variable, rather than fixed) model
regressing tcrinie on unemp and pop.
xtreg

tcrime unemp pop,

re

Random-effects GLS regression
Group variable (i): cendiv
R-sq:

Number of obs
Number of groups

within
= 0.5265
between = 0.9717
overall = 0.9634

Obs per group: min =

5

avg =

5.0
5

max =

Random effects u i ~ Gaussian
corr(u_i, X)
= 0 (assumed)

Wald chi2(2)
Prob > chi2

H
tcrime |

’Ml

Std. Err.

z

P>|z|

[95* Conf. Interval]

unemp |
Pop I
_cons |

.1645266
.0558997
-.7264381

.0381813
.0073437
.301522

4.31
7.61
-2.41

0.000
0.000
0.016

.0896925
.0415062
-1.31741

sigma_u |
sigma_e I
rho |

.34458437
.42064667
.40157462

(traction of variance due to u_i)

■ i [J

Ub
n ■

H

r
|!

■■

705.54
0.0000

Coef.

------------------ 4.

Hit

50
10

.2393607
.0702931
.1354659

The xtreg output table contains regression coefficients, standard errors, t tests, and
confidence intervals that resemble those of an OLS regression. In this example we see that the
coefficient on unemp (. 1645) is positive and statistically significant. The predicted number of
crimes increases by .1645 for each additional person unemployed, if population is held
constant. Holding unemployment constant, predicted crimes increase by 5.59 with each 100person increase in population. Echoing the individual-coefficient z tests, the Wald chi-square
test at upper right (Z: = 705.54.#= 2. P< .00005) allows us to reject the joint null hypothesis
that the coefficients on unemp and pop are both zero.
This output table gives further information related to the two error terms. At lower left in
the table we find
sigma_u

standard deviation of the common residuals u,

sigma_e

standard deviation of the unique residuals e,

rho

fraction of the unexplained variance due to differences among the units (i.e.,
differences among the 10 Newfoundland census divisions).
VarfwJ/CVar^,] + Var[eJ)

At upper left the table gives three “R2 ” <statistics.
'
The definitions for these differ from the
true/?2 ofOLS. In the case of xtreg , the “R2 ” are based on fits between several kinds of
observed and predicted;/ values.

■

-

Il

Regression Diagnostics

I
It
|j

Do the data give us any reason to distrust our regression results? Can we find better ways to
specify the model, or to estimate its parameters? Careful diagnostic work, checking for
potential problems and evaluating the plausibility of key assumptions, forms a crucial step in
modem data analysis. We fit an initial model, but then look closely at our results for signs of
trouble or ways in which the model needs improvement. Many of the general methods
introduced in earlier chapters, such as scatterplots, box plots, normality tests, or just sorting and
listing the data, prove useful for troubleshooting. Stata also provides a toolkit of specialized
diagnostic techniques designed for this purpose.

i!

Autocorrelation, a complication that often affects regression with time series data, is not
covered in this chapter. Chapter 13, Time Series Analysis, introduces Stata’s library of time
series procedures including Durbin-Watson tests, autocorrelation graphs, lag operators and
time-series regression techniques.

4

Regression diagnostic procedures can be found under these menu selections:
■

Statistics - Linear regression and related - Regression diagnostics

i

Statistics - General post-estimation - Obtain predictions, residuals, etc., after estimation

h

Example Commands
1

The commands illustrated in this section all assume that you have just fit a model using either
anova or regress . The commands’ results refer back to that model. These followup
commands are of three basic types:
1. predict options that generate new variables containing case statistics such as predicted
values, residuals, standard errors, and influence statistics. Chapter 6 noted some key
options; type help regress for a complete listing.
2. Diagnostic tests for statistical problems such as autocorrelation, heteroskedasticity,
specification errors, or variance inflation (multicollinearity). Type help regdiag for
a list.
”
'
3. Diagnostic plots such as added-variable or leverage plots, residual-versus-fitted plots
residual-versus-predictorplots. and component-versus-residualplots. Again,typing help
regdiag obtains a full listing of regression and ANOVA diagnostic plots. General
graphs for diagnosing distribution shape and normality were covered in Chapter 2; type
help diagplots for a list of those.

iF
I

196

¥
Linear Regression Analysis

195

R2 within

Explained variation within units—defined as the squared correlation between
deviations of.y„ values from unit means (y-y,) and deviations of predicted
values from unit mean predicted values (y^-y,).

R2 between

Explained variation between units — defined as the squared correlation
between unit means (y,) and y t values predicted from unit means of the
independent variables.

R2 overall

Explained variation overall — defined as the squared correlation between
observed (yf7) and predicted (y,7) values.

Our example model does a very good job fitting the observed crimes overall (R2 = .96), and
also the variations among census division means (R2 = .97). Variations around the means
within census divisions are somewhat less predictable (R2 = .53).
The random-effects option employed for this example is one of several possible choices.

I

re

Generalized least squares (GLS) random-effects estimator; default

be

between regression estimator

fe

fixed-effects (within) regression estimator

mle

maximum-likelihood random-effects estimator

pa

population-averaged estimator

Consult help xtreg for further options and syntax. The Longitudinal/Panel Data
Reference Manual gives examples, references, and technical details.

I

>1
Regression Diagnostics

197

predict Options
. predict new, cooksd

Generates a new variable equal to Cook’s distance D, summarizing how much each
observation influences the fitted model.
. predict new, covratio

Generates a new variable equal to Belsley, Kuh, and Welsch’s COVRATIO statistic.
COVRATIO measures the ith case’s influence upon the variance-covariance matrix of the
estimated coefficients.
. predict DFxl, dfbeta(xl)

Generates DFBETA case statistics measuring how much each observation affects the
coefficient on predictor xl. The dfbeta command accomplishes the same thing more
conveniently, and in this example will automatically name the resulting statistics DFxl\
dfbeta xl

To create a complete set of DFBETAs for all predictors in the model, simply type the
command dfbeta without arguments.
. predict new, dfits

Generates DFITS case statistics, summarizing the influence of each observation on the
fitted model (similar in purpose to Cook’s D and Welsch’s IV).
Diagnostic Tests
I

I

dwstat

Calculates the Durbin-Watson test for first-order autocorrelation. Chapter 13 gives
examples of this and other time series procedures. See also:
help durbina
Durbin-Watson h statistic
help bgodfrey
Breusch-Godfrey LM (Lagrange multiplier) statistic
. hettest

Performs Cook and Weisberg’s test for heteroskedasticity. If we have reason to suspect
that heteroskedasticity is a function of a particular predictor xl, we could focus on that
predictor by typing hettest xl.
. ovtest,

rhs

Perfonns the Ramsey regression specification error test (RESET) for omitted variables. The
option rhs calls for using powers of the right-hand-side variables, instead of powers of
predicted y (default).
. vif

Calculates variance inflation factors to check for multicollinearity.
Diagnostic Plots

acprplot xl, mspline msopts(bands(7))

Constructs an augmented component-plus-residual plot (also known as an augmented
partial residual plot), often better than cprplot in screening for nonlinearities. The
options mspline msopts (bands (7)) call for connecting with line segments the
cross-medians of seven vertical bands. Alternatively, we might ask for a lowess-smoothed
curve with bandwidth 0.5 by specifying the options lowess Isopts (bwidth (. 5) ).

198

.

Statistics with Stata

avplot xl

Constructs an added-variable plot (also called a partial-regression or leverage plot) showing
the relationship between^ and xl, both adjusted for other x variables. Such plots help to
notice outliers and influence points.
avplots

Draws and combines in one image all the added-variable plots from the recent anova or
regress.
.

cprplot xl

Constructs a component-plus-residual plot (also known as a partial-residual plot) showing
the adjusted relationship betweeny and predictor xl. Such plots help detect nonlinearities
in the data.
Ivr2plot

Constructs a leverage-versus-squared-residual plot (also known as an L-Rplot).
rvfplot

Graphs the residuals versus the fitted (predicted) values of v.
.

rvpplot xl

Graphs the residuals against values of predictor xl.

SAT Score Regression, Revisited

-

i|
■

1

Diagnostic techniques have been described as tools for “regression criticism,” because they help
us examine our regression models for possible flaws and for ways that the models could be
improved. In this spirit, we return now to the state Scholastic Aptitude Test regressions of
Chapter 6. A three-predictor model explains about 92% of the variance in mean state SAT
scores. The predictors are percent (percent of high school graduates taking the test), percent?
(percent squared), and high (percent of adults with a high school diploma).
. generate percents = percent*!

. regress csat percent percents high
Source

|

Model |
Residual |

li-i

df

MS

Number of obs =
F(
3,
47) =
Prob > F
R-squared
Adj R-squared =
Root MSE

207225.103
16789.4069

3
47

69075.0343
357.221424

Total |

224014.51

50

4480.2902

csat |

Coef.

Std. Err.

t

P>l t|

percent |
percent2 |
high |
_cons |

-6.520312
.0536555
2.986509
844.8207

.5095805
.0063678
.4857502
36.63387

-12.80
8.43
6.15
23.06

0.000
0.000
0.000
0.000

------- +

I

ss

[95% Conf.

-7.545455
. 0408452
2.009305
771.1228

lil i

The regression equation is
predicted csat = 844.82 - 6.52percent + .05percent2 + 2.99high

51
193.37
0.0000
0.9251
0.9203
18.90

Interval]

-5.495168
.0664658
3.963712
918.5185

Regression Diagnostics

199

- four variables. As
-^interrelations among these
mOdel to fit *e visibly

11 ‘‘"SXws »«< ™"ssl°“

The scatterpH
noted in Chapter
curvilinear relate

and percentcs3t, half »s

yinbol ( + )

percents

Figure T ''

graph matrix

% HS I
graduates |

t3^ J

6000-

percenU

&

40002000-

90-

*
a**5
HS
dipt*03

A

80-

*♦

J

70 I

Mean
composite
SA‘
score

♦♦V*

60 -I
1100-f
1000-

r’

900-

800-{p
0

,

80

70

90

. The
:cond,
4 variance !)■
i of
,uld improve the

50

Several

It then perfe™ »
y equal zero. It
eRresston, we
model. With the C5^r
. ovtest
Ramsey RESET teS^
Ho: mode 1

not rei«'nU" W<“ ‘

. V- -J
fitted
— 3 of the
0,i.ng P°w“
!
variable
5
itted

n0 °

T.

values of csat

1.48

tPrror variance by
me assumption ofconstan A ( e cook and
’ > te
teStS
are linearly related 10 > (
st that in
t
7
standardized
resi^
thc
csat
regresston
su^
A heteroskedaStt^J d standardized resrdu ,
of constant vanance.
examining whether ^"ion and^Vh^oth^
example).
Weisberg 1994 for 4
the mil ..
csat
this instance we shot”
fitted values of

°-2319

gtO» >

. hettest

I

1

Cook-Weisberg te^
Ho: Constant
chi2(D
Prob >

s ..^roskedas
heteros

ticity

4.86
0.02'74

using

r

200

Statistics with Stata

“Significant” heteroskedasticity implies that our standard errors and hypothesis tests might be
invalid. Figure 7.2, in the next section, shows why this result occurs.
■f'

Diagnostic Plots
Chapter 6 demonstrated how predict can create new variables holding residual and
predicted values after a regress command. To obtain these values from our regression of
csut on percent, percent?, and high, we type the two commands:
. predict yhat3
. predict e3, resid

The new variables named e3 (residuals) andy/m/5 (predicted values) could be displayed in a
residual-versus-predicted graph by typing graph twoway scatter e3 yhat,
yline (0) . The rvfplot (residual-versus-fitted) command obtains such graphs in asingle
inreadinVsucl^pl?’2 'nCludeS 3 horizontal line at 0 (the residual mean), which helps
. rvfplot, yline(0)

Figure 7.2

o

CM

1 i

-Po -r

8

i
I '»

1

H

t.

H

I

i

9

%

o L

*

o

T ~l
850

900

950
Fitted values

1000

1050

Figure 7.2 shows residuals symmetrically distributed around 0 (symmetry is consistent with
the normal-en-ors assumption), and with no evidence of outliers or curvilinearity The
dispersion of the residuals appears somewhat greater for above-average predicted values ofy
however, which is why hottest earlier rejected the constant-variance hypothesis.
Kesidual-versus-fitted plots provide a one-graph overview of fhe regression residuals. For
more detailed study, we can plot residuals against each predictor variable separately through

(no?sh7wnrty^al’VerSUS'PrediCt°r” C°mrnandS- T° graph the residuaIs a8ainst Predictor high

>

Regression Diagnostics

.

201

rvpplot high

The one-variable graphs described in Chapter 3 can also be employed for residual analysis.
For example, we could use box plots to check the residuals for outliers or skew, or
quantile-normal plots to evaluate the assumption of normal errors.
Added-variable plots are valuable diagnostic tools, known by different names including
partial-regression leverage plots, adjusted partial residual plots, oradjusted variable plots. They
depict the relationship between y and one x variable, adjusting for the effects of other x
variables. If we regressed yonx2 andx3, and likewise regressed x7 onx2 andx3, then took the
residuals from each regression and graphed-these residuals in a scatterplot, we would obtain an
added-variable plot for the relationship betweeny andxf, adjusted forx2 and x3. An avplot
command performs the necessary calculations automatically. We can draw the adjustedvariable plot for predictor high, for example, just by typing
.

avplot high

Speeding the process further, we could type avplots to obtain a complete set of tiny
added-variable plots with each of the predictor variables in the preceding regression. Figure
7.3 shows the results from the regression of csat on percent, percent2, and high. The lines
drawn in added-variable plots have slopes equal to the corresponding partial regression
coefficients. For example, the slope of the line at lower left in Figure 7.3 equals 2.99 which
is the coefficient on high.
.

avplots

Figure 7.3

o

«—

^8x

x3-

8° '
*3-

3-5
0
e( percent | X)

-10

5

^io

coef = -6.5203116, se = .50958046. t = -12.8

-500

0
500
e( percent | X)

1000

coef = .05365555, se = .00636777. t ■ 8.43

8x

'5'

?■
-10

-5

0
e( high | X)

5

10

coef = 2.9865088, se = .48575023. t “ 6.15

Added-vanable plots help to uncover observations exerting a disproportionate influence on
the regression model. In simple regression with one x variable, ordinary scatterplots suffice for
this purpose. In multiple regression, however, the signs of influence become more subtle. An
observation with an unusual combination of values on several x variables might have high
leverage, or potential to influence the regression, even though none of its individual x values

202
|i

Statistics with Stata

IS

1
if i-

is unusual by itself. High-leverage observations show up in added-variable plots as points
horizontally distant from the rest of the data. We see no such problems in Figure 7.3, however.
If outliers appear, we might identify which observations these are by including observation
labels for the markers in an added-variable plot. This is done using the mlabel ( ) option,
just as with scatterplots. Figure 7.4 illustrates using state names (values of the string variable
state)as labels. Although such labels tend to overprint each other where the data are dense,
individual outliers remain more readable.
.

avplot high, mlabel(state)

Figure 7.4

• Iowa

8
• North DakOtegon

• llliftditpryland

• Wisconsin
• Virginia •Delaware
• Nebraska
• Sputh^akutmt
• California

• Tennessee

X
CD

Alaska

• Rhode Island * M«ss<

O°

• FfqgSW^

of
• Kentucky
_^_^*<ouis?a?i§xas
'Arkansas
• Alabama
• NoftiGeereifoa

8

• New Hampshire
• Montaf^^Wir^l^0^

• Utah
• Wyoming

• MicNeaeda

• District of Columbia

• Mswsaett Carolina
• West Virginia

-10

• Idaho
• Ohio
• Arizona
• Indiana
• Oklahoma

-5

0
e( high | X )
coef = 2.9865088. se = .48575023, t = 6.15

5

10

Component-plus-residual plots, produced by commands of the form cprplot xl, take
a different approach to graphing multiple regression. The component-plus residual plot for
variable xl graphs each observation s residual plus its component predicted ffomx/,
et + bjxlj

against values of xl. Such plots might help diagnose nonlinearities and suggest alternative
functional forms. An augmented component-plus-residual plot (Mallows 1986) works
somewhat better, although both types often seem inconclusive. Figure 7.5 shows an augmented
component-plus-residual plot from the regression of csat on percent, percent?, and high.

I
Regression Diagnostics

203

acprplot high, lowess

f

o
co

Figure 7.5

ro o

1

V)

2
w

-to

c

o

h

o ‘O

eS

0)10

<
o
io

65

1

70

75
80
% over 25 w/HS diploma

85

The straight line in Figure 7.5 corresponds to the regression model. The curved line reflects
lowess smoothing based on the default bandwidth of .5, or half the data. The curve’s downturn
at far right can be disregarded as a lowess artifact, because only a few cases determine its
location toward the extremes (see Chapter 8). If more central parts of the lowess curve showed
a systematically curved pattern, departing from the linear regression model, we would have
reason to doubt the model’s adequacy. In Figure 7.5, however, the component-plus-residuals
medians closely follow the regression model. This plot reinforces the conclusion we reached
earlier from Figure 7.2, that the present regression model adequately accounts for all
nonlinearity visible in the raw data (Figure 7.1), leaving none apparent in its residuals.
As its name implies, a leverage-versus-squared-residuals plot graphs leverage (hat matrix
diagonals) against the residuals squared. Figure 7.6 shows such a plot for the csat regression.
To identify individual outliers, we label the markers with the values of state. The option
mlabsize (medsmall) calls for “medium small” marker labels, somewhat larger than the
default size of “small.” (See help testsizestyle fora list of other choices.) Most of
the state names form a jumble at lower left in Figure 7.6. but a few outliers stand out

■ Si

I

204

Statistics with Stata

Ivr2plot, mlabel(state) mlabsize(medsmall)

.

Figure 7.6

cn

• Connectici t

• Massachus etts

<u
om
ro
to

• Mississippi

•Utah
• Alaska

• New Hampshire
• Kentucky
’.A&oSrk

*

o

Carolina

'1I

S^e Island
ifla

Maryland

Jico

0

flbw--------------•District of Columbh
•Oregon
’ -• Nevada
• Michigan

• Illinois

.02

.04
.06
Normalized residual squared

.08

• Iowa
♦ Tennessee

.1

Lines in a leverage-versus-squared-residuals plot mark the means of leverage (horizontal
line) and squared residuals (vertical line). Leverage tells us how much potential for influencing
the regression an observation has, based on its particular combination of x values. Extreme v
values or unusual combinations give an observation high leverage. A large squared residual
indicates an observation with value much different from that predicted by the regression
model. Connecticut, Massachusetts, and Mississippi have the greatest potential leverage but
the model fits them relatively well. (This is not necessarily good. Sometimes, although not
here, high-leverage observations exert so much influence that they control the regression, and
it must fit them well.) Iowa and Tennessee are poorly fit, but have less potential influence.
Utah stands out as one observation that is both ill fit and potentially influential. We can read
its values by listing just this state. Because state is a string variable, we enclose the value
“Utah” in double quotes.
. list csat yhat3 percent high e3 if state

1.

I csat
I--------I 1031

yhat3

percent

high

1067.712

5

85.1

"Utah"

e3 I
----------------- I
-36.71239 |

Only 5% of Utah students took the SAT, and 85.1% of the state’s adults graduated from
high school. This unusual combination of near-extreme values on both x variables is the source
of the state’s leverage, and leads our model to predict mean SAT scores 36.7 points higher than
what Utah students actually achieved. To see exactly how much difference this one observation
makes, we could repeat the regression using Stata’s “not equal to” qualifier ! = to set Utah
aside.

Regression Diagnostics

..
.

regress

csat percent percent2 high if state

Source I

SS

MS

df

|
|

20109’.423
15214.1968

46

67032.4744
330.741235

Total

|

216311.52

49

4414.52082

csat

|

Std.

t

P> I 11

-13.44
9.02
6.74
22.87

0.000
0.000
0.000
0.000

percent |
percent2 I
high |
_cons |

-6.77=-06
.0563562
3.281’65
827.1159

.5044217
.0062509
.4865854
36.17138

"Utah”

Number of obs
F(
3,
46)
Prob > F
R-squared
Adj R-squared
Root MSE

Model
Residual

3

!=

205

=
=
=
=
=

50
202.67
0.0000
0.9297
0.9251
18.186

[95% Conf. Interval]

-7.794054
. 0437738
2.302319
754.3067

-5.763357
.0689387
4.26121
899.9252

In the n - 50 (instead of n = 51) regression, all three coefficients strengthened a bit because
we deleted an ill-fit obsenation. The general conclusions remain unchanged, however.
Chambers et al. (198?) and Cook and Weisberg (1994) provide more detailed examples and
explanations of diagnostic plots and other graphical methods for data analysis.

Diagnostic Case Statistics

I

I

Afterusing regress or anova, we can obtain a variety ofdiagnostic statistics through the
predict command (see Chapter 6 or type help regress ). The variables created by
predict are case statistics, meaning that they have values for each observation in the data.
Diagnostic work usually begins by calculating the predicted values and residuals.
There is some overlap in purpose among other predict statistics. Many attempt to
measure how much each obsenation influences regression results. “Influencing regression
results, however, could refer to several different things — effects on the r-intercept, on a
particular slope coefficient, on all the slope coefficients, or on the estimated standard errors,
for example. Consequently, we have a variety of alternative case statistics designed to measure
influence.
Standardized and studentized residuals (rstandard and rstudent) helpto identify
outliers among the residuals — observations that particularly contradict the regression model.
Studentized residuals have the most straightforward interpretation. They correspond to the t
statistic we would obtain by including in the regression a dummy predictor coded 1 for that
observation and 0 for all others. Thus, they test whether a particular observation significantly
shifts the ^-intercept.

i

Hat matrix diagonals ( hat ) measure leverage, meaning the potential to influence
regression coefficients. Observations possess high leverage when their x values (or their
combination ofx values) are unusual.

I

Several other statistics measure actual influence on coefficients. DFBETAs indicate by how
many standard errors the coefficient on xl would change if observation i were dropped from
the regression. These can be obtained fora single predictor, xl, in either of two ways: through
the predict option dfbeta(xl) or through the command dfbeta .

1'
206

Statistics with Stata

Cook’s D ( cooksd ), Welsch’s distance ( welsch ), and DFITS ( df its ), unlike
DFBETA, all summarize how much observation i influences the regression model as a whole
or equivalently, how much observation i influences the set of predicted values. CO VRATIO
measures the influence of the zth observation on the estimated standard errors. Below we
generate a full set of diagnostic statistics including DFBETAs for all three predictors. Note that
predict supplies variable labels automatically for the variables it creates, but dfbeta
does not. We begin by repeating our original regression to ensure that these post-regression
diagnostics refer to the proper (n = 51) model.
. quietly regress csat percent percent2 high
. predict standard,

rstandard

predict student, rstudent
. predict h, hat
. predict D, cooksd

kl

r !

I

.

predict DFITS,

.

predict

W,

df i ts

welsch

. predict COVRATIO, covratio

. dfbeta

I

DFpercent:
DFpercent2:
DFhigh:

I il

■h

.

describe standard - DFhigh

variable name
standard
student
h
D
DFITS
W
COVRATIO
DFpercent
DFpercent2
DFhigh

!1
!

Il <

’

7

storage
type
float
float
float
float
float
float
float
float
float
float

display
format

value
label

%9.0g
%9.0g
%9.0g
%9.0g
%9.0g
%9.0g
%9.0g
%9.0g
%9.0g
%9.0g

variable label

Standardized residuals
Studentized residuals
leverage
Cock’s D
Welsch distance
Covratio

summarize standard - DFhigh
Variable I
+
standard I

St-

DFbeta (percent)
DFbeta(percent2)
DFbeta(high)

student i
h I
D I
DFITS I

Obs

Mean

Std. Dev.

Min

Max

51
51
51
51
51

-.0031359
- . 00162
.0784314
. 0219941
- .0107348

1.010579
1.032723
.0373011
.0364003
.3064762

-2.099976
-2.182423
. 0336437
.0000135
-.896658

2.233379
2.336977
.2151227
.1860992
.7444486

51
51
51
51
51

- .089723
1.092452
.000938
-.0010659
-.0012204

2.278704
.1316834
.1498813
.1370372
.1747835

-6.854601
.7607449
- . 5067295
- .440771
-.6316988

5.52468
1.360136
.5269799
.4253958
.3414851

+
W
COVRATIO
DFpercent
DFpercent2
DFhigh

|
|
|
|
|

Ill

___________

Regression Diagnostics

207

summarize shows us the minimum and maximum values of each statistic, so we can
quickly check whether any are large enough to cause concern. For example, special tables
could be used to determine whether the observation with the largest absolute studentized
residual (student) constitutes a significant outlier. Alternatively, we could apply the Bonferroni
inequality and t distribution.table: max| student | is significant at level a if 111 is significant at
a/«. In this example, we have max| student | = 2.337 (Iowa) and n = 51. For Iowa to be a
s>gn>ficant outlier (cause a significant shift in intercept) at a = .05, t = 2.337 must be significant
. display
.00098039

.05/51

Stata’s ttail ( ) function can approximate the probability of 111 > 2.337, given dfh-K-1
= 51-3-1 =47:
. display 2*ttail(47,
. 02375138

2.337)

The obtained P-vaiue (P - .0238) is not below a/n = .00098, so Iowa is not a significant outlier
at a - .05.
n ^“‘‘^ ^iduals measure the /th observation’s influence on the?-intercept. Cook’s
• ’ k
'e SCh S distance 3,1 measure the /th observation’s influence on all coefficients
in the model (or, equivalently, on all n predicted;; values). To list the 5 most influential
observations as measured by Cook’s
type
. sort D
.

list state yhat3 D DFITS W in -5/1

state
47.
48 .
49.
50.

North Dakota
Wyoming
Tennessee
lows
Utah

yhat3
1036.696
1017.005
974.6981
1052.78
1067.712

D
. 0705921
.0789454
.111718
.1265392
. 1860992

□ FITS
. 5493086
-.5820746
. 6992343
. 7444486
- .896658

W
4.020527
-4.270465
5.162398
5.52468
-6.854601

The in -5/1 qualifier tells Stata to 1*list only the fifth-from-last (-5) through last
(lowercase letter “i1”) observations. Figure 7.7 shows one way to display influence graphically
symbols in a residual-versus-predicted plot are given sizes proportional to values of Cook’s D
through the “analytical weight" option [aweight = D]. Five influential observations stand
out, with large positive or negative residuals and high predicted csat values.

i

'J?

I

208

F

Statistics with Stata

• graph twoway scatter e3 yhat3 [aweight - D], msymbol(oh) yline(O)
o

hd.

Figure 7.7

o
CM

O

o

o
°o

o

o

_w
ro
Po

o

<0
0)

cr

o

o

o

%

o

8 O

o

I

o
o

T

850

900

950

1000

Fitted values

1050

Although they have different statistical rationales. Cook’s D, Welsch’s distance, andDFfTS
are closely related. In practice they tend to flag the same observations as influential. Figure
7.8 shows their similarity in the example at hand.
• graph matrix D W DFITS, half

Figure 7.8
Cooks
D

I
J

'

5

0

Welsch
distance

-5

1

h
0

Dfits

-1
0

.1

.2

-5

0

5

DFBETAs indicate how much each observation influences each regression coefficient.
Typing dfbeta after a regression automatically generates DFBETAs, for each predictor. In

■

Regression Diagnostics

209

this example, theyreceivedthenamesDFpercent(DFBETAforpredictorpercent'),DFpercent2,
and DFhigh. Figure 7.9 graphs their distributions as box plots.
.

graph box DFpercent DFpercent2 DFhigh,

legend(cols(3))

Figure 7.9

*9
■

♦

—1

tn

r

■

■■■■ DFpercent

6®® DFpercent2

■■■ DFhigh

From left to right, Figure 7.9 shows the distributions oiDFBETAs for percent, percent?,
and high. (We could more easily distinguish them in color.) The extreme values in each plot
belong to Iowa and Utah, which also have the two highest Cook’s D values. For example,
Utah’s DFhigh = -.63. This tells us that Utah causes the coefficient on high to be .63 standard
errors lower than it would be if Utah were set aside. Similarly, DFpercent = .53 indicates that
with Utah present, the coefficient on percent is .53 standard errors higher (because the percent
regression coefficient is negative, “higher” means closer to 0) than it otherwise would be.
Thus, Utah weakens the apparent effects of both high and percent.
The most direct way to learn how particular observations affect a regression is to repeat the
regression with those observations set aside. For example, we could set aside all states that
move any coefficient by half a standard error (that is, have absolute DFBETAs of .5 or more):
.

regress csat percent percent2 high if abs(DFpercent)
abs(DFpercent2) < .5 & abs(DFhigh) < .5
Source I

-------------------- +

SS
ss

df

MS

<

.5 &

Number of obs =
F(
3,
44) =
Prob > F
R-squared
Adj R-squared =
Root MSE

Model
Residual

|
|

175366.782
11937.1351

3
44

58455.5939
271.298525

Total

|

187303.917

47

3985.18972

csat

|

Ccef.

Std. Err.

t

P> 111

percent
percent2
high
_cons

|
|
|
|

-6.510868
.0538131
3.35664
815.0279

.4700719
. 005779
.4577103
33.93199

-13.85
9.31
7.33
24.02

0.000
0.000
0.000
0.000

[95% Conf.

-7.458235
.0421664
2.434186
746.6424

48
215.47
0.0000
0.9363
0.9319
16.471

Interval]
-5.5635
.0654599
4.279095
883.4133

210

■3

I

fi

M-

Statistics with Stata

J“are^uI *nsPecti°n wiII reveal the details in which this regression table (based on n = 48)
differs from its n = 51 or n = 50 counterparts seen earlier. Our central conclusion — that mean
state SAT scores are well predicted by the percent of adults with high school diplomas and,
curvilinearly, by the percent of students taking the test — remains unchanged, however.
Although diagnostic statistics draw attention to influential observations, they do not answer
the question of whether we should set those observations aside. That requires a substantive
decision based on careful evaluation of the data and research context. In this example, we have
no substantive reason to discard any states, and even the most influential of them do not
fundamentally change our conclusions.
Using any fixed definition of what constitutes an “outlier,” we are liable to see more of
them in larger samples. For this reason, sample-size-adjusted cutoffs are sometimes
recommended for identifying unusual observations. After fitting a regression model with K
coefficients (including the constant) based on n observations, we might look more closely at
those observations for which any of the following are true:
leverage h > IK/n
Cook’s D > 4/n

['■

I’'

S'-

I

DFITS>2VK/n
Welsch’s IV> 3\/K
DFBETA >2/^
\COVRATIO- 1| > 2K/n
The reasoning behind these cutoffs, and the diagnostic statistics more generally, can be found
in Cook and Weisberg (1982,1994); Belsley, Kuh, and Welsch (1980); or Fox (1991).
♦

ir

I

1

i

Multicollinearity
If perfect multicollinearity (linear relationship) exists among the predictors, regression
equations become unsolvable. Stata handles this by warning the user and then automatically
dropping one of the offending predictors. High but not perfect multicollinearity causes more
subtle problems. When we add a newx variable that is strongly related tox variables already
in the model, symptoms of possible trouble include the following:
1. Substantially higher standard errors, with correspondingly lower t statistics.
2. Unexpected changes in coefficient magnitudes or signs.
3. Nonsignificant coefficients despite a high 2?2.
Multiple regression attempts to estimate the independent effects of each x variable. There is
little information for doing so, however, if one or more of the x variables does not have much
independent variation. The symptoms listed above warn that coefficient estimateshave become
unreliable, and might shift drastically with small changes in the sample or model. Further
troubleshooting is needed to determine whether multicollinearity really is at fault and, if so,
what should be done about it.
Multicollinearity cannot necessarily be detected, or ruled out, by examining a matrix of
correlations between variables. A better assessment comes from regressing eachx on all of the
other x variables. Then we calculate 1 -R2 from this regression to see what fraction of the first

Regression Diagnostics

211

x variable’s variance is independent of the otherx variables. For example, about 97% of high's
variance is independent ofpercent and percent?'.

II

. quietly regress high percent percent2
. display 1 - e(r2)
.96942331

After regression, e(r2) holds the value of??2. Similar commands reveal that only 4% of
percent's variance is independent of the other two predictor variables:
. quietly regress percent high percent2
. display 1 - e(r2)
.04010307

This finding about percent and percent2 is not surprising. In polynomial regression or
regression with interaction terms, some x variables are calculated directly from other x
variables. Although strictly speaking their relationship is nonlinear, it often is close enough to
linear to raise problems of multicollinearity.

The post-regression command vif , for variance inflation factor, performs similar
calculations automatically. This gives a quick and straightforward check for multicollinearity.
. quietly regress csat percent percent2 high

. vif
Variable |

VIF

1/VIF

percent I
percent2 |
high |

24.94
24.78
1.03

0.040103
0.040354
0.969423

Mean VIF |

16.92

------- +

The 1/VIF column at right in a vif table gives values equal to 1 -R2 from the regression
of each x on the other x variables, as can be seen by comparing the values for high (.969423)
or percent (.040103) with our earlier display calculations. That is. 1.-VIF (or 1-/?;) tells
us what proportion of anx variable’s variance is independent of all the otherx variables. A low
proportion, such as the .04 (4% independent variation) of percent and percent?, indicates
potential trouble. Some analysts set a minimum level, called tolerance, for the 1/VIF value, and
automatically exclude predictors that fall below their tolerance criterion.
The VIF column at center in a vif table reflects the degree to which other coefficients’
variances (and standard errors) are increased due to the inclusion of that predictor. We see that
high has virtually no impact on other variances, but percent and percent? affect the variances
substantially. VIF values provide guidance but not direct measurements of the increase in
coefficient variances. The following commands show the impact directly by displaying
standard error estimates for the coefficient on percent, when percent? is and is not included in
the model.
. quietly regress csat percent percent2 high

I

L

. display __se [percent]
.50958046
. quietly regress csat percent high

. display _se[percent]
.16162193

kt,

■••I

212

Statistics with Stata

II
With pei cent2 included in the model, the standard error for percent is three times higher:
.50958046/. 16162193 = 3.1529166
This corresponds to a tenfold increase in the coefficient’s variance.
How much variance inflation is too much? Chatterjee, Hadi, and Price (2000) suggest the
following as guidelines for the presence of multi collinearity:
1. The largest VIF is greater than 10; or
2. the mean VIF is larger than 1.
With our largest VIFs close to 25, and the mean almost 17, the csat regression clearly meets
both criteria. How troublesome the problem is, and what, if anything, should be done about it,
are the next questions to consider.

Because/7erce«Z andpercent2 are closely related, we cannot estimate their separate effects
with nearly as much precision as we could the effect of either predictor alone. That is why the
standard error for the coefficient on percent increases threefold when we compare the
regression of csat on percent and high to a polynomial regression ofera/ on percent, percent2,
and high. Despite this loss of precision, however, we can still distinguish all the coefficients
from zero. Moreover, the polynomial regression obtains a better prediction model. For these
reasons, the multi col linearity in this regression does not necessarily pose a great problem, or
require a solution. We could simply live with it as one feature of an otherwise acceptable
model.

■SI i

i 1 ' :'
I '
!I

in

When solutions are needed, a simple trick called “centering” often succeeds in reducing
multicollinearity in polynomial or interaction-effect models. Centering involves subtracting the
mean from x variable values before generating polynomial or product terms. Subtracting the
mean creates a new variable centered on zero and much less correlated with its own squared
values. The resulting regression fits the same as an uncentered version. By reducing
multicollinearity, centering often (but not always) yields more precise coefficient estimates with
lower standard errors. The commands below generate a centered version of percent named
Cpercent, and then obtain squared values of Cpercent named Cpercent2.
summarize percent

Il i

Variable

|

Obs

Mean

Std. Dev.

Min

Max

percent

|

51

35.76471

26.19281

4

81

.

generate Cpercent = percent

r(mean)

.

generate Cpercent2 = Cpercent A2

S

fh

. correlate Cpercent Cpercent2 percent percent2 hiqh
(obs=51)

| •

I Cpercent Cperce~2

percent percent2

I
|
I
|
|
|

1.0000
0.9794
0.1413
-0.8758

------------------------------

ij

Cpercent
Cpercent2
percent
percent2
high
csat

J : *

>

1.0000
0.3791
1.0000
0.9794
0.1413
-0.8758

1.0000
0.3791
0.5582
-0.0417
-0.0428

1.0000
0.1176
-0.7946

csa t

high

csat

1.0000
0.0858

1.0000

Regression Diagnostics

213

JZ T'
Whereas percent and percent? have a near-perfect correlation with each other (r = .9794),
the centered versions Cpercent and Cpercent? are just moderately correlated (r = .3791).
Otherwise, correlations involving percent and Cpercent are identical because centering is a
linear transformation. Correlations involving Cpercent? are different from those with percent?,
however. Figure 7.10 shows scatterplots that help to visualize these correlations, and the
transformation’s effects.
.

graph matrix Cpercent Cpercent2 percent percent2 high csat,
half msymbol(+)

Figure 7.10
Cpercent
2000

Cpercent2

1000

108
% HS
graduates
taking

50

0
6000

4000
percent2

2000

98
80

% over
25 w/HS
diploma

70

1188
Mean
composite
SAT
score

1000
900

800

-50

500

0

1000

2000

50

100

70

2000 4000 600QE0

80

90

The/?2, overall Ftest, predictions, and many other aspects of a model should be unchanged
after centering. Differences will be most noticeable in the centered variable’s coefficient and
standard error.
.

regress csat Cpercent Cpercent2 high

Source |
Model
Residual

|
I

Total

I

ss

dr

MS

20"225.103
lc~89.407

3
47

69075.0343
357.221426

224014.51

50

4480.2902

Number of C-bS =
47) =
F( 3,
Prob > F
R-squared
Adj R-squared =
Root MSE

csat

I

Ooef.

Std. Err.

t

P> 111

Cpercent

I

-2.682362
.0536555
2.936509
680.2552

.1119085
. 0063678
.4857502
37.82329

-23.97
8.43
6.15
17.99

0.000
0.000
0.000
0.000

Cpercent2 I
high |
cons I

[95% Conf.

-2.907493
. 0408452
2.009305
604.1646

51
193.37
0.0000
0.9251
0.9203
18.90

Interval]

-2.457231
.0664659
3.963712
756.3458

In this example, the standard error of the coefficient on Cpercent is actually lower
(.111 9085 compared with .16162193) when Cpercent2 is included in the model. The t statistic
is correspondingly larger. Thus, it appears thatcentering did improve that coefficient estimate’s

214

I

Statistics with Stata

precision. The VIF table now gives less cause for concern: each of the three predictors has
more than 80% independent variation, compared with 4% for percent and percent2 in the
uncentered regression.
.

vif

I

VIE

1/VIF

Cpercent I
Cpercent2 |
high |

1.20
1.18

C .831528
. 8 4 6991
96942 3

Mean VIF I

1 .14

Variable

--------------- +

1.03

Another diagnostic table sometimes consulted to check for multicollinearity is the matrix
of correlations between estimated coefficients {not variables). This matrix can be displayed
after regress, anova , or other model-fitting procedures by typing
.

correlate,

coef

I Cpercent -perce-2
+----------------I
1.0000
|
-0.3893
1
200
|
-0.1700
0
240
|
0.2105
-0
151

Cpercent

Cpercent2
high
_cons

high

cons

1.0000
-0.9912

1.0000

High correlations between pairs of coefficients indicate possible collinearity problems.
By adding the option covariance , we can see the coefficients’ variance—covariance
matrix, from which standard errors are derived.
correlate,
I

Cpercent |
Cpercent2
high
_cons

I
|
|

__coef covariance
Cpercent Sperse'-2

high

cons

.012524
-.000277
.002241
-.009239
.00C322 .235953
.891126 -.051=17 -18.2105

1430.6

Media: 10002 PART-1.pdf

Position: 5 (302 views)