STATA中collapse的妙用

您所在的位置:网站首页 collapse啥意思 STATA中collapse的妙用

STATA中collapse的妙用

2023-06-18 02:36| 来源: 网络整理| 查看: 265

Stata Learning Modules Collapsing data across observations

Sometimes you have data files that need to be collapsed to be useful to you. For example, you might have student data but you really want classroom data, or you might have weekly data but you want monthly data, etc. We will illustrate this using an example showing how you can collapse data across kids to make family level data.

Here is a file containing information about the kids in three families. There is one record per kid. Birth is the order of birth (i.e., 1 is first), age wt and sexare the child's age, weight and sex. We will use this file for showing how to collapse data across observations.

use http://www.ats.ucla.edu/stat/stata/modules/kids, clear

list

          famid    kidname      birth        age         wt        sex

  1.         1       Beth          1          9         60          f 

  2.         1        Bob          2          6         40          m 

  3.         1       Barb          3          3         20          f 

  4.         2       Andy          1          8         80          m 

  5.         2         Al          2          6         50          m 

  6.         2        Ann          3          2         20          f 

  7.         3       Pete          1          6         60          m 

  8.         3        Pam          2          4         40          f 

  9.         3       Phil          3          2         20          m  

Consider the collapse command below. It collapses across all of the observations to make a single record with the average age of the kids.

collapse age

list

           age

  1.  5.111111  

The above collapse command was not very useful, but you can combine it with the by(famid) option, and then it creates one record for each family that contains the average age of the kids in the family.

use http://www.ats.ucla.edu/stat/stata/modules/kids, clear

collapse age, by(famid)

list

         famid        age

  1.         1          6 

  2.         2   5.333333 

  3.         3          4  

The following collapse command does the exact same thing as above, except that the average of age is named avgage and we have explicitly told thecollapse command that we want it to compute the mean.

use http://www.ats.ucla.edu/stat/stata/modules/kids, clear

collapse (mean) avgage=age, by(famid)

list

         famid     avgage

  1.         1          6 

  2.         2   5.333333 

  3.         3          4 

We can request averages for more than one variable. Here we get the average for age and for wt all in the same command.

use http://www.ats.ucla.edu/stat/stata/modules/kids, clear

collapse (mean) avgage=age avgwt=wt, by(famid)

list

         famid     avgage      avgwt

  1.         1          6         40 

  2.         2   5.333333         50 

  3.         3          4         40  

This command gets the average of age and wt like the command above, and also computes numkids which is the count of the number of kids in each family (obtained by counting the number of observations with valid values of birth).

use http://www.ats.ucla.edu/stat/stata/modules/kids, clear 

collapse (mean) avgage=age avgwt=wt (count) numkids=birth, by(famid)

list

         famid     avgage      avgwt    numkids

  1.         1          6         40          3 

  2.         2   5.333333         50          3 

  3.         3          4         40          3  

Suppose you wanted a count of the number of boys and girls in the family. We can do that with one extra step. We will create a dummy variable that is 1 if the kid is a boy (0 if not), and a dummy variable that is 1 if the kid is a girl (and 0 if not). The sum of the boy dummy variable is the number of boys and the sum of the girl dummy variable is the number of girls.

First, let's use the kids file (and clear out the existing data).

use http://www.ats.ucla.edu/stat/stata/modules/kids, clear 

We use tabulate with the generate option to make the dummy variables.

tabulate sex, generate(sexdum)

        sex |      Freq.     Percent        Cum.

------------+-----------------------------------

          f |          4       44.44       44.44

          m |          5       55.56      100.00

------------+-----------------------------------

      Total |          9      100.00

We can look at the dummy variables. Sexdum1 is the dummy variable for girls. Sexdum2 is the dummy variable for boys. The sum of sexdum1 is the number of girls in the family. The sum of sexdum2 is the number of boys in the family.

list famid sex sexdum1 sexdum2

          famid        sex   sexdum1   sexdum2

  1.         1          f         1         0 

  2.         1          m         0         1 

  3.         1          f         1         0 

  4.         2          m         0         1 

  5.         2          m         0         1 

  6.         2          f         1         0 

  7.         3          m         0         1 

  8.         3          f         1         0 

  9.         3          m         0         1  

The command below creates girls which is the number of girls in the family, and boys which is the number of boys in the family.

collapse (count) numkids=birth (sum) girls=sexdum1 boys=sexdum2, by(famid)

We can list out the data to confirm that it worked correctly.

list famid boys girls numkids

         famid      boys     girls    numkids

  1.         1         1         2          3 

  2.         2         2         1          3 

  3.         3         2         1          3  

Summary

To create one record per family (famid) with the average of age within each family.

collapse age, by(famid)

To create one record per family (famid) with the average of age (called avgage) and average weight (called avgwt) within each family.

collapse (mean) avgage=age avgwt=wt,  by(famid)

Same as above example, but also counts the number of kids within each family calling that numkids.

collapse (mean) avgage=age  avgwt=wt (count) numkids=birth, by(famid)

Counts the number of boys and girls in each family by using tabulate to create dummy variables based on sex and then summing the dummy variables within each family.

tabulate sex, generate(sexdum)

collapse (sum) girls=sexdum1 boys=sexdum2, by(famid)

 



【本文地址】


今日新闻


推荐新闻


CopyRight 2018-2019 办公设备维修网 版权所有 豫ICP备15022753号-3