R包dplyr进行数据清洗和整理 |
您所在的位置:网站首页 › r语言数据清洗代码 › R包dplyr进行数据清洗和整理 |
R语言dplyr包主要功能: 按观察值拾取观察值(filter())。对行重新排序(arrange())。按变量名称拾取变量(选择()。使用现有变量的函数(mutate())创建新变量。将多个值向下折叠为一个摘要(summary())。这些都可以与group_by()结合使用,这将每个函数的作用域从对整个数据集进行操作更改为对其进行分组操作。 1. filter 选择满足条件的行 #install.packages("tidyverse") #install.packages("dplyr") library(ggplot2) library(dplyr) # library(tidyverse) library(nycflights13) # ?flights ## filter 选择满足条件的行 filter(flights, month == 1, day == 1) filter(flights, month == 11 | month == 12) nov_dec 120 | dep_delay > 120)) filter(starwars, hair_color == "none" & eye_color == "black") df 1) filter(df, is.na(x) | x > 1) starwars %>% group_by(gender) %>% filter(mass > mean(mass, na.rm = TRUE)) 2. arrange 按列值排列行 ## arrange 按列值排列行 arrange(flights, year, month, day) arrange(flights, desc(dep_delay)) # 降序 df % summarise( count = n(), # n() gives the current group size.当前分组的大小 dist = mean(distance, na.rm = TRUE), delay = mean(arr_delay, na.rm = TRUE) ) %>% filter(count > 20, dest != "HNL") not_cancelled % filter(!is.na(dep_delay), !is.na(arr_delay)) not_cancelled %>% group_by(year, month, day) %>% summarise(mean = mean(dep_delay)) # summarise 中的函数主要有: # Center: mean(), median() # Spread: sd(), IQR(), mad() # Range: min(), max(), quantile() # Position: first(), last(), nth(), # Count: n(), n_distinct() # Logical: any(), all() not_cancelled %>% group_by(year, month, day) %>% mutate(r = min_rank(desc(dep_time))) not_cancelled %>% group_by(year, month, day) %>% summarise(n_early = sum(dep_time < 500)) not_cancelled %>% ungroup() %>% # To removing grouping 取分组 summarise(flights = n()) 6. 数据框合并## 数据框合并函数 semi_join() return all rows from x with a match in y. anti_join() return all rows from x without a match in y. inner_join(): includes all rows in x and y. left_join(): includes all rows in x. right_join(): includes all rows in y. full_join(): includes all rows in x or y. 参考:https://r4ds.had.co.nz/transform.html |
今日新闻 |
推荐新闻 |
CopyRight 2018-2019 办公设备维修网 版权所有 豫ICP备15022753号-3 |