How to set some values within each column to zero based on the number of their repetition?
How to set some values within each column to zero based on the number of their repetition?
I have a dataframe of 117,000 rows with 10,000 columns containing values, some being repeated frequently and some not. I aim to replace those values being repeated less than 100 times within each column to zero. I put a small example here:
my input is like this:
1 200 444
2 310 000
3 310 000
4 444 444
5 200 444
6 200 112
7 310 444
8 310 876
9 310 876
10 444 876
I need in my output any value which is repeated less than 3 times within a column be set to zero (for example in column 1, value 444 is repeated less than 3 times, so it should be set as zero and the same should be done for values 000 and 112 in column 2) :
1 200 444
2 310 0
3 310 0
4 0 444
5 200 444
6 200 0
7 310 444
8 310 876
9 310 876
10 0 876
Could any body help me by writing a script to do this in R for a huge data.frame? (117000 rows and 10000 column)?
0
000
@jlhoward: that is right. you can consider them as characters.
– zara
Sep 22 '15 at 18:45
If that's the case, you cannot mix numeric and character in a single column. You could set to
NA though.– jlhoward
Sep 22 '15 at 18:48
NA
2 Answers
2
Updated to incorporate @Arun's comments, and to demo with a dataset more representative of the real case.
This question is really interesting primarily because the full dataset is so large (117,000 rows X 10,000 columns). So the main issue is not how to flag the less common entries, but how to do that efficiently. This shows three options: a data.frame option, a naive data.table option (mine), and a sophisticated data.table option (@Arun's). It's an object lesson in the superiority of data.table, especially when used correctly.
# create sample: 117,000 rows, 100 columns (1% of real case)
set.seed(1) # for reproducibility
ltrs <- sapply(letters,function(x)paste(rep(x,3),collapse=""))
df <- data.frame(replicate(100,sample(ltrs, 117e3, replace = TRUE,
p=c(0.5,0.3,0.1,0.09,rep(0.01/22,22)))),
stringsAsFactors = FALSE)
So this data-set has 117,000 rows and 100 columns (100% of the rows, 1% of the columns). Each entry is a three letter string ("AAA", "BBB", etc). AAA - DDD account for 99% of the entries and the other 22 letters collectively account for the remaining 1%. So there are bound to be instances of these with frequency < 100.
# data.frame option
set.col <- function(x) {
tbl <- table(x)
x[x%in%names(tbl)[tbl<100]]<-NA
x
}
system.time(
result.1 <- as.data.frame(sapply(df,set.col))
)
# user system elapsed
# 44.52 0.27 44.95
So the data.frame option runs in ~44 sec (on my system). The real case would run in ~4400 sec, or about 73min.
# naive data.table
library(data.table)
result.2 <- as.data.table(df)
system.time(
for (j in 1:ncol(df)) {
tbl <- table(result.2[[j]])
set(result.2, i=which(result.2[[j]]%in%names(tbl)[tbl<100]),j=j, NA)
}
)
# user system elapsed
# 1.51 0.05 1.56
The naive data.table approach still uses table(...) but takes advantage of set(...). It runs about 30X faster. The full data-set would run in about 3 min.
table(...)
set(...)
# sophisticated data.table
# install.packages("data.table", type="source") # requires 1.9.6+
library(data.table)
result.3 <- as.data.table(df)
system.time(
for (j in 1:ncol(df)) {
tbl <- result.3[,.N,by=c(names(result.3)[j])][N<100]
result.3[tbl, c(names(result.3)[j]):=NA, on=c(names(result.3)[j])]
}
)
# user system elapsed
# 0.65 0.00 0.67
identical(result.2,result.3)
# [1] TRUE
The sophisticated data.table solution runs 2.5X faster still, and about 70X faster than the data frame option. The full data-set should run in about 65 sec.
The following just demonstrates that option 3 works with the example in the question:
# import sample data
df <- read.table(header=F, row.names=1, colClasses=c("character"),
text="1 200 444
2 310 000
3 310 000
4 444 444
5 200 444
6 200 112
7 310 444
8 310 876
9 310 876
10 444 876")
result.3 <- as.data.table(df)
for (j in 1:ncol(df)) {
tbl <- result.3[,.N,by=c(names(result.3)[j])][N<3]
result.3[tbl, c(names(result.3)[j]):=NA, on=c(names(result.3)[j])]
}
result.3
# V2 V3
# 1: 200 444
# 2: 310 NA
# 3: 310 NA
# 4: NA 444
# 5: 200 444
# 6: 200 NA
# 7: 310 444
# 8: 310 876
# 9: 310 876
# 10: NA 876
To write the result to a csv file, use something like:
write.csv(result.3, file="myfile.csv")
@ jIhoward: why did you name it as df? and how could you extract this output from R ?
– zara
Sep 22 '15 at 22:26
[1] There's nothing special about the name. You didn't specify. [2] Not sure what you mean by "extract this output from R", but to save the result to a csv file, use, e.g.,
write.csv(df,file="mydf.csv"). [3] Also, I assumed the first column was an actual column, and not row names.– jlhoward
Sep 22 '15 at 23:13
write.csv(df,file="mydf.csv")
@ jlhoward: the first column was row names. should I change something in script please?
– zara
Sep 22 '15 at 23:44
I reworked the answer based on your input in the comments, and mindful that your actual dataset is extremely large ( > 1GB in size). I also included the code used to import your example, just so we have a common understanding of
df.– jlhoward
Sep 23 '15 at 3:55
df
One I could think of is to replace
table() with dt[, .N, by=names(dt)[j][N < 3L] and then do a join in the next step. Runs in about 80s on my machine. Second way is (if the column values are all > 0L, and are integers) is to replace table() with tabulate().– Arun
Sep 23 '15 at 22:42
table()
dt[, .N, by=names(dt)[j][N < 3L]
table()
tabulate()
OK, here is a solution for a similar problem to the one you laid out in your simplified version. The comments should explain it as it goes along.
Here we replace all elements which appear strictly less than four times, so the whole second column of the test_df should go to zero.
test_df
# Creates fake test dataframe
col1 <- c(1, 2, 3, 1, 2, 3, 1, 2, 3, 1, 2, 3)
col2 <- c(3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14)
test_df <- data.frame(col1, col2)
# Finds the number of occurences of every element in the dataframe
occurences <- table(unlist(test_df))
# Find the unique elements across the whole dataframe
elements <- unique(unlist(test_df))
# Creates an empty vector for all elements less than four
elements_less_than_four <- c()
# Loops through all elements in the dataframe and if they appear less than
# four times puts them in a list
for(el in elements){
if( occurences[[el]] < 4){
elements_less_than_four <- c(elements_less_than_four, el)
}
}
# Unlist the df for quick comparison, turn all necessary values to zero.
unlisted_df <- as.vector(unlist(test_df))
correct_values <- replace(unlisted_df, unlisted_df %in% elements_less_than_four, 0)
# Reformats the dataframe
finished_df <- data.frame(matrix(correct_values, nrow = nrow(test_df),
ncol = ncol(test_df), byrow = FALSE))
I ran your script and in the result I saw for second column vale 3 did not set as zero while it repeated only once within column2. could you please edit your script in the way it 3 get zero within column2?
– zara
Sep 22 '15 at 19:24
@JCoolerton:I ran your script and in the result I saw for second column vaule 3 did not set as zero while it was repeated only once within column2. could you please edit your script in the way it 3 get zero within column2?
– zara
Sep 22 '15 at 19:31
Oh, actually it's right that 3 didn't go to zero, as it's repeated more than four times across the whole data frame. So that code should do what it's supposed to :).
– JCollerton
Sep 22 '15 at 19:35
By clicking "Post Your Answer", you acknowledge that you have read our updated terms of service, privacy policy and cookie policy, and that your continued use of the website is subject to these policies.
Are these "numbers" numbers or character? The number
0would not normally display as000.– jlhoward
Sep 22 '15 at 18:33