How to set some values within each column to zero based on the number of their repetition?

I have a dataframe of 117,000 rows with 10,000 columns containing values, some being repeated frequently and some not. I aim to replace those values being repeated less than 100 times within each column to zero. I put a small example here:
my input is like this:

1 200 444 2 310 000 3 310 000 4 444 444 5 200 444 6 200 112 7 310 444 8 310 876 9 310 876 10 444 876

I need in my output any value which is repeated less than 3 times within a column be set to zero (for example in column 1, value 444 is repeated less than 3 times, so it should be set as zero and the same should be done for values 000 and 112 in column 2) :

1 200 444 2 310 0 3 310 0 4 0 444 5 200 444 6 200 0 7 310 444 8 310 876 9 310 876 10 0 876

Could any body help me by writing a script to do this in R for a huge data.frame? (117000 rows and 10000 column)?

Are these "numbers" numbers or character? The number 0 would not normally display as 000.
– jlhoward
Sep 22 '15 at 18:33

0

000

@jlhoward: that is right. you can consider them as characters.
– zara
Sep 22 '15 at 18:45

If that's the case, you cannot mix numeric and character in a single column. You could set to NA though.
– jlhoward
Sep 22 '15 at 18:48

NA

2 Answers
2

Updated to incorporate @Arun's comments, and to demo with a dataset more representative of the real case.

This question is really interesting primarily because the full dataset is so large (117,000 rows X 10,000 columns). So the main issue is not how to flag the less common entries, but how to do that efficiently. This shows three options: a data.frame option, a naive data.table option (mine), and a sophisticated data.table option (@Arun's). It's an object lesson in the superiority of data.table, especially when used correctly.

# create sample: 117,000 rows, 100 columns (1% of real case) set.seed(1) # for reproducibility ltrs <- sapply(letters,function(x)paste(rep(x,3),collapse="")) df <- data.frame(replicate(100,sample(ltrs, 117e3, replace = TRUE, p=c(0.5,0.3,0.1,0.09,rep(0.01/22,22)))), stringsAsFactors = FALSE)

So this data-set has 117,000 rows and 100 columns (100% of the rows, 1% of the columns). Each entry is a three letter string ("AAA", "BBB", etc). AAA - DDD account for 99% of the entries and the other 22 letters collectively account for the remaining 1%. So there are bound to be instances of these with frequency < 100.

# data.frame option set.col <- function(x) { tbl <- table(x) x[x%in%names(tbl)[tbl<100]]<-NA x } system.time( result.1 <- as.data.frame(sapply(df,set.col)) ) # user system elapsed # 44.52 0.27 44.95

So the data.frame option runs in ~44 sec (on my system). The real case would run in ~4400 sec, or about 73min.

# naive data.table library(data.table) result.2 <- as.data.table(df) system.time( for (j in 1:ncol(df)) { tbl <- table(result.2[[j]]) set(result.2, i=which(result.2[[j]]%in%names(tbl)[tbl<100]),j=j, NA) } ) # user system elapsed # 1.51 0.05 1.56

The naive data.table approach still uses table(...) but takes advantage of set(...). It runs about 30X faster. The full data-set would run in about 3 min.

table(...)

set(...)

# sophisticated data.table # install.packages("data.table", type="source") # requires 1.9.6+ library(data.table) result.3 <- as.data.table(df) system.time( for (j in 1:ncol(df)) { tbl <- result.3[,.N,by=c(names(result.3)[j])][N<100] result.3[tbl, c(names(result.3)[j]):=NA, on=c(names(result.3)[j])] } ) # user system elapsed # 0.65 0.00 0.67 identical(result.2,result.3) # [1] TRUE

The sophisticated data.table solution runs 2.5X faster still, and about 70X faster than the data frame option. The full data-set should run in about 65 sec.

The following just demonstrates that option 3 works with the example in the question:

# import sample data df <- read.table(header=F, row.names=1, colClasses=c("character"), text="1 200 444 2 310 000 3 310 000 4 444 444 5 200 444 6 200 112 7 310 444 8 310 876 9 310 876 10 444 876") result.3 <- as.data.table(df) for (j in 1:ncol(df)) { tbl <- result.3[,.N,by=c(names(result.3)[j])][N<3] result.3[tbl, c(names(result.3)[j]):=NA, on=c(names(result.3)[j])] } result.3 # V2 V3 # 1: 200 444 # 2: 310 NA # 3: 310 NA # 4: NA 444 # 5: 200 444 # 6: 200 NA # 7: 310 444 # 8: 310 876 # 9: 310 876 # 10: NA 876

To write the result to a csv file, use something like:

write.csv(result.3, file="myfile.csv")

@ jIhoward: why did you name it as df? and how could you extract this output from R ?
– zara
Sep 22 '15 at 22:26

[1] There's nothing special about the name. You didn't specify. [2] Not sure what you mean by "extract this output from R", but to save the result to a csv file, use, e.g., write.csv(df,file="mydf.csv"). [3] Also, I assumed the first column was an actual column, and not row names.
– jlhoward
Sep 22 '15 at 23:13

write.csv(df,file="mydf.csv")

@ jlhoward: the first column was row names. should I change something in script please?
– zara
Sep 22 '15 at 23:44

I reworked the answer based on your input in the comments, and mindful that your actual dataset is extremely large ( > 1GB in size). I also included the code used to import your example, just so we have a common understanding of df.
– jlhoward
Sep 23 '15 at 3:55

df

One I could think of is to replace table() with dt[, .N, by=names(dt)[j][N < 3L] and then do a join in the next step. Runs in about 80s on my machine. Second way is (if the column values are all > 0L, and are integers) is to replace table() with tabulate().
– Arun
Sep 23 '15 at 22:42

table()

dt[, .N, by=names(dt)[j][N < 3L]

table()

tabulate()

OK, here is a solution for a similar problem to the one you laid out in your simplified version. The comments should explain it as it goes along.

Here we replace all elements which appear strictly less than four times, so the whole second column of the test_df should go to zero.

test_df

# Creates fake test dataframe col1 <- c(1, 2, 3, 1, 2, 3, 1, 2, 3, 1, 2, 3) col2 <- c(3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14) test_df <- data.frame(col1, col2) # Finds the number of occurences of every element in the dataframe occurences <- table(unlist(test_df)) # Find the unique elements across the whole dataframe elements <- unique(unlist(test_df)) # Creates an empty vector for all elements less than four elements_less_than_four <- c() # Loops through all elements in the dataframe and if they appear less than # four times puts them in a list for(el in elements){ if( occurences[[el]] < 4){ elements_less_than_four <- c(elements_less_than_four, el) } } # Unlist the df for quick comparison, turn all necessary values to zero. unlisted_df <- as.vector(unlist(test_df)) correct_values <- replace(unlisted_df, unlisted_df %in% elements_less_than_four, 0) # Reformats the dataframe finished_df <- data.frame(matrix(correct_values, nrow = nrow(test_df), ncol = ncol(test_df), byrow = FALSE))

I ran your script and in the result I saw for second column vale 3 did not set as zero while it repeated only once within column2. could you please edit your script in the way it 3 get zero within column2?
– zara
Sep 22 '15 at 19:24

@JCoolerton:I ran your script and in the result I saw for second column vaule 3 did not set as zero while it was repeated only once within column2. could you please edit your script in the way it 3 get zero within column2?
– zara
Sep 22 '15 at 19:31

Oh, actually it's right that 3 didn't go to zero, as it's repeated more than four times across the whole data frame. So that code should do what it's supposed to :).
– JCollerton
Sep 22 '15 at 19:35

By clicking "Post Your Answer", you acknowledge that you have read our updated terms of service, privacy policy and cookie policy, and that your continued use of the website is subject to these policies.

搜尋此網誌

Search between a Gas Station