When you work with big data sets it is a good idea to store everything in a MySQL database. Doing statistics, you often will need a random sample of your data, perhapse from a subpopulation of the data.
The normal way is to run an SQL-querry with the ORDER BY RAND() addition.
Unfortunately, this is slow as hell.
A better way is to load all autoids or rownumbers of the subpopulation first, select a random sample in R and then run a second request where the sample is directly addressed. This is very fast even with huge data sets and has the advantage that you can control the randomness via set.seed.
Here is an example:
The normal way is to run an SQL-querry with the ORDER BY RAND() addition.
Unfortunately, this is slow as hell.
A better way is to load all autoids or rownumbers of the subpopulation first, select a random sample in R and then run a second request where the sample is directly addressed. This is very fast even with huge data sets and has the advantage that you can control the randomness via set.seed.
Here is an example:
require(RMySQL) # set up RMySQL-connection con <- dbConnect(RMySQL::MySQL(), host = "myhost", user = "myusername", password = "mypassword",dbname="mydb") # select the auto-ID or Rownumber where the column "col" has the value "select" IDs <- dbSendQuery(con, "SELECT autoid FROM mytable WHERE col = 'select';") # fetch the results and set n=-1 to get more than 500 rows IDs <-dbFetch(myIDs, n=-1) # Set seed for reproduceability set.seed(12) # Select sample from IDs (1000 is just an example) IDs <- IDs[sample(dim(IDs)[1],1000),] # make the second request by collapsing the IDs within a pate-command res <- dbSendQuery(con, paste0("SELECT * FROM mytable WHERE col = 'select' AND autoid IN (", paste(IDs, collapse=","), ")", ";")) # fetch the results res <-dbFetch(res, n=-1) # close the connection dbDisconnect(con)
Kommentare
Kommentar veröffentlichen