How to select specific random rows from #MySQL within #R

When you work with big data sets it is a good idea to store everything in a MySQL database. Doing statistics, you often will need a random sample of your data, perhapse from a subpopulation of the data.
The normal way is to run an SQL-querry with the ORDER BY RAND() addition.
Unfortunately, this is slow as hell.
A better way is to load all autoids or rownumbers of the subpopulation first, select a random sample in R and then run a second request where the sample is directly addressed. This is very fast even with huge data sets and has the advantage that you can control the randomness via set.seed.
Here is an example:

require(RMySQL)
# set up RMySQL-connection
con <- dbConnect(RMySQL::MySQL(), host = "myhost",
                 user = "myusername", password = "mypassword",dbname="mydb")
 
# select the auto-ID or Rownumber where the column "col" has the value "select"
IDs <- dbSendQuery(con, "SELECT autoid FROM mytable WHERE col = 'select';")
# fetch the results and set n=-1 to get more than 500 rows
IDs <-dbFetch(myIDs, n=-1)
 
# Set seed for reproduceability
set.seed(12)
# Select sample from IDs (1000 is just an example)
IDs <- IDs[sample(dim(IDs)[1],1000),]
 
# make the second request by collapsing the IDs within a pate-command
res <- dbSendQuery(con, 
                   paste0("SELECT * FROM mytable WHERE col = 'select' AND autoid IN (", 
                          paste(IDs, collapse=","), ")", ";"))
 
# fetch the results
res <-dbFetch(res, n=-1)
 
# close the connection
dbDisconnect(con)

Political Data Science

Dieses Blog durchsuchen

How to select specific random rows from #MySQL within #R

Labels

Kommentare

Kommentar veröffentlichen

Beliebte Posts aus diesem Blog

Deep-Dive Impfeffektivität: Eine kritische Datenanalyse der RKI-Berechnungen / Teil 1: Die Methode

Der Nutzerismus: Eine Ideologie mit totalitärem Potential

Was man an der COVID-Politik über Faschismus lernen kann