When you work with big data sets it is a good idea to store everything in a MySQL database. Doing statistics, you often will need a random sample of your data, perhapse from a subpopulation of the data. The normal way is to run an SQL-querry with the ORDER BY RAND() addition. Unfortunately, this is slow as hell. A better way is to load all autoids or rownumbers of the subpopulation first, select a random sample in R and then run a second request where the sample is directly addressed. This is very fast even with huge data sets and has the advantage that you can control the randomness via set.seed. Here is an example: require ( RMySQL ) # set up RMySQL-connection con <- dbConnect ( RMySQL ::MySQL ( ) , host = "myhost" , user = "myusername" , password = "mypassword" , dbname= "mydb" ) # select the auto-ID or Rownumber where the column "col" has the value "select" IDs <- dbSendQuery ( con ,...