Many of my consulting projects involve interfacing a Matlab program to an SQL database. In such cases, using MathWorks’ Database Toolbox is a viable solution. Users who don’t have the toolbox can also easily connect directly to the database using either the standard ODBC bridge (which is horrible for performance and stability), or a direct JDBC connection (which is also what the Database Toolbox uses under the hood). I explained this Matlab-JDBC interface in detail in chapter 2 of my Matlab-Java programming book. A bare-bones implementation of an SQL SELECT query follows (data update queries are a bit different and will not be discussed here):
% Load the appropriate JDBC driver class into Matlab's memory % (but not directly, to bypass JIT pre-processing - we must do it in run-time!) driver = eval('com.mysql.jdbc.Driver'); % or com.microsoft.sqlserver.jdbc.SQLServerDriver or whatever % Connect to DB dbPort = '3306'; % mySQL=3306; SQLServer=1433; Oracle=... connectionStr = ['jdbc:mysql://' dbURL ':' dbPort '/' schemaName]; % or ['jdbc:sqlserver://' dbURL ':' dbPort ';database=' schemaName ';'] or whatever dbConnObj = java.sql.DriverManager.getConnection(connectionStr, username, password); % Send an SQL query statement to the DB and get the ResultSet stmt = dbConnObj.createStatement(java.sql.ResultSet.TYPE_SCROLL_INSENSITIVE, java.sql.ResultSet.CONCUR_READ_ONLY); try stmt.setFetchSize(1000); catch, end % the default fetch size is ridiculously small in many DBs rs = stmt.executeQuery(sqlQueryStr); % Get the column names and data-types from the ResultSet's meta-data MetaData = rs.getMetaData; numCols = MetaData.getColumnCount; data = cell(0,numCols); % initialize for colIdx = numCols : -1 : 1 ColumnNames{colIdx} = char(MetaData.getColumnLabel(colIdx)); ColumnType{colIdx} = char(MetaData.getColumnClassName(colIdx)); % http://docs.oracle.com/javase/7/docs/api/java/sql/Types.html end ColumnType = regexprep(ColumnType,'.*\.',''); % Get the data from the ResultSet into a Matlab cell array rowIdx = 1; while rs.next % loop over all ResultSet rows (records) for colIdx = 1 : numCols % loop over all columns in the row switch ColumnType{colIdx} case {'Float','Double'} data{rowIdx,colIdx} = rs.getDouble(colIdx); case {'Long','Integer','Short','BigDecimal'} data{rowIdx,colIdx} = double(rs.getDouble(colIdx)); case 'Boolean' data{rowIdx,colIdx} = logical(rs.getBoolean(colIdx)); otherwise %case {'String','Date','Time','Timestamp'} data{rowIdx,colIdx} = char(rs.getString(colIdx)); end end rowIdx = rowIdx + 1; end % Close the connection and clear resources try rs.close(); catch, end try stmt.close(); catch, end try dbConnObj.closeAllStatements(); catch, end try dbConnObj.close(); catch, end % comment this to keep the dbConnObj open and reuse it for subsequent queries
Naturally, in a real-world implementation you also need to handle database timeouts and various other errors, handle data-manipulation queries (not just SELECTs), etc.
Anyway, this works well in general, but when you try to fetch a ResultSet that has many thousands of records you start to feel the pain – The SQL statement may execute much faster on the DB server (the time it takes for the stmt.executeQuery
call), yet the subsequent double-loop processing to fetch the data from the Java ResultSet
object into a Matlab cell array takes much longer.
In one of my recent projects, performance was of paramount importance, and the DB query speed from the code above was simply not good enough. You might think that this was due to the fact that the data
cell array is not pre-allocated, but this turns out to be incorrect: the speed remains nearly unaffected when you pre-allocate data
properly. It turns out that the main problem is due to Matlab’s non-negligible overhead in calling methods of Java objects. Since the JDBC interface only enables retrieving a single data item at a time (in other words, bulk retrieval is not possible), we have a double loop over all the data’s rows and columns, in each case calling the appropriate Java method to retrieve the data based on the column’s type. The Java methods themselves are extremely efficient, but when you add Matlab’s invocation overheads the total processing time is much much slower.
So what can be done? As Andrew Janke explained in much detail, we basically need to push our double loop down into the Java level, so that Matlab receives arrays of primitive values, which can then be processed in a vectorized manner in Matlab.
So let’s create a simple Java class to do this:
// Copyright (c) Yair Altman UndocumentedMatlab.com import java.sql.ResultSet; import java.sql.ResultSetMetaData; import java.sql.SQLException; import java.sql.Types; public class JDBC_Fetch { public static int DEFAULT_MAX_ROWS = 100000; // default cache size = 100K rows (if DB does not support non-forward-only ResultSets) public static Object[] getData(ResultSet rs) throws SQLException { try { if (rs.last()) { // data is available int numRows = rs.getRow(); // row # of the last row rs.beforeFirst(); // get back to the top of the ResultSet return getData(rs, numRows); // fetch the data } else { // no data in the ResultSet return null; } } catch (Exception e) { return getData(rs, DEFAULT_MAX_ROWS); } } public static Object[] getData(ResultSet rs, int maxRows) throws SQLException { // Read column number and types from the ResultSet's meta-data ResultSetMetaData metaData = rs.getMetaData(); int numCols = metaData.getColumnCount(); int[] colTypes = new int[numCols+1]; int numDoubleCols = 0; int numBooleanCols = 0; int numStringCols = 0; for (int colIdx = 1; colIdx <= numCols; colIdx++) { int colType = metaData.getColumnType(colIdx); switch (colType) { case Types.FLOAT: case Types.DOUBLE: case Types.REAL: colTypes[colIdx] = 1; // double numDoubleCols++; break; case Types.DECIMAL: case Types.INTEGER: case Types.TINYINT: case Types.SMALLINT: case Types.BIGINT: colTypes[colIdx] = 1; // double numDoubleCols++; break; case Types.BIT: case Types.BOOLEAN: colTypes[colIdx] = 2; // boolean numBooleanCols++; break; default: // 'String','Date','Time','Timestamp',... colTypes[colIdx] = 3; // string numStringCols++; } } // Loop over all ResultSet rows, reading the data into the 2D matrix caches int rowIdx = 0; double [][] dataCacheDouble = new double [numDoubleCols] [maxRows]; boolean[][] dataCacheBoolean = new boolean[numBooleanCols][maxRows]; String [][] dataCacheString = new String [numStringCols] [maxRows]; while (rs.next() && rowIdx < maxRows) { int doubleColIdx = 0; int booleanColIdx = 0; int stringColIdx = 0; for (int colIdx = 1; colIdx <= numCols; colIdx++) { try { switch (colTypes[colIdx]) { case 1: dataCacheDouble[doubleColIdx++][rowIdx] = rs.getDouble(colIdx); break; // numeric case 2: dataCacheBoolean[booleanColIdx++][rowIdx] = rs.getBoolean(colIdx); break; // boolean default: dataCacheString[stringColIdx++][rowIdx] = rs.getString(colIdx); break; // string } } catch (Exception e) { System.out.println(e); System.out.println(" in row #" + rowIdx + ", col #" + colIdx); } } rowIdx++; } // Return only the actual data in the ResultSet int doubleColIdx = 0; int booleanColIdx = 0; int stringColIdx = 0; Object[] data = new Object[numCols]; for (int colIdx = 1; colIdx <= numCols; colIdx++) { switch (colTypes[colIdx]) { case 1: data[colIdx-1] = dataCacheDouble[doubleColIdx++]; break; // numeric case 2: data[colIdx-1] = dataCacheBoolean[booleanColIdx++]; break; // boolean default: data[colIdx-1] = dataCacheString[stringColIdx++]; // string } } return data; } }
So now we have a JDBC_Fetch
class that we can use in our Matlab code, replacing the slow double loop with a single call to JDBC_Fetch.getData()
, followed by vectorized conversion into a Matlab cell array (matrix):
% Get the data from the ResultSet using the JDBC_Fetch wrapper data = cell(JDBC_Fetch.getData(rs)); for colIdx = 1 : numCols switch ColumnType{colIdx} case {'Float','Double'} data{colIdx} = num2cell(data{colIdx}); case {'Long','Integer','Short','BigDecimal'} data{colIdx} = num2cell(data{colIdx}); case 'Boolean' data{colIdx} = num2cell(data{colIdx}); otherwise %case {'String','Date','Time','Timestamp'} %data{colIdx} = cell(data{colIdx}); % no need to do anything here! end end data = [data{:}];
On my specific program the resulting speedup was 15x (this is not a typo: 15 times faster). My fetches are no longer limited by the Matlab post-processing, but rather by the DB’s processing of the SQL statement (where DB indexes, clustering, SQL tuning etc. come into play).
Additional speedups can be achieved by parsing dates at the Java level (rather than returning strings), as well as several other tweaks in the Java and Matlab code (refer to Andrew Janke’s post for some ideas). But certainly the main benefit (the 80% of the gain that was achieved in 20% of the worktime) is due to the above push of the main double processing loop down into the Java level, leaving Matlab with just a single Java call to JDBC_Fetch
.
Many additional ideas of speeding up database queries and Matlab programs in general can be found in my second book, Accelerating Matlab Performance.
If you’d like me to help you speed up your Matlab program, please email me (altmany at gmail), or fill out the query form on my consulting page.