Speeding up Matlab-JDBC SQL queries

Many of my consulting projects involve interfacing a Matlab program to an SQL database. In such cases, using MathWorks’ Database Toolbox is a viable solution. Users who don’t have the toolbox can also easily connect directly to the database using either the standard ODBC bridge (which is horrible for performance and stability), or a direct JDBC connection (which is also what the Database Toolbox uses under the hood). I explained this Matlab-JDBC interface in detail in chapter 2 of my Matlab-Java programming book. A bare-bones implementation of an SQL SELECT query follows (data update queries are a bit different and will not be discussed here):

% Load the appropriate JDBC driver class into Matlab's memory
% (but not directly, to bypass JIT pre-processing - we must do it in run-time!)
driver = eval('com.mysql.jdbc.Driver');  % or com.microsoft.sqlserver.jdbc.SQLServerDriver or whatever
 
% Connect to DB
dbPort = '3306'; % mySQL=3306; SQLServer=1433; Oracle=...
connectionStr = ['jdbc:mysql://' dbURL ':' dbPort '/' schemaName];  % or ['jdbc:sqlserver://' dbURL ':' dbPort ';database=' schemaName ';'] or whatever
dbConnObj = java.sql.DriverManager.getConnection(connectionStr, username, password);
 
% Send an SQL query statement to the DB and get the ResultSet
stmt = dbConnObj.createStatement(java.sql.ResultSet.TYPE_SCROLL_INSENSITIVE, java.sql.ResultSet.CONCUR_READ_ONLY);
try stmt.setFetchSize(1000); catch, end  % the default fetch size is ridiculously small in many DBs
rs = stmt.executeQuery(sqlQueryStr);
 
% Get the column names and data-types from the ResultSet's meta-data
MetaData = rs.getMetaData;
numCols = MetaData.getColumnCount;
data = cell(0,numCols);  % initialize
for colIdx = numCols : -1 : 1
    ColumnNames{colIdx} = char(MetaData.getColumnLabel(colIdx));
    ColumnType{colIdx}  = char(MetaData.getColumnClassName(colIdx));  % http://docs.oracle.com/javase/7/docs/api/java/sql/Types.html
end
ColumnType = regexprep(ColumnType,'.*\.','');
 
% Get the data from the ResultSet into a Matlab cell array
rowIdx = 1;
while rs.next  % loop over all ResultSet rows (records)
    for colIdx = 1 : numCols  % loop over all columns in the row
        switch ColumnType{colIdx}
            case {'Float','Double'}
                data{rowIdx,colIdx} = rs.getDouble(colIdx);
            case {'Long','Integer','Short','BigDecimal'}
                data{rowIdx,colIdx} = double(rs.getDouble(colIdx));
            case 'Boolean'
                data{rowIdx,colIdx} = logical(rs.getBoolean(colIdx));
            otherwise %case {'String','Date','Time','Timestamp'}
                data{rowIdx,colIdx} = char(rs.getString(colIdx));
        end
    end
    rowIdx = rowIdx + 1;
end
 
% Close the connection and clear resources
try rs.close();   catch, end
try stmt.close(); catch, end
try dbConnObj.closeAllStatements(); catch, end
try dbConnObj.close(); catch, end  % comment this to keep the dbConnObj open and reuse it for subsequent queries

Naturally, in a real-world implementation you also need to handle database timeouts and various other errors, handle data-manipulation queries (not just SELECTs), etc.

Anyway, this works well in general, but when you try to fetch a ResultSet that has many thousands of records you start to feel the pain – The SQL statement may execute much faster on the DB server (the time it takes for the stmt.executeQuery call), yet the subsequent double-loop processing to fetch the data from the Java ResultSet object into a Matlab cell array takes much longer.

In one of my recent projects, performance was of paramount importance, and the DB query speed from the code above was simply not good enough. You might think that this was due to the fact that the data cell array is not pre-allocated, but this turns out to be incorrect: the speed remains nearly unaffected when you pre-allocate data properly. It turns out that the main problem is due to Matlab’s non-negligible overhead in calling methods of Java objects. Since the JDBC interface only enables retrieving a single data item at a time (in other words, bulk retrieval is not possible), we have a double loop over all the data’s rows and columns, in each case calling the appropriate Java method to retrieve the data based on the column’s type. The Java methods themselves are extremely efficient, but when you add Matlab’s invocation overheads the total processing time is much much slower.

So what can be done? As Andrew Janke explained in much detail, we basically need to push our double loop down into the Java level, so that Matlab receives arrays of primitive values, which can then be processed in a vectorized manner in Matlab.

So let’s create a simple Java class to do this:

// Copyright (c) Yair Altman UndocumentedMatlab.com
import java.sql.ResultSet;
import java.sql.ResultSetMetaData;
import java.sql.SQLException;
import java.sql.Types;
 
public class JDBC_Fetch {
 
	public static int DEFAULT_MAX_ROWS = 100000;   // default cache size = 100K rows (if DB does not support non-forward-only ResultSets)
 
	public static Object[] getData(ResultSet rs) throws SQLException {
		try {
			if (rs.last()) {  // data is available
				int numRows = rs.getRow();    // row # of the last row
				rs.beforeFirst();             // get back to the top of the ResultSet
				return getData(rs, numRows);  // fetch the data
			} else {  // no data in the ResultSet
				return null;
			}
		} catch (Exception e) {
			return getData(rs, DEFAULT_MAX_ROWS);
		}
	}
 
	public static Object[] getData(ResultSet rs, int maxRows) throws SQLException {
		// Read column number and types from the ResultSet's meta-data
		ResultSetMetaData metaData = rs.getMetaData();
		int numCols = metaData.getColumnCount();
		int[] colTypes = new int[numCols+1];
		int numDoubleCols = 0;
		int numBooleanCols = 0;
		int numStringCols = 0;
		for (int colIdx = 1; colIdx <= numCols; colIdx++) {
			int colType = metaData.getColumnType(colIdx);
			switch (colType) {
				case Types.FLOAT:
				case Types.DOUBLE:
				case Types.REAL:
					colTypes[colIdx] = 1;  // double
					numDoubleCols++;
					break;
				case Types.DECIMAL:
				case Types.INTEGER:
				case Types.TINYINT:
				case Types.SMALLINT:
				case Types.BIGINT:
					colTypes[colIdx] = 1;  // double
					numDoubleCols++;
					break;
				case Types.BIT:
				case Types.BOOLEAN:
					colTypes[colIdx] = 2;  // boolean
					numBooleanCols++;
					break;
				default: // 'String','Date','Time','Timestamp',...
					colTypes[colIdx] = 3;  // string
					numStringCols++;
			}
		}
 
		// Loop over all ResultSet rows, reading the data into the 2D matrix caches
		int rowIdx = 0;
		double [][] dataCacheDouble  = new double [numDoubleCols] [maxRows];
		boolean[][] dataCacheBoolean = new boolean[numBooleanCols][maxRows];
		String [][] dataCacheString  = new String [numStringCols] [maxRows];
		while (rs.next() && rowIdx < maxRows) {
			int doubleColIdx = 0;
			int booleanColIdx = 0;
			int stringColIdx = 0;
			for (int colIdx = 1; colIdx <= numCols; colIdx++) {
				try {
					switch (colTypes[colIdx]) {
						case 1:  dataCacheDouble[doubleColIdx++][rowIdx]   = rs.getDouble(colIdx);   break;  // numeric
						case 2:  dataCacheBoolean[booleanColIdx++][rowIdx] = rs.getBoolean(colIdx);  break;  // boolean
						default: dataCacheString[stringColIdx++][rowIdx]   = rs.getString(colIdx);   break;  // string
					}
				} catch (Exception e) {
					System.out.println(e);
					System.out.println(" in row #" + rowIdx + ", col #" + colIdx);
				}
			}
			rowIdx++;
		}
 
		// Return only the actual data in the ResultSet
		int doubleColIdx = 0;
		int booleanColIdx = 0;
		int stringColIdx = 0;
		Object[] data = new Object[numCols];
		for (int colIdx = 1; colIdx <= numCols; colIdx++) {
			switch (colTypes[colIdx]) {
				case 1:   data[colIdx-1] = dataCacheDouble[doubleColIdx++];    break;  // numeric
				case 2:   data[colIdx-1] = dataCacheBoolean[booleanColIdx++];  break;  // boolean
				default:  data[colIdx-1] = dataCacheString[stringColIdx++];            // string
			}
		}
		return data;
	}
}

So now we have a JDBC_Fetch class that we can use in our Matlab code, replacing the slow double loop with a single call to JDBC_Fetch.getData(), followed by vectorized conversion into a Matlab cell array (matrix):

% Get the data from the ResultSet using the JDBC_Fetch wrapper
data = cell(JDBC_Fetch.getData(rs));
for colIdx = 1 : numCols
   switch ColumnType{colIdx}
      case {'Float','Double'}
          data{colIdx} = num2cell(data{colIdx});
      case {'Long','Integer','Short','BigDecimal'}
          data{colIdx} = num2cell(data{colIdx});
      case 'Boolean'
          data{colIdx} = num2cell(data{colIdx});
      otherwise %case {'String','Date','Time','Timestamp'}
          %data{colIdx} = cell(data{colIdx});  % no need to do anything here!
   end
end
data = [data{:}];

On my specific program the resulting speedup was 15x (this is not a typo: 15 times faster). My fetches are no longer limited by the Matlab post-processing, but rather by the DB’s processing of the SQL statement (where DB indexes, clustering, SQL tuning etc. come into play).

Additional speedups can be achieved by parsing dates at the Java level (rather than returning strings), as well as several other tweaks in the Java and Matlab code (refer to Andrew Janke’s post for some ideas). But certainly the main benefit (the 80% of the gain that was achieved in 20% of the worktime) is due to the above push of the main double processing loop down into the Java level, leaving Matlab with just a single Java call to JDBC_Fetch.

Many additional ideas of speeding up database queries and Matlab programs in general can be found in my second book, Accelerating Matlab Performance.

If you’d like me to help you speed up your Matlab program, please email me (altmany at gmail), or fill out the query form on my consulting page.

Speeding up compiled apps startup – The MCR_CACHE_ROOT environment variable can reportedly help to speed-up deployed Matlab executables....
Converting Java vectors to Matlab arrays – Converting Java vectors to Matlab arrays is pretty simple - this article explains how....
Explicit multi-threading in Matlab part 1 – Explicit multi-threading can be achieved in Matlab by a variety of simple means. ...
Secure SSL connection between Matlab and PostgreSQL – It is tricky, but quite possible, to use SSL to connect Matlab to a PostgreSQL database. ...

Speeding up Matlab-JDBC SQL queries

Trending Articles

Bath man appears in court charged with attempted murder of a man...

MACLEAN, Allan

Black Angus Grilled Artichokes

Practice Sheet of Right form of verbs for HSC Students

Police blotter for Jan. 12

99 God Status for Whatsapp, Facebook

Rajasthan Board 12th Science Result 2018 name wise- RBSE 12th commerce result...

Notorious Naushad of Ippa gang nabbed

Child Kidnapping: Amy McNeil was kidnapped on her way to school by 5 adults;...

Sonible Smartlimit v1.1.5-R2R

NCERT Solutions for Class 9th Sanskrit Chapter 3 पाथेयम्

मतलबी दोस्त स्टेट्स | Matlabi Dost Status in Hindi – Selfish Friends Status

Arrow Flash 2 – Sinhala Dubbed – Episode 23 – 20th March 2016

[GET] AI Traffic Goldmine

[E² Plugin] HDF-Radio

Universal Multi-Patch v1.3 By RADIXX11

IWAN – Thanks and Praise ( Throw Back Thursday )

RONALD P SONDERGAARD Arrested by Miami-Dade County Corrections on Mar 03, 2017

मुख मैथुन से उठाएं सेक्स का भरपूर मज़ा, जानें क्या है इसका सही तरीकामुख मैथुन...

HSSC Excise & Taxation Inspector Result 2017 Scorecard/ Category Wise Merit List