March 15 Update - Batch execution

Mar 16, 2014 at 8:24 AM
Edited Mar 16, 2014 at 8:39 AM
I have just pushed the code with one new and very interesting feature - batch execution. I'm quite excited and would like to describe it a bit, and explain why I think this capability is important and worth some discussion.
Batch execution is involved with the way the framework saves multiple changes to the database. When business code completes updates to business entities and calls session.SaveChanges, VITA starts the update process. Before the batched updates, it was making a separate call to stored procedure for each entity/record modified (inserted/deleted). This resulted in many roundtrips from app server to the database server.
With batch execution enabled, the process is different. The system builds a single SQL statement consisting of multiple calls to stored procedures with parameters listed as literals. The calls are enclosed into a 'StartTransaction/Commit' block. The resulting multi-line SQL text is sent to the database server, in one roundtrip. The result is a drastic - from 5 to 10-fold - improvement in performance.
To see batch execution in action, run both unit test projects with 'useBatchMode=true' in the app.config file. Batch mode is supported for all database servers except SQL CE. After test execution completes open the log file (in bin folder) and see the logged SQL commands.
Here is an example batch produced by the VitaBooks sample application. This is the SampleDataGenerator code creating initial data (books, authors, publishers) used by the unit tests:
-- BEGIN BATCH ---------------------------
EXEC [books].[PublisherInsert] '66c9e8ac-abdf-11e3-9e90-f0def1783701', 'MS Books';
EXEC [books].[PublisherInsert] '66c9e8ad-abdf-11e3-9e90-f0def1783701', 'Kids Books';
EXEC [books].[UserInsert] '66c9e8bb-abdf-11e3-9e90-f0def1783701', 'linda', 'linda', 4, True;
EXEC [books].[UserInsert] '66c9e8ba-abdf-11e3-9e90-f0def1783701', 'Diego', 'Diego', 1, False;
EXEC [books].[UserInsert] '66c9e8b9-abdf-11e3-9e90-f0def1783701', 'Dora', 'Dora', 1, True;
EXEC [books].[UserInsert] '66c9e8be-abdf-11e3-9e90-f0def1783701', 'jack', 'jack', 2, True;
EXEC [books].[UserInsert] '66c9e8bc-abdf-11e3-9e90-f0def1783701', 'charlie', 'charlie', 8, True;
EXEC [books].[UserInsert] '66c9e8bd-abdf-11e3-9e90-f0def1783701', 'john', 'john', 2, True;
EXEC [books].[AuthorInsert] '66c9e8b0-abdf-11e3-9e90-f0def1783701', 'Jim', 'Hacker', NULL, NULL;
EXEC [books].[AuthorInsert] '66c9e8af-abdf-11e3-9e90-f0def1783701', 'Jack', 'Pound', NULL, '66c9e8be-abdf-11e3-9e90-f0def1783701';
EXEC [books].[AuthorInsert] '66c9e8ae-abdf-11e3-9e90-f0def1783701', 'John', 'Sharp', @P0, '66c9e8bd-abdf-11e3-9e90-f0def1783701';
EXEC [books].[BookInsert] '66c9e8b2-abdf-11e3-9e90-f0def1783701', 'VB Programming', 'Expert programming in VB',  '2013-03-14T00:00:00.0000000-07:00', NULL, 0, 3, 25.0, '66c9e8ac-abdf-11e3-9e90-f0def1783701';
EXEC [books].[BookInsert] '66c9e8b5-abdf-11e3-9e90-f0def1783701', 'IronMan', 'IronMan comics.', NULL, NULL, 1, 1, 3, '66c9e8ad-abdf-11e3-9e90-f0def1783701';
EXEC [books].[BookInsert] '66c9e8b1-abdf-11e3-9e90-f0def1783701', 'c# Programming', 'Expert programming in c#', '2013-03-14T00:00:00.0000000-07:00', @P1, 0, 5, 20.0, '66c9e8ac-abdf-11e3-9e90-f0def1783701';
EXEC [books].[BookInsert] '66c9e8b3-abdf-11e3-9e90-f0def1783701', 'Three little pigs', 'Folk tale',  '2013-03-14T00:00:00.0000000-07:00', NULL, 2, 2, 10.0, '66c9e8ad-abdf-11e3-9e90-f0def1783701';
EXEC [books].[BookInsert] '66c9e8b4-abdf-11e3-9e90-f0def1783701', 'Windows Programming', 'Introduction to Windows Programming', '2003-03-14T00:00:00.0000000-08:00', NULL, 0, 2, 30.0, '66c9e8ac-abdf-11e3-9e90-f0def1783701';
EXEC [books].[BookAuthorInsert] '66c9e8b1-abdf-11e3-9e90-f0def1783701', '66c9e8af-abdf-11e3-9e90-f0def1783701';
EXEC [books].[BookAuthorInsert] '66c9e8b1-abdf-11e3-9e90-f0def1783701', '66c9e8ae-abdf-11e3-9e90-f0def1783701';
EXEC [books].[BookAuthorInsert] '66c9e8b4-abdf-11e3-9e90-f0def1783701', '66c9e8ae-abdf-11e3-9e90-f0def1783701';
EXEC [books].[BookAuthorInsert] '66c9e8b2-abdf-11e3-9e90-f0def1783701', '66c9e8af-abdf-11e3-9e90-f0def1783701';
EXEC [books].[CouponInsert] '66c9e8b6-abdf-11e3-9e90-f0def1783701', '2014-03-15T01:16:19.9744546Z', '2014-04-14T18:16:19.9744546-07:00', 'C1', 10, NULL;
EXEC [books].[CouponInsert] '66c9e8b8-abdf-11e3-9e90-f0def1783701', '2014-03-15T01:16:19.9744546Z', '2014-04-14T18:16:19.9744546-07:00', 'C3', 10, NULL;
EXEC [books].[CouponInsert] '66c9e8b7-abdf-11e3-9e90-f0def1783701', '2014-03-15T01:16:19.9744546Z', '2014-04-14T18:16:19.9744546-07:00', 'C2', 10, NULL;
 -- Parameters: @P0='  Lorem ipsum dolor ...  ...pteur sint occaecat ', @P1='Expert guide to prog...  ...rts.  Covers c# 4.0.' 
 -- Time 3 ms, [2014/03/14 18:16:19] 
-- END BATCH ---------------------------
This is a copy from the log file - you can grab it from _vitaBooks.log in binary folder for unit tests project when you run it with 'useBatchMode=true' in app.config file (it is set to true by default in the download zip). The actual batch SQL sent to server starts with line 'BEGIN TRANSACTION' until 'COMMIT TRANSACTION'. All calls are Inserts, because we are creating new data, but batching works well with full mix of all kinds of update commands.
The first thing to note is that procedure call format is server-specific. The example shown is for MS SQL SERVER; for MySql and PostgreSql it will be different. Stored proc call syntax is not set in standard, unlike basic SQL syntax, so it varies from server to server. VITA adjusts the format automatically - so if you run unit tests with different servers (using cond defines on the project) - you will see slightly different syntax of batch calls in the log file. Batch mode is supported for MS SQL Server (2008 or 2012), MySql and PostGreSql. SQL Compact Edition does not allow multiple statements in one SQL command, so batch mode is off for this provider.
Another thing to note is the execution time for the batch - 3 milliseconds. Quite impressive for 20 + CRUD commands! Even on my weak laptop (compared to typical monster servers in production) - the execution time is in 100 microsec range for a single CRUD operation. The improvement compared to non-batch mode is between 5 and 10-fold. I have app code and database server running on the same machine, so it is reasonable to expect even bigger improvement ratio in production-like environment where app server and database server are separated by the network, so the number of roundtrip has much bigger impact.
Finally, the log lists two batch parameters: P0 and P1. These are used 'inline' in the SQL as arguments of stored procedure calls. Most of the argument values are embedded using literals, but exception is made for 'too-long' values - binary or text values exceeding 100 characters. In the log file, parameter values are printed after batch text, and values are trimmed in the middle, for better readability. Using parameters for long values is done partly for readability of the log files, partly with intent to improve performance - to avoid to-string/from-string conversions of big chunks of data, especially binary data.
I guess you already have a question in mind - how about identity values in inserts? Or timestamps - values generated in the database that should be returned back and put into entities? This is supported too, although not for all servers. For MS SQL Server even a more tricky scenario is supported - inserting multiple parent/child records linked by a foreign key targeting an identity-based primary key. The challenge here is that the values for foreign key column in child entity are not known at the time we build a batch - they are generated by preceding insert of the parent record. VITA solves it by using parameters for such values.
VITA unit tests project contains a test 'IdentityTests' that represents this scenario - we have IPerson and ICar entities, both with identity primary key. ICar.Owner column is a foreign key pointing to a person. Here is a batch sent by unit test code to the server as seen in the log file:
-- BEGIN BATCH ---------------------------
EXEC [ident].[PersonInsert] @P0 OUTPUT, 'John S';
EXEC [ident].[CarInsert] @P1 OUTPUT, 'Beatle', @P0;
EXEC [ident].[CarInsert] @P2 OUTPUT, 'Explorer', @P0;
-- Parameters: @P0=69, @P1=137, @P2=138 
 -- Time 0 ms, [2014/03/14 18:16:05] 
-- END BATCH ---------------------------
For all three primary key values the system uses OUTPUT parameter. It also detected that Owner_id values for the Cars should come from the other parameter (@P0) - and it puts a reference to this parameter in proper places. Not only the newly-generated Person Id is passed between calls, but all three generated identity values are returned back to the client and put into corresponding entities after the database call is completed. The log is written after the batch is executed, so the parameter values reported in the log are values returned from the database call.

(continued in the next post)
Mar 16, 2014 at 8:30 AM

It all works very well for MS SQL. For MySql and PostGres, things are not so smooth. MySql does not allow using parameters as arguments for out procedure parameters. There's a work around (reassign to local variable, call, assign back); it might work and I may try in the future, but not in this version. Postgres has really 'strange' rules for calling stored procedures with output parameters (and in general quite 'tricky' rules for calling stored procedures from SQL - complicated by name overloading). So I gave up on this scenario in Postgres. As a side note, my initial excitement about Postgres is quickly fading; looks like these guys wanted to build something clever and sophisticated, with object-oriented flavor and name overloading and other stuff, but what came out is quite a pain to use, and is sometimes simply unusable like in this case with out parameters.
Note that inability of MySql and Postgres drivers to deal with identities in batch mode does not pose any complications for the client code - VITA detects automatically if a SaveChanges call involves output parameters, and if it does, it switches to regular, non-batch way of executing commands.

Some general observations
Batch execution is enabled only if application uses stored procedures - it makes sense, as bundling together multiple plain SQLs does not seem to be a good idea. For one thing, multiple SQLs in one command will really confuse SQL compiler and query optimizer, it will not be able to reuse previously cached commands. Which brings back an age-old discussion - are stored procedures really needed or not? MS lately had been downplaying use of stored procedures; mainly because their own tools like Entity Framework do not use it as a primary way to do CRUD (you can use them with some extra work, but this is not something coming out of the box). MS folks argue that with that latest improvements in SQL Server query optimizer and plans caching the performance of dynamic SQL is the same as of equivalent stored procedures. However, a batch execution facility described here provides a strong case for using CRUD stored procedures - bundling multiple proc calls in a batch provides a multi-fold boost in performance.
I also think that in addition to pure performance gains as seen by the client - reduced overall time for update operations - the batch mode helps to ease the load on heavily hit production servers. Admittedly I'm not an expert on internals of database servers, so this is just a guess, but having less 'chatty' communication with the clients helps the server to better manage its resources - it recieves several times less hits, but each hit contains all operations for the client in overall operation. Just common sense, although not sure about the size of the positive impact.
How important is this improvement anyway, in the context of modern applications? I consider it quite important, especially for business and enterprise applications. Light-weight web apps that mostly read and occassionally write a record or two might not notice big benefits from batch execution, but Line-of-business apps have many heavy-write long-running processes that can benefit a lot. Applications that deal with financial/accounting are a good example. Accounting rules for business are extremely complex, multi-level, involving many small pieces of information each going into proper place. A seemingly simple transaction like purchase of a chair or computer may result in hundreds of entries into account balancies, tracking documents, various ledgers, funding/expense records, inventory tracks, deffered tax records etc. The transaction literally explodes into multitude of micro updates/inserts that must go to the central database. Your monthly home utility bill may show just a few numbers, but believe me, there were hundreds of records posted to dozens of tables when the bill was generated. In the system I had seen not long ago, generating such a bill took more than 1 second per bill. It is too much, considering that customer base can be 50 000 households - the bill generation takes many hours, which presents a huge problem. Take my word for it - this is system is not a single bad case, the majority of 'modern' business systems are just like that, no better. 1 second per document seems like a huge time for modern hardware, but that's how it worked - no complex math, most of the time spent in hundreds of read/write calls to the database server.
It is kind of fascinating to observe these big names in database business - Oracle, IBM, Microsoft - fight each other over these TCP benchmarks with crazy numbers like 100K+ transactions per second, and at the same time see the sad reality - 1 second per utility bill. Our problem is that we fail to translate these crazy capabilities of modern hardware and software into decent performance of the actual applications. At the deeper level, the problem is the way we write the applications - it is not possible to improve (without full rewrite) the mentioned bill-processing software so that it can use the database server efficiently - like using batch mode or cached data. Many thorny detailed data access operations are spread throughout the business code, and changing anything about data access means complete rewrite. As an example, most older systems use identity columns to generate primary keys, so when they need to insert multiple parent/child records, it needs explicit insert call for the parent, and then uses the generated id as a foreign key value in child entities. So we are stuck with micro-managing the database calls in the business code.
VITA takes different approach, and this is my attempt to present how things should be done. All database access details are sealed behind the wall and isolated from the business code. Even if you use identities, you can create the whole web of parent/child entities and then submit them all to the database - with VITA, you use object references and VITA propagates Id values automatically behind the scene. To enable data caching for reads and batched updates, you don't need to do any changes in your business code - just set a few flags in initial setup, and things start happening automatically. With data caching, query caching and batch execution we are getting closer to the point where we can say that we really use the amazing capabilities of modern database servers to the fullest extent possible. Add here modular construction using entity modules, automatic database schema handling - and you'll get what you need to close the gap. A few important pieces are still missing - but this deserves a separate post.

Until then - happy coding

Truly yours