I went to the LondonR meetup yesterday sponsored by Mango Solutions and also the workshop on using Rcpp - c++ functions in R.
A great introduction to Rcpp, thanks due to Hadley Wickham. Thanks also to Dirk Eddelbuettel and Romain Francois for the package. Romain was also there helping out.
The takeaway is that it is astonishingly easy to create and use c++ functions within R using Rstudio passing in and out vectors of strings or numbers, so much so that I felt driven to post. If performance in some bottleneck in R is a problem and you've vectorised your code etc. there's another chunk of improvement (4 to 20 times depending on some samples we saw) available for very little effort. It's impressive what Dirk and Romain have done.
The three talks after were worth listening to, Andrie de Vries on an overview of using survival analysis to pick up what influenced a marketing success, Richard Pugh on analytics applications including R and Hadley Wickham giving some thoughts where he's trying to take some of packages.
I was also interested to see the broad range of users there for the meetup, which reflects how people are finding R useful in so many walks of life.
Wednesday, 17 July 2013
Wednesday, 3 July 2013
Excel Addin performance
I was working on improving the performance of a spreadsheet and wondered what the performance of different ways of calling functions was.
----------
Findings
Note for anything critical and like all performance testing one would need to repeat the testing with the final configurations etc.. But there are some bits of guidance.
The quickest way to do a simple calculation is in a worksheet cell with worksheet functions.
Using a vanilla XLL would take around twice as long, a COM addin something like 18 times as long and a VBA UDF something like 13 times as long.
Using the ExcelDNA way of creating a dot Net XLL is astonishingly painless, I was really impressed how quick and easy it was. However on the calling it was 4 to 5 times slower than a simple native Delphi XLL - but still quicker than a vba udf or a delphi com automation add in. Unless the function is being called a large number of times, this is a great way of creating XLL's
But if you are going to do large numbers of calculations and where the time of calling is important, it's better to use an native xll rather than writing it in VBA or writing a com automation addin.
Arrays only were a little slower when not referring with a name and looped in VBA, otherwise it made no difference - given there are lots of articles saying to avoid them on performance, maybe that's something that's improved in Excel 2010
Calculating the sum of two columns in a table is slower than calculating outside the table, arguing for calculations to be kept out of a table.
using sum() rather than + was a little quicker
using a name for a range is more readable and made no noticeable difference to the calculation
--------------
Some summary timings
I was trying get some idea of the cost of calling a function, so the function was simply to add the value of one cell to another.
This was on 32 bit Excel 2010, 4 core older xeon desktop, 16Gb ram
I used the range calculation code from http://msdn.microsoft.com/en-us/library/ff700515(v=office.14).aspx
set up a spreadsheet in its own Excel instance and created 100,000 rows of two columns of random numbers, which I then copied and pasted as values.
I then timed the calculation for these as shown below, using different ways of calculating in the worksheet, a vba user defined function and then a com add in (delphi and add in express) and an xll add in (delphi also a version with Visual Studio 2010 with ExcelDNA for a comparison), both looping through the recalculation of 10,000 rows 100 times and also one off calculations of 100,000 rows.
Although I got some variation and, there was enough consistency to make some rough conclusions which I've noted above
summarised results below, timings in microseconds
----------
Findings
Note for anything critical and like all performance testing one would need to repeat the testing with the final configurations etc.. But there are some bits of guidance.
The quickest way to do a simple calculation is in a worksheet cell with worksheet functions.
Using a vanilla XLL would take around twice as long, a COM addin something like 18 times as long and a VBA UDF something like 13 times as long.
Using the ExcelDNA way of creating a dot Net XLL is astonishingly painless, I was really impressed how quick and easy it was. However on the calling it was 4 to 5 times slower than a simple native Delphi XLL - but still quicker than a vba udf or a delphi com automation add in. Unless the function is being called a large number of times, this is a great way of creating XLL's
But if you are going to do large numbers of calculations and where the time of calling is important, it's better to use an native xll rather than writing it in VBA or writing a com automation addin.
Arrays only were a little slower when not referring with a name and looped in VBA, otherwise it made no difference - given there are lots of articles saying to avoid them on performance, maybe that's something that's improved in Excel 2010
Calculating the sum of two columns in a table is slower than calculating outside the table, arguing for calculations to be kept out of a table.
using sum() rather than + was a little quicker
using a name for a range is more readable and made no noticeable difference to the calculation
--------------
Some summary timings
I was trying get some idea of the cost of calling a function, so the function was simply to add the value of one cell to another.
This was on 32 bit Excel 2010, 4 core older xeon desktop, 16Gb ram
I used the range calculation code from http://msdn.microsoft.com/en-us/library/ff700515(v=office.14).aspx
set up a spreadsheet in its own Excel instance and created 100,000 rows of two columns of random numbers, which I then copied and pasted as values.
I then timed the calculation for these as shown below, using different ways of calculating in the worksheet, a vba user defined function and then a com add in (delphi and add in express) and an xll add in (delphi also a version with Visual Studio 2010 with ExcelDNA for a comparison), both looping through the recalculation of 10,000 rows 100 times and also one off calculations of 100,000 rows.
Although I got some variation and, there was enough consistency to make some rough conclusions which I've noted above
summarised results below, timings in microseconds
| Source | Formula | As Range | loop100 |
| us | us | ||
| =B32+A32 | 0.8 | 1.2 | |
| =addingTest!$A32 + addingTest!$B32 | 0.7 | 1.0 | |
| =random1 + random2 | 0.8 | 1.1 | |
| array | {=A32:A100031 +B32:B100031} | 0.7 | 1.7 |
| array | {=random1 + random2} | 0.8 | 1.0 |
| vba udf | =sumolevar(A32,B32) | 12.9 | 16.2 |
| vba udf | =sumolevarDec(A32,B32) | 12.7 | 16.1 |
| =addingTest!$A32+addingTest!$B32 | 1.3 | ||
| vba udf | =sumolevar(A32,B32) | 14.2 | |
| vba udf | =sumolevarDec(A32,B32) | 13.1 | |
| com Addin | =AddCellToCell(A32,B32) | 18.6 | 18.9 |
| =SUM(addingTest!$A32:$B32) | 0.9 | ||
| =SUM(random1 + random2) | 0.9 | ||
| xll | =AddAtoB($A32,$B32) | 2.4 | 2.9 |
| =SUM(A32,B32) | 1.0 | 1.0 | |
| =SUM(A32+B32) | 0.9 | ||
| =SUM(A32:B32) | 1.0 | ||
| in Table | =[@random1]+[@random2] | 1.6 | |
| =A32 + B32 | 0.7 | ||
| =A32:A100031 +B32:B100031 | 0.8 | ||
| vba udf | =sumolevar(Table1[@random1],Table1[@random2]) | 15.3 | |
| vba udf | =sumolevarDec(Table1[@random1],Table1[@random2]) | 15.2 | |
| xll | =AddAtoB(Table1[@random1],Table1[@random2]) | 15.2 | |
| ExcelDNA xll | =AddTwoCells(A32,B32) | 11.6 |
Thursday, 28 February 2013
Excel addin to and from Redis Implementation
I've completed a working version of the Excel add-ins, both the RTD receive Excel add in and the publish addin to connect Excel to Redis now working.
That enables any excel files to share a value, and also pick up the latest value if they can connect to the
Redis server. The point of this for me was that they'd monitor some control values from Streambase,
which are also posted to Redis.
Takeaways - Redis is really nice and easy to access - most of my testing was via Ruby and the Redis library which are very simple to use. Add in Express hides a lot of the idiosyncrasies of writing Excel add-ins, in this instance with Delphi. The OmniThead libraries are great, I still find it relatively easy to get things working with Delphi and Streambase makes it pretty simple to write event servers. Again another nice Redis library, Jedis, makes it a pleasure.
I used the OmniThread libraries to fire off the thread and their non locking queues
I'm using the Add In express vcl libraries which are good and their support is excellent.
Running on my four non hyper threaded cores (xeon 5150 2.66 GHz) machine
it seems happy to send and receive over 4000 updates a second - given Microsoft's stats there's clearly some scope for improvement, but that may be enough for the current project.
I've put my code on GitHub and a compiled versions on Mega
Further work could include some auditing (who sent the updates), sample handling of other data structures,
more robust error handling and notifications (keys and values with spaces in them don't really work)
I've also forked Nils Achterholt's delphi-redis adding a couple of methods for the publication and subscription
That enables any excel files to share a value, and also pick up the latest value if they can connect to the
Redis server. The point of this for me was that they'd monitor some control values from Streambase,
which are also posted to Redis.
Takeaways - Redis is really nice and easy to access - most of my testing was via Ruby and the Redis library which are very simple to use. Add in Express hides a lot of the idiosyncrasies of writing Excel add-ins, in this instance with Delphi. The OmniThead libraries are great, I still find it relatively easy to get things working with Delphi and Streambase makes it pretty simple to write event servers. Again another nice Redis library, Jedis, makes it a pleasure.
I used the OmniThread libraries to fire off the thread and their non locking queues
I'm using the Add In express vcl libraries which are good and their support is excellent.
Running on my four non hyper threaded cores (xeon 5150 2.66 GHz) machine
it seems happy to send and receive over 4000 updates a second - given Microsoft's stats there's clearly some scope for improvement, but that may be enough for the current project.
I've put my code on GitHub and a compiled versions on Mega
Further work could include some auditing (who sent the updates), sample handling of other data structures,
more robust error handling and notifications (keys and values with spaces in them don't really work)
I've also forked Nils Achterholt's delphi-redis adding a couple of methods for the publication and subscription
Connecting Streambase to Redis
I've been working on writing my first input and output adapters to Redis from Streambase
The standard Java library seems to be Jedis and there is a good example of use of Jedis for the pub sub
approach. It's pretty easy to use.
I've only got a windows server to put the Redis server on, but following Microsoft's OpenTech effort
I go a Redis 2.4 server up and running fairly easily
The behaviour I want is a cache of the last value.
e.g something publishes values to Redis
A client connects, it can get the last value published and then any updates
An approach is each publish event in a Redis pipeline sets a key's value and also publishes to a channel.
A client on connecting subscribes to the channel and picks up the key's value, thereby getting any updates.
The subscription for Jedis is a blocking subscription, but it's not clear how safely to add further subscriptions to the same connection One approach is the first subscription is to a control channel, all further requests are sent via Redis to the subscription and in the OnMessage routine you had further subscriptions
This seems elegant though a bit wasteful of trips etc. (i.e. you have to have the check for a control message in the OnMessage handler, and the addition of a subscription requires a trip to Redis and back) But as one subscribes and unsubscribes rarely it's not a problem.
For the initial release I've just pSubscribed to the * pattern to pick up all the messages - I can filter them later in Streambase. The sample input and output adapter is pushing happily 4900 messages and sets a second even using the ide which is probably more than enough for what I'm trying to do, even without any clumping of the messages to pipeline them or send them with protocol buffers.
I'll submit it to the Streambase component exchange, but it's also on github
I've also written Excel addins to publish and receive - the use case here was to have cached monitoring centrally of Streambase - the current Streambase add in for excel is in the process of being rewritten and you'd still need to implement the caching somehow.
The standard Java library seems to be Jedis and there is a good example of use of Jedis for the pub sub
approach. It's pretty easy to use.
I've only got a windows server to put the Redis server on, but following Microsoft's OpenTech effort
I go a Redis 2.4 server up and running fairly easily
The behaviour I want is a cache of the last value.
e.g something publishes values to Redis
A client connects, it can get the last value published and then any updates
An approach is each publish event in a Redis pipeline sets a key's value and also publishes to a channel.
A client on connecting subscribes to the channel and picks up the key's value, thereby getting any updates.
The subscription for Jedis is a blocking subscription, but it's not clear how safely to add further subscriptions to the same connection One approach is the first subscription is to a control channel, all further requests are sent via Redis to the subscription and in the OnMessage routine you had further subscriptions
This seems elegant though a bit wasteful of trips etc. (i.e. you have to have the check for a control message in the OnMessage handler, and the addition of a subscription requires a trip to Redis and back) But as one subscribes and unsubscribes rarely it's not a problem.
For the initial release I've just pSubscribed to the * pattern to pick up all the messages - I can filter them later in Streambase. The sample input and output adapter is pushing happily 4900 messages and sets a second even using the ide which is probably more than enough for what I'm trying to do, even without any clumping of the messages to pipeline them or send them with protocol buffers.
I'll submit it to the Streambase component exchange, but it's also on github
I've also written Excel addins to publish and receive - the use case here was to have cached monitoring centrally of Streambase - the current Streambase add in for excel is in the process of being rewritten and you'd still need to implement the caching somehow.
Friday, 15 February 2013
Delphi object references
I was wondering whether a selection of objects in a program were in fact the same object as they were supposed to be.
As far as I understand an object points to the actual instance of the object, i.e. the value of an object is a reference to the memory where the actual object is.
A simple console app to clarify this is below, giving the results
Object1 >7036359053610256
Object1 loc >4341464
Object2 >7036359021690880
Object2 loc>4341468
set Object2 to Object1
Object2 >7036359053610256
Object2 loc>4341468
Object1 and Object2 have two different addresses. 4341464 and 4341468
However on setting object1 to object2 the contents at those memory addresses is the same
7036359053610256 which points to the exact same instance of an object.
As far as I understand an object points to the actual instance of the object, i.e. the value of an object is a reference to the memory where the actual object is.
A simple console app to clarify this is below, giving the results
Object1 >7036359053610256
Object1 loc >4341464
Object2 >7036359021690880
Object2 loc>4341468
set Object2 to Object1
Object2 >7036359053610256
Object2 loc>4341468
Object1 and Object2 have two different addresses. 4341464 and 4341468
However on setting object1 to object2 the contents at those memory addresses is the same
7036359053610256 which points to the exact same instance of an object.
program testObjRef;
{$APPTYPE CONSOLE}
{$R *.res}
uses
System.SysUtils,windows;
var
object1 : TObject;
object2 : TObject;
function GetObjectReference(thisObject : TObject) : string;
var
p: Pointer;
begin
p := @thisObject;
Result := intToStr(uint64(p^));
end;
begin
try
{ TODO -oUser -cConsole Main : Insert code here }
object1 := TObject.Create;
writeln('Object1 >' + GetObjectReference(object1));
writeln('Object1 loc >' + IntToStr(uint64(@object1)));
writeln('Object2 >' + GetObjectReference(object2));
writeln('Object2 loc>' + IntToStr(uint64(@object2)));
writeln('set Object2 to Object1');
object2 := object1;
writeln('Object2 >' + GetObjectReference(object2));
writeln('Object2 loc>' + IntToStr(uint64(@object2)));
except
on E: Exception do
Writeln(E.ClassName, ': ', E.Message);
end;
end.
Thursday, 10 January 2013
Excel Add-in to and from Redis
I was looking for a cache for messages to and from Streambase which could then be picked up from any number of excel addins
Redis seemed a good candidate for the key value cache - it's got reasonable traction in the community and performance.
The idea of the client is that it would pick up the initial value from Redis's get and subscribe to a channel with the same name for updates, similarly on publishing a value it would be set in the keyvalue pair and published in the same named channel.
I'm using the Excel add in in Delphi, which should be the easiest way of writing a native code add-in when one uses the AddIn Express libraries to avoid having to handle COM idiosyncracies (and the Addin Express support are really helpful)
Microsoft says on modest hardware one should be able to update 20,000 topics 3 times a second
http://code.google.com/p/delphi-redis/
https://github.com/ik5/redis_client.fpc
Kenny Kerr has written well and extensively on RTD addins
Alex Zhang has a nice overview and Microsoft support has the odd article worth reading
Essentially a series of strings in Excel in the RTD formula constitutes a topic that the RTD server must supply a value for.
The interchange between Excel and the RTD server can be grouped into four areas with corresponding methods in the RTDServer interface
ServerStart and ServerTerminate
starting and stopping the RTD server
ConnectData and DisconnectData
notifying the RTD server of new desired topics or ones no longer wanted
Heartbeat called by Excel to check the RTDServer is alive (1 being yes it is)
RefreshData, UpdateNotify
The RTD server lets Excel know that there is updated data via UpdateNotify. It doesn't say what data is updated.
Every throttleInterval seconds (set as Application.RTD.ThrottleInterval) in Excel, Excel calls RefreshData in the RTDServer and expects two pieces of information - the number of topics updated, and a 2 dimensional Variant array where each row is and array of the topicID and the new value for each updated value.
Note the TopicID is an integer assigned to the list of strings by Excel and given in the ConnectData call by Excel.
As the UpdateNotify doesn't say what is updated, there is no point in calling it more than once between each throttleInterval; one reasonable approach is just to stick the call on a timer and check then if there any updates (it needs to be called from the same threading appartment as the RTDServer is created in.
having RefreshData (which is called by the main Excel thread) do as little as possible is important for performance, and also on opening a spreadsheet with large numbers of topics, connectData should do as little as possible (maybe pushed the topics on a thread safe queue to be handled by the threads working with Redis)
Something else to be careful of is the handling of threads the updateNotify must be called from the Com appartment which typically means the thread that called the constructor for the RTDServer.
Redis seemed a good candidate for the key value cache - it's got reasonable traction in the community and performance.
The idea of the client is that it would pick up the initial value from Redis's get and subscribe to a channel with the same name for updates, similarly on publishing a value it would be set in the keyvalue pair and published in the same named channel.
I'm using the Excel add in in Delphi, which should be the easiest way of writing a native code add-in when one uses the AddIn Express libraries to avoid having to handle COM idiosyncracies (and the Addin Express support are really helpful)
Microsoft says on modest hardware one should be able to update 20,000 topics 3 times a second
http://code.google.com/p/delphi-redis/
https://github.com/ik5/redis_client.fpc
Kenny Kerr has written well and extensively on RTD addins
Alex Zhang has a nice overview and Microsoft support has the odd article worth reading
Essentially a series of strings in Excel in the RTD formula constitutes a topic that the RTD server must supply a value for.
The interchange between Excel and the RTD server can be grouped into four areas with corresponding methods in the RTDServer interface
ServerStart and ServerTerminate
starting and stopping the RTD server
ConnectData and DisconnectData
notifying the RTD server of new desired topics or ones no longer wanted
Heartbeat called by Excel to check the RTDServer is alive (1 being yes it is)
RefreshData, UpdateNotify
The RTD server lets Excel know that there is updated data via UpdateNotify. It doesn't say what data is updated.
Every throttleInterval seconds (set as Application.RTD.ThrottleInterval) in Excel, Excel calls RefreshData in the RTDServer and expects two pieces of information - the number of topics updated, and a 2 dimensional Variant array where each row is and array of the topicID and the new value for each updated value.
Note the TopicID is an integer assigned to the list of strings by Excel and given in the ConnectData call by Excel.
As the UpdateNotify doesn't say what is updated, there is no point in calling it more than once between each throttleInterval; one reasonable approach is just to stick the call on a timer and check then if there any updates (it needs to be called from the same threading appartment as the RTDServer is created in.
having RefreshData (which is called by the main Excel thread) do as little as possible is important for performance, and also on opening a spreadsheet with large numbers of topics, connectData should do as little as possible (maybe pushed the topics on a thread safe queue to be handled by the threads working with Redis)
Something else to be careful of is the handling of threads the updateNotify must be called from the Com appartment which typically means the thread that called the constructor for the RTDServer.
Subscribe to:
Posts (Atom)