There is no enough documentation for configuration of AppFabric Event collection service. The best doc I can find is this MSDN page:
We would like to fine control when workflow tracking events are being persisted. how ever, the document does not provide details of how extactly events are schduled to be persisted to the monitor datastore. The main configurable settings are the “eventBufferSize” and “RewriteDelay” attributes in root web.xml:
<collector name="" session="0">
<settings retryCount="10" eventBufferSize="10000" retryWait="00:00:15" maxWriteDelay="00:00:05" aggregationEnabled="true" />
where the description of the attributes are:
eventBufferSize: Maximum number of events the collector buffers before writing them to the store
maxWriteDelay: If no event has arrived in this time period then events are written to the store. The collector may choose to write events even if events have arrived during this time period.
How ever, it turns out the document has a bug and “maxWriteDelay” attribute is not supported in AppFabric 1.1 version. The equivalent in AppFabric 1.1 is the “samplingInterval” attribute.
By setting these values in root web.config, we can control some of the behaviours, but still can’t guarantee when the events will be written to the store. for example, if we set the samplingInterval to 100 ( which is the minimum value allowed) and samplingInterval to 00:00:05 ( 5 seconds, the minimum interval allowed), it still takes up to 1 minute for some events to be written to the appfabric monitoring DB.
So I decided to find out what exact logic is behind this, and the way I did was to decompile the appfabric code. By using some free .net decompiler, I was able to see the source code and here is a summary of how the event collector service determine when to write to the monitoring store:
- The following parameters are defined in root web.xml
- samplingInterval: minimum value 5, max 60; default 5
- eventBufferSize: minimum value 100; max 32767;default 10000
- maxBuffers: default5. minimum 3, max 100
- the value of attribute “schemaSampingInterval” is defined as part of the attribute validation paramter in the ApplicationServers_schema.xml under windows\system32\inetsrv\config\schemas, as the max value of samplingInterval, and the default value is 60.( this is the schema definitions for the attributes above:
<attribute name="retryCount" type="int" defaultValue="5" validationType="integerRange" validationParameter="0,100" />
<attribute name="eventBufferSize" type="int" defaultValue="10000" validationType="integerRange" validationParameter="100,32767" />
<attribute name="maxBuffers" type="int" defaultValue="5" validationType="integerRange" validationParameter="3,100" />
<attribute name="retryWait" type="timeSpan" defaultValue="00:00:15" validationType="timeSpanRange" validationParameter="10,120,1" />
<attribute name="samplingInterval" type="timeSpan" defaultValue="00:00:05" validationType="timeSpanRange" validationParameter="5,60,1" />
<attribute name="aggregationEnabled" required="false" type="bool" defaultValue="true" />
- workflow tracking events are stored in some in-memory buffer by the event collector before they are written to monitoring store. any of the following 2 conditions trigger the data store writting ( flushing the buffer):
- when new event is added, and buffer becomes full: events in buffer will be flushed to store. current buffer will be released into a avlaiable buffer pool. allocate a new buffer from either available buffer pool or create a new buffer if maxBuffer threadshold is not reached.
- there is a timer job runs at interval of the value of samplingInterval seconds. and every time timer is invoked, it will flush and release the buffer if it meets all the following conditions:
(1) there are 1 or more events in the buffer
a. no new events came into the buffer since last time timer job is invoked
b. it’s more than “schemaSampingInterval” time since the buffer was flushed last time.( it’s actually tracked by comparing number of times the timer job is invoked since it ‘s last flushed to schemaSampingInterval / sampingInterval , so no exactly the value of schemaSampingInterval)
For example, if eventBufferSize is 100, samplingInterval is 00:00:05 , and the schemaSamplingInterval is the default 60, then events will be persisted to monioting store when events reaches 100 in the buffer. they will also be persisted if there are no new events between two 5 second intervals (maximum period of time of no new events coming before buffer is flushed is about 10 seconds, minimum is 5 seconds), or if the buffer has not been flushed for 60 seconds.
So depending on the speed of new events being inserted into the buffers, it can be any time between 0 and 60 seconds for an event to be flushed to the monitoring store.
This leads to a potential way of better controlling schedule of how often the events are persited– in addiotna to the values in root web.config, we can also set the values like schemaSamplingInterval to affect the scheule and from the analysis above we can calculate the relationships. for example, if we want to have the events written to monitoring store more frequently, we can set the validation range for samplingInterval to “2,2,1” in ApplicationServer_schema.xml , which means min and max value for “samplingInterval” attribute are both 2, and also set the value of samplingInterval to 00:00:02 in root web.config.
Before ending this post, I want to talk a little about another piece that we don’t want to miss. when Event Collection Service flushes the buffer and writes to MonitoringDB, it writes to a staging table in monitoring DB. on SQL server there is some SQL agent batch job that’s executed every 10 seconds ( 10 seconds is the minimum interval for any sql agent job in sql 2008) to move all records in staging table to the persistent tarcking tables in batch. If we don’t want to wait for up to 10 seocnds to see the tracking records in the persistent tables, one work around is duplicate the batch job but run at different scehdules…for example, the default job runs at seconds 10, 20, 30… of every minute, while another job runs at seconds 5, 15, 25 etc, which essentially makes the staging table job triggered every 5 seconds…