Speaker: Saud Al-Mishari, MSFT PFE – think he’s based in the UK
The session is on the new replication model: RCM, DRS, and SEDO.
Key Concepts
- SQL replication in ConfigMgr 2012 is nothing do do with SQL Server Transaction Replication
- Data Replication Service (DRS)
Terminology
- Stored procedure: sproc
- SSB: SQL Service Broker
- Change Tracking: SQL Server Change Tracking
More:
- RCM: Replication Configuration Management/Monitoring
- Replication Pattern: a set of rules on what will replicate
- Replication group: a set of tables that are monitored and replicated together
- Replication Link: a replication connection between two SQL servers for a particular RG
- Backlog: Unable to write data t the SQL Server DB after being received in the SSB Queue (usually SQL Server write performance)
New Replication Model
- Global data is anything an admin creates and is replicated everywhere, e.g. collection rules
- Site data is stuff like status, collection membership results, replicated up to parent site.
Client generates XML file and copies to management point. MP copies MIF to the site server. Site server process it. DRS replicates the changed data to the parent CAS contains the discovery data.
SQL Server Change Tracking
- Change tracking allows application to keep a record of rows in a table that have been changed: insert/update/deete
- Does not track changed data – obtained directly each sync
- Added in SQL Server 2008 … not to be confused with Change Data Capture
- Is enabled at the DB level and at the table level.
DO NOT ALTER THIS SETTING ON A SITE DATABASE
SQL Service Broker
Messaging service:
- Asynchronous queue based service
- Guaranteed delivery (not infrastructural guarantee – developer guarantee)
- Allows messages to be grouped into a conversation … messages processed in order, allows for multiple threads to process queue
Elasticity:
- Allows scalability
Replication Patterns
- Global data flows in both directions. CAS and primaries all have the same data, e.g. collections and package meta data.
- Site data flows up.
- Global-proxy is admin and control data for secondary sites. A primary and secondary sites all have the same data. Subset of global data that secondary sites needs. Leverages SQL 2008 R2 Express at the secondary site with 10 GB limit.
Select * from vReplicationData to find all RGs and their sync schedules
ID is the key field in here.
Provider Access
SMS_ReplicationGroup is a new WMI class that supports replication. 1 instance per RG. Status propert allow you to determine the sttus of the RG.
What’s in an RG?
Select * from vArticleData where ReplicationID = XX …. using ID from above query
How big is the RG?
EXEC spDiagGetSpaceUsed
If a site goes down for a week or two, how much data must you send across? Use the above query to figure out how much data must be replicated by the RG.
Demo
In the SQL Management Studio. Select * from vReplicationData. Can see all the patterns for global, site and global-proxy. SyncInterval is the number of minutes between replications. DRS runs every 5 minutes .. no control over that.
Select * from vArticleData where ReplictionID = 7. Looks like Endpoint Protection data being replicated here.
Runs spDiagGetSpaceUsed .. takes a while. Returns the size of the tables. Replication Pattern shows the amount of data to replicate if you lose a site for the 3 patterns (global, site, global_proxy).
DRS Architecture
- RCM handles replication link setup, maintenance and monitoring – command and control. It’s a thread of SMSEXEC.
- SSB is the transmission engine of replication
- The Sender still lives and is used for bulk copy for initialization and re-init.
- 5 day limit on DRS for outages – Due to the need to retain changes. It retains 5 days of data. Try to expand this for a 30 day outage and ConfigMgr needs to maintain 30 days of data. It’s 5 days to handle a long weekend apparently – site breaks at start of holiday, come back 4 days later and fix it.
Initialisation:
- BCP: to extract table data
- Sender: SMS EXEC sender thread
- SMB/CIFS: copy data to the destination
On-going replication
- SQL Server Change Tracking
- DRS sprocs and SQLCLR
- SQL Server Service Broker
- XML
Demo – Break replication
SQL DBA has a bad day and disables dbo.ConfMgrDRSQueue. CMTrace is started from DVD. Opens rcmctrl log on site server. See that the queue not running causes and error. We can see that ConfigMgr actually reached out into SQL and re-enabled the queue.
In CMconsole , we have send demo. The link is degraded in one direction but not the other under Database Replication. Looks like TCP 1433 connectivity issue.
Site Initialisation
- Setup start
- Setup asks CAS for site number. If you have more than 50,000 clients, then you need SQL Enterprise Edition to chunk up data in the DB and partition it.
- Setup finished and waits for replication to initialise.
- The replication configuration data is requested. This group tells RCM as the primary how replication should be setup
- CSA receives request and BCPS out the data and sends it via sender back to the primary
- Primary now request remaining Global Replication Groups. CAS creates the BC packages and send them back to the primary. Primary then applies the new data from the CAS.
- Primary site receives BCP fles and inserts all the data from the CAS> The primary can now switch to normal replication.
DRS Message Replication
- Provider executes query that modifies table
- SQL Server writes entries into change tracking table
- On DRS sync: changes are packages up and inserted into SQL Server message queue sing a stored proc.
- Message Broker transmits the message to the receiving site.
- RCM monitors the queue launching activation stored procs to process
- And more on receiving side to insert modifications on receiving side
WARNING: When A CAS Goes Offline
When the CAS goes offline for more than 5 days, don’t make changes on the Primary as a substitute as the CAS. The CAS will re-initialise the primaries after more than 5 days outage, thus wiping the Primary’s changes.
DRS Troubleshooting
- The Replication Link Analyser RLS should be yur first stop. It’s predictable and can do some fairly complex remediation
- RCM Log should be the follow up. But this is just a summary of what has happend.
- For transmissions layer errors, the SSB queue is sometimes the most immediate source for error messages (of this type)
Views for Detailed Info
- The main logging view: vLogs. They log into the DB. Select top 1000 * from vLogs order by LogTime desc. Limit that number. DO not select everything. Will hammer prod environment and compund the issue.
- SMS_Replication_Configuration_Monitor registry key to configure logging
DRS Troubleshooting
- Ensure that TCP 1433 exception is there for SQL Service and 4022 for SQL Broker.
- SSB keys transmitted through setup – monitoring with Hman.
- spDiagDRS will give you an overview of the state of DRS replication at the site. SiteStatus (coded), Replication Group Initialization Status, DRSQueueStates, QueueLenghts (ideally 0 and 0 or you have a backlog), Replication Group Status deltails the last time messages sent
Demo: View Queues
Click on the queues in SQL under service broker under CM database.
Procedural troubleshooting of DRS DEMO
Turns of SQL Broker. Makes a change to Client Policy.
- Run spDiagDRS: EXEC spDiagDRS in SQL MS. We see messages jammed in the outbound queue.
- SSB transmission_queue:
- Service broker queues: We see connection failed errors. Telnet to the port and we see it fails.
- vLogs: select * from vLogs ORDER BY LogTime DESC (beware * in real world … too much data)
- RCM_ReplicationLinkStatus
The Database Replication link in CM console will flip to degraded and then flip to fail after about 25 minutes. Can run Replication Link Analyzer (RLA). In the demo it shows that there’s a network connectivity issue.
Invoke-WmiMethod –namespace rootrootsmssite_CAS –path SMS_ReplicatinGroup –Name InitializeData = arguementlist “20”, “CAS”, “PR1” to reinitialize a RG. RLA should do this for you if required.
SEDO – Why do we need a way of controlling changes?
- As global data is replicated everywhere, a user on a primary site culd change an object at the same time as a user on the CAS or another primary.
- This is an unavoidable consequence of multi-master replicated data model – ask AD.
- SEDO is the solution to this.
What is SEDO?
- SEDO = Serialized Editing of Data/Distributed Objects
- Provides a way to enforece a single user editing of an object at any one time.
- A lock request round trip can take less than 200ms from Primary to CAS to Primary
- Default Timeout is 5 minutes.
- Only SEDO enabled objects require users to get a lock
- Supports explicit and implicit lock handling.
- This is all transparent to admins. Important for devs building extensions to CM.
Once again, an AWESOME blog from Aidan!!!