Best Practices

Jobs Distribution

It is recommended, to split the target backup between different groups of entities or even having one job per entity (user, group, site, etc). This way errors in one job will not invalidate a whole backup cycle where some entities have been successful and some others had errors. This also makes easier to identify the cause of the error.

Concurrency

Microsoft public APIs impose a variety of boundaries that need to be considered. If a boundary is crossed, the corresponding API call will fail and the application will need to wait some amount of time to retry, which is different depending on the boundary crossed.

It is crucial to plan an adequate strategy to backup all the elements without reaching API boundaries. A single job implements some parallelism which can be reduced until a point, if necessary, using the variable backup_queue_size (default value is 30). This variable controls the size of the internal queues communicating the internal threads, that are designed to fetch, open and send every item to Bacula core. Reducing its size will produce, ultimately (with a value of 1 for example), an execution very similar to a single threaded process. On the othere hand the plugin has concurrent_threads which controls the number of simultaneous processes fetching and downloading data (default value is 5).

If you are going to launch different jobs in parallel it is recommended to configure different services for each of them (protect in parallel email, drive, contacts…). However, be careful with the concurrency over the same service (in general, it is recommended a maximum of 4-5 jobs working with the same service) and plan a step-by-step testing scenario before putting it into production. Other important point is the timing schedule, as some boundaries are related to time-frames (number of request per 10 minutes or 1 hour, for example). If you detect you reach boundaries when running all your backups during a single day of the week, please try to use 2 or 3 days and spread the load through them in order to achieve better performance results.

More information about Microsoft 365 Graph API boundaries may be found here:

https://docs.microsoft.com/graph/throttling

Disk Space

It is necessary to have at least enough disk space available for the size of the largest file in the backup session. If you are using concurrency between jobs or through the same job (by default this is the case through the concurrent_threads=5 parameter), you would need at least that size for the largest file multiplied by the number of operations in parallel you run.

Read more details in the Backup of Attachments and Files section.

Performance

The performance of this plugin is highly dependent on many external factors:

  • ISP latency and bandwidth

  • Network infrastructure

  • FD Host hardware

  • FD Load

In summary, it is not possible to establish an exact reference about how much time a backup will need to complete.

As reference values, in testing environments and using sequential calls we had the following results:

  • 3000 emails in the same folder of a single mailbox, with emails ranging from 500 to 3000 words - about 10K to 30K in size

  • The total time required to back them up was: 88243 ms.

  • That implies a total time per email of about 30ms

Concurrency has also been tested. For example, when using single-threaded mode (concurrent_threads=1) with emails, good values where found for 4-6 concurrent jobs that allowed us to backup around 200 emails per second (sizes between 20K-30K) when no attachments were implied. As more attachments were implied, the number of emails will decrease very significantly, while the overall size speed will be increased. This can be generalized:

  • Many little objects to protect -> More objects per second, but less speed (MB/s)

  • Big files to protect -> Less objects per second, but greater speed (MB/s)

It is recommended to benchmark your own environment in base to your requirements and needs.

The automatic parallelization mechanism (using concurrent_threads=x, default is 5) should work well for most scenarios, however, fine tune is possible if we define one job per entity and we control how many of them run in pararllel, together to decrease the concurrent_threads value in order to avoid throttling from MS Graph API.

There are many different possible strategies to use this plugin, so please, study what is best suiting for your needs before deploying the jobs for your entire environment, so you can get best possible results:

  • You can have a job per entity (users, groups, sites…) and all services

  • You can have inside a job multiple entities and only some services

  • You can split your workload through an schedule, or try to run all your jobs together.

  • You can run jobs in parallel or take advantadge of concurrent_threads and so run less jobs in parallel

  • You can select what services to backup or backup them all

  • You can backup whole services to backup or select precisely what elements you really need inside each service (folders, paths, exclussions…)

  • For delegated permissions services, you can proceed with the login previously to backups (recommended) or you can wait for the job to ask for what it needs

  • etc.

Best Practices by Service

Each Microsoft 365 service presents its own particularties. Moreover each deployment and each environment can be protected through different strategies depending on the size, number of entities and other parameters. However, there are some general considerations that can be helpful to consider for a selection of services:

Sharepoint

Sharepoint Online is the service where the target data varies the most among the M365 services. Sites can be structured very differently and there exists many different kinds of Sharepoint lists. In addition to that, some of them could have very large document libraries, while others could be very sparse.

For these reasons, it is recommended to split Sharepoint jobs to be as atomic as possible.

Ideally jobs should be split up to a single site backup job. This will help to monitor the health of the backups, to keep good backups for all sites presenting no issues and but to quickly troubleshoot any site presenting any kind of inconvenience. Usually, this is especially needed when Sharepoint service are intensively used.

If sites are created and removed often, another option to avoid a frequent configuration change is to employ the site_regex_include plugin variable and use the naming patterns to set up the jobs, so they will also include future created sites (possible examples: .*Team, .*Web, Dev.* … ).

In order to facilitate the job split, it is possible to use the listing or query commands described in List general info: Users, groups, sites to have the site names list.

Another best practice is to split Sharepoint sites into two parts with two different jobs: - Pure Sharepoint job: Use sharepoint service (service=sharepoint site=SiteName) and the parameter sharepoint_include_drive_items=no. - Drive part job: Use drive service (service=drive site=SiteName)

Sharepoint jobs, when using the associated powershell commands, cannot be run in parallel. One will wait for any other session to finish, before executing the corresponding commands. Using the above strategy makes Sharepoint jobs to be much more quick and light, so it’s a lot easier to control such concurrency. Other ways to control the concurrency in Bacula Enterprise are:

  • Designing separated time windows through scheduling

  • Sending all Sharepoint jobs to the same pool and the associated storage resource (both of them dedicated to Sharepoint backups) and limiting ConcurrentJobs to the desired value in that particular storage

  • Having 1 specific client resource for Sharepoint and use ConcurrentJobs parameter there.

In general, it’s not needed to enable system lists, hidden lists or even the versioning of the elements. All those options will impact the performance of the jobs directly, so you should plan carefully the usage of such parameters.

Some Sharepoint sites present an issue with the hashcheck of the files contained in some of its document libraries. If the hash is not correct at M365 side, it does not match the one that is calculated by the plugin locally, while the file is correct. This situation causes a lot of error messages around the hash and also many extra calls to the service to re-check this hash. When this situation happens, it is recommended to disable the hashcheck mechanism for the affected site using the parameter ‘drive_disable_hashcheck=true’.

Onenote

Onenote is the most sensitive module to throttling limits in Microsoft 365 APIs: https://learn.microsoft.com/en-us/graph/throttling

It is recommended to be selective with the information to be protected and make one job for each selected entity: that is one job per user, group or site.

In case throttling problems appears with this strategy, jobs will need to be spread into larger time windows.

It is not recommended to include onenote for entity focused strategies where there is one job per user including all services (email, drive, contact, calendar ..). It is recomended to have specific onenote jobs.

Chats and Todo

This service uses delegated permissions. They can be protected together, but we recomend to run the ‘login’ query command for each individual user before running the actual jobs.

Email

Email service allows you to protect not only email and attachments, but also settings, user categories and an additional MIME copy for each email with its attachments. MIME files are very useful if you are planning to migrate your data from M365 to any other Email tool, however, the cost of getting each MIME file is very high and it duplicates backup time. MIME files are not needed to perform a local restore or a restore over M365. Therefore, it is recommended to disable MIME backup for a general basis and use it only in case of planning a complete migration of the data.

Similarly to other services, it is recommended to split the backup set into small sets of users (ideally one backup per mailbox, but it’s also fine to group them by 5 or 10), specially if there is a large number of mailboxes and they contain thousands of emails. To make this easier, the plugin offers regular expression selection parameters (user_regex_include), as well as listing and query commands to analyze the backup target (List general info: Users, groups, sites).

Go back to Microsoft 365 (M365) Plugin article.