.. _dbsync: :mod:`dbsync` --- Database Syncing ================================== .. module:: dbsync :synopsis: Enhydris Databasy Synchronization Details .. moduleauthor:: Seraphim Mellos .. sectionauthor:: Seraphim Mellos The dbsync module implements the database replication and synchronization features. The core part of this module is the syncdb management command which takes care of fetching and installing remote objects from JSON files using the :ref:`Webservice API `. .. admonition:: Note The dbsync application is currently barely working and should be rewritten. DBSync Objects -------------- Each instance of the :class:`Database` class represents a remote enhydris instance. Once such an object has been added to the local database, then the remote instance it refers to can be used in the replication routine. .. class:: Database(name, ip_address, hostname, descr) .. attribute:: name This is the name of the database. It's not mandatory that it's the same to the actual name of the database. This is only used for local reference. .. attribute:: ip_address This field should contain the ip of the host that holds the remote enhydris instance. .. attribute:: hostname This field must contain the FQDN from which the enhydris instance is accessible (this is especially required when using vhosts on a server so that the replication script knows which vhost uses which database). .. note:: A fully qualified domain name (FQDN), sometimes referred to as an absolute domain name, is a domain name that specifies its exact location in the tree hierarchy of the Domain Name System (DNS). It specifies all domain levels, including the top-level domain, relative to the root domain. A fully qualified domain name is distinguished by this absoluteness in the name space. .. attribute:: descr This is a textfield that holds the description for the specific database. DBSync Management Command ------------------------- The core functionality of the DBSync module is to provide a management command with which one can replicate completely a remote instance (or multiple remote instances) of the enhydris web application. The replication script can also update existing entries with changes when run multiple consecutive times but doesn't handle item deletion. The code for the replication scripts resides under the ``enhydris/dbsync/management/commands/`` directory, inside the ``hcore_remotesyncdb.py`` file. You can check out the available options for the script by issuing the following command:: # ./manage.py hcore_remotesyncdb -h Usage: ./manage.py hcore_remotesyncdb [options] This command is used to synchronize the local database using data from a remote instance Options: -v VERBOSITY, --verbosity=VERBOSITY Verbosity level; 0=minimal output, 1=normal output, 2=all output --settings=SETTINGS The Python path to a settings module, e.g. "myproject.settings.main". If this isn't provided, the DJANGO_SETTINGS_MODULE environment variable will be used. --pythonpath=PYTHONPATH A directory to add to the Python path, e.g. "/home/djangoprojects/myproject". --traceback Print traceback on exception -r REMOTE, --remote=REMOTE Remote instance to sync from -p PORT, --port=PORT Specify custom port. Default is 80. -a APP, --app=APP Application which should be synced -e EXCLUDE, --exclude=EXCLUDE State which models of the apps you want excluded from the sync -f, --fetch-only Doesn't actually submit any changes, just fetches remote dumps and saves them locally. -w CWD, --work-dir=CWD Define the tmp dir in which all temporary files will be stored -N, --no-backups Default behaviour is to take a backup of the local db before doing any changes. This overrides this behavior. -s, --skip If skip is specified, then syncing will skip any problems continue execution. Default behavior is to halt on all errors. -R, --resume With resume, no files are fetched but the local ones are used. -S, --silent Suppress all log messages --version show program's version number and exit -h, --help show this help message and exit The most important command line options are the ``-a`` and ``-r`` which are used to specify which application you want to replicate (in our case ``hcore``) and which is the remote instance from which the data should be pulled. A sample execution of the replication script from the command line should look something like this:: # ./manage.py hcore_remotesyncdb -a hcore -r itia.hydroscope.gr -e UserProfile /usr/local/lib/python2.6/dist-packages/django_registration-0.7-py2.6.egg/registration/models.py:4: DeprecationWarning: the sha module is deprecated; use the hashlib module instead Checking port availability on host 147.102.160.28, port 80 Remote host is up. Continuing with the sync. The following models will be synced: ['EventType', 'FileType', 'Garea', 'Gentity', 'GentityAltCode', 'GentityAltCodeType', 'GentityEvent', 'GentityFile', 'Gline', 'Gpoint', 'Instrument', 'InstrumentType', 'Lentity', 'Organization', 'Overseer', 'Person', 'PoliticalDivision', 'Station', 'StationType', 'TimeStep', 'TimeZone', 'Timeseries', 'UnitOfMeasurement', 'Variable', 'WaterBasin', 'WaterDivision'] The following models will be excluded ['UserProfile'] Syncing model EventType - Downloading EventType fixtures : done Syncing model FileType - Downloading FileType fixtures : done Syncing model Garea - Downloading Garea fixtures : done Syncing model Gentity - Downloading Gentity fixtures : done Syncing model GentityAltCode - Downloading GentityAltCode fixtures : done Syncing model GentityAltCodeType - Downloading GentityAltCodeType fixtures : done Syncing model GentityEvent - Downloading GentityEvent fixtures : done Syncing model GentityFile - Downloading GentityFile fixtures : done Syncing model Gline - Downloading Gline fixtures : done Syncing model Gpoint - Downloading Gpoint fixtures : done Syncing model Instrument - Downloading Instrument fixtures : done Syncing model InstrumentType - Downloading InstrumentType fixtures : done Syncing model Lentity - Downloading Lentity fixtures : done Syncing model Organization - Downloading Organization fixtures : done Syncing model Overseer - Downloading Overseer fixtures : done Syncing model Person - Downloading Person fixtures : done Syncing model PoliticalDivision - Downloading PoliticalDivision fixtures : done Syncing model Station - Downloading Station fixtures : done Syncing model StationType - Downloading StationType fixtures : done Syncing model TimeStep - Downloading TimeStep fixtures : done Syncing model TimeZone - Downloading TimeZone fixtures : done Syncing model Timeseries - Downloading Timeseries fixtures : done Syncing model UnitOfMeasurement - Downloading UnitOfMeasurement fixtures : done Syncing model Variable - Downloading Variable fixtures : done Syncing model WaterBasin - Downloading WaterBasin fixtures : done Syncing model WaterDivision - Downloading WaterDivision fixtures : done Creating Generic objects Finished with Generic objects Installing fixtures from file EventType.json Installing fixtures from file FileType.json Installing fixtures from file Gentity.json Installing fixtures from file Garea.json Installing fixtures from file GentityAltCode.json Installing fixtures from file GentityAltCodeType.json Installing fixtures from file GentityEvent.json Installing fixtures from file GentityFile.json Installing fixtures from file Gline.json Installing fixtures from file Gpoint.json Installing fixtures from file Instrument.json Installing fixtures from file InstrumentType.json Installing fixtures from file Lentity.json Installing fixtures from file Organization.json Installing fixtures from file Overseer.json Installing fixtures from file Person.json Installing fixtures from file PoliticalDivision.json Installing fixtures from file Station.json Installing fixtures from file StationType.json Installing fixtures from file TimeStep.json Installing fixtures from file TimeZone.json Installing fixtures from file Timeseries.json Installing fixtures from file UnitOfMeasurement.json Installing fixtures from file Variable.json Installing fixtures from file WaterBasin.json Installing fixtures from file WaterDivision.json Reinitializing foreign keys: done Successfully installed 7319 objects from 26 fixtures. The command above, replicates all remote data except for the UserProfiles ( defined using the ``-e|--exclude`` option) keeping all data and foreign keys intact but without preserving the object ids. If run multiple times, the script can also update existing entries along with adding new ones. It's important to note that when replicating an enhydris database we should *ALWAYS* exclude the UserProfile since we don't want user specific data to be transfered along with the rest of the database. When adding a cronjob, if you don't want a regural mail to come after every sync, you should use the ``--silent`` option which redirects ``stdout`` to ``/dev/null`` and only prints ``stderr``. This, coupled with the ``-W`` python flag can be used to make a cronjob send an email only whenever a problem was encountered. A sample cronjob which runs every night would be something like this:: 1 0 * * * /usr/bin/python -Wignore manage.py hcore_remotesyncdb -a hcore -r itia.hydroscope.gr -e UserProfile --silent .. admonition:: How stuff works In this section, we'll analyze the replication script and see how it operates behind the scenes. Of course, if you want to understand how it works it's probably better if you looked directly into its source code. Regarding the API which provides us with the database objects, it's been fully documented :ref:`here `. Here, we'll see how the replication script handles that data and adds it in the local database. One important thing that you should be familiar with before we delve into the code is the difficulties that we came across when trying to implement this feature. Postgres (and most databases by design) keep track of foreign keys using the primary key of an object which in all of enhydris models happens to be the object id. Since we want to aggregate multiple instances into one, it's only natural that there will be id collissions should we try to load the objects in the database while keeping their original id. Thus, we decided that keeping the ids intact was not an option and we had to find a way to preserve foreign keys and many to many relations without counting on object ids. The best workaround is to add the objects without their foreign keys and many to many relationships and once the objects are in the database we could reinitialize all object relationships. To do that, we added two extra fields on all top-level objects named ``original_id`` and ``original_db`` which can be used to identify a specific object during the syncing process given that we know its id and the database that we're pulling the data from. Now the only thing was to somehow store the foreign relations in a way that could be parsed easily and quite fast after the object initialization. This was achieved using a multilevel dictionary which stores all object foreign relations and parsing this would be a breeze using python's optimized dictionary parsing routines. Of course, that's when the real problems surfaced. Many objects have ``Null=false`` in some foreign keys which caused the replication to fail when trying to save objects with null foreign keys. In order to circumvent that, when firing up the replication script we create a set of ``Dummy Objects`` aka objects that have null values and are used to fill-in the not-Null foreign key dependencies of the to-be-installed objects. Once the replication objects are into the database, we delete the Dummy Objects and update the foreign relations to the original ones which we have stored in the dictionary mentioned above. This may be a slow process but is the only feasible solution that we came up with at the time. Having said all that, we can see what the workflow of the script looks like. First of all, given the application name, it tries to import the specified app and list all available models in it. Using a multipass bubblesort algorithm, it sorts all models using their dependencies as specified in the ``f_dependencies`` model field and given that there are no circular dependencies, the final list contains the models in the correct replication order. Using the model list, the script asks from the remote instance the JSON fixture of each model in the list which is fetched and saved in a temporary dir (by default this is ``/tmp``). Once all JSON fixtures have been fetched, the script creates the generic objects and then deserializes each JSON file in the same order it was fetched. For each object within the fixture, it first strips all foreign relations and reinitializes the not-null ones using the generic objects. Also, the fields original_id and original_db are filled in and the foreign keys and many to many relations are saved in a multilevel dictionary for future reference. Once the deserialization of all fixtures has been completed, all objects are saved under the same transaction management because we don't want to have any objects left out from the replication routine. If everything has been completed successfully, the script reinitializes all foreign keys and many to many relations from the dictionary and exits after cleaning up. If a problem occurs all transactions are rolled back and the database is exactly as it was before the replication attempt. **Note**: The generic objects which are used to fill temporary `Not Null` foreign relations are handcrafted. This means that should the Enhydris database schema change drastically, this would probably require an update as well.