dbsync
— Database Syncing¶
The dbsync module implements the database replication and synchronization features. The core part of this module is the syncdb management command which takes care of fetching and installing remote objects from JSON files using the Webservice API.
Note
The dbsync application is currently barely working and should be rewritten.
DBSync Objects¶
Each instance of the Database
class represents a remote enhydris
instance. Once such an object has been added to the local database, then the
remote instance it refers to can be used in the replication routine.
-
class
dbsync.
Database
(name, ip_address, hostname, descr)¶ -
name
¶ This is the name of the database. It’s not mandatory that it’s the same to the actual name of the database. This is only used for local reference.
-
ip_address
¶ This field should contain the ip of the host that holds the remote enhydris instance.
-
hostname
¶ This field must contain the FQDN from which the enhydris instance is accessible (this is especially required when using vhosts on a server so that the replication script knows which vhost uses which database).
Note
A fully qualified domain name (FQDN), sometimes referred to as an absolute domain name, is a domain name that specifies its exact location in the tree hierarchy of the Domain Name System (DNS). It specifies all domain levels, including the top-level domain, relative to the root domain. A fully qualified domain name is distinguished by this absoluteness in the name space.
-
descr
¶ This is a textfield that holds the description for the specific database.
-
DBSync Management Command¶
The core functionality of the DBSync module is to provide a management command with which one can replicate completely a remote instance (or multiple remote instances) of the enhydris web application. The replication script can also update existing entries with changes when run multiple consecutive times but doesn’t handle item deletion.
The code for the replication scripts resides under the
enhydris/dbsync/management/commands/
directory, inside the
hcore_remotesyncdb.py
file. You can check out the available options for
the script by issuing the following command:
# ./manage.py hcore_remotesyncdb -h
Usage: ./manage.py hcore_remotesyncdb [options]
This command is used to synchronize the local database using data from a
remote instance
Options:
-v VERBOSITY, --verbosity=VERBOSITY
Verbosity level; 0=minimal output, 1=normal output,
2=all output
--settings=SETTINGS The Python path to a settings module, e.g.
"myproject.settings.main". If this isn't provided, the
DJANGO_SETTINGS_MODULE environment variable will be
used.
--pythonpath=PYTHONPATH
A directory to add to the Python path, e.g.
"/home/djangoprojects/myproject".
--traceback Print traceback on exception
-r REMOTE, --remote=REMOTE
Remote instance to sync from
-p PORT, --port=PORT Specify custom port. Default is 80.
-a APP, --app=APP Application which should be synced
-e EXCLUDE, --exclude=EXCLUDE
State which models of the apps you want excluded from
the sync
-f, --fetch-only Doesn't actually submit any changes, just fetches
remote dumps and saves them locally.
-w CWD, --work-dir=CWD
Define the tmp dir in which all temporary files will
be stored
-N, --no-backups Default behaviour is to take a backup of the local db
before doing any changes. This overrides this
behavior.
-s, --skip If skip is specified, then syncing will skip any
problems continue execution. Default behavior is to
halt on all errors.
-R, --resume With resume, no files are fetched but the local ones
are used.
-S, --silent Suppress all log messages
--version show program's version number and exit
-h, --help show this help message and exit
The most important command line options are the -a
and -r
which are
used to specify which application you want to replicate (in our case
hcore
) and which is the remote instance from which the data should be
pulled. A sample execution of the replication script from the command line
should look something like this:
# ./manage.py hcore_remotesyncdb -a hcore -r itia.hydroscope.gr -e UserProfile
/usr/local/lib/python2.6/dist-packages/django_registration-0.7-py2.6.egg/registration/models.py:4:
DeprecationWarning: the sha module is deprecated; use the hashlib module instead
Checking port availability on host 147.102.160.28, port 80
Remote host is up. Continuing with the sync.
The following models will be synced: ['EventType', 'FileType', 'Garea',
'Gentity', 'GentityAltCode', 'GentityAltCodeType', 'GentityEvent',
'GentityFile', 'Gline', 'Gpoint', 'Instrument', 'InstrumentType', 'Lentity',
'Organization', 'Overseer', 'Person', 'PoliticalDivision', 'Station',
'StationType', 'TimeStep', 'TimeZone', 'Timeseries', 'UnitOfMeasurement',
'Variable', 'WaterBasin', 'WaterDivision']
The following models will be excluded ['UserProfile']
Syncing model EventType
- Downloading EventType fixtures : done
Syncing model FileType
- Downloading FileType fixtures : done
Syncing model Garea
- Downloading Garea fixtures : done
Syncing model Gentity
- Downloading Gentity fixtures : done
Syncing model GentityAltCode
- Downloading GentityAltCode fixtures : done
Syncing model GentityAltCodeType
- Downloading GentityAltCodeType fixtures : done
Syncing model GentityEvent
- Downloading GentityEvent fixtures : done
Syncing model GentityFile
- Downloading GentityFile fixtures : done
Syncing model Gline
- Downloading Gline fixtures : done
Syncing model Gpoint
- Downloading Gpoint fixtures : done
Syncing model Instrument
- Downloading Instrument fixtures : done
Syncing model InstrumentType
- Downloading InstrumentType fixtures : done
Syncing model Lentity
- Downloading Lentity fixtures : done
Syncing model Organization
- Downloading Organization fixtures : done
Syncing model Overseer
- Downloading Overseer fixtures : done
Syncing model Person
- Downloading Person fixtures : done
Syncing model PoliticalDivision
- Downloading PoliticalDivision fixtures : done
Syncing model Station
- Downloading Station fixtures : done
Syncing model StationType
- Downloading StationType fixtures : done
Syncing model TimeStep
- Downloading TimeStep fixtures : done
Syncing model TimeZone
- Downloading TimeZone fixtures : done
Syncing model Timeseries
- Downloading Timeseries fixtures : done
Syncing model UnitOfMeasurement
- Downloading UnitOfMeasurement fixtures : done
Syncing model Variable
- Downloading Variable fixtures : done
Syncing model WaterBasin
- Downloading WaterBasin fixtures : done
Syncing model WaterDivision
- Downloading WaterDivision fixtures : done
Creating Generic objects
Finished with Generic objects
Installing fixtures from file EventType.json
Installing fixtures from file FileType.json
Installing fixtures from file Gentity.json
Installing fixtures from file Garea.json
Installing fixtures from file GentityAltCode.json
Installing fixtures from file GentityAltCodeType.json
Installing fixtures from file GentityEvent.json
Installing fixtures from file GentityFile.json
Installing fixtures from file Gline.json
Installing fixtures from file Gpoint.json
Installing fixtures from file Instrument.json
Installing fixtures from file InstrumentType.json
Installing fixtures from file Lentity.json
Installing fixtures from file Organization.json
Installing fixtures from file Overseer.json
Installing fixtures from file Person.json
Installing fixtures from file PoliticalDivision.json
Installing fixtures from file Station.json
Installing fixtures from file StationType.json
Installing fixtures from file TimeStep.json
Installing fixtures from file TimeZone.json
Installing fixtures from file Timeseries.json
Installing fixtures from file UnitOfMeasurement.json
Installing fixtures from file Variable.json
Installing fixtures from file WaterBasin.json
Installing fixtures from file WaterDivision.json
Reinitializing foreign keys: done
Successfully installed 7319 objects from 26 fixtures.
The command above, replicates all remote data except for the UserProfiles (
defined using the -e|--exclude
option) keeping all data and foreign keys
intact but without preserving the object ids. If run multiple times, the
script can also update existing entries along with adding new ones. It’s
important to note that when replicating an enhydris database we should
ALWAYS exclude the UserProfile since we don’t want user specific data to be
transfered along with the rest of the database.
When adding a cronjob, if you don’t want a regural mail to come after every
sync, you should use the --silent
option which redirects stdout
to
/dev/null
and only prints stderr
. This, coupled with the -W
python
flag can be used to make a cronjob send an email only whenever a problem was
encountered. A sample cronjob which runs every night would be something like
this:
1 0 * * * /usr/bin/python -Wignore manage.py hcore_remotesyncdb -a hcore -r itia.hydroscope.gr -e UserProfile --silent
How stuff works
In this section, we’ll analyze the replication script and see how it operates behind the scenes. Of course, if you want to understand how it works it’s probably better if you looked directly into its source code. Regarding the API which provides us with the database objects, it’s been fully documented here. Here, we’ll see how the replication script handles that data and adds it in the local database.
One important thing that you should be familiar with before we delve into the code is the difficulties that we came across when trying to implement this feature. Postgres (and most databases by design) keep track of foreign keys using the primary key of an object which in all of enhydris models happens to be the object id. Since we want to aggregate multiple instances into one, it’s only natural that there will be id collissions should we try to load the objects in the database while keeping their original id. Thus, we decided that keeping the ids intact was not an option and we had to find a way to preserve foreign keys and many to many relations without counting on object ids.
The best workaround is to add the objects without their foreign keys and
many to many relationships and once the objects are in the database we
could reinitialize all object relationships. To do that, we added two
extra fields on all top-level objects named original_id
and
original_db
which can be used to identify a specific object during the
syncing process given that we know its id and the database that we’re
pulling the data from. Now the only thing was to somehow store the foreign
relations in a way that could be parsed easily and quite fast after the
object initialization. This was achieved using a multilevel dictionary
which stores all object foreign relations and parsing this would be a
breeze using python’s optimized dictionary parsing routines.
Of course, that’s when the real problems surfaced. Many objects have
Null=false
in some foreign keys which caused the replication to fail
when trying to save objects with null foreign keys. In order to circumvent
that, when firing up the replication script we create a set of
Dummy Objects
aka objects that have null values and are used to
fill-in the not-Null foreign key dependencies of the to-be-installed
objects. Once the replication objects are into the database, we delete the
Dummy Objects and update the foreign relations to the original ones which
we have stored in the dictionary mentioned above. This may be a slow
process but is the only feasible solution that we came up with at the
time.
Having said all that, we can see what the workflow of the script looks
like. First of all, given the application name, it tries to import the
specified app and list all available models in it. Using a multipass
bubblesort algorithm, it sorts all models using their dependencies as
specified in the f_dependencies
model field and given that there are
no circular dependencies, the final list contains the models in the
correct replication order.
Using the model list, the script asks from the remote instance the JSON
fixture of each model in the list which is fetched and saved in a temporary
dir (by default this is /tmp
). Once all JSON fixtures have been
fetched, the script creates the generic objects and then deserializes
each JSON file in the same order it was fetched. For each object within
the fixture, it first strips all foreign relations and reinitializes the
not-null ones using the generic objects. Also, the fields original_id and
original_db are filled in and the foreign keys and many to many relations
are saved in a multilevel dictionary for future reference.
Once the deserialization of all fixtures has been completed, all objects are saved under the same transaction management because we don’t want to have any objects left out from the replication routine. If everything has been completed successfully, the script reinitializes all foreign keys and many to many relations from the dictionary and exits after cleaning up. If a problem occurs all transactions are rolled back and the database is exactly as it was before the replication attempt.
Note:
The generic objects which are used to fill temporary Not Null foreign relations are handcrafted. This means that should the Enhydris database schema change drastically, this would probably require an update as well.