






































                                      SSlloonnyy--II
                        AA rreepplliiccaattiioonn ssyysstteemm ffoorr PPoossttggrreeSSQQLL

                               -II-mm-pp-ll-ee-mm-ee-nn-tt-aa-tt-ii-oo-nn--dd-ee-tt-aa-ii-ll-ss-

                                     _J_a_n _W_i_e_c_k
                                  Afilias USA INC.
                             Horsham, Pennsylvania, USA


                                      _A_B_S_T_R_A_C_T

                      This  document  describes several implementa
                 tion details of the Slony-I replication engine and
                 related components.















            Slony-I                      -i-            Working document


                                 TTaabbllee ooff CCoonntteennttss


            1. Control data  . . . . . . . . . . . . . . . . . . . .   1
            1.1. Table sl_node . . . . . . . . . . . . . . . . . . .   1
            1.2. Table sl_path . . . . . . . . . . . . . . . . . . .   2
            1.3. Table sl_listen . . . . . . . . . . . . . . . . . .   2
            1.4. Table sl_set  . . . . . . . . . . . . . . . . . . .   2
            1.5. Table sl_table  . . . . . . . . . . . . . . . . . .   2
            1.6. Table sl_subscribe  . . . . . . . . . . . . . . . .   2
            1.7. Table sl_event  . . . . . . . . . . . . . . . . . .   2
            1.8. Table sl_confirm  . . . . . . . . . . . . . . . . .   3
            1.9. Table sl_setsync  . . . . . . . . . . . . . . . . .   3
            1.10. Table sl_log_1 . . . . . . . . . . . . . . . . . .   3
            1.11. Table sl_log_2 . . . . . . . . . . . . . . . . . .   3
            2. Replication Engine Architecture . . . . . . . . . . .   5
            2.1. Sync Thread . . . . . . . . . . . . . . . . . . . .   5
            2.2. Cleanup Thread  . . . . . . . . . . . . . . . . . .   5
            2.3. Local Listen Thread . . . . . . . . . . . . . . . .   5
            2.4. Remote Listen Threads . . . . . . . . . . . . . . .   6
            2.5. Remote Worker Threads . . . . . . . . . . . . . . .   6










































            Slony-I                      -1-            Working document


            11..  CCoonnttrrooll ddaattaa



                              +------------+
                              |            |
                  +--ssll__lliisstteenn|--ssll__ppaatthh--+| ssll__ssuubbss-cc-rr+iibbee
                  |  llii__oPpPrKrKi1o2gv-ii+nd+e+r-ppaa__sPcPeKlKr1i2ve-en-rt++-ssuubb_P_sKpe2rtovi|der
                  |  li_rPeKc3e-i+v+er pa_connin|f+o-sub_PrKe1cei|ver
                  +-----------+  pa_connre|trysub_forwa|rd
                              |           |  sub_activ|e
                              |           |           |
                              |           |           |
                     ssellv___eeoPvvrKeei1nng-tti-n+--ssnllo___nnioodPddKee--+  ssslle__tss_eeiPttdK --+  sstlla__btt_aaiPbbdKllee
                     ev_sPeKq2no    no_active+--set_origi|n  tab_reloid
                     ev_timestampno_commen|t  set_comme+n-t-tab_set
                     ev_minxid            |           |  tab_attkind
                     ev_maxxid            |           |  tab_comment
                     ev_xip               |           |
                     ev_type     sscllo__ncc_ooonnrffi-iig-rri+mmn  ssslls__yss_eesPtteKsst-yyi-nnd+cc
                     ev_data1    con_rec-e-i+v-e-dssy_origin
                     ev_data2    con_seqno   ssy_seqno   ssll__lloogg__[[11||22]]
                     eevv__ddaattaa34    con_timestamspsy_minxid  ssll__oxriidgin
                     ev_data5                ssy_maxxid  sl_tableid
                     ev_data6                ssy_xip     sl_actionseq
                     ev_data7                ssy_action_lisslt_cmdtype
                     ev_data8                            sl_cmddata


                                      Figure 1

                 Figure  1  shows the Entity Relationship Diagram of the
            Slony-I configuration and runtime data. Although Slony-I  is
            a  master slave replication technology, the nodes building a
            cluster do not have any particular role. All  nodes  contain
            the  same configuration data and are running the same repli
            cation engine process. At any given time,  a  collection  of
            tables,  called  set, has one node as its origin. The origin
            of a table is the only node that permits updates by  regular
            client  applications.  The fact that all nodes are function
            ally identical and share the entire configuration data makes
            failover  and  failback  a  lot easier.  All the objects are
            kept in a separate namespace based on the cluster name.

            11..11..  TTaabbllee ssll__nnooddee

                 Lists all  nodes  that  belong  to  the  cluster.   The
            attribute  no_active  is  NOT  intended  for  any short term
            enable/disable games with the node in question. The  transi
            tion from disable to enable of a node requires full synchro
            nization with the cluster, resulting possibly in a full  set
            copy operation.











            Slony-I                      -2-            Working document


            11..22..  TTaabbllee ssll__ppaatthh

                 Defines  the  connection information that the pa_client
            node would use to connect to pa_server node, and  the  retry
            interval in seconds if the connection attempt fails. Not all
            nodes need to be able to connect to each other.  But  it  is
            good practice to define all possible connections so that the
            configuration is in  place  for  an  eventual  failover.  An
            sl_path  entry alone does not actually cause a connection to
            be established. This requires sl_listen and/or  sl_subscribe
            entries as well.

            11..33..  TTaabbllee ssll__lliisstteenn

                 Specifies  that  the  li_receiver  node will select and
            process events originating on li_origin  over  the  database
            connection to the node li_provider. In a normal master slave
            scenario with a  classical  hierarchy,  events  will  travel
            along  the same paths as the replication data. But scenarios
            where multiple sets originate on different nodes can make it
            necessary to distribute events more redundant.

            11..44..  TTaabbllee ssll__sseett

                 A  set  is  a  collection  of tables and sequences that
            originate on one node and is the smallest unit that  can  be
            subscribed to by any other node in the cluster.

            11..55..  TTaabbllee ssll__ttaabbllee

                 Lists  the  tables  and their set relationship. It also
            specifies the attribute kinds of  the  table,  used  by  the
            replication  trigger to construct the update information for
            the log data.

            11..66..  TTaabbllee ssll__ssuubbssccrriibbee

                 Specifies what nodes are subscribed to what  data  sets
            and  where  they  actually get the log data from. A node can
            receive the data from the set origin or any other node  that
            is subscribed with forwarding (cascading).

            11..77..  TTaabbllee ssll__eevveenntt

                 This is the message passing table. A node generating an
            event (configuration change or data sync event) is inserting
            a  new  row  into this table and does Notify all other nodes
            listening for events.  A remote node  listening  for  events
            will  then select these records, change the local configura
            tion or replicate data, store the sl_event row in  its  own,
            local  sl_event  table and Notify there. This way, the event
            cascades through the whole cluster.  For  SYNC  events,  the
            columns ev_minxid, ev_maxxid and ev_xip contain the transac
            tions serializable snapshot information. This  is  the  same









            Slony-I                      -3-            Working document


            information used by MVCC in PostgreSQL, to tell if a partic
            ular change is already visible to the transaction or consid
            ered  to be in the future.  Data is replicated in Slony-I as
            single operations on the row level,  but  grouped  into  one
            transaction containing all the changes that happened between
            two SYNC events. Applying  the  last  and  the  actual  SYNC
            events  transaction  information according to the MVCC visi
            bility rules is the filter mechanism that does  this  group
            ing.

            11..88..  TTaabbllee ssll__ccoonnffiirrmm

                 Every  event  processed  by a node is confirmed in this
            table. The confirmations cascade through the system  similar
            to  the events.  The local cleanup thread of the replication
            engine periodically  condenses  this  information  and  then
            removes  all entries in sl_event that have been confirmed by
            all nodes.

            11..99..  TTaabbllee ssll__sseettssyynncc

                 This table tells for the actual node only what the cur
            rent  local  sync situation of every subscribed data set is.
            This status information is not duplicated to other nodes  in
            the system.  This information is used for two purposes. Dur
            ing replication the node uses the  transaction  snapshot  to
            identify  the log rows that have not been visible during the
            last replication cycle. When a node does  the  initial  data
            copy  of a newly subscribed to data set, it uses this infor
            mation to know and/or remember what sync  points  and  addi
            tional  log  data  is  already contained in this actual data
            snapshot.

            11..1100..  TTaabbllee ssll__lloogg__11

                 The table containing  the  actual  row  level  changes,
            logged  by  the  replication trigger. The data is frequently
            removed by the cleanup thread after all nodes have confirmed
            the corresponding events.

            11..1111..  TTaabbllee ssll__lloogg__22

                 The  system  has  the  ability  to  switch  between the
            sl_log_1 and this table. Under normal  circumstances  it  is
            better to keep the system using the same log table, with the
            cleanup thread deleting old log information and using vacuum
            to  add  the  free'd space to the freespace map.  PostgreSQL
            can use multiple blocks found in the freespace map to  actu
            ally  better  parallelize  insert operations in high concur
            rency.  In the case nodes have been offline or fallen behind
            very far by other means, log data collecting up in the table
            might have increased its size  significantly.  There  is  no
            other way than running a full vacuum to reclaim the space in
            such a case, but this would cause an  exclusive  table  lock









            Slony-I                      -4-            Working document


            and  effectively  stop  the  application. To avoid this, the
            system can be switched to the other log table in this  case,
            and  after  the  old log table is logically empty, it can be
            truncated.



























































            Slony-I                      -5-            Working document


            22..  RReepplliiccaattiioonn EEnnggiinnee AArrcchhiitteeccttuurree




                          ----S-Y-N-C----Sync Thread

                          ----------C-leanup Thread
                           CleanUp
                  |     No|tify, Event                     |       |
                  |       +-C-o-n-f-i-r-m---Local Listen          |       |
                  |       |                               |       |
                  |Local  |        Remote Listen          |Remote |
                  |  DB   |         1 thread peNro-t-i-f-y-,--E-v-e-n+t  DB   |
                  |       |   Eventevent provider         |       |

                                   Remote Worker
                          -----------1 thread per----D-a-t-a----
                     Event, Data, Cornefmiortme node  Confirm

                                      Figure 2


                 Figure 2 illustrates the  thread  architecture  of  the
            Slony-I  replication engine. It is important to keep in mind
            that there is no predefined role for any of the nodes  in  a
            Slony-I  cluster.   Thus,  this  engine  is running once per
            database that is a node of any cluster and all  the  engines
            together build "one distributed replication system".

            22..11..  SSyynncc TThhrreeaadd

                 The  Sync  Thread maintains one connection to the local
            database.  In a  configurable  interval  it  checks  if  the
            action  sequence has been modified which indicates that some
            replicable database activity has happened. It then generates
            a  SYNC event by calling CreateEvent().  There are no inter
            actions with other threads.

            22..22..  CClleeaannuupp TThhrreeaadd

                 The Cleanup Thread  maintains  one  connection  to  the
            local  database.   In  a  configurable interval it calls the
            Cleanup() stored procedure that  will  remove  old  confirm,
            event  and log data. In another interval it vacuums the con
            firm, event and log tables. There are no  interactions  with
            other threads.

            22..33..  LLooccaall LLiisstteenn TThhrreeaadd

                 The Local Listen Thread maintains one connection to the
            local database.  It waits for "Event" notification and scans
            for  events that originate at the local node. When receiving
            new configuration events, caused by administrative  programs









            Slony-I                      -6-            Working document


            calling the stored procedures to change the cluster configu
            ration, it will modify the in-memory  configuration  of  the
            replication engine accordingly.

            22..44..  RReemmoottee LLiisstteenn TThhrreeaaddss

                 There  is one Remote Listen Thread per remote node, the
            local node receives events from (event provider). Regardless
            of  the  number of nodes in the cluster, a typical leaf node
            will only have one Remote Listen Thread  since  it  receives
            events from all origins through the same provider.  A Remote
            Listen Thread maintains one database connection to its event
            provider. Upon receiving notifications for events or confir
            mations, it selects the new information from the  respective
            tables  and  feeds them into the respective internal message
            queues for the worker threads.  The engine starts one remote
            node  specific  worker  thread  (see below) per remote node.
            Messages are forwarded on an internal message queue to  this
            node specific worker for processing and confirmation.

            22..55..  RReemmoottee WWoorrkkeerr TThhrreeaaddss

                 There  is  one Remote Worker Thread per remote node.  A
            remote worker thread maintains one local database connection
            to  do  the  actual  replication data application, the event
            storing and confirmation.   Every  Set  originating  on  the
            remote  node  the  worker is handling, has one data provider
            (which can but must not be identical to the event provider).
            Per  distinct  data  provider  over  these  sets, the worker
            thread maintains one  database  connection  to  perform  the
            actual  replication  data selection.  A remote worker thread
            waits on its internal message queue for events forwarded  by
            the remote listen thread(s). It then processes these events,
            including data selection and application, and  confirmation.
            This  also  includes maintaining the engines in- memory con
            figuration information.
























