BFilter: Efficient XML Message Filtering and Matching in Publish/Subscribe Systems

Liang Dai, Chung-Horng Lung
2016 Journal of Software  
XML message filtering and matching are important operations for the application layer XML message multicast. As a publish/subscribe system and a specific case of content-based multicast in the application layer, XML message multicast depends highly on the data filtering and matching processes. As the XML applications emerge, efficient XML message filtering and matching become more desirable. Many XML filtering techniques have been proposed in the literature. Most of those techniques do not
more » ... ss complex queries with predicates, twig patterns or branches; some require post-processing or a special coding scheme, which is either time consuming or becomes difficult for management for dynamic changes of user queries. This paper addresses the existing gap in the literature and proposes a new technique called BFilter which performs the XML message filtering and matching operation by leveraging branch points in both the XML publication document and user requests or queries. BFilter evaluates user queries that use backward matching branch points to delay further matching processes until branch points match in the XML publication document and the user query. Using the backward branch point matching technique, XML message filtering can be performed more efficiently as the probability of mismatching in the matching process is reduced. A number of experiments have been conducted and the results demonstrate that for complex queries, BFilter has a better performance than the well-known YFilter. Journal of Software from any location in the network. Network layer multicast, however, uses IP addresses to restrict receivers to certain subnets. This means that the receivers within a subnet are usually grouped geographically, which is not appropriate in the case of pub/sub systems, in which the receivers (subscribers) with a particular interest can be located anywhere in the network. Application layer multicast has a few disadvantages. First, the packets are sent along the application overlay layer from a source to a destination, instead of following the shortest path at the network layer. Thus, the path traversed by a packet may be longer in comparison to network layer multicast. Second, duplicated packets may occur at some links. Thus, the main challenge in application layer multicast is for end systems to construct effective overlay structures. In pub/sub systems, a subscriber registers a subscription to the pub/sub service and receives published messages that match the subscription. Intuitively, the sources (publishers) can allow their subscribers to retain whatever they want, and send all the data to all subscribers. This approach is definitely not efficient because there are too many duplicated data packets that reduce system throughput and waste network resources as well as increase the processing overhead for the intermediate nodes. Many research efforts have discussed multicast in the context of pub/sub systems, e.g., [1]- [17] . Generally speaking, there are two ways to carry out multicast in this area. The first is to identify the subscriber by using the subscription information, and then send appropriate data to these subscribers. Data matching can be performed either at the source or at some centralized brokers. The second method is to perform data matching on the fly. In this way, the source simply pushes the data into the network that has a multicast tree composed of routers or called brokers in this context. The application-layer routers or brokers on the tree have a filtering mechanism to dispatch proper subsets of the messages received to their children. The children in turn perform data matching and dispatching and forward the matched data to their children. This continues until the filtered data reaches the subscribers. The first aforementioned approach may use keyword-based multicast or distributed hash table-based multicast. Keyword-based multicast groups subscribers using the keywords in their subscriptions [2], [7] [18]- [22] . Distributed hash table-based multicast uses hash functions to assign keys to subscribers by using their subscriptions [23] . These methods are efficient in terms of delivery speed. However, the keyword-based approach is less expressive because the subscriptions contain only keywords. The distributed hash table approach is not content-aware. In both of these methods, data matching is based on the key or keywords but not the content. The second approach delivers data according to the content. The subscription description is used to perform the matching. The subscription can be presented either in an n-tuple containing n information spaces, or in XPath expressions [8], [24]-[26]. An XPath expression is used for addressing portions of a XML file. XPath is more expressive than n-tuple [24]-[26] and has been used in this research. A XML file is a tree-based structure for describing information. The data content is available between a start tag and an end tag. The pair of tags not only scopes the data it contains, but also describes the data, possibly with some constraints on the tags. One XML document has one root tag pair. The root tag pair can have child tag pairs and the children can have their own child tag pairs, and so on. This structure forms a tree with one single root. As an XML file is semi-structured, it naturally applies filters in the hierarchy to perform data matching and delivery. XML-based multicast can properly match and deliver messages to subscribers. However, because it is more difficult to index and identify the elements in the XML file, compared to the content-based message format, which can be considered to be an n-dimensional array containing keywords, the filtering process in each node is time consuming. Hence, the performance of XML-based multicast depends heavily on the approach used to process the XML message. A preliminary report described the basic idea of a novel XML message filtering algorithm-BFilter [27] . Journal of Software P(Q) = P(last branch) × P(rest | last branch), where P (last branch) is the probability of matching of the last branch point of Q and P(rest | last branch) is the probability of matching of the rest of Q when the last branch is matched.
doi:10.17706/jsw.11.4.376-402 fatcat:i2o7zzxoyfaw3c7dzge6y5unyu