Verbalizing phylogenomic conflict: Representation of node congruence across competing reconstructions of the neoavian explosion

Nico M. Franz, Lukas J. Musher, Joseph W. Brown, Shizhuo Yu, Bertram Ludäscher, Sergei L. Kosakovsky Pond
2019 PLoS Computational Biology  
Phylogenomic research is accelerating the publication of landmark studies that aim to resolve deep divergences of major organismal groups. Meanwhile, systems for identifying and integrating the products of phylogenomic inference-such as newly supported clade conceptshave not kept pace. However, the ability to verbalize node concept congruence and conflict across multiple, in effect simultaneously endorsed phylogenomic hypotheses, is a prerequisite for building synthetic data environments for
more » ... logical systematics and other domains impacted by these conflicting inferences. Here we develop a novel solution to the conflict verbalization challenge, based on a logic representation and reasoning approach that utilizes the language of Region Connection Calculus (RCC-5) to produce consistent alignments of node concepts endorsed by incongruent phylogenomic studies. The approach employs clade concept labels to individuate concepts used by each source, even if these carry identical names. Indirect RCC-5 modeling of intensional (property-based) node concept definitions, facilitated by the local relaxation of coverage constraints, allows parent concepts to attain congruence in spite of their differentially sampled children. To demonstrate the feasibility of this approach, we align two recent phylogenomic reconstructions of higher-level avian groups that entail strong conflict in the "neoavian explosion" region. According to our representations, this conflict is constituted by 26 instances of input "whole concept" overlap. These instances are further resolvable in the output labeling schemes and visualizations as "split concepts", which provide the labels and relations needed to build truly synthetic phylogenomic data environments. Because the RCC-5 alignments fundamentally reflect the trained, logic-enabled judgments of systematic experts, future designs for such environments need to promote a culture where experts routinely assess the intensionalities of node concepts published by our peers-even and especially when we are not in agreement with each other. PLOS Computational Biology | https://doi.Synthetic platforms for phylogenomic knowledge tend to manage conflict between different evolutionary reconstructions in the following way: "If we do not agree, then it is either our view over yours, or we just collapse all conflicting node concepts into polytomies". We argue that this is not an equitable way to realize synthesis in this domain. For instance, it would not be an adequate solution for building a unified data environment where authors can endorse and yet also reconcile their diverging perspectives, side by side. Hence, we develop a novel system for verbalizing-i.e., consistently identifying and aligning-incongruent node concepts that reflects a more forward-looking attitude: "We may not agree with you, but nevertheless we understand your phylogenomic inference well enough to express our disagreements in a logic-compatible syntax. We can therefore maximize the translatability of data linked to our diverging phylogenomic hypotheses". We show that achieving phylogenomic synthesis fundamentally depends on the application of trained expert judgment to assert parent node congruence in spite of incongruently sampled children. Verbalizing phylogenomic conflict PLOS Computational Biology | are enhanced to highlight congruence [4], rooted galled networks [16] or neighbor-net visualizations [17] that show split networks for conflicting topology regions, or simply provide a consensus tree in which incongruent bifurcating branch inferences are collapsed into polytomy [6]. Verbalizing phylogenomic congruence and conflict in open, synthetic knowledge environments [13] constitutes a novel challenge for which traditional naming solutions in systematics are inadequate. The aforementioned studies implicitly support this claim. All use overlapping sets of Code-compliant [18] and other higher-level names in the Linnaean tradition, with sources including [19] or [20] . To identify these source-specific name usages, we will utilize the taxonomic concept label convention of [14] . Accordingly, name usages sec. 2014.JEA are prefixed with "2014.", whereas name usages sec. 2015.PEA are prefixed with "2015." We diagnose the verbalization challenge as follows. (1) In some instances, identical clade names are polysemic-i.e., have multiple meanings-across studies. For instance, 2015.Pelecaniformes excludes 2015.Phalacrocoracidae, yet 2014.Pelecaniformes includes 2014.Phalacrocoracidae; reflecting on two incongruent meanings of "Pelecaniformes". (2) In other cases, two or more non-identical names have congruent meanings, e.g., 2015.Strisores and 2014.Caprimulgimorphae. (3) Names that are unique to just one study-e.g., 2015.Aequorlitornithes or 2014. Cursorimorphae-are not always reconcilable in meaning without additional human effort, thereby adding an element of referential uncertainty to the apparent conflict. (4) Lastly, many of the newly inferred and conflicting edges are not named at all. There is an implicit preference for labeling edges when suitable names are already available. However, unnamed edges can create situations where conflict cannot be verbalized and reconciled in a data environment, due to the lack of syntactic structure ("names"). Jointly, the effects of polysemic names, synonymous names, exclusive yet hard-to-reconcile names, and conflicting unnamed edges are symptomatic of an information culture that is not ready for the identifier and identifier-to-identifier relationship challenges inherent in representing phylogenomic conflict. Suppose we wish to build a collaborative knowledge environment towards inferring "the tree of life" (though see [12] ). The design should allow us to individually represent and at the same integrate conflicting hierarchies, from the tips to the root. The system should respond to name-based data queries across these hierarchies, and return whether they are congruent or how they conflict in meaning. Clearly, the name usages of each individual source are not suited for this integration task. Traditional, Linnaean conventions allow for names to have evolving phylogenomic meanings across hierarchies and are therefore too under-powered for our purpose [21] . At root, this is a novel conceptual challenge for systematics and comparative evolutionary biology, made imperative by the accelerated generation and ingestion of phylogenomic trees into open, dynamic knowledge bases for reliable integration and re-use [11, 13, 22, 23, 24] . The services that such environments aspire to provide require an appropriate theory of node identity, and hence a conception of multi-node congruence or incongruence across individual trees and entire synthesis versions. Here we propose a solution to the phylogenomic conflict representation challenge. This solution requires collaboration between systematic experts, platform designers, and users of phylogenomic information. It is an extension of prior "concept taxonomy" research [14, 25, 26] , and deploys logic reasoning to align tree hierarchies based on Region Connection Calculus (RCC-5) assertions of node congruence [27, 28, 29] . We demonstrate the feasibility of this approach by aligning subregions and entire phylogenomic trees inferred by 2015.PEA and 2014.JEA. In doing so, we address key representation challenges; such as the paraphyly of classification schemes used to label tree regions, and the inference of higher-level node congruence in spite of differentially sampled terminals. The alignment products for this use case constitute Verbalizing phylogenomic conflict PLOS Computational Biology | https://doi.
doi:10.1371/journal.pcbi.1006493 fatcat:5cvxrsc3trhm7ozcuekwvgnexy