Abstract:
Schemata and buiding blocks have been used in Genetic Programming
(GP) in several contexts including subroutines, theoretical analysis and
for empirical analysis. Of these three the least explored is empirical analysis.
This thesis presents a powerful GP empirical analysis technique for
analysis of all schemata of a given form occurring in any program of a
given population at scales not previously possible for the kinds of global
analysis performed.
There are many competing GP forms of schema and, rather than choosing
one for analysis, the thesis defines the match-tree meta-form of schema as
a general language expressing forms of schema for use by the analysis system.
This language can express most forms of schema previously used in
tree-based GP.
The new method can perform wide-ranging analyses on the prohibitively
large set of all schemata in the programs by introducing the concepts of
maximal schema, maximal program subset, representative set of schemata, and
representative program subset. These structures are used to optimize the
analysis, shrinking its complexity to a manageable size without sacrificing
the result.
Characterization experiments analyze GP populations of up to 501 60-
node programs, using 11 forms of schema including rooted-hyperschemata
and non-rooted fragments. The new method has close to quadratic complexity
on population size, and quartic complexity on program size. Efficacy
experiments present example analyses using the new method. The
experiments offer interesting insights into the dynamics of GP runs including
fine-grained analysis of convergence and the visualization of schemata
during a GP evolution. Future work will apply the many possible extensions of this new method
to understanding how GP operates, including studies of convergence, building
blocks and schema fitness. This method provides a much finer-resolution
microscope into the inner workings of GP and will be used to provide accessable
visualizations of the evolutionary process.