Evolutionary neural network automl: leaf explained

feature-image

Play all audios:

Loading...

Content Section * Introduction 2. Related work on Hyperparameter Tuning 3. Related work on Architecture Search 4. LEAF AutoML Overview 5. Comparing LEAF vs MOE (autoML from Yelp) vs Google


AutoML vs others INTRODUCTION AutoML’s task is expected to do the following: algorithm Selection, hyper parameter tuning of models, iterative modeling, and model assessment. In the other


word, AutoML’s tasks are hyper-parameter tuning and algorithm selections. I recommend those who are not familiar with AutoML to check out the following link


https://determined.ai/blog/neural-architecture-search/. Author clearly explains definition of automl, neural architecture search, and hyper-parameters optimization. CASH is a challenge whose


goal is to find an algorithm or neural network models that could identify the best set of hyper parameters of a selected algorithms for a given tasks. But why do we need AutoML? There are 2


main objectives that AutoML could provide * to find innovative configurations of DNNs that also perform well without the need of domain experts (HYPER-PARAMETER AND ARCHITECTURE SEARCH) *


to find configurations that are small enough to be practical such as DNNs on mobile phone (NETWORK COMPLEXITY) BACKGROUND There are many techniques that are often used in AutoML research.


Popular techniques are Reinforcement learning, Evolutionary Algorithm, and Bayesian Optimization. Most of the researches does “partial optimization.” Partial optimization only attempts to


optimize 1 of the 2 aspect of CASH challenges either hyper-paramter tuning (HPT) or neural architecture search (NAS). BACKGROUND: HYPER-PARAMETERS TUNING Popular but less sophisticated


techniques used for HPT are Grid search and Random search. Grid search is used when no hyper-parameters are KNOWN TO BE MORE IMPORTANT THAN OTHERS_._ Otherwise, random search are preferred.


Furthermore, cost of running grid search increases exponentially as number of Hyper-parameters increases More sophisticated techniques of HP is Gaussian processes (GP). (video reference for


more detail: GAUSSIAN PROCESSES) It first creates a probability distribution of functions (also known as Gaussian process) that best fits the objective function and then USES THAT


DISTRIBUTION TO DETERMINE WHERE TO SAMPLE NEXT_._ Disadvantage of GP is that it is computationally intensive and scale cubically with the number of sample points. (more points, more


information, better prediction) In addition, it is important to add that when number of HP is more than 10 ~15 parameters, its ability to find the most optimal hyper-parameters dropped


noticably. AutoML is expected to be a very profitable business. For this reason, many companies are building their own AutoML including Giant Company such as Google AutoML, Metric


Optimization Engine (MOE) from Yelp, TLC from Microsoft. Each company has its own strategy. For example, Google AutoML is a black box while MOE is transparent but it only does


hyper-paramteter tuning. The presented paper introduced new algorithm called LEAF. (Learning Evolutionary AI Framework) Advantage of LEAF is that it does both HT and NAS. At its core, LEAF


uses another evolutionary algorithm, called CoDeepNEAT. Evolutaionary Algorithm improves iteratively upon selected the best “population” each time the solution “evolves.” (select samples,


evolves, repeat) BACKGROUND: NEURAL ARCHITECTURE SEARCH Reinforcement Leanring is often used in NAS, but its down fall is that its optimization process can either be a linear or a tree like


structure, so search space must be selected by the user beforehand. It is also VERY computationally intensive. (failed to satisfy 2nd objective mentioned above) EA models are often modified


version of NEAT model. Other technique includes Genetic Programming, Hierarchical Evolution. EA is a black box which is an advantage in that it has larger search space. LEAF OVERVIEW Search


Space must be defined before computing LEAF eg. LSTM or CNNs. There are 3 components of LEAF: * Algorithm layer * System layer * Problem-domain layer. Figure 1: A visualization of LEAF and


its internal subsystems. The three main components are: (1) the algorithm layer which uses CoDeepNEAT to evolve hyperparameters or neural networks, (2) the system layer which helps train and


evaluate the networks evolved by the algorithm layer, and (3) the problem-domain layer, which utilizes the two previous layers to optimize DNNs. ALGORITHM LAYER Terminology section > 


Population divides into many species. >  > One species has many chromosomes individuals >  > Species = Modules >  > Chromosome, Individual= Graph >  > Node = Layer


>  > Assembled Networks composes of many chromosomes within the same > species LEAF algorithm Step by Step Explaination * group species; specifies evolves independently


proportionally to its fitness. * chromosome (graph) evolves by replacing its nodes to member of its species (replace nodes by other chromosome within the same species) * node mutates HP ,


for each node within a chromosome. (through Gaussian distribution, and random bit flip) * fitness evaluation of the assembled networks as a whole is attributed back to BLUEPRINT, and MODULES


as the average fitness of all the assembled networks contains the blueprint or modules (1 BLUEPRINT IS A PART OF MANY ASSEMBLES -> BLUEPRINT FITNESS = AVERAGE ALL THE ASSEMBLE) * repeat


until satisfy A node (local HP) determines type of layer, and properties of layers. (kernel size, number of neural, activation function) A chromosome (global HP) contains a set of global HP


such as learning rate, training algorithm, and data preprocessing) CoDeepNEAT is similar to a conventional genetic algorithm in that there is elitist selection and the hyperparameters_


_UNDERGO UNIFORM MUTATION AND CROSSOVER_._ it still has NEAT’s speciation mechanism, which PROTECTS NEW AND INNOVATIVE HYPERPARAMETERS. On the other hand, CoDeepNEAT is different from NEAT


in that nodes is NOT A NEURON BUT A LAYER. Its main purpose is to narrow down search space. A node contains real and binary values parameters. Mutation is down in 2 ways: * mutate thought


uniform Gaussian distribution * random bit flipping. Unlike NEAT, edges has no weight, it only indicate connection between nodes. (it removes bias towards combination of few nodes, instead


focus on the graph as a whole) SYSTEM LAYERS LEAF is run in parallel search method. GPUs are on the clouds service. This paper runs on Microsoft Azure. LEAF uses complete service API called


‘StudioML’. Step by Step Explaination * Algorithm layer sends networks ready for fitness evaluation in the Keras-JSON to the system layer server node. * system layer submits the network to


the complete service * results are returns one at a time with no order garantee. PROBLEM DOMAIN LAYERS LEAF can do NAS in both one task and multi task. The foundation is the SOFT-ORDERING


MULTITASK ARCHITECTURE where each level of a fixed linear DNN consists of a fixed number of modules. (1 TASK LEVEL HAS MANY SPECIES) These modules are then used to a different degree for the


different tasks. (CUSTOMIZE OPTIMIZATION TECHNIQUE TO A TAKS) LEAF extends this multi task learning (MTL) architecture by coevolving both the module architectures and the blueprint (routing


between the modules) of the DNN. LEAF also support multi-objective search. In a single objective search, elists modules and blueprint are ranked and top rank will be selected. Ranking is


done purely using fitness function. In the multi-object search, ranking is computed from Pareto fonts generated from number of objectives. (the paper used 2 objectives: primary and secondary


objective) One can vary trade off cost between primary or secondary objective accordingly. PARETO FONTS has many optimum solutions where optimum solutions cannot be compared against other


optimum solutions, but all optimum solutions are garanteed to be better than all non-optimum solutions. EXPERIMENTAL RESULTS LEAF performance can be used to compare to other AutoML technique


because LEAF has a clear trade off between performance and time or cost. In Wikipedia Comment Toxicity Classification Domain, The task is to classify toxic comment and non-toxic comment


from 190K example: 93K training, 31K validation, 31Ktest. from the picture, you can see that it is LEAF did progressive better as number of hours increase, and it achieves state of art


performance beating hand crafted (Kaggle Best) after 40 generation or 4619 hours of training. At this point, you may ask yourselves > Can LEAF also minimize the size of these networks 


without > sacrificing much in performance? The solutions is multi-objective search, MAXIMIZE FITNESS AND MINIMIZE NETWORK SIZE, LEAF actually converged faster during evolution. As


expected, multiobjective LEAF was also able to DISCOVER NETWORKS WITH FEWER PARAMETERS. Interesting behavior of multi-objective search is that during early generation (10th generation) it


tends to have higher average fitness, but as number of generation increase (40th generation), COST OF TRADE OFF BETWEEN PRIMARY AND SECONDARY OBJECT BECAME OBVIOUS. As a result, the network


in multi-objective search BECAME CONSIDERABLY LESS COMPLEX than ones in single-object search while average fitness PERFORMANCE REMAINS NARROWS between multi-objective search and


single-objective search. I suspect that > the slope between primary and secondary object in Pareto fonts > become STEEPER AS NUMBER OF GENERATION INCREASES. Figure 5: Visualizations of


networks with different complexity discovered by multi-bjective LEAF. The performance of the smaller 56K network (Figure 5a) is nearly as good as that of the larger 125K network (Figure


5b). The smaller network uses only two instances of the module architecture shown in Figure 5c while the larger network uses four instances of the same module. These two networks show that


multi-bjective LEAF is able to find good trade-offs between two conflicting objectives by cleverly using modules. I believe that LEAF could potentially find species, chromosome, and nodes


that consistently perform well across a particular domain of problem. > This is because LEAF can optimize any objective and as one > objective’s trade off become to severe, LEAF could 


just hop onto > the next objective and start training again. Maybe by changing objective in circle would help overall result TO CONVERGE FASTER AND ACHIEVE BETTER FITNESS IN ALL


OBJECTIVES (compare to optimizer all objective at once) in recursive fashion. (OBJ1 -> OBJ2 -> OBJ3 -> OBJ1 -> …) Transferring information of one search space to another.