concept scheduler in category apache airflow

This is an excerpt from Manning's book Data Pipelines with Apache Airflow MEAP V05.
The Airflow webserver - which visualizes the DAGs parsed by the scheduler and provides the main interface for users to monitor DAG runs and their results. Figure 1.8 Overview of the main components involved in Airflow (e.g. the Airflow webserver, scheduler and workers).
![]()
The scheduler’s responsibility is twofold:
Figure 12.4 The ideal flow of a task, and the task state for which the components of the scheduler are responsible. The dotted line represents the full scheduler responsibility. When running the SequentialExecutor/LocalExecutor mode, this is a single process. The CeleryExecutor and KubernetesExecutor run the task executor in separate processes, designed to scale out over multiple machines.
![]()
The scheduler does write logs to file by default, as opposed to the webserver. Looking at the $AIRFLOW_HOME/logs directory again, we see various files related to scheduler logs:
. ├── dag_processor_manager │ └── dag_processor_manager.log └── scheduler └── 2020-04-14 ├── hello_world.py.log └── second_dag.py.logThis directory tree is the result of processing two DAGs “hello_world” and “second_dag”. Every time the scheduler processes a DAG file, a number of lines are written to the respective DAG file. These lines are actually key to understanding how the scheduler operates. Let’s take a look at hello_world.py.log:
Listing 12.10 With each scheduler iteration, DAG files are read andcorresponding DAGs/tasks are created
[2020-04-14 17:06:05,310] {scheduler_job.py:154} INFO - Started process (PID=46) to work on /opt/airflow/dags/hello_world.py [2020-04-14 17:06:05,314] {scheduler_job.py:1562} INFO - Processing file /opt/airflow/dags/hello_world.py for tasks to queue #A [2020-04-14 17:06:05,316] {logging_mixin.py:112} INFO - [2020-04-14 17:06:05,316] {dagbag.py:396} INFO - Filling up the DagBag from /opt/airflow/dags/hello_world.py [2020-04-14 17:06:05,347] {scheduler_job.py:1574} INFO - DAG(s) dict_keys(['hello_world']) retrieved from /opt/airflow/dags/hello_world.py #B [2020-04-14 17:06:05,471] {scheduler_job.py:1284} INFO - Processing hello_world #C [2020-04-14 17:06:05,533] {scheduler_job.py:1294} INFO - Created <DagRun hello_world @ 2020-04-11 00:00:00+00:00: scheduled__2020-04-11T00:00:00+00:00, externally triggered: False> #D [2020-04-14 17:06:05,541] {scheduler_job.py:759} INFO - Examining DAG run <DagRun hello_world @ 2020-04-11 00:00:00+00:00: scheduled__2020-04-11T00:00:00+00:00, externally triggered: False> #E [2020-04-14 17:06:05,584] {scheduler_job.py:447} INFO - Skipping SLA check for <DAG: hello_world> because no tasks in DAG have SLAs #F [2020-04-14 17:06:05,595] {scheduler_job.py:1640} INFO - Creating / updating <TaskInstance: hello_world.hello 2020-04-11 00:00:00+00:00 [scheduled]> in ORM #G [2020-04-14 17:06:05,617] {scheduler_job.py:1640} INFO - Creating / updating <TaskInstance: hello_world.world 2020-04-11 00:00:00+00:00 [scheduled]> in ORM #G [2020-04-14 17:06:05,636] {scheduler_job.py:162} INFO - Processing /opt/airflow/dags/hello_world.py took 0.327 seconds #HThese steps processing a DAG file, loading the DAG object from the file, checking if many conditions are met such as DAG schedules, are executed many times over and over and are part of the core functionality of the scheduler. From these logs we can derive whether or not the scheduler is working as intended.
Besides the files in the scheduler directory, there is also one single file named dag_processor_manager.log (log rotation is performed once it reaches 100MB), in which an aggregated view (default last 30 seconds) is displayed of which files the scheduler has processed (Listing 12.2).