Back to edX Analytics Pipeline
NUM_REDUCE_TASKS
based on the size of your cluster. If the cluster is not being used for anything else a good rule of thumb is to make NUM_REDUCE_TASKS
equal the number of available reduce slots on your cluster. See hadoop docs to determine the number of reduce slots available on your cluster.s3://
can be replaced with hdfs://
in all examples below.lms-creds.json { "host": "your.mysql.host.com", "port": "3306", "username": "someuser", "password": "passwordforsomeuser", }
AnswerDistributionWorkflow --local-scheduler \
--src ["s3://path/to/tracking/logs/"] \
--dest s3://folder/where/intermediate/files/go/ \
--name unique_name \
--output-root s3://final/output/path/ \
--include ["*tracking.log*.gz"] \
--manifest "s3://scratch/path/to/manifest.txt" \
--base-input-format "org.edx.hadoop.input.ManifestTextInputFormat" \
--lib-jar ["hdfs://localhost:9000/edx-analytics-pipeline/packages/edx-analytics-hadoop-util.jar"] \
--n-reduce-tasks $NUM_REDUCE_TASKS \
--marker $dest/marker \
--credentials s3://secure/path/to/result_store_credentials.json
--src
: This should be a list of HDFS/S3 paths to the root (or roots) of your tracking logs, expressed as a JSON list.--dest
: This can be any location in HDFS/S3 that doesn’t exist yet.--name
: This can be any alphanumeric string, using the same string will attempt to use the same intermediate outputs etc.--output-root
: This can be any location in HDFS/S3 that doesn’t exist yet.--include
: This glob pattern should match all of your tracking log files, and be expressed as a JSON list.--manifest
: This can be any path in HDFS/S3 that doesn’t exist yet, a file will be written here.--base-input-format
: This is the name of the class within the jar to use to process the manifest.--lib-jar
: This is the path to the jar containing the above class. Note that it should be an HDFS/S3 path, and expressed as a JSON list.--n-reduce-tasks
: Number of reduce tasks to schedule.--marker
: This should be an HDFS/S3 path that doesn’t exist yet. If this marker exists, the job will think it has already run.--credentials
: See discussion of credential files above. These should be the credentials for the result store database to write the result to.remote-task AnswerDistributionWorkflow --host localhost --user ubuntu --remote-name analyticstack --skip-setup --wait \
--local-scheduler --verbose \
--src ["hdfs://localhost:9000/data"] \
--dest hdfs://localhost:9000/tmp/pipeline-task-scheduler/AnswerDistributionWorkflow/1449177792/dest \
--name pt_1449177792 \
--output-root hdfs://localhost:9000/tmp/pipeline-task-scheduler/AnswerDistributionWorkflow/1449177792/course \
--include ["*tracking.log*.gz"] \
--manifest hdfs://localhost:9000/tmp/pipeline-task-scheduler/AnswerDistributionWorkflow/1449177792/manifest.txt \
--base-input-format "org.edx.hadoop.input.ManifestTextInputFormat" \
--lib-jar ["hdfs://localhost:9000/edx-analytics-pipeline/site-packages/edx-analytics-hadoop-util.jar"] \
--n-reduce-tasks 1 \
--marker hdfs://localhost:9000/tmp/pipeline-task-scheduler/AnswerDistributionWorkflow/1449177792/marker \
--credentials /edx/etc/edx-analytics-pipeline/output.json
$FROM_DATE
can be any string that is accepted by the unix utility date
. Here are a few examples: “today”, “yesterday”, and “2016-05-01”.overwrite_mysql
controls whether or not the MySQL tables are replaced in a transaction during processing. Set this flag if you are fully replacing the table, false (default) otherwise.overwrite_hive
controls whether or not the Hive intermediate table metadata is removed and replaced during processing. Set this flag if you want the metadata to be fully recreated, false (default) otherwise.ImportEnrollmentsIntoMysql --local-scheduler \
--interval $(date +%Y-%m-%d -d "$FROM_DATE")-$(date +%Y-%m-%d -d "$TO_DATE") \
--n-reduce-tasks $NUM_REDUCE_TASKS \
--overwrite-mysql \
--overwrite-hive
On September 29, 2016 we merged a modification of the Enrollment workflow to master. The new code calculates Enrollment incrementally, rather than entirely from scratch each time. And it involves a new parameter: overwrite_n_days
.
The workflow now assumes that new Hive-ready data has been written persistently to the course_enrollment_events
directory under warehouse_path by CourseEnrollmentEventsTask. The workflow uses the overwrite_n_days
to determine how many days back to repopulate this data. The idea is that before this point, events are not expected to change, but perhaps there might be new events that have arrived in the last few days. We are currently running with a value of 3, and we define that as an enrollment parameter in our override.cfg file. You can define it there or on the command line.
This means for us that only the last three days of raw events get scanned daily. It is assumed that the previous days’ data has been loaded by previous runs, or by performing a historical load.
To load the historical enrollment events, you would need to first run:
CourseEnrollmentEventsTask --local-scheduler \
--interval $(date +%Y-%m-%d -d "$FROM_DATE")-$(date +%Y-%m-%d -d "$TO_DATE") \
--n-reduce-tasks $NUM_REDUCE_TASKS
geolocation
section of the config file). Getting a data file could look like this:wget http://geolite.maxmind.com/download/geoip/database/GeoLiteCountry/GeoIP.dat.gz
gunzip GeoIP.dat.gz
mv GeoIP.dat geo.dat
hdfs dfs -put geo.dat /edx-analytics-pipeline/
InsertToMysqlLastCountryPerCourseTask --local-scheduler \
--interval $(date +%Y-%m-%d -d "$FROM_DATE")-$(date +%Y-%m-%d -d "$TO_DATE") \
--course-country-output $INTERMEDIATE_OUTPUT_ROOT/$(date +%Y-%m-%d -d "$TO_DATE")/country_course \
--n-reduce-tasks $NUM_REDUCE_TASKS \
--overwrite
On November 19, 2016 we merged a modification of the Location workflow to master. The new code calculates Location incrementally, rather than entirely from scratch each time. And it involves a new parameter: overwrite_n_days
.
The workflow now assumes that new Hive-ready data has been written persistently to the last_ip_of_user_id
directory under warehouse_path by LastDailyIpAddressOfUserTask.(Before May 9,2018, this used the last_ip_of_user
directory for output.) The workflow uses the overwrite_n_days
to determine how many days back to repopulate this data. The idea is that before this point, events are not expected to change, but perhaps there might be new events that have arrived in the last few days. We are currently running with a value of 3, and we define that as an enrollment parameter in our override.cfg file. You can define it there (as overwrite_n_days
in the [location-per-course]
section) or on the command line (as --overwrite-n-days
).
This means for us that only the last three days of raw events get scanned daily. It is assumed that the previous days’ data has been loaded by previous runs, or by performing a historical load.
Another change is to allow the interval start to be defined in configuration (as interval_start
in the [location-per-course]
section). One can then specify instead just the end date on the workflow:
InsertToMysqlLastCountryPerCourseTask --local-scheduler \
--interval-end $(date +%Y-%m-%d -d "$TO_DATE") \
--course-country-output $INTERMEDIATE_OUTPUT_ROOT/$(date +%Y-%m-%d -d "$TO_DATE")/country_course \
--n-reduce-tasks $NUM_REDUCE_TASKS \
--overwrite
On December 5, 2016 the --course-country-output
parameter was removed. That data is instead written to the warehouse_path.
To load the historical location data, you would need to first run:
LastDailyIpAddressOfUserTask --local-scheduler \
--interval $(date +%Y-%m-%d -d "$FROM_DATE")-$(date +%Y-%m-%d -d "$TO_DATE") \
--n-reduce-tasks $NUM_REDUCE_TASKS
Note that this does not use the interval_start
configuration value, so specify the full interval.
overwrite_hive
to True.InsertToMysqlCourseActivityTask --local-scheduler \
--end-date $(date +%Y-%m-%d -d "$TO_DATE") \
--weeks 24 \
--credentials $CREDENTIALS \
--n-reduce-tasks $NUM_REDUCE_TASKS \
--overwrite-mysql
On December 05, 2017 we merged a modification of the Engagement workflow to master. The new code calculates Engagement incrementally, rather than entirely from scratch each time. And it involves a new parameter: overwrite_n_days
.
Also, the workflow has been renamed from CourseActivityWeeklyTask
to InsertToMysqlCourseActivityTask
.
The workflow now assumes that new Hive-ready data has been written persistently to the user_activity
directory under warehouse_path by UserActivityTask. The workflow uses the overwrite_n_days
to determine how many days back to repopulate this data. The idea is that before this point, events are not expected to change, but perhaps there might be new events that have arrived in the last few days. We are currently running the workflow daily with a value of 3, and we define that as an user-activity parameter in our override.cfg file. You can define it there or on the command line.
This means for us that only the last three days of raw events get scanned daily. It is assumed that the previous days’ data has been loaded by previous runs, or by performing a historical load.
If this workflow is run weekly, an overwrite_n_days
value of 10 would be more appropriate.
To load the historical user-activity counts, you would need to first run:
UserActivityTask --local-scheduler \
--interval $(date +%Y-%m-%d -d "$FROM_DATE")-$(date +%Y-%m-%d -d "$TO_DATE") \
--n-reduce-tasks $NUM_REDUCE_TASKS
or you could run the incremental workflow with an overwrite_n_days
value large enough that it would
calculate the historical user-activity counts the first time it is ran:
InsertToMysqlCourseActivityTask --local-scheduler \
--end-date $(date +%Y-%m-%d -d "$TO_DATE") \
--weeks 24 \
--credentials $CREDENTIALS \
--n-reduce-tasks $NUM_REDUCE_TASKS \
--overwrite-n-days 169
After the first run, you can change overwrite_n_days
to 3 or 10 depending on how you plan to run it(daily/weekly).
InsertToMysqlAllVideoTask --local-scheduler \
--interval $(date +%Y-%m-%d -d "$FROM_DATE")-$(date +%Y-%m-%d -d "$TO_DATE") \
--n-reduce-tasks $NUM_REDUCE_TASKS
On October 16, 2017 we merged a modification of the Video workflow to master. The new code calculates Video incrementally, rather than entirely from scratch each time. And it involves a new parameter: overwrite_n_days
.
The workflow now assumes that new Hive-ready data has been written persistently to the user_video_viewing_by_date
directory under warehouse_path by UserVideoViewingByDateTask. The workflow uses the overwrite_n_days
to determine how many days back to repopulate this data. The idea is that before this point, events are not expected to change, but perhaps there might be new events that have arrived in the last few days, particularly if coming from a mobile source. We are currently running the workflow daily with a value of 3, and we define that as a video parameter in our override.cfg file. You can define it there or on the command line.
This means for us that only the last three days of raw events get scanned daily. It is assumed that the previous days’ data has been loaded by previous runs, or by performing a historical load.
To load the historical video counts, you would need to first run:
UserVideoViewingByDateTask --local-scheduler \
--interval $(date +%Y-%m-%d -d "$FROM_DATE")-$(date +%Y-%m-%d -d "$TO_DATE") \
--n-reduce-tasks $NUM_REDUCE_TASKS
or you could run the incremental workflow with an overwrite_n_days
value large enough that it would
calculate the historical video counts the first time it is ran:
InsertToMysqlAllVideoTask --local-scheduler \
--interval $(date +%Y-%m-%d -d "$FROM_DATE")-$(date +%Y-%m-%d -d "$TO_DATE") \
--n-reduce-tasks $NUM_REDUCE_TASKS
--overwrite-n-days 169
After the first run, you can change overwrite_n_days
to 3.
The workflow assumes that new Hive-ready data has been written persistently to the module_engagement
directory under warehouse_path by ModuleEngagementIntervalTask. The workflow uses the overwrite_n_days
to determine how many days back to repopulate this data. The idea is that before this point, events are not expected to change, but perhaps there might be new events that have arrived in the last few days. We are currently running with a value of 3, and this can be overridden on the command-line or defined as a [module-engagement]
parameter in the override.cfg file. This means for us that only the last three days of raw events get scanned daily. It is assumed that the previous days’ data has been loaded by previous runs, or by performing a historical load.
To load module engagement history, you would first need to run:
ModuleEngagementIntervalTask --local-scheduler \
--interval $(date +%Y-%m-%d -d "$FROM_DATE")-$(date +%Y-%m-%d -d "$TO_DATE") \
--n-reduce-tasks $NUM_REDUCE_TASKS \
--overwrite-from-date $(date +%Y-%m-%d -d "$TO_DATE") \
--overwrite-mysql
Since module engagement in Insights only looks at the last two weeks of activity, you only need FROM_DATE
to be two weeks ago. The TO_DATE
need only be within N days of today (as specified by --overwrite-n-days
). Setting --overwrite-mysql
will ensure that all the historical data is also written to the Mysql Result Store. Using --overwrite-from-date
is important when “fixing” data (for some reason): setting it earlier (i.e. to FROM_DATE
) will cause the Hive data to also be overwritten for those earlier days.
Another prerequisite before running the module engagement workflow below is to have run enrollment first. It is assumed that the course_enrollment
directory under warehouse_path has been populated by running enrollment with a TO_DATE
matching that used for the module engagement workflow (i.e. today).
We run the module engagement job daily, which adds the most recent day to this while it is overwriting the last N days (as set by the --overwrite-n-days
parameter). This calculates aggregates and loads them into ES and MySQL.
ModuleEngagementWorkflowTask --local-scheduler \
--date $(date +%Y-%m-%d -d "$TO_DATE") \
--indexing-tasks 5 \
--throttle 0.5 \
--n-reduce-tasks $NUM_REDUCE_TASKS
The value of TO_DATE
is today.