This guide provides information about the edX data pipeline.
Back to edX Analytics Pipeline
NUM_REDUCE_TASKS
based on the size of your cluster. If the cluster is not being used for anything else a good rule of thumb is to make NUM_REDUCE_TASKS
equal the number of available reduce slots on your cluster. See hadoop docs to determine the number of reduce slots available on your cluster.s3://
can be replaced with hdfs://
in all examples below.lms-creds.json { "host": "your.mysql.host.com", "port": "3306", "username": "someuser", "password": "passwordforsomeuser", }
AnswerDistributionWorkflow --local-scheduler \
--src ["s3://path/to/tracking/logs/"] \
--dest s3://folder/where/intermediate/files/go/ \
--name unique_name \
--output-root s3://final/output/path/ \
--include ["*tracking.log*.gz"] \
--manifest "s3://scratch/path/to/manifest.txt" \
--base-input-format "org.edx.hadoop.input.ManifestTextInputFormat" \
--lib-jar ["hdfs://localhost:9000/edx-analytics-pipeline/packages/edx-analytics-hadoop-util.jar"] \
--n-reduce-tasks $NUM_REDUCE_TASKS \
--marker $dest/marker \
--credentials s3://secure/path/to/result_store_credentials.json
--src
: This should be a list of HDFS/S3 paths to the root (or roots) of your tracking logs, expressed as a JSON list.--dest
: This can be any location in HDFS/S3 that doesn’t exist yet.--name
: This can be any alphanumeric string, using the same string will attempt to use the same intermediate outputs etc.--output-root
: This can be any location in HDFS/S3 that doesn’t exist yet.--include
: This glob pattern should match all of your tracking log files, and be expressed as a JSON list.--manifest
: This can be any path in HDFS/S3 that doesn’t exist yet, a file will be written here.--base-input-format
: This is the name of the class within the jar to use to process the manifest.--lib-jar
: This is the path to the jar containing the above class. Note that it should be an HDFS/S3 path, and expressed as a JSON list.--n-reduce-tasks
: Number of reduce tasks to schedule.--marker
: This should be an HDFS/S3 path that doesn’t exist yet. If this marker exists, the job will think it has already run.--credentials
: See discussion of credential files above. These should be the credentials for the result store database to write the result to.remote-task AnswerDistributionWorkflow --host localhost --user ubuntu --remote-name analyticstack --skip-setup --wait \
--local-scheduler --verbose \
--src ["hdfs://localhost:9000/data"] \
--dest hdfs://localhost:9000/tmp/pipeline-task-scheduler/AnswerDistributionWorkflow/1449177792/dest \
--name pt_1449177792 \
--output-root hdfs://localhost:9000/tmp/pipeline-task-scheduler/AnswerDistributionWorkflow/1449177792/course \
--include ["*tracking.log*.gz"] \
--manifest hdfs://localhost:9000/tmp/pipeline-task-scheduler/AnswerDistributionWorkflow/1449177792/manifest.txt \
--base-input-format "org.edx.hadoop.input.ManifestTextInputFormat" \
--lib-jar ["hdfs://localhost:9000/edx-analytics-pipeline/site-packages/edx-analytics-hadoop-util.jar"] \
--n-reduce-tasks 1 \
--marker hdfs://localhost:9000/tmp/pipeline-task-scheduler/AnswerDistributionWorkflow/1449177792/marker \
--credentials /edx/etc/edx-analytics-pipeline/output.json
$FROM_DATE
can be any string that is accepted by the unix utility date
. Here are a few examples: “today”, “yesterday”, and “2016-05-01”.overwrite_mysql
controls whether or not the MySQL tables are replaced in a transaction during processing. Set this flag if you are fully replacing the table, false (default) otherwise.overwrite_hive
controls whether or not the Hive intermediate table metadata is removed and replaced during processing. Set this flag if you want the metadata to be fully recreated, false (default) otherwise.ImportEnrollmentsIntoMysql --local-scheduler \
--interval $(date +%Y-%m-%d -d "$FROM_DATE")-$(date +%Y-%m-%d -d "$TO_DATE") \
--n-reduce-tasks $NUM_REDUCE_TASKS \
--overwrite-mysql \
--overwrite-hive
On September 29, 2016 we merged a modification of the Enrollment workflow to master. The new code calculates Enrollment incrementally, rather than entirely from scratch each time. And it involves a new parameter: overwrite_n_days
.
The workflow now assumes that new Hive-ready data has been written persistently to the course_enrollment_events
directory under warehouse_path by CourseEnrollmentEventsTask. The workflow uses the overwrite_n_days
to determine how many days back to repopulate this data. The idea is that before this point, events are not expected to change, but perhaps there might be new events that have arrived in the last few days. We are currently running with a value of 3, and we define that as an enrollment parameter in our override.cfg file. You can define it there or on the command line.
This means for us that only the last three days of raw events get scanned daily. It is assumed that the previous days’ data has been loaded by previous runs, or by performing a historical load.
To load the historical enrollment events, you would need to first run:
CourseEnrollmentEventsTask --local-scheduler \
--interval $(date +%Y-%m-%d -d "$FROM_DATE")-$(date +%Y-%m-%d -d "$TO_DATE") \
--n-reduce-tasks $NUM_REDUCE_TASKS
geolocation
section of the config file). Getting a data file could look like this:wget http://geolite.maxmind.com/download/geoip/database/GeoLiteCountry/GeoIP.dat.gz
gunzip GeoIP.dat.gz
mv GeoIP.dat geo.dat
hdfs dfs -put geo.dat /edx-analytics-pipeline/
InsertToMysqlLastCountryPerCourseTask --local-scheduler \
--interval $(date +%Y-%m-%d -d "$FROM_DATE")-$(date +%Y-%m-%d -d "$TO_DATE") \
--course-country-output $INTERMEDIATE_OUTPUT_ROOT/$(date +%Y-%m-%d -d "$TO_DATE")/country_course \
--n-reduce-tasks $NUM_REDUCE_TASKS \
--overwrite
On November 19, 2016 we merged a modification of the Location workflow to master. The new code calculates Location incrementally, rather than entirely from scratch each time. And it involves a new parameter: overwrite_n_days
.
The workflow now assumes that new Hive-ready data has been written persistently to the last_ip_of_user_id
directory under warehouse_path by LastDailyIpAddressOfUserTask.(Before May 9,2018, this used the last_ip_of_user
directory for output.) The workflow uses the overwrite_n_days
to determine how many days back to repopulate this data. The idea is that before this point, events are not expected to change, but perhaps there might be new events that have arrived in the last few days. We are currently running with a value of 3, and we define that as an enrollment parameter in our override.cfg file. You can define it there (as overwrite_n_days
in the [location-per-course]
section) or on the command line (as --overwrite-n-days
).
This means for us that only the last three days of raw events get scanned daily. It is assumed that the previous days’ data has been loaded by previous runs, or by performing a historical load.
Another change is to allow the interval start to be defined in configuration (as interval_start
in the [location-per-course]
section). One can then specify instead just the end date on the workflow:
InsertToMysqlLastCountryPerCourseTask --local-scheduler \
--interval-end $(date +%Y-%m-%d -d "$TO_DATE") \
--course-country-output $INTERMEDIATE_OUTPUT_ROOT/$(date +%Y-%m-%d -d "$TO_DATE")/country_course \
--n-reduce-tasks $NUM_REDUCE_TASKS \
--overwrite
On December 5, 2016 the --course-country-output
parameter was removed. That data is instead written to the warehouse_path.
To load the historical location data, you would need to first run:
LastDailyIpAddressOfUserTask --local-scheduler \
--interval $(date +%Y-%m-%d -d "$FROM_DATE")-$(date +%Y-%m-%d -d "$TO_DATE") \
--n-reduce-tasks $NUM_REDUCE_TASKS
Note that this does not use the interval_start
configuration value, so specify the full interval.
overwrite_hive
to True.InsertToMysqlCourseActivityTask --local-scheduler \
--end-date $(date +%Y-%m-%d -d "$TO_DATE") \
--weeks 24 \
--credentials $CREDENTIALS \
--n-reduce-tasks $NUM_REDUCE_TASKS \
--overwrite-mysql
On December 05, 2017 we merged a modification of the Engagement workflow to master. The new code calculates Engagement incrementally, rather than entirely from scratch each time. And it involves a new parameter: overwrite_n_days
.
Also, the workflow has been renamed from CourseActivityWeeklyTask
to InsertToMysqlCourseActivityTask
.
The workflow now assumes that new Hive-ready data has been written persistently to the user_activity
directory under warehouse_path by UserActivityTask. The workflow uses the overwrite_n_days
to determine how many days back to repopulate this data. The idea is that before this point, events are not expected to change, but perhaps there might be new events that have arrived in the last few days. We are currently running the workflow daily with a value of 3, and we define that as an user-activity parameter in our override.cfg file. You can define it there or on the command line.
This means for us that only the last three days of raw events get scanned daily. It is assumed that the previous days’ data has been loaded by previous runs, or by performing a historical load.
If this workflow is run weekly, an overwrite_n_days
value of 10 would be more appropriate.
To load the historical user-activity counts, you would need to first run:
UserActivityTask --local-scheduler \
--interval $(date +%Y-%m-%d -d "$FROM_DATE")-$(date +%Y-%m-%d -d "$TO_DATE") \
--n-reduce-tasks $NUM_REDUCE_TASKS
or you could run the incremental workflow with an overwrite_n_days
value large enough that it would
calculate the historical user-activity counts the first time it is ran:
InsertToMysqlCourseActivityTask --local-scheduler \
--end-date $(date +%Y-%m-%d -d "$TO_DATE") \
--weeks 24 \
--credentials $CREDENTIALS \
--n-reduce-tasks $NUM_REDUCE_TASKS \
--overwrite-n-days 169
After the first run, you can change overwrite_n_days
to 3 or 10 depending on how you plan to run it(daily/weekly).
InsertToMysqlAllVideoTask --local-scheduler \
--interval $(date +%Y-%m-%d -d "$FROM_DATE")-$(date +%Y-%m-%d -d "$TO_DATE") \
--n-reduce-tasks $NUM_REDUCE_TASKS
On October 16, 2017 we merged a modification of the Video workflow to master. The new code calculates Video incrementally, rather than entirely from scratch each time. And it involves a new parameter: overwrite_n_days
.
The workflow now assumes that new Hive-ready data has been written persistently to the user_video_viewing_by_date
directory under warehouse_path by UserVideoViewingByDateTask. The workflow uses the overwrite_n_days
to determine how many days back to repopulate this data. The idea is that before this point, events are not expected to change, but perhaps there might be new events that have arrived in the last few days, particularly if coming from a mobile source. We are currently running the workflow daily with a value of 3, and we define that as a video parameter in our override.cfg file. You can define it there or on the command line.
This means for us that only the last three days of raw events get scanned daily. It is assumed that the previous days’ data has been loaded by previous runs, or by performing a historical load.
To load the historical video counts, you would need to first run:
UserVideoViewingByDateTask --local-scheduler \
--interval $(date +%Y-%m-%d -d "$FROM_DATE")-$(date +%Y-%m-%d -d "$TO_DATE") \
--n-reduce-tasks $NUM_REDUCE_TASKS
or you could run the incremental workflow with an overwrite_n_days
value large enough that it would
calculate the historical video counts the first time it is ran:
InsertToMysqlAllVideoTask --local-scheduler \
--interval $(date +%Y-%m-%d -d "$FROM_DATE")-$(date +%Y-%m-%d -d "$TO_DATE") \
--n-reduce-tasks $NUM_REDUCE_TASKS
--overwrite-n-days 169
After the first run, you can change overwrite_n_days
to 3.
The workflow assumes that new Hive-ready data has been written persistently to the module_engagement
directory under warehouse_path by ModuleEngagementIntervalTask. The workflow uses the overwrite_n_days
to determine how many days back to repopulate this data. The idea is that before this point, events are not expected to change, but perhaps there might be new events that have arrived in the last few days. We are currently running with a value of 3, and this can be overridden on the command-line or defined as a [module-engagement]
parameter in the override.cfg file. This means for us that only the last three days of raw events get scanned daily. It is assumed that the previous days’ data has been loaded by previous runs, or by performing a historical load.
To load module engagement history, you would first need to run:
ModuleEngagementIntervalTask --local-scheduler \
--interval $(date +%Y-%m-%d -d "$FROM_DATE")-$(date +%Y-%m-%d -d "$TO_DATE") \
--n-reduce-tasks $NUM_REDUCE_TASKS \
--overwrite-from-date $(date +%Y-%m-%d -d "$TO_DATE") \
--overwrite-mysql
Since module engagement in Insights only looks at the last two weeks of activity, you only need FROM_DATE
to be two weeks ago. The TO_DATE
need only be within N days of today (as specified by --overwrite-n-days
). Setting --overwrite-mysql
will ensure that all the historical data is also written to the Mysql Result Store. Using --overwrite-from-date
is important when “fixing” data (for some reason): setting it earlier (i.e. to FROM_DATE
) will cause the Hive data to also be overwritten for those earlier days.
Another prerequisite before running the module engagement workflow below is to have run enrollment first. It is assumed that the course_enrollment
directory under warehouse_path has been populated by running enrollment with a TO_DATE
matching that used for the module engagement workflow (i.e. today).
We run the module engagement job daily, which adds the most recent day to this while it is overwriting the last N days (as set by the --overwrite-n-days
parameter). This calculates aggregates and loads them into ES and MySQL.
ModuleEngagementWorkflowTask --local-scheduler \
--date $(date +%Y-%m-%d -d "$TO_DATE") \
--indexing-tasks 5 \
--throttle 0.5 \
--n-reduce-tasks $NUM_REDUCE_TASKS
The value of TO_DATE
is today.
Back to edX Analytics Pipeline
This page is intended to provide answers to particular questions that may arise while using analyticstack for development. Some topics may also be relevant to those working outside of analyticstack as well.
To test a new pipeline feature, I need new events in the tracking.log file.
Don’t try to edit the tracking.log in HDFS directly as it is frequently overwritten by a cron job. Instead:
Create a new file titled something like “custom-tracking.log” (filename must end in “tracking.log”)
Add the events that you need to the file, one event per line.
Make sure that any course_id field has the value of “edX/DemoX/Demo_Course” and org_id = “edX”
To view example events in the existing tracking.log, run (as the hadoop user in the analyticstack):
hadoop fs -cat /data/tracking.logUpload the file to HDFS. Run (as the hadoop user in the analyticstack):
hadoop fs -put custom-tracking.log /data/custom-tracking.logNow you can run the task you are testing. The output should print that it is sourcing events from 2 files now.
If you need to modify the events you added, edit the “custom-tracking.log” on the normal file system and then run the following:
hadoop fs -rm /data/custom-tracking.log hadoop fs -put custom-tracking.log /data/custom-tracking.log
I keep getting Java out-of-memory errors, aka. 143 error code, when I run tasks.
Something is likely misconfigured, and the JVM is not allocating enough memory.
First, make sure the virtual machine has enough virtual memory configured. Open the VirtualBox GUI and check that the machine whose title starts with “analyticstack” has around 4GB of memory assigned to it.
If one is getting an error about virtual memory exceeding a limit, then turn off vmem-check in yarn. As the vagrant user in the analytics stack, edit the yarn-site.xml config to add a property:
sudo nano /edx/app/hadoop/hadoop/etc/hadoop/yarn-site.xml
Inside the <configuration> add the property:
<property>
<name>yarn.nodemanager.vmem-check-enabled</name>
<value>false</value>
</property>
Then run the following to restart yarn so that the config change is registered:
sudo service yarn restart
I made local changes to acceptance tests, but tests don’t seem to be running with the changes.
Acceptance tests are checked out of your branch. Push your changes to your branch and to rerun the acceptance tests. Note that a git commit is the only thing required, since it’s pulling from your local branch, not the remote.
- Commit your changes to your branch.
- Rerun the acceptance tests.
I re-ran a task, but the output didn’t change.
One possible reason for this issue is that the task is not actually being re-run.
The task scheduler will skip running tasks if it recognizes that it has been run before and it can use the existing output instead of re-running it. At the beginning of the output of the command, each task is scheduled. If the line ends with “(DONE)” then the scheduler has recognized that it was run before and will not rerun it. If it is marked as “(PENDING)” then it is actually scheduled to run.
There are a few ways of tricking the scheduler into re-running tasks:
Pass different parameters to the task command on the command-line. As long as the task has not been run with those parameters before, it may force it to re-run tasks because the source data is different.
Remove the output of the task. The task scheduler (luigi) runs the “complete” function on each task to determine whether a task has been run before. This can be different for every task, but typically it checks the output table of the command for any data. Deleting the output table can cause the complete function to return false and force a re-run.
If the output is a hive table, then, as the hadoop user in the analyticstack, run:
hive -e "drop table <table_name>;"If the output is a mysql table, then, as the vagrant user in the analyticstack, run:
sudo mysql --database=<database_name> -e "drop table <table_name>;"If the output are files in the “warehouse” location in HDFS, then, as the hadoop user in analyticstack, run:
hadoop fs -rm -R /edx-analytics-pipeline/warehouse/<tablename>
I need to see more detailed logs than what is sent to standard-out.
In the analyticstack, /edx/app/analytics_pipeline/analytics_pipeline/edx_analytics.log includes what goes to standard-out plus DEBUG level logging.
To see detailed hadoop logs, find and copy the application_id printed in the output of a task run and pass it to this command:
yarn logs -applicationId <application_id>
My acceptance tests are failing, and I want to look at the hive tables.
Query the acceptance test DB via hive.
Within the analytics devstack, switch to the hadoop user:
sudo hadoopStart up hive:
hive
Find the acceptance test database:
show databases;Show tables for your database:
use test_283482342; # your database will be different show tables;Execute your queries.
For docker analyticstack setup, follow instructions under Getting Started on Analytics section in Devstack repository.
Access analytics pipeline shell:
make analytics-pipeline-shell
Before running the user-location workflows, a geolocation Maxmind data file must be downloaded. This file can be in HDFS or S3, for example, and should be pointed to by the geolocation_data setting in the geolocation section of your configuration file. To use the default location used by acceptance tests, execute the following:
curl -fSL http://geolite.maxmind.com/download/geoip/database/GeoLiteCountry/GeoIP.dat.gz -o /var/tmp/GeoIP.dat.gz cd /var/tmp/ && gunzip /var/tmp/GeoIP.dat.gz mv GeoIP.dat geo.dat hdfs dfs -put geo.dat /edx-analytics-pipeline/
To run the full test suite, execute the following command in the shell:
make docker-test-acceptance-local-all
To run individual tests, execute the following:
make docker-test-acceptance-local ONLY_TESTS=edx.analytics.tasks.tests.acceptance.<test_script_file> # e.g.
make docker-test-acceptance-local ONLY_TESTS=edx.analytics.tasks.tests.acceptance.test_user_activity
Access analytics pipeline shell:
make analytics-pipeline-shell
Generate egg files
If you plan to run Spark workflows that use imports that in turn require the use of a plugin mechanism, it is necessary to store those imports locally as egg files. These imports are then identified in the configuration file in the spark section. Opaque keys is one of these imports, and the two egg files used by Spark can be made as follows.
make generate-spark-egg-files
launch-task UserActivityTaskSpark --local-scheduler --interval 2017-03-16
Make sure there are no errors during provision command. If there are errors, do not rerun the provision command without first cleaning up after the failures.
For cleanup, there are 2 options.
Reset containers ( this will remove all containers and volumes )
make destroy
Manual cleanup
make mysql-shell mysql DROP DATABASE reports DROP DATABASE edx_hive_metastore DROP DATABASE edxapp # Only drop if provisioning failed while loading the LMS schema. DROP DATABASE edxapp_csmh # Only drop if provisioning failed while loading the LMS schema. # exit mysql shell make down
The following documentation is generated dynamically from comments in the code. There is detailed documentation for every Luigi task.
Back to edX Analytics Pipeline
Load records into elasticsearch clusters.
edx.analytics.tasks.common.elasticsearch_load.
ElasticsearchIndexTask
(*args, **kwargs)¶Index a stream of documents in an elasticsearch index.
This task is intended to do the following: * Create a new index that is unique to this task run (all significant parameters). * Load all of the documents into this unique index. * If the alias is already pointing at one or more indexes, switch it so that it only points at this newly loaded
index.
Parameters: |
|
---|
Support executing map reduce tasks.
edx.analytics.tasks.common.mapreduce.
MapReduceJobTask
(*args, **kwargs)¶Execute a map reduce job. Typically using Hadoop, but can execute the job in process as well.
Parameters: |
|
---|
edx.analytics.tasks.common.mapreduce.
MultiOutputMapReduceJobTask
(*args, **kwargs)¶Produces multiple output files from a map reduce job.
The mapper output tuple key is used to determine the name of the file that reducer results are written to. Different reduce tasks must not write to the same file. Since all values for a given mapper output key are guaranteed to be processed by the same reduce task, we only allow a single file to be output per key for safety. In the future, the reducer output key could be used to determine the output file name, however.
Parameters: |
|
---|
Support for loading data into a Mysql database.
edx.analytics.tasks.common.mysql_load.
IncrementalMysqlInsertTask
(*args, **kwargs)¶A MySQL table that is mostly appended to, but occasionally has parts of it overwritten.
When overwriting, the task is responsible for populating some records that need to be easy to identify. There should be a one-to-one relationship between a row and the task that was used to write it. It should be straightforward to construct a where clause that selects all of the rows generated by this task.
Parameters: |
|
---|
edx.analytics.tasks.common.mysql_load.
MysqlInsertTask
(*args, **kwargs)¶A task for inserting a data set into RDBMS.
Parameters: |
|
---|
Gather data using Sqoop table dumps run on RDBMS databases.
edx.analytics.tasks.common.sqoop.
SqoopImportFromMysql
(*args, **kwargs)¶An abstract task that uses Sqoop to read data out of a MySQL database and writes it to a file in CSV format.
By default, the output format is defined by meaning of –mysql-delimiters option, which defines defaults used by mysqldump tool:
Parameters: |
|
---|
edx.analytics.tasks.common.sqoop.
SqoopImportTask
(*args, **kwargs)¶An abstract task that uses Sqoop to read data out of a database and writes it to a file in CSV format.
Parameters: |
|
---|
Import data from external RDBMS databases specific to enterprise into Hive.
edx.analytics.tasks.enterprise.enterprise_database_imports.
ImportBenefitTask
(*args, **kwargs)¶Ecommerce: Imports offer benefit information from an ecommerce table to a destination directory and a HIVE metastore.
Parameters: |
|
---|
edx.analytics.tasks.enterprise.enterprise_database_imports.
ImportConditionalOfferTask
(*args, **kwargs)¶Ecommerce: Imports conditional offer information from an ecommerce table to a destination directory and a HIVE metastore.
Parameters: |
|
---|
edx.analytics.tasks.enterprise.enterprise_database_imports.
ImportDataSharingConsentTask
(*args, **kwargs)¶Imports the consent_datasharingconsent table to S3/Hive.
Parameters: |
|
---|
edx.analytics.tasks.enterprise.enterprise_database_imports.
ImportEnterpriseCourseEnrollmentUserTask
(*args, **kwargs)¶Imports the enterprise_enterprisecourseenrollment table to S3/Hive.
Parameters: |
|
---|
edx.analytics.tasks.enterprise.enterprise_database_imports.
ImportEnterpriseCustomerTask
(*args, **kwargs)¶Imports the enterprise_enterprisecustomer table to S3/Hive.
Parameters: |
|
---|
edx.analytics.tasks.enterprise.enterprise_database_imports.
ImportEnterpriseCustomerUserTask
(*args, **kwargs)¶Imports the enterprise_enterprisecustomeruser table to S3/Hive.
Parameters: |
|
---|
edx.analytics.tasks.enterprise.enterprise_database_imports.
ImportStockRecordTask
(*args, **kwargs)¶Ecommerce: Imports the partner_stockrecord table from the ecommerce database to a destination directory and a HIVE metastore.
A voucher is a discount coupon that can be applied to ecommerce purchases.
Parameters: |
|
---|
edx.analytics.tasks.enterprise.enterprise_database_imports.
ImportUserSocialAuthTask
(*args, **kwargs)¶Imports the social_auth_usersocialauth table to S3/Hive.
Parameters: |
|
---|
edx.analytics.tasks.enterprise.enterprise_database_imports.
ImportVoucherTask
(*args, **kwargs)¶Ecommerce: Imports the voucher_voucher table from the ecommerce database to a destination directory and a HIVE metastore.
A voucher is a discount coupon that can be applied to ecommerce purchases.
Parameters: |
|
---|
A canonical calendar that can be joined with other tables to provide information about dates.
edx.analytics.tasks.insights.calendar_task.
CalendarTableTask
(*args, **kwargs)¶Ensure a hive table exists for the calendar so that we can perform joins.
Parameters: |
|
---|
edx.analytics.tasks.insights.calendar_task.
CalendarTask
(*args, **kwargs)¶Generate a canonical calendar.
This table provides information about every day in every year that is being analyzed. It captures many complex details associated with calendars and standardizes references to concepts like “weeks” since they can be defined in different ways by various systems.
It is also intended to contain business-specific metadata about dates in the future, such as fiscal year boundaries, fiscal quarter boundaries and even holidays or other days of special interest for analysis purposes.
Parameters: |
|
---|
Import data from external RDBMS databases into Hive.
edx.analytics.tasks.insights.database_imports.
ImportAllDatabaseTablesTask
(*args, **kwargs)¶Imports a set of database tables from an external LMS RDBMS.
Parameters: |
|
---|
edx.analytics.tasks.insights.database_imports.
ImportAuthUserProfileTask
(*args, **kwargs)¶Imports user demographic information from an external LMS DB to both a destination directory and a HIVE metastore.
Parameters: |
|
---|
edx.analytics.tasks.insights.database_imports.
ImportAuthUserTask
(*args, **kwargs)¶Imports user information from an external LMS DB to a destination directory.
Parameters: |
|
---|
edx.analytics.tasks.insights.database_imports.
ImportCouponVoucherIndirectionState
(*args, **kwargs)¶Ecommerce: Current: Imports the voucher_couponvouchers table from the ecommerce database to a destination directory and a HIVE metastore.
This table is just an extra layer of indirection in the source schema design and is required to translate a ‘couponvouchers_id’ into a coupon id. Coupons are represented as products in the product table, which is imported separately. A coupon can have many voucher codes associated with it.
Parameters: |
|
---|
edx.analytics.tasks.insights.database_imports.
ImportCouponVoucherState
(*args, **kwargs)¶Ecommerce: Current: Imports the voucher_couponvouchers_vouchers table from the ecommerce database to a destination directory and a HIVE metastore.
A coupon can have many voucher codes associated with it. This table associates voucher IDs with ‘couponvouchers_id’s, which are stored in the voucher_couponvouchers table and have a 1:1 relationship to coupon IDs.
Parameters: |
|
---|
edx.analytics.tasks.insights.database_imports.
ImportCourseEntitlementTask
(*args, **kwargs)¶Imports the table containing learners’ course entitlements.
Parameters: |
|
---|
edx.analytics.tasks.insights.database_imports.
ImportCourseModeTask
(*args, **kwargs)¶Course Information: Imports course_modes table to both a destination directory and a HIVE metastore.
Parameters: |
|
---|
edx.analytics.tasks.insights.database_imports.
ImportCourseUserGroupTask
(*args, **kwargs)¶Imports course cohort information from an external LMS DB to both a destination directory and a HIVE metastore.
Parameters: |
|
---|
edx.analytics.tasks.insights.database_imports.
ImportCourseUserGroupUsersTask
(*args, **kwargs)¶Imports user cohort information from an external LMS DB to both a destination directory and a HIVE metastore.
Parameters: |
|
---|
edx.analytics.tasks.insights.database_imports.
ImportCurrentOrderDiscountState
(*args, **kwargs)¶Ecommerce: Current: Imports current order discount records from an ecommerce table to a destination directory and a HIVE metastore.
Parameters: |
|
---|
edx.analytics.tasks.insights.database_imports.
ImportCurrentOrderLineState
(*args, **kwargs)¶Ecommerce: Current: Imports current order line items from an ecommerce table to a destination directory and a HIVE metastore.
Parameters: |
|
---|
edx.analytics.tasks.insights.database_imports.
ImportCurrentOrderState
(*args, **kwargs)¶Ecommerce Current: Imports current orders from an ecommerce table to both a destination directory and a HIVE metastore.
Parameters: |
|
---|
edx.analytics.tasks.insights.database_imports.
ImportCurrentRefundRefundLineState
(*args, **kwargs)¶Ecommerce: Current: Imports current refund line items from an ecommerce table to both a destination directory and a HIVE metastore.
Parameters: |
|
---|
edx.analytics.tasks.insights.database_imports.
ImportEcommercePartner
(*args, **kwargs)¶Ecommerce: Current: Imports Partner information from an ecommerce table to a destination directory and a HIVE metastore.
Parameters: |
|
---|
edx.analytics.tasks.insights.database_imports.
ImportEcommerceUser
(*args, **kwargs)¶Ecommerce: Users: Imports users from an external ecommerce table to a destination dir.
Parameters: |
|
---|
edx.analytics.tasks.insights.database_imports.
ImportGeneratedCertificatesTask
(*args, **kwargs)¶Parameters: |
|
---|
edx.analytics.tasks.insights.database_imports.
ImportIntoHiveTableTask
(*args, **kwargs)¶Abstract class to import data into a Hive table.
Requires four properties and a requires() method to be defined.
Parameters: |
|
---|
edx.analytics.tasks.insights.database_imports.
ImportMysqlToHiveTableTask
(*args, **kwargs)¶Dumps data from an RDBMS table, and imports into Hive.
Requires override of table_name and columns properties.
Parameters: |
|
---|
edx.analytics.tasks.insights.database_imports.
ImportPersistentCourseGradeTask
(*args, **kwargs)¶Imports the grades_persistentcoursegrade table to S3/Hive.
Parameters: |
|
---|
edx.analytics.tasks.insights.database_imports.
ImportProductCatalog
(*args, **kwargs)¶Ecommerce: Products: Imports product catalog from an external ecommerce table to both a destination directory and a HIVE metastore.
Parameters: |
|
---|
edx.analytics.tasks.insights.database_imports.
ImportProductCatalogAttributeValues
(*args, **kwargs)¶Ecommerce: Products: Imports product catalog attribute values from an external ecommerce table to both a destination directory and a HIVE metastore.
Parameters: |
|
---|
edx.analytics.tasks.insights.database_imports.
ImportProductCatalogAttributes
(*args, **kwargs)¶Ecommerce: Products: Imports product catalog attributes from an external ecommerce table to both a destination directory and a HIVE metastore.
Parameters: |
|
---|
edx.analytics.tasks.insights.database_imports.
ImportProductCatalogClass
(*args, **kwargs)¶Ecommerce: Products: Imports product catalog classes from an external ecommerce table to a destination dir.
Parameters: |
|
---|
edx.analytics.tasks.insights.database_imports.
ImportShoppingCartCertificateItem
(*args, **kwargs)¶Imports certificate items from an external LMS DB shopping cart table to both a destination directory and a HIVE metastore.
Parameters: |
|
---|
edx.analytics.tasks.insights.database_imports.
ImportShoppingCartCoupon
(*args, **kwargs)¶Imports coupon definitions from an external LMS DB shopping cart table to both a destination directory and a HIVE metastore.
Parameters: |
|
---|
edx.analytics.tasks.insights.database_imports.
ImportShoppingCartCouponRedemption
(*args, **kwargs)¶Imports coupon redeptions from an external LMS DB shopping cart table to both a destination directory and a HIVE metastore.
Parameters: |
|
---|
edx.analytics.tasks.insights.database_imports.
ImportShoppingCartCourseRegistrationCodeItem
(*args, **kwargs)¶Imports course registration codes from an external ecommerce table to both a destination directory and a HIVE metastore.
Parameters: |
|
---|
edx.analytics.tasks.insights.database_imports.
ImportShoppingCartDonation
(*args, **kwargs)¶Imports donations from an external LMS DB shopping cart table to both a destination directory and a HIVE metastore.
Parameters: |
|
---|
edx.analytics.tasks.insights.database_imports.
ImportShoppingCartOrder
(*args, **kwargs)¶Imports orders from an external LMS DB shopping cart table to both a destination directory and a HIVE metastore.
Parameters: |
|
---|
edx.analytics.tasks.insights.database_imports.
ImportShoppingCartOrderItem
(*args, **kwargs)¶Imports individual order items from an external LMS DB shopping cart table to both a destination directory and a HIVE metastore.
Parameters: |
|
---|
edx.analytics.tasks.insights.database_imports.
ImportShoppingCartPaidCourseRegistration
(*args, **kwargs)¶Imports paid course registrations from an external LMS DB shopping cart table to both a destination directory and a HIVE metastore.
Parameters: |
|
---|
edx.analytics.tasks.insights.database_imports.
ImportStudentCourseEnrollmentTask
(*args, **kwargs)¶Imports course enrollment information from an external LMS DB to a destination directory.
Parameters: |
|
---|
Various helper utilities that are commonly used when working with Hive
edx.analytics.tasks.util.hive.
BareHiveTableTask
(*args, **kwargs)¶Abstract class that represents the metadata associated with a Hive table.
Note that all this task does is ensure that the table is created, it does not populate it with any data, simply runs the DDL commands to create the table.
Also note that it will not change the schema of the table if it already exists unless the overwrite parameter is set to True.
Parameters: |
|
---|
edx.analytics.tasks.util.hive.
HivePartitionTask
(*args, **kwargs)¶Abstract class that represents the metadata associated with a partition in a Hive table.
Note that all this task does is ensure that the partition is created, it does not populate it with any data, simply runs the DDL commands to create the partition.
Parameters: |
|
---|
edx.analytics.tasks.util.hive.
HiveTableFromQueryTask
(*args, **kwargs)¶Creates a hive table from the results of a hive query.
Parameters: |
|
---|
edx.analytics.tasks.util.hive.
HiveTableTask
(*args, **kwargs)¶Abstract class to import data into a Hive table.
Currently supports a single partition that represents the version of the table data. This allows us to use a consistent location for the table and swap out the data in the tables by simply pointing at different partitions within the folder that contain different “versions” of the table data. For example, if a snapshot is taken of an RDBMS table, we might store that in a partition with today’s date. Any subsequent jobs that need to join against that table will continue to use the data snapshot from the beginning of the day (since that is the “live” partition). However, the next time a snapshot is taken a new partition is created and loaded and becomes the “live” partition that is used in all joins etc.
Important note: this code currently does not clean up old unused partitions, they will just continue to exist until they are cleaned up by some external process.
Parameters: |
|
---|
edx.analytics.tasks.util.hive.
OverwriteAwareHiveQueryDataTask
(*args, **kwargs)¶A generalized Data task whose output is a hive table populated from a hive query.
Parameters: |
|
---|
Support URLs. Specifically, we want to be able to refer to data stored in a variety of locations and formats using a standard URL syntax.
Examples:
s3://some-bucket/path/to/file
/path/to/local/file.gz
hdfs://some/directory/
edx.analytics.tasks.util.url.
ExternalURL
(*args, **kwargs)¶Simple Task that returns a target based on its URL
Parameters: | url (Parameter) – |
---|
edx.analytics.tasks.util.url.
UncheckedExternalURL
(*args, **kwargs)¶A ExternalURL task that does not verify if the source file exists, which can be expensive for S3 URLs.
Parameters: | url (Parameter) – |
---|