Doing data science means creating pipelines that manipulate and analyze data. Write your Conducto data pipeline in Python. The pipeline specifies commands and when they should run. Subtasks that can run in parallel are naturally expressed, and will automatically run at scale according to the compute resources available. Conducto is agnostic to where or how you store your data – invoke any shell command that calls any Linux package, library, or tool to access and manipulate your data. Launch your pipeline into our free cloud or run for free on your local machine. Upgrade to paid cloud mode for more scale. View and interact with your pipeline in the browser.
Getting started with Data Science in Conducto requires that you install our python package.
pip install conducto.
conducto-profile add, then login with your username & password.
- Save the below to
python pipeline.py --cloud --runto launch into our free cloud.
- View first pipeline in the browser.
import conducto as co # Build a Docker image using contents of a Git repo IMG = co.Image("python:3.8-slim", copy_url="https://github.com/conducto/examples", copy_branch="main", reqs_packages=["cloc"], reqs_py=["pandas"] ) def main() -> co.Parallel: with co.Parallel(image=IMG) as root: # Count lines of code in the remote Git repo. root["lines of code"] = co.Exec("cloc .") # Run a simple data analysis script located there. root["biggest US cities"] = co.Exec( "cd features/copy_url && python analyze.py cities.csv" ) return root if __name__ == "__main__": co.main(default=main)
Now that you have run your first pipeline, you are all set up.
- Import your own software and code.
- Edit your pipeline to include custom logic.
- Upgrade your toolbox with free local mode.
- Combine sample user data with transaction data to build a model that predicts customer churn. [Sandbox][GitHub]
- Download US weather data then visualize it. [Sandbox][GitHub]
Join us on Slack in the #data-science channel.