Copying files using the command line
Learning to copy files using the command line is one of the most difficult tasks some students will encounter during Workshop practicals.
The faculty are not forcing the students to copy files using the command line based on a “that’s the way we did it” mentality, but rather on our current experience. Much of the data analysis that happens today is done on computing clusters in which the users’ only interaction with the computer is on the command line. Learning to use the command line effectively is an extremely important skill for the toolbox of anybody with data analysis ambitions.
This document is long, because it is attempting to explain things from as near to first principles as possible. You are likely familiar with many of the concepts discussed in the next session, in which case skip to the topics that will be useful.
Quick start
If you are already familiar with navigating directories and using the command line to copy files, then you should find getting started with the practicals to be straightforward.
You will create some directories to organize things, and then in most cases copy scripts and data out of the appropriate directory in /faculty to the directories you’ve created.
Some definitions
Because for many of the students this may be a completely new topic, We’ll start with some definitions.
Computer words
- command line: The command line refers to a text based interface with a computer. Examples of these are the MS Window’s Command Shell and PowerShell and MacOS’ terminal. In this course connections to the cloud computing environment’s command line will be made using SSH.
- cli: An abbreviation for “command line interface”. This is often used when describing programs that are run at the commmand line. For example, R has a cli, but there are also other methods to interact with R, such as RStudio. plink and PRSice have a cli.
- SSH: a versatile tool for securily transfering data between computers. In this course students may use SSH to access a command line in the cloud computing environment, or to copy files to or from the cloud computing environment.
- directory: A directory, also called a “folder” is an organizational unit for computer files. Files exist within a directory. Directories may contain files, other directories (subdirectories), or even nothing at all.
- folder: Synonym for “directory”. The two terms can be used interchangeably.
- directory you’re in: This, or other phrases involving “in” refer to the current working directory for the command line. Commands that do not specify a different directory will happen on the directory you are in. For example, less foo will show the contents of a file named “foo” in the directory you are in. less /faculty/foo will show the contents of a file named “foo” in the /faculty directory.
- cd“Change directory”: is the command used to change what directory you are in. It is similar to setwd() in R.
- home directory: Each user has a “home directory” where all of their files are stored. This can be abbreviated as ~ (the tilde symbol).
- faculty directory: For convenience, all of the faculties’ home directories are assembled in a folder called /faculty.
- subdirectory: A directory that is within another directory. All directories (except the “root” directory) are subdirectories of other directories. Usually “subdirectory” will be used when this relationship is important. For example, instructions may say “copy the ‘HW2’ directory into a subdirectory of your ‘Day1’ directory.”
- /“forward slash”: The forward slash is the Unix/Linux/MacOS directory separator. When writing out directory names the “/” is used to separate directories and subdirectories. For example, ~/Day1/HW2 refers to a subdirectory named “HW2”, which is inside a directory named “Day1”, which is inside of your home directory.
- path: “Path” is used to refer to a series of directories and subdirectories. ~/Day1/HW2 is a path.
- file: an entity on a computer file system. Files may contain text, data, program instructions, or application specific data, such as a PowerPoint slide deck.
- .“ddz”: (A single period, or “ddz”) This represents the directory the command line is operating in.
- ..“dot dot”: (Two periods, or “dots”) This represents the directory one level higher in the hierarchy.
- copy: The act of duplicating a file or directory from its origin to a different location or name. This action is usually done with the cp command.
- move: The act of removing a file or directory from its origin, and putting it in a different location, or changing its name. This action is usually done with the mv command.
- cp“cDZ”: The Unix/Linux/MacOS command used to copy files or directories.
- mv“mDZ”: The Unix/Linux/MacOS command used to move or rename files or directories.
- ls“l”: The Unix/Linux/MacOS list command. It shows the names of files and directories, and can also show other informatino about them.
- less“sometimes less is more”: A general purpose tool for looking at the contents of a text file.
- mkdir“make directory” : The command to create a directory. For example, mkdir foo will create an empty directory named “foo”.
- */wild cards/globbing: These are characters which can be used to match multiple other characters. It is a powerful tool to avoid having to type multiple file names, when action is to be performed on several files. For example foo.* could be used to match foo.bar, foo.baz, and foo.boz. The collective name of the characters used is “wild cards,” and the action of matching wild cards to files is called globbing.
- command line switches or options: Extra text given to a command to affect its behavior. Switches are often preceeded by - or --. For example in cp -v the “-v” is a switch to the “cp” command.
- command line arguments: This text after a command which tells the command what to operate on. For example in cp foo bar “foo” and “bar” are arguments to the “cp” command. Some commands may require switches before some arguments.
- ENTER or RETURN: After typing a command at the command line, the ENTER or RETURN key must be pressed to submit the command.
Display conventions in this document
Text will be shown in several different fonts and formats to express meaning.
Text in a fixed, or typewriter, font represents text on a command line. Either something that the user types, or that the computer outputs.
A screen shot of a terminal will show a sequence of command line entries and responses.
Required arguments to a command will be represented by text surrounded by pointy brackets < >. For example in cp it is shown that some argument must be provided in the “source” and “destination” location. When substituting in real values for the arguments, the pointy brackets are not included. So the typed command would look like cp source destination, to copy the file “source” to a file named “destination”.
Optional arguments are shown with square brackets [ ]. These are arguments which are not necessary for the command to function, but may be provided by the user to achieve desired results.
Anatomy of a command line
There are several items on the default command line used at the Workshop.
- The first part is your username. In this case the example username is student.
- @ is a separator.
- Then comes the computer name. In this example it is ip-10-0-201-191, but the exact name will be different depending on which cloud node you are connected to.
- : is a separator.
- ~ shows the current directory path. ~ is used as a shorthand for the current user’s home directory.
- $ is the end of the command line. Anything typed will appear after the $. Instructions later may show, for example, $ ls which will mean the user has typed ls at the command line.
- The green rectangle is the cursor. Depending on your SSH client and exact terminal settings, the exact color and shape of the cursor will vary.
Putting that all together, if you see a command line showing
smith12@ip-10-0-200-233:~/day2/R-files$That means the user smith12 is logged into the compute node ip-10-0-200-233 and is currently in their home directory, and then in the subdirectories day2 and day2’s subdirectory R-files.
Looking at files and directories
The list command
ls is the command used to list the names of files and directories.
The command ls is run at the command line, and it shows a single thing is in the current directory, somethings named “R” and “foo”.
Switches can be given to ls to have it provide more information.
The “d” in drwxr-xr-x shows that the thing named “R” is a directory, and the first “-” in -rw-r--r-- shows that “foo” is a regular file. The letters following the first one have to do with permissions, and aren’t important at the moment.
Next is shown the owner of the files, “student,” and the group of the file, “students,”. These also aren’t important for what we’re doing.
Next is shown the size of the file, then the date and time the file was last modified, and finally the name of the file or directory.
ls and ls -l are extremely useful for seeing what files and directories exist.
ls can be given a directory as an argument, and it will show the contents of that directory.
In all of these examples, ls is showing directories in blue. That will probably be how your screen looks, but depending on exactly which terminal and SSH client you use directories may be shown in the same color as regular files.
Looking inside a file
The less command can be used to view the contents of a text file. Many files, such as R scripts and some data files are just text, and can be easily viewed with less.
To view the contents of a file, run less .
student@ip-10-0-200-228:~$ less foowill show
becasue foo is literally filled with some random text. The final line foo (END) is a status message from less. It is giving the name of the file being viewed, and showing the position in the file.
If the file is long enough, it can be scrolled by pressing the arrow keys.
To exit less, press the q (quit) key.
Navigating directories
Moving between directories is done using the cd “change directory” command. The syntax of the command is cd [destination]".
Where the destination is the name of the directory you want to move into. The destination is optional, because running cd with no destination will return you to your home directory.
Entering cd R has moved the user into the “R” directory, and the command prompt has been updated to reflect this change.
A full path can be given as the argument to cd
and you will be moved to the final directory in the path. The effect is the same as using multiple cd commands
As can be seen in the previous few examples, the “/” (forward slash) character is extremely important, and it has different meanings depending on where it is in the path.
When at the start of a name, it is telling the computer to look in the “root” directory for that item. For example “/faculty” is in the “root” folder.
When in between names, it tells the computer that those are different directories or files. For example “elizabeth/2022/corrs.csv” is referencing something named “corrs.csv” which is in the “elizabeth” directory and then the “2022” subdirectory.
Leaving out a “/” means that you are referencing something in the directory you are currently in. For example
There is no directory called “/R”, so it is not possible to change there. An error is shown, “No such file or directory”. This error is not serious, and does not cause any problems. It just means that the change directory command could not complete, and you should check for typos, a misplaced /, or other problems.
Actually copying files
Using cp
“cp” is the primary command used to copy files at the command line.
The basic syntax
The basic syntax for cp is
cpcp creates a duplicate of the source file (or directory in some circumstances) at the destination.
Copying the file “foo” to another file called “bar” is done with the command
cp foo barThis will result in two identical files, foo and bar in the current directory.
cp can be combined with wild cards to copy multiple files at the same time. For example
cp /faculty/elizabeth/2022/*.R .will copy all of the files that end in “.R” to the current directory, which is referenced by “.” which is usually spoken as “ddz”.
cp can be given the -r “recursive” switch to cause it to copy a directory, and everything in that directory. For example
has copied everything in the directory /faculty/elizabeth/2022 to the current directory. There is now a new 2022 directory which contains a copy of everything that is in the /faculty/elizabeth/2022 directory.
At the start of most practicals, you will use either cp -r or cp with wild cards to copy files out of the appropriate directory under /faculty to one of your directories.
Using mv
“mv” is the primary command used to move files at the command line. mv is used in a similar way to cp, but there are some very important differences. The most important is that mv removes the source file. After running mv you still have the same number of files or directories you started with, they are just located someplace else, or have a different name.
For example
mv foo barrenames the file “foo” to “bar”. In this case “foo” could have been a directory, and then it will be renamed to a directory called “bar”.
When moving multiple files (or files and directories), then the destination must be a directory.
mv *.R My-RWill move all of the files in the current directory that end in “.R” into the directory “My-R”. The destination directory, “My-R” in the example, must exist before running the mv command. It will not automatically be created.
Creating directories with mkdir
The command to “make directories” is mkdir. It is very simple to use, just mkdir .
To create a directory called “foo” just run
mkdir fooYour friends TAB and Up Arrow
Two huge time savers are the use of the TAB key and the Up Arrow key.
TAB
TAB is used to complete text on the command line. For example, if I want to copy files from /faculty/elizabeth/2022, I don’t need to type out all of those characters. This is what my typing will actually look like, with #TAB# for each time I press the TAB key.
cp /f#TAB#completes to
cp /faculty/and then continue typing
cp /faculty/el#TAB#which completes to
cp /faculty/elizabeth/If a completion isn’t unique, then pressing TAB a second time will list the possible completions. If nothing is listed after repeated pressings of TAB, then there aren’t any possible completions.
This is what that might look like at a terminal, with a red mark inserted each time I pressed the TAB key, and the rest of the line being what was automaticaly added by the computer.
Use of the TAB key is highly recommended to avoid typos in long file and directory names.
Up Arrow
The Up Arrow is used to recall previous typed commands. Those commands can then be edited or used again as they are. The red arrow shows where I pressed the up arrow.
I pressed the Up Arrow once to recover ls, pressed ENTER, and then I pressed the Up Arrow again to recover ls, but edited the line to add a -l before pressing ENTER.
That is a very trivial example, and it is hardly worth pressing the Up Arrow to recover ls, but on long and complicated commands, the Up Arrow is a large time saver.
After pressing the Up Arrow multiple times and getting into your command “history”, it is possible to use the Down Arrow to move to more recent commands. You can return to an empty command line by either pressing the Down Arrow until you are back at a bare prompt, or pressing ctrl-c.
Long commands can be edited by using the left and right arrows, and when modified to your satisfaction, pressing the ENTER (or RETURN) key will submit the command line.