aten/README.md


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242

# ATen: A TENsor library

ATen is a simple tensor library thats exposes the Tensor operations in Torch
and PyTorch directly in C++11. The wrapper respects the semantics of operators
in PyTorch, except minor details due to differences between C++ in Python in
the way default arguments are handled. See the [documentation for tensors](http://pytorch.org/docs/tensors.html) in PyTorch for what these operations do.
ATen's API is auto-generated from the same declarations PyTorch uses so the
two APIs will track each other over time.

Tensor types are resolved dynamically, such that the API is generic and
does not include templates. That is, there is one `Tensor` type. It can hold a
CPU or CUDA Tensor, and the tensor may have Doubles, Float, Ints, etc. This design
makes it easy to write generic code without templating everything.

See the _generated_ [`Tensor.h` file](doc/Tensor.h) and [`Functions.h` file](doc/Functions.h) for the provided API. Excerpt:
```c++
Tensor atan2(const Tensor & other) const;
Tensor & atan2_(const Tensor & other);
Tensor pow(Scalar exponent) const;
Tensor pow(const Tensor & exponent) const;
Tensor & pow_(Scalar exponent);
Tensor & pow_(const Tensor & exponent);
Tensor lerp(const Tensor & end, Scalar weight) const;
Tensor & lerp_(const Tensor & end, Scalar weight);
Tensor histc() const;
Tensor histc(int64_t bins) const;
Tensor histc(int64_t bins, Scalar min) const;
Tensor histc(int64_t bins, Scalar min, Scalar max) const;
```

Inplace operations are also provided, and always suffixed by `_` to indicate they will modify the Tensor.

### Installation

TH/THC/THNN/THCUNN are provided (as git subtrees), so the repo is standalone. You will need a C++11 compiler, cmake, and the pyyaml python package.
```

# Install pyyaml used by python code generation to read API declarations

# OSX: if you don't have pip
sudo easy_install pip
# Ubuntu: if you don't have pip
apt-get -y install python-pip

# if you don't have pyyaml
sudo pip install pyyaml

mkdir build
cd build
cmake .. -DCMAKE_INSTALL_PREFIX=/where/you/want # specify your dest directory
# cmake .. -DNO_CUDA=true  # for CPU only machines
make install
```

### Example usage

Here is a simple example; again, the syntax follows Torch semantics.

```c++
using namespace at; // assumed in the following

Tensor d = CPU(kFloat).ones({3, 4});
Tensor r = CPU(kFloat).zeros({3,4})
for(auto i = 0; i < 100000; i++) {
  r = r.add(d);
  // equivalently
  r = r + d;
  // or
  r += d;
}
```

Want this running on the GPU?
```c++
using namespace at; // assumed in the following

Tensor d = CUDA(kFloat).ones({3, 4});
Tensor r = CUDA(kFloat).zeros({3,4})
for(auto i = 0; i < 100000; i++) {
  r = r.add(d);
  // equivalently
  r = r + d;
  // or
  r += d;
}
```

Expressions like `CUDA(kFloat)` are first-class `at::Type` objects that represent
the type of a Tensor and are used to create Tensors when their type cannot be
inferred. See the _generated_ [Type header](doc/Type.h) for its API.

See more in [sample files](src/ATen/test).

### Creating your kernel

It is easy to create new kernels, thanks to the `dispatch<>()` templated function. Example:
```c++

// a simple sum kernel (for CPU only)
template<typename T>
struct sum_op {
  // dispatch handles variable arguments for you
  Tensor CPU(const Type & t, Tensor & x_)
  {
    Tensor x = x_.contiguous();
    auto x_p = x.data<T>();
    int64_t size = x.numel();
    T sum = 0;
    for(int64_t i = 0; i < size; i++) {
      sum += x_p[i];
    }
    return sum;
  };
  Tensor CUDA(Tensor& x) {
    throw std::invalid_argument("device not supported");
  };
};

Tensor a = CPU(kFloat).rand({3, 7});
std::cout << a << std::endl;
std::cout << dispatch<sum_op>(a.type(),a) << " == " << a.sum() << std::endl;
```

### Efficient access to tensor elements

When using Tensor-wide operations, the relative cost of dynamic dispatch is very small.
However, there are cases, especially in your own kernels, where efficient element-wise access is needed,
and the cost of dynamic dispatch inside the element-wise loop is very high.
ATen provides _accessors_ that are created with a single dynamic check that a Tensor is the type and number of
dimensions. Accessors then expose an API for accessing the Tensor elements efficiently:

```c++

Tensor foo = CPU(kFloat).rand({12,12});

// assert foo is 2-dimensional and holds floats.
auto foo_a = foo.accessor<float,2>();
float trace = 0;

for(int i = 0; i < foo_a.size(0); i++) {
  // use the accessor foo_a to get tensor data.
  trace += foo_a[i][i];
}
```

Accessors are temporary views of a Tensor. They are only valid for the lifetime of the tensor that they
view and hence should only be used locally in a function, like iterators.

### Using externally created data

If you already have your tensor data allocated in memory (CPU or CUDA),
you can view that memory as a Tensor in ATen:

```c++
float data[] = { 1, 2, 3,
                 4, 5, 6};
auto f = CPU(kFloat).tensorFromBlob(data, {2,3});
cout << f << endl;
```

These tensors cannot be resized because ATen does not own the memory, but otherwise
behave as normal tensors.

### Scalars and zero-dimensional tensors

In addition to the `Tensor` objects, ATen also includes `Scalar`s that represent a single number.
Like a Tensor, Scalars are dynamically typed and can hold any one of ATen's [number types](doc/Type.h).
Scalars can be implicitly constructed from C++ number types. Scalars are needed because some functions like `addmm` take numbers along with Tensors and expect these
numbers to be the same dynamic type as the tensor. They are also used in the API to indicate places where
a function will _always_ return a Scalar value, like `sum`.

```c++
Tensor addmm(Scalar beta, const Tensor & self,
             Scalar alpha, const Tensor & mat1,
             const Tensor & mat2);
Scalar sum(const Tensor & self);

//usage
Tensor a = ...
Tensor b = ...
Tensor c = ...
Tensor r = addmm(1.0, a, .5, b, c);
```

In addition to Scalars, ATen also allows Tensor objects to be zero-dimensional. These Tensors hold
a single value and they can be references to a single element in a larger Tensor. They can be used anywhere a Tensor is expected. They are normally created by operators like `select` which reduce the dimensions of
a Tensor.

```c++
Tensor two = CPU(kFloat).rand({10,20});
two[1][2] = 4;
//~~~~~~~  zero-dimensional Tensor
```

It is possible to convert between Scalar and zero-dim Tensors:

```c++
Tensor zero_dim = CPU(kFloat).scalarTensor(4);
Scalar from_tensor = Scalar(zero_dim); //only valid when zero_dim.dim() == 0;
```

### Avoiding unnecessary CUDA synchronization in your kernels when using Scalars

Moving a single number from the GPU to the CPU introduces a synchronization point
that can add latency to your program. In certain cases the result of a GPU operator like `sum` which
returns a Scalar may be plugged into another GPU operator as an argument. If Scalars were always copied
to the CPU, this would result in 2 copies. To avoid these synchronizations, Scalar objects can be
optionally backed by a zero-dim Tensor, and are only copied to the CPU when requested.

```c++
auto a = CUDA(kFloat).rand({3,4})
Scalar on_gpu = Scalar(a[1][1]); //backed by zero-dim Tensor
assert(on_gpu.isBackedByTensor());

double value = on_gpu.toDouble(); // copied to CPU, if it was backed by GPU Tensor.
Scalar svalue = on_gpu.local(); // force the Scalar to become local to CPU.

// get the scalar as a zero-dim tensor. If it was already backed
// by a zero-dim Tensor then this op has no synchronization.
// if the Scalar was local on CPU, it performs the copy
Tensor same_tensor = CUDA(kFloat).scalarTensor(on_gpu);
```

Operators aware of the location of Scalars can arrange to do the minimal number of copies required.

### Developer notes

ATen relies heavily on code generation to automatically generate headers
and implementations for all of the tensor methods it supports.  The main
entry point for the script which does all this work is
[`src/ATen/gen.py`](src/ATen/gen.py), which ingests
[`src/ATen/Declarations.cwrap`](src/ATen/Declarations.cwrap),
[`src/ATen/nn.yaml`](src/ATen/nn.yaml) and the THNN/THCUNN headers and
produces all of the headers and wrapping code necessary to generate
the ATen interface.

If you need to understand how ATen understands a declaration after all
of this processing occurs, it's helpful to look at the generated file
`Declarations.yaml` (NB: not cwrap) which contains information for all
ATen methods in a uniform manner.  This file is utilized by PyTorch
which further extends the ATen interface with support for automatic
differentation.