summaryrefslogtreecommitdiff
path: root/doc/neps/nep-0034.rst
blob: 4863bad86fde71cdba31b9884ce6b41a9d9b90ad (plain)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
===========================================================
NEP 34 — Disallow inferring ``dtype=object`` from sequences
===========================================================

:Author: Matti Picus
:Status: Accepted
:Type: Standards Track
:Created: 2019-10-10
:Resolution: https://mail.python.org/pipermail/numpy-discussion/2019-October/080200.html

Abstract
--------

When users create arrays with sequences-of-sequences, they sometimes err in
matching the lengths of the nested sequences_, commonly called "ragged
arrays".  Here we will refer to them as ragged nested sequences. Creating such
arrays via ``np.array([<ragged_nested_sequence>])`` with no ``dtype`` keyword
argument will today default to an ``object``-dtype array. Change the behaviour to
raise a ``ValueError`` instead.

Motivation and Scope
--------------------

Users who specify lists-of-lists when creating a `numpy.ndarray` via
``np.array`` may mistakenly pass in lists of different lengths. Currently we
accept this input and automatically create an array with ``dtype=object``. This
can be confusing, since it is rarely what is desired. Changing the automatic
dtype detection to never return ``object`` for ragged nested sequences (defined as a
recursive sequence of sequences, where not all the sequences on the same
level have the same length) will force users who actually wish to create
``object`` arrays to specify that explicitly. Note that ``lists``, ``tuples``,
and ``nd.ndarrays`` are all sequences [0]_. See for instance `issue 5303`_.

Usage and Impact
----------------

After this change, array creation with ragged nested sequences must explicitly
define a dtype:

    >>> np.array([[1, 2], [1]])
    ValueError: cannot guess the desired dtype from the input

    >>> np.array([[1, 2], [1]], dtype=object)
    # succeeds, with no change from current behaviour

The deprecation will affect any call that internally calls ``np.asarray``.  For
instance, the ``assert_equal`` family of functions calls ``np.asarray``, so
users will have to change code like::

    np.assert_equal(a, [[1, 2], 3])

to::

    np.assert_equal(a, np.array([[1, 2], 3], dtype=object))

Detailed description
--------------------

To explicitly set the shape of the object array, since it is sometimes hard to
determine what shape is desired, one could use:

    >>> arr = np.empty(correct_shape, dtype=object)
    >>> arr[...] = values

We will also reject mixed sequences of non-sequence and sequence, for instance
all of these will be rejected:

    >>> arr = np.array([np.arange(10), [10]])
    >>> arr = np.array([[range(3), range(3), range(3)], [range(3), 0, 0]])

Related Work
------------

`PR 14341`_ tried to raise an error when ragged nested sequences were specified
with a numeric dtype ``np.array, [[1], [2, 3]], dtype=int)`` but failed due to
false-positives, for instance ``np.array([1, np.array([5])], dtype=int)``.

.. _`PR 14341`: https://github.com/numpy/numpy/pull/14341

Implementation
--------------

The code to be changed is inside ``PyArray_GetArrayParamsFromObject`` and the
internal ``discover_dimentions`` function. See `PR 14794`_.

Backward compatibility
----------------------

Anyone depending on creating object arrays from ragged nested sequences will
need to modify their code. There will be a deprecation period during which the
current behaviour will emit a ``DeprecationWarning``. 

Alternatives
------------

- We could continue with the current situation.

- It was also suggested to add a kwarg ``depth`` to array creation, or perhaps
  to add another array creation API function ``ragged_array_object``. The goal
  was to eliminate the ambiguity in creating an object array from ``array([[1,
  2], [1]], dtype=object)``: should the returned array have a shape of
  ``(1,)``, or ``(2,)``? This NEP does not deal with that issue, and only
  deprecates the use of ``array`` with no ``dtype=object`` for ragged nested
  sequences. Users of ragged nested sequences may face another deprecation
  cycle in the future. Rationale: we expect that there are very few users who
  intend to use ragged arrays like that, this was never intended as a use case
  of NumPy arrays. Users are likely better off with `another library`_ or just
  using list of lists.

- It was also suggested to deprecate all automatic creation of ``object``-dtype
  arrays, which would require adding an explicit ``dtype=object`` for something
  like ``np.array([Decimal(10), Decimal(10)])``. This too is out of scope for
  the current NEP. Rationale: it's harder to asses the impact of this larger
  change, we're not sure how many users this may impact.

Discussion
----------

Comments to `issue 5303`_ indicate this is unintended behaviour as far back as
2014. Suggestions to change it have been made in the ensuing years, but none
have stuck. The WIP implementation in `PR 14794`_ seems to point to the
viability of this approach.

References and Footnotes
------------------------

.. _`issue 5303`: https://github.com/numpy/numpy/issues/5303
.. _sequences: https://docs.python.org/3.7/glossary.html#term-sequence
.. _`PR 14794`: https://github.com/numpy/numpy/pull/14794
.. _`another library`: https://github.com/scikit-hep/awkward-array

.. [0] ``np.ndarrays`` are not recursed into, rather their shape is used
   directly. This will not emit warnings::

      ragged = np.array([[1], [1, 2, 3]], dtype=object)
      np.array([ragged, ragged]) # no dtype needed

Copyright
---------

This document has been placed in the public domain.