Advertisement
This article has Open Peer Review reports available.
https://doi.org/10.1186/1472-6939-15-8
© Masaki et al.; licensee BioMed Central Ltd.2014
Received: 7July2013
Accepted: 6January2014
Published: 4February2014
Since Japan adopted the concept of informed consent from the West, its inappropriate acquisition from patients in the Japanese clinical setting has continued, due in part to cultural aspects. Here, we discuss the current status of and contemporary issues surrounding informed consent in Japan, and how these are influenced by Japanese culture.
Current legal norms towards informed consent and information disclosure are obscure in Japan. For instance, physicians in Japan do not have a legal duty to inform patients of a cancer diagnosis. To gain a better understanding of these issues, we present five court decisions related to informed consent and information disclosure. We then discuss Japanese culture through reviews of published opinions and commentaries regarding how culture affects decision making and obtaining informed consent. We focus on two contemporary problems involving informed consent and relevant issues in clinical settings: the misuse of informed consent and persistence in obtaining consent. For the former issue, the phrase "informed consent" is often used to express an opportunity to disclose medical conditions and recommended treatment choices. The casual use of the expression "informed consent" likely reflects deep-rooted cultural influences. For the latter issue, physicians may try to obtain a signature by doing whatever it takes, lacking a deep understanding of important ethical principles, such as protecting human dignity, serving the patient’s best interest, and doing no harm in decision-making for patients.
There is clearly a misunderstanding of the concept of informed consent and a lack of complete understanding of ethical principles among Japanese healthcare professionals. Although similar in some respects to informed consent as it originated in the United States, our review makes it clear that informed consent in Japan has clear distinguishing features.
Japanese healthcare professionals should aim to understand the basic nature of informed consent, irrespective of their attitudes about individualism, liberalism, and patient self-determination. If they believe that the concept of informed consent is important and essential in Japanese clinical settings, efforts should be made to obtain informed consent in an appropriate manner.
Beauchamp and Childress argued that virtually all codes of medical ethics and institutional regulations should require physicians to obtain informed consent from patients prior to substantial interventions, with the protection of patient autonomy as the primary justification for this requirement. They also claimed that informed consent is an individual’s autonomous authorization and postulated seven structural elements[ 1 ], including threshold elements (competence to understand and decide; voluntariness in deciding), information elements (disclosure of material information; recommendation of a plan; understanding of the information and recommended plan), and consent elements (decision in favor of the plan; authorization of the chosen plan)[ 1 ]. We think that these elements are quite clear and comprehensive, and could provide a useful framework for the critical review of various contemporary issues surrounding informed consent acquisition in Japan.
The concept of informed consent received a great deal of attention during the 1980s in Japan. In 1990, informed consent was translated into Japanese as "setsumei to doi" (back-translated as "explanation and consent"). This Japanese translation, however, carries a connotation that informed consent is a duty owed to patients and does not properly purport the notion that informed consent is a patient’s right[
Nike Mens Flyknit Lunar3 BLACK/WHITEHOT LAVAVOLT 105 M US BsfYL
]. In other words, the Japanese translation fails to grasp the "consent elements" of the framework described above. Currently in Japan, informed consent is often obtained without the patient’s understanding, physician’s recommendation, or adequate time to think[
3
]. In Japan as well as in other countries, many difficult issues regarding patient self-determination and acquisition of informed consent remain even after an ethical norm to obtain informed consent from patients in clinical settings and for research projects has been developed and established. They include compulsive interventions, treatment decisions for incompetent patients or minors, and issues surrounding treatment refusal[
Cambridge Select Womens Closed Round Toe Stacked Heel Western Ankle Bootie Black Imsu woNcp4Fc
,
5
].
In this paper, we discuss current situations and cultural characteristics concerning informed consent in Japan to outline the problems that we think are common and relevant. First, we review five court decisions related to informed consent and information disclosure. Next, we discuss the characteristics of Japanese culture by reviewing published opinions and commentaries. Then, we describe two contemporary issues concerning informed consent in current clinical settings in Japan: misuse of informed consent and persistence in obtaining consent. Finally, we present our opinion on current situations surrounding informed consent in Japan. Our focus is on informed consent in clinical settings; we do not address informed consent in research settings.
In the past three decades, the Japanese Supreme Court has set forth decisions in four cases concerning truth-telling and informed consent, and one district court considered a case about the necessity of disclosure to families of patients. The first case concerned disclosure of a cancer diagnosis. A physician failed to inform a patient that she had gall bladder cancer, but instead told her that she had a gall stone that required inpatient care. However, the patient did not come back to the hospital and, as a result, the physician did not inform either the patient or the patient’s family. In 1995, the Japanese Supreme Court concluded that a physician does not need to disclose a cancer diagnosis on the ground that the physician can overlook a patient’s right to self-determination, if, in their judgment, the actual diagnosis could have an adverse impact on the patient[ 2 ]. In this case, the principle of non-maleficence was prioritized over respect for patient autonomy.
So far, we haven’t explored LSTMs. We’ve more setup a foundation for them. And there’s one glaring issue with our foundation: if we just keep adding information to cell state, it could just grow and grow and grow, and essentially act as a counter that only increments. This is not very useful, and could regularly lead to explosion. We want more fine and rich control over memory. Well, worry not, because this is exactly what LSTMs are capable of doing.
LSTM cells handle memory in a very intelligent way, enabling them to learn long-term dependencies and perform well. How, exactly? Well, the cell is sort of like an internal memory state that allows for context; it “forgets”, a.k.a. resets, information it doesn’t find useful from the previous cell state, “writes” in new information it find useful from the current input and/or previous hidden state, and similarly only “reads” out part of its information — the good stuff — in the computation of h_t . This respectively corresponds to the concepts of: resetting memory, writing to memory, and reading from memory. Very similar to how a modern computer system works, and we often describe an LSTM cell as a “memory cell”.
The “writing to memory” part is additive — it’s what I showed you in the initial diagrams. Information flows through and we add stuff we think is relevant to it. The “resetting memory” part is multiplicative, and occurs before writing to memory; when information from the previous cell state initially flows in, we multiply it by a vector with values between 0 and 1 to reset or retain parts of it we find useless and useful respectively. The “reading from memory” part is also multiplicative with a similar 0–1 range vector, but it doesn’t modify the information flowing through the cell states. Rather, it modifies the information flowing into the hidden states and thus decides what the hidden state is influenced by.
Both of these multiplications are , like so:
In this equation, when a = 0 the information of c is lost. This is what resetting does, and retaining is the vice versa. I also imagine that values such as 0.5 could be used to diminish the importance of certain information, but not completely wipe it out.
Our (unfinished) cell state computational graph now looks like this:
Sidenote: don’t be scared whenever you see the word “multiplicative” and don’t immediately think of “vanishing” or “exploding”. It depends on the context. Here, as I’ll show mathematically in a bit, it’s fine.
This concept in general is known as gating , because we “gate” what can flow in and out of the LSTM cell. What we actually multiply and add by to reset, write, and read are known as the “gates”. There are four such gates:
Here’s our updated computational graph for the cell state:
Looks like I’m starting to create a complex diagram of my own. Damn. 😞 I guess LSTMs and immediately interpretable diagrams just weren’t meant to be!
Basically, f interacts with the cell state through a multiplication. i interacts with g through a multiplication as well, the result of which interacts with the cell state through an addition. Finally, the cell state leaks into a tanh (that’s the shape of the tanh function in the circle), the result of which then interacts with o through multiplication to compute h_t . This does not disrupt the cell state, which flows to the next timestep. h_t then flows forward (and it could flow upward as well).
Here’s the equation form:
As you can see, our cell state has no activation function; the activation function is simply the identity function! Yet, the cell state usually doesn’t explode — it stays stable by “forgetting” and “writing”, and does interesting things with this gating to promote context, fine control over memory, and long-term dependency learning.
So, how are the gates calculated? Well, all of these gates have their own learnable weights and are functions of the last timestep’s flowing in and any current timestep inputs, not the cell state (contrary to what I may have implied earlier with the gradient flow diagrams). This should make sense when you think about it; I mean, firstly, the g and i gates literally represent input, so they better be functionally dependent on hidden states and input data! On an intuitive level, the gates help us modify the cell state, and we modify the cell state based on our current context. External stimulus that provide context should be used to compute these gates, and since context = input + hidden states our gates are functionally dependent on input and hidden states.
Since every gate has a different value at each timestep, we index by timestep t just like for hidden states, cell states, or something similar.
We could generalize for multiple hidden layers as well:
But, for simplicity’s sake, let’s assume we are at the first hidden layer, or that there is only one hidden layer in the LSTM. This way, we can obfuscate the ℓ term and ignore influence from hidden states in the previous depth. We’ll also forget about edge cases and assume input exists at the current timestep. In practice, we obviously can’t make said assumptions, but for the sake of demonstrating the equations it becomes too tedious otherwise.
Sidenote: we make this assumption for the rest of the discussion on LSTMs in this article.
Like with the RNN hidden state, the index of each weight matrix is descriptive; for example, W_xf are the weights that map input x to the forget gate f . Each gate has weight matrices that map input and hidden states to itself, including biases.
And this is the beauty of LSTMs; the whole thing is end-to-end differentiable. These gates can when to allow data to flow and what data should flow depending on the context it sees (the input and the hidden states). It learns this based on patterns it sees while training. In this sense, it’s sort of like how a CNN learns feature detectors for images, but the patterns are way more complex and less human interpretable with LSTMs. This is why they perform so well.
Okay, this looks scarier, but it’s actually not much different to what we had before, especially once you look past the intimidating web of arrows. One notable change is that we’re showing the previous hidden state in time and the current input flowing in. This diagram makes the assumption that we’re in the first layer and at some timestep > 1 where input exists. We then show how the f , i , g, and o gates are computed from this information — the hidden state and inputs are fed into an activation function like sigmoid (or, for g , a tanh; you can tell because it’s double the height of the others) — and it’s expressed through the web of arrows. It’s implied that we weight the two terms entering our activation functions, adding them up with a bias vector, but it’s not necessarily explicit in the diagram.
Let’s embed this into our overall LSTM diagram for a single timestep:
Now let’s zoom out and view our entire unrolled single layer, three timestep LSTM:
It’s beautiful, isn’t it? The full screen width size just adds to the effect! Here’s a link to the full res version.
The only thing that would look more beautiful would be multiple LSTM cells that stack on top of each other (multiple hidden layers)! 😍
You’ve come a long way, young padawan. But there’s still a bit left to go. Part I focused on the motivation for LSTMs, how they work, and a bit on why they reduce the vanishing gradient problem. Now, having a full understanding of LSTMs, Part II will hone in on the latter part—analyzing on a more close, technical level why our gradients stop vanishing as quickly. You won’t find a lot of this information online easily; I had to search and ask left and right to find an explanation better and more comprehensive than what you’ll find in other current tutorials.
Firstly, truncated BPTT is often used with LSTMs; it’s a method to speed up training. In particular, note that if we input a sequence of length 1000 into an LSTM, and want to train it, it’s equivalent to training a 1000 layer neural network. Doing forward and backwards passes into this is very memory and time consuming, especially while backpropagating the error when we need to compute a derivative like this:
…which would include a of terms.
When we backprop the error, and add all the gradients up, this is what we get:
Truncated BPTT does two things:
For example, if t = 20 and k1 = 10 , our (because 20 ÷ 10 = 2) round of BPTT would be:
So, with t = 20 , k2 = 10 , and k1 = 10 , our second round of BPTT would follow:
Both k1 and k2 are hyperparameters. k1 does not have to equal k2 .
These two techniques combined enables truncated BPTT to not lose the ability to learn long term dependencies. Here’s a formal definition:
The same paper gives nice pseudocode for truncated BPTT:
The rest of the math in this section will not be in the context of using truncated backprop, because it’s a technique vs. something rooted in the mathematical foundation of LSTMs.
Moving on — before, we saw this diagram:
In this context, ƒw = i ⊙ g , because it’s the value we’re adding to the cell state.
But this diagram is a bit of a lie. Why? It ignores forget gates. So, does the presence of forget gates affect the vanishing gradient problem? Quite significantly, actually. How? Let’s bring up our cell state equation to see:
With the forget gate, we now include a multiplicative interaction. Our new diagram will look like this:
When our gradients flow back, they will be affected by this multiplicative interaction. So, let’s compute the new derivative:
This seems super neat, actually. the gradient will be f , because f acts as a blocker and controls how much c_t-1 influences c_t ; it’s the gate that you can fully or partially open and close that lets information from c_t-1 flow through! It’s just intuitive that it would propagate back perfectly.
But, if you’ve payed close attention so far, you might be asking: “ ƒw If you’re a hardcore mathematician, you might also be worried that we’re content with leaving the gradient as just f . This is because the gates f , i , and g are all functions of c_t-1 ; they are functions of h_t-1 , which is, in turn, a function of c_t-1 ! The diagram shows this visually, as well. It seems we’re failing to apply calculus properly. We’d need to backprop through f and through i ⊙ g to complete the derivative.
Let’s walk through the differentiation to show why you’re actually not wrong , but neither am I:
Now, with the first derivative, we need to apply product rule. Why? Because we’re differentiating the product of two functions of c_t-1 . The former being the forget gate, and the latter being just c_t-1 . Let’s do it:
Then, from product rule:
That’s the first derivative done. We purposely choose not to compute the derivative of the forget gate with respect to the previous cell state on previous. You’ll see why in a bit.
Now let’s tackle the second one:
You’ll notice that it’s also two functions of c_t-1 multiplied together, so we use the product rule again:
So:
Thus, our overall derivative becomes:
Pay attention to the caption of the diagram.
This is actually our derivative. Modern LSTM implementations just use an auto differentiation library to compute derivatives, so they’ll probably come up with this. However, (or, rather, approximately), our gradient is just the forget gate, because the other three terms tend towards zero. Yup — they vanish. Why?
When we backprop error in LSTMs, we backprop through cell states to propagate the error from the outputs to the cell state we want. For example, if we want to backprop the error from the output at time t down k timesteps, then we need to compute the derivative of the cell state at time t to the cell state at time t-k . Look what happens when we do that:
We didn’t simplify the gate w.r.t. cell state derivatives for a reason; as we backpropagate through time, they begin to vanish. Thus, whatever they multiplied with is killed off from making contributions to the gradient, too. So, effectively:
The rationale behind this is pretty simple, and we don’t need math for it; these gates are the outputs of non-linearities eg. sigmoid and tanh. If we were to get the derivative of them in getting our cell state derivative, then this derivative would contain the derivatives of sigmoid/tanh in them. But, just because we don’t to use math to show this, doesn’t mean we don’t to 😏:
Recall from our vanishing gradient article that the max output of sigmoid’s first order derivative is 0.25, and it’s something similar for tanh. This becomes textbook vanishing gradient problem. As we backprop through more and more cell states, the gradient terms become longer and longer, and this will definitely vanish. When they don’t vanish, they’ll be super minor contributions, so we can just leave them out for brevity.
Ultimately, the reason I obfuscate these terms that vanish in the derivative is because I would like to show the effect of the forget gate on gradient flow now. If I included the other terms, the same implications would be present, but the math would just take longer to type out and render.
Because ƒw = i ⊙ g , we can redraw our diagram showing that ƒw won’t make any contributions to the gradient flow back. Again — ƒw does, but it’s effectively negligible, so we can just exclude it from our updated gradient flow diagram, which follows:
But wait! This doesn’t look good; the gradients have to multiply by this f_t gate at each timestep. Before, they didn’t have to multiply by anything (or, in other words, they multiplied by 1) and flowed past super easily.
Machine learning researchers coined a name for the type of function we had before we introduced the forget gate where the derivative of one cell state w.r.t. the previous is 1.0 : “Constant Error Carousel” (CEC). With our new function, the derivative is equal to f . You’ll see this referred to as a “linear carousel” in papers.
Before we introduced a forget gate — where all we had was the additive interaction from ƒw — our cell state function was a CEC:
The derivative of this cell state w.r.t. the previous one, again as long as we don’t backprop through the i and g gates, is just 1. That’s why gradients flow back super comfortably, without vanishing at all. Basically, for a CEC to exist in this context, the coefficient of c_t-1 needs to be 1.
Once we introduced this multiplicative interaction (for good reason), we got a linear carousel; the coefficient of c_t-1 is f . So, in our case, when f = 1 (when we’re not going to forget) our function becomes a CEC, and our gradients will pretty much never vanish. If it’s close to 0, though, the gradient term will immediately die. Gradients will stay on the carousel for a while until the forget gate is triggered; the effect on the gradient is like a step function, in that it’s constant with a value of 1 and then drops off to zero/dies when we have f ≈ 0 .
Intuitively, this seems problematic. Let’s do some math to investigate:
The derivative of a cell state to the previous is f_t . The derivative of a cell state to two prior cell states is f_t ⊙ f_t-1 . Thus:
As we backpropagate through time, these forget gates keep chaining up and multiplying together to form the overall gradient term.
Now, imagine an LSTM with 100 timesteps. If we wanted to get the derivative of the error w.r.t. a weight like W_xi , to optimize it, remember that with BPTT we add up or average all the gradients from the different timesteps:
OK. Now let’s look at an early (in time) term, like the gradient propagated from the error to the third cell:
Remember that J is an addition of errors from Y individual outputs, so we backpropagate through each of the outputs first:
The first few terms, where we backprop y_k to c_3 where k < 3 , would just be equal to zero because c_3 only exists after these outputs have been computed.
Let’s assume that Y = 100 and continue with our assumption that t = 100 (so each timestep gives rise to an output), for simplicity. With this, let’s now look at the last term in this sum.
That’s a lot of forget gates chained together. If one of these forget gates is [approximately] zero, the whole gradient dies . If these also tend to be a small number between 0 and 1, the whole thing will vanish, and c_3 won’t make any contributions to the gradient here.
This isn’t an issue though! Because, when a forget gate is zero, it means that cell is no longer making any contributions past that point. If
f_4
is zero, then any
y
outputs at/past timestep 4 won’t be influenced by
c_3
(as well as
c_2
and
c_1
) because we “erased” it from memory. Therefore that particular gradient should be zero. If
y_80
is zero, then any outputs at/past timestep 80 won’t be influenced by
c_1, c_2,…, c_79
. Same story here. If these forget gates are between 0 and 1, then the influence of our cell decays over time anyways, and our gradients will be very small, so they’ll reflect that.
AvaCostume Womens Bellwort Butterfly Embroidery Flats High Boots Red 7yx3tY
calls this “releasing resources”.
Cell c_3 will still contribute to the overall gradient, though. For example, take this term:
Here, we’re looking at y_12 instead of y_100 . Chances are that, if you have a sequence of length 100, your 100th cell state isn’t drawing from your 3rd; the forget gate would have been triggered at some point by then. However, the 12th cell state probably will still be drawing from the ones before it.
If we decide not to forget in the first 12 timesteps, ie. f_1… f_12 are each not far from 1, then c_3 would have more influence over y_12 and the error that stems from y_12 . Thus, the gradient would not vanish and c_3 still contributes to update W_xi , it just doesn’t contribute a gradient where it’s not warranted to (that is, where it doesn’t actually contribute to any activation, because it’s been forgotten). To summarize: one activated forget gate will indeed kill off gradient flow to cell(s), but that is a good thing because the network is learning that that gradient from the future has no benefit and is completely irrelevant to those particular cell(s), since those cells have been forgotten by then. In practice, different cells learn different ranges of context, some short, some long. This is a for LSTMs.
So, given a gradient between two cell states in time, when all of these forget gates are [approximately] equal to 1, the gradient signal will remain stable, because we’re multiplying by 1 at each timestep — effectively, not multiplying by anything at all. In such a case, our gradient flow diagram would look like this:
The gradient will have literally zero interactions or disturbances, and will just flow through like it’s driving 150 mph on an empty countryside America highway. The beauty of CECs is that they’re like this.
But, let’s get back to reality. LSTMs aren’t CECs. One disadvantage of these forget gates chaining together is that it could block learning. That is, when we set out to train our LSTM, the forget gates have not been learned; we have to learn them while we learn everything else. So, if they all start around 0, no gradients will flow through our cell states when we perform BPTT, and learning won’t happen at all.
The obvious solution is to set the forget gate bias to a very large value when training, so it starts at 1 instead of 0 (because y = 1 is to the far right of the sigmoid function, so adding to the input will ensure ~1 will be the output). In early stages of training, the forget gates equalling/approximating 1 will result in learning not being blocked. So many papers do this and mention it explicitly such that this forget gate bias could even be considered a hyperparameter.
By introducing forget gates, we stray from CECs and thus the guarantee that our gradients will never ever vanish. But, again, we do it for good reason. And when gradients vanish it’s because we chose to forget that cell — so it’s not necessarily a bad thing. We just need to make sure the forget gates don’t block learning in initial stages of training; in such a case, we shouldn’t need to bother about vanishing gradients too much.
Here’s a more technical explanation: