Soft DC Fix

Shmoconut · Dec 1, 2016

Read through this if you want to know the problem and why the fix works. Scroll to the bottom for the fix.

I'm sure most of you know about the soft d/c bug that has plagued Odin sources since inception. For the uninformed, you would generally get d/ced to the login screen from common actions, usually around item drops after killing monsters.There has been some discussion to identify what causes this, but I don't think anyone has released the "real" fix for it. I discovered the fix a long time ago, but I didn't take the time to understand why it happens, so I was not confident that it was a "real" fix. A recent help thread reminded me of this problem and I decided to actually investigate.

Why it happens:
The problem stems from an encryption issue. The MapleClient and MapleServer both have a send and receive IV which are used to encrypt/decrypt packets, and they are each updated when a packet is sent/received, respectively. The send IV on the server used to crypt packets must match the one the client is expecting, at any time - if they are different, the client will immediately close the socket, thereby dcing you to the login screen.

Why it manifests (usually) from drops:
Drops in most sources are coded poorly. In most sources (and perhaps this has changed recently), there is a new task for each drop, for each monster - suddenly when you use a map-wide skill to kill 10-20 monsters, you are scheduling 30 tasks that each send a packet, instantaneously! There is no other location in the source which spams packets from N async tasks at the same time (other than for drop expiry, which is similarly poorly coded)! ALL sources should change the drops code to only create 1 task per monster, for all of its drops. This will likely reduce the occurrence significantly for some people, but the problem will still remain.

History of solutions:
Initially (5+ years ago), we all thought it was a MapleAESOFB/MapleEncoder problem because the MapleAESOFB worked with an IV whose state needed to be synchronized between packets. Incorrect synchronization of the IV state during encryption would cause a discrepancy between the server's IV and the client's IV, causing a soft d/c. Many people tried to improve the threadsafety of MapleEncoder, and they eventually got it right! Most MapleEncoders are now completely threadsafe with respect to MapleAESOFB. Some people also tried some silly optimizations to cut the number of System.arraycopy calls (lol). The problem persisted though, which left people confused.

MINA at a high level:
It turns out, we cannot understand the problem (nor the fix!) until we understand a little bit about MINA. MINA is setup to push packets through a filter chain that you define when setting up the socket acceptor. Take a typical MapleStory socket acceptor setup:

PHP:

public LoginServer() {    
    acceptor = new SocketAcceptor();
    SocketAcceptorConfig config = new SocketAcceptorConfig();
    config.getFilterChain().addLast("codec", new ProtocolCodecFilter(new MapleCodecFactory()));
    config.getSessionConfig().setTcpNoDelay(true);
    try {
        acceptor.bind(new InetSocketAddress(LOGIN_PORT), new ServerHandler(-1, -1, MapleClient.LOGIN_SERVER), config);
    } catch (Exception e) {
        log.error("Could not bind. ", e);
    }
    log.info("Login Server: Listening on port " + LOGIN_PORT);
}

In MapleStory's case, since we only use bytes, we only need to supply 1 filter: a ProtocolCodecFilter to encrypt/decrypt packets. At a high level, when a packet is sent from the source through

PHP:

client.getSession().write()

, MINA takes the filter chain (an instance of SocketFilterChain for TCP protocol), layers an ExecutorFilterChain inbetween to fire notifications for events like messageReceived and messageSent, and runs through the entire filterchain. At the end, it appends the message to send (now encrypted by our MapleEncoder) to the session's writeQueue. There is a SocketIoProcessor Worker thread which selects available channels and processes their messages (polls the writeQueue and writes the bytes) in an infinite loop.

A dive into MINA code:
The code that actually encrypts and adds the packet to the writeQueue is in the filterWrite() method of the ProtocolCodecFilter, but involves a few different classes/methods:

PHP:

@Override
public void filterWrite(NextFilter nextFilter, IoSession session, WriteRequest writeRequest) throws Exception {
    Object message = writeRequest.getMessage();
    if (message instanceof ByteBuffer) {
        nextFilter.filterWrite(session, writeRequest);
        return;
    }

    ProtocolEncoder encoder = getEncoder(session);
    ProtocolEncoderOutputImpl encoderOut = getEncoderOut(session,
            nextFilter, writeRequest);

    try {
        encoder.encode(session, message, encoderOut);
        encoderOut.flush();
        nextFilter.filterWrite(session, new WriteRequest(new MessageByteBuffer(writeRequest.getMessage()), writeRequest.getFuture(), writeRequest.getDestination()));
    } catch (Throwable t) {
        ProtocolEncoderException pee;
        if (t instanceof ProtocolEncoderException) {
            pee = (ProtocolEncoderException) t;
        } else {
            pee = new ProtocolEncoderException(t);
        }
        throw pee;
    }
}

Notice that a ProtocolEncoderOutputImpl is created, which holds the result of the encoder.encode() method - this is the 'out' parameter passed into MapleEncoder.encode(). Then, the result is flushed and the filter chain moves on to the next filter.
After looking at the flush() method of the EncoderOutput:

PHP:

public WriteFuture flush() {    
    Queue<ByteBuffer> bufferQueue = this.bufferQueue;
    WriteFuture future = null;
    if (bufferQueue.isEmpty()) {
        return null;
    } else {
        for (;;) {
            ByteBuffer buf = bufferQueue.poll();
            if (buf == null) {
                break;
            }

            // Flush only when the buffer has remaining.
            if (buf.hasRemaining()) {
                future = doFlush(buf);
            }
        }
    }

    return future;
}

We see it ends up calling ProtocolEncoderOutputImpl.doFlush():

PHP:

@Override
protected WriteFuture doFlush(ByteBuffer buf) {
    WriteFuture future = new DefaultWriteFuture(session);
    nextFilter.filterWrite(session, new WriteRequest(new HiddenByteBuffer(buf), future, writeRequest.getDestination()));
    return future;
}

Thus, for any given packet, it will:

Start it through the filter chain until it gets to the ProtocolCodecFilter
Encrypt the bytes and create a new WriteRequest which goes through the rest of the chain
Add it onto a writeQueue after it has passed the filterChain

The root of the problem:
All of this seems totally reasonable, so what's the real problem here? Think back to our initial solutions: We thought there would be concurrency issues with the encryption IVs, so we fixed it. But there is a bigger concurrency issue at play here. When we send packets from various threads at the same time, we have no guarantee of ordering. This is obviously contradictory to our necessity, where the client and server must have an IV computed based on the order of sent packets.
So despite us making the encryption threadsafe, there is a lot of stuff on MINA's end that happens before a packet gets put onto the writeQueue, and it is possible for a session.write() call from Thread B to reach the writeQueue first, before Thread A's call, even though the client expects Thread A's packet first (based on the IV), and even though the server encrypted them in order!

The "Fix":
Well, after all that build up, you expect the fix to be something very complex and involved, right? You'll be surprised. Change your MapleEncoder's encode() method to match this:

PHP:

public void encode(IoSession session, Object message, ProtocolEncoderOutput out) throws Exception {    
    try {
        final MapleClient client = (MapleClient) session.getAttribute(MapleClient.KEY);
        final byte[] input = ((MaplePacket) message).getBytes();
        if (client != null) {
            final byte[] toEncrypt = new byte[input.length];
            System.arraycopy(input, 0, toEncrypt, 0, input.length);
            final byte[] ret = new byte[toEncrypt.length + HEADER_SIZE];
            encrypt(toEncrypt);
            client.getSendLock().lock();
            try {
                final byte[] header = client.getSendCrypto().getPacketHeader(toEncrypt.length);
                client.getSendCrypto().crypt(toEncrypt);
                System.arraycopy(header, 0, ret, 0, HEADER_SIZE);
                System.arraycopy(toEncrypt, 0, ret, HEADER_SIZE, toEncrypt.length);
                ByteBuffer out_buffer = ByteBuffer.wrap(ret);
                session.write(out_buffer);
            } finally {
                client.getSendLock().unlock();
            }
        } else {// no client object created yet, send unencrypted (handshake)
            out.write(ByteBuffer.wrap(input));
        }
    } catch (Exception e) {
        log.error("ENCRYPTION EXCEPTION: ", e);
    }
}

Specifically, change your

PHP:

out.write(out_buffer);

to

PHP:

session.write(out_buffer);

It MUST go inside of the critical section (inside the lock or synchronized block)

How it works:
In the first filterChain call (after the initial call to client.getSession().write()), the packet bytes will get encrypted by encoder.encode(). Then, because we called session.write(out_buffer) instead of out.write inside of encode(), it starts the filterchain process again procedurally, in the same thread, but this time it is in a critical section! So on the second filterChain call, our already-encrypted packet reaches the first case in filterWrite:

PHP:

    Object message = writeRequest.getMessage();
    if (message instanceof ByteBuffer) {
        nextFilter.filterWrite(session, writeRequest);
        return;
    }

And puts the packet on the writeQueue while in a critical section. This shrinks the possibility of an out-of-order packet to zero, because we have guaranteed order on a per-session basis.
Do note that this is not a real, bonafide fix to the problem. Ordering anything (not just packets) is tricky when dealing with asynchronous operations - this is a big benefit to having a game loop, so we can consolidate scattered message sending into an ordered queue. Perhaps the biggest gripe I have with OdinMS's design is its message-based architecture.

Well, thanks for sitting through this, and I hope this helped (and you learned something)!

An alternative solution here would be to wrap client.getSession().write() into another method, and synchronize it:

PHP:

private final Object lock = new Object();
public void write(MaplePacket o) {
    synchronized (lock) {
        session.write(o);
    }
}

Then, if you replace all getSession().write() calls with MapleClient.write(), you wouldn't need synchronization in the encoder. This would guarantee we get the packets onto the writeQueue in order as well, and we can switch back to out.write(out_buffer).
There really isn't a reason to expose the IoSession from MapleClient anyway.

Edit:
For MINA 2.x, they replaced ByteBuffer with IoBuffer. If you are using MINA 2.x, you would replace the
ByteBuffer out_buffer = ByteBuffer.wrap(ret);
with
IoBuffer out_buffer = IoBuffer.wrap(ret);

JarrYD · Dec 1, 2016

Awesome, nice find! Makes alot of sense.
How would you go about converting

Code:

public void encode(IoSession session, Object message, ProtocolEncoderOutput out) throws Exception {
		MapleClient client = (MapleClient) session.getAttribute(MapleClient.CLIENT_KEY);

		if (client != null) {
			byte[] input = ((MaplePacket) message).getBytes();
			byte[] unencrypted = new byte[input.length];
			System.arraycopy(input, 0, unencrypted, 0, input.length);
			
			byte[] ret = new byte[unencrypted.length + 4];

			byte[] header = client.getSendCrypto().getPacketHeader(unencrypted.length);
			MapleCustomEncryption.encryptData(unencrypted);
			client.getSendCrypto().crypt(unencrypted);

			System.arraycopy(header, 0, ret, 0, 4);
			System.arraycopy(unencrypted, 0, ret, 4, unencrypted.length);

			ByteBuffer out_buffer = ByteBuffer.wrap(ret);
			out.write(out_buffer);
		} else { // no client object created yet, send unencrypted (hello)
			out.write(ByteBuffer.wrap(((MaplePacket) message).getBytes()));
		}
	}

from odinms.
I thought I did it right but had a bunch of packet processing errors I think once making the changes and then in game from attacking etc.
Not too sure but I'll take a look.

Shmoconut · Dec 2, 2016

An alternative solution here would be to wrap client.getSession().write() into another method, and synchronize it:

PHP:

private final Object lock = new Object();
public void write(MaplePacket o) {
    synchronized (lock) {
        getSession().write(o);
    }
}

Then, if you replace all getSession().write() calls with MapleClient.write(), you wouldn't need synchronization in the encoder. This would guarantee we get the packets onto the writeQueue in order as well, and we can switch back to out.write(out_buffer).

Eric · Dec 3, 2016

Interesting fix, I had always wondered this myself before I switched to Netty, since after I had switched this problem had never occurred. The only thing I could have came up with was instead of just calling out.write(bytes), that you would do a flushed write instead (out.write(x), out.flush()). This is what I had saw to be the recommended method in Netty (writeAndFlush), which is really just a shortcut of writing and then flushing. If we did it this way, then we would write our data in the encoder, and instead of it all being merged into one buffer we would flush it. I'm pretty sure that'd fix it, though Odin doesn't synchronize or lock any blocks in the encoder which was also an issue. Another thing I didn't understand as you had said, was why there was always getSession() openly available and used instead of just a write() or sendPacket() function. I kept IoSession all private and accessible to MapleClient as I keep my Channel's private in my ClientSocket.

The way I personally do it is I have a SendPacket function in User which is what is widely used outside of Login functions, which will call to the Client's SendPacket function. In the Client's SendPacket function, I lock (with a lock specifically for sending out packets), write and flush, and unlock. When it hits the encoder, it uses the reference of their Client and uses the same lock which will then proceed to encrypt the packet, update their sequence, and encode the packet. Haven't had any problems using this method, which is why I wonder if it'd work with MINA as well.

Shmoconut · Dec 3, 2016

Eric said:
Interesting fix, I had always wondered this myself before I switched to Netty, since after I had switched this problem had never occurred. The only thing I could have came up with was instead of just calling out.write(bytes), that you would do a flushed write instead (out.write(x), out.flush()). This is what I had saw to be the recommended method in Netty (writeAndFlush), which is really just a shortcut of writing and then flushing. If we did it this way, then we would write our data in the encoder, and instead of it all being merged into one buffer we would flush it. I'm pretty sure that'd fix it, though Odin doesn't synchronize or lock any blocks in the encoder which was also an issue. Another thing I didn't understand as you had said, was why there was always getSession() openly available and used instead of just a write() or sendPacket() function. I kept IoSession all private and accessible to MapleClient as I keep my Channel's private in my ClientSocket.

Yeah, that is another way to fix it! The bottom line is that the previous understanding of concurrency problems was limited to the server code, but it needed to encompass the mina code as well. I'm not sure why the flush method is inside of the ProtocolEncoderOutput (but public) and not exposed in IoSession for the entire chain, like Netty's is. Seems very strange.

Eric said:
The way I personally do it is I have a SendPacket function in User which is what is widely used outside of Login functions, which will call to the Client's SendPacket function. In the Client's SendPacket function, I lock (with a lock specifically for sending out packets), write and flush, and unlock. When it hits the encoder, it uses the reference of their Client and uses the same lock which will then proceed to encrypt the packet, update their sequence, and encode the packet. Haven't had any problems using this method, which is why I wonder if it'd work with MINA as well.

If you are locking the writeAndFlush, I'd imagine there is no need to lock in the encoder. After all, the encoder is simply a subroutine within writeAndFlush - if you use a mutex for writeAndFlush, all subroutines will be covered by it as well.

iwtLive · Dec 4, 2016

I know I could search this up on google but I'd rather ask this from you.
Just for the pure reason of cosmetics, would:

PHP:

private final Object lock = new Object();
public void write(final byte[] o) {
    synchronized (lock) {
        getSession().write(o);
    }
}

be equal to:

PHP:

    public synchronized void announce(final byte[] packet) {
        session.write(packet);
    }

-Question Mark-

Shmoconut · Dec 4, 2016

iwtLive said:
I know I could search this up on google but I'd rather ask this from you.
Just for the pure reason of cosmetics, would:

PHP:

private final Object lock = new Object(); public void write(final byte[] o) { synchronized (lock) { getSession().write(o); } }

be equal to:

PHP:

public synchronized void announce(final byte[] packet) { session.write(packet); }

-Question Mark-

Strictly, no, they are not equal. There are some very significant differences between those two, but for the purposes of the fix, either would work.

This:

PHP:

public synchronized void announce(final byte[] packet) {
        session.write(packet);
    }

Is equal to this:

PHP:

public void announce(final byte[] packet) {
    synchronized (this) {
        session.write(packet);
    }
}

Note that in the first one, we're synchronizing on a private Object, while in the second one, we're synchronizing on 'this'.
This has some pretty major repercussions! If I synchronize on 'this', then as long as I have a reference to 'this', I can effectively steal the lock. To make this more clear, consider this example, in MapleMap:

PHP:

public void doSomething(int chrId) {
    MapleCharacter chr = getCharacter(chrId);
    if (chr != null) {
        synchronized (chr.getClient()) {
            // do something here... maybe it will take a long time
        }
    }
}

Since we're sychronizing on the same client object, we've stolen the lock from the write() method - this means until doSomething() is done with the lock, the client will not get any packets. Synchronizing on 'this' can have these types of unintended consequences, and should be used very consciously after considering these types of situations. In general, it is discouraged to synchronize on 'this'. Instead, you should use an explicit, dedicated, private Object reference to lock on, like in my example.

Dynamik · Dec 12, 2016

Damn. Am I the only who really appreciate the explanation? :O In any case, awesome release. But an even more awesome explanation.

Soft DC Fix

Shmoconut

Newbie Spellweaver

JarrYD

TranquilityStory

Shmoconut

Newbie Spellweaver

Eric

Custom Title Activated

Shmoconut

Newbie Spellweaver

iwtLive

Experienced Elementalist

Shmoconut

Newbie Spellweaver

Dynamik

<3

About Us

Online statistics

Forum statistics

RaGEZONE Sponsor