Newbie Spellweaver
- Joined
- Aug 6, 2014
- Messages
- 56
- Reaction score
- 42
Read through this if you want to know the problem and why the fix works. Scroll to the bottom for the fix.
I'm sure most of you know about the soft d/c bug that has plagued Odin sources since inception. For the uninformed, you would generally get d/ced to the login screen from common actions, usually around item drops after killing monsters.There has been some discussion to identify what causes this, but I don't think anyone has released the "real" fix for it. I discovered the fix a long time ago, but I didn't take the time to understand why it happens, so I was not confident that it was a "real" fix. A recent help thread reminded me of this problem and I decided to actually investigate.
Why it happens:
The problem stems from an encryption issue. The MapleClient and MapleServer both have a send and receive IV which are used to encrypt/decrypt packets, and they are each updated when a packet is sent/received, respectively. The send IV on the server used to crypt packets must match the one the client is expecting, at any time - if they are different, the client will immediately close the socket, thereby dcing you to the login screen.
Why it manifests (usually) from drops:
Drops in most sources are coded poorly. In most sources (and perhaps this has changed recently), there is a new task for each drop, for each monster - suddenly when you use a map-wide skill to kill 10-20 monsters, you are scheduling 30 tasks that each send a packet, instantaneously! There is no other location in the source which spams packets from N async tasks at the same time (other than for drop expiry, which is similarly poorly coded)! ALL sources should change the drops code to only create 1 task per monster, for all of its drops. This will likely reduce the occurrence significantly for some people, but the problem will still remain.
History of solutions:
Initially (5+ years ago), we all thought it was a MapleAESOFB/MapleEncoder problem because the MapleAESOFB worked with an IV whose state needed to be synchronized between packets. Incorrect synchronization of the IV state during encryption would cause a discrepancy between the server's IV and the client's IV, causing a soft d/c. Many people tried to improve the threadsafety of MapleEncoder, and they eventually got it right! Most MapleEncoders are now completely threadsafe with respect to MapleAESOFB. Some people also tried some silly optimizations to cut the number of System.arraycopy calls (lol). The problem persisted though, which left people confused.
MINA at a high level:
It turns out, we cannot understand the problem (nor the fix!) until we understand a little bit about MINA. MINA is setup to push packets through a filter chain that you define when setting up the socket acceptor. Take a typical MapleStory socket acceptor setup:
In MapleStory's case, since we only use bytes, we only need to supply 1 filter: a ProtocolCodecFilter to encrypt/decrypt packets. At a high level, when a packet is sent from the source through
, MINA takes the filter chain (an instance of SocketFilterChain for TCP protocol), layers an ExecutorFilterChain inbetween to fire notifications for events like messageReceived and messageSent, and runs through the entire filterchain. At the end, it appends the message to send (now encrypted by our MapleEncoder) to the session's writeQueue. There is a SocketIoProcessor Worker thread which selects available channels and processes their messages (polls the writeQueue and writes the bytes) in an infinite loop.
A dive into MINA code:
The code that actually encrypts and adds the packet to the writeQueue is in the filterWrite() method of the ProtocolCodecFilter, but involves a few different classes/methods:
Notice that a ProtocolEncoderOutputImpl is created, which holds the result of the encoder.encode() method - this is the 'out' parameter passed into MapleEncoder.encode(). Then, the result is flushed and the filter chain moves on to the next filter.
After looking at the flush() method of the EncoderOutput:
We see it ends up calling ProtocolEncoderOutputImpl.doFlush():
Thus, for any given packet, it will:
The root of the problem:
All of this seems totally reasonable, so what's the real problem here? Think back to our initial solutions: We thought there would be concurrency issues with the encryption IVs, so we fixed it. But there is a bigger concurrency issue at play here. When we send packets from various threads at the same time, we have no guarantee of ordering. This is obviously contradictory to our necessity, where the client and server must have an IV computed based on the order of sent packets.
So despite us making the encryption threadsafe, there is a lot of stuff on MINA's end that happens before a packet gets put onto the writeQueue, and it is possible for a session.write() call from Thread B to reach the writeQueue first, before Thread A's call, even though the client expects Thread A's packet first (based on the IV), and even though the server encrypted them in order!
The "Fix":
Well, after all that build up, you expect the fix to be something very complex and involved, right? You'll be surprised. Change your MapleEncoder's encode() method to match this:
Specifically, change your
to
It MUST go inside of the critical section (inside the lock or synchronized block)
How it works:
In the first filterChain call (after the initial call to client.getSession().write()), the packet bytes will get encrypted by encoder.encode(). Then, because we called session.write(out_buffer) instead of out.write inside of encode(), it starts the filterchain process again procedurally, in the same thread, but this time it is in a critical section! So on the second filterChain call, our already-encrypted packet reaches the first case in filterWrite:
And puts the packet on the writeQueue while in a critical section. This shrinks the possibility of an out-of-order packet to zero, because we have guaranteed order on a per-session basis.
Do note that this is not a real, bonafide fix to the problem. Ordering anything (not just packets) is tricky when dealing with asynchronous operations - this is a big benefit to having a game loop, so we can consolidate scattered message sending into an ordered queue. Perhaps the biggest gripe I have with OdinMS's design is its message-based architecture.
Well, thanks for sitting through this, and I hope this helped (and you learned something)!
An alternative solution here would be to wrap client.getSession().write() into another method, and synchronize it:
Then, if you replace all getSession().write() calls with MapleClient.write(), you wouldn't need synchronization in the encoder. This would guarantee we get the packets onto the writeQueue in order as well, and we can switch back to out.write(out_buffer).
There really isn't a reason to expose the IoSession from MapleClient anyway.
Edit:
For MINA 2.x, they replaced ByteBuffer with IoBuffer. If you are using MINA 2.x, you would replace the
ByteBuffer out_buffer = ByteBuffer.wrap(ret);
with
IoBuffer out_buffer = IoBuffer.wrap(ret);
I'm sure most of you know about the soft d/c bug that has plagued Odin sources since inception. For the uninformed, you would generally get d/ced to the login screen from common actions, usually around item drops after killing monsters.There has been some discussion to identify what causes this, but I don't think anyone has released the "real" fix for it. I discovered the fix a long time ago, but I didn't take the time to understand why it happens, so I was not confident that it was a "real" fix. A recent help thread reminded me of this problem and I decided to actually investigate.
Why it happens:
The problem stems from an encryption issue. The MapleClient and MapleServer both have a send and receive IV which are used to encrypt/decrypt packets, and they are each updated when a packet is sent/received, respectively. The send IV on the server used to crypt packets must match the one the client is expecting, at any time - if they are different, the client will immediately close the socket, thereby dcing you to the login screen.
Why it manifests (usually) from drops:
Drops in most sources are coded poorly. In most sources (and perhaps this has changed recently), there is a new task for each drop, for each monster - suddenly when you use a map-wide skill to kill 10-20 monsters, you are scheduling 30 tasks that each send a packet, instantaneously! There is no other location in the source which spams packets from N async tasks at the same time (other than for drop expiry, which is similarly poorly coded)! ALL sources should change the drops code to only create 1 task per monster, for all of its drops. This will likely reduce the occurrence significantly for some people, but the problem will still remain.
History of solutions:
Initially (5+ years ago), we all thought it was a MapleAESOFB/MapleEncoder problem because the MapleAESOFB worked with an IV whose state needed to be synchronized between packets. Incorrect synchronization of the IV state during encryption would cause a discrepancy between the server's IV and the client's IV, causing a soft d/c. Many people tried to improve the threadsafety of MapleEncoder, and they eventually got it right! Most MapleEncoders are now completely threadsafe with respect to MapleAESOFB. Some people also tried some silly optimizations to cut the number of System.arraycopy calls (lol). The problem persisted though, which left people confused.
MINA at a high level:
It turns out, we cannot understand the problem (nor the fix!) until we understand a little bit about MINA. MINA is setup to push packets through a filter chain that you define when setting up the socket acceptor. Take a typical MapleStory socket acceptor setup:
PHP:
public LoginServer() {
acceptor = new SocketAcceptor();
SocketAcceptorConfig config = new SocketAcceptorConfig();
config.getFilterChain().addLast("codec", new ProtocolCodecFilter(new MapleCodecFactory()));
config.getSessionConfig().setTcpNoDelay(true);
try {
acceptor.bind(new InetSocketAddress(LOGIN_PORT), new ServerHandler(-1, -1, MapleClient.LOGIN_SERVER), config);
} catch (Exception e) {
log.error("Could not bind. ", e);
}
log.info("Login Server: Listening on port " + LOGIN_PORT);
}
PHP:
client.getSession().write()
A dive into MINA code:
The code that actually encrypts and adds the packet to the writeQueue is in the filterWrite() method of the ProtocolCodecFilter, but involves a few different classes/methods:
PHP:
@Override
public void filterWrite(NextFilter nextFilter, IoSession session, WriteRequest writeRequest) throws Exception {
Object message = writeRequest.getMessage();
if (message instanceof ByteBuffer) {
nextFilter.filterWrite(session, writeRequest);
return;
}
ProtocolEncoder encoder = getEncoder(session);
ProtocolEncoderOutputImpl encoderOut = getEncoderOut(session,
nextFilter, writeRequest);
try {
encoder.encode(session, message, encoderOut);
encoderOut.flush();
nextFilter.filterWrite(session, new WriteRequest(new MessageByteBuffer(writeRequest.getMessage()), writeRequest.getFuture(), writeRequest.getDestination()));
} catch (Throwable t) {
ProtocolEncoderException pee;
if (t instanceof ProtocolEncoderException) {
pee = (ProtocolEncoderException) t;
} else {
pee = new ProtocolEncoderException(t);
}
throw pee;
}
}
After looking at the flush() method of the EncoderOutput:
PHP:
public WriteFuture flush() {
Queue<ByteBuffer> bufferQueue = this.bufferQueue;
WriteFuture future = null;
if (bufferQueue.isEmpty()) {
return null;
} else {
for (;;) {
ByteBuffer buf = bufferQueue.poll();
if (buf == null) {
break;
}
// Flush only when the buffer has remaining.
if (buf.hasRemaining()) {
future = doFlush(buf);
}
}
}
return future;
}
PHP:
@Override
protected WriteFuture doFlush(ByteBuffer buf) {
WriteFuture future = new DefaultWriteFuture(session);
nextFilter.filterWrite(session, new WriteRequest(new HiddenByteBuffer(buf), future, writeRequest.getDestination()));
return future;
}
- Start it through the filter chain until it gets to the ProtocolCodecFilter
- Encrypt the bytes and create a new WriteRequest which goes through the rest of the chain
- Add it onto a writeQueue after it has passed the filterChain
The root of the problem:
All of this seems totally reasonable, so what's the real problem here? Think back to our initial solutions: We thought there would be concurrency issues with the encryption IVs, so we fixed it. But there is a bigger concurrency issue at play here. When we send packets from various threads at the same time, we have no guarantee of ordering. This is obviously contradictory to our necessity, where the client and server must have an IV computed based on the order of sent packets.
So despite us making the encryption threadsafe, there is a lot of stuff on MINA's end that happens before a packet gets put onto the writeQueue, and it is possible for a session.write() call from Thread B to reach the writeQueue first, before Thread A's call, even though the client expects Thread A's packet first (based on the IV), and even though the server encrypted them in order!
The "Fix":
Well, after all that build up, you expect the fix to be something very complex and involved, right? You'll be surprised. Change your MapleEncoder's encode() method to match this:
PHP:
public void encode(IoSession session, Object message, ProtocolEncoderOutput out) throws Exception {
try {
final MapleClient client = (MapleClient) session.getAttribute(MapleClient.KEY);
final byte[] input = ((MaplePacket) message).getBytes();
if (client != null) {
final byte[] toEncrypt = new byte[input.length];
System.arraycopy(input, 0, toEncrypt, 0, input.length);
final byte[] ret = new byte[toEncrypt.length + HEADER_SIZE];
encrypt(toEncrypt);
client.getSendLock().lock();
try {
final byte[] header = client.getSendCrypto().getPacketHeader(toEncrypt.length);
client.getSendCrypto().crypt(toEncrypt);
System.arraycopy(header, 0, ret, 0, HEADER_SIZE);
System.arraycopy(toEncrypt, 0, ret, HEADER_SIZE, toEncrypt.length);
ByteBuffer out_buffer = ByteBuffer.wrap(ret);
session.write(out_buffer);
} finally {
client.getSendLock().unlock();
}
} else {// no client object created yet, send unencrypted (handshake)
out.write(ByteBuffer.wrap(input));
}
} catch (Exception e) {
log.error("ENCRYPTION EXCEPTION: ", e);
}
}
Specifically, change your
PHP:
out.write(out_buffer);
PHP:
session.write(out_buffer);
How it works:
In the first filterChain call (after the initial call to client.getSession().write()), the packet bytes will get encrypted by encoder.encode(). Then, because we called session.write(out_buffer) instead of out.write inside of encode(), it starts the filterchain process again procedurally, in the same thread, but this time it is in a critical section! So on the second filterChain call, our already-encrypted packet reaches the first case in filterWrite:
PHP:
Object message = writeRequest.getMessage();
if (message instanceof ByteBuffer) {
nextFilter.filterWrite(session, writeRequest);
return;
}
Do note that this is not a real, bonafide fix to the problem. Ordering anything (not just packets) is tricky when dealing with asynchronous operations - this is a big benefit to having a game loop, so we can consolidate scattered message sending into an ordered queue. Perhaps the biggest gripe I have with OdinMS's design is its message-based architecture.
Well, thanks for sitting through this, and I hope this helped (and you learned something)!
An alternative solution here would be to wrap client.getSession().write() into another method, and synchronize it:
PHP:
private final Object lock = new Object();
public void write(MaplePacket o) {
synchronized (lock) {
session.write(o);
}
}
Then, if you replace all getSession().write() calls with MapleClient.write(), you wouldn't need synchronization in the encoder. This would guarantee we get the packets onto the writeQueue in order as well, and we can switch back to out.write(out_buffer).
There really isn't a reason to expose the IoSession from MapleClient anyway.
Edit:
For MINA 2.x, they replaced ByteBuffer with IoBuffer. If you are using MINA 2.x, you would replace the
ByteBuffer out_buffer = ByteBuffer.wrap(ret);
with
IoBuffer out_buffer = IoBuffer.wrap(ret);
Last edited: